The reason I like his map is very simple! This map makes me to know what I don’t know; the first step to learning is to know what I don’t know. Below is his fantastic map about learning paths to become a data scientist.
Here is the list of curriculum, with a link for each keyword.
For the Toolbox section(the last one), each tool links to the developer page of the tool.
For wrong links or better links, please suggest by adding comments.
Again, thanks to Swami Chandrasekaran, we have the list of keywords to become a data scientist !
Spark and Shark speed up interactive and complex analytics on Hadoop data by (up to) 40x.
Spark runs MapReduce on data cached in-memory. Shark runs HiveQL on top of Spark.
This article summarizes Matei Zaharia’s seminar on Spark and Shark.
- Hadoop spends 90~95% of time for replication, storing/reading data on disk.
- Only 10~5% of time is spent for processing actual data.
- Step 1 : Load data in memory.
- Step 2 : Run MapReduce or iterations on data in memory.
- Fault tolerance based on RDD. RDD remembers the each step of building in-memory cache.
- Spark uses Scala to define functions for filer, map, and reduce operations on the cloud.
- Analytic queries on Hadoop run up to 40 times faster.
- Full-text search of Wikipedia in <1s (vs 20s for on-disk data).
- Logistic Regression Performance test with Spark takes 6s whereas Hadoop takes 127s.
- PageRank performance : Hadoop 171s, Basic Spark 72s, Spark + Controlled Partitioning 23s
Use Cases :
- Aggregation on streaming of data : Load data every several seconds, run analytic queries.
- Estimate city traffic from crowd-sourced GPS data using Iterative EM algorithm.
- HiveQL on MapReduce on Hadoop is too slow.
- Run HiveQL on Spark on Hadoop.
- Change Hive client library to use Spark instead of MapReduce.
- Compact on-disk size of data by employing Column-oriented storage using arrays of primitive types (Plus increased speed of aggregations).
- SELECT … WHERE LIKE ‘%XYZ%’ : Hive 208s, Shark 182, Shark(cached) 12s.
- SELECT … WHERE … GROUP BY … ORDER BY … LIMIT 1 : Hive 447s, Shark 270s, Shark(cached) 126s.
Streaming Spark runs Spark MapReduce on data within a time window. It runs repeatedly as time goes on and the time window changes.
It can process 42M records/second(4 GB/s) on 100 nodes at sub-second latency.
The Video with Demo
Following is the original video of siminar by Matei Zaharia.