Facebook Internals

This article shows the data model inside Facebook using TAO(The Associations and Objects).

To summarize, it has two layers.

1) An in-memory cache layer in graph data model using objects(vertices) and associations between them (edges).

2) Data on the in-memory cache layer is backed by MySQL storage.

This is simply Facebook internals.
https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-the-graph/10151525983993920

A Long Journey to Data Scientists

Becoming a data scientist is hard. Long way to go! A lot to learn!

Swami Chandrasekaran summarized the long journey becoming a data scientist in his post, Becoming a Data Scientist – Curriculum via Metromap.

The reason I like his map is very simple! This map makes me to know what I don’t know; the first step to learning is to know what I don’t know. Below is his fantastic map about learning paths to become a data scientist.

RoadToDataScientist1-1024x831

Here is the list of curriculum, with a link for each keyword.
For the Toolbox section(the last one), each tool links to the developer page of the tool.

For wrong links or better links, please suggest by adding comments.
Again, thanks to Swami Chandrasekaran, we have the list of keywords to become a data scientist !

  • Fundamentals
  1. Metrics & Linear Algebra Fundamentals
  2. Hash Functions, Binary Tree, O(n)
  3. Relational Algebra, DB Basics
  4. Inner, Outer, Cross, Theta Join
  5. CAP Theorem
  6. Tabular Data
  7. Entropy
  8. Data Frames & Series
  9. Sharding
  10. OLAP
  11. Multidimensional Data Model
  12. Extract/Transform/Load(ETL)
  13. Reporting vs BI vs Analytics
  14. JSON & XML
  15. NoSQL
  16. Regex
  17. Vendor Landsacpe
  18. Env Setup
  1. Pick a Dataset (UCI Repo)
  2. Descriptive Statistics(mean, median, range, SD, Var)
  3. Exploratory Data Analysis
  4. Histograms
  5. Percentiles & Outliers
  6. Probability Theory
  7. Bayes Theorem
  8. Random Variables
  9. Cumulative Distribution Function (CDF)
  10. Continuous Distributions (Normal, Poisson, Gaussian)
  11. Skewness
  12. Analysis of Variance (ANOVA)
  13. Probability Density Function (PDF)
  14. Central Limit Theorem
  15. Monte Carlo Method
  16. Hypothesis Testing
  17. p-Value
  18. Chi-square Test
  19. Estimation
  20. Confidence Interval (CI)
  21. Maximum Likelihood Estimation (MLE)
  22. Kernel Density Estimate
  23. Regression
  24. Covariance
  25. Correlation
  26. Pearson Coeff
  27. Causation
  28. Least Squares Fit
  29. Euclidean Distance
  1. Python Basics
  2. Working in Excel
  3. R Setup, R Studio
  4. R Basics
  5. Expressions
  6. Variables
  7. IBM SPSS, Rapid Miner
  8. Vectors
  9. Matrices
  10. Arrays
  11. Factors
  12. Lists
  13. Data Frames
  14. Reading CSV Data
  15. Reading RAW Data
  16. Subsetting Data
  17. Manipulate Data Frames
  18. Functions
  19. Factor Analysis
  20. Install Pkgs
  1. What is ML?
  2. Numerical Var
  3. Categorical Variable
  4. Supervised Learning
  5. Unsupervised Learning
  6. Concepts, Inputs & Attributes
  7. Training & Test Data
  8. Classifier
  9. Prediction
  10. Lift
  11. Overfitting
  12. Bias & Variance
  13. Trees & Classification
  14. Classification, Classification Rate
  15. Decision Trees
  16. Boosting
  17. Naïve Bayes Classifiers
  18. K-Nearest Neighbor
  19. Logistic Regression
  20. Regression, Ranking
  21. Linear Regression
  22. Perceptron
  23. Clustering, Hierarchical Clustering
  24. K-means Clustering
  25. Neural Networks
  26. Sentiment Analysis
  27. Collaborative Filtering
  28. Tagging
  1. Corpus
  2. Named Entity Recognition
  3. Text Analysis
  4. UIMA
  5. Term Document Matrix
  6. Term Frequency & Weight
  7. Support Vector Machines
  8. Association Rules
  9. Market Based Analysis ( Market Basket Analysis ? )
  10. Feature Extraction
  11. Using Mahout
  12. Using Weka
  13. Using Natural Language Toolkit (NLTK)
  14. Classify Text ( Document Classification? )
  15. Vocabulary Mapping
  • Data Visualization
  1. Data Exploration in R (Hist, Boxplot etc)
  2. Uni, Bi & Multivariate Viz
  3. ggplot2
  4. Histogram & Pie (Uni)
  5. Tree & Tree Map
  6. Scatter Plot (Bi)
  7. Line Charts (Bi)
  8. Spatial Charts
  9. Survey Plot
  10. Timeline
  11. Decision Tree
  12. D3.js
  13. InfoVis
  14. IBM ManyEyes
  15. Tableau
  1. Map Reduce Framework
  2. Hadoop Components
  3. HDFS
  4. Data Replication Principles
  5. Setup Hadoop ( IBM / Cloudera / HortonWorks )
  6. Name & Data Nodes
  7. Job & Task Tracker
  8. M/R Programming
  9. Sqoop : Loading Data in HDFS
  10. Flume, Scribe : For Unstructured Data
  11. SQL with Pig
  12. DWH with Hive
  13. Scribe, Chunkwa For Weblog
  14. Using Mahout
  15. Zookeeper, Avro
  16. Storm : Hadoop Realtime
  17. Rhadoop, RHIPE
  18. rmr
  19. Cassandra
  20. MongoDB, Neo4j
  • Data Ingestion
  1. Summary of Data Formats
  2. Data Discovery
  3. Data Sources & Acquisition
  4. Data Integration
  5. Data Fusion
  6. Transformation, Enrichment
  7. Data Survey
  8. Google OpenRefine
  9. How much Data?
  10. Using ETL
  1. Dimensionality & Numerosity Reduction
  2. Normalization
  3. Data Scrubbing
  4. Handling Missing Values
  5. Unbiased Estimators
  6. Binning Sparse Values
  7. Feature Extraction
  8. Denoising
  9. Sampling
  10. Stratified Sampling
  11. Principal Component Analysis
  • Toolbox
  1. MS Excel w/ Analysis ToolPak
  2. Java, Python
  3. R, R-Studio, Rattle
  4. Weka, Knime, RapidMiner
  5. Hadoop Dist of Choice
  6. Spark, Storm
  7. Flume, Scribe, Chukwa
  8. Nutch, Talend, Scraperwiki
  9. Webscraper, Flume, Sqoop (Flume Dup?)
  10. tm, RWeka, NLTK
  11. RHIPE
  12. D3.js, ggplot2, Shiny
  13. IBM Languageware
  14. Cassandra, MongoDB

That’s it! Hope that these links are helpful for data scientists learning new materials day by day! Feel free to leave your comments ;-) .

-So Big Data

Summary : Spark and Shark High Speed in Memory Analytics Over Hadoop

Introduction

Spark and Shark speed up interactive and complex analytics on Hadoop data by (up to) 40x.
Spark runs MapReduce on data cached in-memory. Shark runs HiveQL on top of Spark.
This article summarizes Matei Zaharia’s seminar on Spark and Shark.

Spark

Problems :
- Hadoop spends 90~95% of time for replication, storing/reading data on disk.
- Only 10~5% of time is spent for processing actual data.

Solution :
- Step 1 : Load data in memory.
- Step 2 : Run MapReduce or iterations on data in memory.

Features :
- Fault tolerance based on RDD. RDD remembers the each step of building in-memory cache.
- Spark uses Scala to define functions for filer, map, and reduce operations on the cloud.

Performance :
- Analytic queries on Hadoop run up to 40 times faster.
- Full-text search of Wikipedia in <1s (vs 20s for on-disk data).
- Logistic Regression Performance test with Spark takes 6s whereas Hadoop takes 127s.
- PageRank performance : Hadoop 171s, Basic Spark 72s, Spark + Controlled Partitioning 23s

Use Cases :
- Aggregation on streaming of data : Load data every several seconds, run analytic queries.
- Estimate city traffic from crowd-sourced GPS data using Iterative EM algorithm.

Shark

Problem :
- HiveQL on MapReduce on Hadoop is too slow.

Solution :
- Run HiveQL on Spark on Hadoop.
- Change Hive client library to use Spark instead of MapReduce.
- Compact on-disk size of data by employing Column-oriented storage using arrays of primitive types (Plus increased speed of aggregations).

Performance :
- SELECT … WHERE LIKE ‘%XYZ%’ : Hive 208s, Shark 182, Shark(cached) 12s.
- SELECT … WHERE … GROUP BY … ORDER BY … LIMIT 1 : Hive 447s, Shark 270s, Shark(cached) 126s.

What’s Next

Streaming Spark runs Spark MapReduce on data within a time window. It runs repeatedly as time goes on and the time window changes.
It can process 42M records/second(4 GB/s) on 100 nodes at sub-second latency.

The Video with Demo

Following is the original video of siminar by Matei Zaharia.