13. Data Access & Quality
User data privacy, IT outsourcing protection, Data Quality
14. Enterprise Slowness
1. Boston CXO Forum 24 October : Best Practice on Global
Innovation (IBM, EMC, P&G, Intuit)
Exploit vs Explore - M&A
2. Brad Feld (Managing Director at Foundry Group)
Hierarchy vs network
15. Big Data Dead Valley
TechnoMaturtity/
Risk
Enterprise Maturity
SMB
Enterprise
Start-ups
Techno Maturity
Risk
20. All abstractions leak
Abstract -> Procrastinate!
http://www.aleax.it/pycon_abst.pdf (Alex Martelli : "Abstraction as a Leverage" )
21. Minimize A Tower of Abstraction
Simplify & lower the layer of abstraction
Examples:
● Work on file not BD if possible
● HD direct connect on server
● Low level linux command lines (cut, grep, sed etc.)
● High level languages : python
Abstraction = 20X benefits
22. EMR vs AWS & S3 1.0
(no data locality optimization + network &
~IO bounded)
EMR = 45 min
AWS = 4 min
23. EMR vs AWS & S3 2.0
EMR = 5+10 min*
AWS = ~4 min
*30 min prepro ;)
EMR = 5+4 if (big files & compress files)
24. Scaling Machine Learning
● Scaling Data-Preprocessing = Hadoop
● Small dataset = GPU
● Train with Big Dataset = ?? Communication Infrastructures =
MPI & MapReduce (John Langford http://hunch.net/?p=2094)
29. Hadoop vs MPI
MPI
● No fault tolerance by default
● Poor understanding of where data is (manual split on nodes + bad
communication & prog complexity)
● Limit scale to ~100 nodes in practice (sharing unavoidable)
● Cluster shared -> slower nodes issues before disk/node failure
MapReduce
● Setup and teardown costs are significant (interaction schedular &
communicating the prog + large number of node)
● Worst: mapreduce wait for free nodes + many mapreduce iteration +
reach high quality prediction
● Flaw: required refactoring code in map/reduce
30. Hadoop-compatible AllReduce -
Vowpall Rabbit (Hadoop + MPI)
● MPI = All reduce (all nodes same state)
● MapReduce = Conceptual Simplicity
● MPI: No need to refactor code
● MapReduce: Data Locality (Map only)
● MPI: Ability to use local storage (or RAM): temp file on
local disk + allow to be cached in RAM by OS
● MapReduce: Automatic cleanup of local resources (tmp
files)
● MPI: Fast Optimization approach remain within the
conceptual scope: AllReduce = fct call
● MapReduce robustness (speculative execution to deal
with slow nodes)
31.
32.
33.
34.
35.
36.
37.
38. Summary
● Big Data Big Picture
○ BigData : Cluster + IO bounded (Locality)
● Big Data Dead Valley Dilemma (MMID)
○ Small Market/Maturity/Data:access,quality/Slowness
● EMR (aws) = Slow
● Minimize Tower or abstraction
● Scaling MP: bottleneck = ML
○ MPI:no fault tolerance + where is the data?
○ Hadoop: slow setup & teardown + Require
Refactoring
○ Hadoop compatible AllReduce
39. Reference MPI & hadoop
blog:
http://bickson.blogspot.ca/2011/12/mpi-vs-hadoop.html
http://hunch.net/?p=2094
Video & slides presentaiton John Langford
Learning From Lots Of Data (full)
CONFÉRENCIER: John LANGFORD, Senior Research Scientist, Microsoft Research
Slides: http://lisaweb.iro.umontrea...
Implementation :
vowpal_wabbit