The big data dead valley dilemma and much more.

The Big Data Dead Valley Dilemma
and Much More
francis@qmining.com
Founder QMining
@fraka6

Unhidden Agenda
● Big Data Big Picture
● Big Data Dead Valley Dilemma
● Elastic Map Reduce (EMR) numbers
● Scaling Learning (MPI & hadoop)

Big Data
=
Lot of Data
(evidence)
+
CPU bounded
(forgotten)

Big Data
=
Lot of Data
(evidence)
-
IO bounded
(reality)

IO bounded
CPU
<100%Data
● HD/Bus speed
● Network
● File server

Big Data Scalability
(ex: hadoop)
=
Cluster
+
Locality+ node failure
(Data move close to CPU)

Big Data Dead Valley
TechnoMaturtity/
Risk
Enterprise size
SMB
Enterprise
Start-ups
Techno Maturity
Risk

Big Data
=
SMALL
MARKET
(B2B vs B2C)

WHY?????
Maturity
Data, Process, QA, infra, talent, $, Long term vision

Data->Analytics ->BI-> Big-Data -> Data-Mining ->

Data Access & Quality
User data privacy, IT outsourcing protection, Data Quality

Enterprise Slowness
1. Boston CXO Forum 24 October : Best Practice on Global
Innovation (IBM, EMC, P&G, Intuit)
Exploit vs Explore - M&A
2. Brad Feld (Managing Director at Foundry Group)
Hierarchy vs network

Big Data Dead Valley
TechnoMaturtity/
Risk
Enterprise Maturity
SMB
Enterprise
Start-ups
Techno Maturity
Risk

QMarketing example
Leveraging hadoop
● map = hits to session
● reduce = sessions to ROI

Online Marketing
Management
Channel % budget ROI
----------------------------------------------
PPC 50% ?
Organic 20% ?
Email Campaign 20% ?
Social Media 10% ?

All abstractions leak
Abstract -> Procrastinate!
http://www.aleax.it/pycon_abst.pdf (Alex Martelli : "Abstraction as a Leverage" )

Minimize A Tower of Abstraction
Simplify & lower the layer of abstraction
Examples:
● Work on file not BD if possible
● HD direct connect on server
● Low level linux command lines (cut, grep, sed etc.)
● High level languages : python
Abstraction = 20X benefits

EMR vs AWS & S3 1.0
(no data locality optimization + network &
~IO bounded)
EMR = 45 min
AWS = 4 min

EMR vs AWS & S3 2.0
EMR = 5+10 min*
AWS = ~4 min
*30 min prepro ;)
EMR = 5+4 if (big files & compress files)

Scaling Machine Learning
● Scaling Data-Preprocessing = Hadoop
● Small dataset = GPU
● Train with Big Dataset = ?? Communication Infrastructures =
MPI & MapReduce (John Langford http://hunch.net/?p=2094)

Hadoop vs MPI
MPI
● No fault tolerance by default
● Poor understanding of where data is (manual split on nodes + bad
communication & prog complexity)
● Limit scale to ~100 nodes in practice (sharing unavoidable)
● Cluster shared -> slower nodes issues before disk/node failure
MapReduce
● Setup and teardown costs are significant (interaction schedular &
communicating the prog + large number of node)
● Worst: mapreduce wait for free nodes + many mapreduce iteration +
reach high quality prediction
● Flaw: required refactoring code in map/reduce

Hadoop-compatible AllReduce -
Vowpall Rabbit (Hadoop + MPI)
● MPI = All reduce (all nodes same state)
● MapReduce = Conceptual Simplicity
● MPI: No need to refactor code
● MapReduce: Data Locality (Map only)
● MPI: Ability to use local storage (or RAM): temp file on
local disk + allow to be cached in RAM by OS
● MapReduce: Automatic cleanup of local resources (tmp
files)
● MPI: Fast Optimization approach remain within the
conceptual scope: AllReduce = fct call
● MapReduce robustness (speculative execution to deal
with slow nodes)

Summary
● Big Data Big Picture
○ BigData : Cluster + IO bounded (Locality)
● Big Data Dead Valley Dilemma (MMID)
○ Small Market/Maturity/Data:access,quality/Slowness
● EMR (aws) = Slow
● Minimize Tower or abstraction
● Scaling MP: bottleneck = ML
○ MPI:no fault tolerance + where is the data?
○ Hadoop: slow setup & teardown + Require
Refactoring
○ Hadoop compatible AllReduce

Reference MPI & hadoop
blog:
http://bickson.blogspot.ca/2011/12/mpi-vs-hadoop.html
http://hunch.net/?p=2094
Video & slides presentaiton John Langford
Learning From Lots Of Data (full)
CONFÉRENCIER: John LANGFORD, Senior Research Scientist, Microsoft Research
Slides: http://lisaweb.iro.umontrea...
Implementation :
vowpal_wabbit

hum...
Questions?
francis@qmining.com

The big data dead valley dilemma and much more.

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (7)

Similar a The big data dead valley dilemma and much more.

Similar a The big data dead valley dilemma and much more. (20)

Más de Francis Piéraut

Más de Francis Piéraut (16)

Último

Último (20)

The big data dead valley dilemma and much more.