Big data, why now?

MapR: The Next Generation
Big Data Platform
©MapR Technologies - Confidential 1

Big is the next big thing

 Big data and Hadoop are exploding

 Companies are being funded

 Books are being written

 Applications sprouting up everywhere

2

Slow Motion Explosion

3

Hadoop Explosion

4

Why Now?

 But Moore’s law has applied for a long time

 Why is Hadoop exploding now?

 Why not 10 years ago?

 Why not 20?

6/1/2012
5

Size Matters, but …

 If it were just availability of data then existing big companies would
adopt big data technology first

6

Size Matters, but …

 If it were just availability of data then existing big companies would
adopt big data technology first

They didn’t

7

Or Maybe Cost

 If it were just a net positive value then finance companies should
adopt first because they have higher opportunity value / byte

8

Or Maybe Cost

 If it were just a net positive value then finance companies should
adopt first because they have higher opportunity value / byte

They didn’t

9

Backwards adoption

 Under almost any threshold argument startups would not adopt
big data technology first

10

Backwards adoption

 Under almost any threshold argument startups would not adopt
big data technology first

They did

11

Everywhere at Once?

 Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small

12

Everywhere at Once?

 Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small

Why?

13

The Conventional Answer
More data is being produced more quickly
Data sizes are bigger than even a very large computer can hold
Cost to create and store continues to decrease


Analytics Scaling Laws

 Analytics scaling is all about the 80-20 rule
– Big gains for little initial effort
– Rapidly diminishing returns
 The key to net value is how costs scale
– Old school – exponential scaling
– Big data – linear scaling, low constant
 Cost/performance has changed radically
– IF you can use many commodity boxes


You’re kidding, people do that?

We didn’t know that!

We should have
known that

We knew that


NSA, non-proliferation
1

0.75

Industry-wide data consortium
Value

0.5
In-house analytics

Intern with a spreadsheet
0.25

Anybody with eyes

0
0 500 1000 1500 2,000

Scale


1

0.75

Net value optimum has a
Value

0.5 sharp peak well before
maximum effort

0.25

0
0 500 1000 1500 2,000

Scale


But scaling laws are changing
both slope and shape


1

0.75
Value

0.5
More than just a little

0.25

0
0 500 1000 1500 2,000

Scale


1

0.75
Value

0.5

They are changing a LOT!
0.25

0
0 500 1000 1500 2,000

Scale


1

0.75
Value

0.5

0.25

0
0 500 1000 1500 2,000

Scale


1

0.75

A tipping point is reached and
things change radically …
Value

0.5

Initially, linear cost scaling
actually makes things worse
0.25

0
0 500 1000 1500 2,000

Scale


Pre-requisites for Tipping

 To reach the tipping point,
 Algorithms must scale out horizontally
– On commodity hardware
– That can and will fail
 Data practice must change
– Denormalized is the new black
– Flexible data dictionaries are the rule
– Structured data becomes rare


But there is more

Especially for large enterprises


Physics of startup companies


For startups

 History is always small
 The future is huge
 Must adopt new technology to survive
 Compatibility is not as important
– In fact, incompatibility is assumed


Physics of large companies

Absolute growth
still very large

Startup
phase


For large businesses

 Present state is always large
 Relative growth is much smaller
 Absolute growth rate can be very large
 Must adopt new technology to survive
– Cautiously!
– But must integrate technology with legacy
 Compatibility is crucial


The startup technology picture

No compatibility
requirement

Old computers
and software
Expected hardware
and software growth

Current computers
and software


The large enterprise picture
Must work
together

?
Current hardware
and software
Proof of concept
Hadoop cluster

Long-term Hadoop
cluster


So that is why and why now

35

So that is why, and why now

What can you do with it?
And how?

36

Scale-free Computing

 Map-reduce
– pure functions for practical batch parallel computation
– high level languages like Hive and Pig available
– MapR provides standard access systems via NFS and ODBC
 BSP
– pure functions for synchronous iterative actor-based compute
– Apache Giraph provides practical implementation
 Actors
– tuple passing with transformations
– Storm provides practical implementation


Future Proof Schemas

 Denormalize data where possible to avoid seeks
– use embedded lists
– duplicate data
 Flexible Schemas
– use standard system for data serialization
– must provide protocol migration without versioning
– Protobufs (Google), Avro (Apache) and Thrift can all be used


Open Compute and Storage

 Big data has mass and inertia
– once it lands, it should not move

 Computation must move to the data
– map-reduce, Storm, Giraph … all OK
– conventional relational models … not OK

 One model is not enough
– must allow access by multiple models of computation


More Information

 Contact:
– tdunning@maprtech.com
– @ted_dunning

 Slides and such:
– http://info.mapr.com/ted-paris-05-2012


Thank You


Big data, why now?

Recomendados

Recomendados

Más contenido relacionado

Similar a Big data, why now?

Similar a Big data, why now? (20)

Más de Ted Dunning

Más de Ted Dunning (20)

Último

Último (20)

Big data, why now?

Notas del editor