3. A solution born in Twitter
Nathan Marz
Author of Big Data:
http://www.manning.com/marz/
4. When big is big?
●
OpenStreetMap.org ~1,5 M users, ~2,2 nodes
(http://j.mp/OSM-stats)
http://j.mp/OSM-stats
●
Wikipedia 32 M pages, 20 M users
(http://en.wikipedia.org/wiki/Wikipedia:Statistics)
http://en.wikipedia.org/wiki/Wikipedia:Statistics
●
Facebook 1.3 G users (http://www.statisticbrain.com/facebook-statistics/)
http://www.statisticbrain.com/facebook-statistics/
●
Twitter 645 M users (http://www.statisticbrain.com/twitter-statistics/)
●
But also:
–
–
Monitoring systems
Any near real time system
8. Batch Layer
●
Store an immutable input data set
●
Computing continuosly the batch view
●
Simple & Distributed
9. Serving Layer
●
Indexing the batch views
●
Access to the batch views
●
Updated by Batch Layer
●
Trivial read only database:
–
Quick
–
Very simple
10. Batch Layer + Service Layer
●
Robust and fault tollerant
●
Scalable
●
General
●
Extensible
●
Allow ad hoc queries
●
Minimal maintenance
●
Debuggable
11. What's miss?
While Batch Layer compute the query on the full
data set a pretty big chunk of data just arrived
and be stored.
Should we wait a couple of hours to query this
data?
12. Speed Layer
Near real time views
New Data
Speed Layer
Near real time views
Near real time views
Query
13. All together now!
Serving Layer
Batch View
All Data
Batch Layer
Batch View
Query
New Data
Near real time views
Speed Layer
Near real time views
14. How all this mess should work?
●
●
All new data are sent to both: batch and speed
layer (data are raw and immutalble, append
only)
The batch layer precompute the query
functions continuosly to all the dataset, to
produce the batch views
●
The serving layer indexes the batch views
●
At the end the data are a couple of hours old
15. How all this mess should work?
●
●
●
●
The speed layer will process only the new data
It use fast read/write database and
incremental processing algorithms
Produce the near real time views
The query will merge real time and batch
views results to resolve the queries
16. Batch Layer - tools
●
Hadoop
–
YARN: framework to schedule jobs and cluster
management
–
Map / Reduce: a way to parallel processing of
huge amount of data, based on YARN
–
HDFS: distributed file system with an high
throughput access to application data
–
And even more...
17. Serving Layer – tools
●
ElephantDB
–
●
●
Readonly database, very little, very fast
Here we need anything that has the same
features
Cloudera Impala
18. Speed Layer - tools
●
Storm project
–
Very fast distributed computed system
●
Apache Hbase
●
MongoDB