Big Data Story
50 Petabytes – Entire written works of mankind
PETABYTE (PB) [1,000 Terabytes]
5 Exabytes – All the words ever spoken by mankind
EXABYTE (XB) [1,000 Petabytes]
Big Data Story
250 billion DVDs
ZETTABYTE (ZB) [1,000 Exabytes]
Size of the entire World Wide Web – 11 trillion years to
download a Yottabyte file
YOTTABYTE (YB) [1,000 Zettabytes]
Big Data Story
US NSA data center – capable of storing a yottabyte of data
Big Data Story
10 exabytes
300 petabytes 1 exabytes
Social Networking Ecommerce
Search & others
Big Data Story
Wrist/arm bands, watches, eyewear
Cars, navigation devices
Heating/ventilation system, air conditioners
Body sensors, body implants, pills
Traffic/street lights, traffic sensors and signs
Wearable
Buildings
Transportation
Health Technology
Cities
Big Data Story
Tile is the smart companion for all the things
you can't stand to lose
Big Data Story
Historical query-based estimates vs official influenza surveillance data
Early detection of a disease outbreak can save lives
http://www.google.org/flutrends/intl/en_us/about/how.html
Big Data Story
"Big data is becoming an effective basis of
competition in pretty much every industry"
- Michael Chui McKinsey Global Institute
Big Data Story
Digitally mapping the global economy
to connect talent with opportunity at massive scale
Connections between people, jobs, skill, companies,
and professional knowledge
Big Data Story
Using Big Data To Solve Social Problems
USING DATA IN THE SERVICE OF HUMANITY
Big Data Story
DJ Patil
U.S. Chief Data Scientist
Weather, health care, climate,
flight
Responsibly unleash the
power of data
Big Data Story
“What we realized is data, when used
responsibly, is a force multiplier”
DJ Patil
U.S. Chief Data Scientist
White House Office of Science and Technology Policy
Big Data Technology
Google File System
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
research.google.com/archive/gfs.html
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean, Sanjay Ghemawat
research.google.com/archive/mapreduce.html
In the beginning…… 2003
Big Data Technology
2003
Google
publish GFS
2004
Google
publish
MapReduce
2006
Hadoop
was born
2008
Top level
Apache
project
Y! - 1TB sort
in 209
seconds
900-cluster
2009
Google -
1TB sort in
69 seconds
100-cluster
2012
Hadoop
version 1.0
2014
Apache
Spark 1.0
2015
Spark 1TB
Sorting
Big Data Technology
Batch King
An M/R application that works on a 10GB of data
will also run on 10PB of data
Automatic parallelism and fault-tolerant
Too low level and limiting
Big Data Technology
Spark: Cluster Computing with Working Sets
people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf
Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing
usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
In recent days…… 2010
Lightning-fast cluster computing
Big Data Technology
Speed In-memory computing
Versatile General execution model
Ease of use APIs in Scala, Java, Python
Big Data Technology
Hadoop MR Record Spark Record
Data Size 100TB 100TB
Elapsed Time 72 mins 23 mins
# Nodes 2100 206
# Cores 50400 6592
Sort Rate 1.42 TB/min 4.27 TB/min
Sort Rate/
node
0.67 GB/min 20.7 GB/min
http://databricks.com/blog/2014/11/05/spark-officiallysets-a-new-record-in-large-scale-sorting.html
New record in large-scale sorting
Big Data Technology
Lambda Architecture
An approach to build big data systems
Human fault tolerant
Data immutability
Re-computation
http://lambda-architecture.net/
query = function(all data)
Big Data Technology
Lambda Architecture
Batch Layer
Speed Layer
Serving Layer
Master
dataset
Batch View
Batch View
Speed View
Data
Query
Big Data Challenges
TOP THREE ANALYTICS CHALLENGES
Analytical
insights into
business
actions
Aggregating
multiple data
sources
Lack of
appropriate
analytical
skills
MITSloan – The Talent Divide Research Report
Big Data Challenges
Data Scientist
The sexiest job of the 21 century
- Havard Business Review 10/2012
Tend to be better programmers than most
statisticians and better statisticians than most
programmers
- Jeanne Harris