4. What is Big Data ?
Big Data ≠ Data Volume
Big Data = Crude Oil
Think of data like ‘Crude Oil’
Big Data is about extracting ‘crude oil’; transporting it in ‘pipelines’; storing it in ‘mega tanks’
5. What is Data Science ?
Data Science ≠ Statistical Analysis
Data Science = Oil Refinery
Data science is about ‘treating’ data; applying ‘science’ to the data;
Refine the data ‘results’; and combine to form ‘insight’
6. Knowns, Unknowns & DIKUW FTW!
known knowns
we know we know
known unknowns
we know we don’t know
unknown unknowns
we don’t know we don’t know
D
DATA
I
INFORMATION
K
KNOWLEDGE
W
WISDOM
U
UNDERSTANDING
PAST FUTURE
Data Engineer Data Analyst Data Miner Data Scientist
raw what how to why when
numbers description experience cause & effect prediction
letters context tested proven what’s best
symbols relationship instruction
known knowns
known unknowns unknown unknowns
signals reports programs models
7. Data Analytics to Data Discovery ?
data you know
data you don’t know
questions you’re asking
questions you’re not asking
Data Analyst
Data Scientist
Data
Analytics
Data Discovery
DATA MODELLING
Y F( X, random noise, parameters)
ALGORITHMIC MODELLING
Y [ BLACK BOX ] X
8. DIVIDE
SCATTER
Split Data in Block
Replicate and Store
Petabytes of Resilience
CONQUER
EXPLORE
1000s of Parallel Threads
Explore Every Path
Machine Learning
INSIGHT
GATHER
Real Time Action
Periodic Dashboards
Iterative Evolution
What is the Big Idea ?
9. Divide = HDFS
Name Node
Client 1. Create Metadata
2. Put Blocks
1 2 3
Control / Monitoring
2 2
1 1
Data Nodes
3 3
WRITE
Name Node
Client 1. Get Metadata
Control / Monitoring
1 1 1 2
2
2
4 3 3 3
4 4
2. Fetch Blocks
Data Nodes
READ
13. Why is Big Data needed ?
VOLUME VELOCITY VARIETY
Exponential growth; 2x in 2 yrs
PB (1000 TB) is now common
Event streams; never at rest
640k GB per internet minute
100s of data sources
85% not in a table
14. Where in the Value Chain ?
Generation Transport Knowledge Output Value
BIG DATA SCIENCE
Straddles all four Challenge Areas
25. TIME VALUE OF DATA KNOWLEDGE IS POWER
LAST WORDS OF WISDOM
NOT ALL ROADS LEAD TO ROME
I AM AN INDIVIDUAL
26. “The price of light is far less than the cost of darkness”
Notas del editor
COST – 20x less per TB v/s Teradata, Netezza, Oracle
– 75% less average marginal cost per capacity
SPEED – 10x faster than Teradata, Netezza
AGILITY – 115% lesser average cost per data source v/s Oracle
SCIENCE – Machine learning, prediction
WHAT - What is Big Data Science?
WHY - Why is it needed?
WHERE - Where is it being used?
HOW - How will it evolve?
WHAT - What is Big Data Science?
WHY - Why is it needed?
WHERE - Where is it being used?
HOW - How will it evolve?
WHAT - What is Big Data Science?
WHY - Why is it needed?
WHERE - Where is it being used?
HOW - How will it evolve?
WHAT - What is Big Data Science?
WHY - Why is it needed?
WHERE - Where is it being used?
HOW - How will it evolve?
COST – 20x less per TB v/s Teradata, Netezza, Oracle
– 75% less average marginal cost per capacity
SPEED – 10x faster than Teradata, Netezza
AGILITY – 115% lesser average cost per data source v/s Oracle
SCIENCE – Machine learning, prediction
TIME VALUE - Yesterday’s data is less valuable than today’s data
- Historical data is more valuable than just now alone
POWER - Get from unknown unknowns to known unknowns or known knowns is powerful
LEAD TO ROME - Exploring with no direct business impact is not a bad thing
INDIVUDUAL - Treat every customer as an individual not an aggregate and analyse
- Aggregate only individual insights