Keynote during BiDaTA 2013 in Genoa, a special track of the ADBIS 2013 conference. URL: http://dbdmg.polito.it/bidata2013/index.php/keynote-presentation
5. Big
Data
–
A
Misnomer
• Misleading
to
quick
assumpAons
• Current
challenges
are
driven
by
many
things,
not
just
the
size
of
data
• ANY
company
can
use
the
Big
Data
principles
to
improve
specific
business
metrics
• Increased
data
retenAon
• Access
to
all
the
data
• Machine
learning
for
paFern
detecAon,
recommendaAons
• But
what
has
happened
to
cause
this
all?
5
6. Explosive
Data
Growth
6
10,000
2005
2015
2010
5,000
0
1.8 trillion gigabytes of
data
was
created
in
2011…
§ More
than
90%
is
unstructured
data
§ Approx.
500
quadrillion
files
§ QuanAty
doubles
every
2
years
STRUCTURED
DATA
UNSTRUCTURED
DATA
GIGABYTES
OF
DATA
CREATED
(IN
BILLIONS)
Source:
IDC
2011
7. The
‘Big
Data’
Phenomenon
7
Big
Data
Drivers:
§ The
proliferaAon
of
data
capture
and
creaAon
technologies
§ Increased
“interconnectedness”
drives
consumpAon
(creaAng
more
data)
§ Inexpensive
storage
makes
it
possible
to
keep
more,
longer
§ InnovaAve
somware
and
analysis
tools
turn
data
into
informaAon
Big
Data
encompasses
not
only
the
content itself,
but
how
it’s consumed.
More Devices
More
Consumption
More Content
New & Better
Information
§ Every
gigabyte
of
stored
content
can
generate
a
petabyte
or
more
of
transient
data*
§ The
informaAon
about
you
is
much
greater
than
the
informaAon
you
create
*Source:
IDC
2011
8. The
Current
SoluAons
8
10,000
2005
2015
2010
5,000
0
Current Database Solutions are
designed
for
structured
data.
§ OpAmized
to
answer
known
quesPons
quickly
§ Schemas
dictate
form/context
§ Difficult
to
adapt
to
new
data
types
and
new
quesAons
§ Expensive
at
Petabyte
scale
STRUCTURED
DATA
UNSTRUCTURED
DATA
GIGABYTES
OF
DATA
CREATED
(IN
BILLIONS)
10%
9. Data
Management
Strategies
Have
Stayed
the
Same
• Raw
data
on
SAN,
NAS
and
tape
• Data
moved
from
storage
to
compute
• RelaAonal
models
with
predesigned
schemas
10. Too
Much
Data,
Too
Many
Sources
• Can’t
ingest
fast
enough
11. Too
Much
Data,
Too
Many
Sources
$
!
$ $
$
• Can’t
ingest
fast
enough
• Costs
too
much
to
store
12. Too
Much
Data,
Too
Many
Sources
1
2 3 4
5
• Can’t
ingest
fast
enough
• Costs
too
much
to
store
• Exists
in
different
places
13. Too
Much
Data,
Too
Many
Sources
• Can’t
ingest
fast
enough
• Costs
too
much
to
store
• Exists
in
different
places
• Archived
data
is
lost
14. Can’t
Use
It
The
Way
You
Want
To
• Analysis
and
processing
takes
too
long
15. Can’t
Use
It
The
Way
You
Want
To
1
2 3 4
5
• Analysis
and
processing
takes
too
long
• Data
exists
in
silos
16. Can’t
Use
It
The
Way
You
Want
To
? ? ?
• Analysis
and
processing
takes
too
long
• Data
exists
in
silos
• Can’t
ask
new
quesAons
17. Can’t
Use
It
The
Way
You
Want
To
• Analysis
and
processing
takes
too
long
• Data
exists
in
silos
• Can’t
ask
new
quesAons
• Can’t
analyze
unstructured
data
18. The
Big
Data
Challenge
18
VOLUME
VARIETY
VELOCITY
DEMANDS
A
NEW
APPROACH
Big
Data
Contains
Limitless
Insights…
BUT
WEB
LOGS
SOCIAL
MEDIA
TRANSACTIONAL
DATA
SMART
GRIDS
OPERATIONAL
DATA
DIGITAL
CONTENT
R&D
DATA
AD
IMPRESSIONS
FILES
19. Big
Data
Challenges
19
Cost-‐effecAvely
managing
the
volume, velocity and
variety of
data
Deriving
value
across
structured and unstructured data
AdapAng
to
context changes and integraAng
new data sources and types
20. Big
Data
SoluAon
Requirements
20
Cost-effectively manage
the
volume,
variety
and
velocity
of
data
Process and analyze
large,
complex
data
sets…quickly
Flexibly adapt
to
context
changes
and
new
data
types
23. Google
File
System
• FoundaAon
of
scalable,
fail-‐safe,
self-‐healing
storage
• One
central
place
of
truth
• Cost-‐effecAve
hardware
finally
available
• 19”
Rack
servers
with
decent
amount
of
disk
space
• Handling
of
failures
built
in
• Components
or
enAre
servers
• At
scale
there
are
always
hardware
faults
• Simple
file
system
interface
• Finally
no
need
for
expensive,
proprietary
systems
23
Storage
24. MapReduce
• First
take
on
distributed
data
processing
framework
• Same
concepts
as
Google
File
System,
i.e.
• Fail-‐safe
and
scalable
• Handles
a
wide
range
of
data
processing
problems
• BUT
not
all
of
them
(more
later)
• Simple
API
reading
and
wriAng
Key/Value
pairs
• Framework
handles
heavy
task
of
data
movement
• Core
concept
is
data
locality,
heavy
I/O
• Brings
code
to
data,
not
the
opposite
(i.e.
no
HPC)
• Accessible
in
many
programming
languages
24
Processing
25. BigTable
• Adds
database
like
random
access
to
data
• EffecAvely
a
Key/Value
store
with
table
semanAcs
• Used
for
small
data
points
• Usually
less
than
a
megabyte
per
Key/Value
• Forfeits
advanced
concepts
for
ease
of
scalability
• No
transacAons,
no
query
language
• Powers
many
applicaAons
at
Google
• Uses
Google
File
System
as
storage
layer
• Tight
integraAon
with
MapReduce
for
batch
processing
25
Random
Access
26. Dremel,
Tenzing,
Pregel
• Dremel
adds
specific
file
format
and
query
language
• Used
for
highly
selecAve
queries,
data
exploraAon
• File
layout
is
opAmized
for
very
effecAve
scanning
• Runs
alongside
of
MapReduce
and
File
System
• Tenzing
adds
SQL
over
various
data
sources
• Can
query
raw
files,
Dremel
files,
or
BigTable
data
etc.
• Brings
“known”
paradigm
to
stored
data
• Pregel
adds
graph
processing
API
26
Query
API
27. Percolator,
Megastore
• AddiAons
to
BigTable
to
add
“missing”
features
• Percolator
is
using
BigTable
to
update
search
index
incrementally,
needs
transacAons
• Distributes
updates
with
mulA-‐phase
commits
• Megastore
drives
Google
App
Engine
to
also
add
transacAons
for
user
API
• Uses
ranges
of
rows
as
en#ty
groups
• Reduces
locking
to
small
subsets
• OpAmisAc,
roll-‐forward
only
transacAons
• Java
layer
over
BigTable
API
27
TransacAons
28. Spanner,
F1
• Future
of
Google’s
distributed
storage
and
processing
system
• Spanner
is
a
scalable,
mulA-‐version,
globally-‐
distributed,
and
synchronously-‐replicated
database
• Replicates
across
datacenters
• Uses
TrueTime
(atomic
clocks)
for
synchronizaAon
• Uses
Colossus
for
storage
(a
GFS
successor)
• F1
replaced
MySQL
for
AdWords
service
• SQL
over
data
stored
in
Spanner
• Colocated
with
Spanner
processes
28
World-‐Wide
Data
30. What
is
Apache
Hadoop?
30
Has
the
Flexibility
to
Store
and
Mine
Any
Type
of
Data
§ Ask
quesAons
across
structured
and
unstructured
data
that
were
previously
impossible
to
ask
or
solve
§ Not
bound
by
a
single
schema
Excels
at
Processing
Complex
Data
§ Scale-‐out
architecture
divides
workloads
across
mulAple
nodes
§ Flexible
file
system
eliminates
ETL
boFlenecks
Scales
Economically
§ Can
be
deployed
on
commodity
hardware
§ Open
source
plavorm
guards
against
vendor
lock
Hadoop
Distributed
File
System
(HDFS)
Self-‐Healing,
High
Bandwidth
Clustered
Storage
MapReduce/YARN
Distributed
CompuAng
Framework
Apache Hadoop
is
an
open
source
plavorm
for
data
storage
and
processing
that
is…
ü Scalable
ü Fault
tolerant
ü Distributed
CORE
HADOOP
SYSTEM
COMPONENTS
31. Core
Hadoop:
HDFS
31
Self-healing, high bandwidth
1
2
3
4
5
2
4
5
HDFS
1
2
5
1
3
4
2
3
5
1
3
4
HDFS
breaks
incoming
files
into
blocks
and
stores
them
redundantly
across
the
cluster.
32. Core
Hadoop:
MapReduce
32
framework.
1
2
3
4
5
2
4
5
MR
1
2
5
1
3
4
2
3
5
1
3
4
Processes
large
jobs
in
parallel
across
many
nodes
and
combines
the
results.
33. Why
Hadoop
Was
Created
33
New opportunities to
derive
value
from
all
your
data.
Exploding
Data
Volumes
&
Types
Driving
The
Need
For
A
Flexible,
Scalable
SoluPon
It’s difficult to handle data this diverse, at this scale.
Traditional platforms can’t keep pace.
WEB
LOGS
SOCIAL
MEDIA
TRANSACTIONAL
DATA
SMART
GRIDS
OPERATIONAL
DATA
DIGITAL
CONTENT
R&D
DATA
AD
IMPRESSIONS
FILES
• Any
Kind
• From
Any
Source
• Structured
&
Unstructured
• At
Scale
• Deep
Analysis
• ExhausAve
&
Detailed
• SophisAcated
Algorithms
• Generate
Results
Quickly
• Extract More Value
• From More Data
• More Cost Effectively
• With Greater Flexibility
BIG
DATA
HARD
PROBLEMS
NEW
OPPORTUNITIES
34. The
Core
Values
of
Hadoop
34
A platform for
§ Designed to store and
process data at petabyte
scale
§ Scale-out architecture
increases capacity and
processing power linearly
§ Perform operations in
parallel across the entire
cluster
§ Store data in any format –
free from rigid schemas
§ Define context at the time
you ask the question
§ Process and analyze data
using virtually any
programming language
§ Build out your cluster on
your hardware of choice
§ Open source software
guards against vendor
lock-in
§ Wide integration ensures
investment protection
1 2 3
41. MoAvaAons
to
Change
MR1
41
• Scaling
>4000
nodes
• Fewer,
larger
clusters
• No
single
source
of
truth,
data
in
“silos”
again
• HA
of
Job
Tracker
difficult
• Large,
complex
state
• Poor
resource
uAlizaAon
• Slots
in
MR1
are
for
either
map
or
reduce
43. Split
of
ResponsibiliAes
43
Job
Tracker
Resource
Manager
ApplicaAon
Master
split
• One
per
Cluster
• Long-‐lived
• App-‐level
• One
per
app
instance
• Short-‐lived
• Task-‐level
scheduling
and
monitoring
44. Fine-‐grained
Resource
Control
• Node
Manager
is
a
generalized
Task
Tracker
• Task
Tracker
• Fixed
number
of
map
and
reduce
slots
• Node
Manager
• Containers
with
variable
resource
limits
44
49. Enterprise
Data
EvoluAon
RDBMS/EDW
HADOOP-OPTIMIZED
INFRASTRUCTURE
AMOUNTOFDATA
BUSINESS IMPACT
NEXT-GEN DATA
COMPUTING PLATFORM
DATA-DRIVEN
ORGANIZATION
AMOUNT
OF
DATA
•
Data
collecAon
&
reporAng
•
Process
data
faster
•
Store
data
more
cost-‐effecAvely
•
Simplify
infrastructure
•
Combine
data
from
across
the
business
•
Ask
new
quesAons
immediately
•
Enable
new
real-‐Ame
applicaAons
1980s
2000s
2010s
CREATE
COMPETITIVE
ADVANTAGE
IMPROVE
OPERATIONAL
EFFICIENCY
50. Playing
Catchup
• Improve
overall
performance
• Google’s
code
is
kernel
module,
C++,
as
low
as
possible
• Hadoop
is
Java,
for
ease
of
development
in
open-‐source
• Maybe
rewrite
parts
of
the
stack?
• Overall
goal:
saturate
machine
specs
(I/O,
CPU,
RAM)
• Add
missing
features
• Everything
is
based
on
“hearsay”,
aka
research
papers
and
presentaAons
• Add
what
is
necessary
or
for
the
sake
of
it?
50
51. Further
Extend
or
Invent?
• YARN
is
a
good
example
for
what
can
be
done
• Look
at
every
component
and
evaluate
• Work
with
research
and
universiAes,
companies
to
drive
new
development
• What
else
can
be
done
with
all
that
data?
51
53. From
Framework
to
Plavorm
to
Commodity
• Hadoop
distribuAons
are
already
a
commodity
• Move
up
the
stack
to
reach
commercial
space
• Simplify
data
processing
• ConAnuuity
• WibiData
(Kiji)
• Cloudera
CDK
• Pure
Hadoop
SoluAons
• DataMeer
• Plavora
53