The Art of Big Data

The road lies plain before
me;--'tis a theme
Single and of
determined bounds; …
- Wordsworth, The Prelude

m
pre ss.co
. word ol
bl eclix te Scho
p:/ /dou Gr adua 1
ka r, htt val Post 2 9,201
n a San r, Na Nov
Krish in a
st Sem
hD Gue
00–P
EC40

What is
Big
Data ?

Big
Data to
smart
data

Big
o  Agenda Data
o  To cover the broad Pipeline

picture
o  Understand the
waypoints &
o  Drill down into one
area (NOSQL) Analytics/
Modeling
Analytic Storage -
R
Algorithms NOSQL

o  Can do others later
…
Processing -
o  Of the Big Data Visualization
Hadoop
…

domain …

Thanks to …
The giants whose
shoulders I am
standing on

Special
Thanks
to:

Peter
Ateshian,
NPS

Prof
Murali
Tummala,
NPS

Shirley
Bailes,O’Reilly

Ed
Dumbill,O’Reilly

Jeﬀ
Barr,AWS

Jenny
Kohr
Chynoweth,AWS

When I think of my own native land,
In a moment I seem to be there;

But, alas! recollection at hand

Soon hurries me back to despair.
- Cowper, The Solitude Of Alexander SelKirk

What is Big Data ?
“Big data” is data “Big data” is less
that becomes large about size, more
enough that it about ﬂow & velocity
cannot be processed - persisting
using conventional petabytes per year is
methods. @twitter

easier than
processing terabytes
per hour. @twitter

Ref:
hIp://radar.oreilly.com/2010/09/the-‐smaq-‐stack-‐for-‐big-‐data.html

What is Big Data ?

Vinod Khosla’s Cool Dozen!
  Consumers : “Widespread innovation in
technologies that reduce data overload for
users” ~ Data Reduction

  Businesses : “Simple solutions to handle
the deluge of data generated from various
sources …” ~ Big Data Analytics

TV
2.0,
EducaXon,
Social
NEXT,Tools
for
sharing
inteerst,Publishing,…

Ref:
hIp://www.ciol.com/News/News/News-‐Reports/Vinod-‐Khosla%E2%80%99s-‐cool-‐dozen-‐tech-‐innovaXons/156307/0/

hIp://yourstory.in/2011/11/vinod-‐khoslas-‐keynote-‐at-‐nasscom-‐product-‐conclave-‐reject-‐punditry-‐believe-‐in-‐an-‐idea-‐take-‐risk-‐and-‐succeed/

EBC322

  Volume

o  Scale

  Velocity

o  Data
change
rate
vs.
decision
window

  Variety

o  Diﬀerent
sources
&
formats

o  Structured
vs.
Unstructured

  Variability

o  Breadth
of
interpreta<on
&

o  Depth
of
analy<cs

  Contextual

o  Dynamic
variability

o  RecommendaXon

  Connectedness

hIp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/

hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf

I.  Two
Main
Types
–
based
on
collecXon

i.  Big
Data
Streams

o  Data
in
“moXon”

o  TwiIer
ﬁre
hose,
Facebook,
G+

ii.  Big
Data
Logs

o  Data
“at
rest”

o  Logs,
DW,
external
market
data,
POS,
…

II.  Typically,
Big
Data
has
a
non-‐determinisXc
angle
as
well
…

o  CreaXve
Discovery

o  IteraXve,
Model
based
AnalyXcs

o  Explore
quesXons
to
ask

III.  Smart
Data
=
Big
Data
+
context
+
embedded/interacXve
(inference,

reasoning)
models

o  Model
Driven

o  DeclaraXvely
InteracXve

hIp://www.slideshare.net/leonsp/hadoop-‐slides-‐11-‐what-‐is-‐big-‐data

hIp://www.slideshare.net/Dataversity/wed-‐1550-‐bacvanskivladimircolor

AWS – 600 Billion
objects!

Twitter

§  200 million tweets/day

§  Peak 10,000/second

§  How would you handle the ﬁre
hose for social network analytics

?
Zynga

§  “Analytics company, not a
gaming company!”

§  Harvests data : 15 TB/day

Storage

§  Test new features

§  4 U box = 40 TB,

§  Target advertising

1 PB = 25 boxes !

§ 
§  230 million players/month

hIp://goo.gl/dcBsQ

•  6
Billion
Messages
per

day

•  2
PB
(w/compression)

online

•  6
PB
w/
replicaXon

•  250
TB/Month
growth

•  HBase
Infrastructure

50
TB/Day
Very
systemaXc

240
nodes,
84
PB
Diagram
speaks
volumes!

Path
Analysis
Teradata
InstallaXon

A/B
TesXng

Ref:
hIp://www.hpts.ws/sessions/2011HPTS-‐TomFastner.pdf

•  “…
they
didn’t
need
a
genius,
…
but
build
the
world’s
most
impressive

dileIante
…
baIling
the
efficient
human
mind
with
spectacular

flamboyant
inefficiency”
–
Final
Jeopardy
by
Stephen
Baker

•  15
TB
memory,
across
90
IBM
760
servers,
in
10
racks

•  1
TB
of
dataset

•  200
Million
pages
processed
by
Hadoop

•  This
is
a
good
example
of
Connected
data

–  Contextual
w/
variability

–  Breath
of
interpretaXon

–  AnalyXcs
depth

hIp://doubleclix.wordpress.com/2011/03/01/the-‐educaXon-‐of-‐a-‐machine-‐%E2%80%93-‐review-‐of-‐book-‐%E2%80%9Cfinal-‐jeopardy
%E2%80%9D-‐by-‐stephen-‐baker/

hIp://doubleclix.wordpress.com/2011/02/17/watson-‐at-‐jeopardy-‐a-‐race-‐of-‐machines/

Warehouse-‐style

ApplicaXons

Block
Store

Distributed
Big Data
ApplicaXons

Storage
Object
Store

NOSQL

AnalyXcs
Parallelism
Map/Reduce

Web
HPC

AnalyXcs

Cloud
Architecture

Social
Media

Log
Inference

AnalyXcs

Social

RecommendaXon/
Graph
Inference
Engines

Machine

Knowledge
Search,
Learning
Mahout

Graph
Indexing

ClassiﬁcaXon,
Clustering

“A towel is about the most massively useful thing an
interstellar hitchhiker can have … any man who can
hitch the length and breadth of the Galaxy, rough it …
win through, and still know where his towel is, is clearly
a man to be reckoned with.”
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
Published by Harmony Books in 1979

Big Data to Smart Data

Don’t throw away
1
any data !

Big data to smart data
Be ready for diﬀerent
2
ways of organizing
the data
•  summary

h;p://goo.gl/fGw7r

Big Data Pipeline

If a problem has no solution, it is not a problem,
but a fact, not to be solved but to be coped with,
over time …
- Peres’s Law

Big Data Pipeline
•  Stages
o  Collect
o  Store
o  Transform & Analyze
o  Model & Reason
o  Predict, Recommend & Visualize
•  Different systems have different characteristics
o  Infrastructure optimization based in application/hardware
attributes correlation (short term)
•  Hadoop, Splunk, internal Dashboard
o  Application performance trends (medium term)
•  Analytics, Modeling,…
o  Product Metrics
•  Feature set vs. usage, what is important to users, stratification
•  Modeling using R, Visualization layers like Tableau

Big Data Pipeline
Ref:h;p:goo.gl/Mm83k

Infer-ability

Model

Internal

dashboards
Hand
,
Tableau

Context

coded

Programs,

Connectedness

R,
Mahout,

…

SQL,

Variety

BI
Tools,

Hadoop,

Pig,

Variability

SQL
Hive,

.NET

NOSQL,

Logs,
Dryad,

Velocity

Scribe,

HDFS,

XML,

Various

Flume,
other

<iles,
…

Volume

Hadoop

tools

…

Decomplexify! Contextualize! Network! Reason! Infer!

Build to Fail - “It is working” is not binary

The NOSQL !

I AM monarch of all I survey;
My right there is none to dispute;

From the centre all round to the sea
I am lord of the fowl and the brute

Agenda
•  Opening Gambit
–  NOSQL
:
Toil,
Tears
&
Sweat
!

•  The Pragmas
–  ABCs
of
NOSQL
[ACID,
BASE
&
CAP]

•  The Mechanics
–  Algorithmics
&
Mechanisms
(For
reference)

Referenced Links @ http://doubleclix.wordpress.com/2010/06/20/nosql-talk-references/

What is NOSQL
Anyway ?
•  NOSQL

!=
NoSQL
or
NOSQL
!=
(!SQL)

•  NOSQL
=
Not
Only
SQL

•  Can
be
traced
back
to
Eric
Evans[2]!

–  You
can
ask
him
during
the
ayernoon
session!

•  Unfortunate
Name,
but
is
stuck
now

•  Non
RelaXonal
could
have
been
beIer

•  Usually
OperaXonal,
Deﬁnitely
Distributed

•  NOSQL
has
certain
semanXcs
–
need
not
stay
that
way

NOSQL

Key
Value
Column
Document
Graph

In-‐memory
SimpleDB
CouchDB
Neo4j

Memcached
Google

MongoDB
FlockDB

BigTable

Disk
Based

HBase
Lotus
Domino
InﬁniteGraph

Redis

Cassandra
Riak

Tokyo
Cabinet

Dynamo
HyperTable

Voldemort
Azure
TS
Ref:
[22,51,52]

When I think of my own native land,
In a moment I seem to be there;
But, alas! recollection at hand
Soon hurries me back to despair.

NOSQL Tales from the field
WHAT WORKS

•  Designer Augmenting RDBMS with a Distributed key
Value Store[40 : A good talk by Geir]
•  Invitation only designer brand sales
•  Limited inventory sales – start at 12:00, members have
10 min to grab them. 500K mails every day
•  Keeps brand value, hidden from search
•  Interesting load properties
•  Each item a row in DB-BUY NOW reserves it
–  Can't order more
•  Started out as a Rails app
–  shared nothing
•  Narrow peaks – half of revenue

Christian Louboutin
Effect

•  ½ amz for Louboutin
•  Use Voldemort
•  Inventory, Shopping Cart,
Checkout
•  Partition by prod ID
•  Shared infrastructure – “fog”
not “cloud’ - Joyent!
•  In-memory inventory
•  Not afraid of sale anymore!
And SQL DBs are
still relevant !

Typical NOSQL Example Bit.ly
•  Bit,ly URL shortening service, uses MongoDB
•  User, title, URL, hash, labels[I-5], sort by time
•  Scale – ~50M users, ~10K concurrent, ~1.25B shortens
per month
•  Criteria:
–  Simple, Zippy FAST, Very Flexible, Reasonable Durability, Low
cost of ownership
•  Sharded by userid

•  New kind of “dictionary” a word repository, GPS for
English – context, pronunciations, twitter … developer
API
•  Characteristics[I-6,Tony Tam’s presentation]
–  RO-centric, 10,000 reads for every write
–  Hit a wall with MySQL (4B rows)
–  MongoDB read was so good that memcached layer was not
required
–  MongoDB used 4 times MySQL storage
•  Another example :
–  Voldemort – Unified Communications, IP-Phone data stored
keyed off of phone number. Data relatively stable

Large Hadron Collider@CERN
•  DAS is part of giant data management
enterprise (cms)
–  Polygot Persistence (SQL + NOSQL, Mongo, Couch,
memcache, HDFS, Luster, Oracle, mySQL, …)
•  Data Aggregation System [I-1,I-2,I-3,I-4]
–  Uses MongoDB
–  Distributed Model, 2-6 pb data
–  Combine info. from different metadata sources, query
without knowing their existence, user has domain
knowledge – but shouldn’t deal with various formats,
interfaces and query semantics
–  DAS aggregates, caches and presents data as JSON
documents – preserving security & integrity

And SQL DBs are
still relevant !

•  Digg
–  RDBMS places burden on reads than writes[I-8]
–  Looked at NOSQL, selected Cassandra
•  Colum oriented, so more structure than key-value
•  Heard from noSQL Boston[http://twitter.com/
#search?q=%23nosqllive]
–  Baidu: 120 node HyperTable cluster managing
600TB of data
–  StumbleUpon uses HBase for Analytics
–  Twitter’s Current Cassandra cluster: 45 nodes

•  Adob is a HBase shop •  BBC is a CouchDB shop
[I-10,I-11,2] [I-13]
•  Adobe SaaS Infrastructure – •  Sweet spot:
tagging, content aggregation, •  Multi-master, multi
search, storage and so forth datacenter replication
•  Dynamic schema & huge
number of records[I-5]
•  40 million records in 2008 to
1 billion with 50 ms response •  Interactive Mediums
•  NOSQL not mature in 2008, •  Old data to CouchDB
now good enough •  Thus free up DB to do
•  Prod Analytics:40 nodes, work!
largest has 100 nodes

•  Cloudkick is a Cassandra shop[I-12]
•  Cloudkick offers cloud management services
•  Store metrics data
•  Linear scalability for write load
•  Massive write performance
•  Memory table & serial commit log
•  Low operational costs
•  Data Structure
–  Metrics, Rolled-up data, Statuses at time slice : all indexed by
timestamp

•  Guardian/UK
–  Runs on Redis[I-14] !
–  “Long-term The Guardian is looking
towards the adoption of a schema-free
database to sit alongside its Oracle
database and is investigating CouchDB.
… the relational database is now just a
component in the overall data
management story, alongside data
caching, data stores, search engines
And SQL DBs are
etc.
still relevant !
–  NOSQL can increase performance of "The evil that SQL
relational data by offloading specific DBs do lives after
data and tasks them; the good is
oft interred with
their bones...",

NOSQL at Netflix
•  Netflix is fully in the cloud
•  Uses NOSQL across the globe
•  Customer Profiles, watchlog, usage logging (see next
slide)
–  No multi-record locking
•  No DBA !
•  Easier Schema Changes
•  Less complex, Highly Available data store
•  Joins happen in the applications

http://www.hpts.ws/sessions/nosql-ecosystem.pdf
http://www.hpts.ws/sessions/GlobalNetflixHPTS.pdf

21 NOSQL Themes
•  Web
Scale

•  Scale
Incrementally/conXnuous
growth

•  Oddly
shaped
&
exponenXally
connected

•  Structure
data
as
it
will
be
used
–
i.e.
read,
query

•  Know
your
queries/updates
in
advance[96],
but
you
can
change

them
later

•  Compute
aIributes
at
run
Xme

•  Create
a
few
large
enXXes
with
opXonal
parts

–  NormalizaXon
creates
many
small
enXXes

•  Deﬁne
Schemas
in
models
(not
in
databases)

•  Avoid
impedance
mismatch

•  Narrow
down
&
solve
your
core
problem

•  Solve
the
right
problem
with
the
right
tool

Ref:
[I-‐8]

21 NOSQL Themes
•  ExisXng
soluXons
are
clunky[1]
(in
certain
situaXons)

•  Scale
automaXcally,
“becoming
prohibiXvely
costly
(in

terms
of
manpower)
to
operate”
TwiIer[I-‐9]

•  DistribuXon
&
parXXoning
are
built-‐in
NOSQL

•  RDBMS
distribuXon
&
sharding
not
fun
and
is
expensive

–  Lose
most
funcXonality
along
the
way

•  Data
at
the
center,
Flexible
schema,
Less
joins

•  The
value
of
NOSQL
is
in
ﬂexibility
as
much
as
it
is
in
“Big

Data”

21 NOSQL Themes
•  Requirements[3]

–  Data
will
not
ﬁt
in
one
node

•  And
so
need
data
parXXon/distribuXon
by
the
system

–  Nodes
will
fail,
but
data
needs
to
be
safe
–
replicaXon!

–  Low
latency
for
real-‐Xme
use

•  Data
Locality

–  Row
based
structures
will
need
to
read
whole
row,

even
for
a
column

–  Column
based
structures
need
to
scan
for
each
row

•  SoluXon
:
Column
storage
with
Locality

–  Keep
data
that
is
read
together,
don’t
read
what
you

don’t
care

•  For
example
friends
–
other
data

Ref:
3

ABCs of
NOSQL -
ACID,
BASE &
CAP
The woods are lovely, dark, and deep,
But I have promises to keep,
And miles to go before I sleep,
And miles to go before I sleep.
-Frost

CAP Principle
“CAP
Principle
→

Strong
Consistency,

High
Availability,

Consistency

Par::on-‐resilience:

Pick
at
most
2”[37]

Availability Partition

Which
feature
to
discard
depends
on
the
nature
of
your
system[41]

CAP Principle
“CAP
Principle
→

Strong
Consistency,

High
Availability,

Consistency


Pick
at
most
2”[37]

C-‐A
No
P
→
Single
DB

server,
no
network
par::on


Which
feature
to
discard
depends
on
the
nature
of
your
system[41]

CAP Principle
“CAP
Principle
→

Strong
Consistency,

High
Availability,

Consistency


Pick
at
most
2”[37]

C-‐P
No
A
→
Block

transac:on
in

case
of
par::on

failure


Which
feature
to
discard
depends
on
the
nature
of
your
system[41]

CAP Principle
Interesting (& controversial) from
“CAP
Principle
→

NOSQL perspective

Strong
Consistency,

High
Availability,

Consistency


Pick
at
most
2”[37]
A-‐P
No
C
→

Expira:on
based

caching,
vo:ng

majority


ABCs
of
NOSQL

•  ACID

o  Atomicity,
Consistency,
IsolaXon
&
Durability
–

fundamental
properXes
of
SQL
DBMS

•  BASE[35,39]

o  Basically
Available
Soy
state(Scalable)

Eventually
Consistent

•  CAP[36,39]

o  Consistency,
Availability
&
ParXXoning

o  This
C
is
~A+C

•  i.e.
Atomic
Consistency[36]

ACID

•  Atomicity

o  All
or
nothing

•  Consistent

o  From
one
consistent
state
to
another

•  e.g.
ReferenXal
Integrity

o  But
it
is
also
applicaXon
dependent
on

•  e.g.
min
account
balance

•  Predicates,
invariants,…

•  IsolaXon

•  Durability

CAP
Pragmas

•  PrecondiXons

o  The
domain
is
scalable
web
apps

o  Low
Latency
For
real
Xme
use

o  A
small
sub-‐set
of
SQL
FuncXonality

o  Horizontal
Scaling

•  PritcheI[35]
talks
about
relaxing
consistency

across
funcXonal
groups
than
within
funcXonal

groups

•  Idempotency
to
consider

o  Updates
inc/dec
are
rarely
idempotent

o  Order
preserving
trx
are
not
idempotent
either

o  MVCC
is
an
answer
for
this
(CouchDB)

Consistency

•  Strict
Consistency

o Any
read
on
Data
X
will
return
the
most

recent
write
on
X[42]

•  SequenXal
Consistency

o Maintains
sequenXal
order
from

mulXple
processes
(No
menXon
of
Xme)

•  Linearizability

o Add
Xmestamp
from
loosely

synchronized
processes

Consistency

•  Write
availability,
not
read
availability[44]

•  Even
load
distribuXon
is
easier
in

eventually
consistent
systems

•  MulX-‐data
center
support
is
easier
in

eventually
consistent
systems

•  Some
problems
are
not
solvable
with

eventually
consistent
systems

•  Code
is
someXmes
simpler
to
write
in

strongly
consistent
systems

CAP
EssenXals
–
1
of
3

•  “CAP
Principle
→
Strong
Consistency,
High

Availability,
ParXXon-‐resilience:
Pick
at

most
2”[37]

o  C-‐A
No
P
→
Single
DB
server,
no
network

parXXon

o  C-‐P
No
A
→
Block
transacXon
in
case
of

parXXon
failure

o  A-‐P
No
C
→
ExpiraXon
based
caching,
voXng

majority

•  Which
feature
to
discard
depends
on
the

nature
of
your
system[41]

CAP
EssenXals
–
2
of
3

•  Yield
vs.
Harvest[37]

o  Yield
→
Probability
of
compleXng
a
request

o  Harvest
→
FracXon
of
data
reﬂected
in
the

response

•  Some
systems
tolerate
<
100%
harvest
(e.g

search
i.e.
approximate
answers
OK)

others
need
100%
harvest
(e.g.
Trx
i.e.

correct
behavior
=
single
well
deﬁned

response)

•  For
sub-‐systems
that
tolerate
harvest

degradaXon,
CAP
makes
sense

CAP
EssenXals
–
3
of
3

•  Trading
Harvest
for
yield
–
AP

•  ApplicaXon
decomposiXon
&
use
NOSQL
in

appropriate
sub-‐systems
that
has
state

management
and
data
semanXcs
that
match
the

opera<onal
feature
&
impedance

o  Hence
NotOnly
SQL
not
No
SQL

o  Intelligent
homing
to
tolerate
parXXon
failures[44]

o  MulX
zones
in
a
region
(150
miles
-‐
5
ms)

o  TwiIer
tweets
in
Cassandra
&
MySQL

o  BBC
using
MongoDB
for
oﬄoading
DBMS

o  Polygot
persistence
at
LHC@CERN

CAP
EssenXals
–
3
of
3

•  Trading
Harvest
for
yield
–
AP

•  ApplicaXon
decomposiXon
&
use
NOSQL
in

appropriate
sub-‐systems
that
has
state

management
and
data
semanXcs
that
match
the

opera<onal
feature
&
impedance

o  Hence
NotOnly
SQL
not
No
SQL

o  Intelligent
homing
to
tolerate
parXXon
failures[44]

o  MulX
zones
in
a
region
(150
miles
-‐
5
ms)

o  TwiIer
tweets
in
Cassandra
and
MySQL

Most important
o  BBC
using
MongoDB
for
oﬄoading
DBMS

point in the whole
o  Polygot
persistence
at
LHC@CERN

presentation

Eventual
Consistency
&
AMZ

•  DistribuXon
Transparency[38]

•  Larger
distributed
systems,
network

parXXons
are
given

•  Consistency
Models

o  Strong

o  Weak

•  Has
an
inconsistency
window
before
update
and

guaranteed

view

o  Eventual

•  If
no
new
updates,
all
will
see
the
value,
eventually

Eventual
Consistency
&
AMZ

•  Guarantee
variaXons[38]

o Read-‐Your-‐writes

o Session
consistency

o Monotonic
Read
consistency

•  Access
will
not
return
previous
value

o Monotonic
Write
consistency

•  Serialize
write
by
the
same
process

•  Guarantee
order
(vector
clocks,

mvcc)

o  Example
:
Amz
Cart
merger
(let
cart
add
even
with
parXal

failure)

Eventual
Consistency
&
AMZ
-‐
SimpleDB

•  SimpleDB
strong
consistency

semanXcs
[49,50]

o UnXl
Feb
2010,
SimpleDB
only

supported
eventual
consistency
i.e.

GetAIributes
ayer
PutAIributes
might

not
be
the
same
for
some
Xme
(1

second)

o On
Feb
24,
AWS
Added

ConsistentRead=True
aIribute
for
read

o Read
will
reﬂect
all
writes
that
got

200OK
Xll
that
Xme!

Eventual
Consistency
&
AMZ
-‐
SimpleDB

•  SimpleDB
strong
consistency

semanXcs
[49,50]

o Also
added
condiXonal
put/delete

o Put
aIribute
has
a
speciﬁed
value

(Expected.1.Value=)
or
(Expected.
1.Exists
=
true/false)

o Same
condiXonal
check
capability
for

delete
also

o 
Only
on
one
aIribute
!

Eventual
Consistency
&
AMZ
–
S3

•  S3
is
an
eventual
consistency
system

o Versioning

o “S3
PUT
&
COPY
synchronously
store

data
across
mulXple
faciliXes
before

returning
SUCCESS”

o Repair
Lost
redundancy,
repair
bit-‐rot

o Reduced
Redundancy
opXon
for
data

that
can
be
reproduced

(99.999999999%

vs.
99.99%)

•  Approx
1/3rd
less

o CloudFront
for
caching

!SQL
?

•  “We
conclude
that
the
current
RDBMS
code
lines,
while

aIempXng
to
be
a
“one
size
ﬁts
all”
soluXon,
in
fact,
excel
at

nothing.
Hence,
they
are
25
year
old
legacy
code
lines
that

should
be
reXred
in
favor
of
a
collecXon
of
“from
scratch”

specialized
engines.”[43]

•  “Current
systems
were
built
in
an
era
where
resources
were

incredibly
expensive,
and
every
compuXng
system
was

watched
over
by
a
collecXon
of
wizards
in
white
lab
coats,

responsible
for
the
care,
feeding,
tuning
and
opXmizaXon
of

the
system.
In
that
era,
computers
were
expensive
and

people
were
cheap”

•  “The
1970
-‐
1985
period
was
a
<me
of
intense
debate,
a

myriad
of
ideas,
&
considerable
upheaval.
We
predict
the

next
ﬁUeen
years
will
have
the
same
feel
“

Further
deliberaXon

•  Daniel
Abadi[45],Mike
Stonebreaker[46],

James
Hamilton[47],
Pat
Hilland[48]
are
all

good
read
for
further
deliberaXons

NOSQL Internals & Algorithmics

Caveats

•  A
representaXve
subset
of
the
mechanics
and

mechanisms
used
in
the
NOSQL
world

•  Being
reﬁned
&
newer
ones
are
being
tried

•  At
a
system
level
–
to
show
how
the
techniques

play
a
part
to
deliver
a
capability

•  The
NOSQL
Papers
and
other
references
for

further
deliberaXon

•  Even
if
we
don’t
cover
fully,
it
is
OK.
I
want
to

introduce
some
of
the
concepts
so
that
you
get

an
appreciaXon
…

NOSQL
Mechanics

•  Horizontal
Scalability
•  Performance

–  Gossip
(Cluster
–  SStables/memtables

membership)
–  LSM
w/Bloom
Filter

–  Failure
DetecXon
•  Integrity/Version

–  Consistent
Hashing
reconcilia<on

–  ReplicaXon
–  Timestamps

Techniques

–  Vector
Clocks

•  Hinted
Handoﬀ

•  Merkle
Trees
–  MVCC

–  Sharding
MongoDB
–  SemanXc
vs.
syntacXc

reconciliaXon

–  Regions
in
HBase

Consistent
Hashing

•  Origin:
web
caching
“To
decrease
‘hot

spots’

•  Three
goals[87]

–  Smooth
evoluXon

•  When
a
new
machine
joins,
minimum
rebalance

work
and
impact

–  Spread

•  Objects
assigned
to
a
min
number
of
nodes

–  Load

•  #
of
disXnct
objects
assigned
to
a
node
is
small

Consistent
Hashing

•  Hash
Keyspace/Token
is
divided
into
parXXons/ranges

•  Cassandra
–
choice

–  OrderPreserving
parXXoner
–
key
=
token
(for
range
queries)

–  Also
saw
a
CollaXngOrderPreservingParXXoner

•  ParXXons
assigned
to
nodes
that
are
logically
arranged
in
a
circle

topology

•  Amz
(dynamo)
–
assign
sets
of

(random)
mulXple
points
to

diﬀerent
machines
depending
on

load

•  Cassandra
–
monitor
load
&

distribute

•  Speciﬁc
join
&
leave
protocols

•  ReplicaXon
–
next
3
consecuXve

•  Cassandra
–
Rack-‐aware,

Datacenter-‐aware

Consistent
Hashing
-‐
Hinted-‐handoﬀ

•  What
happens
when
a
node
is
not
available
?

–  May
be
under
load

–  May
be
network
parXXon

•  Sloppy
Quorum
&
Hinted-‐handoﬀ

•  R/W
performed
on
the
1st
n
healthy
nodes

•  Replica
sent
to
a
host
node
with
hint
in

metadata
&
then
transferred
when
the
actual

node
is
up

•  Burdens
neighboring
nodes

•  Cassandra
0.6.2
default
is
disabled
(I
think)

Consistent
Hashing
-‐
ReplicaXon

•  What
happens
when
a
new
node

joins
?

– It
gets
one
or
more
parXXons

– Dynamo
:
Copy
the
whole
parXXon

– Cassandra
:
Replicate
keyset

– Cassandra
:
working
on
a
bit
torrent

type
protocol
to
copy
from
replicas

AnX-‐entropy

•  Merge
and
reconciliaXon
operaXons

–  Operate
on
two
states
and
return
a
new
state[86]

•  Merkle
Trees

–  Dynamo
use
of
Merkle
trees
to
detect

inconsistencies
between
replicas

–  AnXEntropy
in
Cassandra
exchanges
Merkle
trees

and
if
they
disagree,
range
repair
via
compacXon
[91,92]

–  Cassandra
uses
the
ScuIlebuI
ReconciliaXon[86]

Gossip

•  Membership
&
Failure
detecXon

•  Based
on
emergence
without
rigidity
–

pulse
coupled
oscillators,
biological

systems
like
ﬁreﬂies
![90]

•  Also
used
for
state
propagaXon

–  Used
in
Dynamo/Cassandra

Gossip

•  Cassandra
exchanges
heartbeat
state,
applicaXon
state

and
so
forth

•  Every
second,
random
live
node,
random
unreachable

node
and
exchanges
key-‐value
structures

•  Some
nodes
play
the
part
of
seeds

•  Seed
/iniXal
contact
points
in
staXc
conf
file

storage.conf
file

•  Could
also
come
from
a
configuraXon
service
like

zookeeper

•  To
guard
against
node
flap,
explicit
membership
join
and

leave
–
now
you
know
why
hinted
handoff
was
added

Membership
&
Failure
detecXon

•  Consensus
&
Atomic
Broadcast

-‐
impossible
to

solve
in
a
distributed
system[88,89]

–  Cannot
diﬀerenXate
between
an
slow
system
and
a

crashed
system

•  Completeness

–  Every
system
that
crashed
will
be
eventually

detected

•  Correctness

–  A
correct
process
is
never
suspected

•  In
short,
if
you
are
dead
somebody
will
no<ce
it

and
if
you
are
alive,
nobody
will
mistake
you
for

dead
!

Ø
Accrual
Failure
Detector

•  Not

Boolean
value
but
a
probabilisXc
number
that
“accrues”
over

an
exponenXal
scale

•  Captures
the
degree
of
conﬁdence
that
a
corresponding
monitored

process
has
crashed[94]

–  Suspicion
Level

–  Ø
=
1
-‐>
prob(error)
10%

–  Ø
=
2
-‐>
prob(error)
1%

–  Ø
=
3
-‐>
prob(error)
0.1%

•  If
process
is
dead,

–  Ø
is
monotonically
increasing
&
Ø→α
as
t
→α

•  If
process
is
alive
and
kicking,
Ø=0

•  Account
for
lost
messages,
network
latency
and
actual
crash
of

system/process

•  Well
known
heartbeat
period
Δi,
then
network
latency
Δtr
can
be

tracked
by
inter-‐arrival
Xme
modeling

Write/Read
Mechanisms

•  Read
&
Write
to
a
random
node

(StorageProxy)

•  Proxy
coordinates
the
read
and
write

strategy
(R/W
=
any,
quorum
et
al)

•  Memtables/SSTables
from
big
table

•  Bloom
Filter/Index

•  LSM
Trees

Hbase – WAL,
Node Write Memstore, HDFS File
system

Commit
Logs
Node
M
e
m
o
MemTable r
y
Read

Flushing

Index Index Index
D
i
BF BF BF s
k
SSTable
• Immutable
• Compaction
• Maintain Index & Bloom Filter

How…
does
HBase
work
again?

http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
http://hbaseblog.com/2010/07/04/hug11-hbase-0-90-preview-wrap-up/

Bloom
Filter

•  The
BloomFilter
answers
the
quesXon

•  “Might
there
be
data
for
this
key
in
this

SSTable?”
[Ref:
Cassandra/Hbase
mailer]

–  “Maybe"
or

– 
“Deﬁnitely
not“

–  When
the
BloomFilter
says
"maybe"
we
have
to
go
to

disk
to
check
out
the
content
of
the
SSTable

•  Depends
on
implementaXon

–  Redone
in
Cassandra

–  Hbase
0.20.x
removed,
will
be
back
in
0.90
with
a

“jazzy”
implementaXon

Was it a vision, or a waking dream?
Fled is that music:—do I wake or sleep?
-Keats, Ode to a Nightingale

•  http://www.readwriteweb.com/enterprise/2011/11/infographic-data-
deluge---8-ze.php
•  http://www.crn.com/news/data-center/232200061/efficiency-or-
bust-data-centers-drive-for-low-power-solutions-prompts-channel-
growth.htm
•  http://www.quantumforest.com/2011/11/do-we-need-to-deal-with-
big-data-in-r/
•  http://www.forbes.com/special-report/2011/migration.html
•  http://www.mercurynews.com/bay-area-news/ci_19368103
•  http://www.businessinsider.com/apple-new-data-center-north-
carolina-created-50-jobs-2011-11

The Art of Big Data

Recommended

Recommended

More Related Content

Similar to The Art of Big Data

Similar to The Art of Big Data (20)

More from Krishna Sankar

More from Krishna Sankar (19)

Recently uploaded

Recently uploaded (20)

The Art of Big Data