1. The road lies plain before
me;--'tis a theme
Single and of
determined bounds; …
- Wordsworth, The Prelude
m
pre ss.co
. word ol
bl eclix te Scho
p:/ /dou Gr adua 1
ka r, htt val Post 2 9,201
n a San r, Na Nov
Krish in a
st Sem
hD Gue
00–P
EC40
2. What is
Big
Data ?
Big
Data to
smart
data
Big
o Agenda Data
o To cover the broad Pipeline
picture
o Understand the
waypoints &
o Drill down into one
area (NOSQL) Analytics/
Modeling
Analytic Storage -
R
Algorithms NOSQL
o Can do others later
…
Processing -
o Of the Big Data Visualization
Hadoop
…
domain …
3. Thanks to …
The giants whose
shoulders I am
standing on
Special
Thanks
to:
Peter
Ateshian,
NPS
Prof
Murali
Tummala,
NPS
Shirley
Bailes,O’Reilly
Ed
Dumbill,O’Reilly
Jeff
Barr,AWS
Jenny
Kohr
Chynoweth,AWS
4. When I think of my own native land,
In a moment I seem to be there;
But, alas! recollection at hand
Soon hurries me back to despair.
- Cowper, The Solitude Of Alexander SelKirk
5. What is Big Data ?
“Big data” is data “Big data” is less
that becomes large about size, more
enough that it about flow & velocity
cannot be processed - persisting
using conventional petabytes per year is
methods. @twitter
easier than
processing terabytes
per hour. @twitter
Ref:
hIp://radar.oreilly.com/2010/09/the-‐smaq-‐stack-‐for-‐big-‐data.html
6. What is Big Data ?
Vinod Khosla’s Cool Dozen!
Consumers : “Widespread innovation in
technologies that reduce data overload for
users” ~ Data Reduction
Businesses : “Simple solutions to handle
the deluge of data generated from various
sources …” ~ Big Data Analytics
TV
2.0,
EducaXon,
Social
NEXT,Tools
for
sharing
inteerst,Publishing,…
Ref:
hIp://www.ciol.com/News/News/News-‐Reports/Vinod-‐Khosla%E2%80%99s-‐cool-‐dozen-‐tech-‐innovaXons/156307/0/
hIp://yourstory.in/2011/11/vinod-‐khoslas-‐keynote-‐at-‐nasscom-‐product-‐conclave-‐reject-‐punditry-‐believe-‐in-‐an-‐idea-‐take-‐risk-‐and-‐succeed/
7. EBC322
Volume
o Scale
Velocity
o Data
change
rate
vs.
decision
window
Variety
o Different
sources
&
formats
o Structured
vs.
Unstructured
Variability
o Breadth
of
interpreta<on
&
o Depth
of
analy<cs
Contextual
o Dynamic
variability
o RecommendaXon
Connectedness
hIp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/
hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
8. EBC322
Volume
o Scale
Velocity
o Data
change
rate
vs.
decision
window
Variety
o Different
sources
&
formats
o Structured
vs.
Unstructured
Variability
o Breadth
of
interpreta<on
&
o Depth
of
analy<cs
Contextual
o Dynamic
variability
o RecommendaXon
Connectedness
hIp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/
hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
9. EBC322
Volume
o Scale
Velocity
o Data
change
rate
vs.
decision
window
Variety
o Different
sources
&
formats
o Structured
vs.
Unstructured
Variability
o Breadth
of
interpreta<on
&
o Depth
of
analy<cs
Contextual
o Dynamic
variability
o RecommendaXon
Connectedness
hIp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/
hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
10. EBC322
Volume
o Scale
Velocity
o Data
change
rate
vs.
decision
window
Variety
o Different
sources
&
formats
o Structured
vs.
Unstructured
Variability
o Breadth
of
interpreta<on
&
o Depth
of
analy<cs
Contextual
o Dynamic
variability
o RecommendaXon
Connectedness
hIp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/
hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
11. EBC322
Volume
o Scale
Velocity
o Data
change
rate
vs.
decision
window
Variety
o Different
sources
&
formats
o Structured
vs.
Unstructured
Variability
o Breadth
of
interpreta<on
&
o Depth
of
analy<cs
Contextual
o Dynamic
variability
o RecommendaXon
Connectedness
hIp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/
hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
12. I. Two
Main
Types
–
based
on
collecXon
i. Big
Data
Streams
o Data
in
“moXon”
o TwiIer
fire
hose,
Facebook,
G+
ii. Big
Data
Logs
o Data
“at
rest”
o Logs,
DW,
external
market
data,
POS,
…
II. Typically,
Big
Data
has
a
non-‐determinisXc
angle
as
well
…
o CreaXve
Discovery
o IteraXve,
Model
based
AnalyXcs
o Explore
quesXons
to
ask
III. Smart
Data
=
Big
Data
+
context
+
embedded/interacXve
(inference,
reasoning)
models
o Model
Driven
o DeclaraXvely
InteracXve
hIp://www.slideshare.net/leonsp/hadoop-‐slides-‐11-‐what-‐is-‐big-‐data
hIp://www.slideshare.net/Dataversity/wed-‐1550-‐bacvanskivladimircolor
13. AWS – 600 Billion
objects!
Twitter
§ 200 million tweets/day
§ Peak 10,000/second
§ How would you handle the fire
hose for social network analytics
?
Zynga
§ “Analytics company, not a
gaming company!”
§ Harvests data : 15 TB/day
Storage
§ Test new features
§ 4 U box = 40 TB,
§ Target advertising
1 PB = 25 boxes !
§
§ 230 million players/month
hIp://goo.gl/dcBsQ
16. • “…
they
didn’t
need
a
genius,
…
but
build
the
world’s
most
impressive
dileIante
…
baIling
the
efficient
human
mind
with
spectacular
flamboyant
inefficiency”
–
Final
Jeopardy
by
Stephen
Baker
• 15
TB
memory,
across
90
IBM
760
servers,
in
10
racks
• 1
TB
of
dataset
• 200
Million
pages
processed
by
Hadoop
• This
is
a
good
example
of
Connected
data
– Contextual
w/
variability
– Breath
of
interpretaXon
– AnalyXcs
depth
hIp://doubleclix.wordpress.com/2011/03/01/the-‐educaXon-‐of-‐a-‐machine-‐%E2%80%93-‐review-‐of-‐book-‐%E2%80%9Cfinal-‐jeopardy
%E2%80%9D-‐by-‐stephen-‐baker/
hIp://doubleclix.wordpress.com/2011/02/17/watson-‐at-‐jeopardy-‐a-‐race-‐of-‐machines/
17. Warehouse-‐style
ApplicaXons
Block
Store
Distributed
Big Data
ApplicaXons
Storage
Object
Store
NOSQL
AnalyXcs
Parallelism
Map/Reduce
Web
HPC
AnalyXcs
Cloud
Architecture
Social
Media
Log
Inference
AnalyXcs
Social
RecommendaXon/
Graph
Inference
Engines
Machine
Knowledge
Search,
Learning
Mahout
Graph
Indexing
ClassificaXon,
Clustering
18. “A towel is about the most massively useful thing an
interstellar hitchhiker can have … any man who can
hitch the length and breadth of the Galaxy, rough it …
win through, and still know where his towel is, is clearly
a man to be reckoned with.”
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
Published by Harmony Books in 1979
Big Data to Smart Data
19. Don’t throw away
1
any data !
Big data to smart data
Be ready for different
2
ways of organizing
the data
• summary
h;p://goo.gl/fGw7r
20. Big Data Pipeline
If a problem has no solution, it is not a problem,
but a fact, not to be solved but to be coped with,
over time …
- Peres’s Law
21. Big Data Pipeline
• Stages
o Collect
o Store
o Transform & Analyze
o Model & Reason
o Predict, Recommend & Visualize
• Different systems have different characteristics
o Infrastructure optimization based in application/hardware
attributes correlation (short term)
• Hadoop, Splunk, internal Dashboard
o Application performance trends (medium term)
• Analytics, Modeling,…
o Product Metrics
• Feature set vs. usage, what is important to users, stratification
• Modeling using R, Visualization layers like Tableau
22. Big Data Pipeline
Ref:h;p:goo.gl/Mm83k
Infer-ability
Model
Internal
dashboards
Hand
,
Tableau
Context
coded
Programs,
Connectedness
R,
Mahout,
…
SQL,
Variety
BI
Tools,
Hadoop,
Pig,
Variability
SQL
Hive,
.NET
NOSQL,
Logs,
Dryad,
Velocity
Scribe,
HDFS,
XML,
Various
Flume,
other
<iles,
…
Volume
Hadoop
tools
…
Decomplexify! Contextualize! Network! Reason! Infer!
23. Build to Fail - “It is working” is not binary
The NOSQL !
I AM monarch of all I survey;
My right there is none to dispute;
From the centre all round to the sea
I am lord of the fowl and the brute
- Cowper, The Solitude Of Alexander SelKirk
24. Agenda
• Opening Gambit
– NOSQL
:
Toil,
Tears
&
Sweat
!
• The Pragmas
– ABCs
of
NOSQL
[ACID,
BASE
&
CAP]
• The Mechanics
– Algorithmics
&
Mechanisms
(For
reference)
Referenced Links @ http://doubleclix.wordpress.com/2010/06/20/nosql-talk-references/
25. What is NOSQL
Anyway ?
• NOSQL
!=
NoSQL
or
NOSQL
!=
(!SQL)
• NOSQL
=
Not
Only
SQL
• Can
be
traced
back
to
Eric
Evans[2]!
– You
can
ask
him
during
the
ayernoon
session!
• Unfortunate
Name,
but
is
stuck
now
• Non
RelaXonal
could
have
been
beIer
• Usually
OperaXonal,
Definitely
Distributed
• NOSQL
has
certain
semanXcs
–
need
not
stay
that
way
26. NOSQL
Key
Value
Column
Document
Graph
In-‐memory
SimpleDB
CouchDB
Neo4j
Memcached
Google
MongoDB
FlockDB
BigTable
Disk
Based
HBase
Lotus
Domino
InfiniteGraph
Redis
Cassandra
Riak
Tokyo
Cabinet
Dynamo
HyperTable
Voldemort
Azure
TS
Ref:
[22,51,52]
27. When I think of my own native land,
In a moment I seem to be there;
But, alas! recollection at hand
Soon hurries me back to despair.
- Cowper, The Solitude Of Alexander SelKirk
NOSQL Tales from the field
WHAT WORKS
28. • Designer Augmenting RDBMS with a Distributed key
Value Store[40 : A good talk by Geir]
• Invitation only designer brand sales
• Limited inventory sales – start at 12:00, members have
10 min to grab them. 500K mails every day
• Keeps brand value, hidden from search
• Interesting load properties
• Each item a row in DB-BUY NOW reserves it
– Can't order more
• Started out as a Rails app
– shared nothing
• Narrow peaks – half of revenue
29. Christian Louboutin
Effect
• ½ amz for Louboutin
• Use Voldemort
• Inventory, Shopping Cart,
Checkout
• Partition by prod ID
• Shared infrastructure – “fog”
not “cloud’ - Joyent!
• In-memory inventory
• Not afraid of sale anymore!
And SQL DBs are
still relevant !
30. Typical NOSQL Example Bit.ly
• Bit,ly URL shortening service, uses MongoDB
• User, title, URL, hash, labels[I-5], sort by time
• Scale – ~50M users, ~10K concurrent, ~1.25B shortens
per month
• Criteria:
– Simple, Zippy FAST, Very Flexible, Reasonable Durability, Low
cost of ownership
• Sharded by userid
31. • New kind of “dictionary” a word repository, GPS for
English – context, pronunciations, twitter … developer
API
• Characteristics[I-6,Tony Tam’s presentation]
– RO-centric, 10,000 reads for every write
– Hit a wall with MySQL (4B rows)
– MongoDB read was so good that memcached layer was not
required
– MongoDB used 4 times MySQL storage
• Another example :
– Voldemort – Unified Communications, IP-Phone data stored
keyed off of phone number. Data relatively stable
32. Large Hadron Collider@CERN
• DAS is part of giant data management
enterprise (cms)
– Polygot Persistence (SQL + NOSQL, Mongo, Couch,
memcache, HDFS, Luster, Oracle, mySQL, …)
• Data Aggregation System [I-1,I-2,I-3,I-4]
– Uses MongoDB
– Distributed Model, 2-6 pb data
– Combine info. from different metadata sources, query
without knowing their existence, user has domain
knowledge – but shouldn’t deal with various formats,
interfaces and query semantics
– DAS aggregates, caches and presents data as JSON
documents – preserving security & integrity
And SQL DBs are
still relevant !
34. • Digg
– RDBMS places burden on reads than writes[I-8]
– Looked at NOSQL, selected Cassandra
• Colum oriented, so more structure than key-value
• Heard from noSQL Boston[http://twitter.com/
#search?q=%23nosqllive]
– Baidu: 120 node HyperTable cluster managing
600TB of data
– StumbleUpon uses HBase for Analytics
– Twitter’s Current Cassandra cluster: 45 nodes
35. • Adob is a HBase shop • BBC is a CouchDB shop
[I-10,I-11,2] [I-13]
• Adobe SaaS Infrastructure – • Sweet spot:
tagging, content aggregation, • Multi-master, multi
search, storage and so forth datacenter replication
• Dynamic schema & huge
number of records[I-5]
• 40 million records in 2008 to
1 billion with 50 ms response • Interactive Mediums
• NOSQL not mature in 2008, • Old data to CouchDB
now good enough • Thus free up DB to do
• Prod Analytics:40 nodes, work!
largest has 100 nodes
36. • Cloudkick is a Cassandra shop[I-12]
• Cloudkick offers cloud management services
• Store metrics data
• Linear scalability for write load
• Massive write performance
• Memory table & serial commit log
• Low operational costs
• Data Structure
– Metrics, Rolled-up data, Statuses at time slice : all indexed by
timestamp
37. • Guardian/UK
– Runs on Redis[I-14] !
– “Long-term The Guardian is looking
towards the adoption of a schema-free
database to sit alongside its Oracle
database and is investigating CouchDB.
… the relational database is now just a
component in the overall data
management story, alongside data
caching, data stores, search engines
And SQL DBs are
etc.
still relevant !
– NOSQL can increase performance of "The evil that SQL
relational data by offloading specific DBs do lives after
data and tasks them; the good is
oft interred with
their bones...",
38. NOSQL at Netflix
• Netflix is fully in the cloud
• Uses NOSQL across the globe
• Customer Profiles, watchlog, usage logging (see next
slide)
– No multi-record locking
• No DBA !
• Easier Schema Changes
• Less complex, Highly Available data store
• Joins happen in the applications
http://www.hpts.ws/sessions/nosql-ecosystem.pdf
http://www.hpts.ws/sessions/GlobalNetflixHPTS.pdf
39.
40. 21 NOSQL Themes
• Web
Scale
• Scale
Incrementally/conXnuous
growth
• Oddly
shaped
&
exponenXally
connected
• Structure
data
as
it
will
be
used
–
i.e.
read,
query
• Know
your
queries/updates
in
advance[96],
but
you
can
change
them
later
• Compute
aIributes
at
run
Xme
• Create
a
few
large
enXXes
with
opXonal
parts
– NormalizaXon
creates
many
small
enXXes
• Define
Schemas
in
models
(not
in
databases)
• Avoid
impedance
mismatch
• Narrow
down
&
solve
your
core
problem
• Solve
the
right
problem
with
the
right
tool
Ref:
[I-‐8]
41. 21 NOSQL Themes
• ExisXng
soluXons
are
clunky[1]
(in
certain
situaXons)
• Scale
automaXcally,
“becoming
prohibiXvely
costly
(in
terms
of
manpower)
to
operate”
TwiIer[I-‐9]
• DistribuXon
&
parXXoning
are
built-‐in
NOSQL
• RDBMS
distribuXon
&
sharding
not
fun
and
is
expensive
– Lose
most
funcXonality
along
the
way
• Data
at
the
center,
Flexible
schema,
Less
joins
• The
value
of
NOSQL
is
in
flexibility
as
much
as
it
is
in
“Big
Data”
42. 21 NOSQL Themes
• Requirements[3]
– Data
will
not
fit
in
one
node
• And
so
need
data
parXXon/distribuXon
by
the
system
– Nodes
will
fail,
but
data
needs
to
be
safe
–
replicaXon!
– Low
latency
for
real-‐Xme
use
• Data
Locality
– Row
based
structures
will
need
to
read
whole
row,
even
for
a
column
– Column
based
structures
need
to
scan
for
each
row
• SoluXon
:
Column
storage
with
Locality
– Keep
data
that
is
read
together,
don’t
read
what
you
don’t
care
• For
example
friends
–
other
data
Ref:
3
43. ABCs of
NOSQL -
ACID,
BASE &
CAP
The woods are lovely, dark, and deep,
But I have promises to keep,
And miles to go before I sleep,
And miles to go before I sleep.
-Frost
44. CAP Principle
“CAP
Principle
→
Strong
Consistency,
High
Availability,
Consistency
Par::on-‐resilience:
Pick
at
most
2”[37]
Availability Partition
Which
feature
to
discard
depends
on
the
nature
of
your
system[41]
45. CAP Principle
“CAP
Principle
→
Strong
Consistency,
High
Availability,
Consistency
Par::on-‐resilience:
Pick
at
most
2”[37]
C-‐A
No
P
→
Single
DB
server,
no
network
par::on
Availability Partition
Which
feature
to
discard
depends
on
the
nature
of
your
system[41]
46. CAP Principle
“CAP
Principle
→
Strong
Consistency,
High
Availability,
Consistency
Par::on-‐resilience:
Pick
at
most
2”[37]
C-‐P
No
A
→
Block
transac:on
in
case
of
par::on
failure
Availability Partition
Which
feature
to
discard
depends
on
the
nature
of
your
system[41]
47. CAP Principle
Interesting (& controversial) from
“CAP
Principle
→
NOSQL perspective
Strong
Consistency,
High
Availability,
Consistency
Par::on-‐resilience:
Pick
at
most
2”[37]
A-‐P
No
C
→
Expira:on
based
caching,
vo:ng
majority
Availability Partition
48. ABCs
of
NOSQL
• ACID
o Atomicity,
Consistency,
IsolaXon
&
Durability
–
fundamental
properXes
of
SQL
DBMS
• BASE[35,39]
o Basically
Available
Soy
state(Scalable)
Eventually
Consistent
• CAP[36,39]
o Consistency,
Availability
&
ParXXoning
o This
C
is
~A+C
• i.e.
Atomic
Consistency[36]
49. ACID
• Atomicity
o All
or
nothing
• Consistent
o From
one
consistent
state
to
another
• e.g.
ReferenXal
Integrity
o But
it
is
also
applicaXon
dependent
on
• e.g.
min
account
balance
• Predicates,
invariants,…
• IsolaXon
• Durability
50. CAP
Pragmas
• PrecondiXons
o The
domain
is
scalable
web
apps
o Low
Latency
For
real
Xme
use
o A
small
sub-‐set
of
SQL
FuncXonality
o Horizontal
Scaling
• PritcheI[35]
talks
about
relaxing
consistency
across
funcXonal
groups
than
within
funcXonal
groups
• Idempotency
to
consider
o Updates
inc/dec
are
rarely
idempotent
o Order
preserving
trx
are
not
idempotent
either
o MVCC
is
an
answer
for
this
(CouchDB)
51. Consistency
• Strict
Consistency
o Any
read
on
Data
X
will
return
the
most
recent
write
on
X[42]
• SequenXal
Consistency
o Maintains
sequenXal
order
from
mulXple
processes
(No
menXon
of
Xme)
• Linearizability
o Add
Xmestamp
from
loosely
synchronized
processes
52. Consistency
• Write
availability,
not
read
availability[44]
• Even
load
distribuXon
is
easier
in
eventually
consistent
systems
• MulX-‐data
center
support
is
easier
in
eventually
consistent
systems
• Some
problems
are
not
solvable
with
eventually
consistent
systems
• Code
is
someXmes
simpler
to
write
in
strongly
consistent
systems
53. CAP
EssenXals
–
1
of
3
• “CAP
Principle
→
Strong
Consistency,
High
Availability,
ParXXon-‐resilience:
Pick
at
most
2”[37]
o C-‐A
No
P
→
Single
DB
server,
no
network
parXXon
o C-‐P
No
A
→
Block
transacXon
in
case
of
parXXon
failure
o A-‐P
No
C
→
ExpiraXon
based
caching,
voXng
majority
• Which
feature
to
discard
depends
on
the
nature
of
your
system[41]
54. CAP
EssenXals
–
2
of
3
• Yield
vs.
Harvest[37]
o Yield
→
Probability
of
compleXng
a
request
o Harvest
→
FracXon
of
data
reflected
in
the
response
• Some
systems
tolerate
<
100%
harvest
(e.g
search
i.e.
approximate
answers
OK)
others
need
100%
harvest
(e.g.
Trx
i.e.
correct
behavior
=
single
well
defined
response)
• For
sub-‐systems
that
tolerate
harvest
degradaXon,
CAP
makes
sense
55. CAP
EssenXals
–
3
of
3
• Trading
Harvest
for
yield
–
AP
• ApplicaXon
decomposiXon
&
use
NOSQL
in
appropriate
sub-‐systems
that
has
state
management
and
data
semanXcs
that
match
the
opera<onal
feature
&
impedance
o Hence
NotOnly
SQL
not
No
SQL
o Intelligent
homing
to
tolerate
parXXon
failures[44]
o MulX
zones
in
a
region
(150
miles
-‐
5
ms)
o TwiIer
tweets
in
Cassandra
&
MySQL
o BBC
using
MongoDB
for
offloading
DBMS
o Polygot
persistence
at
LHC@CERN
56. CAP
EssenXals
–
3
of
3
• Trading
Harvest
for
yield
–
AP
• ApplicaXon
decomposiXon
&
use
NOSQL
in
appropriate
sub-‐systems
that
has
state
management
and
data
semanXcs
that
match
the
opera<onal
feature
&
impedance
o Hence
NotOnly
SQL
not
No
SQL
o Intelligent
homing
to
tolerate
parXXon
failures[44]
o MulX
zones
in
a
region
(150
miles
-‐
5
ms)
o TwiIer
tweets
in
Cassandra
and
MySQL
Most important
o BBC
using
MongoDB
for
offloading
DBMS
point in the whole
o Polygot
persistence
at
LHC@CERN
presentation
57. Eventual
Consistency
&
AMZ
• DistribuXon
Transparency[38]
• Larger
distributed
systems,
network
parXXons
are
given
• Consistency
Models
o Strong
o Weak
• Has
an
inconsistency
window
before
update
and
guaranteed
view
o Eventual
• If
no
new
updates,
all
will
see
the
value,
eventually
58. Eventual
Consistency
&
AMZ
• Guarantee
variaXons[38]
o Read-‐Your-‐writes
o Session
consistency
o Monotonic
Read
consistency
• Access
will
not
return
previous
value
o Monotonic
Write
consistency
• Serialize
write
by
the
same
process
• Guarantee
order
(vector
clocks,
mvcc)
o Example
:
Amz
Cart
merger
(let
cart
add
even
with
parXal
failure)
59. Eventual
Consistency
&
AMZ
-‐
SimpleDB
• SimpleDB
strong
consistency
semanXcs
[49,50]
o UnXl
Feb
2010,
SimpleDB
only
supported
eventual
consistency
i.e.
GetAIributes
ayer
PutAIributes
might
not
be
the
same
for
some
Xme
(1
second)
o On
Feb
24,
AWS
Added
ConsistentRead=True
aIribute
for
read
o Read
will
reflect
all
writes
that
got
200OK
Xll
that
Xme!
60. Eventual
Consistency
&
AMZ
-‐
SimpleDB
• SimpleDB
strong
consistency
semanXcs
[49,50]
o Also
added
condiXonal
put/delete
o Put
aIribute
has
a
specified
value
(Expected.1.Value=)
or
(Expected.
1.Exists
=
true/false)
o Same
condiXonal
check
capability
for
delete
also
o
Only
on
one
aIribute
!
61. Eventual
Consistency
&
AMZ
–
S3
• S3
is
an
eventual
consistency
system
o Versioning
o “S3
PUT
&
COPY
synchronously
store
data
across
mulXple
faciliXes
before
returning
SUCCESS”
o Repair
Lost
redundancy,
repair
bit-‐rot
o Reduced
Redundancy
opXon
for
data
that
can
be
reproduced
(99.999999999%
vs.
99.99%)
• Approx
1/3rd
less
o CloudFront
for
caching
62. !SQL
?
• “We
conclude
that
the
current
RDBMS
code
lines,
while
aIempXng
to
be
a
“one
size
fits
all”
soluXon,
in
fact,
excel
at
nothing.
Hence,
they
are
25
year
old
legacy
code
lines
that
should
be
reXred
in
favor
of
a
collecXon
of
“from
scratch”
specialized
engines.”[43]
• “Current
systems
were
built
in
an
era
where
resources
were
incredibly
expensive,
and
every
compuXng
system
was
watched
over
by
a
collecXon
of
wizards
in
white
lab
coats,
responsible
for
the
care,
feeding,
tuning
and
opXmizaXon
of
the
system.
In
that
era,
computers
were
expensive
and
people
were
cheap”
• “The
1970
-‐
1985
period
was
a
<me
of
intense
debate,
a
myriad
of
ideas,
&
considerable
upheaval.
We
predict
the
next
fiUeen
years
will
have
the
same
feel
“
63. Further
deliberaXon
• Daniel
Abadi[45],Mike
Stonebreaker[46],
James
Hamilton[47],
Pat
Hilland[48]
are
all
good
read
for
further
deliberaXons
65. Caveats
• A
representaXve
subset
of
the
mechanics
and
mechanisms
used
in
the
NOSQL
world
• Being
refined
&
newer
ones
are
being
tried
• At
a
system
level
–
to
show
how
the
techniques
play
a
part
to
deliver
a
capability
• The
NOSQL
Papers
and
other
references
for
further
deliberaXon
• Even
if
we
don’t
cover
fully,
it
is
OK.
I
want
to
introduce
some
of
the
concepts
so
that
you
get
an
appreciaXon
…
67. Consistent
Hashing
• Origin:
web
caching
“To
decrease
‘hot
spots’
• Three
goals[87]
– Smooth
evoluXon
• When
a
new
machine
joins,
minimum
rebalance
work
and
impact
– Spread
• Objects
assigned
to
a
min
number
of
nodes
– Load
• #
of
disXnct
objects
assigned
to
a
node
is
small
68. Consistent
Hashing
• Hash
Keyspace/Token
is
divided
into
parXXons/ranges
• Cassandra
–
choice
– OrderPreserving
parXXoner
–
key
=
token
(for
range
queries)
– Also
saw
a
CollaXngOrderPreservingParXXoner
• ParXXons
assigned
to
nodes
that
are
logically
arranged
in
a
circle
topology
• Amz
(dynamo)
–
assign
sets
of
(random)
mulXple
points
to
different
machines
depending
on
load
• Cassandra
–
monitor
load
&
distribute
• Specific
join
&
leave
protocols
• ReplicaXon
–
next
3
consecuXve
• Cassandra
–
Rack-‐aware,
Datacenter-‐aware
69. Consistent
Hashing
-‐
Hinted-‐handoff
• What
happens
when
a
node
is
not
available
?
– May
be
under
load
– May
be
network
parXXon
• Sloppy
Quorum
&
Hinted-‐handoff
• R/W
performed
on
the
1st
n
healthy
nodes
• Replica
sent
to
a
host
node
with
hint
in
metadata
&
then
transferred
when
the
actual
node
is
up
• Burdens
neighboring
nodes
• Cassandra
0.6.2
default
is
disabled
(I
think)
70. Consistent
Hashing
-‐
ReplicaXon
• What
happens
when
a
new
node
joins
?
– It
gets
one
or
more
parXXons
– Dynamo
:
Copy
the
whole
parXXon
– Cassandra
:
Replicate
keyset
– Cassandra
:
working
on
a
bit
torrent
type
protocol
to
copy
from
replicas
71. AnX-‐entropy
• Merge
and
reconciliaXon
operaXons
– Operate
on
two
states
and
return
a
new
state[86]
• Merkle
Trees
– Dynamo
use
of
Merkle
trees
to
detect
inconsistencies
between
replicas
– AnXEntropy
in
Cassandra
exchanges
Merkle
trees
and
if
they
disagree,
range
repair
via
compacXon
[91,92]
– Cassandra
uses
the
ScuIlebuI
ReconciliaXon[86]
72. Gossip
• Membership
&
Failure
detecXon
• Based
on
emergence
without
rigidity
–
pulse
coupled
oscillators,
biological
systems
like
fireflies
![90]
• Also
used
for
state
propagaXon
– Used
in
Dynamo/Cassandra
73. Gossip
• Cassandra
exchanges
heartbeat
state,
applicaXon
state
and
so
forth
• Every
second,
random
live
node,
random
unreachable
node
and
exchanges
key-‐value
structures
• Some
nodes
play
the
part
of
seeds
• Seed
/iniXal
contact
points
in
staXc
conf
file
storage.conf
file
• Could
also
come
from
a
configuraXon
service
like
zookeeper
• To
guard
against
node
flap,
explicit
membership
join
and
leave
–
now
you
know
why
hinted
handoff
was
added
74. Membership
&
Failure
detecXon
• Consensus
&
Atomic
Broadcast
-‐
impossible
to
solve
in
a
distributed
system[88,89]
– Cannot
differenXate
between
an
slow
system
and
a
crashed
system
• Completeness
– Every
system
that
crashed
will
be
eventually
detected
• Correctness
– A
correct
process
is
never
suspected
• In
short,
if
you
are
dead
somebody
will
no<ce
it
and
if
you
are
alive,
nobody
will
mistake
you
for
dead
!
75. Ø
Accrual
Failure
Detector
• Not
Boolean
value
but
a
probabilisXc
number
that
“accrues”
over
an
exponenXal
scale
• Captures
the
degree
of
confidence
that
a
corresponding
monitored
process
has
crashed[94]
– Suspicion
Level
– Ø
=
1
-‐>
prob(error)
10%
– Ø
=
2
-‐>
prob(error)
1%
– Ø
=
3
-‐>
prob(error)
0.1%
• If
process
is
dead,
– Ø
is
monotonically
increasing
&
Ø→α
as
t
→α
• If
process
is
alive
and
kicking,
Ø=0
• Account
for
lost
messages,
network
latency
and
actual
crash
of
system/process
• Well
known
heartbeat
period
Δi,
then
network
latency
Δtr
can
be
tracked
by
inter-‐arrival
Xme
modeling
76. Write/Read
Mechanisms
• Read
&
Write
to
a
random
node
(StorageProxy)
• Proxy
coordinates
the
read
and
write
strategy
(R/W
=
any,
quorum
et
al)
• Memtables/SSTables
from
big
table
• Bloom
Filter/Index
• LSM
Trees
77. Hbase – WAL,
Node Write Memstore, HDFS File
system
Commit
Logs
Node
M
e
m
o
MemTable r
y
Read
Flushing
Index Index Index
D
i
BF BF BF s
k
SSTable
• Immutable
• Compaction
• Maintain Index & Bloom Filter
78. How…
does
HBase
work
again?
http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
http://hbaseblog.com/2010/07/04/hug11-hbase-0-90-preview-wrap-up/
79. Bloom
Filter
• The
BloomFilter
answers
the
quesXon
• “Might
there
be
data
for
this
key
in
this
SSTable?”
[Ref:
Cassandra/Hbase
mailer]
– “Maybe"
or
–
“Definitely
not“
– When
the
BloomFilter
says
"maybe"
we
have
to
go
to
disk
to
check
out
the
content
of
the
SSTable
• Depends
on
implementaXon
– Redone
in
Cassandra
– Hbase
0.20.x
removed,
will
be
back
in
0.90
with
a
“jazzy”
implementaXon
80. Was it a vision, or a waking dream?
Fled is that music:—do I wake or sleep?
-Keats, Ode to a Nightingale