6. +
Dataset
n Artsy.net:
n About
26K
artworks
n About
45K
artists
n JSON
n Generated
User
Log:
n Simulated
user
“collect”
activities
n Multiplied
real
user
location
6
7. +
Search
Artwork
by
Title
7
n Raw
JSON
of
45K
artwork
n Spark
-‐
Elasticsearch
10. +
Batch
Processing
Artists
Location
10
n Spark:
processed
507GB
of
data
11. +
Challenges
n Find
the
right
database
for
your
problem
n Elasticsearch
for
search
n Cassandra
for
time
series
n Computers
are
multilingual
–Python,
Scala,
Java…
n But
challenges
make
life
interesting
11
12. +
About
Me
12
n MS
&
BS
in
Systems
Engineering,
University
of
Virginia
n Machine
Learning
n Natural
Language
Processing
n High
Performance
Computing
13. +
About
Me
13
n MS
&
BS
in
Systems
Engineering,
University
of
Virginia
n Machine
Learning
n Natural
Language
Processing
n High
Performance
Computing
n Enjoy
n Sketching
n Adventures
n Laughing
n Life
15. +
Batch
Processing
Art
Similarity
15
n Artworks
are
manually
tagged
n Input
format:
n [art1]:
[tag1][tag2][tag3]…
n Compute
common
tags
between
two
artworks
n Spark
–
Cassandra
n Could
also
use
Collaborative
Filtering
(MLlib
in
Spark)
16. +
Benchmark
Reads/Writes
16
“Benchmarking
Top
NoSQL
Databases”
End
Point:
http://www.datastax.com/wp-‐content/themes/datastax-‐2014-‐08/files/NoSQL_Benchmarks_EndPoint.pdf
Most
Writes
Writes/Reads
Balanced
n Operations
/
sec
n Cassandra
|
Couchbase
|
Hbase
|
MongoDB
17. +
Cassandra
Time
Series
17
time_stamp_1 time_stamp_2 time_stamp_3 …
art_id_1 3 1 2 …
art_id_2 5 3 1 …
art_id_3 1 4 2 …
… … … … …
Primary
Key
(Partition
Key)
Primary
Key
(Clustering
Key):
with
Clustering
Order
By
(Desc)
n Compound
Primary
Key
(art_id,
time_stamp)
n art_id:
Partition
key,
responsible
for
data
distribution
across
nodes
n time_stamp:
Clustering
key,
responsible
for
data
sorting
within
the
partition
18. +
Transactions
in
Cassandra
18
Node
1 Node
2
Write
“Life
is
good”
Consistency
=
all
Write
“Life
is
good”
Write
“Life
is
good”
Node
1
Life
is
good
Node
2
Life
is
…
<Job
failed>
Report
FAIL
-‐>
Rollback
Report
SUCCESS
Final
report:
FAIL
Node
1
Life
is
good
Node
2
Datastax:
http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_atomicity_c.html
n Write
atomicity
is
at
the
partition-‐level
19. +
Batch
Processing
Artists
Location
19
n Spark:
processed
507GB
of
data
File
Size
in
GB 4.7 14.5 28.5 43 101 202.5 318.5 507
Time
in
min 1.5 5.3 6.5 9.5 21.25 42.87 67.1 110
20. +
Spark
vs.
Hadoop
Spark Hadoop
MapReduce
Fault
Tolerance
via
RDD
(they
rebuild
lost
data
on
failure
using
lineage:
each
RDD
remembers
how
it
was
built
from
other
datasets
to
rebuild
itself)
via
Replication
Cache
Data
into
Memory
Yes No
Support
in-‐memory
data
sharing
across
directed
acyclic
graphs
(RDD);
well-‐suited
to
machine
learning
algorithms
Each
job
reads
data
from
stable
storage
(e.g.
file
systems)
Write
Intermediate
Files
into
Hard
Disk
Yes Yes
while
Shuffle while
Map
reduce
20
21. +
Spark
Streaming
vs.
Storm
Spark
Streaming Storm
Processing
Model Micro
batches One
record
at
a
time
Latency Few
seconds Sub-‐second
Fault
tolerance:
Every
record
processed
Only
once
(track
processing
at
the
batch
level)
At
least
once
(may
have
duplicates
when
recovering
from
a
fault)
Implemented
In Scala Clojure
21
22. +
Cassandra
vs.
PostgreSQL
Cassandra PostgreSQL
Database
Model NoSQL DBMS
Scale
Horizontally
(More
data
=
More
servers)
Vertically
(More
data
=
Bigger
server)
Distributed Distributed Not
distributed
Normalization
Better
denormalized
tables:
Increase
writes
but
simplify
reads
Better
normalized
tables
(store
additional
redundant
information
on
disk
to
optimize
query
response.
But
still,
not
distributed)
Consistency Developer’s
job Software
handles
it
22
23. +
Cassandra
vs.
HBase
Cassandra HBase
Google
Bigtable Adopt
Google
Bigtable
Distributed Yes
Internode
communications
Integrated
Gossip
protocol Rely
on
Zookeeper
Availability
Multiple
seed
nodes
(concentration
points
for
intercluster
communication)
Standby
master
node
Consistency
Richer
consistency
support:
You
can
configure
how
many
replica
nodes
must
successfully
complete
the
operation
before
it
is
acknowledged
(You
can
require
all
replica
nodes)
Strong
row-‐level
consistency
Query SQL
like
query
hbase>
create
‘t1’,
{NAME
=>
‘f1’},
{NAME
=>
‘f2’},
{NAME
=>
‘f3’}
23
24. +
Elasticsearch
vs.
Solr
Elasticseach Solr
Lucene
Index Both
use
Lucene
index
Distributed Yes
Yes
but
depend
on
Zookeeper
Nested
Object
Support
complex
nested
object
Can
be
implemented
as
a
flat
object
but
hard
to
update
Change
#
of
Shards
No
(hash(doc_id)
%
#_of_primary_shards)
Yes
Automatic
shard
rebalancing
(after
adding
new
nodes)
Yes No
Faceting
Support
richer
faceting
(exclude
terms
using
Regex)
Support
faceting
24