2. Motivation
Overview of the databases
Methodology
Results
Summary and conclusion
Clari
3. cations
As a lot of people who read those slides did not get the oral
explanations that MUST go with it, here are a few words of
warning :
All the databases were used with default con
4. gurations, I will
post them soon on nosqlbenchmarking.com
No index was set manually, doing so could have a big impact
on performances
Don't jump too fast on the conclusions, it would be WRONG
to say that Cassandra is very good and that HBase sucks.
The Cassandra implementation of MapReduce seems to be
buggy and do not scale. There must be something wrong with
my HBase con
5. guration, HBase is known to run gigantic
cluster without problems.
2 / 20
6. Motivation
Overview of the databases
Methodology
Results
Summary and conclusion
Clari
7. cations
Also keep in mind that a benchmark is always biased by the chosen
methodology so :
The way I store data in each database could have an impact
on the performances
The summary about the results should not be taken in an
absolute way, especially the
8. rst one. When I say Good or
Bad it is in THIS particular case. Moreover raw results are not
the most important, scalability is very important too. So good
performances for Cassandra MapReduce but without
scalability is NOT good.
The data set is too small, I'm testing cache performances (but
it is the same for all of the databases)
I will add soon a written analysis and a self critic about those
results on www.nosqlbenchmarking.com
3 / 20
9. Motivation
Overview of the databases
Methodology
Results
Summary and conclusion
Motivation
YCSB
Yahoo! Cloud Servicing Benchmark is the best known noSQL bench-
marking application so why make another one?
YCSB uses data generated from statistical distributions
instead of real data
YCSB only focuses on read/write/update/scan performances
YCSB results for elasticity are not conclusive
Idea
Data and use case inspired by a concrete case : Wikipedia
Test read/update performances
Test MapReduce performances by computing an inverted
search index
4 / 20
10. Motivation
Cassandra 0.6.10
Overview of the databases
HBase 0.20.6
Methodology
mongoDB 1.6.5
Results
Riak 0.14
Summary and conclusion
Cassandra 0.6.10
Overview
Cassandra is a fully distributed column oriented data store that pro-
vides a MapReduce implementation using Hadoop.
All the nodes in the cluster play the same role
The data (existing and new) are sharded automatically among
the nodes
The developer can choose the consistency level for each
request
5 / 20
11. Motivation
Cassandra 0.6.10
Overview of the databases
HBase 0.20.6
Methodology
mongoDB 1.6.5
Results
Riak 0.14
Summary and conclusion
HBase 0.20.6
Overview
HBase is a column oriented database that aims to provide low latency
requests on top of Hadoop HDFS
An HBase cluster uses several kinds of servers :
HDFS needs at least one namenode datanodes
and several
HBase needs a ZooKeeper cluster master , a and several
regionservers
The requests must be made to the master(s)
On the HDFS level, existing data are not sharded
automatically but new data are
On the HBase level, the data are divided into regions that are
sharded automatically across regionservers
6 / 20
12. Motivation
Cassandra 0.6.10
Overview of the databases
HBase 0.20.6
Methodology
mongoDB 1.6.5
Results
Riak 0.14
Summary and conclusion
mongoDB 1.6.5
Overview
mongoDB is a document oriented database that stores JSON dic-
tionnaries. It provides auto sharding and a MapReduce implemen-
tation.
A mongoDB cluster is made of several kinds of servers :
The shard servers that store data
The con
14. guration
The router servers that receive and route the requests
Existing and new data are sharded automatically
MapReduce can only use one thread by server
7 / 20
15. Motivation
Cassandra 0.6.10
Overview of the databases
HBase 0.20.6
Methodology
mongoDB 1.6.5
Results
Riak 0.14
Summary and conclusion
Riak 0.14
Overview
Riak is a fully distributed key/bucket store with an implementation
of MapReduce.
Buckets can store the data directly or be a link to another
bucket
All the nodes in the cluster play the same role
The data (existing and new) are sharded automatically
amongs the nodes
The developer can choose the consistency level for each
request
8 / 20
16. Motivation
Overview of the databases The data used
Methodology The client
Results The methodology
Summary and conclusion
The data
Wikipedia export
20.000 pages downloaded from Wikipedia
Every document is in XML format
All documents sum up to 620Mo
Each document is associated to a single integer ID
Insertions
Each document is inserted only once during the whole benchmark
9 / 20
17. Motivation
Overview of the databases The data used
Methodology The client
Results The methodology
Summary and conclusion
The client
Overview
Fully random requests
Acts as a perfect load balancer
The proportion of updates can be speci
19. c parts : read/write/update and MapReduce
Updates
The updates simply concatenate the string 1" at the end of the
article.
10 / 20
20. Motivation
Overview of the databases The data used
Methodology The client
Results The methodology
Summary and conclusion
MapReduce
Overview
MapReduce is used to build a reverse index for a given keyword.
The reverse index is a list of pairs made of :
ID : the ID of the article if Count 6= 0
Count : the number of occurrences of the keyword in this
article
Justi
21. cation
This kind of computation implies that all the documents are crawled
and take advantage of the speci
23. Motivation
Overview of the databases The data used
Methodology The client
Results The methodology
Summary and conclusion
The methodology
1 Start up a clean cluster of size 3 and insert all the documents
2 Choose a total number of requests, a read percentage and
starts the benchmark
3 Wait one minute and starts the benchmark again
4 Wait
24. ve minutes and starts the benchmark again
5 Start the MapReduce benchmark
6 Add a new node to the cluster and wait for it to be ready then
restart immediately the bench with the new node's IP in the
list
7 Jump to 3 until there are no more computer to add to the
cluster
12 / 20
25. Motivation
Overview of the databases
Methodology
Results
Summary and conclusion
Read/update results
13 / 20
26. Motivation
Overview of the databases
Methodology
Results
Summary and conclusion
Read/update results without HBase
14 / 20
27. Motivation
Overview of the databases
Methodology
Results
Summary and conclusion
MapReduce performance
15 / 20
28. Motivation
Overview of the databases
Methodology
Results
Summary and conclusion
The HBase case
Veri
29. cations made :
Checked the logs : nothing seemed problematic
HDFS level : running the balancer with a very low threshold
distributed the blocks evenly but without any impact on the
performances
HBase level : the regions where always nearly evenly
distributed across the regionservers
The number of rows did not change and the content of each
row was correct
16 / 20
30. Motivation
Overview of the databases
Methodology
Results
Summary and conclusion
Summary of raw performances
DB read/update performances MapReduce performances
Cassandra Good Very Good
HBase Bad / N.A. Average / N.A
mongoDB Good Poor but scalable
Riak Poor / unstable Average but scalable
17 / 20
31. Motivation
Overview of the databases
Methodology
Results
Summary and conclusion
Summary of scalability
Going from 3 to 8 servers is a 266% increase in capacity, here are
the observed increases in performances :
DB read/update MapReduce
Cassandra 153% 112%
HBase 11% 43%
mongoDB 145% 211%
Riak 74% 189%
Riak 7 nodes max 155% 168%
18 / 20
32. Motivation
Overview of the databases
Methodology
Results
Summary and conclusion
Conclusion and future work
Conclusion
The elastic gain seems more apparent than with YCSB but
not linear either
It is worth testing MapReduce performances as the results
vary a lot between databases for both raw and scalability
performances
Future work
This is still a work in progress :
Applying this benchmark to other databases (Terrastore,
Voldemort, Scalaris ...)
Trying with a growing/bigger data set
19 / 20
33. Motivation
Overview of the databases
Methodology
Results
Summary and conclusion
Questions and remarks
Any questions or remarks?
20 / 20