NoSQL Solutions - a comparative study

NoSQL solutions

A comparative study

Brief history of databases

● 1980: Oracle DBMS released
● 1995: first release of MySQL, a lightweight, asynchronously
replicated database
● 2000: MySQL 3.23 is adopted by most startups as a
database plateform
● 2004: Google develops BigTable
● 2009: The term NoSQL is coined out

Evolution of database storage

● Monolithic: using traditional databases (Oracle, Sybase...)
○ High cost in infrastructure (big CPU, big SAN)
○ High cost in software licenses (commercial software)
○ Qualified personnel required (certified DBAs)
● LAMP platform (circa 2000)
○ Free software
○ Runs on commodity hardware
○ Low administration (don't need a DBA until your data
grows important)
● NoSQL
○ Scales indefinitely (replication only scales vertically)
○ Logic is in the application
○ Doesn't require SQL or internals knowledge

Scaling
● Vertical scaling (MySQL)
○ Scaling by adding more replicas
○ Load is evenly distributed across the replicas
○ Problems !
■ Cached data is also distributed evenly: inefficient
resource usage
■ Efficiency goes down as dataset exceeds available
memory, cannot get more performance by adding
replicas
■ Write bottleneck

● Horizontal scaling (NoSQL)
○ Data is distributed evenly across the nodes (hashing)
○ More capacity ? Just add one node
○ Loss of traditional database properties (ACID)

NoSQL definition

● Not only SQL (is not exposed via query language)
● Non-relational (denormalized data)
● Distributed (horizontal partitioning)
● Different implementations :
○ Key-value store
○ Document database
○ Graph database

Key-value stores

● Schema-less storage
● Basic associative arrays
{ "username"=> "guillaume" }
● Key-value stores can have column families and subkeys
{ "user:name"=> "guillaume", "user:uid" => 1000 }

● Implementations
○ K/V caches: Redis, memcache
■ in-memory databases
○ Column databases: Cassandra, HBase
■ Data is stored in a tabular fashion (as opposed to
rows in traditional RDBMS)

Document Databases

● Data is organized into documents:
FirstName="Frank", City="Haifa", Hobby="Photographing"
● No strong typing or predefined fields; additional
information can be added easily
FirstName="Guillaume", Address="Hidalgo Village, Pasay City", Languages=[{Name:"French"}, {Name:"
English"}, {Name:"Tagalog"}]
● An ensemble of documents is called a collection
● Uses structured standards: XML, JSON
● Implementations
○ CouchDB (Erlang)
○ MongoDB (C++)
○ RavenDB (.NET)

Graph Databases

● Uses graph theory structure to represent information
● Typically used for relations
○ Example: followers/following in Twitter

Databases at Toluna

● MySQL
○ Traditional Master-Slave configuration
○ Very efficient for small requests
○ Not good for analytics
○ "Big Data" issues (i.e. usersvotes)

● Microsoft SQL Server
○ Good all-around performance
○ Monolithic
■ Suffers from locking issues
■ Hard to scale (many connections)
○ Potentially complex SQL programming to get the better of it

Solutions ?

Let's evaluate some products...

Apache HBase

● A column database based on the Hadoop architecture
● Commercially supported (Cloudera)
● Available on Red Hat, Debian
● Designed for very big data storage (Terabytes)
● Users: Facebook, Yahoo!, Adobe, Mahalo, Twitter

Pros
● Pure Java implementation
● Access to Hadoop MapReduce data via column storage
● True clustered architecture

Cons
● Java
● Hard to deploy and maintain
● Limited options via the API (get, put, scans)

Apache HBase: architecture

● Data is stored in cells
○ Primary row key
○ Column family
■ Limited
■ May have indefinite qualifiers
○ Timestamp (version)
● Example cell structure
RowId Column Family:Qualifier Timestamp Value

1000 user:name 1312868789 guillaume

1000 user:email 1312868789 g@dragonscale.eu

HBase Data operations
● Creating a table and put some data
create table 'usersvotes', 'userid', 'date', 'meta'
hadoop jar /usr/lib/hbase/hbase-0.90.3-cdh3u1.jar importtsv -Dimporttsv.columns=userid,
HBASE_ROW_KEY,date,meta:answer,meta:country usersvotes ~/import/
● Retrieve data
hbase(main):002:0> get 'usersvotes', 1071726
COLUMN CELL
date: timestamp=1312780185940, value=1296523245
meta:answer timestamp=1312780185940, value=2
meta:country timestamp=1312780185940, value=ES
userid: timestamp=1312780185940, value=685352
4 row(s) in 0.0720 seconds
● The last versioned row (higher timestamp) for the specified primary
key is retrieved

HBase Data operations: API

● If not using JAVA, Hbase must be queried using a
webservice (XML, JSON or protobuf)

● Type of operations
○ Get (read single value)
○ Put (write single value)
○ Get multi (read multiple versions)
○ Scan (retrieve multiple rows via a scan)

● MapReduce jobs can be run vs. the database using JAVA
code

What is MapReduce ?
"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes
those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure.
The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them
in some way to get the output – the answer to the problem it was originally trying to solve.

MongoDB
● A document-oriented database
● Written in C++
● Uses javascript as functional specification
● Commercial support (10gen)
● Large user base (NY Times, Disney, MTV, foursquare)

Pros
● Easy installation and deployment
● Sharding and replication
● Easy API (javascript), multiple languages
● Similarities to MySQL (indexes, queries)

Cons
● Versions < 1.8.0 had many issues (development not mature):
consistency, crashes...

MongoDB data structure

● No predefinition of fields; all operations are implicit

# Create document
> d = { "userid" : 8173095, "pollid" : 53064, "date" : NumberLong(1293874493), "answer" : 3,
"country" : "GB" };
# Save it into a new collection
> db.usersvotes.save(d);
# Retrieve documents from the collection
> db.usersvotes.find();
{ "_id" : ObjectId("4e3bde5ae84838f87bf883b2"), "userid" : 8173095, "pollid" : 53064, "date" :
NumberLong(1293874493), "answer" : 3, "country" : "GB" }

MongoDB Indexes and Queries
Let's create an index...
db.usersvotes.ensureIndex({pollid: 1, date :-1})

● Indexes can be created in the background
● Index keys are sortable
● Queries without indexes are slow (scans)
Let's get a usersvotes stream
db.usersvotes.find({pollid: 676781}).sort({date: -1}).skip(10).limit(10);
# equals to the following SQL : SELECT * FROM usersvotes WHERE pollid = 676781 ORDER BY DATE
DESC LIMIT 10,10;
{ "_id" : ObjectId("4e3bde5ce84838f87bfa31e6"), "userid" : 8130783, "pollid" : 676781, "date" : NumberLong
(1295077466), "answer" : 1, "country" : "GB" }
{ "_id" : ObjectId("4e3bde5ce84838f87bfa31e7"), "userid" : 8130783, "pollid" : 676781, "date" : NumberLong
(1295077466), "answer" : 5, "country" : "GB" }

MongoDB MapReduce
● Uses javascript functions
● Single-threaded (javascript limitation)
● Parallelized (runs across all shards)

# Aggregate data over the country field
map = function() { emit ( this.country, { count: 1 } ) };
# Count items in each country
r=function(k,vals) {
var result = { count: 0 };
vals.forEach(function(value) {
result.count += value.count;
}); return result; }
# Start the job
res=db.usersvotes.mapReduce(m,r, {out: { inline : 1}});

VoltDB A SQL alternative
● "NewSQL"
● In-memory database

Pros
● Lock-free
● ACID properties
● Linear scalability
● Fault tolerant
● Java implementation

Cons
● Cannot query the database directly; java stored procs only
● Database shutdown needed to modify schema
● Database shutdown needed to add cluster nodes
● No more memory = no more storage

What about MySQL ?

Potential solutions

MySQL some interesting facts

● In single node tests, MySQL was always faster than NoSQL
solutions
● Load data was faster
○ Sample usersvotes data (1G tsv file)
■ MySQL: 20 seconds
■ MongoDB : >10 minutes
■ HBase: >30 minutes
● Proven technology

MySQL Analytics

● MySQL might be outperformed by other solutions in
analytics depending on the data size
● There are several column database solutions existing for
MySQL (Infobright, ICE, Tokutek)
● Word count operations can be offloaded to a full-text search
engine (Sphinx, SolR, Lucene)

MySQL Big Data
● The Vote Stream case
● Simple query
explain select * from toluna_polls.usersvotes where pollid=843206 order by votedate desc limit
10,20
1, 'SIMPLE', 'usersvotes', 'ref', 'Index_POLLID', 'Index_POLLID', '8', 'const', 2556, 'Using where;
Using filesort'

● Can be easily solved by covering index
ALTER TABLE usersvotes ADD KEY (pollid, votedate)

● But !
○ usersvotes = 160Gb datafiles
○ Adding index: offline operation, would take hours
○ Online schema change could be used, but might run out of
space and/or take days

Conclusions

● HBase: good choice for analytics, but not very adapted to
traditional database operations
○ Most companies use HBase/Hadoop to offload analytical
data from their main database
○ Java experience needed (which Toluna has imho)
○ IT must be trained
● MongoDB
○ Very good choice for starting new web applications "from
the ground up"
● VoltDB
○ Great technology but lack of flexibility
● Traditional databases
○ Will probably be around for long time

NoSQL Solutions - a comparative study

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a NoSQL Solutions - a comparative study

Similar a NoSQL Solutions - a comparative study (20)

Último

Último (20)

NoSQL Solutions - a comparative study