SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
NoSQL solutions

A comparative study
Brief history of databases




 ● 1980: Oracle DBMS released
 ● 1995: first release of MySQL, a lightweight, asynchronously
   replicated database
 ● 2000: MySQL 3.23 is adopted by most startups as a
   database plateform
 ● 2004: Google develops BigTable
 ● 2009: The term NoSQL is coined out
Evolution of database storage

● Monolithic: using traditional databases (Oracle, Sybase...)
   ○ High cost in infrastructure (big CPU, big SAN)
   ○ High cost in software licenses (commercial software)
   ○ Qualified personnel required (certified DBAs)
● LAMP platform (circa 2000)
   ○ Free software
   ○ Runs on commodity hardware
   ○ Low administration (don't need a DBA until your data
     grows important)
● NoSQL
   ○ Scales indefinitely (replication only scales vertically)
   ○ Logic is in the application
   ○ Doesn't require SQL or internals knowledge
Scaling
● Vertical scaling (MySQL)
   ○ Scaling by adding more replicas
   ○ Load is evenly distributed across the replicas
   ○ Problems !
       ■ Cached data is also distributed evenly: inefficient
         resource usage
       ■ Efficiency goes down as dataset exceeds available
         memory, cannot get more performance by adding
         replicas
       ■ Write bottleneck

● Horizontal scaling (NoSQL)
   ○ Data is distributed evenly across the nodes (hashing)
   ○ More capacity ? Just add one node
   ○ Loss of traditional database properties (ACID)
NoSQL definition




● Not only SQL (is not exposed via query language)
● Non-relational (denormalized data)
● Distributed (horizontal partitioning)
● Different implementations :
   ○ Key-value store
   ○ Document database
   ○ Graph database
Key-value stores

  ● Schema-less storage
  ● Basic associative arrays
{ "username"=> "guillaume" }
  ● Key-value stores can have column families and subkeys
{ "user:name"=> "guillaume", "user:uid" => 1000 }

  ● Implementations
      ○ K/V caches: Redis, memcache
         ■ in-memory databases
      ○ Column databases: Cassandra, HBase
         ■ Data is stored in a tabular fashion (as opposed to
           rows in traditional RDBMS)
Document Databases

  ● Data is organized into documents:
FirstName="Frank", City="Haifa", Hobby="Photographing"
  ● No strong typing or predefined fields; additional
   information can be added easily
FirstName="Guillaume", Address="Hidalgo Village, Pasay City", Languages=[{Name:"French"}, {Name:"
English"}, {Name:"Tagalog"}]
  ● An ensemble of documents is called a collection
  ● Uses structured standards: XML, JSON
  ● Implementations
      ○ CouchDB (Erlang)
      ○ MongoDB (C++)
      ○ RavenDB (.NET)
Graph Databases

● Uses graph theory structure to represent information
● Typically used for relations
   ○ Example: followers/following in Twitter
Databases at Toluna

 ● MySQL
    ○ Traditional Master-Slave configuration
    ○ Very efficient for small requests
    ○ Not good for analytics
    ○ "Big Data" issues (i.e. usersvotes)

 ● Microsoft SQL Server
    ○ Good all-around performance
    ○ Monolithic
        ■ Suffers from locking issues
        ■ Hard to scale (many connections)
    ○ Potentially complex SQL programming to get the better of it
Solutions ?

Let's evaluate some products...
Apache HBase

  ● A column database based on the Hadoop architecture
  ● Commercially supported (Cloudera)
  ● Available on Red Hat, Debian
  ● Designed for very big data storage (Terabytes)
  ● Users: Facebook, Yahoo!, Adobe, Mahalo, Twitter

Pros
  ● Pure Java implementation
  ● Access to Hadoop MapReduce data via column storage
  ● True clustered architecture

Cons
 ● Java
 ● Hard to deploy and maintain
 ● Limited options via the API (get, put, scans)
Apache HBase: architecture

● Data is stored in cells
   ○ Primary row key
   ○ Column family
      ■ Limited
      ■ May have indefinite qualifiers
   ○ Timestamp (version)
● Example cell structure
RowId     Column Family:Qualifier   Timestamp    Value

1000      user:name                 1312868789   guillaume

1000      user:email                1312868789   g@dragonscale.eu
HBase Data operations
  ● Creating a table and put some data
create table 'usersvotes', 'userid', 'date', 'meta'
hadoop jar /usr/lib/hbase/hbase-0.90.3-cdh3u1.jar importtsv -Dimporttsv.columns=userid,
HBASE_ROW_KEY,date,meta:answer,meta:country usersvotes ~/import/
  ● Retrieve data
hbase(main):002:0> get 'usersvotes', 1071726
COLUMN                 CELL
 date:           timestamp=1312780185940, value=1296523245
 meta:answer         timestamp=1312780185940, value=2
 meta:country        timestamp=1312780185940, value=ES
 userid:          timestamp=1312780185940, value=685352
4 row(s) in 0.0720 seconds
  ● The last versioned row (higher timestamp) for the specified primary
    key is retrieved
HBase Data operations: API

● If not using JAVA, Hbase must be queried using a
  webservice (XML, JSON or protobuf)

● Type of operations
   ○ Get (read single value)
   ○ Put (write single value)
   ○ Get multi (read multiple versions)
   ○ Scan (retrieve multiple rows via a scan)

● MapReduce jobs can be run vs. the database using JAVA
  code
What is MapReduce ?
"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes
those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure.
The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them
in some way to get the output – the answer to the problem it was originally trying to solve.
MongoDB
 ● A document-oriented database
 ● Written in C++
 ● Uses javascript as functional specification
 ● Commercial support (10gen)
 ● Large user base (NY Times, Disney, MTV, foursquare)

Pros
  ● Easy installation and deployment
  ● Sharding and replication
  ● Easy API (javascript), multiple languages
  ● Similarities to MySQL (indexes, queries)

Cons
 ● Versions < 1.8.0 had many issues (development not mature):
   consistency, crashes...
MongoDB data structure

  ● No predefinition of fields; all operations are implicit

# Create document
> d = { "userid" : 8173095, "pollid" : 53064, "date" : NumberLong(1293874493), "answer" : 3,
"country" : "GB" };
# Save it into a new collection
> db.usersvotes.save(d);
# Retrieve documents from the collection
> db.usersvotes.find();
{ "_id" : ObjectId("4e3bde5ae84838f87bf883b2"), "userid" : 8173095, "pollid" : 53064, "date" :
NumberLong(1293874493), "answer" : 3, "country" : "GB" }
MongoDB Indexes and Queries
Let's create an index...
db.usersvotes.ensureIndex({pollid: 1, date :-1})

 ● Indexes can be created in the background
 ● Index keys are sortable
 ● Queries without indexes are slow (scans)
Let's get a usersvotes stream
db.usersvotes.find({pollid: 676781}).sort({date: -1}).skip(10).limit(10);
# equals to the following SQL : SELECT * FROM usersvotes WHERE pollid = 676781 ORDER BY DATE
DESC LIMIT 10,10;
{ "_id" : ObjectId("4e3bde5ce84838f87bfa31e6"), "userid" : 8130783, "pollid" : 676781, "date" : NumberLong
(1295077466), "answer" : 1, "country" : "GB" }
{ "_id" : ObjectId("4e3bde5ce84838f87bfa31e7"), "userid" : 8130783, "pollid" : 676781, "date" : NumberLong
(1295077466), "answer" : 5, "country" : "GB" }
MongoDB MapReduce
   ● Uses javascript functions
   ● Single-threaded (javascript limitation)
   ● Parallelized (runs across all shards)

# Aggregate data over the country field
map = function() { emit ( this.country, { count: 1 } ) };
# Count items in each country
r=function(k,vals) {
  var result = { count: 0 };
vals.forEach(function(value) {
result.count += value.count;
}); return result; }
# Start the job
res=db.usersvotes.mapReduce(m,r, {out: { inline : 1}});
VoltDB A SQL alternative
 ● "NewSQL"
 ● In-memory database

Pros
 ● Lock-free
 ● ACID properties
 ● Linear scalability
 ● Fault tolerant
 ● Java implementation

Cons
 ● Cannot query the database directly; java stored procs only
 ● Database shutdown needed to modify schema
 ● Database shutdown needed to add cluster nodes
 ● No more memory = no more storage
What about MySQL ?

   Potential solutions
MySQL some interesting facts

 ● In single node tests, MySQL was always faster than NoSQL
   solutions
 ● Load data was faster
     ○ Sample usersvotes data (1G tsv file)
         ■ MySQL: 20 seconds
         ■ MongoDB : >10 minutes
         ■ HBase: >30 minutes
 ● Proven technology
MySQL Analytics

 ● MySQL might be outperformed by other solutions in
   analytics depending on the data size
 ● There are several column database solutions existing for
   MySQL (Infobright, ICE, Tokutek)
 ● Word count operations can be offloaded to a full-text search
   engine (Sphinx, SolR, Lucene)
MySQL Big Data
  ● The Vote Stream case
  ● Simple query
 explain select * from toluna_polls.usersvotes where pollid=843206 order by votedate desc limit
10,20
1, 'SIMPLE', 'usersvotes', 'ref', 'Index_POLLID', 'Index_POLLID', '8', 'const', 2556, 'Using where;
Using filesort'


  ● Can be easily solved by covering index
ALTER TABLE usersvotes ADD KEY (pollid, votedate)


  ● But !
     ○ usersvotes = 160Gb datafiles
     ○ Adding index: offline operation, would take hours
     ○ Online schema change could be used, but might run out of
        space and/or take days
Conclusions

● HBase: good choice for analytics, but not very adapted to
  traditional database operations
    ○ Most companies use HBase/Hadoop to offload analytical
      data from their main database
    ○ Java experience needed (which Toluna has imho)
    ○ IT must be trained
● MongoDB
    ○ Very good choice for starting new web applications "from
      the ground up"
● VoltDB
    ○ Great technology but lack of flexibility
● Traditional databases
    ○ Will probably be around for long time
Questions ?

Más contenido relacionado

La actualidad más candente

An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
DataStax
 

La actualidad más candente (20)

NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and cons
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Horizon for Big Data
Horizon for Big DataHorizon for Big Data
Horizon for Big Data
 
NoSQL Seminer
NoSQL SeminerNoSQL Seminer
NoSQL Seminer
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databases
 
8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databases
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data Storage
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابر
 

Similar a NoSQL Solutions - a comparative study

Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
DataWorks Summit
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Lviv Startup Club
 

Similar a NoSQL Solutions - a comparative study (20)

Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data Scientists
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
No sql bigdata and postgresql
No sql bigdata and postgresqlNo sql bigdata and postgresql
No sql bigdata and postgresql
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Really Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DWReally Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DW
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
NoSQL
NoSQLNoSQL
NoSQL
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of data
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

NoSQL Solutions - a comparative study

  • 2. Brief history of databases ● 1980: Oracle DBMS released ● 1995: first release of MySQL, a lightweight, asynchronously replicated database ● 2000: MySQL 3.23 is adopted by most startups as a database plateform ● 2004: Google develops BigTable ● 2009: The term NoSQL is coined out
  • 3. Evolution of database storage ● Monolithic: using traditional databases (Oracle, Sybase...) ○ High cost in infrastructure (big CPU, big SAN) ○ High cost in software licenses (commercial software) ○ Qualified personnel required (certified DBAs) ● LAMP platform (circa 2000) ○ Free software ○ Runs on commodity hardware ○ Low administration (don't need a DBA until your data grows important) ● NoSQL ○ Scales indefinitely (replication only scales vertically) ○ Logic is in the application ○ Doesn't require SQL or internals knowledge
  • 4. Scaling ● Vertical scaling (MySQL) ○ Scaling by adding more replicas ○ Load is evenly distributed across the replicas ○ Problems ! ■ Cached data is also distributed evenly: inefficient resource usage ■ Efficiency goes down as dataset exceeds available memory, cannot get more performance by adding replicas ■ Write bottleneck ● Horizontal scaling (NoSQL) ○ Data is distributed evenly across the nodes (hashing) ○ More capacity ? Just add one node ○ Loss of traditional database properties (ACID)
  • 5. NoSQL definition ● Not only SQL (is not exposed via query language) ● Non-relational (denormalized data) ● Distributed (horizontal partitioning) ● Different implementations : ○ Key-value store ○ Document database ○ Graph database
  • 6. Key-value stores ● Schema-less storage ● Basic associative arrays { "username"=> "guillaume" } ● Key-value stores can have column families and subkeys { "user:name"=> "guillaume", "user:uid" => 1000 } ● Implementations ○ K/V caches: Redis, memcache ■ in-memory databases ○ Column databases: Cassandra, HBase ■ Data is stored in a tabular fashion (as opposed to rows in traditional RDBMS)
  • 7. Document Databases ● Data is organized into documents: FirstName="Frank", City="Haifa", Hobby="Photographing" ● No strong typing or predefined fields; additional information can be added easily FirstName="Guillaume", Address="Hidalgo Village, Pasay City", Languages=[{Name:"French"}, {Name:" English"}, {Name:"Tagalog"}] ● An ensemble of documents is called a collection ● Uses structured standards: XML, JSON ● Implementations ○ CouchDB (Erlang) ○ MongoDB (C++) ○ RavenDB (.NET)
  • 8. Graph Databases ● Uses graph theory structure to represent information ● Typically used for relations ○ Example: followers/following in Twitter
  • 9. Databases at Toluna ● MySQL ○ Traditional Master-Slave configuration ○ Very efficient for small requests ○ Not good for analytics ○ "Big Data" issues (i.e. usersvotes) ● Microsoft SQL Server ○ Good all-around performance ○ Monolithic ■ Suffers from locking issues ■ Hard to scale (many connections) ○ Potentially complex SQL programming to get the better of it
  • 10. Solutions ? Let's evaluate some products...
  • 11. Apache HBase ● A column database based on the Hadoop architecture ● Commercially supported (Cloudera) ● Available on Red Hat, Debian ● Designed for very big data storage (Terabytes) ● Users: Facebook, Yahoo!, Adobe, Mahalo, Twitter Pros ● Pure Java implementation ● Access to Hadoop MapReduce data via column storage ● True clustered architecture Cons ● Java ● Hard to deploy and maintain ● Limited options via the API (get, put, scans)
  • 12. Apache HBase: architecture ● Data is stored in cells ○ Primary row key ○ Column family ■ Limited ■ May have indefinite qualifiers ○ Timestamp (version) ● Example cell structure RowId Column Family:Qualifier Timestamp Value 1000 user:name 1312868789 guillaume 1000 user:email 1312868789 g@dragonscale.eu
  • 13. HBase Data operations ● Creating a table and put some data create table 'usersvotes', 'userid', 'date', 'meta' hadoop jar /usr/lib/hbase/hbase-0.90.3-cdh3u1.jar importtsv -Dimporttsv.columns=userid, HBASE_ROW_KEY,date,meta:answer,meta:country usersvotes ~/import/ ● Retrieve data hbase(main):002:0> get 'usersvotes', 1071726 COLUMN CELL date: timestamp=1312780185940, value=1296523245 meta:answer timestamp=1312780185940, value=2 meta:country timestamp=1312780185940, value=ES userid: timestamp=1312780185940, value=685352 4 row(s) in 0.0720 seconds ● The last versioned row (higher timestamp) for the specified primary key is retrieved
  • 14. HBase Data operations: API ● If not using JAVA, Hbase must be queried using a webservice (XML, JSON or protobuf) ● Type of operations ○ Get (read single value) ○ Put (write single value) ○ Get multi (read multiple versions) ○ Scan (retrieve multiple rows via a scan) ● MapReduce jobs can be run vs. the database using JAVA code
  • 15. What is MapReduce ? "Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output – the answer to the problem it was originally trying to solve.
  • 16. MongoDB ● A document-oriented database ● Written in C++ ● Uses javascript as functional specification ● Commercial support (10gen) ● Large user base (NY Times, Disney, MTV, foursquare) Pros ● Easy installation and deployment ● Sharding and replication ● Easy API (javascript), multiple languages ● Similarities to MySQL (indexes, queries) Cons ● Versions < 1.8.0 had many issues (development not mature): consistency, crashes...
  • 17. MongoDB data structure ● No predefinition of fields; all operations are implicit # Create document > d = { "userid" : 8173095, "pollid" : 53064, "date" : NumberLong(1293874493), "answer" : 3, "country" : "GB" }; # Save it into a new collection > db.usersvotes.save(d); # Retrieve documents from the collection > db.usersvotes.find(); { "_id" : ObjectId("4e3bde5ae84838f87bf883b2"), "userid" : 8173095, "pollid" : 53064, "date" : NumberLong(1293874493), "answer" : 3, "country" : "GB" }
  • 18. MongoDB Indexes and Queries Let's create an index... db.usersvotes.ensureIndex({pollid: 1, date :-1}) ● Indexes can be created in the background ● Index keys are sortable ● Queries without indexes are slow (scans) Let's get a usersvotes stream db.usersvotes.find({pollid: 676781}).sort({date: -1}).skip(10).limit(10); # equals to the following SQL : SELECT * FROM usersvotes WHERE pollid = 676781 ORDER BY DATE DESC LIMIT 10,10; { "_id" : ObjectId("4e3bde5ce84838f87bfa31e6"), "userid" : 8130783, "pollid" : 676781, "date" : NumberLong (1295077466), "answer" : 1, "country" : "GB" } { "_id" : ObjectId("4e3bde5ce84838f87bfa31e7"), "userid" : 8130783, "pollid" : 676781, "date" : NumberLong (1295077466), "answer" : 5, "country" : "GB" }
  • 19. MongoDB MapReduce ● Uses javascript functions ● Single-threaded (javascript limitation) ● Parallelized (runs across all shards) # Aggregate data over the country field map = function() { emit ( this.country, { count: 1 } ) }; # Count items in each country r=function(k,vals) { var result = { count: 0 }; vals.forEach(function(value) { result.count += value.count; }); return result; } # Start the job res=db.usersvotes.mapReduce(m,r, {out: { inline : 1}});
  • 20. VoltDB A SQL alternative ● "NewSQL" ● In-memory database Pros ● Lock-free ● ACID properties ● Linear scalability ● Fault tolerant ● Java implementation Cons ● Cannot query the database directly; java stored procs only ● Database shutdown needed to modify schema ● Database shutdown needed to add cluster nodes ● No more memory = no more storage
  • 21. What about MySQL ? Potential solutions
  • 22. MySQL some interesting facts ● In single node tests, MySQL was always faster than NoSQL solutions ● Load data was faster ○ Sample usersvotes data (1G tsv file) ■ MySQL: 20 seconds ■ MongoDB : >10 minutes ■ HBase: >30 minutes ● Proven technology
  • 23. MySQL Analytics ● MySQL might be outperformed by other solutions in analytics depending on the data size ● There are several column database solutions existing for MySQL (Infobright, ICE, Tokutek) ● Word count operations can be offloaded to a full-text search engine (Sphinx, SolR, Lucene)
  • 24. MySQL Big Data ● The Vote Stream case ● Simple query explain select * from toluna_polls.usersvotes where pollid=843206 order by votedate desc limit 10,20 1, 'SIMPLE', 'usersvotes', 'ref', 'Index_POLLID', 'Index_POLLID', '8', 'const', 2556, 'Using where; Using filesort' ● Can be easily solved by covering index ALTER TABLE usersvotes ADD KEY (pollid, votedate) ● But ! ○ usersvotes = 160Gb datafiles ○ Adding index: offline operation, would take hours ○ Online schema change could be used, but might run out of space and/or take days
  • 25. Conclusions ● HBase: good choice for analytics, but not very adapted to traditional database operations ○ Most companies use HBase/Hadoop to offload analytical data from their main database ○ Java experience needed (which Toluna has imho) ○ IT must be trained ● MongoDB ○ Very good choice for starting new web applications "from the ground up" ● VoltDB ○ Great technology but lack of flexibility ● Traditional databases ○ Will probably be around for long time