2. What is NOSQL ?
• NOSQL is not a standard.
• NOSQL does not mean "No SQL", rather “Not Only SQL”
• But is also not a RDBMS replacement.
• CAP [Consistency Availability Partition Tolerance] Theorem
• BASE [ Basic Availability, Soft--‐state, Eventual Consistency] v/s ACID
3. Characteristics of a NoSQL Database
• Flexible schema / schema less
• Non relational
• Often Distributed (Partitioned)
• Often Replicated
• Horizontally Scalable
• Eventually consistent
• Cheaper compared to Big names RDBMS systems
• Simple API as compared to SQL (but not standard across products or even
versions).
4. NoSQL pros/cons
Advantages
– Massive scalability
– High availability
– Lower cost (than competitive solutions at that scale)
– (usually) predictable elasticity
– Schema flexibility, sparse & semi-structured data
5. Disadvantages
– Limited query capabilities (so far)
– Eventual consistency is not intuitive to program for
• Makes client applications more complicated
– No standardizatrion
• Portability might be an issue
– Insufficient access control
6. Different types of NoSQL Databases
• NoSQL databases are classified in four major data models:
1. Key-value
2. Document
3. Column family
4. Graph
7. 1. Key-value data model
• The main idea is the use of a hash table
• Access data (values) by strings called keys
• Data has no required format – data may have any format
• Data model: (key, value) pairs
• Basic Operations:
Insert(key , value),
Fetch(key),Update(key),
Delete(key)
8. Contd..
• key/value store
• can be in memory only, or backed by disk persistence.
• supports versioning
• e.g. Voldemort (LinkedIn), Amazon SimpleDB, Memcache,
BerkeleyDB, Oracle NoSQL
9. 1.1 Voldemort
• Distributed key-value store
– Based on Dynamo
• Originally developed by LinkedIn, now open source
• Features
– Simple data model (no joins or complex queries, no RI, …)
– P2P
– Scale-out / elastic
• Consistent hashing of keyspace
• Fixed partitions (no splits, but owner may change when re-balancing)
– Eventual consistency / High Availability
– Replication
– Failure handling
10. 2. Riak
• Like Voldemort , Riak was based on Dynamo database
• Offers key/value interface
• Designed to run on large distributed clusters
• Uses consistent hashing to avoid the need for the kind of centralized
index server
• Querying is handled using MapReduce functions written in JavaScript
• It’s a open source for enterprise customers
11. 2. Document-based datamodel
• Similar to Key-Value model, except value is a document.
• Usually JSON like interchange model.
• Query Model: JavaScript-like or custom.
• Aggregations: Map/Reduce
• Indexes are done via B-Trees.
• unlike simple key-value stores, both keys and values are fully
searchable in document databases.
• e.g. Couchbase, MongoDB, RavenDB, ArangoDB, MarkLogic,
OrientDB, RavenDB, Redis, RethinkDB
12. 2.1 CouchDB
• Schema-free, document oriented database
– Documents stored in JSON format (XML in old versions)
– B-tree storage engine
– MVCC model, no locking
– no joins, no PK/FK (UUIDs are auto assigned)
– Implemented in Erlang
• 1st version in C++, 2nd in Erlang and 500 times more scalable (source: “Erlang
Programming” by Cesarini & Thompson)
– Replication (incremental)
• Documents
– UUID, version
– Old versions retained
13. 2.2 MongoDB
• Another popular Document Database
• Data is stored on Disks but cached in memory for speed
• Supports Replication and Partitioning (Sharding)
• Very popular in Web Applications
• Data is stored internally as BSON and exchanged with
applications as JSON.
• Very easy to setup and get started.
• Not open--‐source but free to use (even commercially) and
support license option.
15. 2.3 Redis
• Often referred to as a Data Structure Server
• Supports storing strings, hashes, lists, sets , sorted sets bitmaps and
hyperloglogs.
• Data is kept in Memory
• Extremely popular for short lived data (Session, cache)
• Can be used as a Push/Pull Message Queue
16. 3. Column family data model
• The column is lowest/smallest
instance of data.
• It is a tuple that contains a
name, a value and a timestamp
• Multiple columns (values) per key.
• e.g. Cassandra, Hbase,
Amazon Redshift, HP Vertica,
Teradata, BigTable, Hypertable
17. 3.1 Cassandra
• Data is stored column wise as opposed to row--‐wise
• Supports partitioning (sharding) and replication even across data
centers.
• Can be used to store > Petabytes of data.
• Supports SQL like CQL interface.
• Open--‐source but commercially supported by DataStax.
18. 3.1 Cassandra – data model, partitioning
• Data model
– Same as BigTable
– Super Columns (nested Columns) and Super Column Families
– column order in a CF can be specified (name, time)
• Dynamic partitioning
– Consistent hashing
– Ring of nodes
– Nodes can be “moved” on the ring for load balancing
19. 3.2 BigTable
• Sparse, distributed, persistent multidimensional sorted map
• (row, column, timestamp) dimensions, value is string
• Key features
– Hybrid row/column store
– Single master (stand-by replica)
– Versioning
– Compression
20. BigTable - architecture
• Master server
– Assign tablets to Tablet Servers
– Balance TS load
– Garbage collection
– Schema management
– Client data does not move through the MS (directly through TS)
– Tablet location not handled by MS
• Tablet server (many)
– thousands of tablets per TS
– Manages Read / Write / Split of its tablets
21. 3.3 HBase
• Developed by Powerset, now Apache
• Based on BigTable
– HDFS (GFS), ZooKeeper (Chubby)
– Master Node (Master Server), Region Servers (Tablet Servers)
– HStore (tablet), memcache (memtable), MapFile (SSTable)
• Features
– Data is stored sorted (no real indexes)
– Automatic partitioning
– Automatic re-balancing / re-partitioning
– Fault tolerance (HDFS, 3 replicas)
23. 3.4 Hypertable
• It’s a open source clone of BigTable
• Written in C++
• Has increased performance
24. 4. Graph data model
• Based on Graph Theory.
• Scale vertically, no clustering.
• You can use graph algorithms easily
• Transactions
• ACID
• For modeling the structure of Data
• Uses Property Graph Data Model (Nodes, Relationships,
properties)
• e.g. Neo4j, InfiniteGraph, OrientDB, Titan GraphDB
25. Other Types / Special Purpose
• Search DBs Solr, Elasticsearch
• Object Databases
• XML Databases