SlideShare una empresa de Scribd logo
1 de 63
VISVESVARAYA TECHNOLOGICAL UNIVERSITY
Belgaum, Karnataka
IV SEMESTER SEMINAR ON
“NoSQL Data Management: Concepts and Systems”
Md. Mushahid Faizan
(1RX16MCA34)
RNS INSTITUTE OF TECHNOLOGY
Estd: 2001
Department of MCA
Channasandra, Dr. Vishnuvardhan, Bengaluru-560098.
NoSQL Data Management
Overview
• Introduction to NoSQL
• Basic Concepts for NoSQL
• Overview of NoSQL Systems
2
NoSQL Data Management
History of NoSQL
● MultiValue databases at TRW in 1965.
● DBM is released by AT&T in 1979.
● Lotus Domino released in 1989.
● Carlo Strozzi used the term NoSQL in 1998 to name his lightweight,
open-source relational database that did not expose the standard SQL
interface.
● Graph database Neo4j is started in 2000.
● Google BigTable is started in 2004. Paper published in 2006.
● CouchDB is started in 2005.
● The research paper on Amazon Dynamo is released in 2007.
● The document database MongoDB is started in 2007 as a part of a open
source cloud computing stack and first standalone release in 2009.
● Facebooks open sources the Cassandra project in 2008.
● Project Voldemort started in 2008.
● The term NoSQL was reintroduced in early 2009.
● Some NoSQL conferences
NoSQL Matters, NoSQL Now!, INOSA
3
NoSQL Data Management
History of NoSQL
• SQL Databases were dominant for decades
 Persistent storage
 Standards based
 Concurrency Control
 Application Integration
 ACID
•  Designed to run on a single „big“ machine
• Cloud computing changes that dramatically
 Cluster of machines
 Large amount of unreliable machines
 Distributed System
 Schema-free unstructured Big Data
4
NoSQL Data Management
Methods to Run a Database
• Virtual Machine Image
 Users purchase virtual machine instances to run a database on these
 Upload and setup own image with database, or use ready-made images with optimized
database installations
 E.g. Oracle Database 11g Enterprise Edition image for Amazon EC2 and for Microsoft
Azure.
• Database as a service (DBaaS)
 Using a database without physically launching a virtual machine instance
 No configuration or management needed by application owners
 E.g. Amazon Web Services provide SimpleDB, Amazon Relational Database Service (RDS),
DynamoDB,
• Managed database hosting
 Not offered as a service, but hosted and managed by the cloud database vendor
 E.g. Rackspace offers managed hosting for MySQL
5
NoSQL Data Management
Which Data Model?
• Relational Databases
 Standard SQL database available for Cloud Environments as Virtual Machine
Image or as a service depending on the vendor
 Not cloud-ready: Difficult to scale
• NoSQL databases
 Database which is designed for the cloud
 Built to serve heavy read/write loads
 Good ability to scale up and down
 Applications built based on SQL data model require a complete rewrite
 E.g. Apache Cassandra, CouchDB and MongoDB
6
NoSQL Data Management
How to scale the data management?
• Vertical scaling – Scale up
• Horizontal scaling – Scale out
7
NoSQL Data Management
Definition and Goals of NoSQL databases
• No formal NoSQL definition
available!
• Store very large scale data
called “Big data”
• Typically scale horizontally
• Simple query mechanisms
• Often designed and set up for
a concrete application
• Typical characteristics of
NoSQL databases are:
 Non-relational
 Schema-free
 Open Source
 Simple API
 Distributed
 Eventual consistency
Source: https://clt.vtc.edu.hk/what-happens-online-in-60-seconds/
8
NoSQL Data Management
Non-relational
• NoSQL databases generally do not follow the relational model
• Do not provide tables with flat fixed-column records
• Work with self-contained (hierarchical) aggregates or BLOBs
• No need for object-relational mapping and data normalization
• No complex and costly features like query languages, query planners,
referential integrity, joins, ACID
Schema-free
9
• Most NoSQL databases are schema-free or have relaxed schemas
• No need for definition of any sort of schema of the data
• Allows heterogeneous structures of data in the same domain
NoSQL Data Management
Simple API
• Often simple interfaces for storage and querying data provided
• APIs often allow low-level data manipulation and selection methods
• Often no standard based query language is used
• Text-based protocols often using HTTP REST with JSON
• Web-enabled databases running as internet-facing services
Distributed
• Several NoSQL databases can be executed in a distributed fashion
• Providing auto-scaling and fail-over capabilities
• Often ACID is sacrificed for scalability and throughput
• Often no synchronous replication between distributed nodes is possible, e.g.
asynchronous Multi-Master Replication, peer-to-peer, HDFS Replication
• Only providing eventual consistency
10
NoSQL Data Management
Core Categories of NoSQL Systems
• Key-Value Stores
 Manage associative arrays
 Big hash table
• Wide Column Stores
 Each storage block contains only data from one column
 Read and write is done using columns (rather than rows – like
in SQL)
• Document Stores
 Store documents consisting of tagged values
 Data is a collection of key value pairs
 Provides structure and encoding of the managed data
 Encoded using XML, JSON, BSON
 Schema-free
• Graph DB
 Network database using graphs with node and edges for
storage
 Nodes represent entities, edges represent their relationships
• Other NoSQL systems
 ObjectDB
 XML DB
 Special Grid DB
 Triple Store
Value
{
“id”: 123,
“name”:”Matthias”,
“location”:{
“city”:”Hersonissos”,
“region”:”Crete”
}
}
A B
C
relation
11
NoSQL Data Management
Usage of NoSQL in Practice
• Google
 Big Table
 Google Apps, Google Search
• Facebook
 Social network
• Twitter
• Amazon
 DynamoDB and SimpleDB
• CERN
• GitHub
12
NoSQL Data Management
Overview
• Introduction to NoSQL
• Basic Concepts for NoSQL
 CAP-Theorem
 Eventual Consistency
 Consistent Hashing
 MVCC-Protocol
 Query Mechanisms for NoSQL
• Overview of NoSQL Systems
13
NoSQL Data Management
CAP Theorem – Brewer's theorem
NoSQL (AP)
Domain Name
System DNS (AP)
Cloud Computing
(AP)
 Consistency: all nodes see the
same data at the same time
 Availability: every request
receives a response about
whether it succeeded or failed
 Partition tolerance: the
system continues to operate
despite physical network
partitions
Choose two!
• States that it is impossible for a
distributed system to provide all
three of the following guarantees
14
NoSQL Data Management
Eventual consistency and BASE
• The term “eventual consistency”
 Copies of data on multiple machines to achieve high availability and scalability
 A change to a data item on one machine has to be propagated to other replicas
 Propagation is not instantaneous so the copies will be mutually inconsistent
 The change will eventually be propagated to all the copies
 Fast access requirement dominates
 Different replicas can return different values for the queried attribute of the object
 A System that achieved eventual consistency converged, or achieved replica
convergence
• Eventual consistency guarantees:
 If no new updates are made to a given data item eventually all accesses to that item will
return the last updated value
• Eventually consistent services provide weak consistency using BASE
 Basically Available, Soft state, Eventual consistency
 Basically available indicates that the system guaranteed availability (CAP theorem)
 Soft state indicates that the state of the system may change over time, even without input
 Eventual consistency indicates that the system will become consistent over time
15
NoSQL Data Management
Consistent Hashing
• Technique how to efficiently distribute replicas to nodes
• Consistent hashing is a special kind of hashing
 When hash table is resized only K/n keys need to be remapped on average
 K is the number of keys, and n is the number of slots
 In traditional hash tables nearly all keys have to be remapped
• Insert Servers on ring
 Hash based e.g. on IP, Name, …
 Take over objects between own and processor hash
• Insert Objects on ring
 Hash based on key
 Walks around the circle until
falling into the first bucket
• Delete Servers
 Copy objects to next server
• Virtual Servers
 More than one hash per server
• Replication
 Place objects multiple times
 Improves reliability
Server 1
0
12
2
Server 3
Server 2/2
16
SeSrevrevre2r/21
NoSQL Data Management
Multiversion Concurrency Control (MVCC)
• Concurrency control method to
provide concurrent access to the
database
• LOCKING
 All readers wait until the writer is done
 This can be very slow!
• MVCC
 An write adds a new version
 Read is always possible
 Any changes made by a writer will not
be seen by other users of the database
until they have been committed
 Conflicts (e.g. V1a ,V1b) can occur and
have to be handled
User 1 User 2
V
transaction T
User 1
write
read V (Delayed)
Lock!
V1
write
V0
MVCC
read V0-Old Version
V1 read V1
transaction T
User 1
User 1 User 2
17
NoSQL Data Management
Query Mechanisms for NoSQL
• REST based retrieval of a value based on its key/ID with
GET resource
• Document stores allow more complex queries
 E.g. CouchDB allows to define views with MapReduce
• MapReduce
 Available in many NoSQL databases
 Can run fully distributed
 It is Functional Programming, not writing queries!
 Map phase - perform filtering and sorting
 Reduce phase - performs a summary operation
 ~ SELECT and GROUP BY of a relational database
 More details later!
• Apache Spark is an open source big data processing
framework providing more operations than MapReduce
• Example use cases for MapReduce
 Distributed Search
 Counting – URL, Words
 Building linkage graphs for web sites
 Sorting distributed data
17
Source: @tgrall
NoSQL Data Management
Overview
• Introduction to NoSQL
• Basic Concepts for NoSQL
• Overview of NoSQL Systems
 Key-Value Stores
 Document Stores
 Wide-column stores
 Graph Stores
 Hadoop Map/Reduce and more …
19
NoSQL Data Management
Key Value Stores
Memcached
Project
Voldemort
• Developer: Basho Technologies
(http://basho.com/)
• Current version: 2.1.1
• Available since: 2009
• Licence: Apache licence 2.0
• Supported operating systems:
Linux, BSD, Mac OS, Solaris
• Client libraries for:
 Java, Ruby, Python, C#, Erlang
(the official ones)
 C, Node.js, PHP (the unofficial
ones)
 even more form the Riak
community
• Developer: Basho Technologies
(http://basho.com/)
• Current version: 2.1.1
• Available since: 2009
• Licence: Apache licence 2.0
• Supported operating systems:
Linux, BSD, Mac OS, Solaris
• Client libraries for:
 Java, Ruby, Python, C#, Erlang
(the official ones)
 C, Node.js, PHP (the unofficial
ones)
 even more form the Riak
community
20
NoSQL Data Management
Typical Use Cases
• Session data
• User profiles
• Sensor data (IOT)
timestamp x y z temperature
01.01.2014 350 120 78 -10°
01.01.2014 350 120 95 -9
01.01.2014 350 100 78 -10°
02.01.2014 350 120 78 -5°
02.01.2014 350 120 95 -8°
…
key value
sessionid=A08154711
userlogin=“xyz”
date_of_expiry=2015/12/31
id
21
NoSQL Data Management
Key Functionality
buckets
key value
key value
key value
key value
store <k,v>
get <k>
delete <k>
get bucket properties
set bucket properties
bucket types
create bucket type
activate bucket type
update bucket type
get status of bucket type
22
NoSQL Data Management
Instances and Vnodes
• Riak runs on potentially large
clusters
• Each host in the cluster runs a
single instance of Riak (Riak
node)
• Each Riak node manages a set of
virtual node (vnodes)
• Mapping of <bucket,key> pairs
 compute 160-bit hash
 map result to a ring position
 ring is divided into partitions
 each vnode is responsible for one
partition
docs.basho.com 23
NoSQL Data Management
Configure Replication
• Some bucket parameters
 N: replication factor
 R: number of servers that must
respond to a read request
 W: number of servers that must
respond to a write request
 DW: number of servers that
must report that a write has
successfully been written to disk
 …
 Parameters allow to trade
consistency vs. availability vs.
performance
docs.basho.com 24
NoSQL Data Management
Transactions and Consistency
• No multi-operation transactions are supported
• Eventual consistency is default (and was the only option before Riak 2.0)
 Vector clocks and Dotted Version Vectors (DVV) used to resolve object
conflicts
• Strong consistency as an alternative option
 A quorum of nodes is needed for any successful operation
25
NoSQL Data Management
Riak Search
• For Raik KV, a value is just a value possibly associated with a type
• Riak Search 2.0
 Based on Solr, the search platform built on Apache Lucene
 Define extractors, i.e., modules responsible for pulling out a list of fields and
values from a Riak object
 Define Solr schema to instruct Riak/Solr how to index a value
 Queries:
exact match, globs, inclusive/exclusive range queries, AND/OR/NOT, prefix
matching, proximity searches, term boosting, sorting, …
26
NoSQL Data Management
Document Stores
• Developer: MongoDB, Inc.
(https://www.mongodb.com/)
• Available since: 2009
• Licence: GNU AGPL v3.0
• Supported operating systems: all
major platforms
• Drivers for:
 C, C++, C#, Java, Node.js,
Perl, PHP, Python, Motor,
Ruby, Scala, …
• Developer: MongoDB, Inc.
(https://www.mongodb.com/)
• Available since: 2009
• Licence: GNU AGPL v3.0
• Supported operating systems: all
major platforms
• Drivers for:
 C, C++, C#, Java, Node.js,
Perl, PHP, Python, Motor,
Ruby, Scala, …
27
NoSQL Data Management
What are Documents?
• Aggregated data
• No fixed schema
• Internal structure matters
• Format: JSON, BSON, …
{
"firstName": "John",
"lastName": "Smith",
"age": 25,
"address":
{
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021"
},
"phoneNumbers":
[
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "fax",
"number": "646 555-4567"
}
]
}
28
{
"firstName": "John",
"lastName": "Smith",
"age": 25,
"address":
{
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021"
},
"phoneNumbers":
[
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "fax",
"number": "646 555-4567"
}
]
}
{
"firstName": “Paul",
"lastName": “Adam",
"age": 45,
"address":
{
"streetAddress": "22 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021"
}
}
{
"firstName": “Paul",
"lastName": “Adam",
"age": 45,
"address":
{
"streetAddress": "22 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021"
}
}
NoSQL Data Management
Typical Use Cases
• Simple content management
 e.g. blogging platforms
• Logging events
 cope with event type heterogeneity
 data associated with events changes over time
• E-Commerce applications
 flexible schema for product and order data
29
NoSQL Data Management
Data Organization
• Collections of documents that
share indexes (but not a schema)
• Collections are organized into
databases
• Documents stored in BSON
databases
collections
30
NoSQL Data Management
Key Functionality
collections
find <field critera>
insert document
update document(s)
create index
drop index
databases
create collection
drop collection
delete document(s)
query data inside
the documents
31
NoSQL Data Management
Querying Documents
www.mongodb.com 32
NoSQL Data Management
Availability
• Each Replica set is group of
MongoDB instances that hold the
same dataset
• one primary instance that takes
all write operations
• multiple secondary instances
• changes are replicated to the
secondaries
• if the primary is unavailable, the
replica set will elect a secondary
to be primary
• Specific secondaries
 priority 0 member
 hidden member
 delayed member
 arbiter
sec.
1
sec.
2
prim.
read/write
apply opplog
replica set
33
NoSQL Data Management
Scalability
• Replica sets for read scalability
 reading from secondaries may
provide stale data
• Sharding for write scalability
 at collection level
 using indexed field that is
available in all documents of the
collection
 range-based or hash-based
 Each shard is a MongoDB
instance
 background processes to
maintain even distribution:
splitting + balancer
 Shards may also hold replica sets
shard shard shard
1 2 3
router
gconfi
serveg rconfi
serveconfig r
server
metadata
read/write
34
NoSQL Data Management
Transactions and Consistency
• Write concern, w option
 default:
 num.:
 majority:
confirms write operations only on the primary
Guarantees that write operations have propagated successfully to
the specified number of replica set members including the primary
Confirms that write operations have propagated to the majority of
voting nodes
• Read preference
 describes how MongoDB clients route read operations to the members of a
replica set, i.e., from one of the secondaries or the primary
 eventual consistency!
• No multi-operation transactions supported. reading stale data
from the primary
is possible!
35
NoSQL Data Management
Wide-Column Stores / Column Family Stores
• 2006: originally project of
company Powerset
• 2008: HBase becomes Hadoop
sub-project.
• 2010: HBase becomes Apache
top-level project.
• runs on top of HDFS (Hadoop
Distributed File System)
• providing BigTable-like
capabilities for Hadoop
• APIs: Java, REST, Thrift, C/C++
• 2006: originally project of
company Powerset
• 2008: HBase becomes Hadoop
sub-project.
• 2010: HBase becomes Apache
top-level project.
• runs on top of HDFS (Hadoop
Distributed File System)
• providing BigTable-like
capabilities for Hadoop
• APIs: Java, REST, Thrift, C/C++
36
NoSQL Data Management
Logical Data Model
• Table rows contain:
 row key
 versions, typically a timestamp
 multiple column families per key
- define column families at design time
- add columns to a column family at runtime
• Metadata
 there is no catalog that provides the set of all columns for a certain table
 left to the user/application
sparse table
37
NoSQL Data Management
Physical Data Model
• Store each column familiy separately
• Sorted by timestamp (descending)
• Empty cells from the logical view are not stored
• Key/Value class
keylength valuelength key value
rowlength row
key
columnfamily
length
column
family
columnqualifier timestamp keytype
com.cnn.
www
2 anchor cnnsi.com t9 put
e.g.
38
NoSQL Data Management
Key Functionality
namespaces
get <k, t>
put <k, …>
scan
regions
split
merge
delete <k>
create
alter
drop
c, …, c
key family c, …, c
key family c, …, c
key family c, …, c
table
key family
39
NoSQL Data Management
MasterServer and RegionServer
• Failover
 HBase clients talk directly to the
RegionServers, hence they may
continue without MasterServer (at least
for some time)
 catalog table META exists as HBase
tables, i.e., not resident in the
MasterServer
HMaster
• monitors RegionServers
• operations related to metadata
(tables, column families, regions)
Backup HMaster
…
HRegionServer
• manages regions
• operations related to data (put,
get, …)
• operations related to regions
(split, merge, …)
META: list of regions for each table
• Failover
 Region immediately becomes
unavailable when the RegionServer is
down
 The Master will detect that the
RegionServer has failed
 region assignments will be considered
invalid
 assign region to a new RegionServer
40
NoSQL Data Management
Storage Structure
• Table T with column families a and b
HRegionServer
Store a
MemStore MemStore
StoreFile
StoreFile
StoreFile
Store b
StoreFile
StoreFile
Store a Store b
MemStore MemStore
StoreFile
StoreFile
StoreFile
StoreFile
HRegionServer
Store a Store b
MemStore MemStore
StoreFile StoreFile
StoreFile StoreFile
Block
<k,v> <k,v> …
Block
Block
Block
mainmemoryfilesystem
Log Log
regions
41
NoSQL Data Management
Write Data
1. Write change to log (WAL)
2. Write change to MemStore
3. Regularily flush MemStore to disk (into StoreFiles) and empty MemStore
HRegionServerHRegionServer
Store a
MemStore MemStore
StoreFile
StoreFile
StoreFile
Store b
StoreFile
StoreFile
Store a
MemStore MemStore
StoreFile
StoreFile
StoreFile
Store b
StoreFile
Store a
MemStore MemStore
StoreFile
StoreFile
Store b
StoreFile
StoreFile
mainmemoryfilesystem
Log Log
write table_T.family_a.field_f
  
42
NoSQL Data Management
Read Data
1. Read from Block Cache
2. Read from MemStore
3. Read from all relevant StoreFiles
4. Merge results
HRegionServerHRegionServer
Store a
MemStore MemStore
StoreFile
StoreFile
StoreFile
Store b
StoreFile
StoreFile
Store a
MemStore MemStore
StoreFile
StoreFile
StoreFile
Store b
StoreFile
Store a
MemStore MemStore
StoreFile
StoreFile
Store b
StoreFile
StoreFile
mainmemoryfilesystem
Log Log
read table_T.family_a.field_f
  
43
NoSQL Data Management
StoreFile Reorganisation
• minor compaction
 merge various StoreFiles, without
considering tombstones etc.
• major compaction
 reorg of store files, e.g. by
removing deleted rows
 significantly reduces file size
 at a configurable time interval
or manually
• Merge Regions
 consolidate several regions of one table into a single region
 offline reorg initiated manually!
• Split Regions
 distribute data from one region to several regions
 configurable by parameter max.filesize or manually
Store a
MemStore
StoreFile
StoreFile
StoreFile
Store a
MemStore
StoreFile
44
NoSQL Data Management
Transactions and Consistency
• No explicit transaction boundaries
• Atomicity
 atomic row-level operations (either "success" or "failure")
 operations spanning several rows (batch put) are not atomic
• Consistency
 Default: Strong consistency by routing all through a single region server
 Optional: Region replication for high availability
- Writes only through the primary
- Reads may also be processed by the secondaries
45
NoSQL Data Management
SELECT WRITETIME (name)
FROM excelsior.clicks
WHERE url = 'http://apache.org';
SELECT WRITETIME (name)
FROM excelsior.clicks
WHERE url = 'http://apache.org';
CQL in Cassandra
45
CREATE KEYSPACE demodb WITH REPLICATION =
{'class' : 'SimpleStrategy', 'replication_factor': 3};
CREATE KEYSPACE demodb WITH REPLICATION =
{'class' : 'SimpleStrategy', 'replication_factor': 3};
CREATE TABLE users (
user_name varchar,
password varchar,
gender varchar,
session_token varchar,
state varchar,
birth_year bigint,
PRIMARY KEY (user_name));
CREATE TABLE users (
user_name varchar,
password varchar,
gender varchar,
session_token varchar,
state varchar,
birth_year bigint,
PRIMARY KEY (user_name));
SELECT *
FROM emp
WHERE empID IN (130,104)
ORDER BY deptID DESC;
SELECT *
FROM emp
WHERE empID IN (130,104)
ORDER BY deptID DESC;
INSERT INTO excelsior.clicks (userid, url, date, name)
VALUES ( 3715e600-2eb0-11e2-81c1-0800200c9a66,
'http://apache.org', '2013-10-09', 'Mary')
USING TTL 86400;
INSERT INTO excelsior.clicks (userid, url, date, name)
VALUES ( 3715e600-2eb0-11e2-81c1-0800200c9a66,
'http://apache.org', '2013-10-09', 'Mary')
USING TTL 86400;
cassandra.apache.org
NoSQL Data Management
Graph Databases
Flock DB
• Developer: Neo Technology
(http://www.neotechnology.com)
• Available since: 2007
• Licence: GPLv3 and AGPLv3,
commercial
• Supported operating systems: all
major platforms
• Drivers for:
 Java, .NET, JavaScript, Python,
Ruby, PHP
 and R, Go, Clojure, Perl, Haskell
• Developer: Neo Technology
(http://www.neotechnology.com)
• Available since: 2007
• Licence: GPLv3 and AGPLv3,
commercial
• Supported operating systems: all
major platforms
• Drivers for:
 Java, .NET, JavaScript, Python,
Ruby, PHP
 and R, Go, Clojure, Perl, Haskell
47
NoSQL Data Management
Typical Use Cases
• Highly connected data,
e.g., social networks,
employees and their knowledge
• Location-based services,
e.g., planning delivery
• Recommendation systems,
e.g., bought products, often-visited attractions
people who visited … also visited …
5
15
7
9
1
20 12
50
graph with
distances
smartblogs.com
48
NoSQL Data Management
Graph Data Model
48
node
relationship
properties
• No need to define a schema
neo4j.com
NoSQL Data Management
Basic Functionality
match node
patterns
graph traversal
graph
create node/relationship
delete node/relationship
set property
remove property
create index
query index
50
NoSQL Data Management
Cypher Example
• Many query languages: Cypher, Gremlin, G, GraphLog, GRAM, GraphDB,
…
• No standard
CREATE (me:PERSON {name:”Holger”})
CREATE (mat:PERSON {name:”Matthias”})
CREATE (fra:PERSON {name:”Frank”})
CREATE (me) -[knows:KNOWS]-> (mat)
CREATE (me) -[knows:KNOWS]-> (fra)
CREATE (mat) -[knows:KNOWS]-> (me)
MATCH (n {name:”Holger”})-[:KNOWS]->(m)
51
NoSQL Data Management
S2
S3 S4
S1
Scalability
• Strategies for read scaling
a) large enough memory
for the working set of nodes
b) adding read-only slaves
graph
data
node memory
graph
data
graph
data
graph
data
graph
data
slave memory
c) application-level sharding
graph
data
north
graph
data
south
appl query
north/south?
automatic
sharding?
52
NoSQL Data Management
High Availability
M S1 S2 S3
 
propagate
 write  commit
• HA availability feature in neo4j
 cluster of 1 master and n slave nodes
 continues to operate from any number of nodes down to a single machine
 nodes may leave and re-join the cluster
 in case of master failure, a new master will be elected
 read operations are possible on any node
 write operations are possible on any node and propagated to the others
write on master
53
NoSQL Data Management
High Availability
M S1 S2 S3
 write  commit
 
propagate
 
propagate
 commit
• HA availability feature in neo4j
 cluster of 1 master and n slave nodes
 continues to operate from any number of nodes down to a single machine
 nodes may leave and re-join the cluster
 in case of master failure, a new master will be elected
 read operations are possible on any node
 write operations are possible on any node and propagated to the others
write on slave
M S1 S2 S3
pull asynchronously
54
NoSQL Data Management
Transactions and Consistency
• Set transaction boundaries explicitly
• Transactions are possible on any node in the cluster
• Transactions are atomic and durable
• Writes are eventually consistent
 optimistically pushed to slaves
 slaves can also be configured to pull updates asynchronously
55
NoSQL Data Management
Hadoop Ecosystem
HDFS
(Hadoop Distributed File System)
• Apache project http://hadoop.apache.org/
HBase
YARN
(Cluster Resource Management)
Zookeeper
(Coordination)
MapReduce Framework
Hive Pig …
56
NoSQL Data Management
Principles of Map Reduce
• User provides data in files
• Data model: key/value pairs (k, v)
• Based on higher-order functions MAP and REDUCE
• Tasks of the programmer
 User-defined functions m and r serve as input to MAP and REDUCE
 m and r define what the job actually does
• MAP m: Km,Vm ↦ Kr, Vr
∗
• REDUCE r: (Kr, Vr
∗
)↦(Kr, Vr)
• Example: Aggregate salary per department:
(employee, <name, department, salary, …>)
(department, salary)
MAP
(department, salary)
(department, <salary, salary, …>)
RED
57
NoSQL Data Management
…
Execution of Map/Reduce Jobs
57
file
task tracker m2
map()
combine()
partition()
task tracker m3
task tracker m1
split 1
split 2
split 3
split 4
split 5
k/v 1
k/v 2
k/v 1
k/v 2
k/v 1
k/v 2
task tracker r2
task tracker r1
shuffle()
sort()
reduce()
output()
file
phases defined by user
job trackerprogram
client start &
control
NoSQL Data Management
Fault Tolerance
58
• map node fails
 job tracker receives no report for a certain time -> mark node as failed
 restart map job on a different node
 new job reads another copy of the necessary input split
• reduce node fails
 job tracker receives no report for a certain time -> mark node as failed
 restart reduce job on a different node
 read necessary intermediate input data from map nodes
• To make this work, all relevant data has to be stored in a distributed file
system, in particular
 the input splits
 intermediate data produced by map jobs
Hadoop Files System (HDFS)
NoSQL Data Management
HDFS Architecture
59HDFS Users Guide at http://hadoop.apache.org
NoSQL Data Management
Is that all?
• Other systems build on or extend this
basic functionality
• Build an SQL layer on to of Hadoop MapReduce
 Hive
 Pig
• Others focus on datastream and in-memory processing
 Spark
 Flink
61
NoSQL Data Management
What we also Skipped Today
• Further classes of NoSQL systems
 Triple stores, …
• NewSQL
• Cloud offerings for the various types of NoSQL data stores
 e.g., Riak CS (Cloud Storage)
• More cloud platforms
 IBM Bluemix
 Google app engine
62
NoSQL Data Management
Conclusion
• Relational Databases provide
 Data spread over many tables
 Schema needs to be defined
 Structured query language (SQL)
 Transactions
 Strong Consistency
 General purpose applicability
• NoSQL
 Aggregated data in one object (identified by a key)
 No predefined schema
 No declarative query language
 Limited transactional capability
 Eventual consistency rather ACID property
 Focus on scalability and availability
 Often selected and customized for a concrete application scenario
To make a proper decision,
carefully examine your application
• the data model that is most
appropriate
• the query complexity
• the consistency needs
• the transactional requirements
63
To make a proper decision,
carefully examine your application
• the data model that is most
appropriate
• the query complexity
• the consistency needs
• the transactional requirements

Más contenido relacionado

La actualidad más candente

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data miningKrish_ver2
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
The rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingThe rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingMinhazul Arefin
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databasesAshwani Kumar
 
Data mining in social network
Data mining in social networkData mining in social network
Data mining in social networkakash_mishra
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBLee Theobald
 
Chapter 06 Data Mining Techniques
Chapter 06 Data Mining TechniquesChapter 06 Data Mining Techniques
Chapter 06 Data Mining TechniquesHouw Liong The
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series dataKrish_ver2
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
 

La actualidad más candente (20)

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
The rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingThe rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computing
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databases
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data mining in social network
Data mining in social networkData mining in social network
Data mining in social network
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
Web mining
Web miningWeb mining
Web mining
 
Chapter 06 Data Mining Techniques
Chapter 06 Data Mining TechniquesChapter 06 Data Mining Techniques
Chapter 06 Data Mining Techniques
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
 
Data Mining
Data MiningData Mining
Data Mining
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 

Similar a NoSql Data Management

Chapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choicesChapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choicesMaynooth University
 
NoSQL Database in .NET Apps
NoSQL Database in .NET AppsNoSQL Database in .NET Apps
NoSQL Database in .NET AppsShiju Varghese
 
Mongo Bb - NoSQL tutorial
Mongo Bb - NoSQL tutorialMongo Bb - NoSQL tutorial
Mongo Bb - NoSQL tutorialMohan Rathour
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxRahul Borate
 
Solr cloud the 'search first' nosql database extended deep dive
Solr cloud the 'search first' nosql database   extended deep diveSolr cloud the 'search first' nosql database   extended deep dive
Solr cloud the 'search first' nosql database extended deep divelucenerevolution
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageBethmi Gunasekara
 
NoSQL Data Architecture Patterns
NoSQL Data ArchitecturePatternsNoSQL Data ArchitecturePatterns
NoSQL Data Architecture PatternsMaynooth University
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxRahul Borate
 
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityNOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityIvan Zoratti
 

Similar a NoSql Data Management (20)

NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
NOsql Presentation.pdf
NOsql Presentation.pdfNOsql Presentation.pdf
NOsql Presentation.pdf
 
NoSQL and MongoDB
NoSQL and MongoDBNoSQL and MongoDB
NoSQL and MongoDB
 
Chapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choicesChapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choices
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
No sq lv1_0
No sq lv1_0No sq lv1_0
No sq lv1_0
 
No SQL
No SQLNo SQL
No SQL
 
NoSQL Database in .NET Apps
NoSQL Database in .NET AppsNoSQL Database in .NET Apps
NoSQL Database in .NET Apps
 
Mongo Bb - NoSQL tutorial
Mongo Bb - NoSQL tutorialMongo Bb - NoSQL tutorial
Mongo Bb - NoSQL tutorial
 
Big data stores
Big data  storesBig data  stores
Big data stores
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
 
Revision
RevisionRevision
Revision
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Solr cloud the 'search first' nosql database extended deep dive
Solr cloud the 'search first' nosql database   extended deep diveSolr cloud the 'search first' nosql database   extended deep dive
Solr cloud the 'search first' nosql database extended deep dive
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data Storage
 
NoSQL Data Architecture Patterns
NoSQL Data ArchitecturePatternsNoSQL Data ArchitecturePatterns
NoSQL Data Architecture Patterns
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
NoSQL Consepts
NoSQL ConseptsNoSQL Consepts
NoSQL Consepts
 
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityNOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
 

Último

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 

Último (20)

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 

NoSql Data Management

  • 1. VISVESVARAYA TECHNOLOGICAL UNIVERSITY Belgaum, Karnataka IV SEMESTER SEMINAR ON “NoSQL Data Management: Concepts and Systems” Md. Mushahid Faizan (1RX16MCA34) RNS INSTITUTE OF TECHNOLOGY Estd: 2001 Department of MCA Channasandra, Dr. Vishnuvardhan, Bengaluru-560098.
  • 2. NoSQL Data Management Overview • Introduction to NoSQL • Basic Concepts for NoSQL • Overview of NoSQL Systems 2
  • 3. NoSQL Data Management History of NoSQL ● MultiValue databases at TRW in 1965. ● DBM is released by AT&T in 1979. ● Lotus Domino released in 1989. ● Carlo Strozzi used the term NoSQL in 1998 to name his lightweight, open-source relational database that did not expose the standard SQL interface. ● Graph database Neo4j is started in 2000. ● Google BigTable is started in 2004. Paper published in 2006. ● CouchDB is started in 2005. ● The research paper on Amazon Dynamo is released in 2007. ● The document database MongoDB is started in 2007 as a part of a open source cloud computing stack and first standalone release in 2009. ● Facebooks open sources the Cassandra project in 2008. ● Project Voldemort started in 2008. ● The term NoSQL was reintroduced in early 2009. ● Some NoSQL conferences NoSQL Matters, NoSQL Now!, INOSA 3
  • 4. NoSQL Data Management History of NoSQL • SQL Databases were dominant for decades  Persistent storage  Standards based  Concurrency Control  Application Integration  ACID •  Designed to run on a single „big“ machine • Cloud computing changes that dramatically  Cluster of machines  Large amount of unreliable machines  Distributed System  Schema-free unstructured Big Data 4
  • 5. NoSQL Data Management Methods to Run a Database • Virtual Machine Image  Users purchase virtual machine instances to run a database on these  Upload and setup own image with database, or use ready-made images with optimized database installations  E.g. Oracle Database 11g Enterprise Edition image for Amazon EC2 and for Microsoft Azure. • Database as a service (DBaaS)  Using a database without physically launching a virtual machine instance  No configuration or management needed by application owners  E.g. Amazon Web Services provide SimpleDB, Amazon Relational Database Service (RDS), DynamoDB, • Managed database hosting  Not offered as a service, but hosted and managed by the cloud database vendor  E.g. Rackspace offers managed hosting for MySQL 5
  • 6. NoSQL Data Management Which Data Model? • Relational Databases  Standard SQL database available for Cloud Environments as Virtual Machine Image or as a service depending on the vendor  Not cloud-ready: Difficult to scale • NoSQL databases  Database which is designed for the cloud  Built to serve heavy read/write loads  Good ability to scale up and down  Applications built based on SQL data model require a complete rewrite  E.g. Apache Cassandra, CouchDB and MongoDB 6
  • 7. NoSQL Data Management How to scale the data management? • Vertical scaling – Scale up • Horizontal scaling – Scale out 7
  • 8. NoSQL Data Management Definition and Goals of NoSQL databases • No formal NoSQL definition available! • Store very large scale data called “Big data” • Typically scale horizontally • Simple query mechanisms • Often designed and set up for a concrete application • Typical characteristics of NoSQL databases are:  Non-relational  Schema-free  Open Source  Simple API  Distributed  Eventual consistency Source: https://clt.vtc.edu.hk/what-happens-online-in-60-seconds/ 8
  • 9. NoSQL Data Management Non-relational • NoSQL databases generally do not follow the relational model • Do not provide tables with flat fixed-column records • Work with self-contained (hierarchical) aggregates or BLOBs • No need for object-relational mapping and data normalization • No complex and costly features like query languages, query planners, referential integrity, joins, ACID Schema-free 9 • Most NoSQL databases are schema-free or have relaxed schemas • No need for definition of any sort of schema of the data • Allows heterogeneous structures of data in the same domain
  • 10. NoSQL Data Management Simple API • Often simple interfaces for storage and querying data provided • APIs often allow low-level data manipulation and selection methods • Often no standard based query language is used • Text-based protocols often using HTTP REST with JSON • Web-enabled databases running as internet-facing services Distributed • Several NoSQL databases can be executed in a distributed fashion • Providing auto-scaling and fail-over capabilities • Often ACID is sacrificed for scalability and throughput • Often no synchronous replication between distributed nodes is possible, e.g. asynchronous Multi-Master Replication, peer-to-peer, HDFS Replication • Only providing eventual consistency 10
  • 11. NoSQL Data Management Core Categories of NoSQL Systems • Key-Value Stores  Manage associative arrays  Big hash table • Wide Column Stores  Each storage block contains only data from one column  Read and write is done using columns (rather than rows – like in SQL) • Document Stores  Store documents consisting of tagged values  Data is a collection of key value pairs  Provides structure and encoding of the managed data  Encoded using XML, JSON, BSON  Schema-free • Graph DB  Network database using graphs with node and edges for storage  Nodes represent entities, edges represent their relationships • Other NoSQL systems  ObjectDB  XML DB  Special Grid DB  Triple Store Value { “id”: 123, “name”:”Matthias”, “location”:{ “city”:”Hersonissos”, “region”:”Crete” } } A B C relation 11
  • 12. NoSQL Data Management Usage of NoSQL in Practice • Google  Big Table  Google Apps, Google Search • Facebook  Social network • Twitter • Amazon  DynamoDB and SimpleDB • CERN • GitHub 12
  • 13. NoSQL Data Management Overview • Introduction to NoSQL • Basic Concepts for NoSQL  CAP-Theorem  Eventual Consistency  Consistent Hashing  MVCC-Protocol  Query Mechanisms for NoSQL • Overview of NoSQL Systems 13
  • 14. NoSQL Data Management CAP Theorem – Brewer's theorem NoSQL (AP) Domain Name System DNS (AP) Cloud Computing (AP)  Consistency: all nodes see the same data at the same time  Availability: every request receives a response about whether it succeeded or failed  Partition tolerance: the system continues to operate despite physical network partitions Choose two! • States that it is impossible for a distributed system to provide all three of the following guarantees 14
  • 15. NoSQL Data Management Eventual consistency and BASE • The term “eventual consistency”  Copies of data on multiple machines to achieve high availability and scalability  A change to a data item on one machine has to be propagated to other replicas  Propagation is not instantaneous so the copies will be mutually inconsistent  The change will eventually be propagated to all the copies  Fast access requirement dominates  Different replicas can return different values for the queried attribute of the object  A System that achieved eventual consistency converged, or achieved replica convergence • Eventual consistency guarantees:  If no new updates are made to a given data item eventually all accesses to that item will return the last updated value • Eventually consistent services provide weak consistency using BASE  Basically Available, Soft state, Eventual consistency  Basically available indicates that the system guaranteed availability (CAP theorem)  Soft state indicates that the state of the system may change over time, even without input  Eventual consistency indicates that the system will become consistent over time 15
  • 16. NoSQL Data Management Consistent Hashing • Technique how to efficiently distribute replicas to nodes • Consistent hashing is a special kind of hashing  When hash table is resized only K/n keys need to be remapped on average  K is the number of keys, and n is the number of slots  In traditional hash tables nearly all keys have to be remapped • Insert Servers on ring  Hash based e.g. on IP, Name, …  Take over objects between own and processor hash • Insert Objects on ring  Hash based on key  Walks around the circle until falling into the first bucket • Delete Servers  Copy objects to next server • Virtual Servers  More than one hash per server • Replication  Place objects multiple times  Improves reliability Server 1 0 12 2 Server 3 Server 2/2 16 SeSrevrevre2r/21
  • 17. NoSQL Data Management Multiversion Concurrency Control (MVCC) • Concurrency control method to provide concurrent access to the database • LOCKING  All readers wait until the writer is done  This can be very slow! • MVCC  An write adds a new version  Read is always possible  Any changes made by a writer will not be seen by other users of the database until they have been committed  Conflicts (e.g. V1a ,V1b) can occur and have to be handled User 1 User 2 V transaction T User 1 write read V (Delayed) Lock! V1 write V0 MVCC read V0-Old Version V1 read V1 transaction T User 1 User 1 User 2 17
  • 18. NoSQL Data Management Query Mechanisms for NoSQL • REST based retrieval of a value based on its key/ID with GET resource • Document stores allow more complex queries  E.g. CouchDB allows to define views with MapReduce • MapReduce  Available in many NoSQL databases  Can run fully distributed  It is Functional Programming, not writing queries!  Map phase - perform filtering and sorting  Reduce phase - performs a summary operation  ~ SELECT and GROUP BY of a relational database  More details later! • Apache Spark is an open source big data processing framework providing more operations than MapReduce • Example use cases for MapReduce  Distributed Search  Counting – URL, Words  Building linkage graphs for web sites  Sorting distributed data 17 Source: @tgrall
  • 19. NoSQL Data Management Overview • Introduction to NoSQL • Basic Concepts for NoSQL • Overview of NoSQL Systems  Key-Value Stores  Document Stores  Wide-column stores  Graph Stores  Hadoop Map/Reduce and more … 19
  • 20. NoSQL Data Management Key Value Stores Memcached Project Voldemort • Developer: Basho Technologies (http://basho.com/) • Current version: 2.1.1 • Available since: 2009 • Licence: Apache licence 2.0 • Supported operating systems: Linux, BSD, Mac OS, Solaris • Client libraries for:  Java, Ruby, Python, C#, Erlang (the official ones)  C, Node.js, PHP (the unofficial ones)  even more form the Riak community • Developer: Basho Technologies (http://basho.com/) • Current version: 2.1.1 • Available since: 2009 • Licence: Apache licence 2.0 • Supported operating systems: Linux, BSD, Mac OS, Solaris • Client libraries for:  Java, Ruby, Python, C#, Erlang (the official ones)  C, Node.js, PHP (the unofficial ones)  even more form the Riak community 20
  • 21. NoSQL Data Management Typical Use Cases • Session data • User profiles • Sensor data (IOT) timestamp x y z temperature 01.01.2014 350 120 78 -10° 01.01.2014 350 120 95 -9 01.01.2014 350 100 78 -10° 02.01.2014 350 120 78 -5° 02.01.2014 350 120 95 -8° … key value sessionid=A08154711 userlogin=“xyz” date_of_expiry=2015/12/31 id 21
  • 22. NoSQL Data Management Key Functionality buckets key value key value key value key value store <k,v> get <k> delete <k> get bucket properties set bucket properties bucket types create bucket type activate bucket type update bucket type get status of bucket type 22
  • 23. NoSQL Data Management Instances and Vnodes • Riak runs on potentially large clusters • Each host in the cluster runs a single instance of Riak (Riak node) • Each Riak node manages a set of virtual node (vnodes) • Mapping of <bucket,key> pairs  compute 160-bit hash  map result to a ring position  ring is divided into partitions  each vnode is responsible for one partition docs.basho.com 23
  • 24. NoSQL Data Management Configure Replication • Some bucket parameters  N: replication factor  R: number of servers that must respond to a read request  W: number of servers that must respond to a write request  DW: number of servers that must report that a write has successfully been written to disk  …  Parameters allow to trade consistency vs. availability vs. performance docs.basho.com 24
  • 25. NoSQL Data Management Transactions and Consistency • No multi-operation transactions are supported • Eventual consistency is default (and was the only option before Riak 2.0)  Vector clocks and Dotted Version Vectors (DVV) used to resolve object conflicts • Strong consistency as an alternative option  A quorum of nodes is needed for any successful operation 25
  • 26. NoSQL Data Management Riak Search • For Raik KV, a value is just a value possibly associated with a type • Riak Search 2.0  Based on Solr, the search platform built on Apache Lucene  Define extractors, i.e., modules responsible for pulling out a list of fields and values from a Riak object  Define Solr schema to instruct Riak/Solr how to index a value  Queries: exact match, globs, inclusive/exclusive range queries, AND/OR/NOT, prefix matching, proximity searches, term boosting, sorting, … 26
  • 27. NoSQL Data Management Document Stores • Developer: MongoDB, Inc. (https://www.mongodb.com/) • Available since: 2009 • Licence: GNU AGPL v3.0 • Supported operating systems: all major platforms • Drivers for:  C, C++, C#, Java, Node.js, Perl, PHP, Python, Motor, Ruby, Scala, … • Developer: MongoDB, Inc. (https://www.mongodb.com/) • Available since: 2009 • Licence: GNU AGPL v3.0 • Supported operating systems: all major platforms • Drivers for:  C, C++, C#, Java, Node.js, Perl, PHP, Python, Motor, Ruby, Scala, … 27
  • 28. NoSQL Data Management What are Documents? • Aggregated data • No fixed schema • Internal structure matters • Format: JSON, BSON, … { "firstName": "John", "lastName": "Smith", "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021" }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "fax", "number": "646 555-4567" } ] } 28 { "firstName": "John", "lastName": "Smith", "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021" }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "fax", "number": "646 555-4567" } ] } { "firstName": “Paul", "lastName": “Adam", "age": 45, "address": { "streetAddress": "22 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021" } } { "firstName": “Paul", "lastName": “Adam", "age": 45, "address": { "streetAddress": "22 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021" } }
  • 29. NoSQL Data Management Typical Use Cases • Simple content management  e.g. blogging platforms • Logging events  cope with event type heterogeneity  data associated with events changes over time • E-Commerce applications  flexible schema for product and order data 29
  • 30. NoSQL Data Management Data Organization • Collections of documents that share indexes (but not a schema) • Collections are organized into databases • Documents stored in BSON databases collections 30
  • 31. NoSQL Data Management Key Functionality collections find <field critera> insert document update document(s) create index drop index databases create collection drop collection delete document(s) query data inside the documents 31
  • 32. NoSQL Data Management Querying Documents www.mongodb.com 32
  • 33. NoSQL Data Management Availability • Each Replica set is group of MongoDB instances that hold the same dataset • one primary instance that takes all write operations • multiple secondary instances • changes are replicated to the secondaries • if the primary is unavailable, the replica set will elect a secondary to be primary • Specific secondaries  priority 0 member  hidden member  delayed member  arbiter sec. 1 sec. 2 prim. read/write apply opplog replica set 33
  • 34. NoSQL Data Management Scalability • Replica sets for read scalability  reading from secondaries may provide stale data • Sharding for write scalability  at collection level  using indexed field that is available in all documents of the collection  range-based or hash-based  Each shard is a MongoDB instance  background processes to maintain even distribution: splitting + balancer  Shards may also hold replica sets shard shard shard 1 2 3 router gconfi serveg rconfi serveconfig r server metadata read/write 34
  • 35. NoSQL Data Management Transactions and Consistency • Write concern, w option  default:  num.:  majority: confirms write operations only on the primary Guarantees that write operations have propagated successfully to the specified number of replica set members including the primary Confirms that write operations have propagated to the majority of voting nodes • Read preference  describes how MongoDB clients route read operations to the members of a replica set, i.e., from one of the secondaries or the primary  eventual consistency! • No multi-operation transactions supported. reading stale data from the primary is possible! 35
  • 36. NoSQL Data Management Wide-Column Stores / Column Family Stores • 2006: originally project of company Powerset • 2008: HBase becomes Hadoop sub-project. • 2010: HBase becomes Apache top-level project. • runs on top of HDFS (Hadoop Distributed File System) • providing BigTable-like capabilities for Hadoop • APIs: Java, REST, Thrift, C/C++ • 2006: originally project of company Powerset • 2008: HBase becomes Hadoop sub-project. • 2010: HBase becomes Apache top-level project. • runs on top of HDFS (Hadoop Distributed File System) • providing BigTable-like capabilities for Hadoop • APIs: Java, REST, Thrift, C/C++ 36
  • 37. NoSQL Data Management Logical Data Model • Table rows contain:  row key  versions, typically a timestamp  multiple column families per key - define column families at design time - add columns to a column family at runtime • Metadata  there is no catalog that provides the set of all columns for a certain table  left to the user/application sparse table 37
  • 38. NoSQL Data Management Physical Data Model • Store each column familiy separately • Sorted by timestamp (descending) • Empty cells from the logical view are not stored • Key/Value class keylength valuelength key value rowlength row key columnfamily length column family columnqualifier timestamp keytype com.cnn. www 2 anchor cnnsi.com t9 put e.g. 38
  • 39. NoSQL Data Management Key Functionality namespaces get <k, t> put <k, …> scan regions split merge delete <k> create alter drop c, …, c key family c, …, c key family c, …, c key family c, …, c table key family 39
  • 40. NoSQL Data Management MasterServer and RegionServer • Failover  HBase clients talk directly to the RegionServers, hence they may continue without MasterServer (at least for some time)  catalog table META exists as HBase tables, i.e., not resident in the MasterServer HMaster • monitors RegionServers • operations related to metadata (tables, column families, regions) Backup HMaster … HRegionServer • manages regions • operations related to data (put, get, …) • operations related to regions (split, merge, …) META: list of regions for each table • Failover  Region immediately becomes unavailable when the RegionServer is down  The Master will detect that the RegionServer has failed  region assignments will be considered invalid  assign region to a new RegionServer 40
  • 41. NoSQL Data Management Storage Structure • Table T with column families a and b HRegionServer Store a MemStore MemStore StoreFile StoreFile StoreFile Store b StoreFile StoreFile Store a Store b MemStore MemStore StoreFile StoreFile StoreFile StoreFile HRegionServer Store a Store b MemStore MemStore StoreFile StoreFile StoreFile StoreFile Block <k,v> <k,v> … Block Block Block mainmemoryfilesystem Log Log regions 41
  • 42. NoSQL Data Management Write Data 1. Write change to log (WAL) 2. Write change to MemStore 3. Regularily flush MemStore to disk (into StoreFiles) and empty MemStore HRegionServerHRegionServer Store a MemStore MemStore StoreFile StoreFile StoreFile Store b StoreFile StoreFile Store a MemStore MemStore StoreFile StoreFile StoreFile Store b StoreFile Store a MemStore MemStore StoreFile StoreFile Store b StoreFile StoreFile mainmemoryfilesystem Log Log write table_T.family_a.field_f    42
  • 43. NoSQL Data Management Read Data 1. Read from Block Cache 2. Read from MemStore 3. Read from all relevant StoreFiles 4. Merge results HRegionServerHRegionServer Store a MemStore MemStore StoreFile StoreFile StoreFile Store b StoreFile StoreFile Store a MemStore MemStore StoreFile StoreFile StoreFile Store b StoreFile Store a MemStore MemStore StoreFile StoreFile Store b StoreFile StoreFile mainmemoryfilesystem Log Log read table_T.family_a.field_f    43
  • 44. NoSQL Data Management StoreFile Reorganisation • minor compaction  merge various StoreFiles, without considering tombstones etc. • major compaction  reorg of store files, e.g. by removing deleted rows  significantly reduces file size  at a configurable time interval or manually • Merge Regions  consolidate several regions of one table into a single region  offline reorg initiated manually! • Split Regions  distribute data from one region to several regions  configurable by parameter max.filesize or manually Store a MemStore StoreFile StoreFile StoreFile Store a MemStore StoreFile 44
  • 45. NoSQL Data Management Transactions and Consistency • No explicit transaction boundaries • Atomicity  atomic row-level operations (either "success" or "failure")  operations spanning several rows (batch put) are not atomic • Consistency  Default: Strong consistency by routing all through a single region server  Optional: Region replication for high availability - Writes only through the primary - Reads may also be processed by the secondaries 45
  • 46. NoSQL Data Management SELECT WRITETIME (name) FROM excelsior.clicks WHERE url = 'http://apache.org'; SELECT WRITETIME (name) FROM excelsior.clicks WHERE url = 'http://apache.org'; CQL in Cassandra 45 CREATE KEYSPACE demodb WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 3}; CREATE KEYSPACE demodb WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 3}; CREATE TABLE users ( user_name varchar, password varchar, gender varchar, session_token varchar, state varchar, birth_year bigint, PRIMARY KEY (user_name)); CREATE TABLE users ( user_name varchar, password varchar, gender varchar, session_token varchar, state varchar, birth_year bigint, PRIMARY KEY (user_name)); SELECT * FROM emp WHERE empID IN (130,104) ORDER BY deptID DESC; SELECT * FROM emp WHERE empID IN (130,104) ORDER BY deptID DESC; INSERT INTO excelsior.clicks (userid, url, date, name) VALUES ( 3715e600-2eb0-11e2-81c1-0800200c9a66, 'http://apache.org', '2013-10-09', 'Mary') USING TTL 86400; INSERT INTO excelsior.clicks (userid, url, date, name) VALUES ( 3715e600-2eb0-11e2-81c1-0800200c9a66, 'http://apache.org', '2013-10-09', 'Mary') USING TTL 86400; cassandra.apache.org
  • 47. NoSQL Data Management Graph Databases Flock DB • Developer: Neo Technology (http://www.neotechnology.com) • Available since: 2007 • Licence: GPLv3 and AGPLv3, commercial • Supported operating systems: all major platforms • Drivers for:  Java, .NET, JavaScript, Python, Ruby, PHP  and R, Go, Clojure, Perl, Haskell • Developer: Neo Technology (http://www.neotechnology.com) • Available since: 2007 • Licence: GPLv3 and AGPLv3, commercial • Supported operating systems: all major platforms • Drivers for:  Java, .NET, JavaScript, Python, Ruby, PHP  and R, Go, Clojure, Perl, Haskell 47
  • 48. NoSQL Data Management Typical Use Cases • Highly connected data, e.g., social networks, employees and their knowledge • Location-based services, e.g., planning delivery • Recommendation systems, e.g., bought products, often-visited attractions people who visited … also visited … 5 15 7 9 1 20 12 50 graph with distances smartblogs.com 48
  • 49. NoSQL Data Management Graph Data Model 48 node relationship properties • No need to define a schema neo4j.com
  • 50. NoSQL Data Management Basic Functionality match node patterns graph traversal graph create node/relationship delete node/relationship set property remove property create index query index 50
  • 51. NoSQL Data Management Cypher Example • Many query languages: Cypher, Gremlin, G, GraphLog, GRAM, GraphDB, … • No standard CREATE (me:PERSON {name:”Holger”}) CREATE (mat:PERSON {name:”Matthias”}) CREATE (fra:PERSON {name:”Frank”}) CREATE (me) -[knows:KNOWS]-> (mat) CREATE (me) -[knows:KNOWS]-> (fra) CREATE (mat) -[knows:KNOWS]-> (me) MATCH (n {name:”Holger”})-[:KNOWS]->(m) 51
  • 52. NoSQL Data Management S2 S3 S4 S1 Scalability • Strategies for read scaling a) large enough memory for the working set of nodes b) adding read-only slaves graph data node memory graph data graph data graph data graph data slave memory c) application-level sharding graph data north graph data south appl query north/south? automatic sharding? 52
  • 53. NoSQL Data Management High Availability M S1 S2 S3   propagate  write  commit • HA availability feature in neo4j  cluster of 1 master and n slave nodes  continues to operate from any number of nodes down to a single machine  nodes may leave and re-join the cluster  in case of master failure, a new master will be elected  read operations are possible on any node  write operations are possible on any node and propagated to the others write on master 53
  • 54. NoSQL Data Management High Availability M S1 S2 S3  write  commit   propagate   propagate  commit • HA availability feature in neo4j  cluster of 1 master and n slave nodes  continues to operate from any number of nodes down to a single machine  nodes may leave and re-join the cluster  in case of master failure, a new master will be elected  read operations are possible on any node  write operations are possible on any node and propagated to the others write on slave M S1 S2 S3 pull asynchronously 54
  • 55. NoSQL Data Management Transactions and Consistency • Set transaction boundaries explicitly • Transactions are possible on any node in the cluster • Transactions are atomic and durable • Writes are eventually consistent  optimistically pushed to slaves  slaves can also be configured to pull updates asynchronously 55
  • 56. NoSQL Data Management Hadoop Ecosystem HDFS (Hadoop Distributed File System) • Apache project http://hadoop.apache.org/ HBase YARN (Cluster Resource Management) Zookeeper (Coordination) MapReduce Framework Hive Pig … 56
  • 57. NoSQL Data Management Principles of Map Reduce • User provides data in files • Data model: key/value pairs (k, v) • Based on higher-order functions MAP and REDUCE • Tasks of the programmer  User-defined functions m and r serve as input to MAP and REDUCE  m and r define what the job actually does • MAP m: Km,Vm ↦ Kr, Vr ∗ • REDUCE r: (Kr, Vr ∗ )↦(Kr, Vr) • Example: Aggregate salary per department: (employee, <name, department, salary, …>) (department, salary) MAP (department, salary) (department, <salary, salary, …>) RED 57
  • 58. NoSQL Data Management … Execution of Map/Reduce Jobs 57 file task tracker m2 map() combine() partition() task tracker m3 task tracker m1 split 1 split 2 split 3 split 4 split 5 k/v 1 k/v 2 k/v 1 k/v 2 k/v 1 k/v 2 task tracker r2 task tracker r1 shuffle() sort() reduce() output() file phases defined by user job trackerprogram client start & control
  • 59. NoSQL Data Management Fault Tolerance 58 • map node fails  job tracker receives no report for a certain time -> mark node as failed  restart map job on a different node  new job reads another copy of the necessary input split • reduce node fails  job tracker receives no report for a certain time -> mark node as failed  restart reduce job on a different node  read necessary intermediate input data from map nodes • To make this work, all relevant data has to be stored in a distributed file system, in particular  the input splits  intermediate data produced by map jobs Hadoop Files System (HDFS)
  • 60. NoSQL Data Management HDFS Architecture 59HDFS Users Guide at http://hadoop.apache.org
  • 61. NoSQL Data Management Is that all? • Other systems build on or extend this basic functionality • Build an SQL layer on to of Hadoop MapReduce  Hive  Pig • Others focus on datastream and in-memory processing  Spark  Flink 61
  • 62. NoSQL Data Management What we also Skipped Today • Further classes of NoSQL systems  Triple stores, … • NewSQL • Cloud offerings for the various types of NoSQL data stores  e.g., Riak CS (Cloud Storage) • More cloud platforms  IBM Bluemix  Google app engine 62
  • 63. NoSQL Data Management Conclusion • Relational Databases provide  Data spread over many tables  Schema needs to be defined  Structured query language (SQL)  Transactions  Strong Consistency  General purpose applicability • NoSQL  Aggregated data in one object (identified by a key)  No predefined schema  No declarative query language  Limited transactional capability  Eventual consistency rather ACID property  Focus on scalability and availability  Often selected and customized for a concrete application scenario To make a proper decision, carefully examine your application • the data model that is most appropriate • the query complexity • the consistency needs • the transactional requirements 63 To make a proper decision, carefully examine your application • the data model that is most appropriate • the query complexity • the consistency needs • the transactional requirements