Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Introduction to NoSQL
Jim Driscoll, MarkLogic

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.SLIDE: 2
Agenda
 History of NoSQL
 NoSQL Terminology
 Types of NoSQL Databases (with examples of each)

HISTORY OF NOSQL

A Short History of Data
 Application Specific Databases
 Size is paramount
 Relational Databases
 Size matters…
 …but break from the application silo
 … and provide data integrity
 NoSQL Databases
 Agility
 Scalability
 Speed

RELATIONAL DOESN’T MEAN
WHAT YOU THINK IT DOES

What's wrong with Relational?
 Nothing, it's perfect for square data
 ...where you know the relationships in advance
 ...where the schema doesn't change often
 ...where all the data can fit on one machine
 ...where a separate disk seek for every join isn't an issue (or can be cached)

SEEK AND YOU WILL FIND
…IN ABOUT 10MS

The Rise of NoSQL
1998 2001 2003 2004 2006 2007 2009
Google
FileSystem
paper
Carlo Strozzi
coins term
Google
BigTable
paper
Eric Evans
popularizes
term
MarkLogic
founded
Google
MapReduce
paper
Amazon
Dynamo
paper

MEMCACHED

Memcached
 Developed at LiveJournal as a frontend cache for websites
 First released in 2003
 Keep disk access at a minimum, pool memory on many machines
 So useful it found wide popularity, still under active development

Memcached
 High Performance, Distributed Memory Object Caching System
 Distributed – runs across many computers
 Memory – runs without touching disk
 Object cache – designed to hold small lumps of data
 High performance – because it never touches disk, and the objects are
small, it’s optimized for speed

Memcached
 Client server system
 Servers are unaware of each other
 Clients determine server to use via hashing
 Servers keep content as an LRU cache
 So all data transitory

SHARDING

Sharding to Scale Out

BIGTABLE

Bigtable
 Created by Google in 2004
 …to store massive amounts of data
 Made public in famous 2006 paper
 Used throughout Google
 GMail, Google Maps, YouTube, Web Indexing, etc
 Reportedly over 100 internal projects
 Never shipped externally as a product
 … but available for public use as part the AppEngine hosting API

Bigtable
 Rows are composed of columns, which in turn belong to column families
 Column families are essentially typing, validation and expiration info
 It’s helpful to think of them as the “tables”
 Lookups are done via a Row Key
 Every cell is versioned via timestamp, and sparsely stored
 System is robust and crash resistant
 Can survive the crash of any machine, including the master
 Scale out architecture

MAP / REDUCE

Map / Reduce
 Massively Distributed Processes
 Map - sort, filter, transform data
 Reduce - summarize data (iteratively)

HADOOP

Hadoop
 First envisioned as “Nutch” at the Internet Archive in 2002
 There were 100’s of millions of webpages to index
 Early versions heavily influenced by Google File System, Map Reduce papers
 Goal: Perform work on large datasets using commodity machines
 Development moved to Yahoo in 2006
 Open Sourced to Apache, as Hadoop
 A File system (HDFS)
 A Task Runner (MapReduce)
 A Task Manager (YARN)
 Note: Not a database

Hadoop
 Really good at…
 Batch Processing
 … on incredibly large data sets
 Not so good parts
 Latency
 Updates
 Usability

DYNAMO

Amazon Dynamo
 Created to power Amazon’s Web store
 Writing with low latency more important than consistency
 Techniques first made public in 2007 paper
 Never externally shipped…
 …but huge influence on market
 Used for a variety of critical portions of Amazon’s site
 Shopping cart
 User Session
 Succeeded by DynamoDB
 Similar name, but whole new architecture (with better consistency)

Amazon Dynamo
 Distributed Key Value store
 "always writable”
 low latency reads and writes, at the expense of consistency
 asynchronous replication on put() operations
 …mean that get() may return a stale value
 updates during a network partition can result in conflicts
 …and the application must handle them

TERMINOLOGY

What is(n't) NoSQL
 No SQL
 Schema-less
 Open Source
 BASE (Eventually Consistent)

ACID
 Atomicity
 Everything either succeeds or fails
 Consistency
 Nothing is saved unless it passes consistency rules
 Isolation
 No two processes can interfere with each other
 Durability
 Once saved, data can not be lost due to system failure

BASE
 Basically Available
 Soft state
 Eventually consistent

What happens without consistency?
 Absolute fastest performance at lowest hardware cost
 Highest global data availability at lowest hardware cost
 Working with one document or row at a time
 Writing advanced code to create your own consistency model
 Eventually consistent data
 Some inconsistent data that can’t be reconciled
 Some missing data that can’t be recovered
 Some inconsistent query results

What is NoSQL?
 Database
 Non-relational
 Schema on read
 Scale out architecture
 Cluster friendly / Cloud Ready

TYPES OF NOSQL DATABASES

Types of NoSQL Databases
Graph
Databases
Wide Column
Databases
Key Value
Databases
Document
Databases

KEY VALUE STORES

MemcacheDB
 Very early KV implementation (2008)
 KV Store based on Memcached source, with BerkleyDB persistent store
 Speaks the memcached protocol
 Development stopped (2009), but still quite popular
 For when you like Memcached, but want persistence

REDIS

Redis
 First released in 2009
 Sponsored by VMWare, then Pivotal
 Name means Remote Dictionary Server
 Fully in memory key value store
 Whole db must reside in memory of one machine
 Limits scalability, at the benefit of performance
 Often used as a front end cache for other NoSQL databases

Redis
 Not just strings as values:
 Lists of strings
 Sets of strings (collections of non-repeating unsorted elements)
 Sorted sets of strings (collections of non-repeating elements ordered by a
floating-point number called score)
 Hashes where keys and values are strings

Redis
 Master / slave replication - slave may be master to another slave
 allowing tree replication
 also publish/subscribe API
 slaves may be updated separately from master, allows inconsistencies (!)
 Persistent store
 Append only journal
 Flushed every 2 seconds by default

DOCUMENT STORES

Document vs Key Value Stores
 Extension of Key Value - the value is a document
 but also Structurally aware
 Indexed searches
 Self-describing document formats
 CouchDB – JSON
 MongoDB – BSON
 MarkLogic - JSON, XML

MARKLOGIC

MarkLogic
 Founded in 2001, founders were search engine experts
 Document centric database with search engine features
 Stores and indexes XML, JSON, text and binaries
 Enterprise NoSQL
 ACID transactions (including XA)
 HA/DR
 Government grade security

MarkLogic
 Universal Index – index all the things
 Index words, elements, the relationships of words and elements
 Many indexes (automatically) used at once, resolving queries without
touching disk
 Search on ranges, free text, field values, more…
 Shared nothing architecture, transactions via MVCC
 Automatic partitioning and balancing
 Hadoop support (works on HDFS, and with Map/Reduce jobs)
 Includes a webserver for building RESTful applications

MarkLogic
 More than a document store
 Range indexes allow in-memory column operations
 Triple store, supporting RDF Triples and SPARQL
 High Availability – multiple copies of updates saved transactionally
 Disaster Recovery – copies sent to remote site with a window
 Free to download and try out with a developer license

MONGODB

MongoDB
 Development began in 2007 by 10gen
 Name from “humongous”
 Originally wanted to create a Google App Engine system
 1.4 considered first “production ready” release, 2010
 Stores and retrieves BSON documents
 Horizontally scaling

MongoDB
 Stores data in proprietary format
 BSON, similar to JSON with more data types
 Search on field, on range, or on regex
 Single index per query (secondary index optional)
 Replication of databases as master/slave, with (tunable) eventual consistency
 Sharding handled via a shard key, splitting by range
 Be sure the key is evenly distributed
 Client APIs in many languages

WIDE COLUMN STORES

Column Stores
 Descended from Big Table approach
 Excellent for sparse data
 Column families need to be specified up front
 But still stored sparsely
 No way to list all the columns in the database
 Append only
 Updates via timestamp
 Deletes via tombstone marker

CASSANDRA

Cassandra
 Developed at Facebook, 2008, donated to Apache
 Descended from Bigtable and Dynamo
 One of the primary Dynamo developers helped create Cassandra
 Focused on maximum throughput
 Write lots of data, fast
 But at the expense of consistency (tunable)
 Used by Twitter, Reddit, Netflix
 …but not Facebook

Cassandra
 Partitioned via hash (multiple strategies)
 Be careful choosing your Row Key!
 Async masterless replication
 Tunable Consistency
 from "writes never fail" to "wait until persisted on all slaves”
 Query with range queries, column family, CQL
 Hadoop support (replaces HDFS)

GRAPH DATABASES

Nodes and Vertices

NEO4J

Neo4J
 Released in 2010
 Written in Java, APIs are Java centric
 Most popular Graph Database
 Powers the recommendation engines of Glassdoor, Walmart

Neo4J
 Whole graph in memory – scales to millions of relationships
 But does persist to disk
 Transactional
 Replicated for performance and robustness, master/slave
 Proprietary Graph query language (Cypher)
 Enterprise version adds clustering, sharding

SEMANTIC WEB

Semantics: A New Way to Organize Data
Data is stored in Triples, expressed as: Subject : Predicate : Object
John Smith : livesIn : London
London : isIn : England
Query with SPARQL, gives us simple lookup .. and more
Find people who live in (a place that's in) England
"John Smith" "England"
livesIn
"London"
isIn
livesIn

Context from the World at Large
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
Linked Open Data
 Facts that are freely available
 In a form that’s easily consumed
DBpedia (wikipedia as structured information)
 Einstein was born in Germany
 Ireland’s currency is the Euro
GeoNames
 Doha is the capital of Qatar
 Doha has these lat/long coordinates

IN CONCLUSION…

Don't Design Your System Like It's 1979

ANY QUESTIONS?
@MARKLOGIC

Intro to NoSQL

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Intro to NoSQL

Similar a Intro to NoSQL (20)

Último

Último (20)

Intro to NoSQL

Notas del editor