2. Outline The Whys of Non-Relational Databases Vocabulary of the Non-Relational World MongoDB
3. Why did non-relational databases arise? Problems with relational databases in the web world The Whys of Non-Relational Databases
4. Problem - Schema Evolution Applications are evolving all the time Applications need new fields Applications need new indexes Data is growing – sometimes very fast Users need to be able to alter their schemas without making their data unavailable The web world expects 24x7 service RDBMSs can have a hard time doing this
5. Problem – Write Rates Replication is a solution for high read loads Sooner or later, writing becomes a bottleneck Sharding – partitioning a logical database across multiple database instances Joins and aggregation become a problem Distributed transactions are too slow for the web Manual management of shards Choosing shard partitions Rebalancing shards
6. An introduction to terminology you’re going to be seeing a lot Vocabulary of the Non-Relational World
7. Data Models A non-relational database’s data model determines the kinds of items it can contain and how they can be retrieved What can the system store, and what does it know about what it contains? The relational data model is about storing records made up of named, scalar-valued fields, as specified by a schema, or type definition What kind of queries can you do? SQL is a manifestation of the kinds of queries that fall out of relational algebra
9. Key-Value Stores A mapping from a key to a value The store doesn’t know anything about the the key or value The store doesn’t know anything about the insides of the value Operations Set, get, or delete a key-value pair
10. Document Stores The store is a container for documents Documents are made up of named fields Fields may or may not have type definitions e.g. XSDs for XML stores, vs. schema-less JSON stores Can create “secondary indexes” These provide the ability to query on any document field(s) Operations: Insert and delete documents Update fields within documents
11. Column-Oriented Stores Like a relational store, but flipped around: all data for a column is kept together An index provides a means to get a column value for a record Operations: Get, insert, delete records; updating fields Streaming column data in and out of Hadoop
12. Graph Databases Stores vertex-to-vertex edges Operations: Getting and setting edges Sometimes possible to annotate vertices or edges Query languages support finding paths between vertices, subject to various constraints
13. Consistency Models Relational databases support transactions Can only see committed changes Commit/abort span multiple changes Read-only transaction flavors Read committed, repeatable read, etc Classic assumption: “I’m querying the one-and-only database” Scaling reads and writes introduce different problems
15. Limitations of a Single Master Replication can provide arbitrary read scalability Subject to coping with read-consistency issues Sooner or later, writing becomes a bottleneck Physical limitations (seek time) Throughput of a single I/O subsystem
16. Sharding Paritition the primary key space via hashing Set up a duplicate system for each shard The write-rate limitation now applies to each shard Joins or aggregation across shards are problematic Can the data be re-sharded on a live system? Can shards be re-balanced on a live system?
17. Multi-Site Operation Failure of a single-master system’s master A new master can be chosen But what if there’s a network partition? Can the application continue in read-only mode?
18. Dynamo Now a generic term for multi-master systems Writes can occur to any node The same record can be updated on different nodes by different clients All writes are replicated everywhere
19. Dynamo – the 2nd breakdown of consistency Collisions can occur Who wins? A collision resolution strategy is required Vector clocks http://en.wikipedia.org/wiki/Vector_clock Application access must be aware of this
21. Key Client Implementation Concerns Monotonic reads Can my reads go back in time? Read-your-own-writes If I issue a query immediately after an insert or update, will I see my changes? Uninterrupted writes Am I always guaranteed the ability to write? Conflict Resolution Do I need to have a conflict resolution strategy?
22. Using a Single-Master System What does the intermediate agent or system do for… Monotonic reads? Read-your-own-writes? Uninterrupted writes? Conflict Resolution?
23. Using a Multi-Master System What does the intermediate agent or system do for… Monotonic reads? Read-your-own-writes? Uninterrupted writes? Conflict Resolution?
24. Where MongoDB fits in the non-relational world MongoDB’s architecture and features Some real-world users MongoDB
25. MongoDB is a Document Store MongoDB stores JSON objects as BSON { LastName: ‘Flintstone’, FirstName: ‘Fred’, …} Secondary Indexes db.collection.ensureIndex({LastName : 1, FirstName : 1}); Simple QBE-like query syntax db.collection.find({LastName : ‘Flintstone’}); db.collection.find({LastName : { $gte : ‘Flintstone’});
26. MongoDB – Advanced Queries Geo-spatial queries Create a geo index Find points near a given point, sorted by radial distance Can be planar or spherical Find points within a certain radial distance, within a bounding box, or a polygon Built-in Map-Reduce The caller provides map and reduce functions written in JavaScript
27. MongoDB is a Single-Master System A database is served by members of a “replica set” The system elects a primary (master) Failure of the master is detected, and a new master is elected Application writes get an error if there is no quorum to elect a new master Reads continue to be fulfilled
29. MongoDB Supports Sharding A collection can be sharded Each shard is served by its own replica set New shards (each a replica set) can be added at any time Shard key ranges are automatically balanced
31. MongoDB Storage Management Data is kept in memory-mapped files Servers should have a lot of memory Files are allocated as needed Documents in a collection are kept on a list using a geographical addressing scheme Indexes (B*-trees) point to documents using geographical addresses
32. MongoDB Server Management Replica set members are aware of each other A majority of votes is required to elect a new primary Members can be assigned priorities to affect the election e.g., an “invisible” replica can be created with zero priority for backup purposes
33. MongoDB Access Drivers are available in many languages 10gen supported C, C# (.Net), C++, Erlang, Haskell, Java, JavaScript, Perl, PHP, Python, Ruby, Scala Community supported Clojure, ColdFusion, F#, Go, Groovy, Lua, R http://www.mongodb.org/display/DOCS/Overview+-+Writing+Drivers+and+Tools
36. MongoDB Support Paid Support http://www.10gen.com/client-portal 10gen Hosted Monitoring Consulting, training Free Support http://groups.google.com/group/mongodb-user http://stackoverflow.com/questions/tagged/mongodb