2. Agenda Challenges of Relational Databases NoSQL: not only SQL Document store concept Document-oriented databases Raven DB Raven DB Demo MapReduce (optional)
3. Relational Databases properties ACID Atomic, Consistent, Isolated, Durable Relational based on relation algebra & Codd’s work Table / Row based Rich querying capabilities Foreign keys Schema
4. What do our apps need? Need to scale horizontally Partition and replication OnLineTransaction Processing and OnLine Analytical Processing Web 2.0 Performance, Performance, Performance Flexibility Big even Huge datasets http://www.graph-database.org
5. Not only SQL philosophy Being non-relational, distributed, cloud-ready Open-source Horizontally scalable: easy replication support Schema-free Simple API BASE (not ACID): Basically Available, Soft state, Eventual consistency Huge data amount
6. noSQL Pros + Cheap, easy to implement + Removes impedance mismatch between objects and tables + Quickly process large amounts of data + Data modeling flexibility + Command Query Responsibility Segregation (CQRS), Event Sourcing
12. CAP Consistency Each client has the same view Availability All clients can read and write Partition tolerance Works well across different network partitions http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
15. Document-oriented databases are Collection of independent documents: XML, JSON, JAML Non relational, i.e. do not store data in tables with uniform sized fields for each record Not limited with number of fields or length Usually accessible via a RESTful HTTP/JSON API Horizontally scalable Can be distributed Fault-tolerant
16. Why documents store? Schema free User generated content Storing full complex object graphs Low overhead – usually operate on a single document: - One read, one write Fast Known format means the database can do interesting things with it…
17. Indexing Order in schema free world Materialized views Built on the background Allow stale reads Don’t slow down CRUD ops
18. Index concept { "name": "ayende", ”twitter": "@ayende", "projects": [ "rhino mocks", "nhibernate", "raven db", ] } from doc in docs fromprjindoc.projects selectnew { Project = prj, Name = doc.Name } GET /indexes/ProjectAndName?query=Project:raven http://ayende.com/blog/4459/that-no-sql-thing-document-databases
19. Document DB family CouchDB: Apache project created by Damien Katz; RavenDB: Oren Eini and Hybernating Rhinos project; MongoDB: 10gen project. SimpleDB: Amazon project. It is used as a web service in concert with Amazon Elastic Compute Cloud;
21. Raven DB Build on excising infrastructure (ESENT) that is known to scale to amazing sizes Can be transactional, i.e. ACID: supports System.Transactions and can take part in distributed transactions Indexes via Linq query, implements IQueryable that map to Lucene Supports map/reduce operations
22. Raven DB Comes with fully functional .NET client API, Unit of Work, change tracking REST based, so you can access it via the Java Script API directly Support optimistic concurrency blocking Can be extended with MEF Has triggering support Supports Sharding and Replication http://ravendb.net
24. Demo: RavenDB Setup, Server RavenDB Client API Denormalization, modeling documents CRUD Attachments Indexes MapReduce indexes Sharding
25. MapReduce MapReduceis a programming model and an associated implementation for processing and generating large data sets Map function processes a key/value pair to generate a set of intermediate key/value pairs Reduce function that merges all intermediate values associated with the same intermediate key
29. Sharding Sharding refers to horizontal partitioning of data across multiple machines The idea is to split the load across many commodity machines, instead of buying huge expensive servers
After spending so long extolling the benefits of the various NoSQL solutions, I would like to point out at least onescenario where I haven’t seen a good NosQL solution for the RDBMS: Reporting. One of the great things aboutRDBMS is that given the information that it already have, it is very easy to massage the data into a lot of interestingforms. That is especially important when you are trying to do things like give the user the ability to analyze the dataon their own, such as by providing the user with a report tool that allows them to query, aggregate and manipulate thedata to their heart’s content.While it is certainly possible to produce reports on top of a NoSQL store, you wouldn’t be able to come close to thelevel of flexibility that a RDMBS will offer. That is one of the major benefits of the RDBMS, its flexibility. TheNoSQL solutions will tend to outperform the RDBMS solution (as long as you stay in the appropriate niche for eachNoSQL solution) and they certainly have better scalability story than the RDBMS, but for user driven reports, theRDBMS is still my tool of choice.
Column Databases is a DBMS that stores its content by column rather than by row. This has advantages for data warehouses. More efficient with Aggregates and if data is column oriented. Suited for OLAP and not much for OLTP. Comes from 1970s. Apache CASSANDRAKey-Value DBs allow the use to store key-value pair, where the key usually consist of a string, and the value is a simple primitive. Suited for uses cases where properties and values are enough: profiles, logs, etc. Eventually consistent, hierarchy, multivalued, etc. REDIS.IOGraph DB is a DB that uses graph structure with nodes, edges, and properties. Suited doe associative datasets, map object orient app structure. Avoid expensive joins.
There is a computer science theorem that quantifies the inevitable trade-offs. Eric Brewer’s CAP theorem says that if you want consistency, availability, and partition tolerance, you have to settle for two out of three. (For a distributed system, partition tolerance means the system will continue to work unless there is a total network failure. A few nodes can fail and the system keeps going.)
Horizontally ScalableThe problem is that SQL doesn't scale well. In particular, it doesn't scale horizontally. If your SQL performance is poor, you can't just add more SQL servers to make it faster. In general, you need rather large computers to handle large databases, which means some very expensive hardware. In addition, since you need large computers, this doesn't fit well with the cloud model.Document-oriented databases (such as CouchDB and MongoDB) are designed for horizontal scalability. This means as your database grows, you can simply add more commodity hardware, or more resources from the cloud. But how does it achieve this?These types of databases operate on something similar to distributed hash tables (DHTs). DHTs store a key/value pair in hash buckets. These buckets hold a number of key/value pairs indexed by "hash value." This hash value is a number generated from the data in such a way that all key/value pairs are distributed evenly among the hash buckets. For example, if the DHT has 5 hash buckets and 50 key/value pairs are stored, each hash bucket should have about 10 key/value pairs.One of the advantages here is that this is extremely easy to parallelize. Want more database servers? Just add more hash buckets. As your database grows, you just add more servers, and none of them need to be super-computers either. This is what it means to be "horizontally scalable.“Schema-lessAnother defining feature of document-oriented databases is that they're schemaless. This is a hard pill to swallow if you've been using relational databases for a long period of time. Instead of each record existing in a row of carefully designed columns, each record exists in a document. Think of it as a file on the filesystem. This document can store any data it wants, it doesn't have to follow a schema.Though, while these documents are schemaless, they're not freeform. Many databases opt to use the JSON format, which helps you store key/value pairs in a formatted way. A document can have any number of key/value pairs. Instead of using a schema, documents of the same time (for example, documents representing blog posts) all have a similar set of key/value pairs.One example of this is compactness. Since all documents of the same type don't have to have the same set of key/value pairs, you can save space by leaving some of them off. So, if not all blog posts have associated links, you can simply leave that key/value pair out. Not just leave it empty, you can leave it out entirely.But that has further implications. Not only does it save a bit of space, it makes adding features to a database relatively free. Doing an ALTER statement on a large SQL database can take hours of crunching. If it goes wrong, you'll have to restore a backup of the database, figure out what went wrong and try again. With document-oriented databases, you simply start adding new key/value pairs to your documents, it's as easy as that.Cloud ModelThe trend in web applications (and many other fields) is toward that of cloud computing. If you're not familiar with cloud computing, imagine a huge server farm. Your web application is on one of these servers, shared with many other applications. But, someone posts a link to your application on a popular website and you're suddenly inundated with traffic. On a traditional hosting platform, you'll reach the limitations of your virtual server and hit a brick wall. On a cloud, more servers can be dynamically allocated to deal with this traffic. Once the traffic is over with, the space on those servers return to the cloud. Nothing broke, and your web application didn't even slow down.One of the problems with SQL servers is they don't work well in a cloud. As databases grow and as traffic increases, larger and faster computers are required. Load balancing can be achieved by mirroring the servers, but they still need to be large and fast. This just doesn't fit with the cloud model. NoSQL servers, on the other hand, can simply add more nodes.
A document database is, at its core, a key/value store with one major exception. Instead of just storing any blob in it, a document db requires that the data will be store in a format that the database can understand. The format can be XML, JSON, Binary JSON (MongoDB), or just about anything, as long as the database can understand it.Why is this such a big thing? Because when the database can understand the format of the data that you send it, it can now do server side operations on that data. In most doc dbs, that means that we can now allow queries on the document data.The known format also means that it is much easier to write tooling for the database, since it is possible to show, display and edit the data.Each document contains both the actual data and an additional metadata information that you can use.We can PUT this document in the database, under the key ‘ayende’. We can also GET the document back by using the key ‘ayende’.A document database is schema free, that is, you don’t have to define your schema ahead of time and adhere to that. It also allow us to store arbitrarily complex data. If I want to store trees, or collections, or dictionaries, that is quite easy. In fact, it is so natural that you don’t really think about it.It does not, however, support relations. Each document is standalone. It can refer to other documents by store their key, but there is nothing to enforce relational integrity.The major benefit of using a document database comes from the fact that while it has all the benefits of a key/value store, you aren’t limited to just querying by key. By storing information in a form that the database can understand, we can ask the server to do things for us. Such as defining the following index for us:from doc in docsfrom prj in doc.projectsselect new { Project = prj, Name = doc.Name }With that in place, we can now make queries on the stored documents:GET /indexes/ProjectAndName?query=Project:ravenIn the first case, you define an indexing function (in Raven’s case, a Linq query, in CouchDB case, a JavaScript function) and the server will run it to prepare the results, once the results are prepared, they can be served to the client with minimal computation. CouchDB and Raven differs in the method they use to update those indexes, Raven will update the index immediately on document change, and queries to indexes will never wait. The query may return a stale result (and is explicitly marked as such), but it will return immediately.MongoDB’s indexes behave in much the same way RDBMS indexes behave, that is, they are updated as part or the insert process, so large number of indexes is going to affect write performance.
RavenDB: API: JSON, .NET solution. Provides HTTP/JSON access. LINQ queries & Sharding supportedMongoDB: Works through C# drivers, stores in BSON, Protocol: lots of langs, Query Method: dynamic object-based language & MapReduce, Replication: Master Slave & Auto-ShardingEloquera Database: Object Database and Document-Oriented DatabaseCouchDB: stores in JSON, Protocol: REST, Query Method: MapReduceR of JavaScript Funcs, Replication: Master Master, Written in: ErlangREpresentational State Transfer (REST)
Client API design guidelinesThe Raven Client API design intentionally mimics the widely successful NHibernate API. The API is composed of the following main classes:IDocumentStore- This is expensive to create, thread safe and should only be created once per application. The Document Store is used to create DocumentSessions, to hold the conventions related to saving/loading data and any other global configuration. IDocumentSession- Instances of this interface are created by the DocumentStore, they are cheap to create and not thread safe. If an exception is thrown by an IDocumentSession method, the behavior of all of the methods (except Dispose) is undefined. The document session is used to interact with the Raven database, load data from the database, query the database, save and delete. Instances of this interface implement the Unit of Work pattern and change tracking. IDocumentQuery - Allows querying the indexes on the Raven server.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.
Шардинг - разделение данных на уровне ресурсов. Концепция шардинга заключается в логическом разделении данных по различным ресурсам исходя из требований к нагрузке.A database shard is a horizontal partition in a database or search engine. Each individual partition is referred to as a shard or database shard.Horizontal partitioning is a database design principle whereby rows of a database table are held separately, rather than splitting by columns (as for normalization). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.There are numerous advantages to this partitioning approach. The total number of rows in each table is reduced. This reduces index size, which generally improves search performance. A database shard can be placed on separate hardware, and multiple shards can be placed on multiple machines. This enables a distribution of the database over a large number of machines, which means that the database performance can be spread out over multiple machines, greatly improving performance. In addition, if the database shard is based on some real-world segmentation of the data (e.g. European customers vs. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.[1]Sharding is in practice far more difficult than this. Although it has been done for a long time by hand-coding (especially where rows have an obvious grouping, as per the example above), this is often inflexible. There is a desire to support sharding automatically, both in terms of adding code support for it, and for identifying candidates to be sharded separately.Where distributed computing is used to separate load between multiple servers (either for performance or reliability reasons) a shard approach may also be useful.