In formal database theory, tables are relations, rows are tuples, and fields are attributes.
Relational databases aren’t about “Relations” between tables. It refers to the tables (relations) that make up the database.
Relational databases actually have a problems dealing with “relationships” in the informal sense – joins need to be planned for in advance, and schema design can be a multi-week process (or even multi-month!) which has to be complete before you can start building your application.
Disk read latency on a spinning 7200 RPM platter is 4.17ms for avg rotational latency, plus 8 or so ms for average seek time. You want as few seeks as possible, and normalizing data increases the seeks.
This made complete sense when disks only held 10MB of data. Now?
Modern relational databases spend much of their optimizing effort at combating this multi-seek problem… but can be limited by the constraints of memory on commodity hardware – which means you end up buying specialist hardware. Have I mentioned Larry Ellison owns an island? An entire Hawaiian island.
Turn of millennium saw XML and Object databases, like MarkLogic and Objectivity – but the real explosion in interest began in the middle of the decade, as the needs of data storage and retrieval really started to change.
Eric Evan popularized the term, as title to Meetups to discuss this new technology trend in San Francisco, my home town.
Next, we’ll go over some of these historical developments.
One of the earliest developments was the creation of Memcached
Developed at LiveJournal in 2003 as a way to speed up web applications, memcached has proved so useful, that it’s still in wide use and under active development.
Described as: high-performance, distributed memory object caching system
Let’s unpack:
Distributed – runs across many computers
Memory – runs without touching disk
Object cache – designed to hold small lumps of data
High performance – because it never touches disk, and the objects are small, it’s optimized for speed
Advantage? Scale out architecture
With a single server, as in most relational systems, all you can do is buy a bigger machine – scale up. But this quickly gets ruinously expensive. NoSQL offers another way to scale – scale out.
With Memcache, there’s no connection between the machines, where the data lives is determined by the client hash. That lets you set up mulitple machines.
[click]
But other systems are possible. The servers can communicate among themselves, and decide who keeps what data. Mongo, for instance, does this by setting a key, so where data lives depends on it’s value. MarkLogic does this automatically without setting a key.
This lets you scale to an effectively unlimited number of hosts.
[click]
From 2006 Google paper: Bigtable is a sparse, distributed, persistent multi- dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
Bigtable was never shipped outside of Google, but it’s considered a seminal paper for the NoSQL movement, and the ideas behind it are the basis for a family of databases called wide column stores. It’s also integral to many Google projects, and is the Data storage method exposed by App Engine, so you can still use it today.
Bigtable uses MVCC for writes, and as a result is able to do fast writes which scale well. It also supports indexing for queries.
History
functions need to be order independent. Another scale out architecture
Doug Cutting of Internet Archive and Mike Cafarella of U Wash. Cutting went to Yahoo in 2006.
http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/
Another influential system that never shipped publicly was Amazon’s Dynamo. Presented at the 2007 All Things Distributed conference, the Amazon Dynamo paper was every bit as exciting as the Bigtable paper.
From the paper, Dynamo is “a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. To achieve this level of availability, Dynamo sacrifices consistency”
Paper is at http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Another influential system that never shipped was Amazon’s Dynamo. Presented at the 2007 All Things Distributed conference, the Amazon Dynamo paper was every bit as exciting as the Bigtable paper.
From the paper, Dynamo is “a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. To achieve this level of availability, Dynamo sacrifices consistency”
Paper is at http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
DynamoDB info at http://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html
From the paper: Updates in the presence of network partitions and node failures can potentially result in an object having distinct version sub-histories, which the system will need to reconcile in the future. This requires us to design applications that explicitly acknowledge the possibility of multiple versions of the same data (in order to never lose any updates).
A somewhat artificial acronym for a very real thing.
We’ll cover this in more detail in just a moment.
A very made up acronym for a much more vague idea than ACID. It’s another way of saying “not ACID”. It’s also appropriate for some usecases… but beware using it where it’s not.
The most famous problem with this approach came from two different BitCoin exchanges who went out of business because they relied on eventual consistency.
So, for things like Survey Data, cat pictures, forum postings (for some forums, but not others, like in Finanace), BASE is fine. For anything having to do with money, or regulatory compliance, or inventory, etc, use ACID.
Consistency is a function of the other three properties, Durability, Isolation and Atomicity.
So, this is essentially a summary of the preceding slides.
As a side note, Eventually Consistent is really just marketing speak: if you’re only consistent eventually, you’re Essentially Inconsistent.
Slide originally from Mike Bowers (but since modified), presented at MarkLogic World 2013
Database, not a filesystem. Not a cache (without a store). So, not Hadoop, not memcache (but memcachedb).
Cluster friendly is about more than just running in an AMI - it means running on commodity hardware.
There are easily over 200 different NoSQL database systems, and they vary wildly in features and design centers.
Key Values stores are “Hashtables in the sky”.
Redis is an open-source, networked, in-memory, key-value data store with optional durability. The most popular KV store.
As mentioned previously, it’s also considered a “data structure server”.
The ability to do a clustered shared-nothing distribution of data is currently in Beta
Like key value stores, but by also allowing for additional structure in the value stored, new possibilities open up for things like indexing, search and aggregation.
Mongo is the most used Document DB, while MarkLogic is the largest NoSQL database company from a revenue perspective (according to independent web site estimates)
Binary JSON (BSON) oriented document database, with sharding and eventual consistency. First stable release in 2010.
Big Table, Cassandra, Hadoop HBase, Apache Accumulo
All from the Bigtable starting point, and share that general architecture.
But really, it’s almost all Cassandra, from a marketshare perspective
Data model like BigTable
* Distro model like Dynamo
* Built by Facebook in 2008
* Apache Project 2010
Great for: Recommendations, Social Network analysis, Shortest path, Asset Management
Neo4J, Allegro, Titan, Objectivity
Databases where the primary thing tracked are nodes, and the connections of those nodes, called vertixes.
Neo4J dominates the market from a share perspective.
The Semantic Web and Open Linked Data are really just a special case of Graph Databases.
Semantics is a new way of organizing and searching information
Data are modeled as triples: the combination of a subject, predicate, object triple, or fact.
For example, “John Smith lives in London” is a fact
“London is in England”. Each of those facts can modeled as a triple.
Any human would look at those two facts and immediately know that John Smith lives in England
With rules, MarkLogic Semantics can achieve the same result [CLICK]
Even though we never explicitly say that John Smith lives in England, we can query MarkLogic and find that it’s true
There are a large and growing number of Linked Open Data sets available and more are coming every day
These data sets are in a form that makes them easily consumed. That’s really important and we’ll describe what that form looks like in a minute
Examples
dbpedia (wikipedia as triples)
Einstein was born in Germany
Ireland's currency is the Euro
GeoNames:
Doha is the capital of Qatar
Doha has these lat/long coords
Others:
Data.gov, data.gov.uk
Legislation
Where the money goes
World Bank Linked Data
Patents.data.gov, reference.data.gov,
BBC Programmes, BBC Music, BBC Wildlife