what’s our course? where are we starting, where are we going?
(image: somewhere on flickr)
Where do we start?
I’m approaching this as someone whose primary/only database use has been with relational databases, using SQL.
What are we looking for?
An understanding of what document-oriented databases are, how they differ from relational databases, why you’d use them over relational databases, and what some of the options are.
What pitfalls might we encounter?
Matt is not an expert, so I probably will miss stuff, might not be able to argue for document-oriented databases very eloquently, hopefully I won’t totally mislead anybody.
back to the title. are we really going to talk about and seriously consider schema-free databases? what’s the point of that?
the short answer is yes. hopefully this presentation will show why schema-free databases are sometimes very useful.
quick review: relational databases are made up of relations.
roughly, attributes are columns, tuples are rows. relations are collections of tuples with the same set of attributes, so tables.
nice, structured, data.
(image: http://en.wikipedia.org/wiki/File:Relational_database_terms.svg)
You could say that a relational database is defined by its structure.
Structured Query Language
For this presentation, it’s analagous to static programming languages (like C, C++, C#, Java)
so, what are some of the challenges?
ironically, some structured data can be difficult or tedious to implement.
For example, parent-child relationships can be difficult to represent and/or query on (select all work items where area path is in “Top-Level Component”)
Relational databases typically aren’t designed for replication and scale-out from the beginning. As we all know, neglecting to consider something like this will make it harder to do later. Even something like merging in a source control tool (git vs. svn)... if you start out trying to support it, you’ll do better than if you add it as a feature later.
one of the reasons that replication or distribution is difficult is that conflicts are sure to arise. two edits could conflict. two identical ids could be autogenerated... the application can solve these things, but the database isn’t going to provide too much out of the box.
Relational databases do solve problems for us, and they’re a powerful tool. I don’t want to discount that.
document-oriented databases.
can anyone tell me what document-oriented databases are made up of?
http://www.flickr.com/photos/janodecesare/2978128591/sizes/o/
we’re not doing waterfall, here
attributes.
notice the differences. there’s no schema to follow!
flexibility
couchdb is a very popular open-source document-oriented database.
JavaScript Object Notation
CAP theorem.
consistency: all reads return the same, “right” result; reads from two servers return the same result. This ends up being a challenge for lots of big web 2.0 properties -- I’ve read about how flickr, facebook deal with this.
availability: data is returned when requested. i.e. writes don’t block reads.
partition tolerance: the database can be split
choose two.
“eventual consistency”
as you can see, one of the differences between couchdb and a relational database is the consistency/availability tradeoff. couchdb is written in erlang, so some of its features have an erlangish feel to them: data is always there (old revisions always exist, are immutable), and new versions get layered on top.
can someone describe map-reduce?
enables parallelization.
views in couch are map/reduce
stale=ok means that views won’t be recomputed (if map’s output is in memory, don’t check to see if it needs to be regenerated).
reduce=false skips the reduce function, if it was supplied.