This talk will introduce the features of MongoDB by walking through how one can building a simple location-based checkin application using MongoDB. The talk will cover the basics of MongoDB's document model, query language, map-reduce framework and deployment architecture.
3. Document Database
• Not for .PDF & .DOC files
• A document is essentially an associative array
• Document == JSON object
• Document == PHPArray
• Document == Python Dict
• Document == Ruby Hash
• etc
4. Open Source
• MongoDB is an open source project
• On GitHub
• Licensed under the AGPL
• Started & sponsored by 10gen
• Commercial licenses available
• Contributions welcome
7. Full Featured
• Ad Hoc queries
• Real time aggregation
• Rich query capabilities
• Strongly consistent
• Geospatial features
• Support for most programming languages
• Flexible schema
9. $ tar xvf mongodb-osx-i386-2.4.4.tar.gz
$ cd mongodb-osx-i386-2.4.4/bin
$ mkdir –p /data/db
$ ./mongod
Running MongoDB
10. $ mongo
MongoDB shell version: 2.4.4
connecting to: test
Welcome to the MongoDB shell.
For interactive help, type "help".
For more comprehensive documentation, see
http://docs.mongodb.org/
Questions? Try the support group
http://groups.google.com/group/mongodb-user
> db.test.insert({text: 'Welcome to MongoDB'})
> db.test.find().pretty()
{
"_id" : ObjectId("51c34130fbd5d7261b4cdb55"),
"text" : "Welcome to MongoDB"
}
Mongo Shell
23. > db.users.findOne()
{
"_id" : ObjectId("50804d0bd94ccab2da652599"),
"username" : "ngraham",
"first_name" : "Norman",
"last_name" : "Graham"
}
Querying for the user
24. > db.posts.insert({
title: "Hello World",
body: "This is my first blog post",
date: new Date("2013-06-20"),
username: "ngraham",
tags: ["adventure", "mongodb"],
comments: []
})
Creating a blog post
25. db.posts.find().pretty()
"_id" : ObjectId("51c3bafafbd5d7261b4cdb5a"),
"title" : "Hello World",
"body" : "This is my first blog post",
"date" : ISODate("2013-06-20T00:00:00Z"),
"username" : ”ngraham",
"tags" : [
"adventure",
"mongodb"
],
"comments" : [ ]
}
Finding the Post
26. > db.posts.find({tags:'adventure'}).pretty()
{
"_id" : ObjectId("51c3bcddfbd5d7261b4cdb5b"),
"title" : "Hello World",
"body" : "This is my first blog post",
"date" : ISODate("2013-06-20T00:00:00Z"),
"username" : ”ngraham",
"tags" : [
"adventure",
"mongodb"
],
"comments" : [ ]
}
Querying an Array
Hi, my name’s Norman Graham and I’m a consulting engineer at 10gen, the MongoDB company. I normally spend my time helping people get the most from their MongoDB-based applications, and that can include anything from engineering assessments of production MondoDBdeployments, to trainingand advising engineering teams on schema design and MongoDB-related architectural decisions. Today, I’m hosting this webinar from the 10gen offices in Palo Alto, CA just a few blocks from the campus of Stanford University. Our topic for today is Building your first application with MongoDB. My goal is to introduce you to some MongoDB concepts, maybe provide some context from the world of relational databases, and also to sketch out a very simple application as a codifying example of those concepts.The document model is really a fundamental departure from the relational model, but it’s a very easy transition for application developers to make. Flexible schemas are a great fit with the agile methods that most of us are using and the documents themselves are a very natural match for objects in our programming languages. These features, along with the rest of the rich functionality in MongoDB, combine to create not only a great production environment, but also one that’s a joy to develop in and that greatly increases the speed and agility of the development process.
First what is MongoDB? What are its salient properties?
MongoDB is document database. It’s an open source project, it strives to achieve high performance. It’s horitizontallyscalabe and full featured. We will go through each of these it turn.
By documents we don’t mean microsoft word documents or pdf files. You can think of a document as an associative array. If you use javascript, a JSON object can be stored directly into MongoDB. If you are familiat with PHP, it’s stores stuff that looks like a php array. In python, the dict is the closest analogy. And in ruby, there is a ruby hash. As you know if you use thee things, they are not flat data structures. They are hierarchical data structures. For for example, in python you can store an array within a dict, and each array element could be another array or a dict. This is really the fundamental departure from relational where you store rows, and the rows are flat.
AGPL – GNU Affero General Public License. MongoDB is open source. You can download the source right now on github. We license it under the Affero variant of the GPL. The project was initiated and is sponsored by 10gen. You can get a commercial license by buying a subscription from 10gen. Subscribers also receive commercial support and depending on the subscription level, access to some proprietary extensions, mostly interesting to large enterprises. Contributions to the source are welcome.
MongoDB was designed to be high performance and inexpensive to scale out. It is written in c++ and runs commodity hardware. MongoDB doesn’t use any exotic operating system extensions. The databases files themselves are memory mapped into the address space of the cpu. You can get a build for MongoDB on most platforms including windows, mac os and the major variants of linux. Big endian architectures are not supported today, which rules out ARM. MongoDB stores its data on disk using the BSON format, which can be viewed as binary JSON. BSON is also used for the wire protocol between applications and the database server. MongoDB is designed with the developer in mind. We have full support for primary and secondary indexes and as I hope to demonstrate, the document model is a lot less work as a developer.
One of the primary design goals of MongoDB is that it be horizontally scalable. With a traditional RDBMS, when you need to handler a larger workload, you buy a bigger machine. The problem with that approach is that machines are not priced linearly. The largest computers cost exponentially more money than commodity hardware. And what’s more, if you have reasonable success in your business, you can quickly get to a point where you simply can’t buy a large enough a machine for the workload. MongoDB was designed be horizontally scalable through sharding by adding boxes.
Well how did we achieve this horizontal scalability. If you think about the database landscape, you can plot each technology in terms of its scalability and its depth of functionality. At the top left we have the key value stores like memcached. These are typically very fast, but they lack key features to make a developer productive. On the far right, are the traditional RDBMS technologies like Oracle and Mysql. These are very full featured, but will not scale easily. And the reason that they won’t scale is that certain features they support, such as joins between tables and transactions, are not easy to run in parallel across multiple computers. MongoDB strives to sit at the knee of the curve, achieving nearly the as much scalability as key value stores, while only giving up the features that prevent scaling. So, as I said, mongoDB does not support joins or transactions. But we have certain compensating features, mostly beyond the scope of this talk, that mitigate the impact of that design decision.
But we left a lot of good stuff in, including everything you see here. Ad hoc queries means that you can explore data from the shell using a query language (not sql though). Real time aggregation gives you much of the functionality offered by group by in sql. We have a strong consistency model by default. What that means is that you when you read data from the datbase, you read what you wrote. Sounds fairly obvious, but some systems don’t offer that feature to gain greater availability of writes. We have geospatial queries, the ability to find things based on location. And as you will see we support all popular programming languages and offer flexible, dynamic schema.
In today’s talk, we are going to talk about building an app versus build one in real time through a demo. This session is just not long enough to actually build an app, so the entire talk is a bit meta. Sorry about that! On the plus side, I can tell you that building an app in realtime in front of an audience is something that nearly never goes well, so I am sparing you that. Step one as an app developer is to download mongodb. You can gotomongodb.org or do what most people do and google for “download mongodb” that’s going to take you to this page where you can choose an appropriate build. Even number major releases like 2.2, 2.4 are the stable releases. 64 bit is better than 32 bit. Don’t go into production with our 32 bit build. If you recall, we memory map the data files and so a 32 bit architecture limits you to 2GB of data.
To get mongodb started, you download the tarball, expand it, cd to the directory. Create a data directory in the standard place. Now start mongodb running. That’s it.
The first thing you will want to do after that is start the mongos hell. The mongo shell is an interaactive program that connects to mongodb and lets you perform ad-hoc queries against the database. Here you can see we have started the mongodb shell. Then we insert our first document into a collection called test. That document has a single key called “text” and it’s value is “welcome to mongodb”.’Right after inserting it, we query the test collection and print out every document in it. There is only one, just he one we created. Plus you can see there is a strange _id field that is now part of the document. We will tallk more about that later, but the short explanation is that every document must have a unique _id value and if you don’t specify one, Mongo creates one for you.
Alright, let’s take a step back now and better understand what a document database is since that is so central to the workings of mongodb.
If you come from the world of relational, it’s useful to go through the different concept in a relational database and think about how they map to mongodb. In relational, you have a table, or perhaps a view on a table. In mongodb, we have collections. In relational, a table holds rows. In mongodb, a collection holds documents. Indexes are very similar in both technologies. In relational you can create compound indexes that include multiple columns. In mongodb, you can create indexes that include multiple keys. Relational offers the concept of a join. In mongodb, we don’t support joins, but you can “pre-join” your data by embedding documents as values. In relational, you have foreign keys, mongodb has references between collections. In relational, you might talk about partitioning the database to scale. We refer to this as sharding.
Today we are going to build a blog. A now buiding a blog Is a bit cliché buts it san application that everyone understands. You create posts, and those posts have a title. People come and comment on the blog. And perhaps you categorize your blog posts using tags. This is a screen shot of the education blog, which is built on tumblr. We won’t be building anything as full features as tumblr today, but it’s the same idea.
Ok, the first step in building an application to manage this library is to think about what entities we need to model and maintain.
The entities for our blog will be users, and by users we mean the authors of the blog posts. We will let people comment anonymously. We also have comments and tags.
In a relational based solution, we would probably start by doing schema design. We would build out an ERD diagram.
Here is the entity relationship diagram for a small blogging system. Each of these boxes represents a relational table you might have a user table, and a posts table, which holds the blog posts, and tag table, and so on. In all, if you count the tables used to relate these tables, there would be 5 tables. Let’s look at the posts table. For each post you would assign a post_id. When a comment comes in, you would put it in the comments table and also store the post_id. The post_tags table relates a post and its tags. Posts and tags are a many to many relationship. To display the front page of the blog you would need to access every table.
In mongodb this process is turned on its head. We would do some basic planning on the collections and the document structure that would be typical in each collection and then immediately get to work on the application, letting the schema evolve over time. In mongo, every document in a collection does not need to have the same schema, although usually documents in the same collection have very nearly the same schema by convention.
In MongoDB, you might model this blogging system with only two collections. A user collection that would hold information about the users in the system and a article collection that would embed comments, tags and category information right in the article. This way of representing the data has the benefit of being a lot closer to the way you model it within most object oriented programming languages. Rather than having to select from 8 tables to reconstruct an article, you can do a single query. Let’s stop and pause and look at this closely. The idea of having an array of comments within a blog post is a fundamental departure from relational. There are some nice properties to this. First, when I fetch the post I get every piece of information I need to display it. Second, it’s fast. Disks are slow to seek but once you seek to a location, they have a lot of throughput.
Ok, so how many of you by a show of hands are reasonably well acquainted with javascript? Ok, looks like most of you. Based on my market research, I am going to assume that blog users are generally familiar with javascript and build this application in the mongodb shell. All blog readers and authors will have to use the shell to write posts, read the blog and comment on the blog. I am only kidding of course. IN a real application, we would use an application server, have a front end and program in the language of our choice. But in the interest of time, we are going to use javascript inside the mongo shell to get started today.
Now I would like to turn to working within data within mongodb for blog application.
Here is a what a document looks like in javascript object notation, or JSON. Note that the document begins with a leading parentheseis, and then has a sequence of key/value pairs. They keys are not protected by quotes. They are optional within the shell. The values are protected by quotes. In this case we have specified strings. I am inserting myself into the users collection. This is the mongo shell, which is a full javascript interpreter, so I have assigned the json document to a variable.
To insert the user into mongodb, we type db.users. Insert(user) in the shell. This will create a new document in the users collection. Note that I did not create the collection before I used it. MongoDB will automatically create the collection when I first use it.
If we want to retrieve that document from mongodb using the shell, we would issue the findOne command. The findOne command without any parameters will find any one document within mongodb – and you can’t specify which one. but in this case, we only have one document within the collection and hence get the one we inserted. Note that the document now has an _id field field. Let’s talk more about that. Every document must have an _id key. If you don’t specify one, the database will insert one fore you. The default type is an ObjectID, which is a 12 byte field containing information about the time, a sequence number, which client machine and which process made the request. The _id inserted by the database is guaranteed to be unique. You can specify your own _id if you desire.
Every document must have an _id within a collection. The _id is the primary key within the collection and the _id of document must be unique within the collection. In the case we showed, we did not specify an _id and hence shell created one for us and inserted it into the database. ObjectID is the type use if the driver creates the _id. You can use any immutable value as the _id. In the case of the users collection, if we were ok with a username being immutable, we could have used that.
"50804d0bd94ccab2da652599" is a 24 byte string (12 byte ObjectId hex encoded). The actual representation on disk is 12 bytes. The first four bytes represent a timestamp. This is useful for debugging since you can tell the order in which documents were inserted. It’s only a rough approximation of true insertion order because these _id values are generated at the client versus the mongodb server. The next three bytes identify a client machine uniquely. Typically, it’s a hash of the hostname. The next two bytes represent the processid that issued the write. Finally, there is 3 bytes of data used to distinguish between inserts done by the same process on the same client machine. We are building our application in the shell, so it’s our client in this case.
Now it’s time to put our first blog post into the blogging system.. For our first post, we are going to say “hello world.”. Note that because we are in the mongo shell, which supports javascript, I can create a new Date object to insert the current date and time. We’ve done this w/o specifying previously what the blog collection is going to look like. This is true agile development. If we decide later to start tracking the IP address where the blog post came from, we can do that for new posts without fixing up the existing data. Of course, our application needs to understand that some keys may not be in all documents.
Once we insert a post, the first thing we want to do is find it and make sure its there. There is only one post in the collection today so finding it is easy. We use the find command and append .pretty() on the end to so that the mongo shell prints the document in a way that is easy for humans to read. Note again that there is now an _id field for the blog post, ending in DB5A.I also inserted an emtpy comments array to make it easy to add comments later. But I could have left this out.
Now that we have a blog post inserted, let’s look at how we would query for all blog posts that have a particular tag. This query shows the syntex for querying posts that have the tag “adventure.” This illustrates two things: first, how to query by example and second, that you can reach into an array to find a match. In this case, the query is returning all documents with elements that match the string “adventure”
Now Let’s add a comment to the blog post. This would probably happen through a web server. For example, the user might see a post and comment on it, submitting the comment. Let’s imagine that steve blank came to my blog and posted the comment “Awesome Post.” The application server would then update the particular blog post. This query shows the syntax for an update. We specify the blog post through its _id. The ObjectID that I am creating in the shell is a throw-away data structure so that I can represent this binary value from javascript. Next, I use the $push operator to append a document to the end of the comments array. There is a rich set of query operators that start with $ within mongodb. This query appends the document with the query on the of the comments array for the blog post in question.
Of course, the first thing we want to do is query to make sure our comment is in the post. We do this by using findOne, which returns one documents, and specifying the document with the same _id. The comment is within the document in yellow. Note that mongodb has decided to move the comment key within the document. This makes the point that the exact order of keys within a document is not guaranteed.
If this was a real application it would have a front end UI and that UI would not be having the user login to the mongo shell.
Real applications are not built using the mongo shell.
MongoDB offers a set of drivers that bridge between application languages and the mongodb server. We have bindings for over 12 languages.
There are drivers that are created and maintained by 10gen for most popular languages. You can find them at api.mongodb.org.
There are also drivers maintained by the community. Drivers connect to the mongodb servers. They translate BSON into native types within the language. Also take care of maintaining connections to replica set. The mongo shell that I showed you today is not a driver, but works like one in some ways. You install a driver by going to api.mongodb.org, clicking through the documentation for a driver and finding the recommended way to install it on your computer. For example, for the python driver, you use pip to install it, typically.
We’ve covered a lot of ground in a pretty short time but still have barely scratched the surface of how to build an application within mongodb. To learn more about mongodb, you can use the online documentation, which is a great resource.
There are also introductory talks that delve deeper into many of the topics we covered here today. At 11:05, Gary Murakamiis discussing schema design within MongoDB.
Randall Hunt is discussing replication. Replication is the way we achieve high availability and fault tolerance within mongodb.
10gen CEO MaxSchireson is talking about indexing at 12:40pm
We still have a few minutes before the next session and I would be happy to answer questions.