MongoDB is a document-oriented NoSQL database that uses flexible schemas and provides high performance, high availability, and easy scalability. It uses either MMAP or WiredTiger storage engines and supports features like sharding, aggregation pipelines, geospatial indexing, and GridFS for large files. While MongoDB has better performance than Cassandra or Couchbase according to benchmarks, it has limitations such as a single-threaded aggregation and lack of joins across collections.
2. Outline
• Introduction to MongoDB
• Storage Layout
• Data Management Features
• Performance Analysis
• Limitations
• Conclusion
• Demo
• References
3. What is MongoDB?
• MongoDB is a NoSQL Document-Oriented database.
• It provides semi-structured flexible schema.
• It provides high performance, high availability, and easy scalability.
• MongoDB is free and open source software.
• License: GNU Affero General Public License (AGPL) and Apache License
• MongoDB is a server process that runs on Linux, Windows and OS X. It can
be run both as a 32 or 64-bit application.
4. When to use MongoDB?
“Knowing when to use a hammer, and when to use a screwdriver.”
• Account and user profiles: can store arrays of addresses with ease (MetLife)
• Content Management Systems (CMS): the flexible schema of MongoDB is great for heterogeneous
collections of content types (MongoPress)
• Form data: MongoDB makes it easy to evolve the structure of form data over time (ADP)
• Blogs / user-generated content: can keep data with complex relationships together in one object (Forbes,
AOL)
• Messaging: vary message meta-data easily per message or message type without needing to maintain
separate collections or schemas (Viber)
• System configuration: just a nice object graph of configuration values, which is very natural in MongoDB
(Cisco)
• Log data of any kind: structured log data is the future (ebay)
• Location based systems: makes use of Geospatial indices (Foursquare, City government of Chicago)
10. To Sum Up: Internal File Format
• Files on disk are broken into extents which contain the documents.
• A collection has one or more extents.
• Extent grow exponentially up to 2GB.
• Namespace entries in the ns (namespace) file point to the first extent
for that collection.
12. Storage Engine - MMAP (Memory Mapped)
• All data files are memory mapped to Virtual Memory by the
OS.
• MongoDB just reads / writes to RAM in the filesystem cache
• OS takes care of the rest!
• Virtual process size = total files size + overhead (connections,
heap)
• Uses Memory-mapped file using mmap() system call.
13. Storage Engine - WiredTiger
• Designed especially for Write-Intensive applications
• Document level locking
• Compression and Record-level locking
• Multi-version concurrency control (MVCC)
• Multi-document transactions
• Support for Log Structured Merge (LSM) trees for very high
insert workloads
14. What makes MongoDB cool?
• Sharding
• Aggregation Framework and Map-Reduce
• Capped Collection
• GridFS
• Geo-Spatial Indexing
15. Sharding
• Horizontal scaling - divides the data set and distributes the data over
multiple servers, or shards.
• Used to support deployments with very large data sets and high
throughput operations.
• Sharded Cluster Components –
• Shards – mongod instance or replica sets
• Config Server – Multiple mongod instances
• Routing Instances – Multiple mongos instances
• Shards are divided into fixed size chunks using ranges of shard key
values.
17. Choosing a Shard key
The choice of shard key affects:
• Distribution of reads and writes
• Uneven distribution of reads/writes across shards.
• Solution – Hashed ids
• Size of chunks
• Jumbo chunks cause uneven distribution of data.
• Moving data between shards becomes difficult.
• Solution – Multi-tenant compound index
• The number of shards each query hits
19. Aggregation Pipeline
• The aggregation pipeline is a framework for performing aggregation
tasks, modeled on the concept of data processing pipelines.
• Using this framework, MongoDB passes the documents of a single
collection through a pipeline.
• The pipeline transforms the documents into aggregated results, and is
accessed through the aggregate database command.
• Operators: $match, $project, $unwind, $sort, $limit
• User gets to choose the operator.
23. Capped Collection
• Fixed size collection called capped collection
• Use the db.createCollection command and marked it as capped
• e.g - db.createCollection(‘logs’, {capped: true, size: 2097152})
• When it reaches the size limit, old documents are automatically
removed
• Guarantees preservation of the insertion order
• Maintains insertion order identical to the order on disk by prohibiting
updates that increase document size
• Allows the use of tailable cursor to retrieve documents
24. GridFS
• GridFS is a specification for storing and retrieving files that exceed
the BSON (binary JSON) document size limit of 16MB.
• Instead of storing a file in a single document, GridFS divides a file into
parts, or chunks, and stores each of those chunks as a separate
document.
• By default GridFS limits chunk size to 255k.
• GridFS uses two collections to store files. One collection stores the file
chunks, and the other stores file metadata.
• GridFS is useful not only for storing files that exceed 16MB but also
for storing any files for which you want access without having to load
the entire file into memory.
25. GeoSpatial Indexing
• To support efficient queries of geospatial coordinate data, MongoDB
provides two special indexes:
• 2d indexes that uses planar geometry when returning results.
• 2sphere indexes that use spherical geometry to return results.
• Store location data as GeoJSON objects with this coordinate-axis
order: longitude, latitude.
• GeoJSON Object Supported: Point, LineString, Polygon, etc.
• Query Operations: Inclusion, Intersection, Proximity.
• You cannot use a geospatial index as the shard key index.
27. Limitations
• Need to have enough memory to fit your working set into memory,
otherwise performance might suffer.
• MapReduce and Aggregation are single-threaded. To be more specific,
one per mongod.
• No joins across collections.
• On 32-bit, it has limitation of 2.5 Gb data.
• Sharding has some unique exceptions. If you plan to shard your data,
you need to shard early as some things that are feasible on a single
server are not feasible on a sharded collection.
28. Conclusion
• MongoDB is a semi-structured document-oriented NoSQL Database.
• It has two storage engines: MMAP and WiredTiger
• Multiple Aggregation Frameworks: Aggregation Pipeline and Map-
Reduce
• Support for GridFS, GeoSpatial Indexing, Capped Collection
• Better Performance as compared to Cassandra and Couchbase.
• On-going work – In-memory and HDFS support