MongoDB Internals

MongoDB
https://www.mongodb.com/
Prutha Date (dprutha1@umbc.edu)
Siraj Memon (siraj1@umbc.edu)

Outline
• Introduction to MongoDB
• Storage Layout
• Data Management Features
• Performance Analysis
• Limitations
• Conclusion
• Demo
• References

What is MongoDB?
• MongoDB is a NoSQL Document-Oriented database.
• It provides semi-structured flexible schema.
• It provides high performance, high availability, and easy scalability.
• MongoDB is free and open source software.
• License: GNU Affero General Public License (AGPL) and Apache License
• MongoDB is a server process that runs on Linux, Windows and OS X. It can
be run both as a 32 or 64-bit application.

When to use MongoDB?
“Knowing when to use a hammer, and when to use a screwdriver.”
• Account and user profiles: can store arrays of addresses with ease (MetLife)
• Content Management Systems (CMS): the flexible schema of MongoDB is great for heterogeneous
collections of content types (MongoPress)
• Form data: MongoDB makes it easy to evolve the structure of form data over time (ADP)
• Blogs / user-generated content: can keep data with complex relationships together in one object (Forbes,
AOL)
• Messaging: vary message meta-data easily per message or message type without needing to maintain
separate collections or schemas (Viber)
• System configuration: just a nice object graph of configuration values, which is very natural in MongoDB
(Cisco)
• Log data of any kind: structured log data is the future (ebay)
• Location based systems: makes use of Geospatial indices (Foursquare, City government of Chicago)

Terminologies – RDBMS vs MongoDB
*JSON – JavaScript Object Notation

Storage Internals - Directory Layout
Data Directory is found at /data/db

To Sum Up: Internal File Format
• Files on disk are broken into extents which contain the documents.
• A collection has one or more extents.
• Extent grow exponentially up to 2GB.
• Namespace entries in the ns (namespace) file point to the first extent
for that collection.

Storage Engine - MMAP (Memory Mapped)
• All data files are memory mapped to Virtual Memory by the
OS.
• MongoDB just reads / writes to RAM in the filesystem cache
• OS takes care of the rest!
• Virtual process size = total files size + overhead (connections,
heap)
• Uses Memory-mapped file using mmap() system call.

Storage Engine - WiredTiger
• Designed especially for Write-Intensive applications
• Document level locking
• Compression and Record-level locking
• Multi-version concurrency control (MVCC)
• Multi-document transactions
• Support for Log Structured Merge (LSM) trees for very high
insert workloads

What makes MongoDB cool?
• Sharding
• Aggregation Framework and Map-Reduce
• Capped Collection
• GridFS
• Geo-Spatial Indexing

Sharding
• Horizontal scaling - divides the data set and distributes the data over
multiple servers, or shards.
• Used to support deployments with very large data sets and high
throughput operations.
• Sharded Cluster Components –
• Shards – mongod instance or replica sets
• Config Server – Multiple mongod instances
• Routing Instances – Multiple mongos instances
• Shards are divided into fixed size chunks using ranges of shard key
values.

Choosing a Shard key
The choice of shard key affects:
• Distribution of reads and writes
• Uneven distribution of reads/writes across shards.
• Solution – Hashed ids
• Size of chunks
• Jumbo chunks cause uneven distribution of data.
• Moving data between shards becomes difficult.
• Solution – Multi-tenant compound index
• The number of shards each query hits

Aggregation Framework
• Aggregation Pipeline
• Map-Reduce
• Single Purpose Aggregation Operations (deprecated in latest version)

Aggregation Pipeline
• The aggregation pipeline is a framework for performing aggregation
tasks, modeled on the concept of data processing pipelines.
• Using this framework, MongoDB passes the documents of a single
collection through a pipeline.
• The pipeline transforms the documents into aggregated results, and is
accessed through the aggregate database command.
• Operators: $match, $project, $unwind, $sort, $limit
• User gets to choose the operator.

Aggregation Pipeline - Example

Capped Collection
• Fixed size collection called capped collection
• Use the db.createCollection command and marked it as capped
• e.g - db.createCollection(‘logs’, {capped: true, size: 2097152})
• When it reaches the size limit, old documents are automatically
removed
• Guarantees preservation of the insertion order
• Maintains insertion order identical to the order on disk by prohibiting
updates that increase document size
• Allows the use of tailable cursor to retrieve documents

GridFS
• GridFS is a specification for storing and retrieving files that exceed
the BSON (binary JSON) document size limit of 16MB.
• Instead of storing a file in a single document, GridFS divides a file into
parts, or chunks, and stores each of those chunks as a separate
document.
• By default GridFS limits chunk size to 255k.
• GridFS uses two collections to store files. One collection stores the file
chunks, and the other stores file metadata.
• GridFS is useful not only for storing files that exceed 16MB but also
for storing any files for which you want access without having to load
the entire file into memory.

GeoSpatial Indexing
• To support efficient queries of geospatial coordinate data, MongoDB
provides two special indexes:
• 2d indexes that uses planar geometry when returning results.
• 2sphere indexes that use spherical geometry to return results.
• Store location data as GeoJSON objects with this coordinate-axis
order: longitude, latitude.
• GeoJSON Object Supported: Point, LineString, Polygon, etc.
• Query Operations: Inclusion, Intersection, Proximity.
• You cannot use a geospatial index as the shard key index.

Performance Analysis
• Yahoo! Cloud Serving Benchmark (YCSB)
• Throughput (ops/second)
WORKLOADS Cassandra Couchbase MongoDB
50% read, 50% update 134,839 106,638 160,719
95% read, 5% update 144,455 187,798 196,498
50% read, 50% update
(Durability Optimized)
6,289 1,236 31,864

Limitations
• Need to have enough memory to fit your working set into memory,
otherwise performance might suffer.
• MapReduce and Aggregation are single-threaded. To be more specific,
one per mongod.
• No joins across collections.
• On 32-bit, it has limitation of 2.5 Gb data.
• Sharding has some unique exceptions. If you plan to shard your data,
you need to shard early as some things that are feasible on a single
server are not feasible on a sharded collection.

Conclusion
• MongoDB is a semi-structured document-oriented NoSQL Database.
• It has two storage engines: MMAP and WiredTiger
• Multiple Aggregation Frameworks: Aggregation Pipeline and Map-
Reduce
• Support for GridFS, GeoSpatial Indexing, Capped Collection
• Better Performance as compared to Cassandra and Couchbase.
• On-going work – In-memory and HDFS support

References
• https://www.mongodb.com/presentations/storage-engine-internals
• http://docs.mongodb.org/manual/core/data-modeling-introduction/
• http://docs.mongodb.org/manual/core/aggregation-introduction/
• https://2013.nosql-matters.org/bcn/wp-content/uploads/2013/12/storage-talk-
mongodb.pdf
• http://info-mongodb-com.s3.amazonaws.com/High Performance Benchmark White
Paper final.pdf
• https://www.mongodb.com/collateral/mongodb-architecture-guide
• Book - MongoDB: The Definitive Guide by Kristina Chodorow and Michael Dirolf

MongoDB Internals

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (19)

Similar a MongoDB Internals

Similar a MongoDB Internals (20)

Último

Último (20)

MongoDB Internals