Daniel Coupal "At this point, you may be familiar with the design of MongoDB databases and collections, however what are the frequent patterns you may have to model?
This presentation will build on the knowledge of how to represent common relationships (1-1, 1-N, N-N) into MongoDB. Going further than relationships, this presentation aims at identifying a set of common patterns in a similar way the Gang of Four did for Object Oriented Design. Finally, this presentation will guide you through the steps of modeling those patterns into MongoDB collections.
"
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
Advanced Schema Design Patterns
1. O C T O B E R 1 2 , 2 0 1 7 | B E S P O K E | S A N F R A N C I S C O
# M D B l o c a l
Advanced Schema
Design Patterns
2. # M D B l o c a l
{ "name": "Daniel Coupal",
"jobs_at_MongoDB": [
{ "job": "Senior Curriculum Engineer",
"from": new Date("2016-11") },
{ "job": "Senior Technical Service Engineer",
"from": new Date("2013-11") }
],
"previous_jobs": [
"Consultant",
"Developer",
"Manager Quality & Tools Team",
"Manager Software Team",
"Tools Developer"
],
"likes": [ "food", "beers", "movies", "MongoDB" ]
}
Who Am I?
3. # M D B l o c a l
The "Gang of Four":
A design pattern systematically names, explains,
and evaluates an important and recurring design
in object-oriented systems
MongoDB systems can also be built using its own
patterns
PATTERN
Pattern
4. # M D B l o c a l
• Enable teams to use a common methodology and vocabulary
when designing schemas for MongoDB
• Giving you the ability to model schemas using building blocks
• Less art and more methodology
Why this Talk?
5. # M D B l o c a l
Ensure:
• Good performance
• Scalability
despite constraints ➡
• Hardware
• RAM faster than Disk
• Disk cheaper than RAM
• Network latency
• Reduce costs $$$
• Database Server
• Maximum size for a document
• Atomicity of a write
• Data set
• Size of data
Why do we Create Models?
6. # M D B l o c a l
•Don’t over-design! •Design for:
•Performance
•Scalability
•Simplicity
However …
7. # M D B l o c a l
WMDB -
World Movie Database
Any events, characters and
entities depicted in this
presentation are fictional.
Any resemblance or similarity to
reality is entirely coincidental
8. # M D B l o c a l
WMDB -
World Movie Database
First iteration
3 collections:
A. movies
B. moviegoers
C. screenings
9. # M D B l o c a l
Our mission, should we decide to accept it, is to
fix this solution, so it can perform well and scale.
As always, should I or anyone in the audience do
it without training, WMDB will disavow any
knowledge of our actions.
This tape will self-destruct in five seconds. Good
luck!
Mission Possible
10. # M D B l o c a l
Categories of Patterns
• Frequency of
Access
• Subset ✓
• Approximation ✓
• Grouping
• Computed ✓
• Overflow
• Bucket
• Representation
• Attribute ✓
• Schema Versioning ✓
• Document Versioning
• Tree
• Pre-Allocation
11. # M D B l o c a l
{
title: "Moonlight",
...
release_USA: "2016/09/02",
release_Mexico: "2017/01/27",
release_France: "2017/02/01",
release_Festival_Mill_Valley:
"2017/10/10"
}
Would need the following indexes:
{ release_USA: 1 }
{ release_Mexico: 1 }
{ release_France: 1 }
...
{ release_Festival_Mill_Valley: 1 }
...
Issue #1: Big Documents, Many Fields
and Many Indexes
12. # M D B l o c a l
Pattern #1: Attribute
{
title: "Moonlight",
...
release_USA: "2016/09/02",
release_Mexico: "2017/01/27",
release_France: "2017/02/01",
release_Festival_Mill_Valley:
"2017/10/10"
}
13. # M D B l o c a l
Problem:
• Lots of similar fields
• Common characteristic to search across those fields together
• Fields present in only a small subset of documents
Use cases:
• Product attributes like ‘color’, ‘size’, ‘dimensions’, ...
• Release dates of a movie in different countries, festivals
Attribute Pattern
14. # M D B l o c a l
Solution:
• Field pairs in an array
Benefits:
• Allow for non deterministic list of attributes
• Easy to index
{ "releases.location": 1, "releases.date": 1 }
• Easy to extend with a qualifier, for example:
{ descriptor: "price", qualifier: "euros", value: Decimal(100.00) }
Attribute Pattern - Solution
15. # M D B l o c a l
Possible solutions:
A. Reduce the size of your working set
B. Add more RAM per machine
C. Start sharding or add more shards
Issue #2: Working Set doesn’t fit in RAM
16. # M D B l o c a l
WMDB -
World Movie Database
First iteration
3 collections:
A. movies
B. moviegoers
C. screenings
17. # M D B l o c a l
In this example, we can:
• Limit the list of actors and
crew to 20
• Limit the embedded reviews
to the top 20
• …
Pattern #2: Subset
18. # M D B l o c a l
Problem:
• There is a 1-N or N-N relationship, and only few documents from
need to be shown always
• Only infrequently do you need to pull all of the depending
documents
Use cases:
• Main actors of a movie
• List of reviews or comments
Subset Pattern
19. # M D B l o c a l
Solution:
• Keep duplicates of a small subset of fields in the main collection
Benefits:
• Allows for fast data retrieval and a reduced working set size
• One query brings all the information needed for the "main page"
Subset Pattern - Solution
21. # M D B l o c a l
• How duplication is handled
A. Update both source and target in real time
B. Update target from source at regular intervals. Examples:
• Most popular items => update nightly
• Revenues from a movie => update every hour
• Last 10 reviews => update hourly? daily?
Aspect of Patterns: Consistency
23. # M D B l o c a l
{
title: "Your Name",
...
viewings: 5,000
viewers: 385,000
revenues: 5,074,800
}
Issue #3: ..caused by repeated
calculations
24. # M D B l o c a l
For example:
• Apply a sum, count, ...
• rollup data by minute, hour,
day
• As long as you don’t mess
with your source, you can
recreate the rollups
Pattern #3: Computed
25. # M D B l o c a l
Problem:
• There is data that needs to be computed
• The same calculations would happen over and over
• Reads outnumber writes:
• example: 1K writes per hour vs 1M read per hour
Use cases:
• Have revenues per movie showing, want to display sums
• Time series data, Event Sourcing
Computed Pattern
26. # M D B l o c a l
Solution:
• Apply a computation or operation on data and store the result
Benefits:
• Avoid re-computing the same thing over and over
• Replaces a view
Computed Pattern - Solution
27. # M D B l o c a l
Issue #4: Lots of Writes
Web page counters
Updates on movie data
Screenings
Other
28. # M D B l o c a l
Issue #4: … for non critical data
29. # M D B l o c a l
• Only increment once in X
iterations
• Increment by X
Pattern #4: Approximation
30. # M D B l o c a l
Problem:
• Data is difficult to calculate correctly
• May be too expensive to update the document every time to keep
an exact count
• No one gives a damn if the number is exact
Use cases:
• Population of a country
• Web site visits
Approximation Pattern
31. # M D B l o c a l
Solution:
• Fewer stronger writes
Benefits:
• Less writes, reducing contention on some documents
Approximation Pattern –
Solution
32. # M D B l o c a l
• Keeping track of the schema version of a document
Issue #5: Need to change the list of fields
in the documents
33. # M D B l o c a l
Add a field to track the
schema version number, per
document
Does not have to exist for
version 1
Pattern #5: Schema Versioning
34. # M D B l o c a l
Problem:
• Updating the schema of a database is:
• Not atomic
• Long operation
• May not want to update all documents, only do it on updates
Use cases:
• Practically any database that will go to production
Schema Versioning Pattern
35. # M D B l o c a l
Solution:
• Have a field keeping track of the schema version
Benefits:
• Don't need to update all the documents at once
• May not have to update documents until their next modification
Schema Versioning Pattern –
Solution
36. # M D B l o c a l
• Bucket
• grouping documents together, to have less documents
• Document Versioning
• tracking of content changes in a document
• Outlier
• Avoid few documents drive the design, and impact performance for all
• Tree(s)
• Pre-allocation
Other Patterns
38. # M D B l o c a l
• Simple grouping from tables to collections is not optimal
• Learn a common vocabulary for designing schemas with
MongoDB
• Use patterns as "plug-and-play" for your future designs
• Attribute
• Subset
• Computed
• Approximation
• Schema Versioning
Take Aways
39. # M D B l o c a l
A full design example for a
given problem:
• E-commerce site
• Contents Management
System
• Social Networking
• Single view
• …
References for complete Solutions
40. # M D B l o c a l
• More patterns in a follow up to this presentation
• MongoDB in-person training courses on Schema Design
• Upcoming Online course at
MongoDB University:
• https://university.mongodb.com
• M220 Data Modeling
How Can I Learn More About Schema
Design?
41. # M D B l o c a l
daniel.coupal@mongodb.com
Thank You for
using MongoDB!
Notas del editor
Welcome
[Remember]
Beware of transitions, keep them smooth
[TODOs]
Add the page numbers
Drawing of a working set
Consider removing ":" in the slide titles
Consider changing "revenues" => revenue, in few slides
More on the value and use cases for each pattern
Previous Jobs, Order of likes, =>Gang of Four
I like Food, Beers and Movies … and MongoDB.
My inspiration for this talk comes from the "Gang of Four".
How many of you are familiar with the "Gang of Four"?
Building blocks, Some patterns, => Same for MongoDB
Basically the ones who wrote this book on "Design Patterns"
GOF are Erich Gamma, Richard Helm, Ralph Johnson and John Vlissides
https://en.wikipedia.org/wiki/Design_Patterns
Key words are "Elements of Reusable Software"
Assemble their experience on designing and implementing software over the years
They found that a lot of the solutions were sharing some "patterns"
Examples of patterns from "Design Patterns"
Types: Creational (5), Structural (7), Behavioral (11)
Singleton (restrict the creation to a single object for a given class)
Observer (number of objects to see an event)
Command (user operation)
Decorator (embellishing a UI element)
Memento (ability to restore an object to a previous state)
…
So, they went and made a catalog of those "patterns".
The idea is enable people who write software to share a common language and have building blocks for solutions.
10 Years, Vocabulary, Building Blocks, "Art", => Example
We use that contents in our internal trainings, however is it the first time we are presenting it at a conference, well… including the "data modeling" workshop we ran yesterday.
The goal is not to teach you about doing schema design.
I am expecting you to either have done some with MongoDB or with a Relational Database
My goal is help you formalize the process of creating schemas for MongoDB, help you work in team by sharing visuals, vocabulary
Performance & scalability, "air"
Before we get going, let's just answer why we create models.
In a perfect world, you don't really have to model.
I mean if everything is super fast and resources are abundant, you really don't care where and how data is stored
Every day I get up I don't make plans on how I will breathe air.
However if you go to space or under water, you will need a "design" that will let you get the amount of air you need.
Design is optional, cost of developer, 5 or 10 shards?
If performance is not an issue, meaning you have resources to spare, then you are likely to model for simplicity. The reason is that software engineers are very expensive. You may not think so, but your manager does.
If you need to shard the database, it is likely that performance is very important
Why using 10 shards, if you can reduce the number of operations (reads and writes) by 2 and be able to do the same with 5 shards?
Entities
In order to illustrate this talk, let's assume there is a site called the "World Movie Database".
This site is so popular that everyone goes there on Thursdays before the release of new movies and it crashes the site.
Then some people tried to migrate the site to a NoSQL database, MongoDB obviously.
Collections, grouping not optimal, =>accept challenge
This is the first try of trying to move the schema from Relational to MongoDB.
There are 3 collections: movies, moviegoers and screenings.
Simply grouping entities into collections is not optimal.
The solution using this design did not perform much better than the previous one.
This is still normalized. When you remove this restriction, duplication is fine, 1-1 relationships are fine.
You open the door to some important transformations.
Those will be our patterns.
[NOTE] Use "Sync Visibility" once you activate the color layer to also see it in the PNG file.
Perform & Scale, without training, disavow
Our goal, no need to say, is to fix this website before it gets the same fate as this tape recorder.
GoF, top 5 patterns in order,
We will use patterns, like the Gang of Four.
Most patterns can be grouped in 3 categories.
We will cover those patterns identified with check marks in this presentation.
Also, I will cover the patterns in order of importance, or so.
For the other ones, I will refer you to the slides of this presentation and subsequent content we will have on the subject.
How do I search on movies being released on a given date in the USA?
The same would apply to products you could see on E-commerce site.
For example, clothes may have a size that is expressed as S, M, L, while for some other products like a laptop, size would be something like 13", 15"
If you noticed from my personal info, I did use that pattern.
That allowed me to list my jobs at MongoDB and associate them with a given date.
Inventory of things to insure
Polymorphic entities
Vehicles: submarine, car
"Adding a qualifier on the attribute" may be "currency"
Working set, imagine no more RAM
With everyone pounding on the WMDB site, it was observed that the working set does not fit in memory.
What can you do?
Looking at the design we see that we are putting all the actors and all reviews for a given movie in the main document
[TODO] Add a drawing showing what the working set is
Collections, grouping not optimal, =>accept challenge
This is the first try of trying to move the schema from Relational to MongoDB.
There are 3 collections: movies, moviegoers and screenings.
Simply grouping entities into collections is not optimal.
The solution using this design did not perform much better than the previous one.
This is still normalized. When you remove this restriction, duplication is fine, 1-1 relationships are fine.
You open the door to some important transformations.
Those will be our patterns.
[NOTE] Use "Sync Visibility" once you activate the color layer to also see it in the PNG file.
The collection "castandcrew" contains all the actors, but also the producers, costume makers, stunts, etc.
For this pattern to be worth it, it has to have a fair amount of information left aside.
Top level information for a first page
If this is slow, you may not keep your users on the site
You want them to validate that this is what they want, then dig for more if needed
Let's take a pause there.
Don't go get popcorn, not yet, this is just an intermission from our pattern list.
[TODO] make this "intermission" more appealing
Let’s pause from our pattern list, and let’s examine a characteristic or aspect of some patterns.
As you may guess, people pay attention to the popularity of the movies.
So, metrics like "revenues" and "viewers" are really important.
In the current design, those numbers are calculated every time the page of a movie is displayed.
Let’s calculate those numbers once in a while and stick the results on the page instead.
As you may guess, people pay attention to the popularity of the movies.
So, metrics like "revenues" and "viewers" are really important.
In the current design, those numbers are calculated every time the page of a movie is displayed.
Let’s calculate those numbers once in a while and stick the results on the page instead.
Also refer to "Rolled up" as CQRS - Command Query Responsibility Segregation
According to Bryan, that sounds good at a Party.
Another thing that was observed with the current design is that trying to keep track of all page views of the site resulted in very poor performance. That was seen for both MMAPv1 and WT.
In MMAPv1, you get a lot of threads looking for the write lock.
While with WT, you get a lot of write conflicts that need to be retried.
One solution is to record "good enough" numbers. Well no one cares that the count is 100 millions or 100 millions and few. What is the tolerance level here? Let’s assume 1000.
In this case, we will let the application update the page views by 1000, however only 1/1000th of the time. Statistically, we should get a result very close to the exact count, however doing only 1/1000th of the writes.
If you make the parallel to a movie, we never see a movie as a continuous image, the movie is made by displaying 24 static images per second, however this is enough to our eyes to not see the discontinuties.
How do you do that? Let’s have the application run a (X mod 1000) operation, where X is a random number. If the result is 0, let’s update the counter by 1000.
Another thing that was observed with the current design is that trying to keep track of all page views of the site resulted in very poor performance. That was seen for both MMAPv1 and WT.
In MMAPv1, you get a lot of threads looking for the write lock.
While with WT, you get a lot of write conflicts that need to be retried.
One solution is to record "good enough" numbers. Well no one cares that the count is 100 millions or 100 millions and few. What is the tolerance level here? Let’s assume 1000.
In this case, we will let the application update the page views by 1000, however only 1/1000th of the time. Statistically, we should get a result very close to the exact count, however doing only 1/1000th of the writes.
If you make the parallel to a movie, we never see a movie as a continuous image, the movie is made by displaying 24 static images per second, however this is enough to our eyes to not see the discontinuties.
How do you do that? Let’s have the application run a (X mod 1000) operation, where X is a random number. If the result is 0, let’s update the counter by 1000.
You can have a counter. Once you reach the count, you do the write.
Or you can use a random generator and when you get a specific value, you do the write.
As you guess, this simple pattern is also applicable to Relational databases.
… it is just that NoSQL people have more tricks to handle performance bottlenecks.
Let's face it configuration management and database usually don't work well together.
Database tend to keep the "latest" state of your data, while "CM" systems remember everything. Those of you who checked in stupid mistakes in Git, ClearCase, etc know what I am taking about.
For this pattern, we are keeping track of the shape of the document. We are not addressing keeping track of the different contents of the document it self. This other case is solved by the Document Versioning pattern.
Instead of using a "version" field, we could discover the version number based on fields
- Few million references would not even fit into an embedded array. And if it did, you would not want to construct a query by passing a million values to the $in operator.
We touch a little bit the bucket pattern when we looked at the outlier one. The bucket pattern let you group X sub-documents into one document. When the bucket is full, you create another one.
Pre-allocation will be the case where you pre-create an array of cells to have the reads and writes easily access the elements. This is a very important pattern if you are using MMAPv1, as continuously growing an array can have a negative effect. With Wired Tiger it is not as crucial, however may make the code in the application simpler.
As for Trees are commonly represented by either having one node per document, where you can list the parent, the children, the ancestors, or a combination of those
[TODO] I need another title!
Elliot and Dev went to the future to see if there are still people using relational databases there, so we can work on the missing features in our next release.
I think they are looking at their watch to see if it is time to come back… or wait, maybe they want me to hurry up, so I will wrap up the presentation…
We did use a fictional site, however all the patterns we used would also apply to "Internet of Things", "Single View", "E-commerce" solutions.
10 years, future data big or not square, becoming an expert
MongoDB celebrates 10 years … very soon.
We are able to identify patterns because we have seen a lot of models with MongoDB over those first 10 years. Those are "plug-and-play" elements that let you go faster in your designs.
We do believe MongoDB has a bright future.
Most data that could be put in a Relational Database is already there. We are left with:
Data this is "not square", meaning it does not fit well in square tables.
Large datasets
We believe the document model and the scalability of MongoDB are prime to store that data
Ensure you are ready for the future by becoming an expert on MongoDB and how to model for it
My goal was to introduce you to patterns, however if you want more complete solutions to common problems, there are few good books out there. Let me point you to those 2:
The Little Mongo DB Schema Design Book Paperback, by Christian Kvalheim
MongoDB Applied Design Patterns, by Rick Copeland
I am leaving you with where you can find more information about schema design
M220 is likely to be available in Q4 2017
Thanks you for attending my presentation, and this conference, but above all:
Thank you for using MongoDB!