SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
MongoDB at Sailthru
                         Scaling and Schema
                                Design
                                Ian White
                                @eonwhite
                               NoSQL Now!
                                 8/25/11




Sunday, August 7, 2011
Sailthru
                    • API-based transactional email led to...
                    • Mass campaign email led to...
                    • Intelligence and user behavior
                    • Three engineers built the ESP we always
                         wanted to use
                    • Some Clients: Huffpo-AOL, Thrillist,
                         Refinery 29, Flavorpill, Business Insider, Fab,
                         Totsy, New York Observer
Sunday, August 7, 2011
How We Got To
                         MongoDB from SQL
                    • JSON was part of Sailthru infrastructure
                         from start (SQL columns and S3)
                    • Kept a close eye on CouchDB project
                    • MongoDB felt like natural fit
                    • Used for user profiles and analytics initially
                    • Migrated one table at a time (very, very
                         carefully)


Sunday, August 7, 2011
Sailthru Architecture
                    • User interface to display stats, build
                         campaigns and templates, etc (PHP/EC2)
                    • API, link rewriting, and onsite endpoints
                         (PHP/EC2)
                    • Core mailer engine (Java/EC2 and colo)
                    • Modified-postfix SMTP servers (colo)
                    • 11 database servers on EC2 (for now)
Sunday, August 7, 2011
MongoDB Overview

                    • 13 instances on EC2 (6 two-member
                         replica sets, 1 backup server)
                    • About 40 collections
                    • About 1TB
                    • Largest single collection is 500m docs

Sunday, August 7, 2011
Users are Documents

                    • Users aren’t records split among multiple
                         tables
                    • End user’s lists, clickstream interests,
                         geolocation, browser, time of day, purchase
                         history becomes one ever-growing
                         document


Sunday, August 7, 2011
Profiles Accessible
                             Everywhere
                    • Put abandoned shopping cart notifications
                         within a mass email
                     {if profile.purchase_incomplete}
                      <p>This is what’s in your cart:</p>
                      {foreach profile.purchase_incomplete.items as item}
                        {item.qty} <a href=”{item.url}”>{item.title}</a><br/>
                      {/foreach}
                     {/if}




Sunday, August 7, 2011
Profiles Accessible
                             Everywhere
                    • Show a section of content conditional on
                         the user’s location

                     {if profile.geo.city[‘New York, NY US’]}
                       <div>Come to the New York Meetup on the 27th!</div>
                     {/if}




Sunday, August 7, 2011
Profiles Accessible
                                 Everywhere
                    • Show different content depending on user
                            interests as measured by on-site behavior
                         {select}
                           {case horizon_interest('black,dark')}
                             <img src="http://example.com/dress-image-black.jpg" />
                           {/case}
                           {case horizon_interest('green')}
                             <img src="http://example.com/dress-image-green.jpg" />
                           {/case}
                           {case horizon_interest('purple,polka_dot,pattern')}
                             <img src="http://example.com/dress-image-polkadot.jpg" />
                           {/case}
                         {/select}



Sunday, August 7, 2011
Profiles Accessible
                                 Everywhere
                    • Pick top content from a data feed based on
                            tags


                         {content = horizon_select(content,10)}

                         {foreach content as c}
                           <a href=”{c.url}”>{c.title}</a><br/>
                         {/foreach}




Sunday, August 7, 2011
Other Advantages of
                             MongoDB
                    • High performance
                    • Take any parameters from our clients
                    • Really flexible development
                    • Great for analytics (internal and external)
                    • No more downtime for schema migrations
                         or reindexing


Sunday, August 7, 2011
How We Run mongod
                    •    mongod --dbpath /path/to/db --logpath /path/to/log/
                         mongodb.log --logappend --fork --rest --replSet
                         main1 --journal


                    • Don’t ever run without replication
                    • Don’t ever kill -9
                    • Don’t run without writing to a log
                    • Run behind a firewall
                    • Use journaling now that it’s there
                    • Use --rest, it’s handy
Sunday, August 7, 2011
Separate DBs By
                              Collections
                    • Lower-effort than auto-sharding
                    • Separate databases for different usage
                         patterns
                    • Consider consequences of database failure/
                         unavailability
                    • But make sure your backup and monitoring
                         strategy is prepared for multiple DBs


Sunday, August 7, 2011
Our Five Replica Sets
                    • main: most of the stuff on the UI, lots of
                         small/medium collections
                    • horizon: realtime onsite browsing data
                    • profile: user profile data (60m user docs)
                    • message: last three months of emails
                    • archive: emails older than three months
Sunday, August 7, 2011
Monitoring

                    • Some stuff to monitor: faults/sec, index
                         misses, % locked, queue size, load average
                    • we check basic status once/minute on all
                         database servers (SMS alerts if down), email
                         warnings on thresholds every 10 minutes
                    • have been beta-ing 10gen’s MMS product

Sunday, August 7, 2011
Backups
                    • Used to use mongodump - don’t do that
                         anymore
                    • Have single node of each replica set on a
                         backup server
                    • Two-hour slave delay
                    • fsync/lock, freeze xfs file system, EBS
                         snapshot, unfreeze, unlock


Sunday, August 7, 2011
The Great EC2 EBS
                         Outage Adventure
                    • We survived
                    • Most of our nodes unavailable for 2-4 days
                    • Were able to spin up new instances from
                         backup server, snapshots, and get
                         operational within hours
                    • Wasn’t fun

Sunday, August 7, 2011
DESIGN




Sunday, August 7, 2011
Develop Your Mental
                         Model of MongoDB

                    • You don’t need to look at the internals
                    • But try to gain a working understanding of
                         how MongoDB operates, especially RAM
                         and indexes




Sunday, August 7, 2011
Big-Picture Design
                              Questions
                    • What is the data I want to store?
                    • How will I want to use that data later?
                    • How big will the data get?
                    • If the answers are “I don’t know yet”, guess
                         with your best YAGNI



Sunday, August 7, 2011
“But premature
                         optimization is evil”
                    • Knuth said that about code, which is
                         flexible and easy to optimize later
                    • Data is not as flexible as code
                    • So doing some planning for performance is
                         usually good when it comes to your data



Sunday, August 7, 2011
Specific MongoDB
                         Design Questions
                    • Embed vs top-level collection?
                    • Denormalize (double-store data)?
                    • How many/which indexes?
                    • Arrays vs hashes for embedding?
                    • Implicit schema (field names and types)
Sunday, August 7, 2011
Short Field Names?
                    • Disk space: cheap
                    • RAM: not cheap
                    • Developer Time: expensive
                    • Err towards compact, readable fieldnames
                    • Might be worth writing a mapper
                    • Probably wish we’d used c instead of
                         client_id

Sunday, August 7, 2011
Favor Human-Readable
                       Foreign Keys
                    • DBRefs are a bit cumbersome
                    • Referencing by MongoId often means doing
                         extra lookups
                    • Build human-readable references to save
                         you doing lookups and manual joins



Sunday, August 7, 2011
Example



                    • Store the Template and the Email as strings
                         on the message object
                    •    { template: “Internal - Blast Notify”, email:
                         “support-alerts@sailthru.com” }


                    • No external reference lookups required
                    • The tradeoff is basically just disk space
Sunday, August 7, 2011
Embed vs Top-Level
                           Collections?
                    • Major question of MongoDB schema design
                    • If you can ask the question at all, you might
                         want to err on the side of embedding
                    • Don’t embed if the embedding could get
                         huge
                    • Don’t feel too bad about denormalizing by
                         embedding AND storing in a top-level
                         collection
Sunday, August 7, 2011
Typical Properties of
                         Top-Level Collections

                    • Independence: They don’t “belong”
                         conceptually to another collection
                    • Nouns: the building blocks of your system
                    • Easily referenceable and updatable


Sunday, August 7, 2011
Embedding Pros
                    • Super-fast retrieval of document with
                         related data
                    • Atomic updates
                    • “Ownership” of embedded document is
                         obvious
                    • Usually maps well to code structures

Sunday, August 7, 2011
Embedding Cons

                    • Harder to get at, do mass queries
                    • Does not size up infinitely, will hit 16MB
                         limit
                    • Hard to create references to embedded
                         object
                    • Limited ability to indexed-sort the
                         embedded objects

Sunday, August 7, 2011
If You Think You Can
                                 Embed
                    • You probably should
                    • I take advantage of embedding in my
                         designs more often now than I did three
                         years ago
                    • It’s a gift MongoDB gives you in exchange
                         for giving up your joins



Sunday, August 7, 2011
Design Example:
                           User Permissions
                    • Users can have various broad permission
                         levels for any number of clients
                    • For example, user ‘ploki’ might have
                         permission level ‘admin’ for client 76 and
                         permission level ‘reports_only’ for client
                         450



Sunday, August 7, 2011
How Will We Use This
                          Data?

                    • Retrieve all clients for a given user
                    • Retrieve all users for a given client
                    • Retrieve a permission level for a given
                         client for a given user




Sunday, August 7, 2011
How Will This Data
                             Grow?

                    • In the medium term, it will stay small
                    • Number of clients and number of users can
                         both grow infinitely




Sunday, August 7, 2011
Back in SQL-land

                    • There’s a fairly standard way to do it
                    • It’s a many-many relationship, so
                    • Use a join table (client_user)



Sunday, August 7, 2011
Should We Use a New
                    Top-Level Collection?
                         db.client.user.save( {
                           client_id: 76,
                           username: ‘ploki’,
                           permission: ‘admin’,
                         });
                         db.client.user.save( {
                           client_id: 450,
                           username: ‘ploki’,
                           permission: ‘reports_only’,
                         });

                         db.client.user.ensureIndex( { client_id: 1 } );
                         db.client.user.ensureIndex( { username: 1 } );

                         // get all users belonging to a client
                         db.client.user.find( { client_id: 76 } );

                         // get all clients a user has access to
                         db.client.user.find( { username: ‘ibwhite’ } );

                         // get permissions for our current user
                         db.client.user.findOne( { username: user.name } );

Sunday, August 7, 2011
Probably Not


                    • Only needed if we have lots of clients per
                         user AND lots of users per client
                    • This is a case where we can embed, so let’s
                         do so




Sunday, August 7, 2011
Three Ways to Embed
                         ‘clients’: {
                            ‘76’: ‘admin’,                                   Not good:
       Object               ‘450’: ‘reports_only’,                   can’t do a multikeys index
                         },                                            on the keys of a hash
                         index:???


                                                                              Okay:
    Array                ‘clients’: [
                            {‘_id’: 76, ‘access’: ‘admin’},             but have to search
                                                                          through array
  of objects             },
                            {‘_id’: 450, ‘access’: ‘reports_only’}
                                                                          to find by _id
                         index: { ‘clients._id’: 1 }                     on retrieved doc


                         ‘clients’: [ 76, 450 ],
                                                                        Our approach:
   Array
                         ‘clients_access’: {
                           ’76’: ‘admin’,                             Fields next to each
                                                                      other alphabetically
 and object
                           ‘450’: ‘reports_only’,
                         }
                         index: { clients: 1 }



Sunday, August 7, 2011
Indexes
                    • Index all highly frequent queries
                    • Do less-indexed queries only on
                         secondaries
                    • Reduce the size of indexes whereever you
                         can on big collections
                    • Don’t sweat the medium-sized collections,
                         focus on the big wins


Sunday, August 7, 2011
Take Advantage of
                         Multiple-Field Indexes
                    • Order matters
                    • If you have an index on {client_id:
                         1, email: 1 }

                    • Then you also have the {client_id:
                         1} index “for free”

                    • but not {    email: 1}


Sunday, August 7, 2011
Use your _id


                    • You must use an _id for every collection,
                         which will cost you index size
                    • So do something useful with _id


Sunday, August 7, 2011
Take advantage of fast
                               ^indexes
                    • Messages have _ids like: 32423.00000341
                    • Need all messages in blast 32423:
                    • db.message.blast.find(
                             { _id: /^32423./ } );

                    •    (Yeah, I know the . is ugly. Don’t use a dot if you do this.)




Sunday, August 7, 2011
Manual Range
                                   Partioning
                    • We moved a big message.blast collection
                         into per-day collections:
                    •    message.blast.20110605
                         message.blast.20110606
                         message.blast.20110607
                         etc...


                    • Keeps working set indexes smaller
                    • When we move data into the archive,
                         drop() is much faster than remove()


Sunday, August 7, 2011
Questions?
                         Looking for a job?
                              ian@sailthru.com
                            twitter.com/eonwhite



Sunday, August 7, 2011

Más contenido relacionado

Similar a MongoDB at Sailthru: Scaling and Schema Design

Deploying on the cutting edge
Deploying on the cutting edgeDeploying on the cutting edge
Deploying on the cutting edgeericholscher
 
Flowdock's full-text search with MongoDB
Flowdock's full-text search with MongoDBFlowdock's full-text search with MongoDB
Flowdock's full-text search with MongoDBFlowdock
 
JavaOne 2011 - Going Mobile With Java Based Technologies Today
JavaOne 2011 - Going Mobile With Java Based Technologies TodayJavaOne 2011 - Going Mobile With Java Based Technologies Today
JavaOne 2011 - Going Mobile With Java Based Technologies TodayWesley Hales
 
Mobile drupal: building a mobile theme
Mobile drupal: building a mobile themeMobile drupal: building a mobile theme
Mobile drupal: building a mobile themeJohn Albin Wilkins
 
[DCTPE2011] 7) Mobile Drupal(英/中雙語)
[DCTPE2011] 7) Mobile Drupal(英/中雙語)[DCTPE2011] 7) Mobile Drupal(英/中雙語)
[DCTPE2011] 7) Mobile Drupal(英/中雙語)Drupal Taiwan
 
Taking eZ Find beyond full-text search
Taking eZ Find beyond  full-text searchTaking eZ Find beyond  full-text search
Taking eZ Find beyond full-text searchPaul Borgermans
 
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
2011 June - Singapore GTUG presentation. App Engine program update + intro to Goikailan
 
Starting from scratch in 2017
Starting from scratch in 2017Starting from scratch in 2017
Starting from scratch in 2017Stefano Bonetta
 
Sneak Peek of Nuxeo 5.4
Sneak Peek of Nuxeo 5.4Sneak Peek of Nuxeo 5.4
Sneak Peek of Nuxeo 5.4Nuxeo
 
Full stack development using javascript what and why - ajay chandravadiya
Full stack development using javascript   what and why - ajay chandravadiyaFull stack development using javascript   what and why - ajay chandravadiya
Full stack development using javascript what and why - ajay chandravadiyaajayrcgmail
 
Javascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJSJavascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJSSylvain Zimmer
 
Top 8 Improvements in Drupal 8
Top 8 Improvements in Drupal 8Top 8 Improvements in Drupal 8
Top 8 Improvements in Drupal 8Angela Byron
 
Plone IDE - the future of Plone development
Plone IDE - the future of Plone developmentPlone IDE - the future of Plone development
Plone IDE - the future of Plone developmentMikko Ohtamaa
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel GatewayBig analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel GatewayLuciano Resende
 
Why we love ArangoDB. The hunt for the right NosQL Database
Why we love ArangoDB. The hunt for the right NosQL DatabaseWhy we love ArangoDB. The hunt for the right NosQL Database
Why we love ArangoDB. The hunt for the right NosQL DatabaseAndreas Jung
 
Apcug 2011 07-17-intro_to_drupal_jeff_schuler
Apcug 2011 07-17-intro_to_drupal_jeff_schulerApcug 2011 07-17-intro_to_drupal_jeff_schuler
Apcug 2011 07-17-intro_to_drupal_jeff_schulerhewie
 

Similar a MongoDB at Sailthru: Scaling and Schema Design (20)

Deploying on the cutting edge
Deploying on the cutting edgeDeploying on the cutting edge
Deploying on the cutting edge
 
Flowdock's full-text search with MongoDB
Flowdock's full-text search with MongoDBFlowdock's full-text search with MongoDB
Flowdock's full-text search with MongoDB
 
JavaOne 2011 - Going Mobile With Java Based Technologies Today
JavaOne 2011 - Going Mobile With Java Based Technologies TodayJavaOne 2011 - Going Mobile With Java Based Technologies Today
JavaOne 2011 - Going Mobile With Java Based Technologies Today
 
Mobile drupal: building a mobile theme
Mobile drupal: building a mobile themeMobile drupal: building a mobile theme
Mobile drupal: building a mobile theme
 
[DCTPE2011] 7) Mobile Drupal(英/中雙語)
[DCTPE2011] 7) Mobile Drupal(英/中雙語)[DCTPE2011] 7) Mobile Drupal(英/中雙語)
[DCTPE2011] 7) Mobile Drupal(英/中雙語)
 
Node at artsy
Node at artsyNode at artsy
Node at artsy
 
Taking eZ Find beyond full-text search
Taking eZ Find beyond  full-text searchTaking eZ Find beyond  full-text search
Taking eZ Find beyond full-text search
 
App Engine Meetup
App Engine MeetupApp Engine Meetup
App Engine Meetup
 
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
 
Starting from scratch in 2017
Starting from scratch in 2017Starting from scratch in 2017
Starting from scratch in 2017
 
Sneak Peek of Nuxeo 5.4
Sneak Peek of Nuxeo 5.4Sneak Peek of Nuxeo 5.4
Sneak Peek of Nuxeo 5.4
 
Full stack development using javascript what and why - ajay chandravadiya
Full stack development using javascript   what and why - ajay chandravadiyaFull stack development using javascript   what and why - ajay chandravadiya
Full stack development using javascript what and why - ajay chandravadiya
 
Javascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJSJavascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJS
 
eZ Publish nextgen
eZ Publish nextgeneZ Publish nextgen
eZ Publish nextgen
 
Top 8 Improvements in Drupal 8
Top 8 Improvements in Drupal 8Top 8 Improvements in Drupal 8
Top 8 Improvements in Drupal 8
 
Plone IDE - the future of Plone development
Plone IDE - the future of Plone developmentPlone IDE - the future of Plone development
Plone IDE - the future of Plone development
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel GatewayBig analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel Gateway
 
Why we love ArangoDB. The hunt for the right NosQL Database
Why we love ArangoDB. The hunt for the right NosQL DatabaseWhy we love ArangoDB. The hunt for the right NosQL Database
Why we love ArangoDB. The hunt for the right NosQL Database
 
Apcug 2011 07-17-intro_to_drupal_jeff_schuler
Apcug 2011 07-17-intro_to_drupal_jeff_schulerApcug 2011 07-17-intro_to_drupal_jeff_schuler
Apcug 2011 07-17-intro_to_drupal_jeff_schuler
 

Más de DATAVERSITY

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...DATAVERSITY
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceDATAVERSITY
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data LiteracyDATAVERSITY
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for YouDATAVERSITY
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?DATAVERSITY
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling FundamentalsDATAVERSITY
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectDATAVERSITY
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?DATAVERSITY
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...DATAVERSITY
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsDATAVERSITY
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayDATAVERSITY
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise AnalyticsDATAVERSITY
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best PracticesDATAVERSITY
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?DATAVERSITY
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best PracticesDATAVERSITY
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageDATAVERSITY
 

Más de DATAVERSITY (20)

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and Governance
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data Literacy
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for You
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement Today
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 

Último

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 

Último (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

MongoDB at Sailthru: Scaling and Schema Design

  • 1. MongoDB at Sailthru Scaling and Schema Design Ian White @eonwhite NoSQL Now! 8/25/11 Sunday, August 7, 2011
  • 2. Sailthru • API-based transactional email led to... • Mass campaign email led to... • Intelligence and user behavior • Three engineers built the ESP we always wanted to use • Some Clients: Huffpo-AOL, Thrillist, Refinery 29, Flavorpill, Business Insider, Fab, Totsy, New York Observer Sunday, August 7, 2011
  • 3. How We Got To MongoDB from SQL • JSON was part of Sailthru infrastructure from start (SQL columns and S3) • Kept a close eye on CouchDB project • MongoDB felt like natural fit • Used for user profiles and analytics initially • Migrated one table at a time (very, very carefully) Sunday, August 7, 2011
  • 4. Sailthru Architecture • User interface to display stats, build campaigns and templates, etc (PHP/EC2) • API, link rewriting, and onsite endpoints (PHP/EC2) • Core mailer engine (Java/EC2 and colo) • Modified-postfix SMTP servers (colo) • 11 database servers on EC2 (for now) Sunday, August 7, 2011
  • 5. MongoDB Overview • 13 instances on EC2 (6 two-member replica sets, 1 backup server) • About 40 collections • About 1TB • Largest single collection is 500m docs Sunday, August 7, 2011
  • 6. Users are Documents • Users aren’t records split among multiple tables • End user’s lists, clickstream interests, geolocation, browser, time of day, purchase history becomes one ever-growing document Sunday, August 7, 2011
  • 7. Profiles Accessible Everywhere • Put abandoned shopping cart notifications within a mass email {if profile.purchase_incomplete} <p>This is what’s in your cart:</p> {foreach profile.purchase_incomplete.items as item} {item.qty} <a href=”{item.url}”>{item.title}</a><br/> {/foreach} {/if} Sunday, August 7, 2011
  • 8. Profiles Accessible Everywhere • Show a section of content conditional on the user’s location {if profile.geo.city[‘New York, NY US’]} <div>Come to the New York Meetup on the 27th!</div> {/if} Sunday, August 7, 2011
  • 9. Profiles Accessible Everywhere • Show different content depending on user interests as measured by on-site behavior {select} {case horizon_interest('black,dark')} <img src="http://example.com/dress-image-black.jpg" /> {/case} {case horizon_interest('green')} <img src="http://example.com/dress-image-green.jpg" /> {/case} {case horizon_interest('purple,polka_dot,pattern')} <img src="http://example.com/dress-image-polkadot.jpg" /> {/case} {/select} Sunday, August 7, 2011
  • 10. Profiles Accessible Everywhere • Pick top content from a data feed based on tags {content = horizon_select(content,10)} {foreach content as c} <a href=”{c.url}”>{c.title}</a><br/> {/foreach} Sunday, August 7, 2011
  • 11. Other Advantages of MongoDB • High performance • Take any parameters from our clients • Really flexible development • Great for analytics (internal and external) • No more downtime for schema migrations or reindexing Sunday, August 7, 2011
  • 12. How We Run mongod • mongod --dbpath /path/to/db --logpath /path/to/log/ mongodb.log --logappend --fork --rest --replSet main1 --journal • Don’t ever run without replication • Don’t ever kill -9 • Don’t run without writing to a log • Run behind a firewall • Use journaling now that it’s there • Use --rest, it’s handy Sunday, August 7, 2011
  • 13. Separate DBs By Collections • Lower-effort than auto-sharding • Separate databases for different usage patterns • Consider consequences of database failure/ unavailability • But make sure your backup and monitoring strategy is prepared for multiple DBs Sunday, August 7, 2011
  • 14. Our Five Replica Sets • main: most of the stuff on the UI, lots of small/medium collections • horizon: realtime onsite browsing data • profile: user profile data (60m user docs) • message: last three months of emails • archive: emails older than three months Sunday, August 7, 2011
  • 15. Monitoring • Some stuff to monitor: faults/sec, index misses, % locked, queue size, load average • we check basic status once/minute on all database servers (SMS alerts if down), email warnings on thresholds every 10 minutes • have been beta-ing 10gen’s MMS product Sunday, August 7, 2011
  • 16. Backups • Used to use mongodump - don’t do that anymore • Have single node of each replica set on a backup server • Two-hour slave delay • fsync/lock, freeze xfs file system, EBS snapshot, unfreeze, unlock Sunday, August 7, 2011
  • 17. The Great EC2 EBS Outage Adventure • We survived • Most of our nodes unavailable for 2-4 days • Were able to spin up new instances from backup server, snapshots, and get operational within hours • Wasn’t fun Sunday, August 7, 2011
  • 19. Develop Your Mental Model of MongoDB • You don’t need to look at the internals • But try to gain a working understanding of how MongoDB operates, especially RAM and indexes Sunday, August 7, 2011
  • 20. Big-Picture Design Questions • What is the data I want to store? • How will I want to use that data later? • How big will the data get? • If the answers are “I don’t know yet”, guess with your best YAGNI Sunday, August 7, 2011
  • 21. “But premature optimization is evil” • Knuth said that about code, which is flexible and easy to optimize later • Data is not as flexible as code • So doing some planning for performance is usually good when it comes to your data Sunday, August 7, 2011
  • 22. Specific MongoDB Design Questions • Embed vs top-level collection? • Denormalize (double-store data)? • How many/which indexes? • Arrays vs hashes for embedding? • Implicit schema (field names and types) Sunday, August 7, 2011
  • 23. Short Field Names? • Disk space: cheap • RAM: not cheap • Developer Time: expensive • Err towards compact, readable fieldnames • Might be worth writing a mapper • Probably wish we’d used c instead of client_id Sunday, August 7, 2011
  • 24. Favor Human-Readable Foreign Keys • DBRefs are a bit cumbersome • Referencing by MongoId often means doing extra lookups • Build human-readable references to save you doing lookups and manual joins Sunday, August 7, 2011
  • 25. Example • Store the Template and the Email as strings on the message object • { template: “Internal - Blast Notify”, email: “support-alerts@sailthru.com” } • No external reference lookups required • The tradeoff is basically just disk space Sunday, August 7, 2011
  • 26. Embed vs Top-Level Collections? • Major question of MongoDB schema design • If you can ask the question at all, you might want to err on the side of embedding • Don’t embed if the embedding could get huge • Don’t feel too bad about denormalizing by embedding AND storing in a top-level collection Sunday, August 7, 2011
  • 27. Typical Properties of Top-Level Collections • Independence: They don’t “belong” conceptually to another collection • Nouns: the building blocks of your system • Easily referenceable and updatable Sunday, August 7, 2011
  • 28. Embedding Pros • Super-fast retrieval of document with related data • Atomic updates • “Ownership” of embedded document is obvious • Usually maps well to code structures Sunday, August 7, 2011
  • 29. Embedding Cons • Harder to get at, do mass queries • Does not size up infinitely, will hit 16MB limit • Hard to create references to embedded object • Limited ability to indexed-sort the embedded objects Sunday, August 7, 2011
  • 30. If You Think You Can Embed • You probably should • I take advantage of embedding in my designs more often now than I did three years ago • It’s a gift MongoDB gives you in exchange for giving up your joins Sunday, August 7, 2011
  • 31. Design Example: User Permissions • Users can have various broad permission levels for any number of clients • For example, user ‘ploki’ might have permission level ‘admin’ for client 76 and permission level ‘reports_only’ for client 450 Sunday, August 7, 2011
  • 32. How Will We Use This Data? • Retrieve all clients for a given user • Retrieve all users for a given client • Retrieve a permission level for a given client for a given user Sunday, August 7, 2011
  • 33. How Will This Data Grow? • In the medium term, it will stay small • Number of clients and number of users can both grow infinitely Sunday, August 7, 2011
  • 34. Back in SQL-land • There’s a fairly standard way to do it • It’s a many-many relationship, so • Use a join table (client_user) Sunday, August 7, 2011
  • 35. Should We Use a New Top-Level Collection? db.client.user.save( { client_id: 76, username: ‘ploki’, permission: ‘admin’, }); db.client.user.save( { client_id: 450, username: ‘ploki’, permission: ‘reports_only’, }); db.client.user.ensureIndex( { client_id: 1 } ); db.client.user.ensureIndex( { username: 1 } ); // get all users belonging to a client db.client.user.find( { client_id: 76 } ); // get all clients a user has access to db.client.user.find( { username: ‘ibwhite’ } ); // get permissions for our current user db.client.user.findOne( { username: user.name } ); Sunday, August 7, 2011
  • 36. Probably Not • Only needed if we have lots of clients per user AND lots of users per client • This is a case where we can embed, so let’s do so Sunday, August 7, 2011
  • 37. Three Ways to Embed ‘clients’: { ‘76’: ‘admin’, Not good: Object ‘450’: ‘reports_only’, can’t do a multikeys index }, on the keys of a hash index:??? Okay: Array ‘clients’: [ {‘_id’: 76, ‘access’: ‘admin’}, but have to search through array of objects }, {‘_id’: 450, ‘access’: ‘reports_only’} to find by _id index: { ‘clients._id’: 1 } on retrieved doc ‘clients’: [ 76, 450 ], Our approach: Array ‘clients_access’: { ’76’: ‘admin’, Fields next to each other alphabetically and object ‘450’: ‘reports_only’, } index: { clients: 1 } Sunday, August 7, 2011
  • 38. Indexes • Index all highly frequent queries • Do less-indexed queries only on secondaries • Reduce the size of indexes whereever you can on big collections • Don’t sweat the medium-sized collections, focus on the big wins Sunday, August 7, 2011
  • 39. Take Advantage of Multiple-Field Indexes • Order matters • If you have an index on {client_id: 1, email: 1 } • Then you also have the {client_id: 1} index “for free” • but not { email: 1} Sunday, August 7, 2011
  • 40. Use your _id • You must use an _id for every collection, which will cost you index size • So do something useful with _id Sunday, August 7, 2011
  • 41. Take advantage of fast ^indexes • Messages have _ids like: 32423.00000341 • Need all messages in blast 32423: • db.message.blast.find( { _id: /^32423./ } ); • (Yeah, I know the . is ugly. Don’t use a dot if you do this.) Sunday, August 7, 2011
  • 42. Manual Range Partioning • We moved a big message.blast collection into per-day collections: • message.blast.20110605 message.blast.20110606 message.blast.20110607 etc... • Keeps working set indexes smaller • When we move data into the archive, drop() is much faster than remove() Sunday, August 7, 2011
  • 43. Questions? Looking for a job? ian@sailthru.com twitter.com/eonwhite Sunday, August 7, 2011