MongoDB at Sailthru: Scaling and Schema Design

MongoDB at Sailthru
Scaling and Schema
Design
Ian White
@eonwhite
NoSQL Now!
8/25/11

Sunday, August 7, 2011

Sailthru
• API-based transactional email led to...
• Mass campaign email led to...
• Intelligence and user behavior
• Three engineers built the ESP we always
wanted to use
• Some Clients: Huffpo-AOL, Thrillist,
Reﬁnery 29, Flavorpill, Business Insider, Fab,
Totsy, New York Observer

How We Got To
MongoDB from SQL
• JSON was part of Sailthru infrastructure
from start (SQL columns and S3)
• Kept a close eye on CouchDB project
• MongoDB felt like natural ﬁt
• Used for user proﬁles and analytics initially
• Migrated one table at a time (very, very
carefully)


Sailthru Architecture
• User interface to display stats, build
campaigns and templates, etc (PHP/EC2)
• API, link rewriting, and onsite endpoints
(PHP/EC2)
• Core mailer engine (Java/EC2 and colo)
• Modiﬁed-postﬁx SMTP servers (colo)
• 11 database servers on EC2 (for now)

MongoDB Overview

• 13 instances on EC2 (6 two-member
replica sets, 1 backup server)
• About 40 collections
• About 1TB
• Largest single collection is 500m docs


Users are Documents

• Users aren’t records split among multiple
tables
• End user’s lists, clickstream interests,
geolocation, browser, time of day, purchase
history becomes one ever-growing
document


Proﬁles Accessible
Everywhere
• Put abandoned shopping cart notiﬁcations
within a mass email
{if profile.purchase_incomplete}
<p>This is what’s in your cart:</p>
{foreach profile.purchase_incomplete.items as item}
{item.qty} <a href=”{item.url}”>{item.title}</a><br/>
{/foreach}
{/if}


Everywhere
• Show a section of content conditional on
the user’s location

{if profile.geo.city[‘New York, NY US’]}
<div>Come to the New York Meetup on the 27th!</div>
{/if}


Everywhere
• Show different content depending on user
interests as measured by on-site behavior
{select}
{case horizon_interest('black,dark')}
<img src="http://example.com/dress-image-black.jpg" />
{/case}
{case horizon_interest('green')}
<img src="http://example.com/dress-image-green.jpg" />
{/case}
{case horizon_interest('purple,polka_dot,pattern')}
<img src="http://example.com/dress-image-polkadot.jpg" />
{/case}
{/select}


Everywhere
• Pick top content from a data feed based on
tags

{content = horizon_select(content,10)}

{foreach content as c}
<a href=”{c.url}”>{c.title}</a><br/>
{/foreach}


Other Advantages of
MongoDB
• High performance
• Take any parameters from our clients
• Really ﬂexible development
• Great for analytics (internal and external)
• No more downtime for schema migrations
or reindexing


How We Run mongod
• mongod --dbpath /path/to/db --logpath /path/to/log/
mongodb.log --logappend --fork --rest --replSet
main1 --journal

• Don’t ever run without replication
• Don’t ever kill -9
• Don’t run without writing to a log
• Run behind a ﬁrewall
• Use journaling now that it’s there
• Use --rest, it’s handy

Separate DBs By
Collections
• Lower-effort than auto-sharding
• Separate databases for different usage
patterns
• Consider consequences of database failure/
unavailability
• But make sure your backup and monitoring
strategy is prepared for multiple DBs


Our Five Replica Sets
• main: most of the stuff on the UI, lots of
small/medium collections
• horizon: realtime onsite browsing data
• proﬁle: user proﬁle data (60m user docs)
• message: last three months of emails
• archive: emails older than three months

Monitoring

• Some stuff to monitor: faults/sec, index
misses, % locked, queue size, load average
• we check basic status once/minute on all
database servers (SMS alerts if down), email
warnings on thresholds every 10 minutes
• have been beta-ing 10gen’s MMS product


Backups
• Used to use mongodump - don’t do that
anymore
• Have single node of each replica set on a
backup server
• Two-hour slave delay
• fsync/lock, freeze xfs ﬁle system, EBS
snapshot, unfreeze, unlock


The Great EC2 EBS
Outage Adventure
• We survived
• Most of our nodes unavailable for 2-4 days
• Were able to spin up new instances from
backup server, snapshots, and get
operational within hours
• Wasn’t fun


DESIGN


Develop Your Mental
Model of MongoDB

• You don’t need to look at the internals
• But try to gain a working understanding of
how MongoDB operates, especially RAM
and indexes


Big-Picture Design
Questions
• What is the data I want to store?
• How will I want to use that data later?
• How big will the data get?
• If the answers are “I don’t know yet”, guess
with your best YAGNI


“But premature
optimization is evil”
• Knuth said that about code, which is
ﬂexible and easy to optimize later
• Data is not as ﬂexible as code
• So doing some planning for performance is
usually good when it comes to your data


Speciﬁc MongoDB
Design Questions
• Embed vs top-level collection?
• Denormalize (double-store data)?
• How many/which indexes?
• Arrays vs hashes for embedding?
• Implicit schema (ﬁeld names and types)

Short Field Names?
• Disk space: cheap
• RAM: not cheap
• Developer Time: expensive
• Err towards compact, readable ﬁeldnames
• Might be worth writing a mapper
• Probably wish we’d used c instead of
client_id


Favor Human-Readable
Foreign Keys
• DBRefs are a bit cumbersome
• Referencing by MongoId often means doing
extra lookups
• Build human-readable references to save
you doing lookups and manual joins


Example

• Store the Template and the Email as strings
on the message object
• { template: “Internal - Blast Notify”, email:
“support-alerts@sailthru.com” }

• No external reference lookups required
• The tradeoff is basically just disk space

Embed vs Top-Level
Collections?
• Major question of MongoDB schema design
• If you can ask the question at all, you might
want to err on the side of embedding
• Don’t embed if the embedding could get
huge
• Don’t feel too bad about denormalizing by
embedding AND storing in a top-level
collection

Typical Properties of
Top-Level Collections

• Independence: They don’t “belong”
conceptually to another collection
• Nouns: the building blocks of your system
• Easily referenceable and updatable


Embedding Pros
• Super-fast retrieval of document with
related data
• Atomic updates
• “Ownership” of embedded document is
obvious
• Usually maps well to code structures


Embedding Cons

• Harder to get at, do mass queries
• Does not size up inﬁnitely, will hit 16MB
limit
• Hard to create references to embedded
object
• Limited ability to indexed-sort the
embedded objects


If You Think You Can
Embed
• You probably should
• I take advantage of embedding in my
designs more often now than I did three
years ago
• It’s a gift MongoDB gives you in exchange
for giving up your joins


Design Example:
User Permissions
• Users can have various broad permission
levels for any number of clients
• For example, user ‘ploki’ might have
permission level ‘admin’ for client 76 and
permission level ‘reports_only’ for client
450


How Will We Use This
Data?

• Retrieve all clients for a given user
• Retrieve all users for a given client
• Retrieve a permission level for a given
client for a given user


How Will This Data
Grow?

• In the medium term, it will stay small
• Number of clients and number of users can
both grow inﬁnitely


Back in SQL-land

• There’s a fairly standard way to do it
• It’s a many-many relationship, so
• Use a join table (client_user)


Should We Use a New
Top-Level Collection?
db.client.user.save( {
client_id: 76,
username: ‘ploki’,
permission: ‘admin’,
});
db.client.user.save( {
client_id: 450,
username: ‘ploki’,
permission: ‘reports_only’,
});

db.client.user.ensureIndex( { client_id: 1 } );
db.client.user.ensureIndex( { username: 1 } );

// get all users belonging to a client
db.client.user.find( { client_id: 76 } );

// get all clients a user has access to
db.client.user.find( { username: ‘ibwhite’ } );

// get permissions for our current user
db.client.user.findOne( { username: user.name } );


Probably Not

• Only needed if we have lots of clients per
user AND lots of users per client
• This is a case where we can embed, so let’s
do so


Three Ways to Embed
‘clients’: {
‘76’: ‘admin’, Not good:
Object ‘450’: ‘reports_only’, can’t do a multikeys index
}, on the keys of a hash
index:???

Okay:
Array ‘clients’: [
{‘_id’: 76, ‘access’: ‘admin’}, but have to search
through array
of objects },
{‘_id’: 450, ‘access’: ‘reports_only’}
to ﬁnd by _id
index: { ‘clients._id’: 1 } on retrieved doc

‘clients’: [ 76, 450 ],
Our approach:
Array
‘clients_access’: {
’76’: ‘admin’, Fields next to each
other alphabetically
and object
‘450’: ‘reports_only’,
}
index: { clients: 1 }


Indexes
• Index all highly frequent queries
• Do less-indexed queries only on
secondaries
• Reduce the size of indexes whereever you
can on big collections
• Don’t sweat the medium-sized collections,
focus on the big wins


Take Advantage of
Multiple-Field Indexes
• Order matters
• If you have an index on {client_id:
1, email: 1 }

• Then you also have the {client_id:
1} index “for free”

• but not { email: 1}


Use your _id

• You must use an _id for every collection,
which will cost you index size
• So do something useful with _id


Take advantage of fast
^indexes
• Messages have _ids like: 32423.00000341
• Need all messages in blast 32423:
• db.message.blast.find(
{ _id: /^32423./ } );

• (Yeah, I know the . is ugly. Don’t use a dot if you do this.)


Manual Range
Partioning
• We moved a big message.blast collection
into per-day collections:
• message.blast.20110605
message.blast.20110606
message.blast.20110607
etc...

• Keeps working set indexes smaller
• When we move data into the archive,
drop() is much faster than remove()


Questions?
Looking for a job?
ian@sailthru.com
twitter.com/eonwhite


MongoDB at Sailthru: Scaling and Schema Design

Recomendados

Recomendados

Más contenido relacionado

Similar a MongoDB at Sailthru: Scaling and Schema Design

Similar a MongoDB at Sailthru: Scaling and Schema Design (20)

Más de DATAVERSITY

Más de DATAVERSITY (20)

Último

Último (20)

MongoDB at Sailthru: Scaling and Schema Design