Sailthru provides all your website email delivery needs, ensuring Inbox delivery for transactional and mass mail. Sailthru started out as a MySQL-powered transactional-mail service. Starting in 2009, we migrated to the document-oriented "nosql" database MongoDB. Moving entirely to MongoDB has allowed us to build complex user profiles to power behavioral-targeted mass emails and onsite recommendations. How and why we made the move, and how we use MongoDB today.
1. MongoDB at Sailthru
Scaling and Schema
Design
Ian White
@eonwhite
NoSQL Now!
8/25/11
Sunday, August 7, 2011
2. Sailthru
• API-based transactional email led to...
• Mass campaign email led to...
• Intelligence and user behavior
• Three engineers built the ESP we always
wanted to use
• Some Clients: Huffpo-AOL, Thrillist,
Refinery 29, Flavorpill, Business Insider, Fab,
Totsy, New York Observer
Sunday, August 7, 2011
3. How We Got To
MongoDB from SQL
• JSON was part of Sailthru infrastructure
from start (SQL columns and S3)
• Kept a close eye on CouchDB project
• MongoDB felt like natural fit
• Used for user profiles and analytics initially
• Migrated one table at a time (very, very
carefully)
Sunday, August 7, 2011
4. Sailthru Architecture
• User interface to display stats, build
campaigns and templates, etc (PHP/EC2)
• API, link rewriting, and onsite endpoints
(PHP/EC2)
• Core mailer engine (Java/EC2 and colo)
• Modified-postfix SMTP servers (colo)
• 11 database servers on EC2 (for now)
Sunday, August 7, 2011
5. MongoDB Overview
• 13 instances on EC2 (6 two-member
replica sets, 1 backup server)
• About 40 collections
• About 1TB
• Largest single collection is 500m docs
Sunday, August 7, 2011
6. Users are Documents
• Users aren’t records split among multiple
tables
• End user’s lists, clickstream interests,
geolocation, browser, time of day, purchase
history becomes one ever-growing
document
Sunday, August 7, 2011
7. Profiles Accessible
Everywhere
• Put abandoned shopping cart notifications
within a mass email
{if profile.purchase_incomplete}
<p>This is what’s in your cart:</p>
{foreach profile.purchase_incomplete.items as item}
{item.qty} <a href=”{item.url}”>{item.title}</a><br/>
{/foreach}
{/if}
Sunday, August 7, 2011
8. Profiles Accessible
Everywhere
• Show a section of content conditional on
the user’s location
{if profile.geo.city[‘New York, NY US’]}
<div>Come to the New York Meetup on the 27th!</div>
{/if}
Sunday, August 7, 2011
9. Profiles Accessible
Everywhere
• Show different content depending on user
interests as measured by on-site behavior
{select}
{case horizon_interest('black,dark')}
<img src="http://example.com/dress-image-black.jpg" />
{/case}
{case horizon_interest('green')}
<img src="http://example.com/dress-image-green.jpg" />
{/case}
{case horizon_interest('purple,polka_dot,pattern')}
<img src="http://example.com/dress-image-polkadot.jpg" />
{/case}
{/select}
Sunday, August 7, 2011
10. Profiles Accessible
Everywhere
• Pick top content from a data feed based on
tags
{content = horizon_select(content,10)}
{foreach content as c}
<a href=”{c.url}”>{c.title}</a><br/>
{/foreach}
Sunday, August 7, 2011
11. Other Advantages of
MongoDB
• High performance
• Take any parameters from our clients
• Really flexible development
• Great for analytics (internal and external)
• No more downtime for schema migrations
or reindexing
Sunday, August 7, 2011
12. How We Run mongod
• mongod --dbpath /path/to/db --logpath /path/to/log/
mongodb.log --logappend --fork --rest --replSet
main1 --journal
• Don’t ever run without replication
• Don’t ever kill -9
• Don’t run without writing to a log
• Run behind a firewall
• Use journaling now that it’s there
• Use --rest, it’s handy
Sunday, August 7, 2011
13. Separate DBs By
Collections
• Lower-effort than auto-sharding
• Separate databases for different usage
patterns
• Consider consequences of database failure/
unavailability
• But make sure your backup and monitoring
strategy is prepared for multiple DBs
Sunday, August 7, 2011
14. Our Five Replica Sets
• main: most of the stuff on the UI, lots of
small/medium collections
• horizon: realtime onsite browsing data
• profile: user profile data (60m user docs)
• message: last three months of emails
• archive: emails older than three months
Sunday, August 7, 2011
15. Monitoring
• Some stuff to monitor: faults/sec, index
misses, % locked, queue size, load average
• we check basic status once/minute on all
database servers (SMS alerts if down), email
warnings on thresholds every 10 minutes
• have been beta-ing 10gen’s MMS product
Sunday, August 7, 2011
16. Backups
• Used to use mongodump - don’t do that
anymore
• Have single node of each replica set on a
backup server
• Two-hour slave delay
• fsync/lock, freeze xfs file system, EBS
snapshot, unfreeze, unlock
Sunday, August 7, 2011
17. The Great EC2 EBS
Outage Adventure
• We survived
• Most of our nodes unavailable for 2-4 days
• Were able to spin up new instances from
backup server, snapshots, and get
operational within hours
• Wasn’t fun
Sunday, August 7, 2011
19. Develop Your Mental
Model of MongoDB
• You don’t need to look at the internals
• But try to gain a working understanding of
how MongoDB operates, especially RAM
and indexes
Sunday, August 7, 2011
20. Big-Picture Design
Questions
• What is the data I want to store?
• How will I want to use that data later?
• How big will the data get?
• If the answers are “I don’t know yet”, guess
with your best YAGNI
Sunday, August 7, 2011
21. “But premature
optimization is evil”
• Knuth said that about code, which is
flexible and easy to optimize later
• Data is not as flexible as code
• So doing some planning for performance is
usually good when it comes to your data
Sunday, August 7, 2011
22. Specific MongoDB
Design Questions
• Embed vs top-level collection?
• Denormalize (double-store data)?
• How many/which indexes?
• Arrays vs hashes for embedding?
• Implicit schema (field names and types)
Sunday, August 7, 2011
23. Short Field Names?
• Disk space: cheap
• RAM: not cheap
• Developer Time: expensive
• Err towards compact, readable fieldnames
• Might be worth writing a mapper
• Probably wish we’d used c instead of
client_id
Sunday, August 7, 2011
24. Favor Human-Readable
Foreign Keys
• DBRefs are a bit cumbersome
• Referencing by MongoId often means doing
extra lookups
• Build human-readable references to save
you doing lookups and manual joins
Sunday, August 7, 2011
25. Example
• Store the Template and the Email as strings
on the message object
• { template: “Internal - Blast Notify”, email:
“support-alerts@sailthru.com” }
• No external reference lookups required
• The tradeoff is basically just disk space
Sunday, August 7, 2011
26. Embed vs Top-Level
Collections?
• Major question of MongoDB schema design
• If you can ask the question at all, you might
want to err on the side of embedding
• Don’t embed if the embedding could get
huge
• Don’t feel too bad about denormalizing by
embedding AND storing in a top-level
collection
Sunday, August 7, 2011
27. Typical Properties of
Top-Level Collections
• Independence: They don’t “belong”
conceptually to another collection
• Nouns: the building blocks of your system
• Easily referenceable and updatable
Sunday, August 7, 2011
28. Embedding Pros
• Super-fast retrieval of document with
related data
• Atomic updates
• “Ownership” of embedded document is
obvious
• Usually maps well to code structures
Sunday, August 7, 2011
29. Embedding Cons
• Harder to get at, do mass queries
• Does not size up infinitely, will hit 16MB
limit
• Hard to create references to embedded
object
• Limited ability to indexed-sort the
embedded objects
Sunday, August 7, 2011
30. If You Think You Can
Embed
• You probably should
• I take advantage of embedding in my
designs more often now than I did three
years ago
• It’s a gift MongoDB gives you in exchange
for giving up your joins
Sunday, August 7, 2011
31. Design Example:
User Permissions
• Users can have various broad permission
levels for any number of clients
• For example, user ‘ploki’ might have
permission level ‘admin’ for client 76 and
permission level ‘reports_only’ for client
450
Sunday, August 7, 2011
32. How Will We Use This
Data?
• Retrieve all clients for a given user
• Retrieve all users for a given client
• Retrieve a permission level for a given
client for a given user
Sunday, August 7, 2011
33. How Will This Data
Grow?
• In the medium term, it will stay small
• Number of clients and number of users can
both grow infinitely
Sunday, August 7, 2011
34. Back in SQL-land
• There’s a fairly standard way to do it
• It’s a many-many relationship, so
• Use a join table (client_user)
Sunday, August 7, 2011
35. Should We Use a New
Top-Level Collection?
db.client.user.save( {
client_id: 76,
username: ‘ploki’,
permission: ‘admin’,
});
db.client.user.save( {
client_id: 450,
username: ‘ploki’,
permission: ‘reports_only’,
});
db.client.user.ensureIndex( { client_id: 1 } );
db.client.user.ensureIndex( { username: 1 } );
// get all users belonging to a client
db.client.user.find( { client_id: 76 } );
// get all clients a user has access to
db.client.user.find( { username: ‘ibwhite’ } );
// get permissions for our current user
db.client.user.findOne( { username: user.name } );
Sunday, August 7, 2011
36. Probably Not
• Only needed if we have lots of clients per
user AND lots of users per client
• This is a case where we can embed, so let’s
do so
Sunday, August 7, 2011
37. Three Ways to Embed
‘clients’: {
‘76’: ‘admin’, Not good:
Object ‘450’: ‘reports_only’, can’t do a multikeys index
}, on the keys of a hash
index:???
Okay:
Array ‘clients’: [
{‘_id’: 76, ‘access’: ‘admin’}, but have to search
through array
of objects },
{‘_id’: 450, ‘access’: ‘reports_only’}
to find by _id
index: { ‘clients._id’: 1 } on retrieved doc
‘clients’: [ 76, 450 ],
Our approach:
Array
‘clients_access’: {
’76’: ‘admin’, Fields next to each
other alphabetically
and object
‘450’: ‘reports_only’,
}
index: { clients: 1 }
Sunday, August 7, 2011
38. Indexes
• Index all highly frequent queries
• Do less-indexed queries only on
secondaries
• Reduce the size of indexes whereever you
can on big collections
• Don’t sweat the medium-sized collections,
focus on the big wins
Sunday, August 7, 2011
39. Take Advantage of
Multiple-Field Indexes
• Order matters
• If you have an index on {client_id:
1, email: 1 }
• Then you also have the {client_id:
1} index “for free”
• but not { email: 1}
Sunday, August 7, 2011
40. Use your _id
• You must use an _id for every collection,
which will cost you index size
• So do something useful with _id
Sunday, August 7, 2011
41. Take advantage of fast
^indexes
• Messages have _ids like: 32423.00000341
• Need all messages in blast 32423:
• db.message.blast.find(
{ _id: /^32423./ } );
• (Yeah, I know the . is ugly. Don’t use a dot if you do this.)
Sunday, August 7, 2011
42. Manual Range
Partioning
• We moved a big message.blast collection
into per-day collections:
• message.blast.20110605
message.blast.20110606
message.blast.20110607
etc...
• Keeps working set indexes smaller
• When we move data into the archive,
drop() is much faster than remove()
Sunday, August 7, 2011
43. Questions?
Looking for a job?
ian@sailthru.com
twitter.com/eonwhite
Sunday, August 7, 2011