MongoDB for Time Series Data: Sharding

Sr. Solutions Architect, MongoDB
Jake Angerman
Sharding Time Series Data

Let's Pretend We Are DevOps
What my friends
think I do
What society
thinks I do
What my Mom
thinks I do
What my boss
thinks I do What I think I do What I really do
DevOps

Sharding Overview
Primary
Secondary
Secondary
Shard 1
Primary
Secondary
Secondary
Shard 2
Primary
Secondary
Secondary
Shard 3
Primary
Secondary
Secondary
Shard N
…
Query
Router
Query
Router
Query
Router
……
Driver
Application

Why do we need to shard?
• Reaching a limit on some resource
– RAM (working set)
– Disk space
– Disk IO
– Client network latency on writes (tag aware sharding)
– CPU

Do we need to shard right now?
• Two schools of thought:
1. Shard at the outset to avoid technical debt later
2. Shard later to avoid complexity and overhead today
• Either way, shard before you need to!
– 256GB data size threshold published in documentation
– Chunk migrations can cause memory contention and disk
IO
Working
Set
Free
RAM
Things seemed fine…
Working
Set
Chunk
Migration
… then I waited
too long to
shard

Develop Nationwide traffic monitoring
system

Traffic sensors to monitor interstate
conditions
• 16,000 sensors
• Measure
• Speed
• Travel time
• Weather, pavement, and traffic conditions
• Support desktop, mobile, and car navigation
systems

Model After NY State Solution
http://511ny.org

{ _id: “900006:2014031206”,
data: [
{ speed: NaN, time: NaN },
...
],
conditions: {
status: "unknown",
pavement: "unknown",
weather: "unknown"
}
}
Sample Document Structure
Pre-allocated,
60 element array of
per-minute data

> db.mdbw.stats()
{
"ns" : "test.mdbw",
"count" : 16000, // one hour's worth of documents
"size" : 65280000, // size of user data, padding included
"avgObjSize" : 4080,
"storageSize" : 93356032, // size of data extents, unused space included
"numExtents" : 11,
"nindexes" : 1,
"lastExtentSize" : 31354880,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 1,
"totalIndexSize" : 801248,
"indexSizes" : { "_id_" : 801248 },
"ok" : 1
}
collection stats

Storage model spreadsheet
sensors 16,000
years to keep data 6
docs per day 384,000
docs per year 140,160,000
docs total across all years 840,960,000
indexes per day 801248 bytes
storage per hour 63 MB
storage per day 1.5 GB
storage per year 539 GB
storage across all years 3,235 GB

Why we need to shard now
539 GB in year one
alone
0
500
1,000
1,500
2,000
2,500
3,000
3,500
1 2 3 4 5 6
Year
Total storage
(GB)
16,000 sensors today…
… 47,000
tomorrow?

What will our sharded cluster look
like?
• We need to model the application to answer this
question
• Model should include:
– application write patterns (sensors)
– application read patterns (clients)
– analytic read patterns
– data storage requirements
• Two main collections
– summary data (fast query times)
– historical data (analysis of environmental conditions)

Option 1: Everything in one sharded
cluster
Primary Primary Primary
Secondary Secondary Secondary
Shard 2 Shard 3 Shard N
…
Primary
Secondary
Secondary
Shard 1
Primary Shard
Primary
Secondary
Secondary
Shard 4
• Issue: prevent analytics jobs from affecting application
performance
• Summary data is small (16,000 * N bytes) and accessed
frequently

Option 2: Distinct replica set for
summaries
Primary Primary Primary
…
Primary
Secondary
Secondary
Replica set
Primary
Secondary
Secondary
Shard 3
• Pros: Operational separation between business functions
• Cons: application must write to two different databases

Application read patterns
• Web browsers, mobile phones, and in-car
navigation devices
• Working set will be kept in RAM
• 5M subscribers * 1% active * 50
sensors/query *
1 device query/min = 41,667 reads/sec
• 41,667 reads/sec * 4080 bytes = 162 MB/sec
– and that's without any protocol overhead
• Gigabit Ethernet is ≈ 118 MB/sec
Primary
Secondary
Secondary
Replica set
1 Gbps

Application read patterns
(continued)
• Options
– provision more bandwidth ($$$)
– tune application read pattern
– add a caching layer
– secondary reads from the replica
set
Primary
Secondary
Secondary
Replica set
1 Gbps
1 Gbps
1 Gbps

Secondary Reads from the Replica
Set
• Stale data OK in this use case
• caution: read preference of
secondary could be disastrous in a
3-replica set if a secondary fails!
• app servers with mixed read
preferences of primary and
secondary are operationally
cumbersome
• Use nearest read preference to
access all nodes
Primary
Secondary
Secondary
Replica set
1 Gbps
1 Gbps
1 Gbps
db.collection.find().readPref
( { mode: 'nearest'} )

Replica Set Tags
• app servers in different data centers use
replica set tags plus read preferencenearest
• db.collection.find().readPref( { mode: 'nearest',
tags: [ {'datacenter': 'east'} ] } )
east
Secondary
Secondary
Primary
>rs.conf()
{"_id":"rs0",
"version":2,
"members":[
{"_id":0,
"host":"node0.example.net:27017",
"tags":{"datacenter":"east"}
},
{"_id":1,
},
{"_id":2,
},
}

eastcentralwest
Replica Set Tags
• Enables geographic distribution
SecondarySecondary Primary

eastcentralwest
Replica Set Tags
• Enables geographic distribution
• Allows scaling within each data center
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Primary
Secondary
Secondary

Analytic read patterns
• How does an analyst look at the data on the sharded cluster?
• 1 Year of data = 539 GB
3, 256
3, 192
5, 128
9, 64
17, 32
0
50
100
150
200
250
300
0 2 4 6 8 10 12 14 16 18
Server RAM
Number of shards

Application write patterns
• 16,000 sensors every minute = 267 writes/sec
• Could we handle 16,000 writes in one second?
– 16,000 writes * 4080 bytes = 62 MB
• Load test the app!

Modeling the Application - summary
• We modeled:
– application write patterns (sensors)
– application read patterns (clients)
– analytic read patterns
– data storage requirements
– the network, a little bit

Shard Key characteristics
• Agood shard key has:
– sufficient cardinality
– distributed writes
– targeted reads ("query isolation")
• Shard key should be in every query if possible
– scatter gather otherwise
• Choosing a good shard key is important!
– affects performance and scalability
– changing it later is expensive

Hashed shard key
• Pros:
– Evenly distributed writes
• Cons:
– Random data (and index) updates can be IO intensive
– Range-based queries turn into scatter gather
Shard 1
mongos

Low cardinality shard key
• Induces "jumbo chunks"
• Examples: sensor ID
• Makes sense for some use cases besides this one
Shard 1
mongos
[ a, b ) [ b, c ) [ c, d ) [ e, f )

Ascending shard key
• Monotonically increasing shard key values cause
"hot spots" on inserts
• Examples: timestamps, _id
Shard 1
mongos
[ ISODate(…), $maxKey )

Choosing a shard key for time series
data
• Consider compound shard key:
{arbitrary value, incrementing value}
• Best of both worlds – multi-hot spotting, targeted
reads
Shard 1
mongos
[ {V1, ISODate(A)}, {V1, ISODate(B)} ),
[ {V1, ISODate(B)}, {V1, ISODate(C)} ),
[ {V1, ISODate(C)}, {V1, ISODate(D)} ),
…
[ {V4, ISODate(A)}, {V4, ISODate(B)}
[ {V4, ISODate(B)}, {V4, ISODate(C)}
[ {V4, ISODate(C)}, {V4, ISODate(D)}
…
…
…

What is our shard key?
• Let's choose: linkID, date
– example: { linkID: 9000006, date: 2014031206 }
– example: { _id: "900006:2014031206" }
– this application's _id is in this form already, yay!

Summary
• Model the read/write patterns and storage
• Choose an appropriate shard key
• DevOps influenced the application
– write recent summary data to separate database
– replica set tags for summary database
– avoid synchronous sensor checkins
– consider changing client polling frequency
– consider throttling RESTAPI access to app servers

Sign up for our “Path to Proof” Program
and get free expert advice on
implementation, architecture, and
configuration.
www.mongodb.com/lp/contact/path-proof-program

Sr. Solutions Architect, MongoDB
Jake Angerman
Thank You

MongoDB for Time Series Data: Sharding

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (10)

Similar a MongoDB for Time Series Data: Sharding

Similar a MongoDB for Time Series Data: Sharding (20)

Más de MongoDB

Más de MongoDB (20)

Último

Último (20)

MongoDB for Time Series Data: Sharding

Notas del editor