7. First Deployment
• Asingle server with a really big disk
Application mongod
i2.8xlarge
251 GB RAM
6 TB SSD
c3.8xlarge
8. Second Deployment
• Areally big cluster where everything is in RAM
Application / mongos
...
100 x r3.2xlarge
61 GB RAM
@
100 GB disk
mongod
c3.8xlarge
9. Second Deployment
• Areally big cluster where everything is in RAM
Application / mongos
...
100 x r3.2xlarge
61 GB RAM
@
100 GB disk
mongod
10. Now... how much would you pay?
..
$60,000 / yr
$700,000 / yr
11. Use Cases
• Bulk loading
– getting all data into the system
• Latency and throughput for queries
– point in space-time
– one station, one year
– the whole world, once upon a time
• Aggregation and Exploration
– warmest and coldest day ever, etc.
12. Bulk Loading: Principles
• On the application side:
– batch size
– number of client threads
– use unordered bulk writes
• On the server side:
– Journaling off ( temporarily! )
– Index later
– In cluster: pre-split, no balancing
13. Bulk Loading: Single Server
batch
size
threads
through
put
8 threads,
batch size 100
→ 85,000 doc/s
14. Bulk Loading: Single Server
• Settings: 8 threads
batch size 100
• Total loading time: 10 h 20 min
• Documents per second: 70,000
• Index build time: 7 h 40 min (ts_1_st_1)
23. Analytics and Exploration
• Analytics means ad-hoc queries for which
we do not have an index
– Find all tornados
– Maximum reported temperature
• We cannot just index everything
– memory
– write performance
24. Analytics: Find all Tornados
db.data.find ({
"presentWeatherObservation.condition" : "99"
})
47 s
Cluster
1 h 28 min
Single Server
25. Analytics: Maximum Temperature
db.data.aggregate ([
{ "$match" : { "airTemperature.quality" :
{ "$in" : [ "1", "5" ] } } },
{ "$group" : { "_id" : null,
"maxTemp" : { "$max" :
"$airTemperature.value" } } }
])
61.8 °C = 143 °F
2 min
Cluster
4 h 45 min
Single Server
26. Summary: Single Server
Pro
• Cost-effective
• Very good latency for single queries
Con
• Some operations are prohibitive:
– Indexing
– Table Scans
27. Summary: Cluster
Con
• High cost
Pro
• High throughput
• Very good latency for single queries
• Scatter-gather yields significant speed-up
• Analytics are possible
..