2. BIGPANDA
SaaS platform that lets companies aggregate alerts
from all their monitoring systems into one place for
faster incident discovery and response.
3. HOW IT WORKS
High CPU on
prod-srv-1
18/06/14 16:05
CRITICAL
High CPU on
prod-srv-1
18/06/14 16:07
WARNING
Memory usage on
prod-srv-1
18/06/14 16:08
CRITICAL
Events Entities
High CPU on
prod-srv-1
WARNING
Memory usage on
prod-srv-1
CRITICAL
Incidents
2 Alerts on
prod-srv-1
4. PRODUCT REQUIREMENTS
• Events need to be processed into incidents and
streamed to the user’s browser as fast as possible
• Incidents need to reliably reflect the state as it is in
the monitoring system
• The service has to be up and running 24x7
5. MISSION CRITICAL
• It’s not rocket science, it’s not Google, but:
• It has to be super fast
• It has to be extremely reliable
• It has to always be available
8. WHY MONGO?
At first:
• NodeJS shop
• Schemaless
• Easy to master
Later on:
• Reliable
• Easy to evolve
• Partial and atomic updates
• Powerful query language
BECAUSE IT’S WEB SCALE!
10. HARDWARE
03/13
3 x m1.medium
02/14
1 x i2.xlarge
+
2 x m1.medium
m1.medium: 1 vCPUs, 3.75GB RAM, EBS drive
06/14
2 x i2.xlarge
+
1 x m3.xlarge
m3.xlarge: 4 vCPUs, 15GB RAM, EBS drive
i2.xlarge: 4 vCPUs, 30.5GB RAM, SSD 800GB
x3 reads
x4 writes
11. –Eliot Horowitz
“Schema design is … the largest factor when it comes
to performance and scalability … more important
than hardware, how you shard, or anything else,
schema is by far the most important thing.”
15. LEAN QUERIES
• Use projections to limit fields returned by a query:
Model.find().select(‘-events’)
• Mongoose users: use .lean() when possible to gain more
than 50% performance boost:
Model.find().lean()
• Stream results:
Model.find().stream().on(‘data’, function(doc){})
16. RESULTS
• Average latency of all API calls went from 500ms
to under 20ms
• Average latency of full pipeline went from 2s to
under 500ms
• Peak time latency of full pipeline went down from
5m(!!) to less than 30s
18. ATOMIC & PARTIAL UPDATES
• Several services might try to update the same
document at the same time, but:
• Different systems update different parts of the
document
• Updates to the same document are sharded and
ordered at the application level
(read our awesome blog post: http://bit.ly/1nQVcbS)
20. REPLICA SET
• 3 nodes replica set
• Using priorities to enforce master election of
stronger nodes
• Deployed on different availability zones
21. DISASTER RECOVERY
• Cold backup using MMS Backup
• Full production replication on another EC2 region:
using mongo’s replication mechanism to
continuously sync data to the backup region
For each customer:
aggregate alert notifications from multiple monitoring systems
group together alerts that belong to the same monitored appliance
group together, into “incidents”, alerts that are (topo-)logically related