Building an event system on top MongoDB

•

0 likes•915 views

BigPanda

How we built a super fast, extremely reliable and highly available event system on top MongoDB

Software Technology

BUILDING A MISSION
CRITICAL EVENT SYSTEM
ON TOP OF MONGODB
by @shahar_kedar

BIGPANDA
SaaS platform that lets companies aggregate alerts
from all their monitoring systems into one place for
faster incident discovery and response.

HOW IT WORKS
High CPU on

prod-srv-1

18/06/14 16:05

CRITICAL
High CPU on

prod-srv-1

18/06/14 16:07

WARNING

Memory usage on

prod-srv-1

18/06/14 16:08

CRITICAL

Events Entities
High CPU on

prod-srv-1

WARNING
Memory usage on

prod-srv-1

CRITICAL

Incidents
2 Alerts on

prod-srv-1

PRODUCT REQUIREMENTS
• Events need to be processed into incidents and
streamed to the user’s browser as fast as possible

• Incidents need to reliably reﬂect the state as it is in
the monitoring system

• The service has to be up and running 24x7

MISSION CRITICAL
• It’s not rocket science, it’s not Google, but:

• It has to be super fast

• It has to be extremely reliable

• It has to always be available

WHY MONGO?
At ﬁrst:

• NodeJS shop

• Schemaless

• Easy to master

Later on:

• Reliable

• Easy to evolve

• Partial and atomic updates

• Powerful query language
BECAUSE IT’S WEB SCALE!

SUPER FAST
Hardware
Schema Design
Lean & Stream

HARDWARE
03/13
3 x m1.medium
02/14
1 x i2.xlarge 
+

2 x m1.medium
m1.medium: 1 vCPUs, 3.75GB RAM, EBS drive
06/14
2 x i2.xlarge 
+

1 x m3.xlarge
m3.xlarge: 4 vCPUs, 15GB RAM, EBS drive
i2.xlarge: 4 vCPUs, 30.5GB RAM, SSD 800GB
x3 reads
x4 writes

–Eliot Horowitz
“Schema design is … the largest factor when it comes
to performance and scalability … more important
than hardware, how you shard, or anything else,
schema is by far the most important thing.”

$SCHEMA DESIGN Event { timestamp : Date status: String description: String, } Entity { start : Date end: Date status: String description: String, events: [ <embedded> ] source_system: String } Incident { start : Date end: Date is_active: Boolean description: String, entities: [  { entityId: ObjectId status: String } ] }$

DENORMALIZATION
• Go over the checklist (http://bit.ly/1vUdz2T)

• Incidents => Entities: partially embedded + ref

• Cardinality: one-to-few

• Direct access to Entities

• Entities are frequently updated

• Entities => Events: embedded

• Events are not directly accessed

• Events are immutable

• Cardinality: one-to-many ~ one-to-gazzilion

INDEXES
• Optimized indexes  
db.collection.find({..}).explain()

• Removed redundant indexes

• Truncated events collections (TTL index)

LEAN QUERIES
• Use projections to limit ﬁelds returned by a query: 
Model.find().select(‘-events’)

• Mongoose users: use .lean() when possible to gain more
than 50% performance boost: 
Model.find().lean()

• Stream results:  
Model.find().stream().on(‘data’, function(doc){})

RESULTS
• Average latency of all API calls went from 500ms
to under 20ms

• Average latency of full pipeline went from 2s to
under 500ms

• Peak time latency of full pipeline went down from
5m(!!) to less than 30s

EXTREMELY
RELIABLE
Atomic & Partial Updates

ATOMIC & PARTIAL UPDATES
• Several services might try to update the same
document at the same time, but:

• Different systems update different parts of the
document

• Updates to the same document are sharded and
ordered at the application level  
(read our awesome blog post: http://bit.ly/1nQVcbS)

IMPOSSIBLETO
KILL
Replica Set
Disaster Recovery

REPLICA SET
• 3 nodes replica set

• Using priorities to enforce master election of
stronger nodes

• Deployed on different availability zones

DISASTER RECOVERY
• Cold backup using MMS Backup

• Full production replication on another EC2 region:
using mongo’s replication mechanism to
continuously sync data to the backup region

What's hot

SplunkLive! Customer Presentation - Garmin InternationalSplunk

Turning Cloud Metrics into ResultsInfluxData

Efficient IT operations using monitoring systems and standardized tools - Ici...Icinga

LabGauge - LRIG Late Nightxi2elic

Monitoring via DatadogKnoldus Inc.

Monitoring @ scale spot dyArvind Rapaka

Combinación de logs, métricas y trazas para una observabilidad centralizadaElasticsearch

Capstone Poster Final Draft - 2Krishna Prasad A R

Why Visibility into Your Stack MattersAmazon Web Services

Splunk Implementation and Usage - GarminSplunk

Making Runtime Data Useful for Incident Diagnosis: An Experience ReportQAware GmbH

Data torrent meetup-productionengChris Westin

Go Observability (in practice)Eran Levy

Codemotion Milan 2015 Alerts Overloadsarahjwells

Sarah Wells - Alert overload: How to adopt a microservices architecture witho...Codemotion

Subutai Ahmad, VP of Research, Numenta at MLconf SF - 11/13/15MLconf

SensorThings API webinar-#4-Connect Your SensorSensorUp

Using static analysis tools within continuous integration systemsRogue Wave Software

Cloud-native application monitoring powered by Riverbed and ElasticsearchRichard Juknavorian

What's hot (19)

SplunkLive! Customer Presentation - Garmin International

Turning Cloud Metrics into Results

Efficient IT operations using monitoring systems and standardized tools - Ici...

LabGauge - LRIG Late Night

Monitoring via Datadog

Monitoring @ scale spot dy

Combinación de logs, métricas y trazas para una observabilidad centralizada

Capstone Poster Final Draft - 2

Why Visibility into Your Stack Matters

Splunk Implementation and Usage - Garmin

Making Runtime Data Useful for Incident Diagnosis: An Experience Report

Data torrent meetup-productioneng

Go Observability (in practice)

Codemotion Milan 2015 Alerts Overload

Sarah Wells - Alert overload: How to adopt a microservices architecture witho...

Subutai Ahmad, VP of Research, Numenta at MLconf SF - 11/13/15

SensorThings API webinar-#4-Connect Your Sensor

Using static analysis tools within continuous integration systems

Cloud-native application monitoring powered by Riverbed and Elasticsearch

Similar to Building an event system on top MongoDB

Growing into a proactive Data PlatformLivePerson

Keptn: Unbreakable Continuous Delivery - Berlin CI/CD MeetupJürgen Etzlstorfer

Barista: Event-centric NOS Composition Framework for SDNBoanLabDKU

AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAmazon Web Services

Building Autonomous Operations for Kubernetes with keptnJohannes Bräuer

2006 - Basta!: Advanced server controlsDaniel Fisher

Sybase BAM OverviewXu Jiang

Building Microservices with Scala, functional domain models and Spring Boot -...JAXLondon2014

#JaxLondon: Building microservices with Scala, functional domain models and S...Chris Richardson

Building a system for machine and event-oriented data - Data Day Seattle 2015Eric Sammer

OSDC 2018 | From Monolith to Microservices by Paul Puschmann_NETWAYS

Event Driven ArchitecturesAvinash Ramineni

Azure Event Grid: Glue for the InternetJeremy Likness

Data to Insight in a Flash: Introduction to Real-Time Analytics with WSO2 Com...WSO2

AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mob...Eduardo Saito

Building a system for machine and event-oriented data - Velocity, Santa Clara...Eric Sammer

[2C6]Everyplay_Big_DataNAVER D2

Behavioral Analytics and Blockchain Applications – a Reliability View. Keynot...Ingo Weber

How to Create Observable Integration Solutions Using WSO2 Enterprise IntegratorWSO2

Observability for Integration Using WSO2 Enterprise IntegratorWSO2

Similar to Building an event system on top MongoDB (20)

Growing into a proactive Data Platform

Keptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup

Barista: Event-centric NOS Composition Framework for SDN

AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Building Autonomous Operations for Kubernetes with keptn

2006 - Basta!: Advanced server controls

Sybase BAM Overview

Building Microservices with Scala, functional domain models and Spring Boot -...

#JaxLondon: Building microservices with Scala, functional domain models and S...

Building a system for machine and event-oriented data - Data Day Seattle 2015

OSDC 2018 | From Monolith to Microservices by Paul Puschmann_

Event Driven Architectures

Azure Event Grid: Glue for the Internet

Data to Insight in a Flash: Introduction to Real-Time Analytics with WSO2 Com...

AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mob...

Building a system for machine and event-oriented data - Velocity, Santa Clara...

[2C6]Everyplay_Big_Data

Behavioral Analytics and Blockchain Applications – a Reliability View. Keynot...

How to Create Observable Integration Solutions Using WSO2 Enterprise Integrator

Observability for Integration Using WSO2 Enterprise Integrator

Recently uploaded

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI

Software Quality Assurance Interview QuestionsArshad QA

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

TECUNIQUE: Success Stories: IT Service providermohitmore19

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

Recently uploaded (20)

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI

Software Quality Assurance Interview Questions

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

TECUNIQUE: Success Stories: IT Service provider

How To Use Server-Side Rendering with Nuxt.js

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

A Secure and Reliable Document Management System is Essential.docx

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

Optimizing AI for immediate response in Smart CCTV

Hand gesture recognition PROJECT PPT.pptx

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

Unlocking the Future of AI Agents with Large Language Models

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

HR Software Buyers Guide in 2024 - HRSoftware.com

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

Building an event system on top MongoDB

1. BUILDING A MISSION CRITICAL EVENT SYSTEM ON TOP OF MONGODB by @shahar_kedar

2. BIGPANDA SaaS platform that lets companies aggregate alerts from all their monitoring systems into one place for faster incident discovery and response.

3. HOW IT WORKS High CPU on prod-srv-1 18/06/14 16:05 CRITICAL High CPU on prod-srv-1 18/06/14 16:07 WARNING Memory usage on prod-srv-1 18/06/14 16:08 CRITICAL Events Entities High CPU on prod-srv-1 WARNING Memory usage on prod-srv-1 CRITICAL Incidents 2 Alerts on prod-srv-1

4. PRODUCT REQUIREMENTS • Events need to be processed into incidents and streamed to the user’s browser as fast as possible • Incidents need to reliably reﬂect the state as it is in the monitoring system • The service has to be up and running 24x7

5. MISSION CRITICAL • It’s not rocket science, it’s not Google, but: • It has to be super fast • It has to be extremely reliable • It has to always be available

6. OUR #1 COMPETITOR

7. WHY MONGO? BECAUSE IT’S WEB SCALE!

8. WHY MONGO? At ﬁrst: • NodeJS shop • Schemaless • Easy to master Later on: • Reliable • Easy to evolve • Partial and atomic updates • Powerful query language BECAUSE IT’S WEB SCALE!

9. SUPER FAST Hardware Schema Design Lean & Stream

10. HARDWARE 03/13 3 x m1.medium 02/14 1 x i2.xlarge  + 2 x m1.medium m1.medium: 1 vCPUs, 3.75GB RAM, EBS drive 06/14 2 x i2.xlarge  + 1 x m3.xlarge m3.xlarge: 4 vCPUs, 15GB RAM, EBS drive i2.xlarge: 4 vCPUs, 30.5GB RAM, SSD 800GB x3 reads x4 writes

11. –Eliot Horowitz “Schema design is … the largest factor when it comes to performance and scalability … more important than hardware, how you shard, or anything else, schema is by far the most important thing.”

12. SCHEMA DESIGN Event { timestamp : Date status: String description: String, } Entity { start : Date end: Date status: String description: String, events: [ <embedded> ] source_system: String } Incident { start : Date end: Date is_active: Boolean description: String, entities: [  { entityId: ObjectId status: String } ] }

13. DENORMALIZATION • Go over the checklist (http://bit.ly/1vUdz2T) • Incidents => Entities: partially embedded + ref • Cardinality: one-to-few • Direct access to Entities • Entities are frequently updated • Entities => Events: embedded • Events are not directly accessed • Events are immutable • Cardinality: one-to-many ~ one-to-gazzilion

14. INDEXES • Optimized indexes   db.collection.find({..}).explain() • Removed redundant indexes • Truncated events collections (TTL index)

15. LEAN QUERIES • Use projections to limit ﬁelds returned by a query:  Model.find().select(‘-events’) • Mongoose users: use .lean() when possible to gain more than 50% performance boost:  Model.find().lean() • Stream results:   Model.find().stream().on(‘data’, function(doc){}) 

16. RESULTS • Average latency of all API calls went from 500ms to under 20ms • Average latency of full pipeline went from 2s to under 500ms • Peak time latency of full pipeline went down from 5m(!!) to less than 30s

17. EXTREMELY RELIABLE Atomic & Partial Updates

18. ATOMIC & PARTIAL UPDATES • Several services might try to update the same document at the same time, but: • Different systems update different parts of the document • Updates to the same document are sharded and ordered at the application level   (read our awesome blog post: http://bit.ly/1nQVcbS)

19. IMPOSSIBLETO KILL Replica Set Disaster Recovery

20. REPLICA SET • 3 nodes replica set • Using priorities to enforce master election of stronger nodes • Deployed on different availability zones

21. DISASTER RECOVERY • Cold backup using MMS Backup • Full production replication on another EC2 region: using mongo’s replication mechanism to continuously sync data to the backup region

22. THANKYOU!

Editor's Notes

For each customer: aggregate alert notifications from multiple monitoring systems group together alerts that belong to the same monitored appliance group together, into “incidents”, alerts that are (topo-)logically related

Building an event system on top MongoDB

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Building an event system on top MongoDB

Similar to Building an event system on top MongoDB (20)

Recently uploaded

Recently uploaded (20)

Building an event system on top MongoDB

Editor's Notes