Discussion on Datadog’s experiences, both successes and challenges, as they built our monitoring solutions on top AWS Lambda and Amazon API gateway with the goal of reducing latency and increasing performance while cutting infrastructure costs.
15. Latency as seen by users
Upstream Latency
Data points are available
after 2 to 11 minutes
Scheduling Latency
We schedule crawlers
every 1 to 10 minutes
Crawler Latency
How much time it takes to
fetch the data
Crawler based metrics can’t be real time
16. Latency as seen by users
Upstream Latency
Data points are available
after 2 to 11 minutes
Scheduling Latency
We schedule crawlers
every 1 to 10 minutes
Crawler Latency
How much time it takes to
fetch the data
Crawler based metrics can’t be real time
Throttling
API call charges
Infrastructure cost
21. - Serverless / Operation-less
- Event driven architecture with many
integrations
AWS Lambda
22. AWS Lambda
- Serverless / Operation-less
- Event driven architecture with many
integrations
- Api Gateway for custom integrations
23. Agenda
● Pulling data via crawlers generates latency
and operational cost
● Using Lambda to minimize latency and reduce
operational costs
○ RDS enhanced monitoring with Lambda
33. Lambda allows sub minute latency for those metrics
Crawler
based
Lambda
based
34. Lambda allows sub minute latency for those metrics
Crawler
based
Lambda
based
35. Amazon RDS enhanced monitoring
+ Sub minute latency
+ No crawler to run and
maintain
+ No internal state to
remember which points
to process
+ No Ops
36. Amazon RDS enhanced monitoring
+ Sub minute latency
+ No crawler to run and
maintain
+ No internal state to
remember which points
to process
+ No Ops
- Not as easy to setup
and troubleshoot
- Not easy to update
- No ad hoc replay
37. Agenda
● Pulling data via crawlers generates latency
and operational cost
● Using Lambda to minimize latency and reduce
operational costs
○ Agent Release Process with Lambda
42. Agenda
● Pulling data via crawlers generates latency
and operational cost
● Using Lambda to minimize latency and reduce
operational costs
○ Using Lambda to extract custom metrics
from Lambda
49. Datadog
Intake
CloudWatch
Logs
How to submit data from CloudWatch Logs to Datadog?
user.submit|1|timestamp1
user.submit|1|timestamp2
user.submit|1|timestamp1
Contains
user.submit|1|timestamp2
user.submit|2|timestamp1
Expects
50. Datadog
Intake
CloudWatch
Logs
We need to aggregate the data points
user.submit|1|timestamp1
user.submit|1|timestamp2
user.submit|1|timestamp1
Contains
user.submit|1|timestamp2
user.submit|2|timestamp1
Expects
Aggregation
user.submit|timestamp1:[1, 1]
user.submit|timestamp2:[1]
52. Datadog
Intake
CloudWatch
Logs
How can we build this with Lambda?
user.submit|1|timestamp1
user.submit|1|timestamp2
user.submit|1|timestamp1
Contains
user.submit|1|timestamp2
user.submit|2|timestamp1
Expects
Aggregation
user.submit|timestamp1:[1, 1]
user.submit|timestamp2:[1]
Push Push
53. Datadog
Intake
CloudWatch
Logs
How can we build this with Lambda?
user.submit|1|timestamp1
user.submit|1|timestamp2
user.submit|1|timestamp1
Contains
user.submit|1|timestamp2
user.submit|2|timestamp1
Expects
Aggregation Service
user.submit|timestamp1:[1, 1]
user.submit|timestamp2:[1]
Push Push
54. Datadog
Intake
CloudWatch
Logs
We need a stateful Lambda pipeline to aggregate metrics
user.submit|1|timestamp1
user.submit|1|timestamp2
user.submit|1|timestamp1
Contains
user.submit|1|timestamp2
user.submit|2|timestamp1
Expects
Aggregation Service
user.submit|timestamp1:[1, 1]
user.submit|timestamp2:[1]
Push Push
55. Building a stateful lambda pipeline
Aggregation Service
user.submit|timestamp2:[1]
user.submit|timestamp1:[1, 1]
user.submit|1|ts2
user.submit|2|ts1
user.submit|1|ts1
user.submit|1|ts2
user.submit|1|ts1
56. A Database stores the state
Aggregation Service
DynamoDB
Kinesis
ElastiCache
user.submit|timestamp1:[1, 1]
user.submit|1|ts2
user.submit|2|ts1
user.submit|timestamp2:[1]
user.submit|1|ts1
user.submit|1|ts2
user.submit|1|ts1
62. Takeaways
- Lambda allows us move towards a push
system
- Lambda is great for small stateless event
based tasks
63. Takeaways
- Lambda allows us move towards a push
system
- Lambda is great for small stateless event
based tasks
- We’re seeing adoption amongst our users