As engineers, we're empowered by advancements in cloud platforms to build ever more complex systems that can achieve amazing feats at a scale previously only possible for the elite few. The monitoring tools have evolved over the years to accommodate our growing needs with these increasingly complex systems, but the emergence of serverless technologies like AWS Lambda has shifted the landscape and broken some of the underlying assumptions that existing tools are built upon - eg. you can no longer access the underlying host to install monitoring agents/daemons, and it's no longer feasible to use background threads to send monitoring data outside the critical path.
Furthermore, event-driven architectures has become easily accessible and widely adopted by those adopting serverless technologies, and this trend has added another layer of complexity with how we monitor and debug our systems as it involves tracing executions that flow through async invocations, and often fan'd-out and fan'd-in via various event processing patterns.
Join us in this talk as serverless expert Yan Cui gives us an overview of the challenges with observing a serverless architecture, the tradeoffs to consider, the current state of the tooling for serverless observability and a sneak peek at some of the new and coming tools that will hopefully inform us what the future of serverless observability might look like.
14. However, I would argue that the health of the system no
longer matters. We've entered an era where what matters is
the health of each individual event, or each individual user's
experience, or each shopping cart's experience (or other high
cardinality dimensions). With distributed systems you
don't care about the health of the system, you care about
the health of the event or the slice.
”http://bit.ly/2E2QngU- Charity Majors
“
15. However, I would argue that the health of the system no
longer matters. We've entered an era where what matters is
the health of each individual event, or each individual user's
experience, or each shopping cart's experience (or other high
cardinality dimensions). With distributed systems you
don't care about the health of the system, you care about
the health of the event or the slice.
”http://bit.ly/2E2QngU- Charity Majors
“
16. These are the four pillars of the Observability Engineering
team’s charter:
• Monitoring
• Alerting/visualization
• Distributed systems tracing infrastructure
• Log aggregation/analytics
“
” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
30. user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-facing latency
handler
handler
handler
handler
handler
handler
handler
31. user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-facing latency
StatsD
handler
handler
handler
handler
handler
handler
handler
rsyslog
background processing:
batched, asynchronous, low
overhead
32. user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-facing latency
StatsD
handler
handler
handler
handler
handler
handler
handler
rsyslog
background processing:
batched, asynchronous, low
overhead
NO background processing
except what platform provides
47. •high chance of data loss (if batching)
•nowhere to install agents/daemons
•no background processing
•higher concurrency to telemetry system
new challenges
57. •asynchronous invocations
•nowhere to install agents/daemons
•no background processing
•higher concurrency to telemetry system
•high chance of data loss (if batching)
new challenges
59. These are the four pillars of the Observability Engineering
team’s charter:
• Monitoring
• Alerting/visualization
• Distributed systems tracing infrastructure
• Log aggregation/analytics
“
” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
76. those extra 10-20ms for
sending custom metrics would
compound when you have
microservices and multiple
APIs are called within one slice
of user event
77. Amazon found every 100ms of latency cost them 1% in sales.
http://bit.ly/2EXPfbA
78. console.log(“hydrating yubls from db…”);
console.log(“fetching user info from user-api”);
console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”);
console.log(“MONITORING|1489795335|8|count|yubls-served”);
timestamp metric value
metric type
metric namemetrics
logs
89. narrow focus on a function
good for homing in on performance issues
for a particular function, but offers little to
help you build intuition about how your
system operates as a whole.
90. However, I would argue that the health of the system no
longer matters. We've entered an era where what matters is
the health of each individual event, or each individual user's
experience, or each shopping cart's experience (or other high
cardinality dimensions). With distributed systems you don't
care about the health of the system, you care about the
health of the event or the slice.
”http://bit.ly/2E2QngU- Charity Majors
“
93. don’t span over async invocations
good for identifying dependencies of a function,
but not good enough for tracing the entire call
chain as user request/data flows through the
system via async event sources.
112. SubscriberGetAccount
200,545
0
19
94
0
0 %
0 %
Est Cost:
Req Rate:
$54.0/s
20,056.0/s
Concurrency
Median
Mean 99.5th
99th
90th370
1ms
4ms 61ms
44ms
10ms
circle colour and size represent
health and traffic volume
2 minutes of request rate to
show relative changes in traffic
no. of concurrent executions
of this function
Request rate
Estimated cost
Error percentage
of last 10 seconds
Cold start percentage
last 10 seconds
last minute latency percentiles
200,545
0
19
94
0
Rolling 10 second counters
with 1 second granularity
Successes
Cold starts
Timeouts
Throttled Invocations
Errors
113. SubscriberGetAccount
200,545
0
19
94
0
0 %
0 %
Est Cost:
Req Rate:
$54.0/s
20,056.0/s
Concurrency
Median
Mean 99.5th
99th
90th370
1ms
4ms 61ms
44ms
10ms
circle colour and size represent
health and traffic volume
2 minutes of request rate to
show relative changes in traffic
no. of concurrent executions
of this function
Request rate
Estimated cost
Error percentage
of last 10 seconds
Cold start percentage
last 10 seconds
last minute latency percentiles
200,545
0
19
94
0
Rolling 10 second counters
with 1 second granularity
Successes
Cold starts
Timeouts
Throttled Invocations
Errors
114. SubscriberGetAccount
200,545
0
19
94
0
0 %
0 %
Est Cost:
Req Rate:
$54.0/s
20,056.0/s
Concurrency
Median
Mean 99.5th
99th
90th370
1ms
4ms 61ms
44ms
10ms
circle colour and size represent
health and traffic volume
2 minutes of request rate to
show relative changes in traffic
no. of concurrent executions
of this function
Request rate
Estimated cost
Error percentage
of last 10 seconds
Cold start percentage
last 10 seconds
last minute latency percentiles
200,545
0
19
94
0
Rolling 10 second counters
with 1 second granularity
Successes
Cold starts
Timeouts
Throttled Invocations
Errors
115. SubscriberGetAccount
200,545
0
19
94
0
0 %
0 %
Est Cost:
Req Rate:
$54.0/s
20,056.0/s
Concurrency
Median
Mean 99.5th
99th
90th370
1ms
4ms 61ms
44ms
10ms
circle colour and size represent
health and traffic volume
2 minutes of request rate to
show relative changes in traffic
no. of concurrent executions
of this function
Request rate
Estimated cost
Error percentage
of last 10 seconds
Cold start percentage
last 10 seconds
last minute latency percentiles
200,545
0
19
94
0
Rolling 10 second counters
with 1 second granularity
Successes
Cold starts
Timeouts
Throttled Invocations
Errors
116. SubscriberGetAccount
200,545
0
19
94
0
0 %
0 %
Est Cost:
Req Rate:
$54.0/s
20,056.0/s
Concurrency
Median
Mean 99.5th
99th
90th370
1ms
4ms 61ms
44ms
10ms
circle colour and size represent
health and traffic volume
2 minutes of request rate to
show relative changes in traffic
no. of concurrent executions
of this function
Request rate
Estimated cost
Error percentage
of last 10 seconds
Cold start percentage
last 10 seconds
last minute latency percentiles
200,545
0
19
94
0
Rolling 10 second counters
with 1 second granularity
Successes
Cold starts
Timeouts
Throttled Invocations
Errors
131. user
profile-images
POST /user
process-images
resize-images
image-tasks
Auth0
create-user
reformat-imagestag-user
Face API
Logs
timestamp component message
2018/01/25 20:51:23.201 create-user
2018/01/25 20:51:23.215 create-user
2018/01/25 20:51:23.585
saving user [theburningmonk] in the [user] table…
saved user [theburningmonk] in the [user] table
level
debug
debug
debug uploading profile image…
create-user debug tagged user [theburningmonk] with Azure Face API…
create-user2018/01/25 20:51:23.587
click here to go to code create-auth0-user
173. However, I would argue that the health of the system no
longer matters. We've entered an era where what matters is
the health of each individual event, or each individual user's
experience, or each shopping cart's experience (or other high
cardinality dimensions). With distributed systems you don't
care about the health of the system, you care about the
health of the event or the slice.
”http://bit.ly/2E2QngU- Charity Majors
“