This talk will serve as a practical introduction to Distributed Tracing. We will see how we can make best use of open source distributed tracing platforms like Hypertrace with Azure and find the root cause of problems and predict issues in our critical business applications beforehand.
2. Hello!
I am Jayesh Bapu Ahire
- Founding Engineer, Traceable.ai
- AWS ML Hero | Twilio Champion
- AI Researcher
- Find me at @Jayesh_Ahire1
2
3. Agenda
◎ Observability: Why and What?
◎ Telemetry
◎ Metrics, Logs and Traces!
◎ What is Distributed Tracing?
◎ Why do I care?
◎ Demo! (Hoping it won’t fail! )
3
5. “
Monitoring tells you whether a
system is working, observability
lets you ask why it isn’t working.
- Baron Schwartz
5
6. Why Observability?
◎ Monoliths to microservices
◎ Microservices create complex interactions.
◎ Failure modes are unpredictable.
◎ Rise of API ecosystem
◎ Monitoring no longer can help us.
6
8. What is Observability?
◎ In control theory, observability is defined as a
measure of how well internal states of a system
can be inferred from knowledge of that system’s
external outputs. Simply put, observability is how
well you can understand your complex system.
◎ Metrics, events, logs, and traces—or MELT—are at
the core of observability. But observability is about
a whole lot more than just data.
8
9. What is Observability?
What characteristics did the queries that timed
out at 500ms share in common? Service
versions? Browser plugins?
- Instrumentation produces data.
- Querying data answers our questions.
9
10. Telemetry aids observability
◎ Telemetry data isn't observability itself.
◎ Instrumentation code is how we get
telemetry.
◎ Telemetry data describes events in the
system.
All different views into the same underlying
truth.
10
11. Metrics, Logs and Traces!
◎ Metrics: Aggregated summary statistics.
◎ Logs: Detailed debugging information emitted by
processes.
◎ Distributed Tracing: Provides insights into the
full lifecycles, aka traces of requests to a
system, allowing you to pinpoint failures and
performance issues.
Structured data can be transmitted into any of
these! 11
16. What is Distributed Tracing?
◎ Distributed tracing tracks production
requests as they touch different parts of your
architecture across the time.
◎ Requests have a unique trace ID, which you
can use to lookup a trace diagram, or log
entries related to it.
◎ Causal diagrams are easier to understand
than scrolling through logs.
16
17. Why do I care?
◎ Reduce time in triage by contextualizing
errors and delays
◎ Visualize latency like time in my service vs
waiting for other services
◎ Understand complex applications like async
code or microservices
◎ See your architecture with live dependency
diagrams built from traces
17
19. Tracing concepts
◎ Span
○ Represents a single unit of work in a system.
○ Typically encapsulates: operation name, a start and finish timestamp, the
parent span identifier, the span identifier, and context items.
◎ Trace
○ Defined implicitly by its spans. A trace can be thought of as a directed
acyclic graph of spans where the edges between spans are defined as
parent/child relationships.
◎ DistributedContext
○ Contains the tracing identifiers, tags, and options that are propagated
from parent to child spans
19
23. Why Hypertrace?
Topology
View the topology of all
services and backends in
real-time. Scalable for 100s
of them.
Dashboards
Pre-canned dashboards for
Services, APIs and Backends
Extensibility
Write your own enrichers to
enrich trace data for your
business needs and create
views around your business
needs.
Interoperable
Works OOB with all open
source tracing formats like
Zipkin, OTel, Jaeger.
GraphQL APIs
Traces, spans and entities
exposed through GraphQL
APIs. Build the next creative
use case
API & Trace
Analytics
Powerful slice & dice of all
the data powered by Apache
Pinot
23
27. Credits
Special thanks to all the people who made and released
these awesome resources for free:
◎ Presentation template by SlidesCarnival
◎ Photographs by Unsplash
27