Microservice architecture is widespread our days. It comes with a lot of benefits and challenges to solve. Main goal of this talk is to go through troubleshooting and debugging in the distributed micro-service world. Topic would cover:
main aspects of the logging,
monitoring,
distributed tracing,
debugging services on the cluster.
About speaker:
Andrеy Kolodnitskiy is Staff engineer in the Lohika and his primary focus is around distributed systems, microservices and JVM based languages.
Majority of time engineers spend debugging and fixing the issues. This talk will be dedicated to best practicies and tools Andrеys team uses on its project which do help to find issues more efficiently.
4. The challenge
Monolithic application
• Single process
• Holystic view
• Simple infrastructure
• Can be deployed/debugged
locally
Microservice application
• Multiple processes
• Fractional view
• Complex infrastructure
• Local deployment/debug
can be an issue
5. The challenge (most optimistic figures)
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.370.9611&rep=rep1&type=pdf
6. The challenge
Monolithic application
• Single process
• Holystic view
• Simple infrastructure
• Can be deployed/debugged
locally
Microservice application
• Multiple processes
• Fractional view
• Complex infrastructure
• Local deployment/debug
can be an issue
7. Observability
Monitoring
• Provides high level view of the
system health and performance
(Grafana, Prometheus,
VictoriaMetrics)
Logging
• Keep record of input data,
processing and results in the
application (Elasticsearch, Fluent
bit, Kibana)
Tracing
• Insights about specific operation
(Open tracing, Jaeger)
8. Monitoring
• A way to get bird eye view on
the infrastructure and services
health
• A way to get information about
system and individual
components performance
• A way to be alerted on
SLA/SLO
• Infrastructure health and
resource utilization
• Application and individual
services health and resource
utilization
• Application and individual
services performance
• Application and individual
services errors
Why What
9. Monitoring- How?
• Define naming conventions for the metrics
• Structure dashboards
• Build dashboards to be used with predefined techniques e.g., Layer
peeling, Exemplars
• Dashboards for infrastructure and applications i.e., follow the methodology
(USE, RED)
• Dashboards for specific services e.g., Java and Spring
• Avoid having a lot of custom dashboards and too much data
• Avoid high data cardinality when using tags
• Avoid having false positive alerts
• Look for predefined dashboards e.g., Spring
10. USE and RED
• Utilization: the
proportion of the
resource that is used,
so 100% utilization
means no more work
can be accepted;
• Saturation: the degree
to which the resource
has extra work which it
can’t service, often
queued;
• Errors: the count of
error events;
RED
• Rate: the number of
requests our service
is serving;
• Error: the number of
failed requests;
• Duration: the amount
of time it takes to
process a request;
USE
11. Logging
• Monitoring and troubleshooting
application for engineers
• Helping operations
• Security, compliance
• A way to be alerted on
SLA/SLO
• Application events:
• Availability events
(startup/shutdown)
• Resources (connectivity issues)
• Threats
• Errors
• Processing events
• Highly depends on security/audit
and compliance requirements:
• Login/Logout
• Attempting accessing unauthorized
data
• User actions
Why What
12. Logging – How?
• Centralized logging
• Align on the log format and levels
• Use structured logs
• Ability to correlate request inter services
• Log messages same as code would be read by other engineers think of them
and help them
• Do not trust clocks
• Do not log sensitive information
13. Tracing with Jaeger
• A way to get details about
individual request/event
• A way to get insights into
performance
• A way to get cross service
dependencies
• Statistics on time spent
• Compare traces
• Share traces
• Timings and logs for:
• Database calls
• Calls to other services
• Messages queues
• Heavy processing
Why What
14. Tracing – How?
• Pick either open tracing or open telemetry
• Open telemetry is a merge of open tracing and open census
• Open telemetry is newer and provides metrics API as well
• Key concepts:
• Spans:
• Named, timed operation representing a piece of the workflow.
• Contains: operation name, start and finish timestamps, tags, logs and context
• May contain other spans
• Tracers
• The Tracer interface creates Spans and understands how to Inject (serialize)
and Extract (deserialize) their metadata across process boundaries
• A new trace is started whenever a new Span is created without references to a
parent Span.
15. Tracing – How?
• Add open tracing support to your application e.g., opentracing-spring-jaeger-cloud-
starter
• Add additional libraries e.g., gRPC(opentracing-grpc)
• In case application contains few languages align on span tags, names and implement
decorators
• Ensure trace id and span id are used as correlation id in logs
• If you have service mesh then interservice communication can be received for
free and integrated with Jaeger or you may look at tools like Kiali
• If your application uses Zipkin it still can be easily switched to Jaeger
16. Tracing – How?
• Install and configure Jaeger
• Client - libraries that implement open tracing API and send data further
• Agent – network daemon that listens to UDP and sends data to collector
• Collector – Stores data in the storage
• Storage – Storage with the span (Cassandra, Elasticsearch, Kafka)
• Query – provides API to read trace data from storage
• Ingester – reads data from Kafka and stores it in the storage
17.
18. Tracing – How?
• Configure sampling
• Constant
• Probabilistic
• Rate limiting
• Remote
• Configure autoscaling for collectors
• Provide enough resources to the storage
19. Recap
• So:
• Complex infrastructure is monitored and there is visibility in
• Attempt to provide holistic view is provided by the Jaeger
• Centralized logging and open tracing allow to trace request through multiple processes
• How to troubleshoot then:
• Identify what version is deployed
• Punish people which use latest instead of specific deployment version
• Use metrics to check service and infrastructure health, resource consumption
• Find error(s) in the logs and by filtering by trace id find root operation
• Find corresponding operations in Jaeger and analyze the calls and compare with logs
• Build the hypothesis and test it or debug it
20. How can we debug services in Kubernetes?
• Port forward and remote debugging
• Tools like Telepresence and Squash
• Use cases:
• Issues reproduced only on the cluster
• Services accessible only on the cluster
• No ability to run service(s) locally
• Cloud native technologies
22. How does it solve it?
• Telepresence v1
• Provides ability to export env vars and swap deployment in the container with the
proxy
• Forwards ports that service exposes
• Routes all traffic through the proxy
• To achieve that:
• Run telepresence --swap-deployment {serviceName} --namespace
{namespaceName} --env-json ~/telepresence-legacy.json
• In other words:
• Service will run locally but would have access to all the resources in the cluster
and no debugging information will be passed via network and no time is spent
for container build/upload and deploy
23. Telepresence v1
• Telepresence v1 is cool and reliable tool which does not require
any cluster configuration
• Telepresence v1 is great but has a lot of limitations:
• Only one service at a time can be debugged
• Service is fully replaced and thus all traffic goes to your machine
• Thus, telepresence v2 was implemented
24. Telepresence v2
• Access all resources in cluster like your machine is deployed there
• telepresence connect
• Debug multiple services at a time
• Execute multiple intercept commands and point them to different local ports
• Intercept specific ports
• telepresence list
• kubectl get service example-service –output.yaml
• telepresence intercept example-service --port 8080:http --env-file ~/example-service-intercept.env
• Intercept specific requests
• telepresence intercept example-service --port 8080:http --env-file ~/example-service-intercept.env –preview-url=true
• Share dev environments
27. Telepresence v2 cons
• Brew by default updates you to the latest which may require cluster
configuration
• It cannot intercept more than one port on the service
• It does not substitute the pod and thus if you consume messages
your breakpoint may not work
• It does not work with certain service meshes
28. So, what should I use?
• Use both
• V1 suits for the cases when:
• there is more than one port to intercept
• you need to consume messages from the queues or Kafka
• It is ok to swap the deployment
• V2 suits for the cases:
• Connect to cluster resources without extra port forwards
• Intercept specific port
• Intercept specific requests
29. So, how would I do that?
• Install v2
• To install specific version (2.3.5), please use that command line:
• sudo curl -fL
https://app.getambassador.io/download/tel2/darwin/amd64/2.3.5/telepresence -
o /usr/local/bin/telepresence
• sudo chmod a+x /usr/local/bin/telepresence
• Install V1:
• brew install --cask macfuse
• brew install datawire/blackbird/telepresence-legacy
• ln -s /usr/local/Cellar/telepresence-legacy/0.109/bin/telepresence
/usr/local/bin/tel