Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Apricot2017 Request tracing in distributed environment

174 visualizaciones

Publicado el

Apricot2017 Request tracing in distributed environment

Publicado en: Ingeniería
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Apricot2017 Request tracing in distributed environment

  1. 1. 2017 February 07 Hieu LE ( Fujitsu Vietnam Limited PODC (Platform Offshore Development Center) Vietnam OpenStack Community - VFOSSA Logging/Request Tracing in Distributed Environment Copyright 2017 Fujitsu Vietnam Limited
  2. 2. /me 2 APRICOT 2017 Hieu LE Vietnam Official OpenStack Community Organizer VFOSSA Executive Member OpenStack Project leader @ Fujitsu OpenStack ATC/AUC Email:
  3. 3. Outline 3 APRICOT 2017 1. Intro 2. Current Logging solution  Pros  Cons 3. Tracing requirements 4. Request tracing  Demo with OpenStack
  4. 4. Intro 4 APRICOT 2017  Distributed Environment:  Cloud Computing – Fog Computing.  IoT environment.  Micro-services architecture.
  5. 5. IoT – Fog – Cloud 5 APRICOT 2017 (Virtual) Storage Services/Servers Virtual Compute Resources Virtual Network O2M2 Thingworx DeviceHive Other Platforms Multiple Clouds - Routing + Optimizing paths + Data pre-processing
  6. 6. 6 APRICOT 2017 • What if something happened in our system? • How can we resolve the problems as quick as possible?
  7. 7. Current Logging solution (1) 7 APRICOT 2017  ELK, Graylog:  Collecting logs from systems and appliances.  Indexing and filtering  RCA  Multiple Alert/Notify mechanisms.  Visualization based on user’s needs.
  8. 8. Current Logging solution (2) 8 APRICOT 2017  Pros:  Quickly trouble-shoot problems of systems/appliances.  Reduce cost for storing log, based on PCI DSS or HIPAA requirements.  Cons:  Mostly depend on systems/appliances log.  Require more efforts on sizing/deploying, maintaining and operating these logging solution.  Ate up resources (mostly storage)  May not suitable for small sensors.
  9. 9. Current Logging solution (3) 9 APRICOT 2017  Example 01:  Single request for launching 01 VM in OpenStack cloud system can go through at least 04 micro-services.  Log INFO level sometimes contain misleading information or not- enough information for trouble-shooting  Turn on DEBUG log level  Too much information and eat up storage.  Hard to control the overhead threshold.
  10. 10. Current Logging solution (4) 10 APRICOT 2017  Example 02:  ELK/Graylog requires some tweaks and efforts on visualize, collecting, profiling and RCA in distributed environment.  Consider following queries in environments with >10 services:  “Find me the root cause of all error requests where the requests process X business.”  “Find me requests where the user was logged in and the request took more than two seconds and a DB transaction was held open for more than 500 ms.”
  11. 11. Tracing Requirements Address the Data Explosion Logs, Metrics, Events, Active/Passive Checks, … End-to-End Debugging Understand what the real issue is and what is affected when errors occur Visibility Deliver centralized intelligence for cloud operations at scale Operator Needs Resource Utilization Understand resource availability and utilization Solution Requirements Able to Collect, Store and Access all types of data in one place Highly Performant and Scalable Platform Flexible Processing Pipeline that can support multiple use cases: diagnostics, root cause analysis, SLA calculations, utilization reporting, … Extensible Platform that can be extended to support new types of data and processing 11 APRICOT 2017
  12. 12. Tracing Requirements • Users need centralize solution that provide enough information related to machine centric (monitor) and workflow centric (tracing). – Provide general picture for every workflow: the communication steps, req/resp time for each step for performance reviewing purpose. – Show monitoring metrics of hardware/services for each step at the time of investigation. – Provide general purpose RCA method for quickly troubleshooting. 12 APRICOT 2017
  13. 13. Workflow Centric solution quick survey There are many solutions aim to tracing the workflow centric, divided into 3 categories: [1] 1. Explicit metadata propagation: inject tracing metadata into current system (Zipkin, Kieker, X-Trace, Tracelytics, Cloudera Htrace, ExplorViz, OpenTracing - CNCF) 2. Schema-based: rely on the event semantics of system and use temporal schema of custom log message for tracing. (Magpie) 3. Black-box tracing: rely on log analysis for inferring relationship among events. (Fchain, Netmedic) [1]. HANSEL: Diagnosing Faults in OpenStack – IBM Research 13 APRICOT 2017
  14. 14. Workflow centric solutions (1) 14 APRICOT 2017 • Figure of traditional workflow Service A Service B Service C Service D Req
  15. 15. Workflow centric solutions (2) 15 APRICOT 2017 • Explicit metadata propagation  Figure of explicit metadata tracing workflow: inject metadata in request/response and send to tracing mechanism (Zipkin, Dapper..) Service A Service B Service C Service D Tracing Mechanism Req
  16. 16. Workflow centric solutions (3) 16 APRICOT 2017 • Explicit metadata propagation  Pros: • Give enough detail for tracing the problems • Highly scalability.  Cons: • Must modify code base and inject meta-data into header of each request and response • Increase network packet (maybe a little bit like Zipkin - around 500bytes)
  17. 17. Workflow centric solutions (4) 17 APRICOT 2017 • Schema-based: based on sematic of event generated from system (including OS, services and applications), then joining all related event schema for final inference. Service A Service B Service C Service D Authenticate Authenticate Authenticate Get Image Create port, IP and attach Req Read/Write DB Event Listener
  18. 18. Workflow centric solutions (5) 18 APRICOT 2017 • Schema-based  Pros: • Less modification into code base  Cons: • Low scalability. (the result is delayed until all event are collected). • Less details than explicit meta-data. (the semantic of event, the event list and also the way to join schemas define the success of this approach  we need to build a warehouse of event semantic)
  19. 19. Workflow centric solutions (6) 19 APRICOT 2017 • Black-box tracing: collect logs of all services, then do analyzing all the logs and infer the root cause of problem. Service A Service B Service C Service D DB Log Collector and Analyzer Logs Logs Logs Logs Logs
  20. 20. Workflow centric solutions (7) 20 APRICOT 2017 • Black-box tracing:  Pros: • No modification to code base.  Cons: • High error rate. (almost is probabilistic data mining approaches)
  21. 21. Example (1) 21 APRICOT 2017 Magpie: Schema-based
  22. 22. Example (2) 22 APRICOT 2017 Zipkin: Explicit metadata propagation
  23. 23. Demo with OpenStack 23 APRICOT 2017 OSProfiler: Explicit metadata propagation small library
  24. 24. Q & A THANK YOU! 24 APRICOT 2017