Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Teach your application eloquence. Logs, metrics, traces - Dmytro Shapovalov (RUS) | Ruby Meditation 26

63 visualizaciones

Publicado el

Speech of Dmytro Shapovalov, Infrastructure Engineer at Cossack Labs, at Ruby Meditation #26 Kyiv 16.02.2019
Next conference - http://www.rubymeditation.com/

Most modern applications live in a close cooperation with each other. We will talk about the ways to effectively use the modern techniques for monitoring the health of applications and look on tasks and typical implementation mistakes through the eyes of an infrastructure engineer. And we will also consider the Ruby libraries that help to implement all of this.

Announcements and conference materials https://www.fb.me/RubyMeditation
News https://twitter.com/RubyMeditation
Photos https://www.instagram.com/RubyMeditation
The stream of Ruby conferences (not just ours) https://t.me/RubyMeditation

Publicado en: Tecnología
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Teach your application eloquence. Logs, metrics, traces - Dmytro Shapovalov (RUS) | Ruby Meditation 26

  1. 1. Teach your application eloquence. Logs, metrics, traces. Dmytro Shapovalov Infrastructure Engineer @ Cossack Labs
  2. 2. Who we are? • UK-based data security products and services company
 • Building security tools to prevent sensitive data leakage and to comply with data security regulations
 • Cryptographic tools, security consulting, training
 • We are cryptographers, system engineers, applied engineers, infrastructure engineers
 • We support community, speak, teach, open source a lot
  3. 3. What we are going to talk • Why do we need telemetry? • What are the different kinds of telemetry? • Borders of applicability of various types of telemetry • Approaches and mistakes • Implementation
  4. 4. What is telemetry? «Gathering data on the use of applications and application components, measurements of start-up time and processing time, hardware, application crashes, and general usage statistics.»
  5. 5. Why do we need telemetry at all? Who are the consumers?
 − developers
 − devops/sysadmins
 − analysts
 − security staff What purposes?
 − debug
 − monitor state and health
 − measure and tune performance
 − business analysis
 − intrusion detection
  6. 6. It is worthwhile, indeed • speed up developing process • increase overall stability • reduce the reaction time on crashes and intrusions • adequate business planning
  7. 7. It is worthwhile, indeed • speed up developing process • increase overall stability • reduce the reaction time on crashes and intrusions • adequate business planning • COST of development • COST of use
  8. 8. What data do we have to export? … we can ask any specialist.
  9. 9. What data do we have to export? … we can ask any specialist. — ALL!… will be their answer.
  10. 10. Classification of information technical:
 − state
 − health
 − errors
 − performance
 − debug
 − events
  11. 11. Classification of information technical:
 − state
 − health
 − errors
 − performance
 − debug
 − events business:
 − SLI
 − user actions
  12. 12. Classification of information technical:
 − state
 − health
 − errors
 − performance
 − debug
 − events business:
 − SLI
 − user actions developers devops/sysadmins
  13. 13. Classification of information technical:
 − state
 − health
 − errors
 − performance
 − debug
 − events business:
 − SLI
 − user actions developers devops/sysadmins analysts
  14. 14. Classification of information technical:
 − state
 − health
 − errors
 − performance
 − debug
 − events business:
 − SLI
 − user actions developers devops/sysadmins analysts security staff
  15. 15. SIEM — security staff’s main instrument Complex analyze:
 − correlation
 − threats
 − patterns
 − compliance Applications Network devices Servers Environment
  16. 16. Telemetry evolution Logs • each application has an individual log file • syslog:
 − message standard (RFC 3164, 2001)
 − aggregation • ELK (agents, collectors) • HTTP, JSON, protobuf
  17. 17. Telemetry evolution Logs • each application has an individual log file • syslog:
 − message standard (RFC 3164, 2001)
 − aggregation • ELK (agents, collectors) • HTTP, JSON, protobuf Metrics • reports into logs • agents, collectors, stores with proprietary protocols • SNMP • HTTP, protobuf • custom implementations
  18. 18. Telemetry evolution Logs • each application has an individual log file • syslog:
 − message standard (RFC 3164, 2001)
 − aggregation • ELK (agents, collectors) • HTTP, JSON, protobuf Metrics • reports into logs • agents, collectors, stores with proprietary protocols • SNMP • HTTP, protobuf • custom implementations Traces • reports into logs • agents, collectors, stores with proprietary protocols • custom implementations
  19. 19. Telemetry applicability Logs • simplest • no external tools required • human readable • UNIX-style • compatible with a tons of tools • queries • alerts
  20. 20. Telemetry applicability Logs • simplest • no external tools required • human readable • UNIX-style • compatible with a tons of tools • queries • alerts Metrics • minimal store size • low performance impact • performance measuring • health and state observing • special structures • queries • alerts
  21. 21. Telemetry applicability Logs • simplest • no external tools required • human readable • UNIX-style • compatible with a tons of tools • queries • alerts Metrics • minimal store size • low performance impact • performance measuring • health and state observing • special structures • queries • alerts Traces • minimal store size • low performance impact • per-query metrics • low-level information • precise debugging and performance tuning
  22. 22. Telemetry applicability Logs • simplest • no external tools required • human readable • UNIX-style • compatible with a tons of tools • queries • alerts Metrics • minimal store size • low performance impact • performance measuring • health and state observing • special structures • queries • alerts Traces • minimal store size • low performance impact • per-query metrics • low-level information • precise debugging and performance tuning + SIEM systems
  23. 23. Telemetry flow creation
  24. 24. Telemetry flow creation transport aggregation normalization store analyze + alerting visualize archive
  25. 25. Logs
  26. 26. Logs : kinds of data • initial information about the application • state changes (start/ready/…/stop) • health changes • audit trail (security-relevant list of activities: financial operations, health care data transactions, changing keys, changing configuration) • user sessions (sign-in attempts, sign-out, actions) • not expected actions (wrong URLs, sign-in fails, etc.) • various information in string format
  27. 27. Logs : on start • new state: starting • application name • component name • commit hash / build number • configuration in use • deprecation warnings • running mode
  28. 28. Logs : on ready • new state: ready • listen interfaces, ports and sockets • health
  29. 29. Logs : on state or health change • new state • reason • URL to documentation
  30. 30. Logs : on state or health change • new state • reason • URL to documentation Use traffic-light highlight system for health states:
 ● — completely unhealthy
 ● — partially healthy, reduced functionality
 ● — completely healthy
  31. 31. Logs : on shutdown • reason • status of preparing to shutdown • new state: stopped (final goodbye)
  32. 32. Logs : each line • timestamps (ISO8601, TZ, reasonable precission) • PID • application/component short name • application version (JSON, CEF, protobuf) • severity (CEF: 0→10, rfc5427: 7→0) • event code (HTTP style) • human-readable message
  33. 33. Logs : do not export! • passwords, tokens, any sensitive data — security risks • private data — legal risks Use: − masking − anonymisation / pseudonymisation
  34. 34. Logs : consumers • Console • Files • General purpose collector/store/alert/search system. • SIEM
  35. 35. Logs : consumers and formats console, STDERR file syslog ELK SIEM socket, HTTP, custom plain ✓ syslog (RFC3164) ✓ ✓ ✓ ✓ ✓ ✓ JSON ✓ ✓ ✓ ✓ ✓ ✓ CEF ✓ ✓ ✓ ✓ ✓ ✓ protobuf ✓ ✓
  36. 36. Logs : CEF • old (2009), but widely used standard • simple: easy to generate, easy to parse (supported even by devices without powerful CPUs) • well documented:
 − field name dictionaries
 − field types CEF:Version|Device Vendor|Device Product|Device Version| Signature ID|Name|Severity|Extension Sep 19 08:26:10 host CEF:0|security|threatmanager|1.0|100| worm successfully stopped|10|src=10.0.0.1 dst=2.1.2.2 spt=1232
  37. 37. CEF naming, data formats + JSON/protobuf/… transport = painless logging
  38. 38. Logs : bear in mind [1/3] • Logs will be read by humans. Often, when failure happens. With limited time to reaction. Be brief and eloquent. Give information that may help to solve a problem. • Logs will be searched. Don’t be a poet, be a technical specialist. Use expected words. • Logs will be parsed automatically; indeed, they will. There are too many different systems that want telemetry from your application. • Carefully classify the severity of events. Many error messages instead of warnings in non-critical situations will lead to ignoring information from the logs.
  39. 39. Logs : bear in mind [2/3] • Whenever it possible, base on existing standards. Grouping event codes according to the HTTP error code table is not bad idea. • Logs are the first resource to analyze security incidents. • Logs will be archived and stored for a long period of time. It will be almost impossible to cut off some pieces of data. • Should be configurable: formats, transport protocols, paths, severity.
  40. 40. Logs : bear in mind [3/3] • Your application may run in many different environments with different standards of logging (VM, docker). Application should be able to direct all logs into one channel. Splitting may be an option. • Do not implement log files rotation. Give possibility to inform your application when it needs to gracefully recreate the log file after being rotated by an external service. • When big trouble occurs and nothing works, your application should be able to print readable logs in the simplest manner — to stderr/stdout.
  41. 41. Logs : implementation • native Ruby methods • semantic_logger
 https://github.com/rocketjob/semantic_logger
 (a lot of destinations: DBs, HTTP, UDP, syslog) • ougai
 https://github.com/tilfin/ougai
 (JSON) • httplog
 https://github.com/trusche/httplog
 (HTTP logging, JSON support)
  42. 42. Metrics
  43. 43. Metrics : approaches • USE method
 Utilization, Saturation, Errors • Google SRE book
 Latency, Traffic, Errors, Saturation • RED method
 Rate, Errors, Duration
  44. 44. Metrics : utilization • Hardware resources: CPU, disk system, network intefaces • File system: capacity, usage • Memory: capacity, cache, heap, queue • Resources: file descriptors, threads, sockets, connections The average time that the resource was busy servicing work. Usage of resource.
  45. 45. Metrics : traffic, rate • normal operations: 
 − requests
 − queries
 − transactions
 − sending network packets
 − processing flow bytes A measure of how much demand is being placed on your system. (Google SRE book) The number of requests, per second, you services are serving. (RED Method)
  46. 46. Metrics : latency, duration The time it takes to service a request. (Google SRE book) • latency of operations: 
 − requests
 − queries
 − transactions
 − sending network packets
 − processing flow bytes
  47. 47. Metrics : errors • error events:
 − hardware errors
 − software exceptions
 − invalid requests / input
 − authentication fails
 − invalid URLs The count of error events. (USE Method) The rate of requests that fail, either explicitly, implicitly, or by policy. (Google SRE book)
  48. 48. Metrics : saturation • calculated value, measure of current load The degree to which the resource has extra work which it can't service, often queued. (USE Method) How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained. (Google SRE book)
  49. 49. Metrics : saturation • can be calculated internally or measured externally • high utilization is a problem • high saturation is a problem • low utilization level does not guarantee that everything is OK • low saturation (in the case of a correct calculation) most likely indicates that everything is OK
  50. 50. OpenMetrics : based on Prometheus metric types • Gauge
 single numerical value
 − memory used
 − fan speed
 − connections count • Counter
 single monotonically increasing counter
 − operations done
 − errors occured
 − requests processed • Histogram
 increment counter per buckets
 − requests count per latency buckets
 − CPU load values count per range buckets • Summary
 similar to the Histogram, but φ-quantiles are calculated on client-side; calculating of other quantiles is not possible https://openmetrics.io/ https://prometheus.io/docs/concepts/metric_types/
  51. 51. OpenMetrics : Average vs Percentile Average
  52. 52. OpenMetrics : Average vs Percentile Average
  53. 53. OpenMetrics : Average vs Percentile Average 99 percentile
  54. 54. OpenMetrics : Average vs Percentile Average 99 percentile
  55. 55. Metrics : buckets <10 < 20 < 30 < 40 < 50 < 60 < 70 < 80 < 90 < 100
  56. 56. Metrics : buckets <10 < 20 < 30 < 40 < 50 < 60 < 70 < 80 < 90 < 100 1 1 1 1 1 1 1 1 1 1
  57. 57. Metrics : buckets <10 < 20 < 30 < 40 < 50 < 60 < 70 < 80 < 90 < 100 1 1 1 1 1 1 1 1 1 1 90 percentile50 percentile
  58. 58. Metrics : export data • current state • current health • event counters:
 − AAA events
 − not expected actions (wrong URLs, sign-in fails)
 − errors during normal operations • performance metrics
 − normal operations
 − queues
 − utilization, saturation
 − query latency • application info:
 − version
 − warnings/notifications gauge
  59. 59. Metrics : formats • suggest using Prometheus format
 − native for Prometheus
 − OpenMetrics — open source specification
 − simple and clear
 − HTTP-based
 − can be easily converted
 − libraries exist • Influx or similar format if you really need to implement push model • protobuf / gRPC
 − custom
 − high load

  60. 60. Metrics : implementation • Prometheus Ruby client
 https://github.com/prometheus/client_ruby • native Ruby methods
  61. 61. Metrics : bear in mind [1/2] • Split statistic by types. For example, the aggregation of successful (relatively long) and failed (relatively short) durations may lead to the illusion of performance increase when multiple failures occur. • Whenever it possible use Saturation to determine load of system. Utilization is not complete information. • Be sure to export the metrics of the component closest to the user. This will allow to evaluate the SLI. • Implement configurable buckets sizes.
  62. 62. Metrics : bear in mind [2/2] • Export appropriate metrics as buckets. It lower polling rate and makes possible to get statistics in percentiles. • Add units to metric names. • Whenever it possible, use SI units. • Follow the naming standard. Prometheus “Metric and label naming” document is a good base.
  63. 63. Traces
  64. 64. Traces : definition In software engineering, tracing involves a specialized use of logging to record information about a program's execution. … There is not always a clear distinction between tracing and other forms of logging, except that the term tracing is almost never applied to logging that is a functional requirement of a program. — Wikipedia
  65. 65. Traces : use cases • Debugging during development • Measuring and tuning performance • Analyze failures and security incidents https://www.cossacklabs.com/blog/how- to-implement-distributed-tracing.html • Approaches • Library comparison • Implementation example • Use cases
  66. 66. Traces : principles • Low overhead • Application-level transparency • Scalability
  67. 67. Traces : spans in trace tree https://static.googleusercontent.com/media/research.google.com/uk/pubs/archive/36356.pdf
  68. 68. Traces : kinds of data • trace id • span id • parent span id • application info (product, component) • module name • method name • context data (session/request id, user id, …) • operation name and code • start time • end time Per request/query tracking:
  69. 69. Traces : what it looks like
  70. 70. Traces : consumers • General purpose collectors:
 − Jaeger
 − Zipkin • Cloud collectors:
 − Google StackDriver
 − AWS X-Ray
 − Azure Application Insights • SIEM
  71. 71. Traces : formats • Proprietary protocols:
 − Jaeger
 − Zipkin
 − Google StackDriver
 − AWS X-Ray
 − Azure Application Insights • JSON:
 − SIEM • protobuf/gRPC:
 − custom
  72. 72. Traces : implementation • OpenCensus
 https://www.rubydoc.info/gems/opencensus
 (Zipkin, GC Stackdriver, JSON) • OpenTracing
 https://opentracing.io/guides/ruby/ • Jaeger client
 https://github.com/salemove/jaeger-client-ruby
  73. 73. Checklists
  74. 74. Checklist : Logs □ Each line:
 □ timestamps (ISO8601, TZ, reasonable precission)
 □ PID
 □ component name
 □ severity
 □ event code
 □ human-readable message □ Events to log:
 □ state changes (start/ready/pause/stop)
 □ health changes (new state, reason, doc URL)
 □ user sign-in attempts (including failed with reasons), actions, sign-out
 □ audit trail
 □ errors □ On start:
 □ product name, component name
 □ version (+build, +commit hash)
 □ running mode (debug/normal, daemon/)
 □ deprecation warnings
 □ which configuration in use (ENV, file, configuration service) □ On ready: communication sockets and ports □ On exit: reason □ Do not log:
 □ passwords, tokens
 □ personal data
  75. 75. Checklist : Metrics □ Data to export:
 □ application (version, warning/notification)
 □ utilization (resources, capacities, usage)
 □ saturation (internally calculated or appropriate metrics)
 □ rate (operations)
 □ errors
 □ latencies □ Split metrics by types □ Export as buckets when reasonable □ Configure size of buckets □ Export metrics for SLI □ Determine required resolution □ Normalize, use SI units, add units to names □ Prefer poll model if it possible □ Clear counters on restart
  76. 76. Links [1/2] • Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
 https://static.googleusercontent.com/media/ research.google.com/uk//pubs/archive/36356.pdf • How to Implement Tracing in a Modern Distributed Application
 https://www.cossacklabs.com/blog/how-to-implement- distributed-tracing.html • OpenTracing
 https://opentracing.io/ • OpenMetrics
 https://github.com/RichiH/OpenMetrics • OpenCensus
 https://opencensus.io
  77. 77. Links [2/2] • CEF
 https://kc.mcafee.com/resources/sites/MCAFEE/content/live/ CORP_KNOWLEDGEBASE/78000/KB78712/en_US/ CEF_White_Paper_20100722.pdf • Metrics : USE method
 http://www.brendangregg.com/usemethod.html • Google SRE book
 https://landing.google.com/sre/sre-book/chapters/monitoring-distributed- systems/ • Metrics : RED method
 https://www.weave.works/blog/the-red-method-key-metrics-for-microservices- architecture/ • MS Azure : monitoring and diagnostic
 https://docs.microsoft.com/en-us/azure/architecture/best-practices/monitoring • Prometheus : Metrics and label names
 https://prometheus.io/docs/practices/naming/
  78. 78. Dmytro Shapovalov Infrastructure Engineer @ Cossack Labs Thank you! shadinua shad.in.ua shad.in.ua

×