HighLoad++ 2017
Зал «Пекин+Шанхай», 7 ноября, 16:00
Тезисы:
http://www.highload.ru/2017/abstracts/2842.html
A story about real life experience in Lamoda, featuring logging, forest animals, limited size buffers and morning routines.
Possible takeaways from this presentation:
1. Understanding the need of central log aggregation
2. Learning a few tips about logging and event aggregation
3. Saving a lot of money by implementing your own personal "poor-man's" NewRelic
...
3. Structured list of rants
● Concept of events
● Benefits of structure
● Poor man’s event tracing
● Debug logging in production
4. Structured list of rants
● Concept of events
● Benefits of structure
● Poor man’s event tracing
● Debug logging in production
… anyone here runs docker in production?
5. Why do people log?
● To ensure their application is running
● To debug running application
● To see activity traces
● To see errors or edge cases manifesting
● To gather audit information
● To collect runtime metrics/usage patterns
… ordered by subjective importance
6. How do people log?
● STDOUT
● STDERR
● FILE
● NETWORK
● UNIX SOCKET
7. How do people log?
● STDOUT
● STDERR
● FILE
● NETWORK
● UNIX SOCKET
Protocol?
Library?
Endpoints?
Path?
Retention?
TCP/UDP?
Security?
…?
11. Because...
1. By decoupling logging transport from application you enable other team(s) to
ensure your logs get to a central place without any effort from your side
2. By having log transport not bound to complicated assumptions your
application is more portable
3. By forcing yourself away from standard logging protocols (... did I mention I
am not a fan of RFC-5424?) you can use elegant data encoding options (...
but I like JSON?)
31. More points on debug logs
● Usually you filter out the crap and only look for understandable logs
○ Searching for needle in a barn is not a fun activity
32.
33. More points on debug logs
● Usually you filter out the crap and only look for understandable logs
○ Searching for needle in a barn is not a fun activity
● Logs are not a good place to store sensitive data
○ “user %s authorised with password %s bought item SKU#%s” is my favourite.
34. More points on debug logs
● Usually you filter out the crap and only look for understandable logs
○ Searching for needle in a barn is not a fun activity
● Logs are not a good place to store sensitive data
○ “user %s authorised with password %s bought item SKU#%s” is my favourite.
● You want to see patterns, not individual errors
○ “%s caught for %s user, while in %s” is hard to find, when you are not 100% certain what are
you looking for
○ “exception caught in purchase pipeline” is better, I’ll cover the details in next topic.
36. [pid: 5151|app: 0|req: 2166118/25642392] 10.5.244.32 () {42 vars in 865 bytes} [Wed
Jul 19 18:54:39 2017] GET
/api/v1/pp_preview/?models=reviews%2Cquestions&sku=XXXXXXXXXXX&limit=50&country=by =>
generated 1285 bytes in 20 msecs (HTTP/1.0 200) 5 headers in 154 bytes (1 switches on
core 0)
{“pid”: 5151, “req_n”: 2166118, “source_ip”: “10.5.244.32”, “response_size”: 1285,
“timestamp”: “2017-07-19T18:54:39Z”, “http_method”: “GET”,
“http_url”:”/api/v1/pp_preview/?models=reviews%2Cquestions&sku=XXXXXXXXXXX&limit=50&c
ountry=by”, “response_time”: “0.02”, ... }
VS
37. {“pid”: 5151, “req_n”: 2166118, “source_ip”: “10.5.244.32”, “response_size”: 1285,
“timestamp”: “2017-07-19T18:54:39Z”, “http_method”: “GET”,
“http_url”:”/api/v1/pp_preview/?models=reviews%2Cquestions&sku=XXXXXXXXXXX&limit=50&c
ountry=by”, “response_time”: “0.02”, ... }
[pid: 5151|app: 0|req: 2166118/25642392] 10.5.244.32 () {42 vars in 865 bytes} [Wed
Jul 19 18:54:39 2017] GET
/api/v1/pp_preview/?models=reviews%2Cquestions&sku=XXXXXXXXXXX&limit=50&country=by =>
generated 1285 bytes in 20 msecs (HTTP/1.0 200) 5 headers in 154 bytes (1 switches on
core 0)
VS
$ open "goo.gl/RS43Rg"
$ npm install -g bunyan
$ brew install jq
38.
39. … structure with benefits ...
● Understanding/predicting necessary information
○ Do you need a particular field?
○ Is it useful now?
○ How about future?
○ What can you aggregate from a field?
● Think about aggregations
○ Average over time? You need a number.
○ Histogram on field? You need limited set of possible values.
○ Pie chart? Limited set of values.
○ … (yes, I don’t mention geo based aggregations, because you can google that one out)
43. Rich Man’s tracing
● There is a number of libraries (including newrelic) which provide the same
functionality
● End result is similar: You see what parts are touched in a single operation
… yes, I shamelessly trimmed the output.
46. Why tracing?
● For Lamoda it was used to figure out which system in the chain forces the
process to timeout.
○ Request returns 504
○ Request touches 8 subsystems
○ Somewhere in the chain nginx is configured with proxy_read_timeout 500ms;
○ None of individual systems show response durations higher than 500ms
○ … but the sum of a chain below the capped nginx -- is higher than 500ms!
● Also it helps to find reasons for failing requests
○ Because sometimes you just don’t know the dependencies of two systems
○ When you have multiple services responsible for a request handling, you are deemed to have
a bad time managing your dependencies
48. Event examples
● A user clicks on a button
● A hard disk driver emits a failure code
● Aggregation of response time is larger than X for last Y
● Scheduled alarm clock goes off
… yes, every morning is an event in this sense...
49. Event anti-examples
● User clicked 15 buttons on average
● Hard disk is spinning normally
● Response time aggregation is X
● This morning it took me 3 alarms to get out of bed
55. Structured list of rants
● Concept of events
● Benefits of structure
● Poor man’s event tracing
● Debug logging in production
56. Summarized list of rants
● A log message is an event
○ Events can be aggregated to extract metrics
○ Metrics can be analysed to figure out complex events
● Structured events are easier to aggregate
○ Less parsing == more throughput
● “X-Trace-ID: ***” … it is just a neat trick when you need to …
● … find a gold nugget in a pile of …
57. Summarized list of rants
● A log message is an event
○ Events can be aggregated to extract metrics
○ Metrics can be analysed to figure out complex events
● Structured events are easier to aggregate
○ Less parsing == more throughput
● “X-Trace-ID: ***” … it is just a neat trick when you need to …
● … find a gold nugget in a pile of debug logs.
58. … questions?
… I retain the right to silently ignore difficult ones
60. … and as promised, CLS™ architecture!
scalable
highly available
cloud ready
much buzzword
open source
solutionCLS™
LOGS UNICORNS
61. strong ES
strong ES
strong ES
log source
rsyslogd
syslog-ng
kafka
kafka
kafka
kafka
fluentd
logstash
logstash
logstash
strong ES
weak ES
weak ES
weak ES
weak ES
cold ES
cold ES
cold ES
cold ES
Producers Entry
buffer
indexers readonly archives
Mutators
Filters
Parsers
ElasticSearch Store