Logging and ranting / Vytis Valentinavičius (Lamoda)

Logging and
ranting
Or how to trim your bush

Hi, my name is Vytis and I have a story

Structured list of rants
● Concept of events
● Benefits of structure
● Poor man’s event tracing
● Debug logging in production

Structured list of rants
● Concept of events
● Benefits of structure
● Poor man’s event tracing
● Debug logging in production
… anyone here runs docker in production?

Why do people log?
● To ensure their application is running
● To debug running application
● To see activity traces
● To see errors or edge cases manifesting
● To gather audit information
● To collect runtime metrics/usage patterns
… ordered by subjective importance

How do people log?
● STDOUT
● STDERR
● FILE
● NETWORK
● UNIX SOCKET

How do people log?
● STDOUT
● STDERR
● FILE
● NETWORK
● UNIX SOCKET
Protocol?
Library?
Endpoints?
Path?
Retention?
TCP/UDP?
Security?
…?

STDOUT
STDERR
application aggregator

Because...
1. By decoupling logging transport from application you enable other team(s) to
ensure your logs get to a central place without any effort from your side
2. By having log transport not bound to complicated assumptions your
application is more portable
3. By forcing yourself away from standard logging protocols (... did I mention I
am not a fan of RFC-5424?) you can use elegant data encoding options (...
but I like JSON?)

Because...
1. ELK…
2. … meets DOCKER
3. … and JSON logs

application JSON
[this advertisement was funded by Lamoda]
CLS™

application JSON
[this advertisement was funded by Lamoda]
CLS™
Containers under docker
Applications under JVM
...

application JSON
[but it advertised ELK stack :D]
CLS™
Containers under docker
Applications under JVM
...
Basically vanilla ELK

… a moment of silence for PHP devs
...yes, php-fpm trims stdio output to 1024 symbols.

NETWORK
UNIX SOCKET
$ env
...
SYSLOG_ENDPOINT=/var/log/cls.sock
...
$ cat /etc/rsyslog.conf
...
action(type="mmjsonparse" cookie="")
...

NETWORK
UNIX SOCKET
$ env
...
SYSLOG_ENDPOINT=/var/log/cls.sock
...
$ cat /etc/rsyslog.conf
...
action(type="mmjsonparse" cookie="")
...
PROPERTY
AUTHO
RISED
USE
O
NLY
ENVIRO
NM
ENTAL

+3000$/month
… only data storage.

More points on debug logs
● Usually you filter out the crap and only look for understandable logs
○ Searching for needle in a barn is not a fun activity

● Logs are not a good place to store sensitive data
○ “user %s authorised with password %s bought item SKU#%s” is my favourite.

● Logs are not a good place to store sensitive data
○ “user %s authorised with password %s bought item SKU#%s” is my favourite.
● You want to see patterns, not individual errors
○ “%s caught for %s user, while in %s” is hard to find, when you are not 100% certain what are
you looking for
○ “exception caught in purchase pipeline” is better, I’ll cover the details in next topic.

[pid: 5151|app: 0|req: 2166118/25642392] 10.5.244.32 () {42 vars in 865 bytes} [Wed
Jul 19 18:54:39 2017] GET
/api/v1/pp_preview/?models=reviews%2Cquestions&sku=XXXXXXXXXXX&limit=50&country=by =>
generated 1285 bytes in 20 msecs (HTTP/1.0 200) 5 headers in 154 bytes (1 switches on
core 0)
{“pid”: 5151, “req_n”: 2166118, “source_ip”: “10.5.244.32”, “response_size”: 1285,
“timestamp”: “2017-07-19T18:54:39Z”, “http_method”: “GET”,
“http_url”:”/api/v1/pp_preview/?models=reviews%2Cquestions&sku=XXXXXXXXXXX&limit=50&c
ountry=by”, “response_time”: “0.02”, ... }
VS

{“pid”: 5151, “req_n”: 2166118, “source_ip”: “10.5.244.32”, “response_size”: 1285,
“timestamp”: “2017-07-19T18:54:39Z”, “http_method”: “GET”,
“http_url”:”/api/v1/pp_preview/?models=reviews%2Cquestions&sku=XXXXXXXXXXX&limit=50&c
ountry=by”, “response_time”: “0.02”, ... }
[pid: 5151|app: 0|req: 2166118/25642392] 10.5.244.32 () {42 vars in 865 bytes} [Wed
Jul 19 18:54:39 2017] GET
/api/v1/pp_preview/?models=reviews%2Cquestions&sku=XXXXXXXXXXX&limit=50&country=by =>
generated 1285 bytes in 20 msecs (HTTP/1.0 200) 5 headers in 154 bytes (1 switches on
core 0)
VS
$ open "goo.gl/RS43Rg"
$ npm install -g bunyan
$ brew install jq

… structure with benefits ...
● Understanding/predicting necessary information
○ Do you need a particular field?
○ Is it useful now?
○ How about future?
○ What can you aggregate from a field?
● Think about aggregations
○ Average over time? You need a number.
○ Histogram on field? You need limited set of possible values.
○ Pie chart? Limited set of values.
○ … (yes, I don’t mention geo based aggregations, because you can google that one out)

NEW RELIC
… no, they don’t pay me … yet.

Poor Man’s tracing
map $http_x_trace_id $trace_id {
'' $request_id;
default $http_x_trace_id;
}
...
proxy_set_header X-Trace-Id $trace_id;

Rich Man’s tracing
● There is a number of libraries (including newrelic) which provide the same
functionality
● End result is similar: You see what parts are touched in a single operation
… yes, I shamelessly trimmed the output.

Why tracing?
● For Lamoda it was used to figure out which system in the chain forces the
process to timeout.
○ Request returns 504
○ Request touches 8 subsystems
○ Somewhere in the chain nginx is configured with proxy_read_timeout 500ms;
○ None of individual systems show response durations higher than 500ms
○ … but the sum of a chain below the capped nginx -- is higher than 500ms!
● Also it helps to find reasons for failing requests
○ Because sometimes you just don’t know the dependencies of two systems
○ When you have multiple services responsible for a request handling, you are deemed to have
a bad time managing your dependencies

EVENTS
… you know, second coming...

Event examples
● A user clicks on a button
● A hard disk driver emits a failure code
● Aggregation of response time is larger than X for last Y
● Scheduled alarm clock goes off
… yes, every morning is an event in this sense...

Event anti-examples
● User clicked 15 buttons on average
● Hard disk is spinning normally
● Response time aggregation is X
● This morning it took me 3 alarms to get out of bed

event metric
analysis
aggregation

event metric
analysis
aggregationreaction

event metric
analysis
aggregationreaction
inspection

Summarized list of rants
● A log message is an event
○ Events can be aggregated to extract metrics
○ Metrics can be analysed to figure out complex events
● Structured events are easier to aggregate
○ Less parsing == more throughput
● “X-Trace-ID: ***” … it is just a neat trick when you need to …
● … find a gold nugget in a pile of …

Summarized list of rants
● A log message is an event
○ Events can be aggregated to extract metrics
○ Metrics can be analysed to figure out complex events
● Structured events are easier to aggregate
○ Less parsing == more throughput
● “X-Trace-ID: ***” … it is just a neat trick when you need to …
● … find a gold nugget in a pile of debug logs.

… questions?
… I retain the right to silently ignore difficult ones

vytis.valentinavicius@lamoda.lt, also go look at http://tech.lamoda.ru/
… questions?

… and as promised, CLS™ architecture!
scalable
highly available
cloud ready
much buzzword
open source
solutionCLS™
LOGS UNICORNS

strong ES
strong ES
strong ES
log source
rsyslogd
syslog-ng
kafka
kafka
kafka
kafka
fluentd
logstash
logstash
logstash
strong ES
weak ES
weak ES
weak ES
weak ES
cold ES
cold ES
cold ES
cold ES
Producers Entry
buffer
indexers readonly archives
Mutators
Filters
Parsers
ElasticSearch Store

Logging and ranting / Vytis Valentinavičius (Lamoda)

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Similar a Logging and ranting / Vytis Valentinavičius (Lamoda)

Similar a Logging and ranting / Vytis Valentinavičius (Lamoda) (20)

Más de Ontico

Más de Ontico (20)

Último

Último (20)

Logging and ranting / Vytis Valentinavičius (Lamoda)