Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Scaling an ELK stack 
Elasticsearch NL meetup 
2014.09.22, Utrecht
1 
Who am I? 
Renzo Tomà 
• IT operations 
• Linux engineer 
• Python developer 
• Likes huge streams of raw data 
• Desig...
2 
ELK
3 
ELK at bol.com 
Logsearch platform. 
For developers & operations. 
Search & analyze log events using Kibana. 
Events fr...
4 
Real world examples 
Case: release of new webshop version. 
Nagios alert: jboss processing time. 
Metrics: increase in ...
5 
Real world examples 
Case: strange performance spikes on webshop. 
Looks bad, but cause unknown. 
Find all errors in we...
Initial design (mid 2013’ish) 
6 
Kibana2 
Servers, routers, firewalls … 
Remote 
_syslog 
pkg 
Log4j 
syslog 
appender 
L...
7 
Initial attempt #fail 
Single logstash instance not fast enough. 
Unable to keep up with events created. 
High CPU load...
8 
Initial attempt #fail 
Log4j events can be multiline (e.g. stacktraces). 
Events are send per line: 
100 lines = 100 sy...
9 
Initial attempt #fail 
Syslog RFC3164: 
“The total length of the packet MUST be 1024 bytes or 
less.” 
Rich Apache LogF...
10 
The only way is up. 
Improvement proposals: 
- Use queuing to make Logstash horizontal 
scalable. 
- Drop syslog as tr...
Latest design (mid 2014’ish) 
Lots of Many instances 
other 
sources 
11 
Kibana 
2 + 3 
Servers, routers, firewalls … 
Lo...
12 
Current status #win 
- Logstash: up to 10 instances per env (because of logstash 1.1 version) 
- ES cluster (v1.0.1): ...
13 
Our lessons learned 
Before anything else! 
Start collecting metrics so you get a baseline. 
No blind tuning. Validate...
14 
Logstash tip #1 
Insert Redis as queue between source and 
logstash instances: 
- Scale Logstash scale horizontally 
-...
15 
Logstash tip #2 
Tune your workers. Find your chokepoint and 
increase its workers to improve throughput. 
Input Filte...
16 
Logstash tip #3 
Grok is very powerful, but CPU intensive. Hard to 
write, maintain and debug. 
Fix: vertical scaling....
17 
Logstash tip #4 (last one) 
Use the HTTP protocol Elasticsearch output. 
Avoid a version lock in! 
HTTP may be slower,...
18 
Elasticsearch tip #1 
Do not download a ‘great’ configuration. 
Elasticsearch is very complex. Lots of moving parts. 
...
19 
Elasticsearch tip #2 
Increase the ‘index.refresh_interval’ setting. 
Refresh: make newly added docs available for 
se...
20 
Elasticsearch tip #3 
Use Curator to keep total shardcount constant. 
Uncontrolled shard growth may trigger a sudden 
...
21 
Elasticsearch tip #4 
Become experienced in rolling cluster restarts: 
- to roll out new Elasticsearch releases 
- to ...
22 
Elasticsearch tip #5 (last one) 
Cluster restarts improve recovery time. 
Recovery: compares replica vs primary shard....
You can contact me via: 
rtoma@bol.com, or
24
Relocation in action
26 
Tools we use 
http://redis.io/ 
Key/value memory store, no-frills queuing, extremely fast. 
Used to scale logstash hor...
Próxima SlideShare
Cargando en…5
×

Scaling an ELK stack at bol.com

21.159 visualizaciones

Publicado el

A presentation about the deployment of an ELK stack at bol.com

At bol.com we use Elasticsearch, Logstash and Kibana in a logsearch system that allows our developers and operations people to easilly access and search thru logevents coming from all layers of its infrastructure.

The presentations explains the initial design and its failures. It continues with explaining the latest design (mid 2014). Its improvements. And finally a set of tips are giving regarding Logstash and Elasticsearch scaling.

These slides were first presented at the Elasticsearch NL meetup on September 22nd 2014 at the Utrecht bol.com HQ.

Publicado en: Tecnología
  • Inicia sesión para ver los comentarios

Scaling an ELK stack at bol.com

  1. 1. Scaling an ELK stack Elasticsearch NL meetup 2014.09.22, Utrecht
  2. 2. 1 Who am I? Renzo Tomà • IT operations • Linux engineer • Python developer • Likes huge streams of raw data • Designed metrics & logsearch platform • Married, proud father of two And you?
  3. 3. 2 ELK
  4. 4. 3 ELK at bol.com Logsearch platform. For developers & operations. Search & analyze log events using Kibana. Events from many sources (e.g. syslog, accesslog, log4j, …) Part of our infrastructure. Why? Faster root cause analyses  quicker time-to-repair.
  5. 5. 4 Real world examples Case: release of new webshop version. Nagios alert: jboss processing time. Metrics: increase in active threads (and proctime). => Inconclusive! Find all HTTP requests to www.bol.com which were slower than 5 seconds: @type:apache_access AND @fields.site:”www_bol_com” AND @fields.responsetimes:[5.000.000 TO *] => Hits for 1 URL. Enough for DEV to start its RCA.
  6. 6. 5 Real world examples Case: strange performance spikes on webshop. Looks bad, but cause unknown. Find all errors in webshop log4j logging: @fields.application:wsp AND @fields.level:ERROR Compare errors before vs during spike. Spot the difference. => Spikes caused by timeouts on a backend service. Metrics correlation: timeouts not cause, but symptom of full GC issue.
  7. 7. Initial design (mid 2013’ish) 6 Kibana2 Servers, routers, firewalls … Remote _syslog pkg Log4j syslog appender Logstash Elastic Elassetaicrc h search Syslog Log events Acts as syslog server. Converts lines into events, into json docs. Accesslog Central syslog server Apache webservers Java webapplications (JVM) Using syslog protocol over UDP as transport. Even for accesslog + log4j. tail
  8. 8. 7 Initial attempt #fail Single logstash instance not fast enough. Unable to keep up with events created. High CPU load, due to intensive grokking (regex). Network buffer overflow. UDP traffic dropped. Result: missing events.
  9. 9. 8 Initial attempt #fail Log4j events can be multiline (e.g. stacktraces). Events are send per line: 100 lines = 100 syslog msgs Merging by Logstash. Remember the UDP drops? Result: - unparseable events (if 1st line was missing) - Swiss cheese. Stacktrace lines were missing.
  10. 10. 9 Initial attempt #fail Syslog RFC3164: “The total length of the packet MUST be 1024 bytes or less.” Rich Apache LogFormat + lots of cookies = 4kb easily. Anything after byte 1024 got trimmed. Result: unparseable events (mismatch grok pattern)
  11. 11. 10 The only way is up. Improvement proposals: - Use queuing to make Logstash horizontal scalable. - Drop syslog as transport (for non-syslog). - Reduce amount of grokking. Pre-formatting at source scales better. Less complexity.
  12. 12. Latest design (mid 2014’ish) Lots of Many instances other sources 11 Kibana 2 + 3 Servers, routers, firewalls … Local Logsheep Log4j jsonevent layout Elastic Elassetaicrc h search Syslog Accesslog jsonevent format Log events Central syslog server Apache webservers Java webapplications (JVM) Elastic Resdeaisrch (queue) Log4j redis appender Logstash Local Logsheep Events in jsonevent format. No grokking required.
  13. 13. 12 Current status #win - Logstash: up to 10 instances per env (because of logstash 1.1 version) - ES cluster (v1.0.1): 6 data + 2 client nodes - Each datanode has 7 datadisks (striping) - Indexing at 2k – 4k docs added per second - Avg. index time: 0.5ms - Peak: 300M docs = 185GB, per day - Searches: just a few per hour - Shardcount: 3 per idx, 1 replica, 3000 total - Retention: up to 60 days
  14. 14. 13 Our lessons learned Before anything else! Start collecting metrics so you get a baseline. No blind tuning. Validate every change fact-based. Our weapons of choice: • Graphite • Diamond (I am contributor of the ES collector) • Jcollectd Alternative: try Marvel.
  15. 15. 14 Logstash tip #1 Insert Redis as queue between source and logstash instances: - Scale Logstash scale horizontally - High availability (no events get lost) Redis Logstash Logstash Logstash Redis
  16. 16. 15 Logstash tip #2 Tune your workers. Find your chokepoint and increase its workers to improve throughput. Input Filter Output Filter Input Output Filter $ top –H –p $(pgrep logstash)
  17. 17. 16 Logstash tip #3 Grok is very powerful, but CPU intensive. Hard to write, maintain and debug. Fix: vertical scaling. Increase filterworkers or add more Logstash instances. Better: feed Logstash with jsonevent input. Solutions: • Log4j: use log4j-jsonevent-layout • Apache: define json output with LogFormat
  18. 18. 17 Logstash tip #4 (last one) Use the HTTP protocol Elasticsearch output. Avoid a version lock in! HTTP may be slower, but newer ES means: - Lots of new features - Lots of bug fixes - Lots of performance improvements Most important: you decide what versions to use. Logstash v1.4.2 (June ‘14) requires ES v1.1.1 (April ‘14). Latest ES version is v1.3.2 (Aug ‘14).
  19. 19. 18 Elasticsearch tip #1 Do not download a ‘great’ configuration. Elasticsearch is very complex. Lots of moving parts. Lots of different use-cases. Lots of configuration options. The defaults can not be optimal. Start with defaults: • Load it (stresstest or pre-launch traffic). • Check your metrics. • Find your chokepoint. • Change setting. • Verify and repeat.
  20. 20. 19 Elasticsearch tip #2 Increase the ‘index.refresh_interval’ setting. Refresh: make newly added docs available for search. Default value: one second. High impact on heavy indexing systems (like ours). Change it at runtime & check the metrics: $ curl -s -XPUT 0:9200/_all/_settings?index.refresh_interval=5s
  21. 21. 20 Elasticsearch tip #3 Use Curator to keep total shardcount constant. Uncontrolled shard growth may trigger a sudden hockey stick effect. Our setup: - 6 datanodes - 6 shards per index - 3 primary, 3 replica “One shard per datanode” (YMMV)
  22. 22. 21 Elasticsearch tip #4 Become experienced in rolling cluster restarts: - to roll out new Elasticsearch releases - to apply a config setting (e.g. heap, gc, ..) - because it will solve an incident. Control concurrency + bandwidth: cluster.routing.allocation.node_concurrent_recoveries cluster.routing.allocation.cluster_concurrent_rebalance indices.recovery.max_bytes_per_sec Get confident enough to trust doing a rolling restart on a Saturday evening! (To get this graph )
  23. 23. 22 Elasticsearch tip #5 (last one) Cluster restarts improve recovery time. Recovery: compares replica vs primary shard. If different, recreate the replica. Costly (iowait) and very time consuming. But … difference is normal. Primary and replica have their own segment merge management: same docs, but different bytes. After recovery: replica is exact copy of primary. Note: only works for stale shards (no more updates). You have a lot of those when using daily Logstash indices.
  24. 24. You can contact me via: rtoma@bol.com, or
  25. 25. 24
  26. 26. Relocation in action
  27. 27. 26 Tools we use http://redis.io/ Key/value memory store, no-frills queuing, extremely fast. Used to scale logstash horizontally. https://github.com/emicklei/log4j-redis-appender Send log4j event to Redis queue, non-blocking, batch, failover https://github.com/emicklei/log4j-jsonevent-layout Format log4j events in logstash event layout. Why have logstash do lots of grokking, if you can feed it with logstash friendly json. http://untergeek.com/2013/09/11/getting-apache-to-output-json-for-logstash-1-2-x/ Format Apache access logging in logstash event layout. Again: avoid grokking. https://github.com/bolcom/ (SOON) Logsheep: custom multi-threaded logtailer / udp listener, sends events to redis. https://github.com/BrightcoveOS/Diamond/ Great metrics collector framework with Elasticsearch collector. I am contributor. https://github.com/elasticsearch/curator Tool for automatic Elasticsearch index management (delete, close, optimize, bloom).

×