SlideShare una empresa de Scribd logo
1 de 27
Scaling an ELK stack 
Elasticsearch NL meetup 
2014.09.22, Utrecht
1 
Who am I? 
Renzo Tomà 
• IT operations 
• Linux engineer 
• Python developer 
• Likes huge streams of raw data 
• Designed metrics & logsearch platform 
• Married, proud father of two 
And you?
2 
ELK
3 
ELK at bol.com 
Logsearch platform. 
For developers & operations. 
Search & analyze log events using Kibana. 
Events from many sources (e.g. syslog, accesslog, log4j, …) 
Part of our infrastructure. 
Why? Faster root cause analyses  quicker time-to-repair.
4 
Real world examples 
Case: release of new webshop version. 
Nagios alert: jboss processing time. 
Metrics: increase in active threads (and proctime). 
=> Inconclusive! 
Find all HTTP requests to www.bol.com which were slower 
than 5 seconds: 
@type:apache_access AND @fields.site:”www_bol_com” AND  
@fields.responsetimes:[5.000.000 TO *] 
=> Hits for 1 URL. Enough for DEV to start its RCA.
5 
Real world examples 
Case: strange performance spikes on webshop. 
Looks bad, but cause unknown. 
Find all errors in webshop log4j logging: 
@fields.application:wsp AND @fields.level:ERROR 
Compare errors before vs during spike. Spot the difference. 
=> Spikes caused by timeouts on a backend service. 
Metrics correlation: timeouts not cause, but symptom of full 
GC issue.
Initial design (mid 2013’ish) 
6 
Kibana2 
Servers, routers, firewalls … 
Remote 
_syslog 
pkg 
Log4j 
syslog 
appender 
Logstash 
Elastic 
Elassetaicrc h 
search 
Syslog 
Log 
events 
Acts as syslog server. 
Converts lines 
into events, 
into json docs. 
Accesslog 
Central 
syslog 
server 
Apache webservers 
Java webapplications (JVM) 
Using syslog protocol 
over UDP as transport. 
Even for accesslog + log4j. 
tail
7 
Initial attempt #fail 
Single logstash instance not fast enough. 
Unable to keep up with events created. 
High CPU load, due to intensive grokking (regex). 
Network buffer overflow. UDP traffic dropped. 
Result: missing events.
8 
Initial attempt #fail 
Log4j events can be multiline (e.g. stacktraces). 
Events are send per line: 
100 lines = 100 syslog msgs 
Merging by Logstash. 
Remember the UDP drops? 
Result: 
- unparseable events (if 1st line was missing) 
- Swiss cheese. Stacktrace lines were missing.
9 
Initial attempt #fail 
Syslog RFC3164: 
“The total length of the packet MUST be 1024 bytes or 
less.” 
Rich Apache LogFormat + lots of cookies = 4kb easily. 
Anything after byte 1024 got trimmed. 
Result: unparseable events (mismatch grok pattern)
10 
The only way is up. 
Improvement proposals: 
- Use queuing to make Logstash horizontal 
scalable. 
- Drop syslog as transport (for non-syslog). 
- Reduce amount of grokking. Pre-formatting at 
source scales better. Less complexity.
Latest design (mid 2014’ish) 
Lots of Many instances 
other 
sources 
11 
Kibana 
2 + 3 
Servers, routers, firewalls … 
Local 
Logsheep 
Log4j 
jsonevent 
layout 
Elastic 
Elassetaicrc h 
search 
Syslog 
Accesslog 
jsonevent 
format 
Log 
events 
Central 
syslog 
server 
Apache webservers 
Java webapplications (JVM) 
Elastic 
Resdeaisrch 
(queue) 
Log4j 
redis 
appender 
Logstash 
Local 
Logsheep 
Events in jsonevent format. 
No grokking required.
12 
Current status #win 
- Logstash: up to 10 instances per env (because of logstash 1.1 version) 
- ES cluster (v1.0.1): 6 data + 2 client nodes 
- Each datanode has 7 datadisks (striping) 
- Indexing at 2k – 4k docs added per second 
- Avg. index time: 0.5ms 
- Peak: 300M docs = 185GB, per day 
- Searches: just a few per hour 
- Shardcount: 3 per idx, 1 replica, 3000 total 
- Retention: up to 60 days
13 
Our lessons learned 
Before anything else! 
Start collecting metrics so you get a baseline. 
No blind tuning. Validate every change fact-based. 
Our weapons of choice: 
• Graphite 
• Diamond (I am contributor of the ES collector) 
• Jcollectd 
Alternative: try Marvel.
14 
Logstash tip #1 
Insert Redis as queue between source and 
logstash instances: 
- Scale Logstash scale horizontally 
- High availability (no events get lost) 
Redis 
Logstash 
Logstash 
Logstash 
Redis
15 
Logstash tip #2 
Tune your workers. Find your chokepoint and 
increase its workers to improve throughput. 
Input Filter Output 
Filter 
Input Output 
Filter 
$ top –H –p $(pgrep logstash)
16 
Logstash tip #3 
Grok is very powerful, but CPU intensive. Hard to 
write, maintain and debug. 
Fix: vertical scaling. Increase filterworkers or add 
more Logstash instances. 
Better: feed Logstash with jsonevent input. 
Solutions: 
• Log4j: use log4j-jsonevent-layout 
• Apache: define json output with LogFormat
17 
Logstash tip #4 (last one) 
Use the HTTP protocol Elasticsearch output. 
Avoid a version lock in! 
HTTP may be slower, but newer ES means: 
- Lots of new features 
- Lots of bug fixes 
- Lots of performance improvements 
Most important: you decide what versions to use. 
Logstash v1.4.2 (June ‘14) requires ES v1.1.1 (April ‘14). 
Latest ES version is v1.3.2 (Aug ‘14).
18 
Elasticsearch tip #1 
Do not download a ‘great’ configuration. 
Elasticsearch is very complex. Lots of moving parts. 
Lots of different use-cases. Lots of configuration 
options. The defaults can not be optimal. 
Start with defaults: 
• Load it (stresstest or pre-launch traffic). 
• Check your metrics. 
• Find your chokepoint. 
• Change setting. 
• Verify and repeat.
19 
Elasticsearch tip #2 
Increase the ‘index.refresh_interval’ setting. 
Refresh: make newly added docs available for 
search. Default value: one second. High impact 
on heavy indexing systems (like ours). 
Change it at runtime & check the metrics: 
$ curl -s -XPUT 0:9200/_all/_settings?index.refresh_interval=5s
20 
Elasticsearch tip #3 
Use Curator to keep total shardcount constant. 
Uncontrolled shard growth may trigger a sudden 
hockey stick effect. 
Our setup: 
- 6 datanodes 
- 6 shards per index 
- 3 primary, 3 replica 
“One shard per datanode” (YMMV)
21 
Elasticsearch tip #4 
Become experienced in rolling cluster restarts: 
- to roll out new Elasticsearch releases 
- to apply a config setting (e.g. heap, gc, ..) 
- because it will solve an incident. 
Control concurrency + bandwidth: 
cluster.routing.allocation.node_concurrent_recoveries 
cluster.routing.allocation.cluster_concurrent_rebalance 
indices.recovery.max_bytes_per_sec 
Get confident enough to trust 
doing a rolling restart on a 
Saturday evening! 
(To get this graph )
22 
Elasticsearch tip #5 (last one) 
Cluster restarts improve recovery time. 
Recovery: compares replica vs primary shard. If 
different, recreate the replica. Costly (iowait) and 
very time consuming. 
But … difference is normal. Primary and replica 
have their own segment merge management: 
same docs, but different bytes. 
After recovery: replica is exact copy of primary. 
Note: only works for stale shards (no more updates). 
You have a lot of those when using daily Logstash indices.
You can contact me via: 
rtoma@bol.com, or
24
Relocation in action
26 
Tools we use 
http://redis.io/ 
Key/value memory store, no-frills queuing, extremely fast. 
Used to scale logstash horizontally. 
https://github.com/emicklei/log4j-redis-appender 
Send log4j event to Redis queue, non-blocking, batch, failover 
https://github.com/emicklei/log4j-jsonevent-layout 
Format log4j events in logstash event layout. 
Why have logstash do lots of grokking, if you can feed it with logstash friendly json. 
http://untergeek.com/2013/09/11/getting-apache-to-output-json-for-logstash-1-2-x/ 
Format Apache access logging in logstash event layout. Again: avoid grokking. 
https://github.com/bolcom/ (SOON) 
Logsheep: custom multi-threaded logtailer / udp listener, sends events to redis. 
https://github.com/BrightcoveOS/Diamond/ 
Great metrics collector framework with Elasticsearch collector. I am contributor. 
https://github.com/elasticsearch/curator 
Tool for automatic Elasticsearch index management (delete, close, optimize, bloom).

Más contenido relacionado

La actualidad más candente

Logstash + Elasticsearch + Kibana Presentation on Startit Tech Meetup
Logstash + Elasticsearch + Kibana Presentation on Startit Tech MeetupLogstash + Elasticsearch + Kibana Presentation on Startit Tech Meetup
Logstash + Elasticsearch + Kibana Presentation on Startit Tech Meetup
Startit
 
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Andrii Vozniuk
 

La actualidad más candente (20)

Introducing ELK
Introducing ELKIntroducing ELK
Introducing ELK
 
Logstash + Elasticsearch + Kibana Presentation on Startit Tech Meetup
Logstash + Elasticsearch + Kibana Presentation on Startit Tech MeetupLogstash + Elasticsearch + Kibana Presentation on Startit Tech Meetup
Logstash + Elasticsearch + Kibana Presentation on Startit Tech Meetup
 
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
 
Log management with ELK
Log management with ELKLog management with ELK
Log management with ELK
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & Kibana
 
Centralized Logging System Using ELK Stack
Centralized Logging System Using ELK StackCentralized Logging System Using ELK Stack
Centralized Logging System Using ELK Stack
 
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
 
Elk devops
Elk devopsElk devops
Elk devops
 
Logstash
LogstashLogstash
Logstash
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 
ELK introduction
ELK introductionELK introduction
ELK introduction
 
Introduction to ELK
Introduction to ELKIntroduction to ELK
Introduction to ELK
 
Open Source Logging and Monitoring Tools
Open Source Logging and Monitoring ToolsOpen Source Logging and Monitoring Tools
Open Source Logging and Monitoring Tools
 
elk_stack_alexander_szalonnas
elk_stack_alexander_szalonnaselk_stack_alexander_szalonnas
elk_stack_alexander_szalonnas
 
Introduction to ELK
Introduction to ELKIntroduction to ELK
Introduction to ELK
 
Elk
Elk Elk
Elk
 
Logstash family introduction
Logstash family introductionLogstash family introduction
Logstash family introduction
 
Monitoring with Graylog - a modern approach to monitoring?
Monitoring with Graylog - a modern approach to monitoring?Monitoring with Graylog - a modern approach to monitoring?
Monitoring with Graylog - a modern approach to monitoring?
 
Machine Learning in a Twitter ETL using ELK
Machine Learning in a Twitter ETL using ELK Machine Learning in a Twitter ETL using ELK
Machine Learning in a Twitter ETL using ELK
 
Graylog Engineering - Design Your Architecture
Graylog Engineering - Design Your ArchitectureGraylog Engineering - Design Your Architecture
Graylog Engineering - Design Your Architecture
 

Similar a Scaling an ELK stack at bol.com

Similar a Scaling an ELK stack at bol.com (20)

SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Migrating the elastic stack to the cloud, or application logging @ travix
 Migrating the elastic stack to the cloud, or application logging @ travix Migrating the elastic stack to the cloud, or application logging @ travix
Migrating the elastic stack to the cloud, or application logging @ travix
 
Tuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for LogsTuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for Logs
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダ
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
 
Testing kubernetes and_open_shift_at_scale_20170209
Testing kubernetes and_open_shift_at_scale_20170209Testing kubernetes and_open_shift_at_scale_20170209
Testing kubernetes and_open_shift_at_scale_20170209
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Elk scilifelab
Elk scilifelabElk scilifelab
Elk scilifelab
 
uWSGI - Swiss army knife for your Python web apps
uWSGI - Swiss army knife for your Python web appsuWSGI - Swiss army knife for your Python web apps
uWSGI - Swiss army knife for your Python web apps
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
Application Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyApplication Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.key
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
HeroLympics Eng V03 Henk Vd Valk
HeroLympics  Eng V03 Henk Vd ValkHeroLympics  Eng V03 Henk Vd Valk
HeroLympics Eng V03 Henk Vd Valk
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
Optimizing elastic search on google compute engine
Optimizing elastic search on google compute engineOptimizing elastic search on google compute engine
Optimizing elastic search on google compute engine
 

Último

Último (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Scaling an ELK stack at bol.com

  • 1. Scaling an ELK stack Elasticsearch NL meetup 2014.09.22, Utrecht
  • 2. 1 Who am I? Renzo Tomà • IT operations • Linux engineer • Python developer • Likes huge streams of raw data • Designed metrics & logsearch platform • Married, proud father of two And you?
  • 4. 3 ELK at bol.com Logsearch platform. For developers & operations. Search & analyze log events using Kibana. Events from many sources (e.g. syslog, accesslog, log4j, …) Part of our infrastructure. Why? Faster root cause analyses  quicker time-to-repair.
  • 5. 4 Real world examples Case: release of new webshop version. Nagios alert: jboss processing time. Metrics: increase in active threads (and proctime). => Inconclusive! Find all HTTP requests to www.bol.com which were slower than 5 seconds: @type:apache_access AND @fields.site:”www_bol_com” AND @fields.responsetimes:[5.000.000 TO *] => Hits for 1 URL. Enough for DEV to start its RCA.
  • 6. 5 Real world examples Case: strange performance spikes on webshop. Looks bad, but cause unknown. Find all errors in webshop log4j logging: @fields.application:wsp AND @fields.level:ERROR Compare errors before vs during spike. Spot the difference. => Spikes caused by timeouts on a backend service. Metrics correlation: timeouts not cause, but symptom of full GC issue.
  • 7. Initial design (mid 2013’ish) 6 Kibana2 Servers, routers, firewalls … Remote _syslog pkg Log4j syslog appender Logstash Elastic Elassetaicrc h search Syslog Log events Acts as syslog server. Converts lines into events, into json docs. Accesslog Central syslog server Apache webservers Java webapplications (JVM) Using syslog protocol over UDP as transport. Even for accesslog + log4j. tail
  • 8. 7 Initial attempt #fail Single logstash instance not fast enough. Unable to keep up with events created. High CPU load, due to intensive grokking (regex). Network buffer overflow. UDP traffic dropped. Result: missing events.
  • 9. 8 Initial attempt #fail Log4j events can be multiline (e.g. stacktraces). Events are send per line: 100 lines = 100 syslog msgs Merging by Logstash. Remember the UDP drops? Result: - unparseable events (if 1st line was missing) - Swiss cheese. Stacktrace lines were missing.
  • 10. 9 Initial attempt #fail Syslog RFC3164: “The total length of the packet MUST be 1024 bytes or less.” Rich Apache LogFormat + lots of cookies = 4kb easily. Anything after byte 1024 got trimmed. Result: unparseable events (mismatch grok pattern)
  • 11. 10 The only way is up. Improvement proposals: - Use queuing to make Logstash horizontal scalable. - Drop syslog as transport (for non-syslog). - Reduce amount of grokking. Pre-formatting at source scales better. Less complexity.
  • 12. Latest design (mid 2014’ish) Lots of Many instances other sources 11 Kibana 2 + 3 Servers, routers, firewalls … Local Logsheep Log4j jsonevent layout Elastic Elassetaicrc h search Syslog Accesslog jsonevent format Log events Central syslog server Apache webservers Java webapplications (JVM) Elastic Resdeaisrch (queue) Log4j redis appender Logstash Local Logsheep Events in jsonevent format. No grokking required.
  • 13. 12 Current status #win - Logstash: up to 10 instances per env (because of logstash 1.1 version) - ES cluster (v1.0.1): 6 data + 2 client nodes - Each datanode has 7 datadisks (striping) - Indexing at 2k – 4k docs added per second - Avg. index time: 0.5ms - Peak: 300M docs = 185GB, per day - Searches: just a few per hour - Shardcount: 3 per idx, 1 replica, 3000 total - Retention: up to 60 days
  • 14. 13 Our lessons learned Before anything else! Start collecting metrics so you get a baseline. No blind tuning. Validate every change fact-based. Our weapons of choice: • Graphite • Diamond (I am contributor of the ES collector) • Jcollectd Alternative: try Marvel.
  • 15. 14 Logstash tip #1 Insert Redis as queue between source and logstash instances: - Scale Logstash scale horizontally - High availability (no events get lost) Redis Logstash Logstash Logstash Redis
  • 16. 15 Logstash tip #2 Tune your workers. Find your chokepoint and increase its workers to improve throughput. Input Filter Output Filter Input Output Filter $ top –H –p $(pgrep logstash)
  • 17. 16 Logstash tip #3 Grok is very powerful, but CPU intensive. Hard to write, maintain and debug. Fix: vertical scaling. Increase filterworkers or add more Logstash instances. Better: feed Logstash with jsonevent input. Solutions: • Log4j: use log4j-jsonevent-layout • Apache: define json output with LogFormat
  • 18. 17 Logstash tip #4 (last one) Use the HTTP protocol Elasticsearch output. Avoid a version lock in! HTTP may be slower, but newer ES means: - Lots of new features - Lots of bug fixes - Lots of performance improvements Most important: you decide what versions to use. Logstash v1.4.2 (June ‘14) requires ES v1.1.1 (April ‘14). Latest ES version is v1.3.2 (Aug ‘14).
  • 19. 18 Elasticsearch tip #1 Do not download a ‘great’ configuration. Elasticsearch is very complex. Lots of moving parts. Lots of different use-cases. Lots of configuration options. The defaults can not be optimal. Start with defaults: • Load it (stresstest or pre-launch traffic). • Check your metrics. • Find your chokepoint. • Change setting. • Verify and repeat.
  • 20. 19 Elasticsearch tip #2 Increase the ‘index.refresh_interval’ setting. Refresh: make newly added docs available for search. Default value: one second. High impact on heavy indexing systems (like ours). Change it at runtime & check the metrics: $ curl -s -XPUT 0:9200/_all/_settings?index.refresh_interval=5s
  • 21. 20 Elasticsearch tip #3 Use Curator to keep total shardcount constant. Uncontrolled shard growth may trigger a sudden hockey stick effect. Our setup: - 6 datanodes - 6 shards per index - 3 primary, 3 replica “One shard per datanode” (YMMV)
  • 22. 21 Elasticsearch tip #4 Become experienced in rolling cluster restarts: - to roll out new Elasticsearch releases - to apply a config setting (e.g. heap, gc, ..) - because it will solve an incident. Control concurrency + bandwidth: cluster.routing.allocation.node_concurrent_recoveries cluster.routing.allocation.cluster_concurrent_rebalance indices.recovery.max_bytes_per_sec Get confident enough to trust doing a rolling restart on a Saturday evening! (To get this graph )
  • 23. 22 Elasticsearch tip #5 (last one) Cluster restarts improve recovery time. Recovery: compares replica vs primary shard. If different, recreate the replica. Costly (iowait) and very time consuming. But … difference is normal. Primary and replica have their own segment merge management: same docs, but different bytes. After recovery: replica is exact copy of primary. Note: only works for stale shards (no more updates). You have a lot of those when using daily Logstash indices.
  • 24. You can contact me via: rtoma@bol.com, or
  • 25. 24
  • 27. 26 Tools we use http://redis.io/ Key/value memory store, no-frills queuing, extremely fast. Used to scale logstash horizontally. https://github.com/emicklei/log4j-redis-appender Send log4j event to Redis queue, non-blocking, batch, failover https://github.com/emicklei/log4j-jsonevent-layout Format log4j events in logstash event layout. Why have logstash do lots of grokking, if you can feed it with logstash friendly json. http://untergeek.com/2013/09/11/getting-apache-to-output-json-for-logstash-1-2-x/ Format Apache access logging in logstash event layout. Again: avoid grokking. https://github.com/bolcom/ (SOON) Logsheep: custom multi-threaded logtailer / udp listener, sends events to redis. https://github.com/BrightcoveOS/Diamond/ Great metrics collector framework with Elasticsearch collector. I am contributor. https://github.com/elasticsearch/curator Tool for automatic Elasticsearch index management (delete, close, optimize, bloom).

Notas del editor

  1. Log4j , multiline why? Sent per line Logstash needs to merge (multiline filter) Lots of messages + UDP drops = unparseable + swiss cheese