SlideShare una empresa de Scribd logo
1 de 50
Descargar para leer sin conexión
WWW.HAPROXY.COM
Observability
with HAProxy
WWW.HAPROXY.COM
● Introduction to “observability”
● Deep dive in HAProxy
● Tips (to get more from HAProxy)
● RoX (Return On eXperience)
● Conclusion (because we need one)
Agenda
WWW.HAPROXY.COMWWW.HAPROXY.COM
Introduction
WWW.HAPROXY.COM
Observability:
"control theory, observability is a measure of how well
internal states of a system can be inferred from knowledge
of its external outputs"
(wikipedia)
In short for us: WTF is going on ?
Definition
WWW.HAPROXY.COM
● Monitoring tells you how well something works (or not)
● Observability helps you detect what is not working and why
=> You monitor an observable system.
Observability vs monitoring
WWW.HAPROXY.COM
● can’t rely on impatient users’ complaints anymore
=> detect and fix trouble not reported yet
● stop adding duct tape, address root causes!
● improve users experience where it matters
Observability is crucial
WWW.HAPROXY.COM
● central place
● in distributed systems, one LB per layer, even better when
sidecar
● sees multiple targets, eases comparisons
=> draw references
● Trusted low level component & excellent transparency
● many LB decisions actually depend on metrics related to
performance and observability
● again, logs provide long-term references
The LB as an observation tower
WWW.HAPROXY.COM
● Maintain two types of TCP connections
○ From client to HAPRoxy (1)
○ From HAProxy to server (2)
○ Only access to data payload (not the “packet”)
LB in Reverse-proxy mode
HAProxyClient Server
(1) (2)
WWW.HAPROXY.COM
● you should! Especially in distributed systems (microservices,
etc)!
● but often it's too late when the first incident happens!
● with existing LB's logs, it's already possible to do a lot
Note: check OpenTracing and Prometheus
Why not looking from other points?
WWW.HAPROXY.COM
● global failures (aborts, timeouts)
● abnormal delays caused by network retransmits
● connection failures and retries caused by bad tuning (eg:
conntrack)
● connection slowdowns caused by inefficient firewall policies
(#rules)
● Non exhaustive list ...
What does the LB see?
WWW.HAPROXY.COM
● client-side issues (BW limitations)
● per-URL processing time (application issues, svc partners)
● per-node vs per-cluster variations
=> narrow down to individual node or shared resource
● deployment issues : new occasional error on a specific page,
can be addressed before going full-scale
What does the LB see?
WWW.HAPROXY.COMWWW.HAPROXY.COM
Deep dive in
HAProxy
WWW.HAPROXY.COM
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - -SDNN 1/1/1/1/0 0/0 {haproxy.org} {}
"GET /index.html HTTP/1.1"
I need help !!!
WWW.HAPROXY.COM
● Logs :
○ Halog, ELK, Splunk, datadog…
○ Provides unique-id for tracing/event correlation
● Statistics :
○ Stats page, CLI, hatop
○ Prometheus, statsd, ...
● Stick-tables (per arbitrary key like IP, URL, cookie, ...) :
○ Byte count, cumulated/concurrent conns, errors, rates…
Accessing metrics in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
● HAProxy now supports heavier per-request workloads (Lua,
device identification, …)
● Processing times over 200 µs can become noticeable
● Actions:
○ log per-request total CPU time spent in analysers
○ log per-request total CPU time spent in TLS handshake
○ log per-request total latency added by other tasks
○ Ability to kill offending tasks
○ Ability to alert on high latencies
=> make HAProxy as observable as other components
And more to come in HAProxy 1.9 (Nov. 2018)
WWW.HAPROXY.COM
● Timers are averaged in the stats
● Each timer appears in the logs
● halog -rt/-RT/-pct for quick analysis
● Each timer crossing a limit triggers a timeout
● Each abort at a specific step causes a hard error
=> termination codes
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in
static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org}
{} "GET /index.html HTTP/1.1"
Event timing reports
Timers Term code Cookie code
WWW.HAPROXY.COM
● Who, when, why and how the session has been terminated
● Distinguish between timeout and abort
● Indicate who (client, server, haproxy, kill, ...)
● Indicate when (req, queue, connect, response, data...)
● Completed by persistence cookie indications
● Filtered and sorted by halog :
halog -tcn|-TCN ... # for filtering
halog -tc # for sorting
=> Graphing the number of errors reported by second helps
detecting an issue is in progress somewhere
Termination codes
WWW.HAPROXY.COM
● Stats page: distribution per frontend/backend/server
● Filter by ranges: halog -hs/-HS
● Sorted output: halog -st
=> graph the distribution and watch for variations between
application deployments
HTTP status code distribution
WWW.HAPROXY.COM
● Uses server maxconn
● Grows exponentially with slowdowns : easy to detect!
● Tells you how many extra servers you need
● Reported by halog -Q/-QS
● Shown in real time on the stats page per backend/srv
=> If you watch only one metric, watch this one!
Other relevant metrics: queues
WWW.HAPROXY.COM
LB algorithm implies fairness between servers :
● Equal request count with roundrobin
=> Higher than average concurrency indicates abnormally
slow server
● Equal load with leastconn
=> Low req count indicates abnormally slow server
=> graph relevant values within the farm
Other relevant metrics: LB fairness
WWW.HAPROXY.COM
● Global: halog -e
● Per server: halog -srv
● Per client IP: halog -e -ic (detect bad CDN nodes)
● Per URL: halog -ue
● Stats page: per frontend/backend/server
● Stick-tables: per arbitrary key using http_err_rate()
=> no threshold, watch for variations
Other relevant metrics: Error rate
WWW.HAPROXY.COM
● Default httplog format is quite rich
● Can be improved using the log-format directive
● Hint: log stick-table stats for similar keys
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in
static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org}
{} "GET /index.html HTTP/1.1"
Useful entries in log-format
Timers
HTTP status Byte
count
Term
Code
Cookie
Code
Conn
Count
Queue
Length
WWW.HAPROXY.COMWWW.HAPROXY.COM
Tips !!!!
WWW.HAPROXY.COM
"I can't enable logs, I have too much traffic!"
● an average syslog server can store 20k events/s without
sweating
● that's 1.7B events/day or 350GB of uncompressed haproxy
logs/day
● compresses to 1TB/month
● for $100 you can store 4 months with no loss
● have more traffic / not interested in this level of detail ?
# log only 5% of requests
http-request set-log-level silent unless { rand(100) -lt 5 }
Sampling: why / When
Tips !!!!
WWW.HAPROXY.COM
● you only want to catch suspicious events
● enable logs per url, source IP, etc...
● disable logging unless Tc/Tq/Tr/Tw/... is above a certain
threshold
● on the fly for selected keys from the CLI + stick-table
● also see "option dontlognormal"
WARNING: you'll lose any valid reference
Selective logging: why / When
Tips !!!!
WWW.HAPROXY.COM
● Poorly documented, use halog --help
● response time per url: halog -uat
● errors per server: halog -srv
● Percentiles on req/queue/conn/resp times: halog -pct
● detect stolen CPU / swap : halog -ac … -ad …
● very fast (1-2 GB of log file per second)
=> Use it in production to figure the relevant metrics
Other halog goodies
Tips !!!!
WWW.HAPROXY.COMWWW.HAPROXY.COM
ROX
(Return On eXperience)
WWW.HAPROXY.COM
● Many ops and dev do this every day all around the world
● Users are complaining that the application is slow
● Devops team check HAProxy logs (using halog or ELK) in order to isolate
potential sources of issue:
○ Check server Tc
○ Sort servers by response time
○ Sort URLs by response time
● HAProxy won’t fix the problem (in most cases), but will drastically reduce the
troubleshooting time by isolating the potential issues
Web application is slow!!!!!!!!
Return On eXperience
WWW.HAPROXY.COM
● Retail customer, with shops everywhere in France
● Cash register reports the following error message “Unable to reach central
system” when searching people in the loyalty card database, after waiting up to
10s.
● Argument between customer and “central system” software provider:
○ Customer: your software is slow
○ Provider: your network triggers this error
● After installing HAProxy, its logs reported the following:
○ Time to get the query from the client: 300ms (from North of France to Paris)
○ Server response time: 50s
● Conclusion: provider turned off debug mode in the “central system” software
Unavailable central system ?!?
Return On eXperience
WWW.HAPROXY.COM
● Customer with big TLS traffic, mostly API based
● Each HAProxy server can’t go higher than 20K HTTPs req/s (in 2015), even if
there seem to be a lot of unused resources on the box
● At first, this was qualified as an “HAProxy” issue (including the HW and OS
itself)
● Hard tuning the box did not improve anything
● halog Tc percentile reported higher values when the load increased (up to
30ms, on the LAN!!!)
● After asking customer to double check the Spanning Tree, it seemed a small 24
ports ToR switch became root bridge….
● Fixing spanning tree also fixed HAProxy’s TLS performance
HAProxy TLS performance issue
Return On eXperience
WWW.HAPROXY.COM
● Tc from HA1 to srv 1,2,3,5 always low, srv 4,6 high at 99 pct
● Tc from HA2 to srv 1,2,4,6 always low, srv 3,5 high at 99 pct
=> both haproxy and servers out of cause
● issue rate stable at various traffic levels => not congestion
● inter-switch link apparently at cause but not for all flows
● inter-switch link made of two fibers balanced on MAC tuple
● thanks to long-term logs, origin could even be identified
Spotting a broken fiber channel between 2
core switches
Return On eXperience
WWW.HAPROXY.COM
● Tc abnormally high with lots of random values to several
seconds, and only for TLS
● Tc timer also covers TLS handshake
=> not a network, hardware or performance issue, only
server config.
=> system was regularly running out of entropy due to
mistakenly using /dev/random as a random source for SSL
TIP: use /dev/urandom ;)
Wrong web server configuration using
/dev/random
Return On eXperience
WWW.HAPROXY.COMWWW.HAPROXY.COM
Conclusion
WWW.HAPROXY.COM
● exploit your stats
● Choose the right LB
● enable logs on LBs, no excuse for not doing it!
● process them automatically, manually once in a while
● compare numbers between similar objects
● detect anomalies
● fix problems before they are witnessed
● Enjoy :-)
Conclusion
Conclusion
WWW.HAPROXY.COM
QUESTION &
ANSWER

Más contenido relacionado

La actualidad más candente

Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
 
HA Deployment Architecture with HAProxy and Keepalived
HA Deployment Architecture with HAProxy and KeepalivedHA Deployment Architecture with HAProxy and Keepalived
HA Deployment Architecture with HAProxy and KeepalivedGanapathi Kandaswamy
 
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...Vietnam Open Infrastructure User Group
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...Vietnam Open Infrastructure User Group
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyftTao Feng
 
Linking Metrics to Logs using Loki
Linking Metrics to Logs using LokiLinking Metrics to Logs using Loki
Linking Metrics to Logs using LokiKnoldus Inc.
 
cLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHousecLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHouseAltinity Ltd
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowYohei Onishi
 
Ceph and Openstack in a Nutshell
Ceph and Openstack in a NutshellCeph and Openstack in a Nutshell
Ceph and Openstack in a NutshellKaran Singh
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to KubernetesVishal Biyani
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in ProductionRobert Sanders
 
Learn O11y from Grafana ecosystem.
Learn O11y from Grafana ecosystem.Learn O11y from Grafana ecosystem.
Learn O11y from Grafana ecosystem.HungWei Chiu
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18CodeOps Technologies LLP
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itBruno Faria
 
Nginx Reverse Proxy with Kafka.pptx
Nginx Reverse Proxy with Kafka.pptxNginx Reverse Proxy with Kafka.pptx
Nginx Reverse Proxy with Kafka.pptxwonyong hwang
 

La actualidad más candente (20)

Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
HA Deployment Architecture with HAProxy and Keepalived
HA Deployment Architecture with HAProxy and KeepalivedHA Deployment Architecture with HAProxy and Keepalived
HA Deployment Architecture with HAProxy and Keepalived
 
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...
 
Airflow 101
Airflow 101Airflow 101
Airflow 101
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
Linking Metrics to Logs using Loki
Linking Metrics to Logs using LokiLinking Metrics to Logs using Loki
Linking Metrics to Logs using Loki
 
cLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHousecLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHouse
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Ceph and Openstack in a Nutshell
Ceph and Openstack in a NutshellCeph and Openstack in a Nutshell
Ceph and Openstack in a Nutshell
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to Kubernetes
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Learn O11y from Grafana ecosystem.
Learn O11y from Grafana ecosystem.Learn O11y from Grafana ecosystem.
Learn O11y from Grafana ecosystem.
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
Power-up services with gRPC
Power-up services with gRPCPower-up services with gRPC
Power-up services with gRPC
 
Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
 
Nginx Reverse Proxy with Kafka.pptx
Nginx Reverse Proxy with Kafka.pptxNginx Reverse Proxy with Kafka.pptx
Nginx Reverse Proxy with Kafka.pptx
 

Similar a Observability with HAProxy

Observability tips for HAProxy
Observability tips for HAProxyObservability tips for HAProxy
Observability tips for HAProxyWilly Tarreau
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
Website & Internet + Performance testing
Website & Internet + Performance testingWebsite & Internet + Performance testing
Website & Internet + Performance testingRoman Ananev
 
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...Ontico
 
HAProxy as Egress Controller
HAProxy as Egress ControllerHAProxy as Egress Controller
HAProxy as Egress ControllerJulien Pivotto
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
SPDY and What to Consider for HTTP/2.0
SPDY and What to Consider for HTTP/2.0SPDY and What to Consider for HTTP/2.0
SPDY and What to Consider for HTTP/2.0Mike Belshe
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetJuan Berner
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek PROIDEA
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackJakub Hajek
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaborationJulien Pivotto
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesAlexander Penev
 
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020'Effective node.js development' by Viktor Turskyi at OdessaJS'2020
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020OdessaJS Conf
 
University of Delaware - Improving Web Protocols (early SPDY talk)
University of Delaware - Improving Web Protocols (early SPDY talk)University of Delaware - Improving Web Protocols (early SPDY talk)
University of Delaware - Improving Web Protocols (early SPDY talk)Mike Belshe
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Brian Brazil
 
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...NETWAYS
 
WTF is Sensu and Monitoring
WTF is Sensu and MonitoringWTF is Sensu and Monitoring
WTF is Sensu and MonitoringToby Jackson
 
Approaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days PolandApproaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days PolandMaarten Balliauw
 

Similar a Observability with HAProxy (20)

Observability tips for HAProxy
Observability tips for HAProxyObservability tips for HAProxy
Observability tips for HAProxy
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Website & Internet + Performance testing
Website & Internet + Performance testingWebsite & Internet + Performance testing
Website & Internet + Performance testing
 
Log Files
Log FilesLog Files
Log Files
 
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
 
HAProxy as Egress Controller
HAProxy as Egress ControllerHAProxy as Egress Controller
HAProxy as Egress Controller
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
 
SPDY and What to Consider for HTTP/2.0
SPDY and What to Consider for HTTP/2.0SPDY and What to Consider for HTTP/2.0
SPDY and What to Consider for HTTP/2.0
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budget
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaboration
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020'Effective node.js development' by Viktor Turskyi at OdessaJS'2020
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020
 
University of Delaware - Improving Web Protocols (early SPDY talk)
University of Delaware - Improving Web Protocols (early SPDY talk)University of Delaware - Improving Web Protocols (early SPDY talk)
University of Delaware - Improving Web Protocols (early SPDY talk)
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
 
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
 
WTF is Sensu and Monitoring
WTF is Sensu and MonitoringWTF is Sensu and Monitoring
WTF is Sensu and Monitoring
 
Approaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days PolandApproaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days Poland
 

Último

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Observability with HAProxy

  • 2. WWW.HAPROXY.COM ● Introduction to “observability” ● Deep dive in HAProxy ● Tips (to get more from HAProxy) ● RoX (Return On eXperience) ● Conclusion (because we need one) Agenda
  • 4. WWW.HAPROXY.COM Observability: "control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs" (wikipedia) In short for us: WTF is going on ? Definition
  • 5. WWW.HAPROXY.COM ● Monitoring tells you how well something works (or not) ● Observability helps you detect what is not working and why => You monitor an observable system. Observability vs monitoring
  • 6. WWW.HAPROXY.COM ● can’t rely on impatient users’ complaints anymore => detect and fix trouble not reported yet ● stop adding duct tape, address root causes! ● improve users experience where it matters Observability is crucial
  • 7. WWW.HAPROXY.COM ● central place ● in distributed systems, one LB per layer, even better when sidecar ● sees multiple targets, eases comparisons => draw references ● Trusted low level component & excellent transparency ● many LB decisions actually depend on metrics related to performance and observability ● again, logs provide long-term references The LB as an observation tower
  • 8. WWW.HAPROXY.COM ● Maintain two types of TCP connections ○ From client to HAPRoxy (1) ○ From HAProxy to server (2) ○ Only access to data payload (not the “packet”) LB in Reverse-proxy mode HAProxyClient Server (1) (2)
  • 9. WWW.HAPROXY.COM ● you should! Especially in distributed systems (microservices, etc)! ● but often it's too late when the first incident happens! ● with existing LB's logs, it's already possible to do a lot Note: check OpenTracing and Prometheus Why not looking from other points?
  • 10. WWW.HAPROXY.COM ● global failures (aborts, timeouts) ● abnormal delays caused by network retransmits ● connection failures and retries caused by bad tuning (eg: conntrack) ● connection slowdowns caused by inefficient firewall policies (#rules) ● Non exhaustive list ... What does the LB see?
  • 11. WWW.HAPROXY.COM ● client-side issues (BW limitations) ● per-URL processing time (application issues, svc partners) ● per-node vs per-cluster variations => narrow down to individual node or shared resource ● deployment issues : new occasional error on a specific page, can be addressed before going full-scale What does the LB see?
  • 13. WWW.HAPROXY.COM haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - -SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" I need help !!!
  • 14. WWW.HAPROXY.COM ● Logs : ○ Halog, ELK, Splunk, datadog… ○ Provides unique-id for tracing/event correlation ● Statistics : ○ Stats page, CLI, hatop ○ Prometheus, statsd, ... ● Stick-tables (per arbitrary key like IP, URL, cookie, ...) : ○ Byte count, cumulated/concurrent conns, errors, rates… Accessing metrics in HAProxy
  • 30. WWW.HAPROXY.COM ● HAProxy now supports heavier per-request workloads (Lua, device identification, …) ● Processing times over 200 µs can become noticeable ● Actions: ○ log per-request total CPU time spent in analysers ○ log per-request total CPU time spent in TLS handshake ○ log per-request total latency added by other tasks ○ Ability to kill offending tasks ○ Ability to alert on high latencies => make HAProxy as observable as other components And more to come in HAProxy 1.9 (Nov. 2018)
  • 31. WWW.HAPROXY.COM ● Timers are averaged in the stats ● Each timer appears in the logs ● halog -rt/-RT/-pct for quick analysis ● Each timer crossing a limit triggers a timeout ● Each abort at a specific step causes a hard error => termination codes haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" Event timing reports Timers Term code Cookie code
  • 32. WWW.HAPROXY.COM ● Who, when, why and how the session has been terminated ● Distinguish between timeout and abort ● Indicate who (client, server, haproxy, kill, ...) ● Indicate when (req, queue, connect, response, data...) ● Completed by persistence cookie indications ● Filtered and sorted by halog : halog -tcn|-TCN ... # for filtering halog -tc # for sorting => Graphing the number of errors reported by second helps detecting an issue is in progress somewhere Termination codes
  • 33. WWW.HAPROXY.COM ● Stats page: distribution per frontend/backend/server ● Filter by ranges: halog -hs/-HS ● Sorted output: halog -st => graph the distribution and watch for variations between application deployments HTTP status code distribution
  • 34. WWW.HAPROXY.COM ● Uses server maxconn ● Grows exponentially with slowdowns : easy to detect! ● Tells you how many extra servers you need ● Reported by halog -Q/-QS ● Shown in real time on the stats page per backend/srv => If you watch only one metric, watch this one! Other relevant metrics: queues
  • 35. WWW.HAPROXY.COM LB algorithm implies fairness between servers : ● Equal request count with roundrobin => Higher than average concurrency indicates abnormally slow server ● Equal load with leastconn => Low req count indicates abnormally slow server => graph relevant values within the farm Other relevant metrics: LB fairness
  • 36. WWW.HAPROXY.COM ● Global: halog -e ● Per server: halog -srv ● Per client IP: halog -e -ic (detect bad CDN nodes) ● Per URL: halog -ue ● Stats page: per frontend/backend/server ● Stick-tables: per arbitrary key using http_err_rate() => no threshold, watch for variations Other relevant metrics: Error rate
  • 37. WWW.HAPROXY.COM ● Default httplog format is quite rich ● Can be improved using the log-format directive ● Hint: log stick-table stats for similar keys haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" Useful entries in log-format Timers HTTP status Byte count Term Code Cookie Code Conn Count Queue Length
  • 39. WWW.HAPROXY.COM "I can't enable logs, I have too much traffic!" ● an average syslog server can store 20k events/s without sweating ● that's 1.7B events/day or 350GB of uncompressed haproxy logs/day ● compresses to 1TB/month ● for $100 you can store 4 months with no loss ● have more traffic / not interested in this level of detail ? # log only 5% of requests http-request set-log-level silent unless { rand(100) -lt 5 } Sampling: why / When Tips !!!!
  • 40. WWW.HAPROXY.COM ● you only want to catch suspicious events ● enable logs per url, source IP, etc... ● disable logging unless Tc/Tq/Tr/Tw/... is above a certain threshold ● on the fly for selected keys from the CLI + stick-table ● also see "option dontlognormal" WARNING: you'll lose any valid reference Selective logging: why / When Tips !!!!
  • 41. WWW.HAPROXY.COM ● Poorly documented, use halog --help ● response time per url: halog -uat ● errors per server: halog -srv ● Percentiles on req/queue/conn/resp times: halog -pct ● detect stolen CPU / swap : halog -ac … -ad … ● very fast (1-2 GB of log file per second) => Use it in production to figure the relevant metrics Other halog goodies Tips !!!!
  • 43. WWW.HAPROXY.COM ● Many ops and dev do this every day all around the world ● Users are complaining that the application is slow ● Devops team check HAProxy logs (using halog or ELK) in order to isolate potential sources of issue: ○ Check server Tc ○ Sort servers by response time ○ Sort URLs by response time ● HAProxy won’t fix the problem (in most cases), but will drastically reduce the troubleshooting time by isolating the potential issues Web application is slow!!!!!!!! Return On eXperience
  • 44. WWW.HAPROXY.COM ● Retail customer, with shops everywhere in France ● Cash register reports the following error message “Unable to reach central system” when searching people in the loyalty card database, after waiting up to 10s. ● Argument between customer and “central system” software provider: ○ Customer: your software is slow ○ Provider: your network triggers this error ● After installing HAProxy, its logs reported the following: ○ Time to get the query from the client: 300ms (from North of France to Paris) ○ Server response time: 50s ● Conclusion: provider turned off debug mode in the “central system” software Unavailable central system ?!? Return On eXperience
  • 45. WWW.HAPROXY.COM ● Customer with big TLS traffic, mostly API based ● Each HAProxy server can’t go higher than 20K HTTPs req/s (in 2015), even if there seem to be a lot of unused resources on the box ● At first, this was qualified as an “HAProxy” issue (including the HW and OS itself) ● Hard tuning the box did not improve anything ● halog Tc percentile reported higher values when the load increased (up to 30ms, on the LAN!!!) ● After asking customer to double check the Spanning Tree, it seemed a small 24 ports ToR switch became root bridge…. ● Fixing spanning tree also fixed HAProxy’s TLS performance HAProxy TLS performance issue Return On eXperience
  • 46. WWW.HAPROXY.COM ● Tc from HA1 to srv 1,2,3,5 always low, srv 4,6 high at 99 pct ● Tc from HA2 to srv 1,2,4,6 always low, srv 3,5 high at 99 pct => both haproxy and servers out of cause ● issue rate stable at various traffic levels => not congestion ● inter-switch link apparently at cause but not for all flows ● inter-switch link made of two fibers balanced on MAC tuple ● thanks to long-term logs, origin could even be identified Spotting a broken fiber channel between 2 core switches Return On eXperience
  • 47. WWW.HAPROXY.COM ● Tc abnormally high with lots of random values to several seconds, and only for TLS ● Tc timer also covers TLS handshake => not a network, hardware or performance issue, only server config. => system was regularly running out of entropy due to mistakenly using /dev/random as a random source for SSL TIP: use /dev/urandom ;) Wrong web server configuration using /dev/random Return On eXperience
  • 49. WWW.HAPROXY.COM ● exploit your stats ● Choose the right LB ● enable logs on LBs, no excuse for not doing it! ● process them automatically, manually once in a while ● compare numbers between similar objects ● detect anomalies ● fix problems before they are witnessed ● Enjoy :-) Conclusion Conclusion