Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
a talk
Nelson Elhage, @nelhage
Operating Consul
As an Early Adopter
This Talk
• consul @ Stripe
• War Stories
• Lessons Learned
Consul at Stripe
The Good, The Bad, The Outages
Why Consul?
• Early 2014
• Stripe Infra gaining complexity
• Nightmarish in-house service registry
• Host lists distribute...
Why Consul?
• Wanted a better service/host store
• consul had everything baked in
• Decided to do some test deployments
Initial Rollout
• Rolled out across all servers
• (started with bake-in in QA)
• No clients at all
What Could Go Wrong?
• We worried about memory leaks
Our First Production Issue
• Noticed one node taking >100M RAM
• (others all <50M)
• Reached out to armon for advice
• bug...
StartedAdding Clients
• Hooked into our deploy tool
• kept a manual emergency fallback
• Generated LB config from consul
•...
Raft Instability
• Seeing >1 failover/minute
• Reached out toArmon
• “Try 0.3”
• “consul is not optimized for spinning dis...
Rolling out 0.3
• Roll to QAfirst
• Nothing works!
• Check logs: TLS verification errors
Rolling out 0.3
• 0.3 changed TLS verification to check the
cert name
• Change our SSL issuing to add SANs
• 2014/06/16 16...
0.3 TLS Woes
• Whoops! consul was checking the remote
cert against the local node name
• armon> we just use "demo.consul.i...
0.3.1
• I wrote and got merged a patch to restore
0.2 behavior
• Rolled forward to 0.3.1
• Upgraded to SSD-backed servers
Increasing Rollout
• Switched various operational tools from
flatfile to consul
• Main app started using consul at startup
Consensus is Hard
consul-template
• Generating haproxy config using consul-template
• https://github.com/hashicorp/consul-template/
issues/1...
consul-template
• Got that fixed, turned it on
• consul immediately fell over
• multiple elections/minute
• 2M allocations...
consul-template
• Service Watches churn when any service
changes health state
• Watching services on a large cluster →
sel...
consul-template
• We use `consul-template -once` in cron
now
• Worse latency, but it works reliably
consul for leader election
• Our data team wanted a leader-election
primitive
• Built on top of consul, cribbing example
c...
Sometime Later…
goroutine leak
• consul would rapidly eat all memory
• larger heap -> large GC pauses -> raft
instability
• manually resta...
goroutine leak
• Reached out toArmon
• Very helpful in debugging
• Found several unrelated memory leaks
goroutine leak
• Tried to figure out what changed
• Eventually correlated to a session leak in
our leader election code
goroutine leak
• Fixed our leader-election code
• New policy: No non-discovery uses of
consul
consul DNS
• Increasingly reliant on consul for internal
discovery
• Unhappy at exposure to periodic instability
• Still h...
consul DNS
• Solution: Use consul-template to compile
consul DNS to a zone file
• Serve that out of a normal DNS server
• ...
Current Status
• Run consul everywhere
• Register all services
• Request-path lookups hit cached DNS
• Operational tools u...
Final Stability Note
• consul 0.5.2 fixed our memory leaks
• consul has been quite stable for us of late
• consul-template...
Lessons Learned
being an early adopter without bringing down the site
(too many times)
Expect It To Be Rough
Monitoring, Monitoring, Monitoring
(graph all the things)
Incremental Rollout
Limit Scope
Isolation
UpgradeAggressively
Get To Know Upstream
Be Willing to Dive In
Questions?
Próxima SlideShare
Cargando en…5
×

Operating Consul as an Early Adopter

988 visualizaciones

Publicado el

Hashiconf 2015 talk about our experiences adopting consul from a very early version.

Publicado en: Software
  • Hi! Get Your Professional Job-Winning Resume Here! 👉 http://bit.ly/rexumtop
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

Operating Consul as an Early Adopter

  1. 1. a talk Nelson Elhage, @nelhage Operating Consul As an Early Adopter
  2. 2. This Talk • consul @ Stripe • War Stories • Lessons Learned
  3. 3. Consul at Stripe The Good, The Bad, The Outages
  4. 4. Why Consul? • Early 2014 • Stripe Infra gaining complexity • Nightmarish in-house service registry • Host lists distributed via puppet
  5. 5. Why Consul? • Wanted a better service/host store • consul had everything baked in • Decided to do some test deployments
  6. 6. Initial Rollout • Rolled out across all servers • (started with bake-in in QA) • No clients at all
  7. 7. What Could Go Wrong? • We worried about memory leaks
  8. 8. Our First Production Issue • Noticed one node taking >100M RAM • (others all <50M) • Reached out to armon for advice • bug in the stats framework: • https://github.com/armon/go-metrics/commit/02567bbc4f518a43853d262b651a3c8257c3f141
  9. 9. StartedAdding Clients • Hooked into our deploy tool • kept a manual emergency fallback • Generated LB config from consul • Noticed a surprising rate of errors
  10. 10. Raft Instability • Seeing >1 failover/minute • Reached out toArmon • “Try 0.3” • “consul is not optimized for spinning disk”
  11. 11. Rolling out 0.3 • Roll to QAfirst • Nothing works! • Check logs: TLS verification errors
  12. 12. Rolling out 0.3 • 0.3 changed TLS verification to check the cert name • Change our SSL issuing to add SANs • 2014/06/16 16:52:57 [ERR] raft: Failed to make RequestVote RPC to 10.100.29.175:8300: x509: certificate is valid for [remote host], not [local host]
  13. 13. 0.3 TLS Woes • Whoops! consul was checking the remote cert against the local node name • armon> we just use "demo.consul.io" as the CN for all of them • 0.3 essentially completely broke TLS
  14. 14. 0.3.1 • I wrote and got merged a patch to restore 0.2 behavior • Rolled forward to 0.3.1 • Upgraded to SSD-backed servers
  15. 15. Increasing Rollout • Switched various operational tools from flatfile to consul • Main app started using consul at startup
  16. 16. Consensus is Hard
  17. 17. consul-template • Generating haproxy config using consul-template • https://github.com/hashicorp/consul-template/ issues/168 – `consul-template` takes O(N²) time with N services
  18. 18. consul-template • Got that fixed, turned it on • consul immediately fell over • multiple elections/minute • 2M allocations/minute
  19. 19. consul-template • Service Watches churn when any service changes health state • Watching services on a large cluster → self-DDOS
  20. 20. consul-template • We use `consul-template -once` in cron now • Worse latency, but it works reliably
  21. 21. consul for leader election • Our data team wanted a leader-election primitive • Built on top of consul, cribbing example code
  22. 22. Sometime Later…
  23. 23. goroutine leak • consul would rapidly eat all memory • larger heap -> large GC pauses -> raft instability • manually restarted cluster 1/day
  24. 24. goroutine leak • Reached out toArmon • Very helpful in debugging • Found several unrelated memory leaks
  25. 25. goroutine leak • Tried to figure out what changed • Eventually correlated to a session leak in our leader election code
  26. 26. goroutine leak • Fixed our leader-election code • New policy: No non-discovery uses of consul
  27. 27. consul DNS • Increasingly reliant on consul for internal discovery • Unhappy at exposure to periodic instability • Still have fallbacks, but outages remain painful
  28. 28. consul DNS • Solution: Use consul-template to compile consul DNS to a zone file • Serve that out of a normal DNS server • Refresh every 15s
  29. 29. Current Status • Run consul everywhere • Register all services • Request-path lookups hit cached DNS • Operational tools use HTTP interface • Also generate config from consul-template
  30. 30. Final Stability Note • consul 0.5.2 fixed our memory leaks • consul has been quite stable for us of late • consul-template watches still don’t scale • 0.6 should help
  31. 31. Lessons Learned being an early adopter without bringing down the site (too many times)
  32. 32. Expect It To Be Rough
  33. 33. Monitoring, Monitoring, Monitoring
  34. 34. (graph all the things)
  35. 35. Incremental Rollout
  36. 36. Limit Scope
  37. 37. Isolation
  38. 38. UpgradeAggressively
  39. 39. Get To Know Upstream
  40. 40. Be Willing to Dive In
  41. 41. Questions?

×