Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Using SaltStack to Auto Triage and Remediate Production Systems

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 29 Anuncio

Using SaltStack to Auto Triage and Remediate Production Systems

Descargar para leer sin conexión

LinkedIn created an auto-remediation system named Nurse which leverages SaltStack and the CherryPy API to auto-triage and remediate issues with production systems. See how LinkedIn uses SaltStack with Nurse in its production environment and learn how to architect your own auto-triage and remediation system.

LinkedIn created an auto-remediation system named Nurse which leverages SaltStack and the CherryPy API to auto-triage and remediate issues with production systems. See how LinkedIn uses SaltStack with Nurse in its production environment and learn how to architect your own auto-triage and remediation system.

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

A los espectadores también les gustó (20)

Anuncio

Similares a Using SaltStack to Auto Triage and Remediate Production Systems (20)

Más de Michael Kehoe (17)

Anuncio

Más reciente (20)

Using SaltStack to Auto Triage and Remediate Production Systems

  1. 1. Michael Kehoe Senior Site Reliability Engineer LinkedIn Using SaltStack to Auto Triage and Remediate Production Systems
  2. 2. Topics • $ whoami • Salt @ LinkedIn • LinkedIn’s auto remediation story • Nurse • Salt API • Auto-remediation Salt Modules • How to get started
  3. 3. $ whoami
  4. 4. Salt @ LinkedIn • 11-14k minions per master • Up to 8.5 events/ second per master • Deployment system heavily utilizes Salt • 5 dedicated SRE’s work on LinkedIn’s Salt infrastructure • Number of SRE’s who also contribute
  5. 5. Alert Growth vs NOC Engineers LinkedIn’s Auto-remediation Story 0 5 10 15 20 25 30 0 5000 10000 15000 20000 25000 30000 35000 40000 2010 2011 2012 2013 2014 2015 2016 2017 Num of Alerts Num of Engineers
  6. 6. The problem LinkedIn’s Auto-remediation Story • Growing number of high priority alerts vs Engineers to watch • Complicated ITR’s (runbooks) took time for NOC engineers to execute and escalate • SRE’s would forget to run diagnostic tooling during outages • Longer MTTR == Bad experience for members
  7. 7. Building the Solution LinkedIn’s Auto-Remediation Story • LinkedIn needed a workflow engine • Requirements • Take automated actions against applications/ hosts • Perform the run book automatically • Appropriate auditing • Scalable
  8. 8. What’s already out there Autoremediation • StackStorm • fbar • SaltStack Started building an in-house solution in Mid-2014
  9. 9. Nurse Image: freeflaticons.com
  10. 10. Nurse • ‘Event based’ system • Built on top of LinkedIn’s existing monitoring infrastructure • Allows for manual execution via Web UI • Allows for other systems to plug-in via REST API • Built in rules-engine • Allows for complicated checks/ branching • Global environment awareness • Scalable!
  11. 11. Nurse
  12. 12. Nurse
  13. 13. The basics Salt REST API • RESTful interface to SALT • Allows for external-authentication • PAM • LDAP • Your own plugin… • Built on top of CherryPy
  14. 14. Interface Salt REST API • /login - Log in to receive a session token • /logout - Remove or invalidate sessions • /minions – Working directly with minions • /jobs - Getting lists of previously run jobs or getting the return from a single job • /run - Run commands without normal session handling • /events - Expose the Salt event bus • /hook - A generic web hook entry point that fires an event on Salt's event bus • /keys – Wrapper around key management • /ws - Open a WebSocket connection to Salt's event bus • /stats - Return a dump of statistics collected from the CherryPy server https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html
  15. 15. Configuration Salt REST API external_auth: ldap: headless-nurse: - '*': - test.* - nurse.* - '@jobs' https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html /etc/salt/master.d/salt-api.conf
  16. 16. Configuration Salt REST API rest_cherrypy: port: 8888 ssl_crt: /etc/salt/pki/api/salt-api.crt ssl_key: /etc/salt/pki/api/salt-api.key https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html /etc/salt/master.d/salt-api.conf
  17. 17. Configuration Salt REST API auth.ldap.basedn: 'dc=example,dc=com' auth.ldap.binddn: ’readonly-ldap@example' auth.ldap.bindpw: ’password' auth.ldap.filter: sAMAccountName={{ username }} auth.ldap.server: ldap.example.com auth.ldap.tls: True auth.ldap.persontype: 'person' auth.ldap.groupclass: 'group' auth.ldap.groupou: 'Users’ https://docs.saltstack.com/en/latest/topics/eauth/index.html /etc/salt/master.d/salt-api.conf
  18. 18. Code Interfaces Salt REST API • There is a Python interface to the Salt API – pepper • Provides basic wrapper around REST API • https://github.com/saltstack/pepper/ • See https://github.com/SUSE/salt-netapi-client for JAVA bindings
  19. 19. Python Code example Salt REST API from pepper.libpepper import Pepper api = Pepper('https://salt.example.com:8888') api.login('saltdev', 'saltdev', ’ldap') # Run simple function api.low([{'client': 'local', 'tgt': '*', 'fun': 'test.ping'}]) # Execute a runner function api.runner('jobs.lookup_jid', jid=12345)
  20. 20. Goals Auto-remediation Salt Modules • Auto-triage issues • Reduce context-switching • Run labor intensive tasks automatically • Gather data while engineer is logging in after escalation • Auto-remediate issues • Let the engineer sleep • Faster MTTR
  21. 21. Implementation Auto-remediation Salt Modules • Auto-triage issues • Identify abusive clients • Collect & analyses thread/ heap dumps • Parse & summarize log files
  22. 22. Implementation Auto-remediation Salt Modules • Auto-remediate issues • Restart/ OOR applications • Scale-up applications • Blocking abusive clients • Update A/ B experiment definitions • Otherwise escalate…
  23. 23. Implementation Auto-remediation Salt Modules • Plenty of Salt modules/ runners already out there • Over 300 modules are available in Salt core • Modules to take remediate & notify • Write your own • Make sure you test!
  24. 24. Success LinkedIn’s Auto-remediation Story • 854k actions taken • 100% of service health check alerts are on boarded • ~37k man hours have now been automated • Now automating ~1100 hours/ week
  25. 25. Success LinkedIn’s Auto-remediation Story 0 5 10 15 20 25 30 0 5000 10000 15000 20000 25000 30000 35000 40000 2010 2011 2012 2013 2014 2015 2016 2017 Num of Alerts Num of Engineers
  26. 26. Some lessons learnt How to get started • Need to make decision on how you architecture YOUR ‘event bus’ • Use external monitoring to trigger Salt via API • Use reactors/ beacons internally within Salt’s event bus • See ‘Thorium’ documentation for Salt’s new reactor engine • Auditing is important! • Need to know what/ who triggered actions
  27. 27. Some lessons learnt How to get started • Reporting is important! • Need to know how often automated actions are being taken • Find failure hotspots • Leverage event-bus or returners • Safety first • Don’t give yourself too much rope…
  28. 28. Conclusion • Think carefully about your ‘Event-Driven-Automation’ architecture • The sky is the limit with Salt…don’t limit yourself • Again…safety first
  29. 29. Questions? 29 Thank You

Notas del editor

  • WhoAmI
    Background on Salt & Auto-remediation at LinkedIn
    Nurse – LinkedIn’s Auto-remediation engine
    Salt API & Autoremediation Salt Modules
    How to get started

  • Been at LinkedIn for nearly 2.5 years
    Apart of LinkedIn’s ARVT group
    Occasional committer to LinkedIn’s Salt ecosystem
  • Want to give you some context into our infrastructure

    Average is 11-14k minions per production datacenter
    Up to 8.5 events/ second per master
    Deployment system heavily utilizes Salt
    5 dedicated SRE’s work on LinkedIn’s Salt infrastructure
    Number of SRE’s who also contribute

    Keep these numbers in mind as the session progresses
  • Let’s talk about LinkedIn’s autoremediation story
    Graph depicts the number of alerts our NOC watches vs the Number of NOC engineers we employee globally
  • What is out there
    Has anyone heard of these
  • Login/ logout
    Jobs
    Run
    Events
    Events – get data from event bus
    Hook – Fire an event INTO the salt bus
    WS – Open websocket
    Stats
  • So what are the goals of building an auto-remediation engine
  • Remediate
    Apache/ Nginx restarts
    Notify
    Slack/ Hipchat modules
  • In 2014, we don’t increase headcount, now we’re lowering our NOC engineer headcount

×