Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Observability - From fire fighting to smoke detection

573 visualizaciones

Publicado el

How the nightmare of losing weeks of data transformed our teams way of working

Publicado en: Tecnología
  • Hello! I can recommend a site that has helped me. It's called ⇒ www.WritePaper.info ⇐ So make sure to check it out!
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

Observability - From fire fighting to smoke detection

  1. 1. THE IMPORTANCE OF OBSERVABILITY From Fire Fighting to Smoke Detection @ShaneCarroll84
  2. 2. WHAT IS OBSERVABILITY? @ShaneCarroll84
  3. 3. DASHBOARDS @ShaneCarroll84
  4. 4. KEY METRICS @ShaneCarroll84
  5. 5. OBJECTIVES @ShaneCarroll84
  6. 6. ALERTS @ShaneCarroll84
  7. 7. OBSERVABILITY IS LIKE A FITNESS TRACKER, BUT FOR
 YOUR SYSTEM @ShaneCarroll84
  8. 8. Track content opens, clicks, views, interactions… Get this data and send it on
 for other teams. MACGYVER’S JOB? @ShaneCarroll84
  9. 9. CHALLENGES Considerable amount of old enterprise environments… Ancient systems that are
 not easily testable. @ShaneCarroll84
  10. 10. IMPACT OUR TEAM? HOW DID THE LACK OF OBSERVABILITY @ShaneCarroll84
  11. 11. IT ALL STARTED WITH A BUG… Missing a large number of tracking events for one customer… 🤔 A customer is telling us we’ve
 a problem with our system. @ShaneCarroll84
  12. 12. WE WERE IN FIRE FIGHTING MODE Scrambling to try find the
 cause of the issue… Don’t know the full impact, causing stress for the team. @ShaneCarroll84
  13. 13. @ShaneCarroll84
  14. 14. LET’S LOOK
 AT THE LOGS… Logs were noisy and not easily searchable… Not all useful information to help isolate the issue was logged. @ShaneCarroll84
  15. 15. MACGYVER,
 WE HAVE A PROBLEM Issue with titles containing special characters… 🤦 Lots of customers impacted
 and over a million tracking
 events lost. @ShaneCarroll84
  16. 16. WHY DID WE NOT GET AN ALERT? IT WAS LOST IN
 A SEA OF ALERTS. @ShaneCarroll84
  17. 17. OUR PROCESS
 WAS BROKEN! Took weeks to gather missing data and reprocess events… Problem continued to impact customers and other teams. @ShaneCarroll84
  18. 18. WE DIDN’T KNOW OUR SYSTEM’S HEALTH What can we learn from this? Use a bad experience such
 as a bug to learn from
 and spark change. @ShaneCarroll84
  19. 19. WHEN YOU LOOK AT YOUR CURRENT SYSTEM, HOW DO YOU KNOW IT’S HEALTHY? @ShaneCarroll84
  20. 20. TESTING
 AT POPPULO Rob Meaney, Head of Testing CODS model 10 P’s of Testability @ShaneCarroll84
  21. 21. LEARNING REVIEW Detection Impact Isolation Fix & Retest Repair Minimise Impact Prevention @ShaneCarroll84
  22. 22. SHARE LEARNINGS @ShaneCarroll84
  23. 23. RISK @ShaneCarroll84
  24. 24. THREE AMIGOS George Dinwiddie first came up with this strategy in his blog. The Three Amigos – Product, Developer, and Tester – discuss the new feature @ShaneCarroll84
  25. 25. THREE AMIGOS Allow for discussion on risk before beginning work on a feature… We now ‘Three Amigo’ each new story before beginning any new development. @ShaneCarroll84
  26. 26. REFINING ALERTS Reduce noise and only alert on what is important to the team. Entire team takes responsibility for investigating and fixing alerts. @ShaneCarroll84
  27. 27. DAILY STAND-UP We changed our standup, to include a new question… Small change but now it’s part of our team process! Any new alerts today? @ShaneCarroll84
  28. 28. IMPROVED LOGGING Removed noise. Logged everything that helped isolate potential issues. Used structured logging. @ShaneCarroll84
  29. 29. UNSTRUCTURED LOGS @ShaneCarroll84
  30. 30. UNSTRUCTURED LOGS @ShaneCarroll84
  31. 31. VISUALISATIONS @ShaneCarroll84
  32. 32. DASHBOARDS Identified critical metrics Highlight failures Show trends Added important tests to the dashboard @ShaneCarroll84
  33. 33. KEY METRICS @ShaneCarroll84
  34. 34. COMPARE DATA @ShaneCarroll84
  35. 35. DRILL-DOWN @ShaneCarroll84
  36. 36. LIVE ARCHITECTURE @ShaneCarroll84
  37. 37. WATCHING TV
 AT WORK Why invest time and effort into monitoring tools if no one looks at them? Dashboards are now in constant view! @ShaneCarroll84
  38. 38. TRACING @ShaneCarroll84
  39. 39. TRACING @ShaneCarroll84
  40. 40. BUT BE CAREFUL Don’t introduce a dependency in your system when adding monitoring tools! @ShaneCarroll84
  41. 41. PERFORMANCE TESTS @ShaneCarroll84
  42. 42. PERFORMANCE TESTS @ShaneCarroll84
  43. 43. CLIENT-SIDE ERRORS @ShaneCarroll84
  44. 44. CLIENT-SIDE ERRORS @ShaneCarroll84
  45. 45. CLIENT-SIDE ERRORS @ShaneCarroll84
  46. 46. CLIENT-SIDE ERRORS @ShaneCarroll84
  47. 47. IN-HOUSE TOOLS @ShaneCarroll84
  48. 48. IN-HOUSE TOOLS @ShaneCarroll84
  49. 49. IN-HOUSE TOOLS @ShaneCarroll84
  50. 50. SMOKE DETECTION MODE ACTIVATED Risk is discussed early. Alerts are raised and actioned. Dashboards show our critical metrics. @ShaneCarroll84
  51. 51. SMOKE DETECTION MODE ACTIVATED Investigating logs and isolating issues is easier. Quickly see who's impacted. Replay scripts are available. @ShaneCarroll84
  52. 52. HOW DO YOU GO FROM FIRE FIGHTING TO SMOKE DETECTION? LEARN FROM THE FIRE! @ShaneCarroll84
  53. 53. THANK YOU! @ShaneCarroll84

×