Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

2 Epic Migrations at Flo:

133 visualizaciones

Publicado el

From hardware to AWS, from EC2 to EKS. What we learned from it?

Publicado en: Ingeniería
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

2 Epic Migrations at Flo:

  1. 1. 2 Epic migrations at FLO Dmitry Yackevich, Director of Engineering at Flo Health
  2. 2. ✔ Director of Engineering at Flo Health ✔ DevOps enabler (and sometimes disabler) at Flo, Pandadoc, Targetprocess and Workfusion WHO AM I? Dmitry Yackevich
  3. 3. Context
  4. 4. Migrations are the only mechanism to effectively manage technical debt as your company and code grows. If you don't get effective at software and system migrations, you'll end up languishing in technical debt. And still have to do one later anyway, it's just that it'll probably be a full rewrite. WHY MIGRATIONS MATTER
  5. 5. LONG, LONG TIME AGO 2018 Q1
  6. 6. Flo is an AI-powered health app for women that supports them during an entire reproductive period
  7. 7. NOT JUST PERIOD TRACKER
  8. 8. 1. Backup/restore lasted for one week 2. Constant Outages 3. DB cluster was near capacity limit 4. Changes is hard WE REACH THE LIMIT ON BARE METAL
  9. 9. Heroic mode
  10. 10. ✔ AWS + Terraform ✔ Ansible deployment ✔ Bitbucket Pipelines SOLUTION
  11. 11. TERRAFORM WORKFLOW Branch Master
  12. 12. Typical deployment takes 15 minutes ANSIBLE WORKFLOW
  13. 13. ARCHITECTURE 2019 Q1
  14. 14. AVAILABILITY 2%
  15. 15. RESPONSE TIME
  16. 16. Challenges
  17. 17. PRICE RISES TOO FAST
  18. 18. LOW RESOURCE UTILIZATION
  19. 19. • 30 to 200 employees • 2 to 15 deploy/day • 9 to 120 services in production RAPID GROWTH
  20. 20. Minimal Time to Market for new service is 2 day 15 deployments * 15 minutes= 4 hours/day Manual actions: add ssh keys, permissions, etc. TIME IS MATTER It’s ok for most cases, but for building high-speed process you need to get rid of it
  21. 21. Plan
  22. 22. ✔ Autoscaling ✔ Turn Key Modules ✔ Self Healing ✔ Cost Optimization WE WANT TO ACHIEVE = Containers + immutable infrastructure + k8s
  23. 23. ✔ Based on AWS services ✔ Control Plane as a Service ✔ Automatically provision EC2, ALB, Subnets, Route53 EKS
  24. 24. • Unstable setup • Too complicated • Performance degradation • Dark corners RISKS
  25. 25. • Proof of Concept • Dogfooding • Early adopters • Migrate all SCENARIO
  26. 26. Boring migration
  27. 27. PROOF OF CONCEPT EKS SERVICE
  28. 28. It works! 💪 PROOF OF CONCEPT 2048 GAME
  29. 29. ✔ Jenkins ✔ Sentry ✔ Prometheus ✔ Grafana ✔ Fluentd ✔ PgBouncer DOGFOODING We decided to move to EKS infrastructure services first
  30. 30. ● 35 projects ● 5000 events/minute in spike ● 450 events/minute average DOGFOODING WITH SENTRY
  31. 31. DOGFOODING WITH SENTRY AUTOSCALING
  32. 32. DOGFOODING WITH SENTRY HPA
  33. 33. Migration | Dogfooding HPA and Autoscaling
  34. 34. Migration | Dogfooding HPA and Autoscaling
  35. 35. Migration | Dogfooding HPA and Autoscaling
  36. 36. ✔ Test network performance ✔ Rock solid stability ✔ All database requests go through conneсtion pooler DOGFOODING PGBOUNCER
  37. 37. Migration | Dogfooding PGbouncer
  38. 38. LOADBALANCER FOR ALL
  39. 39. ✔ Test team adoption ✔ Customer facing service ✔ Measure complexity EARLY ADOPTERS
  40. 40. Lesson Learned
  41. 41. • Move service in k8s took at least 1 sprint • Everybody want to use k8s for new services, not for old ones EARLY ADOPTERS
  42. 42. MIGRATION STATEMENT
  43. 43. JAVA MEMORY TRICKS
  44. 44. REQUESTS/LIMITS STRATEGY
  45. 45. Each instance type have hard limit for IP address (and Pods) ✔ .large and have only 29 ip addresses ✔ .xlarge and .2xlarge — 58 ✔ .4xlarge — 234 IP Our winner c5.2xlarge
  46. 46. ● 700 IPs is not enough ● 50 ip for 1 server, so there were only 4 EKS workers in one AZ ● Use /20 instead of /24 Subnets
  47. 47. Results
  48. 48. 40% VS 20% RESOURCE UTILIZATION
  49. 49. Performance became more consistent and stable PERFORMANCE
  50. 50. DYNAMIC UTILIZATION
  51. 51. DYNAMIC UTILIZATION
  52. 52. Decreased from 15 to 3 minutes DEPLOYMENT TIME
  53. 53. CHANGE RATE
  54. 54. ✔ -30% ✔ Predictable growing model ✔ Out scaling factor is utilization, not service count COST OPTIMISATION
  55. 55. ✔ Never Fear ✔ Derisk ✔ Find early adopters ✔ Push it to the end How to run migration?
  56. 56. Q&A
  57. 57. THANK YOU! Join us and contribute to the global health! https://flo.health/careers

×