SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
Managing ECS hosts with AWS lambda and step
functions
Terraform at Comtravo
Terraform at Comtravo
➢ Six environments maintained by Terraform.
➢ Integrated into our CI/CD pipeline.
➢ Each environment has:
○ 500+ AWS components.
○ 43 Lambdas.
○ 25 microservices.
CI/CD at Comtravo: Mono-repo Pull request
CI/CD at Comtravo: Mono-repo Pull request
CI/CD at Comtravo: Mono-repo Merge to master
CI/CD at Comtravo: Mono-repo Merge to master
ECS at Comtravo
ECS: Many interesting challenges
One such challenge:
Update EC2 hosts in a ECS cluster
Update EC2 hosts in a ECS cluster: Use cases
➢ You have a custom AMI for your ECS cluster(s).
➢ You want to always rollout the latest ECS-optimized AMIs.
➢ You want to rotate the admin keys.
➢ Change Instance type.
➢ Use an updated user_data script.
Update EC2 hosts in a ECS cluster: The process
➢ Terraform emits an AWS cloudwatch event once launch
configuration was created.
➢ Detach “old instances“ from ASG and wait for capacity.
➢ “Move” services from old instances to new instances.
➢ Terminate old instances when no more tasks running.
➢ Alert on failures.
Terraform + AWS Events + AWS Step functions =
Awesome
I created a new
launch configuration
lc-1234 for ASG
asg-1234 belonging
to ECS cluster
cluster-A
AWS CloudWatch Events
time
Task A
started
bar
Task C
started
Task B
stopped
ECS Host
bla baz
custom event
custom event
custom event
Terraform Event Emitter
resource "null_resource" "launch-config-update" {
provisioner "local-exec" {
command = "python ${path.module}/scripts/emit_launchconfig_event.py
--launch_configuration_name ${aws_launch_configuration.ecs-lc.name}
--autoscaling_group_name ${aws_autoscaling_group.ecs-asg.name}
--ami ${var.aws_ami}
--cluster ${var.cluster}"
}
triggers {
launchConfigurationName = "${aws_launch_configuration.ecs-lc.name}"
}
}
Terraform Event
{
"version": "0",
"id": "f24d8f1c-8c3f-9b62-cb3c-54430739fc55",
"source": "comtravo.terraform.alpha",
"account": "1234567890",
"time": "2018-05-09T13:35:43Z",
"region": "eu-west-1",
"resources": [
"ct-backend-ecs-alpha-t2.large-generic20180509133303168200000003"
],
"detail": {
"ami": "ami-bfb5fec6",
"status": "ACTIVE",
"agentConnected": false,
"autoscalingGroupName": "ct-backend-ecs-alpha-t2.large-generic20180503065507554700000005",
"environment": "alpha",
"clusterArn": "arn:aws:ecs:eu-west-1:1234567890:cluster/ct-backend-ecs-alpha"
"launchConfigurationName": "ct-backend-ecs-alpha-t2.large-generic20180509133303168200000003"
},
"detailType": "ECS Launch Configuration Change"
}
AWS CloudWatch Event Rules
resource "aws_cloudwatch_event_rule" "ecs-manager" {
name = "capture-ecs-events-${terraform.workspace}"
description = "Capture ECS related events"
event_pattern = <<PATTERN
{
"source": [
"comtravo.terraform.${terraform.workspace}"
],
"detail-type": [
"ECS Launch Configuration Change"
],
"detail": {
"clusterArn": [
"arn:aws:ecs:${var.region}:${var.ct_account_id}:cluster/ct-backend-ecs-${terraform.workspace}"
],
"status": ["ACTIVE"]
}
}
PATTERN
}
AWS Step functions
DEMO
Questions
You all have been awesome!!!
Extras
ECS Challenge #1
ECS AGENT DISCONNECTS
#1 ECS agent disconnects - Initial solution
➢ Cron job on ECS hosts to notify via SNS event and restart
ECS agent.
➢ Chances of ECS agent failing again due to some inherent
problem within the instance are high.
#1 ECS agent disconnects - Initial solution
#1 ECS agent disconnects - Better solution
➢ Detect ECS agent disconnects.
➢ Bootup new ECS host and wait for it to be healthy.
➢ “Move” all the existing containers from the problematic
instance to a new Instance.
➢ Terminate the problematic instance.
➢ Alert on failures.
#1 ECS agent disconnects - Better solution
#1 ECS agent disconnects: Detection
How do we detect ECS agent disconnects?
AWS Cloudwatch EVENTS to the
rescue!!!
#1 ECS agent disconnects: ECS Events
time
Task A
started
bar
Task C
started
Task B
stopped foo baz
ECS agent
disconnected
ECS agent
connected
ECS agent
disconnected
#1 ECS agent disconnects: Filter ECS Events
{
"detail": {
"agentConnected": [
false
],
"clusterArn": [
"arn:aws:ecs:eu-west-1:1234567890:cluster/ct-backend-ecs-qa"
],
"status": [
"ACTIVE"
]
},
"detail-type": [
"ECS Container Instance State Change"
],
"source": [
"aws.ecs"
]
}
#1 ECS agent disconnects: Trigger step function
#1 ECS agent disconnects: ECS Events

Más contenido relacionado

La actualidad más candente

Infrastructure as code with Terraform
Infrastructure as code with TerraformInfrastructure as code with Terraform
Infrastructure as code with Terraform
Sam Bashton
 

La actualidad más candente (19)

Declarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with TerraformDeclarative & workflow based infrastructure with Terraform
Declarative & workflow based infrastructure with Terraform
 
Elasticsearch (R)Evolution — You Know, for Search… by Philipp Krenn at Big Da...
Elasticsearch (R)Evolution — You Know, for Search… by Philipp Krenn at Big Da...Elasticsearch (R)Evolution — You Know, for Search… by Philipp Krenn at Big Da...
Elasticsearch (R)Evolution — You Know, for Search… by Philipp Krenn at Big Da...
 
Infrastructure as code with Terraform
Infrastructure as code with TerraformInfrastructure as code with Terraform
Infrastructure as code with Terraform
 
Terraform at Scale
Terraform at ScaleTerraform at Scale
Terraform at Scale
 
London Hug 19/5 - Terraform in Production
London Hug 19/5 - Terraform in ProductionLondon Hug 19/5 - Terraform in Production
London Hug 19/5 - Terraform in Production
 
Terraform
TerraformTerraform
Terraform
 
Real World Optimization
Real World OptimizationReal World Optimization
Real World Optimization
 
AWS re:Invent 2014 talk: Scheduling using Apache Mesos in the Cloud
AWS re:Invent 2014 talk: Scheduling using Apache Mesos in the CloudAWS re:Invent 2014 talk: Scheduling using Apache Mesos in the Cloud
AWS re:Invent 2014 talk: Scheduling using Apache Mesos in the Cloud
 
Deliver Docker Containers Continuously on AWS - QCon 2017
Deliver Docker Containers Continuously on AWS - QCon 2017Deliver Docker Containers Continuously on AWS - QCon 2017
Deliver Docker Containers Continuously on AWS - QCon 2017
 
Testing & deploying terraform
Testing & deploying terraformTesting & deploying terraform
Testing & deploying terraform
 
Terraform modules and best-practices - September 2018
Terraform modules and best-practices - September 2018Terraform modules and best-practices - September 2018
Terraform modules and best-practices - September 2018
 
Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)
Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)
Using Terraform.io (Human Talks Montpellier, Epitech, 2014/09/09)
 
From * to Symfony2
From * to Symfony2From * to Symfony2
From * to Symfony2
 
Terraform -- Infrastructure as Code
Terraform -- Infrastructure as CodeTerraform -- Infrastructure as Code
Terraform -- Infrastructure as Code
 
Scalable Event Tracking
Scalable Event TrackingScalable Event Tracking
Scalable Event Tracking
 
Using Libvirt with Cluster API to manage baremetal Kubernetes
Using Libvirt with Cluster API to manage baremetal KubernetesUsing Libvirt with Cluster API to manage baremetal Kubernetes
Using Libvirt with Cluster API to manage baremetal Kubernetes
 
Scaling terraform
Scaling terraformScaling terraform
Scaling terraform
 
Terraform at Scale - All Day DevOps 2017
Terraform at Scale - All Day DevOps 2017Terraform at Scale - All Day DevOps 2017
Terraform at Scale - All Day DevOps 2017
 
Flamingo in Production
Flamingo in ProductionFlamingo in Production
Flamingo in Production
 

Similar a Zero down time ECS cluster upgrades

Troubleshooting Strategies for CloudStack Installations by Kirk Kosinski
Troubleshooting Strategies for CloudStack Installations by Kirk Kosinski Troubleshooting Strategies for CloudStack Installations by Kirk Kosinski
Troubleshooting Strategies for CloudStack Installations by Kirk Kosinski
buildacloud
 
Kubernetes Cluster API - managing the infrastructure of multi clusters (k8s ...
Kubernetes Cluster API - managing the infrastructure of  multi clusters (k8s ...Kubernetes Cluster API - managing the infrastructure of  multi clusters (k8s ...
Kubernetes Cluster API - managing the infrastructure of multi clusters (k8s ...
Tobias Schneck
 
Creating Kubernetes multi clusters with ClusterAPI @ Stuttgart Kubernetes Meetup
Creating Kubernetes multi clusters with ClusterAPI @ Stuttgart Kubernetes MeetupCreating Kubernetes multi clusters with ClusterAPI @ Stuttgart Kubernetes Meetup
Creating Kubernetes multi clusters with ClusterAPI @ Stuttgart Kubernetes Meetup
Tobias Schneck
 
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Tenchi Security
 
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Alexandre Sieira
 

Similar a Zero down time ECS cluster upgrades (20)

Zero downtime ECS host updates with Terraform
Zero downtime ECS host updates with TerraformZero downtime ECS host updates with Terraform
Zero downtime ECS host updates with Terraform
 
From Kubernetes to OpenStack in Sydney
From Kubernetes to OpenStack in SydneyFrom Kubernetes to OpenStack in Sydney
From Kubernetes to OpenStack in Sydney
 
ProxySQL at Scale on AWS.pdf
ProxySQL at Scale on AWS.pdfProxySQL at Scale on AWS.pdf
ProxySQL at Scale on AWS.pdf
 
Artem Zhurbila - docker clusters (solit 2015)
Artem Zhurbila - docker clusters (solit 2015)Artem Zhurbila - docker clusters (solit 2015)
Artem Zhurbila - docker clusters (solit 2015)
 
以Device Shadows與Rules Engine串聯實體世界
以Device Shadows與Rules Engine串聯實體世界以Device Shadows與Rules Engine串聯實體世界
以Device Shadows與Rules Engine串聯實體世界
 
Autoscaling in kubernetes v1
Autoscaling in kubernetes v1Autoscaling in kubernetes v1
Autoscaling in kubernetes v1
 
Ceilometer + Heat = Alarming
Ceilometer + Heat = Alarming Ceilometer + Heat = Alarming
Ceilometer + Heat = Alarming
 
Orchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
Orchestration Tool Roundup - Arthur Berezin & Trammell ScruggsOrchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
Orchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
 
Troubleshooting Strategies for CloudStack Installations by Kirk Kosinski
Troubleshooting Strategies for CloudStack Installations by Kirk Kosinski Troubleshooting Strategies for CloudStack Installations by Kirk Kosinski
Troubleshooting Strategies for CloudStack Installations by Kirk Kosinski
 
Deploying on Kubernetes - An intro
Deploying on Kubernetes - An introDeploying on Kubernetes - An intro
Deploying on Kubernetes - An intro
 
Kubernetes Cluster API - managing the infrastructure of multi clusters (k8s ...
Kubernetes Cluster API - managing the infrastructure of  multi clusters (k8s ...Kubernetes Cluster API - managing the infrastructure of  multi clusters (k8s ...
Kubernetes Cluster API - managing the infrastructure of multi clusters (k8s ...
 
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021
 
Oliver leech cloudstack
Oliver leech   cloudstackOliver leech   cloudstack
Oliver leech cloudstack
 
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHow Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage maker
 
Creating Kubernetes multi clusters with ClusterAPI @ Stuttgart Kubernetes Meetup
Creating Kubernetes multi clusters with ClusterAPI @ Stuttgart Kubernetes MeetupCreating Kubernetes multi clusters with ClusterAPI @ Stuttgart Kubernetes Meetup
Creating Kubernetes multi clusters with ClusterAPI @ Stuttgart Kubernetes Meetup
 
Phil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage makerPhil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage maker
 
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
 
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
Shopping for Vulnerabilities - How Cloud Service Provider Marketplaces can He...
 
Serverless Multi Region Cache Replication
Serverless Multi Region Cache ReplicationServerless Multi Region Cache Replication
Serverless Multi Region Cache Replication
 

Último

Teachers record management system project report..pdf
Teachers record management system project report..pdfTeachers record management system project report..pdf
Teachers record management system project report..pdf
Kamal Acharya
 
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DrGurudutt
 

Último (20)

Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
E-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentE-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are present
 
Online book store management system project.pdf
Online book store management system project.pdfOnline book store management system project.pdf
Online book store management system project.pdf
 
Teachers record management system project report..pdf
Teachers record management system project report..pdfTeachers record management system project report..pdf
Teachers record management system project report..pdf
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
 
Object Oriented Programming OOP Lab Manual.docx
Object Oriented Programming OOP Lab Manual.docxObject Oriented Programming OOP Lab Manual.docx
Object Oriented Programming OOP Lab Manual.docx
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
Quiz application system project report..pdf
Quiz application system project report..pdfQuiz application system project report..pdf
Quiz application system project report..pdf
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
 
solid state electronics ktu module 5 slides
solid state electronics ktu module 5 slidessolid state electronics ktu module 5 slides
solid state electronics ktu module 5 slides
 
Lab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxLab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docx
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.
 
BURGER ORDERING SYSYTEM PROJECT REPORT..pdf
BURGER ORDERING SYSYTEM PROJECT REPORT..pdfBURGER ORDERING SYSYTEM PROJECT REPORT..pdf
BURGER ORDERING SYSYTEM PROJECT REPORT..pdf
 
Theory for How to calculation capacitor bank
Theory for How to calculation capacitor bankTheory for How to calculation capacitor bank
Theory for How to calculation capacitor bank
 
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdfONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
 

Zero down time ECS cluster upgrades

  • 1.
  • 2. Managing ECS hosts with AWS lambda and step functions
  • 4. Terraform at Comtravo ➢ Six environments maintained by Terraform. ➢ Integrated into our CI/CD pipeline. ➢ Each environment has: ○ 500+ AWS components. ○ 43 Lambdas. ○ 25 microservices.
  • 5. CI/CD at Comtravo: Mono-repo Pull request
  • 6. CI/CD at Comtravo: Mono-repo Pull request
  • 7. CI/CD at Comtravo: Mono-repo Merge to master
  • 8. CI/CD at Comtravo: Mono-repo Merge to master
  • 10. ECS: Many interesting challenges
  • 11. One such challenge: Update EC2 hosts in a ECS cluster
  • 12. Update EC2 hosts in a ECS cluster: Use cases ➢ You have a custom AMI for your ECS cluster(s). ➢ You want to always rollout the latest ECS-optimized AMIs. ➢ You want to rotate the admin keys. ➢ Change Instance type. ➢ Use an updated user_data script.
  • 13. Update EC2 hosts in a ECS cluster: The process ➢ Terraform emits an AWS cloudwatch event once launch configuration was created. ➢ Detach “old instances“ from ASG and wait for capacity. ➢ “Move” services from old instances to new instances. ➢ Terminate old instances when no more tasks running. ➢ Alert on failures.
  • 14. Terraform + AWS Events + AWS Step functions = Awesome I created a new launch configuration lc-1234 for ASG asg-1234 belonging to ECS cluster cluster-A
  • 15. AWS CloudWatch Events time Task A started bar Task C started Task B stopped ECS Host bla baz custom event custom event custom event
  • 16. Terraform Event Emitter resource "null_resource" "launch-config-update" { provisioner "local-exec" { command = "python ${path.module}/scripts/emit_launchconfig_event.py --launch_configuration_name ${aws_launch_configuration.ecs-lc.name} --autoscaling_group_name ${aws_autoscaling_group.ecs-asg.name} --ami ${var.aws_ami} --cluster ${var.cluster}" } triggers { launchConfigurationName = "${aws_launch_configuration.ecs-lc.name}" } }
  • 17. Terraform Event { "version": "0", "id": "f24d8f1c-8c3f-9b62-cb3c-54430739fc55", "source": "comtravo.terraform.alpha", "account": "1234567890", "time": "2018-05-09T13:35:43Z", "region": "eu-west-1", "resources": [ "ct-backend-ecs-alpha-t2.large-generic20180509133303168200000003" ], "detail": { "ami": "ami-bfb5fec6", "status": "ACTIVE", "agentConnected": false, "autoscalingGroupName": "ct-backend-ecs-alpha-t2.large-generic20180503065507554700000005", "environment": "alpha", "clusterArn": "arn:aws:ecs:eu-west-1:1234567890:cluster/ct-backend-ecs-alpha" "launchConfigurationName": "ct-backend-ecs-alpha-t2.large-generic20180509133303168200000003" }, "detailType": "ECS Launch Configuration Change" }
  • 18. AWS CloudWatch Event Rules resource "aws_cloudwatch_event_rule" "ecs-manager" { name = "capture-ecs-events-${terraform.workspace}" description = "Capture ECS related events" event_pattern = <<PATTERN { "source": [ "comtravo.terraform.${terraform.workspace}" ], "detail-type": [ "ECS Launch Configuration Change" ], "detail": { "clusterArn": [ "arn:aws:ecs:${var.region}:${var.ct_account_id}:cluster/ct-backend-ecs-${terraform.workspace}" ], "status": ["ACTIVE"] } } PATTERN }
  • 20. DEMO
  • 21.
  • 23. You all have been awesome!!!
  • 25. ECS Challenge #1 ECS AGENT DISCONNECTS
  • 26. #1 ECS agent disconnects - Initial solution ➢ Cron job on ECS hosts to notify via SNS event and restart ECS agent. ➢ Chances of ECS agent failing again due to some inherent problem within the instance are high.
  • 27. #1 ECS agent disconnects - Initial solution
  • 28. #1 ECS agent disconnects - Better solution ➢ Detect ECS agent disconnects. ➢ Bootup new ECS host and wait for it to be healthy. ➢ “Move” all the existing containers from the problematic instance to a new Instance. ➢ Terminate the problematic instance. ➢ Alert on failures.
  • 29. #1 ECS agent disconnects - Better solution
  • 30. #1 ECS agent disconnects: Detection How do we detect ECS agent disconnects? AWS Cloudwatch EVENTS to the rescue!!!
  • 31. #1 ECS agent disconnects: ECS Events time Task A started bar Task C started Task B stopped foo baz ECS agent disconnected ECS agent connected ECS agent disconnected
  • 32. #1 ECS agent disconnects: Filter ECS Events { "detail": { "agentConnected": [ false ], "clusterArn": [ "arn:aws:ecs:eu-west-1:1234567890:cluster/ct-backend-ecs-qa" ], "status": [ "ACTIVE" ] }, "detail-type": [ "ECS Container Instance State Change" ], "source": [ "aws.ecs" ] }
  • 33. #1 ECS agent disconnects: Trigger step function
  • 34. #1 ECS agent disconnects: ECS Events