SlideShare una empresa de Scribd logo
1 de 45
Patterns for building resilient and scalable
microservices platform on AWS
Boyan Dimitrov,
Platform Automation Lead @ Hailo
@nathariel
Back in 2011 we started simple
We quickly found out that supporting monoliths is hard:
• Hard to maintain the codebase
• Hard to build new features
• Hard to scale the dev teams
Failure to deliver business value
Frontend Backend
MySQL
So in 2013 we ended up
doing…
At present we have
• Microservices ecosystem (99.9% written in Go)
• Designed specifically for the cloud – different building blocks and
components will constantly be in flux, broken or unavailable
• 1000+ AWS instances spanning multiple regions
• 200+ services in production
The Platform under the hood
TeVPC
Auto Scaling
S3
OrchestrationEnv DNS
Release AutoScaling
Discovery
Monitoring
CFEC2
Route 53
Redshift
ComputeEIP
Routing
Core
Platform
Provisioning
Login
Services
Cloud Provider
Whisper
Config
• Lowest level building blocks
• We mostly use basic PaaS components and services as they cover most of our
needs
• We expect every underlying component to fail and we designed for this
eu-west-1
Message Bus+
Go Services
Proxy Layer
C*
us-east-1
Proxy Layer
C*
Go Services
Message Bus+
eu-west-1
Proxy Layer
Message Bus
eu-west-1a
Services
eu-west-1b eu-west-1c
Shared Infra
RabbitMQ RabbitMQ RabbitMQ
API API API
Go Go Go
x many
C*
NSQ
ZK
C*
NSQ
ZK
C*
NSQ
ZK
x many x many
• We use auto scaling groups for everything
 Guarantees each component can be rebuilt automatically
 Including our database clusters that run on ephemeral storage ( we do keep
6 copies of each piece of data in 2 regions )
• Minimum of 3 AZs in every region
• Every workflow is automated
• Every component has to be self healing and scalable
Basic principles
• Our “cloud provider abstraction” layer
• Main purpose is infrastructure and workflow automation and discovery
• Has a global view of everything happening across our infrastructure
• Provides additional capabilities on top of AWS
• Runs in a dedicated VPCs across two regions
OrchestrationEnv DNS
Release AutoScalingComputeEIP
Whisper
It all started by a small challenge we had to overcome:
Payment providers whitelist sources
EIP Service
Elastic IP Provisioning Service
NAT
LIVE
NAT
FOO
51.x.x.1 nat live
51.x.x.2 nat live
51.x.x.3 nat live
50.x.x.5
1
nat foo
Maintains elastic IP pools across all
our accounts and matches them against
auto scaling groups and environments
auto scaling group auto scaling group
We do a lot of server discovery
• Both external and internal orchestration tools rely on AWS APIs for server
discovery
• Puppet has AWS integration for clustering infra
• Exponential back-off mitigates the issue but does not solve it if you have
many clients
“RequestLimitExceeded”.
Compute service to the rescue
• A distributed cache of all compute instances and their meta data
• Powerful query API ( Very Fast!)
• Main interface for creating new compute instances
• Reconciles any changes in any AWS account within seconds
Compute Service
Other
providers
Internal
tools
External
toolsServices
Everything in our platform emits events
So naturally we want to capture all external events as well!
Whisper Service
It’s all about event driven compute – think Lambda but within our platform
Events
Events
Hundreds of publishers & subscribe
NSQ Topics
Events
External
sources
Actions
To subscribe to any new event source
we have to only change a single service
What about AWS resource access?
temporary
security
credentials
AWS Account X AWS Account Y
service
temporary
security
credentials
role role
• Each external orchestration
service instance has a
“global” view of our
infrastructure
• Relies heavily on STS to
operate across different
accounts and regions
• Each service has a
designated role for every
account and region
AWS Auth under the hood
Shared environments create contention. We decided to boost our
developers productivity and give them on demand environments
ENV ENV
ENV
Environment Service
SIE MIE
Infrastructure
Core Platform
Single server on
AWS
Hundreds of servers
/ single AWS region
CloudFormation
Orchestration layer
On demand environments
Single Instance Environment Multi instance environment
Environment Service
SIE MIE
Infrastructure
Core Platform
Single server on
AWS
Hundreds of servers
/ single AWS region
Release Service
ANY ENV
(PROD)
Services
Config
*Data
clone
ETA: ~12 min ETA: ~40 min
CloudFormation
Orchestration layer
On demand environments
Single Instance Environment Multi instance environment
Environment Service
SIE MIE
Infrastructure
Core Platform
Single server on
AWS
Vagrant support
Hundreds of servers
/ single AWS region
Multi-region
environments
Release Service
ANY ENV
(PROD)
Services
Config
*Data
clone
ETA: ~12 min ETA: ~40 min
CloudFormation
Orchestration layer
On demand environments
Single Instance Environment Multi instance environment
SIE
Pre Prod
All of this so we can do
SIE
MIE
MIE
SIE
MIE
SIE Live
Orchestration
SIE
Preparing for…
SIE
MIE
MIE
SIE
MIE
SIE Live
Orchestration
• The only services directly aware of our cloud provider specifics – gives us a lot of
flexibility and let us introduce changes quickly
• Each of them fulfills a very specific task and together create powerful workflows
• Nothing else in our platform is aware of the underlying cloud layer
• We did not envision being “cloud agnostic” – it just happened
Provides the most essential platform functions for every service:
• Service Discovery
• Service Provisioning
• Routing & Load Balancing
• Authentication/Authorization
• Monitoring
• Configuration
Service Provisioning
Provisioning Service
Build Pipeline
Amazon S3
Provisioning Manager
Provisioning Service
Docker Registry
Provisioning overview
Instance Instance
Process Container
Auto Scaling GroupAuto Scaling Group
Service deployment specifics
• Each service is decoupled from the rest and deployed individually
• We run multiple services on the same instance but each service is
deployed in at least 3 AZs
• We rely on auto scaling groups for organizing and scaling our
workload
• We use static partitioning to match a service to an auto scaling group
and this results in non optimal resource utilisation (25% - 50%)
Deploying a service
service name version
auto scaling group
Coming soon: Elastic resource pools and QoS scheduling
Elastic Resource Pool
ECS
Agent
ECS
Agent
ECS
Agent
ECS
Agent
ECS
Agent
ECS
Agent
QoS Scheduler
eu-west-1a eu-west-1b eu-west-1c
AWS
Cloud Provider
ECS
Cluster Manager
instance instance instance instance instance instance
So what does this mean?
Elastic resource pool
75-80%
Utilization
eu-west-1a eu-west-1b eu-west-1c
One word – such difference!
instance instance instance instance instance instance
Why building our own scheduler?
• Service Priority
• Service specific runtime metrics
• Interference
• Cloud awareness ( availability zones, pool elasticity…)
Running services in a pay as you go fashion will soon be a reality as much as
todays on demand compute
We want a cloud-native scheduler that is aware of the cloud specifics and our
microservices ecosystem:
• Self-contained units of execution
• Built around business capabilities or domain objects
• Small enough to be rewritten in a few days
• They are all about adding business value
Service interactions – not as scary as it looks!
A microservice under the hood
Logic
Storage
Library for abstracting service-
to-service comms
service-layer
Handler platform-layer
Self-configuring external
service adapters
Service
• Service to service
communication libs
• Discovery
• Configuration
• A/B testing capabilities
• Monitoring & Instrumentation
• … and much more
Any service gets for free:
Microservices are all about tooling
Live request tracing
You need to identify your main KPIs
Thanks!
Get a taxi home on us:
@nathariel
boyan@hailocab.com
@HailoTech

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Scaling micro-services Architecture on AWS
Scaling micro-services Architecture on AWSScaling micro-services Architecture on AWS
Scaling micro-services Architecture on AWS
 
Micro-Service Architectures in E-Commerce environments with SPHERE.IO / comme...
Micro-Service Architectures in E-Commerce environments with SPHERE.IO / comme...Micro-Service Architectures in E-Commerce environments with SPHERE.IO / comme...
Micro-Service Architectures in E-Commerce environments with SPHERE.IO / comme...
 
MicroServices sur AWS
MicroServices sur AWSMicroServices sur AWS
MicroServices sur AWS
 
AWS ELB Tips & Best Practices
AWS ELB Tips & Best PracticesAWS ELB Tips & Best Practices
AWS ELB Tips & Best Practices
 
Greetings from AWS User Group Taiwan
Greetings from AWS User Group TaiwanGreetings from AWS User Group Taiwan
Greetings from AWS User Group Taiwan
 
How IT at Getty Images Brokers Cloud Services
How IT at Getty Images Brokers Cloud ServicesHow IT at Getty Images Brokers Cloud Services
How IT at Getty Images Brokers Cloud Services
 
Building a Modern Microservices Architecture at Gilt: The Essentials
Building a Modern Microservices Architecture at Gilt: The EssentialsBuilding a Modern Microservices Architecture at Gilt: The Essentials
Building a Modern Microservices Architecture at Gilt: The Essentials
 
Intro to Serverless
Intro to ServerlessIntro to Serverless
Intro to Serverless
 
The Application Server Platform of the Future - Container & Cloud Native and ...
The Application Server Platform of the Future - Container & Cloud Native and ...The Application Server Platform of the Future - Container & Cloud Native and ...
The Application Server Platform of the Future - Container & Cloud Native and ...
 
104 meets cloud
104 meets cloud104 meets cloud
104 meets cloud
 
Partner Solutions: Rackspace - Rethinking Your Migration Strategy to Maximize...
Partner Solutions: Rackspace - Rethinking Your Migration Strategy to Maximize...Partner Solutions: Rackspace - Rethinking Your Migration Strategy to Maximize...
Partner Solutions: Rackspace - Rethinking Your Migration Strategy to Maximize...
 
The Enterprise Service Bus is Dead! Long live the Enterprise Service Bus, Rim...
The Enterprise Service Bus is Dead! Long live the Enterprise Service Bus, Rim...The Enterprise Service Bus is Dead! Long live the Enterprise Service Bus, Rim...
The Enterprise Service Bus is Dead! Long live the Enterprise Service Bus, Rim...
 
Nurturing a large GST ecosystem on AWS - Anil Sharma, Chicago
Nurturing a large GST ecosystem on AWS - Anil Sharma, ChicagoNurturing a large GST ecosystem on AWS - Anil Sharma, Chicago
Nurturing a large GST ecosystem on AWS - Anil Sharma, Chicago
 
Kubernetes for Sales Engineers & Solutions Engineers–You Too Can Leverage Kub...
Kubernetes for Sales Engineers & Solutions Engineers–You Too Can Leverage Kub...Kubernetes for Sales Engineers & Solutions Engineers–You Too Can Leverage Kub...
Kubernetes for Sales Engineers & Solutions Engineers–You Too Can Leverage Kub...
 
How to Say Yes to Self-Service in the Cloud and Become an IT Hero (ENT217) | ...
How to Say Yes to Self-Service in the Cloud and Become an IT Hero (ENT217) | ...How to Say Yes to Self-Service in the Cloud and Become an IT Hero (ENT217) | ...
How to Say Yes to Self-Service in the Cloud and Become an IT Hero (ENT217) | ...
 
Simplified migration with CloudEndure
Simplified migration with CloudEndureSimplified migration with CloudEndure
Simplified migration with CloudEndure
 
Sundog Media Toolkit
Sundog Media Toolkit Sundog Media Toolkit
Sundog Media Toolkit
 
Managing Container-as-a-Service and Docker Clusters in the Cloud with RightScale
Managing Container-as-a-Service and Docker Clusters in the Cloud with RightScaleManaging Container-as-a-Service and Docker Clusters in the Cloud with RightScale
Managing Container-as-a-Service and Docker Clusters in the Cloud with RightScale
 
Building A Diverse Geo-Architecture For Cloud Native Applications In One Day
Building A Diverse Geo-Architecture For Cloud Native Applications In One DayBuilding A Diverse Geo-Architecture For Cloud Native Applications In One Day
Building A Diverse Geo-Architecture For Cloud Native Applications In One Day
 
Meetup #4: AWS ELB Deep dive & Best practices
Meetup #4: AWS ELB Deep dive & Best practicesMeetup #4: AWS ELB Deep dive & Best practices
Meetup #4: AWS ELB Deep dive & Best practices
 

Similar a Patterns for building resilient and scalable microservices platform on AWS

AWS Cloud Kata | Manila - Getting to Scale on AWS
AWS Cloud Kata | Manila - Getting to Scale on AWSAWS Cloud Kata | Manila - Getting to Scale on AWS
AWS Cloud Kata | Manila - Getting to Scale on AWS
Amazon Web Services
 

Similar a Patterns for building resilient and scalable microservices platform on AWS (20)

Managing Your Cloud Assets
Managing Your Cloud AssetsManaging Your Cloud Assets
Managing Your Cloud Assets
 
Satrtup Bootcamp - Scale on AWS
Satrtup Bootcamp - Scale on AWSSatrtup Bootcamp - Scale on AWS
Satrtup Bootcamp - Scale on AWS
 
Managed Cloud Services for Siebel CRM on Amazon AWS
Managed Cloud Services for Siebel CRM on Amazon AWSManaged Cloud Services for Siebel CRM on Amazon AWS
Managed Cloud Services for Siebel CRM on Amazon AWS
 
Global Azure Bootcamp: Azure service fabric
Global Azure Bootcamp: Azure service fabric Global Azure Bootcamp: Azure service fabric
Global Azure Bootcamp: Azure service fabric
 
Amazon ECS
Amazon ECSAmazon ECS
Amazon ECS
 
AWS Migration Day - Windows Workloads
AWS Migration Day - Windows WorkloadsAWS Migration Day - Windows Workloads
AWS Migration Day - Windows Workloads
 
AWS re:Invent 2016: Running Microservices on Amazon ECS (CON309)
AWS re:Invent 2016: Running Microservices on Amazon ECS (CON309)AWS re:Invent 2016: Running Microservices on Amazon ECS (CON309)
AWS re:Invent 2016: Running Microservices on Amazon ECS (CON309)
 
Customer Sharing: Trend Micro - Analytic Engine - A common Big Data computati...
Customer Sharing: Trend Micro - Analytic Engine - A common Big Data computati...Customer Sharing: Trend Micro - Analytic Engine - A common Big Data computati...
Customer Sharing: Trend Micro - Analytic Engine - A common Big Data computati...
 
analytic engine - a common big data computation service on the aws
analytic engine - a common big data computation service on the awsanalytic engine - a common big data computation service on the aws
analytic engine - a common big data computation service on the aws
 
AWS Cloud Kata | Manila - Getting to Scale on AWS
AWS Cloud Kata | Manila - Getting to Scale on AWSAWS Cloud Kata | Manila - Getting to Scale on AWS
AWS Cloud Kata | Manila - Getting to Scale on AWS
 
Pace of Innovation at AWS - London Summit Enteprise Track RePlay
Pace of Innovation at AWS - London Summit Enteprise Track RePlayPace of Innovation at AWS - London Summit Enteprise Track RePlay
Pace of Innovation at AWS - London Summit Enteprise Track RePlay
 
Infrastructure Security: Your Minimum Security Baseline
Infrastructure Security: Your Minimum Security BaselineInfrastructure Security: Your Minimum Security Baseline
Infrastructure Security: Your Minimum Security Baseline
 
AWS Cloud Experience CA: ¿Porqué Correr WorkLoads Microsoft & Oracle en AWS?
AWS Cloud Experience CA: ¿Porqué Correr WorkLoads Microsoft & Oracle en AWS?AWS Cloud Experience CA: ¿Porqué Correr WorkLoads Microsoft & Oracle en AWS?
AWS Cloud Experience CA: ¿Porqué Correr WorkLoads Microsoft & Oracle en AWS?
 
Enterprise Workloads on AWS
Enterprise Workloads on AWSEnterprise Workloads on AWS
Enterprise Workloads on AWS
 
SAP on Amazon web services
SAP on Amazon web servicesSAP on Amazon web services
SAP on Amazon web services
 
GAM307_Ubisoft How For Honor Runs Using Amazon ECS
GAM307_Ubisoft How For Honor Runs Using Amazon ECSGAM307_Ubisoft How For Honor Runs Using Amazon ECS
GAM307_Ubisoft How For Honor Runs Using Amazon ECS
 
AWS re:Invent 2016: Born in the Cloud; Built Like a Startup (ARC205)
AWS re:Invent 2016: Born in the Cloud; Built Like a Startup (ARC205)AWS re:Invent 2016: Born in the Cloud; Built Like a Startup (ARC205)
AWS re:Invent 2016: Born in the Cloud; Built Like a Startup (ARC205)
 
AWS Summit Singapore - More Containers, Less Operations
AWS Summit Singapore - More Containers, Less OperationsAWS Summit Singapore - More Containers, Less Operations
AWS Summit Singapore - More Containers, Less Operations
 
Serverless Architectures on AWS Lambda
Serverless Architectures on AWS LambdaServerless Architectures on AWS Lambda
Serverless Architectures on AWS Lambda
 
Intro to AWS: Amazon EC2 and Compute Services
Intro to AWS: Amazon EC2 and Compute ServicesIntro to AWS: Amazon EC2 and Compute Services
Intro to AWS: Amazon EC2 and Compute Services
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Patterns for building resilient and scalable microservices platform on AWS

  • 1. Patterns for building resilient and scalable microservices platform on AWS Boyan Dimitrov, Platform Automation Lead @ Hailo @nathariel
  • 2.
  • 3. Back in 2011 we started simple We quickly found out that supporting monoliths is hard: • Hard to maintain the codebase • Hard to build new features • Hard to scale the dev teams Failure to deliver business value Frontend Backend MySQL
  • 4. So in 2013 we ended up doing…
  • 5. At present we have • Microservices ecosystem (99.9% written in Go) • Designed specifically for the cloud – different building blocks and components will constantly be in flux, broken or unavailable • 1000+ AWS instances spanning multiple regions • 200+ services in production
  • 7. TeVPC Auto Scaling S3 OrchestrationEnv DNS Release AutoScaling Discovery Monitoring CFEC2 Route 53 Redshift ComputeEIP Routing Core Platform Provisioning Login Services Cloud Provider Whisper Config
  • 8. • Lowest level building blocks • We mostly use basic PaaS components and services as they cover most of our needs • We expect every underlying component to fail and we designed for this
  • 9. eu-west-1 Message Bus+ Go Services Proxy Layer C* us-east-1 Proxy Layer C* Go Services Message Bus+
  • 10. eu-west-1 Proxy Layer Message Bus eu-west-1a Services eu-west-1b eu-west-1c Shared Infra RabbitMQ RabbitMQ RabbitMQ API API API Go Go Go x many C* NSQ ZK C* NSQ ZK C* NSQ ZK x many x many
  • 11. • We use auto scaling groups for everything  Guarantees each component can be rebuilt automatically  Including our database clusters that run on ephemeral storage ( we do keep 6 copies of each piece of data in 2 regions ) • Minimum of 3 AZs in every region • Every workflow is automated • Every component has to be self healing and scalable Basic principles
  • 12. • Our “cloud provider abstraction” layer • Main purpose is infrastructure and workflow automation and discovery • Has a global view of everything happening across our infrastructure • Provides additional capabilities on top of AWS • Runs in a dedicated VPCs across two regions OrchestrationEnv DNS Release AutoScalingComputeEIP Whisper
  • 13. It all started by a small challenge we had to overcome: Payment providers whitelist sources
  • 14. EIP Service Elastic IP Provisioning Service NAT LIVE NAT FOO 51.x.x.1 nat live 51.x.x.2 nat live 51.x.x.3 nat live 50.x.x.5 1 nat foo Maintains elastic IP pools across all our accounts and matches them against auto scaling groups and environments auto scaling group auto scaling group
  • 15. We do a lot of server discovery • Both external and internal orchestration tools rely on AWS APIs for server discovery • Puppet has AWS integration for clustering infra • Exponential back-off mitigates the issue but does not solve it if you have many clients “RequestLimitExceeded”.
  • 16. Compute service to the rescue • A distributed cache of all compute instances and their meta data • Powerful query API ( Very Fast!) • Main interface for creating new compute instances • Reconciles any changes in any AWS account within seconds Compute Service Other providers Internal tools External toolsServices
  • 17. Everything in our platform emits events So naturally we want to capture all external events as well!
  • 18. Whisper Service It’s all about event driven compute – think Lambda but within our platform Events Events Hundreds of publishers & subscribe NSQ Topics Events External sources Actions To subscribe to any new event source we have to only change a single service
  • 19. What about AWS resource access?
  • 20. temporary security credentials AWS Account X AWS Account Y service temporary security credentials role role • Each external orchestration service instance has a “global” view of our infrastructure • Relies heavily on STS to operate across different accounts and regions • Each service has a designated role for every account and region AWS Auth under the hood
  • 21. Shared environments create contention. We decided to boost our developers productivity and give them on demand environments ENV ENV ENV
  • 22. Environment Service SIE MIE Infrastructure Core Platform Single server on AWS Hundreds of servers / single AWS region CloudFormation Orchestration layer On demand environments Single Instance Environment Multi instance environment
  • 23. Environment Service SIE MIE Infrastructure Core Platform Single server on AWS Hundreds of servers / single AWS region Release Service ANY ENV (PROD) Services Config *Data clone ETA: ~12 min ETA: ~40 min CloudFormation Orchestration layer On demand environments Single Instance Environment Multi instance environment
  • 24. Environment Service SIE MIE Infrastructure Core Platform Single server on AWS Vagrant support Hundreds of servers / single AWS region Multi-region environments Release Service ANY ENV (PROD) Services Config *Data clone ETA: ~12 min ETA: ~40 min CloudFormation Orchestration layer On demand environments Single Instance Environment Multi instance environment
  • 25.
  • 26. SIE Pre Prod All of this so we can do SIE MIE MIE SIE MIE SIE Live Orchestration
  • 28. • The only services directly aware of our cloud provider specifics – gives us a lot of flexibility and let us introduce changes quickly • Each of them fulfills a very specific task and together create powerful workflows • Nothing else in our platform is aware of the underlying cloud layer • We did not envision being “cloud agnostic” – it just happened
  • 29. Provides the most essential platform functions for every service: • Service Discovery • Service Provisioning • Routing & Load Balancing • Authentication/Authorization • Monitoring • Configuration
  • 31. Provisioning Service Build Pipeline Amazon S3 Provisioning Manager Provisioning Service Docker Registry Provisioning overview Instance Instance Process Container Auto Scaling GroupAuto Scaling Group
  • 32. Service deployment specifics • Each service is decoupled from the rest and deployed individually • We run multiple services on the same instance but each service is deployed in at least 3 AZs • We rely on auto scaling groups for organizing and scaling our workload • We use static partitioning to match a service to an auto scaling group and this results in non optimal resource utilisation (25% - 50%)
  • 33. Deploying a service service name version auto scaling group
  • 34. Coming soon: Elastic resource pools and QoS scheduling Elastic Resource Pool ECS Agent ECS Agent ECS Agent ECS Agent ECS Agent ECS Agent QoS Scheduler eu-west-1a eu-west-1b eu-west-1c AWS Cloud Provider ECS Cluster Manager instance instance instance instance instance instance
  • 35. So what does this mean? Elastic resource pool 75-80% Utilization eu-west-1a eu-west-1b eu-west-1c One word – such difference! instance instance instance instance instance instance
  • 36. Why building our own scheduler? • Service Priority • Service specific runtime metrics • Interference • Cloud awareness ( availability zones, pool elasticity…) Running services in a pay as you go fashion will soon be a reality as much as todays on demand compute We want a cloud-native scheduler that is aware of the cloud specifics and our microservices ecosystem:
  • 37. • Self-contained units of execution • Built around business capabilities or domain objects • Small enough to be rewritten in a few days • They are all about adding business value
  • 38. Service interactions – not as scary as it looks!
  • 39. A microservice under the hood Logic Storage Library for abstracting service- to-service comms service-layer Handler platform-layer Self-configuring external service adapters Service • Service to service communication libs • Discovery • Configuration • A/B testing capabilities • Monitoring & Instrumentation • … and much more Any service gets for free:
  • 40. Microservices are all about tooling
  • 41.
  • 43. You need to identify your main KPIs
  • 44.
  • 45. Thanks! Get a taxi home on us: @nathariel boyan@hailocab.com @HailoTech

Notas del editor

  1. Seamless user experience
  2. Cloud
  3. We built our custom provisioning system and we started by running a number of services on a single instance Initially we were running services as normal processes on the instance but this started causing noisy neighbour problems Several months ago we gradually started moving to containers aiming for isolation and resource control capabilities.
  4. We want an elastic resource pool where services are scheduled on a need to basis We don’t want to manage services manually and leave that to a smart scheduler