Patterns for building resilient and scalable microservices platform on AWS

Patterns for building resilient and scalable
microservices platform on AWS
Boyan Dimitrov,
Platform Automation Lead @ Hailo
@nathariel

Back in 2011 we started simple
We quickly found out that supporting monoliths is hard:
• Hard to maintain the codebase
• Hard to build new features
• Hard to scale the dev teams
Failure to deliver business value
Frontend Backend
MySQL

So in 2013 we ended up
doing…

At present we have
• Microservices ecosystem (99.9% written in Go)
• Designed specifically for the cloud – different building blocks and
components will constantly be in flux, broken or unavailable
• 1000+ AWS instances spanning multiple regions
• 200+ services in production

TeVPC
Auto Scaling
S3
OrchestrationEnv DNS
Release AutoScaling
Discovery
Monitoring
CFEC2
Route 53
Redshift
ComputeEIP
Routing
Core
Platform
Provisioning
Login
Services
Cloud Provider
Whisper
Config

• Lowest level building blocks
• We mostly use basic PaaS components and services as they cover most of our
needs
• We expect every underlying component to fail and we designed for this

eu-west-1
Message Bus+
Go Services
Proxy Layer
C*
us-east-1
Proxy Layer
C*
Go Services
Message Bus+

eu-west-1
Proxy Layer
Message Bus
eu-west-1a
Services
eu-west-1b eu-west-1c
Shared Infra
RabbitMQ RabbitMQ RabbitMQ
API API API
Go Go Go
x many
C*
NSQ
ZK
C*
NSQ
ZK
C*
NSQ
ZK
x many x many

• We use auto scaling groups for everything
 Guarantees each component can be rebuilt automatically
 Including our database clusters that run on ephemeral storage ( we do keep
6 copies of each piece of data in 2 regions )
• Minimum of 3 AZs in every region
• Every workflow is automated
• Every component has to be self healing and scalable
Basic principles

• Our “cloud provider abstraction” layer
• Main purpose is infrastructure and workflow automation and discovery
• Has a global view of everything happening across our infrastructure
• Provides additional capabilities on top of AWS
• Runs in a dedicated VPCs across two regions
OrchestrationEnv DNS
Release AutoScalingComputeEIP
Whisper

It all started by a small challenge we had to overcome:
Payment providers whitelist sources

EIP Service
Elastic IP Provisioning Service
NAT
LIVE
NAT
FOO
51.x.x.1 nat live
51.x.x.2 nat live
51.x.x.3 nat live
50.x.x.5
1
nat foo
Maintains elastic IP pools across all
our accounts and matches them against
auto scaling groups and environments
auto scaling group auto scaling group

We do a lot of server discovery
• Both external and internal orchestration tools rely on AWS APIs for server
discovery
• Puppet has AWS integration for clustering infra
• Exponential back-off mitigates the issue but does not solve it if you have
many clients
“RequestLimitExceeded”.

Compute service to the rescue
• A distributed cache of all compute instances and their meta data
• Powerful query API ( Very Fast!)
• Main interface for creating new compute instances
• Reconciles any changes in any AWS account within seconds
Compute Service
Other
providers
Internal
tools
External
toolsServices

Everything in our platform emits events
So naturally we want to capture all external events as well!

Whisper Service
It’s all about event driven compute – think Lambda but within our platform
Events
Events
Hundreds of publishers & subscribe
NSQ Topics
Events
External
sources
Actions
To subscribe to any new event source
we have to only change a single service

What about AWS resource access?

temporary
security
credentials
AWS Account X AWS Account Y
service
temporary
security
credentials
role role
• Each external orchestration
service instance has a
“global” view of our
infrastructure
• Relies heavily on STS to
operate across different
accounts and regions
• Each service has a
designated role for every
account and region
AWS Auth under the hood

Shared environments create contention. We decided to boost our
developers productivity and give them on demand environments
ENV ENV
ENV

Environment Service
SIE MIE
Infrastructure
Core Platform
Single server on
AWS
Hundreds of servers
/ single AWS region
CloudFormation
Orchestration layer
On demand environments
Single Instance Environment Multi instance environment

Environment Service
SIE MIE
Infrastructure
Core Platform
Single server on
AWS
Hundreds of servers
/ single AWS region
Release Service
ANY ENV
(PROD)
Services
Config
*Data
clone
ETA: ~12 min ETA: ~40 min
CloudFormation
Orchestration layer

Environment Service
SIE MIE
Infrastructure
Core Platform
Single server on
AWS
Vagrant support
Hundreds of servers
/ single AWS region
Multi-region
environments
Release Service
ANY ENV
(PROD)
Services
Config
*Data
clone
ETA: ~12 min ETA: ~40 min
CloudFormation
Orchestration layer

SIE
Pre Prod
All of this so we can do
SIE
MIE
MIE
SIE
MIE
SIE Live
Orchestration

SIE
Preparing for…
SIE
MIE
MIE
SIE
MIE
SIE Live
Orchestration

• The only services directly aware of our cloud provider specifics – gives us a lot of
flexibility and let us introduce changes quickly
• Each of them fulfills a very specific task and together create powerful workflows
• Nothing else in our platform is aware of the underlying cloud layer
• We did not envision being “cloud agnostic” – it just happened

Provides the most essential platform functions for every service:
• Service Discovery
• Service Provisioning
• Routing & Load Balancing
• Authentication/Authorization
• Monitoring
• Configuration

Provisioning Service
Build Pipeline
Amazon S3
Provisioning Manager
Provisioning Service
Docker Registry
Provisioning overview
Instance Instance
Process Container
Auto Scaling GroupAuto Scaling Group

Service deployment specifics
• Each service is decoupled from the rest and deployed individually
• We run multiple services on the same instance but each service is
deployed in at least 3 AZs
• We rely on auto scaling groups for organizing and scaling our
workload
• We use static partitioning to match a service to an auto scaling group
and this results in non optimal resource utilisation (25% - 50%)

Deploying a service
service name version
auto scaling group

Coming soon: Elastic resource pools and QoS scheduling
Elastic Resource Pool
ECS
Agent
ECS
Agent
ECS
Agent
ECS
Agent
ECS
Agent
ECS
Agent
QoS Scheduler
eu-west-1a eu-west-1b eu-west-1c
AWS
Cloud Provider
ECS
Cluster Manager
instance instance instance instance instance instance

So what does this mean?
Elastic resource pool
75-80%
Utilization
eu-west-1a eu-west-1b eu-west-1c
One word – such difference!
instance instance instance instance instance instance

Why building our own scheduler?
• Service Priority
• Service specific runtime metrics
• Interference
• Cloud awareness ( availability zones, pool elasticity…)
Running services in a pay as you go fashion will soon be a reality as much as
todays on demand compute
We want a cloud-native scheduler that is aware of the cloud specifics and our
microservices ecosystem:

• Self-contained units of execution
• Built around business capabilities or domain objects
• Small enough to be rewritten in a few days
• They are all about adding business value

Service interactions – not as scary as it looks!

A microservice under the hood
Logic
Storage
Library for abstracting service-
to-service comms
service-layer
Handler platform-layer
Self-configuring external
service adapters
Service
• Service to service
communication libs
• Discovery
• Configuration
• A/B testing capabilities
• Monitoring & Instrumentation
• … and much more
Any service gets for free:

Microservices are all about tooling

You need to identify your main KPIs

Thanks!
Get a taxi home on us:
@nathariel
boyan@hailocab.com
@HailoTech

Patterns for building resilient and scalable microservices platform on AWS

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Patterns for building resilient and scalable microservices platform on AWS

Similar a Patterns for building resilient and scalable microservices platform on AWS (20)

Último

Último (20)

Patterns for building resilient and scalable microservices platform on AWS

Notas del editor