Speaker: Miles Ward - Solutions Architect, Amazon Web Services
Today’s technology systems deliver ever more critical capabilities to enterprises, startups, and all users in-between. Amazon Web Services, the leader in Infrastructure-as-a-Service, has delivered several solutions that provide unique value for your efforts towards high-availbility and fault-tolerance. Learn best practices for delivering these innovations to your operations from experienced HA innovator and AWS Solutions Architect Manager Miles Ward.
6. #6#
No Up-Front
Capital Expense
Pay Only for
What You Use
Self-Service
Infrastructure
Easily Scale Up
and Down
Improve Agility &
Time-to-Market
Low Cost
Cloud Computing Benefits
Deploy
7. #7#
No Up-Front HA
Capital Expense
Pay for DR Only
When You Use it
Self-Service
DR Infrastructure
Easily Deliver Fault-
Tolerant Applications
Improve Agility &
Time-to-Recovery
Low Cost
Backups
Cloud Computing Fault-Tolerance
Benefits
Deploy
8. #8#
AWS Cloud allows Overcast Redundancy
Have the shadow duplicate
of your infrastructure ready
to go when you need it…
…but only pay for what
you actually use
12. #12#
Terminology
Ability of a system to
continue operating
properly (perhaps at
a degraded level) if
one or more
components fails.
The process, policies
and procedures
related to restoring
critical systems after
a catastrophic event.
Goal is to get
application back up
and running within a
defined time period
(RTO) and within a
certain data loss
window (RPO).
Fault Tolerant
systems are
measured by their
Availability in terms
of planned and
unplanned service
outages for end
users.
13. #13#
Terminology - continued
Time period in which service
must be restored to meet
BCP (Business Continuity
Planning) objectives
Acceptable data loss as a
result of a recovering from a
disaster/catastrophic event
RTO and RPO are often at odds, and tradeoffs need to
be made in order to find an acceptable middle ground
14. #14#
Takeaways
• Understand core concepts behind HA and DR
• Introduction to architectural options for designing HA, fault-
tolerant applications and DR environments and procedures
• Best Practices for implementation of these architectural
options within AWS (independent of RightScale)
• Multi-Availability Zone (AZ) and Multi-Region
• Architectural options and Considerations / pros and cons of these options
• Understanding of the tools RightScale brings to AWS to
simplify the creation of these HA and DR environments
15. #15#
Regions & Availability Zones
• Zones within a region share a LAN (high bandwidth, low latency, private IP access)
• Zones utilize separate power sources, are physically segregated
• Regions are “islands”, and share no resources.
Japan
Availability
Zone A
Availability
Zone B
EU West Region
Availability
Zone A
Availability
Zone B
US East Region
Availability
Zone A
Availability
Zone C
Availability
Zone B
US West Region
Availability
Zone A
Availability
Zone B
Singapore
Availability
Zone A
Availability
Zone B
Source: AWS
16. #16#
Designing for Failure
• Large scale failures in the cloud are rare but do happen
• Application owners are ultimately responsible for
availability and recoverability
• Balance cost and complexity of HA efforts against
risk(s) you are willing to bear
• Cloud infrastructure has made DR and HA remarkably
affordable versus past options
-Multi-Server
-Multi-AZ (Availability Zone)
-Multi-Region
“Everything fails, all the time.”
Werner Vogels, CTO Amazon.com
17. #17#
Designing for Failure – Basic Concepts
• Fault tolerance is the goal. Degradation of service may occur,
but application continues to function.
• Avoid single points of failure (SPOF)
• Assume everything fails (remember Werner’s mantra) and
design accordingly
• Plan and practice your recovery process (both for HA and DR)
• Remember that better HA and DR equals more $$$. So find
that acceptable balance.
18. #18#
High Availability
Don’t sweat the small stuff.
And it’s all small stuff*
*(until it’s not)
Follow a few general best practices to absorb
application component outages…
19. #19#
General HA Best Practices
• Avoid single points of failure.
• Always place one of each component (load balancers,
app servers, databases) in at least two AZs.
• Replicate data across AZs (HA) and backup or replicate
across regions for failover (DR)
• Setup monitoring, alerts and operations to identify and
automate problem resolution or failover process.
20. #20#
• High availability for top web properties
with 270M visitors/month
• Migration from datacenter to AWS
• RightScale provides
-Self-service access to developers
-Consistency and low maintenance
-Usage and cost accounting
-Multi-region architectures to avoid downtime
21. #21#
Multi-Zone HA
SLAVE DBMASTER DB
SNAPSHOTS
LOAD BALANCERS
REPLICATE
DNS
S3
EBS
US-EAST 1a
1US-EAST 1b
LOAD BALANCERS
APP SERVERS
AUTOSCALE
172.168.7.31 172.168.8.62
Snapshot data volume for backups
so the database can be readily
recovered within the region.
Place Slave databases in one
or more zones for failover.
Consider local storage for additional
slave database to remove
dependency on attached volume
Consider
distributed
NoSQL
databases with
the same
distribution
considerations
.
22. #22#
Disaster Recovery
DR presents a few new wrinkles compared to HA,
but there are multiple options depending on your
needs and budget…
Don’t sweat the small stuff.
And it’s all small stuff*
*(until it’s not)
23. #23#
HA/DR Checklist for Risk Mitigation
• Determine who owns the architecture, DR process and testing.
• Develop expertise in-house and / or get outside help.
• Conduct a risk assessment for each application.
• Specify your target RTO and RPO.
• Design for failure starting with application architecture. This
will help drive the infrastructure architecture.
24. #24#
HA/DR Checklist for Risk Mitigation
• Implement HA best practices balancing cost, complexity and
risk.
-Automate infrastructure for consistency and reliability.
• Document operational processes and automations.
• Test the failover... then test it again.
• Release the Chaos Monkey.
25. #25#
Multi-Region/Cloud DR Options
Cold DR
Warm DR
Hot DR
Multi-Cloud HA0
< 5 Mins
< 1 Hour
> 1 Hour
$ $$ $$$ $$$$
(Most Common)
(Recommended)
(Least Common)
(Live/Live Config)
DowntimeAvailability
99.999%
99.9%
99.5%
99%
26. #26#
Multi-Region Cold DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
APP SERVERS
US WEST
SNAPSHOTS
172.168.7.31
SLAVE DB
US EAST
S3
Staged Server Configuration and generally no staged data
• Not recommended if rapid recovery is required
• Slow to replicate data to other cloud and bring database online
EBS
27. #27#
Multi-Region Warm DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
APP SERVERS
SLAVE DB
REPLICATE
US WEST
172.168.7.31
US EAST
SNAPSHOTS
Staged Server Configuration, pre-staged data and running Slave Database Server
• Generally recommended DR solution
• Minimal additional cost and allows fairly rapid recovery
SNAPSHOTS
EBS
S3
28. #28#
APP SERVERS
Multi-Region Hot DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
US WEST
SNAPSHOTS
172.168.7.31
US EAST
Parallel Deployment with all servers running but all traffic going to primary
• Not recommended
• Very high additional cost to allow rapid recovery
SNAPSHOTS
EBS
S3
29. #29#
Hybrid HA
APP SERVERS
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
CHICAGO
SNAPSHOTS
172.168.7.31 172.168.8.62
US-EAST
S3 SWIFT
SNAPSHOTS
Live/Live configuration. Geo-target IP services to direct traffic to regional LBs.
• Possible, but not recommended (more to follow…)
• Max additional cost and max availability, but complex to implement and manage
EBS
30. #30#
APP SERVERS
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
CHICAGO
SNAPSHOTS
172.168.7.31 172.168.8.62
US-EAST
S3
Hybrid HA
You need DNS management
or a global load balancer.
Security requires addt’l effort as
security groups are Region-
specific.
Machine Images
are specific to the
cloud/region.
Looks similar to Multi-Zone… but additional problems to solve as some resources
are not shared
SNAPSHOTS
SWIFT
EBS VOLUME
31. #31#
• Procurement software
• SLA to their customers require HA
• Subway chain is a customer that procures perishable goods
through Coupa
33. #33#
Automating HA and DR
• Use dynamic DNS for your database servers
Allow app servers to use a single FQDN.
Use a low TTL to allow rapid failover in the case of a change in master
database
• Automatic connection of app servers to load balancing servers
App servers can connect to all load balancers automatically at launch
No manual intervention
No DNS modifications
• Automated promotion of slave to master
Process is automated
Decision to run process is manual
34. #34#
MultiCloud Images
• MultiCloud Images can be launched across regions and hybrid
without modification
How RightScale makes it possible
MultiCloud Images
Cloud A, RightImage 1
Cloud B, RightImage 2
Cloud C, RightImage 3
ServerTemplate contains a list
of MultiCloud Images (MCIs)
When the Server is
created, a specific MCI
is chosen.
Cloud A, RightImage 1
Cloud A
Image 1
The appropriate
RightImage is used at
launch.
RightImage
Stability across clouds
1
2
3
35. #35#
How RightScale makes it possible
ServerTemplates, Tags, and Inputs
• Automated load balancer registration and database connections
• Autoscaling across zones
• Dynamic configuration
36. #36#
DR Cost Comparison Example
Multi-Region
Cold DR
Multi-Region
Warm DR
Multi-Region
Hot DR
Total $4480 / month $5630 / month $8800 / month
Running $4470 / month
3 Load Balancers (Large)
6 App Servers (XLarge)
1 Master DB (2XLarge)
1 Slave DB (2XLarge)
$5540 / month
3 Load Balancers (Large)
6 App Servers (XLarge)
1 Master DB (2XLarge)
2 Slave DB (2XLarge)
$8440 / month
6 Load Balancers (Large)
12 App Servers (XLarge)
1 Master DB (2XLarge)
2 Slave DB (2XLarge)
Staged $0 / month
3 Load Balancers (Large)
6 App Servers (XLarge)
1 Slave DB (2XLarge)
$0 / month
3 Load Balancers (Large)
6 App Servers (Xlarge)
Replication $10 / month
25GB / day cross-zone
$90 / month
25GB / day cross-region
$360 / month
100GB / day cross-region
37. #37#
Outage-Proofing Best Practices
Place in >1 zone:
• Load balancers
• App servers
• Databases
Maintain capacity
to absorb zone or
region failures
Replicate data
across zones
Design stateless
apps for resilience
to reboot / relaunch
Replicate data
across zones
Backup across
regions
Monitoring, alert, a
nd automate
operations to
speed up failover
Cloud computing is a better way to run your business. The cloud helps companies of all sizesbecome moreagile. Instead of running your applications yourself you can run them on the cloud where IT infrastructure is offered as a service like a utility. With the cloud, your company saves money: there are no up-front capital expenses as you don’t have to buy hardware for your projects. The massive scale and fast pace of innovation of the cloud drive the costs down for you. In the cloud, you pay only for what you use just like electricity.The cloud can also help your company save time and improve agility – it’s faster to get started: you can build new environments in minutes as you don’t need to wait for new servers to arrive. The elastic nature of the cloud makes it easy to scale up and down as needed. At the end of the day you have more resources left for innovation which allows you to focus on projects that can really impact your businesses like building and deploying more applications. “With the high growth nature of our business, we were looking for a cloud solution to enable us to scale fast. Think twice before buying your next server. Cloud computing is the way forward.” - Sami Lababidi, CTO, Playfish
AWS is useful for low-end traditional DR to high-end HA, but…AWS encourages a rethinking of traditional DR / HA practicesEverything in the cloud is “off-site” and (potentially) “multi-site”Using multiple sites (multiple AZs) comes largely for freeUsing multiple geographically-distributed sites (multiple Regions) is significantly cheaper and easierTends to move the default design point away from “cold” Disaster Recovery toward “hot” High AvailabilityMakes it easier to stack multiple mechanismse.g., Basic HA within one Region, DR site in second Region
Cold DR(Most common... hours) Staged Server Configuration and generally no staged data. Bring up the servers and load the data to failover. Cold DR failover is typically manual.Warm DR(Recommended... >hour) Staged Server Configuration, pre-staged data and running Database Slave Server. Warm DR failover is typically manual but can be automated.Hot DR(Least common... but needed if <5 min) Parallel Deployment with all servers running but all traffic going to primary. Hot DR failover is normally automated.Hot HALive/Live configuration. May use Geo-target IP services to direct traffic to regional load balancers. Failover to other region if one has problems. Hot HA is normally seamlessly automated.
Note: Other costs such as IOPS, volumes, other bandwidth, object storage, and snapshot storage is additional