SlideShare una empresa de Scribd logo
1 de 38
#1
Safeguard Your Cloud Applications:
High Availability and Fault Tolerance
#2#
Agenda
• Terminology/Level-Setting
• Takeaways
• Cloud and Component Definitions
• Designing for Failure
• Architectural Options and Considerations
High Availability
Disaster Recovery
• Conclusions / Q&A
#3#
Faults?
• Facilities
• Hardware
• Networking
• Code
• People
#4#
What is “Fault-Tolerant”?
• Degrees of risk mitigation - not binary
• Automated
• Tested!
#5#
Old School Fault-Tolerance: Build Two
#6#
No Up-Front
Capital Expense
Pay Only for
What You Use
Self-Service
Infrastructure
Easily Scale Up
and Down
Improve Agility &
Time-to-Market
Low Cost
Cloud Computing Benefits
Deploy
#7#
No Up-Front HA
Capital Expense
Pay for DR Only
When You Use it
Self-Service
DR Infrastructure
Easily Deliver Fault-
Tolerant Applications
Improve Agility &
Time-to-Recovery
Low Cost
Backups
Cloud Computing Fault-Tolerance
Benefits
Deploy
#8#
AWS Cloud allows Overcast Redundancy
Have the shadow duplicate
of your infrastructure ready
to go when you need it…
…but only pay for what
you actually use
#9#
Old Barriers to HA
are now Surmountable
• Cost
• Complexity
• Expertise
#10#
AWS Building Blocks: Two Strategies
Inherently fault-
tolerant services
Services that are fault-tolerant
with the right architecture
Amazon EC2
Amazon Virtual Private Cloud (Amazon VPC)
Amazon Elastic Block Store (EBS)
Amazon Relational Database Service
(Amazon RDS)
Amazon S3
Amazon SimpleDB
Amazon DynamoDB
Amazon CloudFront
Amazon SWF
Amazon SQS
Amazon SNS
Amazon SES
Amazon Route 53
Elastic Load Balancing
AWS Elastic Beanstalk
Amazon ElastiCache
Amazon Elastic MapReduce
AWS Identity and Access
Management (IAM)
#11#
The Stack:
Resources
Deployment
Management
Configuration
Networking
Facilities
Geographies
#12#
Terminology
Ability of a system to
continue operating
properly (perhaps at
a degraded level) if
one or more
components fails.
The process, policies
and procedures
related to restoring
critical systems after
a catastrophic event.
Goal is to get
application back up
and running within a
defined time period
(RTO) and within a
certain data loss
window (RPO).
Fault Tolerant
systems are
measured by their
Availability in terms
of planned and
unplanned service
outages for end
users.
#13#
Terminology - continued
Time period in which service
must be restored to meet
BCP (Business Continuity
Planning) objectives
Acceptable data loss as a
result of a recovering from a
disaster/catastrophic event
RTO and RPO are often at odds, and tradeoffs need to
be made in order to find an acceptable middle ground
#14#
Takeaways
• Understand core concepts behind HA and DR
• Introduction to architectural options for designing HA, fault-
tolerant applications and DR environments and procedures
• Best Practices for implementation of these architectural
options within AWS (independent of RightScale)
• Multi-Availability Zone (AZ) and Multi-Region
• Architectural options and Considerations / pros and cons of these options
• Understanding of the tools RightScale brings to AWS to
simplify the creation of these HA and DR environments
#15#
Regions & Availability Zones
• Zones within a region share a LAN (high bandwidth, low latency, private IP access)
• Zones utilize separate power sources, are physically segregated
• Regions are “islands”, and share no resources.
Japan
Availability
Zone A
Availability
Zone B
EU West Region
Availability
Zone A
Availability
Zone B
US East Region
Availability
Zone A
Availability
Zone C
Availability
Zone B
US West Region
Availability
Zone A
Availability
Zone B
Singapore
Availability
Zone A
Availability
Zone B
Source: AWS
#16#
Designing for Failure
• Large scale failures in the cloud are rare but do happen
• Application owners are ultimately responsible for
availability and recoverability
• Balance cost and complexity of HA efforts against
risk(s) you are willing to bear
• Cloud infrastructure has made DR and HA remarkably
affordable versus past options
-Multi-Server
-Multi-AZ (Availability Zone)
-Multi-Region
“Everything fails, all the time.”
Werner Vogels, CTO Amazon.com
#17#
Designing for Failure – Basic Concepts
• Fault tolerance is the goal. Degradation of service may occur,
but application continues to function.
• Avoid single points of failure (SPOF)
• Assume everything fails (remember Werner’s mantra) and
design accordingly
• Plan and practice your recovery process (both for HA and DR)
• Remember that better HA and DR equals more $$$. So find
that acceptable balance.
#18#
High Availability
Don’t sweat the small stuff.
And it’s all small stuff*
*(until it’s not)
Follow a few general best practices to absorb
application component outages…
#19#
General HA Best Practices
• Avoid single points of failure.
• Always place one of each component (load balancers,
app servers, databases) in at least two AZs.
• Replicate data across AZs (HA) and backup or replicate
across regions for failover (DR)
• Setup monitoring, alerts and operations to identify and
automate problem resolution or failover process.
#20#
• High availability for top web properties
with 270M visitors/month
• Migration from datacenter to AWS
• RightScale provides
-Self-service access to developers
-Consistency and low maintenance
-Usage and cost accounting
-Multi-region architectures to avoid downtime
#21#
Multi-Zone HA
SLAVE DBMASTER DB
SNAPSHOTS
LOAD BALANCERS
REPLICATE
DNS
S3
EBS
US-EAST 1a
1US-EAST 1b
LOAD BALANCERS
APP SERVERS
AUTOSCALE
172.168.7.31 172.168.8.62
Snapshot data volume for backups
so the database can be readily
recovered within the region.
Place Slave databases in one
or more zones for failover.
Consider local storage for additional
slave database to remove
dependency on attached volume
Consider
distributed
NoSQL
databases with
the same
distribution
considerations
.
#22#
Disaster Recovery
DR presents a few new wrinkles compared to HA,
but there are multiple options depending on your
needs and budget…
Don’t sweat the small stuff.
And it’s all small stuff*
*(until it’s not)
#23#
HA/DR Checklist for Risk Mitigation
• Determine who owns the architecture, DR process and testing.
• Develop expertise in-house and / or get outside help.
• Conduct a risk assessment for each application.
• Specify your target RTO and RPO.
• Design for failure starting with application architecture. This
will help drive the infrastructure architecture.
#24#
HA/DR Checklist for Risk Mitigation
• Implement HA best practices balancing cost, complexity and
risk.
-Automate infrastructure for consistency and reliability.
• Document operational processes and automations.
• Test the failover... then test it again.
• Release the Chaos Monkey.
#25#
Multi-Region/Cloud DR Options
Cold DR
Warm DR
Hot DR
Multi-Cloud HA0
< 5 Mins
< 1 Hour
> 1 Hour
$ $$ $$$ $$$$
(Most Common)
(Recommended)
(Least Common)
(Live/Live Config)
DowntimeAvailability
99.999%
99.9%
99.5%
99%
#26#
Multi-Region Cold DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
APP SERVERS
US WEST
SNAPSHOTS
172.168.7.31
SLAVE DB
US EAST
S3
Staged Server Configuration and generally no staged data
• Not recommended if rapid recovery is required
• Slow to replicate data to other cloud and bring database online
EBS
#27#
Multi-Region Warm DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
APP SERVERS
SLAVE DB
REPLICATE
US WEST
172.168.7.31
US EAST
SNAPSHOTS
Staged Server Configuration, pre-staged data and running Slave Database Server
• Generally recommended DR solution
• Minimal additional cost and allows fairly rapid recovery
SNAPSHOTS
EBS
S3
#28#
APP SERVERS
Multi-Region Hot DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
US WEST
SNAPSHOTS
172.168.7.31
US EAST
Parallel Deployment with all servers running but all traffic going to primary
• Not recommended
• Very high additional cost to allow rapid recovery
SNAPSHOTS
EBS
S3
#29#
Hybrid HA
APP SERVERS
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
CHICAGO
SNAPSHOTS
172.168.7.31 172.168.8.62
US-EAST
S3 SWIFT
SNAPSHOTS
Live/Live configuration. Geo-target IP services to direct traffic to regional LBs.
• Possible, but not recommended (more to follow…)
• Max additional cost and max availability, but complex to implement and manage
EBS
#30#
APP SERVERS
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
CHICAGO
SNAPSHOTS
172.168.7.31 172.168.8.62
US-EAST
S3
Hybrid HA
You need DNS management
or a global load balancer.
Security requires addt’l effort as
security groups are Region-
specific.
Machine Images
are specific to the
cloud/region.
Looks similar to Multi-Zone… but additional problems to solve as some resources
are not shared
SNAPSHOTS
SWIFT
EBS VOLUME
#31#
• Procurement software
• SLA to their customers require HA
• Subway chain is a customer that procures perishable goods
through Coupa
#32#
In the Dashboard
Multi-region
or cloud
Multi-region
Warm DR
Staged
servers
Cost
forecasting
for DR
environment
#33#
Automating HA and DR
• Use dynamic DNS for your database servers
Allow app servers to use a single FQDN.
Use a low TTL to allow rapid failover in the case of a change in master
database
• Automatic connection of app servers to load balancing servers
App servers can connect to all load balancers automatically at launch
No manual intervention
No DNS modifications
• Automated promotion of slave to master
Process is automated
Decision to run process is manual
#34#
MultiCloud Images
• MultiCloud Images can be launched across regions and hybrid
without modification
How RightScale makes it possible
MultiCloud Images
Cloud A, RightImage 1
Cloud B, RightImage 2
Cloud C, RightImage 3
ServerTemplate contains a list
of MultiCloud Images (MCIs)
When the Server is
created, a specific MCI
is chosen.
Cloud A, RightImage 1
Cloud A
Image 1
The appropriate
RightImage is used at
launch.
RightImage
Stability across clouds
1
2
3
#35#
How RightScale makes it possible
ServerTemplates, Tags, and Inputs
• Automated load balancer registration and database connections
• Autoscaling across zones
• Dynamic configuration
#36#
DR Cost Comparison Example
Multi-Region
Cold DR
Multi-Region
Warm DR
Multi-Region
Hot DR
Total $4480 / month $5630 / month $8800 / month
Running $4470 / month
3 Load Balancers (Large)
6 App Servers (XLarge)
1 Master DB (2XLarge)
1 Slave DB (2XLarge)
$5540 / month
3 Load Balancers (Large)
6 App Servers (XLarge)
1 Master DB (2XLarge)
2 Slave DB (2XLarge)
$8440 / month
6 Load Balancers (Large)
12 App Servers (XLarge)
1 Master DB (2XLarge)
2 Slave DB (2XLarge)
Staged $0 / month
3 Load Balancers (Large)
6 App Servers (XLarge)
1 Slave DB (2XLarge)
$0 / month
3 Load Balancers (Large)
6 App Servers (Xlarge)
Replication $10 / month
25GB / day cross-zone
$90 / month
25GB / day cross-region
$360 / month
100GB / day cross-region
#37#
Outage-Proofing Best Practices
Place in >1 zone:
• Load balancers
• App servers
• Databases
Maintain capacity
to absorb zone or
region failures
Replicate data
across zones
Design stateless
apps for resilience
to reboot / relaunch
Replicate data
across zones
Backup across
regions
Monitoring, alert, a
nd automate
operations to
speed up failover
#38#
AWS
Contact:aws.amazon.com/contact-
us
Resources and Q&A
RightScale
Try: RightScale Free Edition
www.rightscale.com/free
Contact:
Toll Free: 1.866.720.0208
Int’l: 1.805.855.0265

Más contenido relacionado

Más de RightScale

10 Must-Have Automated Cloud Policies for IT Governance
10 Must-Have Automated Cloud Policies for IT Governance10 Must-Have Automated Cloud Policies for IT Governance
10 Must-Have Automated Cloud Policies for IT GovernanceRightScale
 
Kubernetes and Terraform in the Cloud: How RightScale Does DevOps
Kubernetes and Terraform in the Cloud: How RightScale Does DevOpsKubernetes and Terraform in the Cloud: How RightScale Does DevOps
Kubernetes and Terraform in the Cloud: How RightScale Does DevOpsRightScale
 
Optimize Software, SaaS, and Cloud with Flexera and RightScale
Optimize Software, SaaS, and Cloud with Flexera and RightScaleOptimize Software, SaaS, and Cloud with Flexera and RightScale
Optimize Software, SaaS, and Cloud with Flexera and RightScaleRightScale
 
Prepare Your Enterprise Cloud Strategy for 2019: 7 Things to Think About Now
Prepare Your Enterprise Cloud Strategy for 2019: 7 Things to Think About NowPrepare Your Enterprise Cloud Strategy for 2019: 7 Things to Think About Now
Prepare Your Enterprise Cloud Strategy for 2019: 7 Things to Think About NowRightScale
 
How to Set Up a Cloud Cost Optimization Process for your Enterprise
How to Set Up a Cloud Cost Optimization Process for your EnterpriseHow to Set Up a Cloud Cost Optimization Process for your Enterprise
How to Set Up a Cloud Cost Optimization Process for your EnterpriseRightScale
 
Multi-Cloud Management with RightScale CMP (Demo)
Multi-Cloud Management with RightScale CMP (Demo)Multi-Cloud Management with RightScale CMP (Demo)
Multi-Cloud Management with RightScale CMP (Demo)RightScale
 
Comparing Cloud VM Types and Prices: AWS vs Azure vs Google vs IBM
Comparing Cloud VM Types and Prices: AWS vs Azure vs Google vs IBMComparing Cloud VM Types and Prices: AWS vs Azure vs Google vs IBM
Comparing Cloud VM Types and Prices: AWS vs Azure vs Google vs IBMRightScale
 
How to Allocate and Report Cloud Costs with RightScale Optima
How to Allocate and Report Cloud Costs with RightScale OptimaHow to Allocate and Report Cloud Costs with RightScale Optima
How to Allocate and Report Cloud Costs with RightScale OptimaRightScale
 
Should You Move Between AWS, Azure, or Google Clouds? Considerations, Pros an...
Should You Move Between AWS, Azure, or Google Clouds? Considerations, Pros an...Should You Move Between AWS, Azure, or Google Clouds? Considerations, Pros an...
Should You Move Between AWS, Azure, or Google Clouds? Considerations, Pros an...RightScale
 
Using RightScale CMP with Cloud Provider Tools
Using RightScale CMP with Cloud Provider ToolsUsing RightScale CMP with Cloud Provider Tools
Using RightScale CMP with Cloud Provider ToolsRightScale
 
Best Practices for Multi-Cloud Security and Compliance
Best Practices for Multi-Cloud Security and ComplianceBest Practices for Multi-Cloud Security and Compliance
Best Practices for Multi-Cloud Security and ComplianceRightScale
 
Automating Multi-Cloud Policies for AWS, Azure, Google, and More
Automating Multi-Cloud Policies for AWS, Azure, Google, and MoreAutomating Multi-Cloud Policies for AWS, Azure, Google, and More
Automating Multi-Cloud Policies for AWS, Azure, Google, and MoreRightScale
 
The 5 Stages of Cloud Management for Enterprises
The 5 Stages of Cloud Management for EnterprisesThe 5 Stages of Cloud Management for Enterprises
The 5 Stages of Cloud Management for EnterprisesRightScale
 
9 Ways to Reduce Cloud Storage Costs
9 Ways to Reduce Cloud Storage Costs9 Ways to Reduce Cloud Storage Costs
9 Ways to Reduce Cloud Storage CostsRightScale
 
Serverless Comparison: AWS vs Azure vs Google vs IBM
Serverless Comparison: AWS vs Azure vs Google vs IBMServerless Comparison: AWS vs Azure vs Google vs IBM
Serverless Comparison: AWS vs Azure vs Google vs IBMRightScale
 
Best Practices for Cloud Managed Services Providers: The Path to CMP Success
Best Practices for Cloud Managed Services Providers: The Path to CMP SuccessBest Practices for Cloud Managed Services Providers: The Path to CMP Success
Best Practices for Cloud Managed Services Providers: The Path to CMP SuccessRightScale
 
Cloud Storage Comparison: AWS vs Azure vs Google vs IBM
Cloud Storage Comparison: AWS vs Azure vs Google vs IBMCloud Storage Comparison: AWS vs Azure vs Google vs IBM
Cloud Storage Comparison: AWS vs Azure vs Google vs IBMRightScale
 
2018 Cloud Trends: RightScale State of the Cloud Report
2018 Cloud Trends: RightScale State of the Cloud Report2018 Cloud Trends: RightScale State of the Cloud Report
2018 Cloud Trends: RightScale State of the Cloud ReportRightScale
 
Got a Multi-Cloud Strategy? How RightScale CMP Helps
Got a Multi-Cloud Strategy? How RightScale CMP HelpsGot a Multi-Cloud Strategy? How RightScale CMP Helps
Got a Multi-Cloud Strategy? How RightScale CMP HelpsRightScale
 
How to Manage Cloud Costs with RightScale Optima
How to Manage Cloud Costs with RightScale OptimaHow to Manage Cloud Costs with RightScale Optima
How to Manage Cloud Costs with RightScale OptimaRightScale
 

Más de RightScale (20)

10 Must-Have Automated Cloud Policies for IT Governance
10 Must-Have Automated Cloud Policies for IT Governance10 Must-Have Automated Cloud Policies for IT Governance
10 Must-Have Automated Cloud Policies for IT Governance
 
Kubernetes and Terraform in the Cloud: How RightScale Does DevOps
Kubernetes and Terraform in the Cloud: How RightScale Does DevOpsKubernetes and Terraform in the Cloud: How RightScale Does DevOps
Kubernetes and Terraform in the Cloud: How RightScale Does DevOps
 
Optimize Software, SaaS, and Cloud with Flexera and RightScale
Optimize Software, SaaS, and Cloud with Flexera and RightScaleOptimize Software, SaaS, and Cloud with Flexera and RightScale
Optimize Software, SaaS, and Cloud with Flexera and RightScale
 
Prepare Your Enterprise Cloud Strategy for 2019: 7 Things to Think About Now
Prepare Your Enterprise Cloud Strategy for 2019: 7 Things to Think About NowPrepare Your Enterprise Cloud Strategy for 2019: 7 Things to Think About Now
Prepare Your Enterprise Cloud Strategy for 2019: 7 Things to Think About Now
 
How to Set Up a Cloud Cost Optimization Process for your Enterprise
How to Set Up a Cloud Cost Optimization Process for your EnterpriseHow to Set Up a Cloud Cost Optimization Process for your Enterprise
How to Set Up a Cloud Cost Optimization Process for your Enterprise
 
Multi-Cloud Management with RightScale CMP (Demo)
Multi-Cloud Management with RightScale CMP (Demo)Multi-Cloud Management with RightScale CMP (Demo)
Multi-Cloud Management with RightScale CMP (Demo)
 
Comparing Cloud VM Types and Prices: AWS vs Azure vs Google vs IBM
Comparing Cloud VM Types and Prices: AWS vs Azure vs Google vs IBMComparing Cloud VM Types and Prices: AWS vs Azure vs Google vs IBM
Comparing Cloud VM Types and Prices: AWS vs Azure vs Google vs IBM
 
How to Allocate and Report Cloud Costs with RightScale Optima
How to Allocate and Report Cloud Costs with RightScale OptimaHow to Allocate and Report Cloud Costs with RightScale Optima
How to Allocate and Report Cloud Costs with RightScale Optima
 
Should You Move Between AWS, Azure, or Google Clouds? Considerations, Pros an...
Should You Move Between AWS, Azure, or Google Clouds? Considerations, Pros an...Should You Move Between AWS, Azure, or Google Clouds? Considerations, Pros an...
Should You Move Between AWS, Azure, or Google Clouds? Considerations, Pros an...
 
Using RightScale CMP with Cloud Provider Tools
Using RightScale CMP with Cloud Provider ToolsUsing RightScale CMP with Cloud Provider Tools
Using RightScale CMP with Cloud Provider Tools
 
Best Practices for Multi-Cloud Security and Compliance
Best Practices for Multi-Cloud Security and ComplianceBest Practices for Multi-Cloud Security and Compliance
Best Practices for Multi-Cloud Security and Compliance
 
Automating Multi-Cloud Policies for AWS, Azure, Google, and More
Automating Multi-Cloud Policies for AWS, Azure, Google, and MoreAutomating Multi-Cloud Policies for AWS, Azure, Google, and More
Automating Multi-Cloud Policies for AWS, Azure, Google, and More
 
The 5 Stages of Cloud Management for Enterprises
The 5 Stages of Cloud Management for EnterprisesThe 5 Stages of Cloud Management for Enterprises
The 5 Stages of Cloud Management for Enterprises
 
9 Ways to Reduce Cloud Storage Costs
9 Ways to Reduce Cloud Storage Costs9 Ways to Reduce Cloud Storage Costs
9 Ways to Reduce Cloud Storage Costs
 
Serverless Comparison: AWS vs Azure vs Google vs IBM
Serverless Comparison: AWS vs Azure vs Google vs IBMServerless Comparison: AWS vs Azure vs Google vs IBM
Serverless Comparison: AWS vs Azure vs Google vs IBM
 
Best Practices for Cloud Managed Services Providers: The Path to CMP Success
Best Practices for Cloud Managed Services Providers: The Path to CMP SuccessBest Practices for Cloud Managed Services Providers: The Path to CMP Success
Best Practices for Cloud Managed Services Providers: The Path to CMP Success
 
Cloud Storage Comparison: AWS vs Azure vs Google vs IBM
Cloud Storage Comparison: AWS vs Azure vs Google vs IBMCloud Storage Comparison: AWS vs Azure vs Google vs IBM
Cloud Storage Comparison: AWS vs Azure vs Google vs IBM
 
2018 Cloud Trends: RightScale State of the Cloud Report
2018 Cloud Trends: RightScale State of the Cloud Report2018 Cloud Trends: RightScale State of the Cloud Report
2018 Cloud Trends: RightScale State of the Cloud Report
 
Got a Multi-Cloud Strategy? How RightScale CMP Helps
Got a Multi-Cloud Strategy? How RightScale CMP HelpsGot a Multi-Cloud Strategy? How RightScale CMP Helps
Got a Multi-Cloud Strategy? How RightScale CMP Helps
 
How to Manage Cloud Costs with RightScale Optima
How to Manage Cloud Costs with RightScale OptimaHow to Manage Cloud Costs with RightScale Optima
How to Manage Cloud Costs with RightScale Optima
 

Último

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Último (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

High Availability and Fault Tolerance: AWS + RightScale - RightScale Compute 2013

  • 1. #1 Safeguard Your Cloud Applications: High Availability and Fault Tolerance
  • 2. #2# Agenda • Terminology/Level-Setting • Takeaways • Cloud and Component Definitions • Designing for Failure • Architectural Options and Considerations High Availability Disaster Recovery • Conclusions / Q&A
  • 3. #3# Faults? • Facilities • Hardware • Networking • Code • People
  • 4. #4# What is “Fault-Tolerant”? • Degrees of risk mitigation - not binary • Automated • Tested!
  • 6. #6# No Up-Front Capital Expense Pay Only for What You Use Self-Service Infrastructure Easily Scale Up and Down Improve Agility & Time-to-Market Low Cost Cloud Computing Benefits Deploy
  • 7. #7# No Up-Front HA Capital Expense Pay for DR Only When You Use it Self-Service DR Infrastructure Easily Deliver Fault- Tolerant Applications Improve Agility & Time-to-Recovery Low Cost Backups Cloud Computing Fault-Tolerance Benefits Deploy
  • 8. #8# AWS Cloud allows Overcast Redundancy Have the shadow duplicate of your infrastructure ready to go when you need it… …but only pay for what you actually use
  • 9. #9# Old Barriers to HA are now Surmountable • Cost • Complexity • Expertise
  • 10. #10# AWS Building Blocks: Two Strategies Inherently fault- tolerant services Services that are fault-tolerant with the right architecture Amazon EC2 Amazon Virtual Private Cloud (Amazon VPC) Amazon Elastic Block Store (EBS) Amazon Relational Database Service (Amazon RDS) Amazon S3 Amazon SimpleDB Amazon DynamoDB Amazon CloudFront Amazon SWF Amazon SQS Amazon SNS Amazon SES Amazon Route 53 Elastic Load Balancing AWS Elastic Beanstalk Amazon ElastiCache Amazon Elastic MapReduce AWS Identity and Access Management (IAM)
  • 12. #12# Terminology Ability of a system to continue operating properly (perhaps at a degraded level) if one or more components fails. The process, policies and procedures related to restoring critical systems after a catastrophic event. Goal is to get application back up and running within a defined time period (RTO) and within a certain data loss window (RPO). Fault Tolerant systems are measured by their Availability in terms of planned and unplanned service outages for end users.
  • 13. #13# Terminology - continued Time period in which service must be restored to meet BCP (Business Continuity Planning) objectives Acceptable data loss as a result of a recovering from a disaster/catastrophic event RTO and RPO are often at odds, and tradeoffs need to be made in order to find an acceptable middle ground
  • 14. #14# Takeaways • Understand core concepts behind HA and DR • Introduction to architectural options for designing HA, fault- tolerant applications and DR environments and procedures • Best Practices for implementation of these architectural options within AWS (independent of RightScale) • Multi-Availability Zone (AZ) and Multi-Region • Architectural options and Considerations / pros and cons of these options • Understanding of the tools RightScale brings to AWS to simplify the creation of these HA and DR environments
  • 15. #15# Regions & Availability Zones • Zones within a region share a LAN (high bandwidth, low latency, private IP access) • Zones utilize separate power sources, are physically segregated • Regions are “islands”, and share no resources. Japan Availability Zone A Availability Zone B EU West Region Availability Zone A Availability Zone B US East Region Availability Zone A Availability Zone C Availability Zone B US West Region Availability Zone A Availability Zone B Singapore Availability Zone A Availability Zone B Source: AWS
  • 16. #16# Designing for Failure • Large scale failures in the cloud are rare but do happen • Application owners are ultimately responsible for availability and recoverability • Balance cost and complexity of HA efforts against risk(s) you are willing to bear • Cloud infrastructure has made DR and HA remarkably affordable versus past options -Multi-Server -Multi-AZ (Availability Zone) -Multi-Region “Everything fails, all the time.” Werner Vogels, CTO Amazon.com
  • 17. #17# Designing for Failure – Basic Concepts • Fault tolerance is the goal. Degradation of service may occur, but application continues to function. • Avoid single points of failure (SPOF) • Assume everything fails (remember Werner’s mantra) and design accordingly • Plan and practice your recovery process (both for HA and DR) • Remember that better HA and DR equals more $$$. So find that acceptable balance.
  • 18. #18# High Availability Don’t sweat the small stuff. And it’s all small stuff* *(until it’s not) Follow a few general best practices to absorb application component outages…
  • 19. #19# General HA Best Practices • Avoid single points of failure. • Always place one of each component (load balancers, app servers, databases) in at least two AZs. • Replicate data across AZs (HA) and backup or replicate across regions for failover (DR) • Setup monitoring, alerts and operations to identify and automate problem resolution or failover process.
  • 20. #20# • High availability for top web properties with 270M visitors/month • Migration from datacenter to AWS • RightScale provides -Self-service access to developers -Consistency and low maintenance -Usage and cost accounting -Multi-region architectures to avoid downtime
  • 21. #21# Multi-Zone HA SLAVE DBMASTER DB SNAPSHOTS LOAD BALANCERS REPLICATE DNS S3 EBS US-EAST 1a 1US-EAST 1b LOAD BALANCERS APP SERVERS AUTOSCALE 172.168.7.31 172.168.8.62 Snapshot data volume for backups so the database can be readily recovered within the region. Place Slave databases in one or more zones for failover. Consider local storage for additional slave database to remove dependency on attached volume Consider distributed NoSQL databases with the same distribution considerations .
  • 22. #22# Disaster Recovery DR presents a few new wrinkles compared to HA, but there are multiple options depending on your needs and budget… Don’t sweat the small stuff. And it’s all small stuff* *(until it’s not)
  • 23. #23# HA/DR Checklist for Risk Mitigation • Determine who owns the architecture, DR process and testing. • Develop expertise in-house and / or get outside help. • Conduct a risk assessment for each application. • Specify your target RTO and RPO. • Design for failure starting with application architecture. This will help drive the infrastructure architecture.
  • 24. #24# HA/DR Checklist for Risk Mitigation • Implement HA best practices balancing cost, complexity and risk. -Automate infrastructure for consistency and reliability. • Document operational processes and automations. • Test the failover... then test it again. • Release the Chaos Monkey.
  • 25. #25# Multi-Region/Cloud DR Options Cold DR Warm DR Hot DR Multi-Cloud HA0 < 5 Mins < 1 Hour > 1 Hour $ $$ $$$ $$$$ (Most Common) (Recommended) (Least Common) (Live/Live Config) DowntimeAvailability 99.999% 99.9% 99.5% 99%
  • 26. #26# Multi-Region Cold DR LOAD BALANCERS MASTER DB SLAVE DB APP SERVERS LOAD BALANCERS REPLICATE DNS APP SERVERS US WEST SNAPSHOTS 172.168.7.31 SLAVE DB US EAST S3 Staged Server Configuration and generally no staged data • Not recommended if rapid recovery is required • Slow to replicate data to other cloud and bring database online EBS
  • 27. #27# Multi-Region Warm DR LOAD BALANCERS MASTER DB SLAVE DB APP SERVERS LOAD BALANCERS REPLICATE DNS APP SERVERS SLAVE DB REPLICATE US WEST 172.168.7.31 US EAST SNAPSHOTS Staged Server Configuration, pre-staged data and running Slave Database Server • Generally recommended DR solution • Minimal additional cost and allows fairly rapid recovery SNAPSHOTS EBS S3
  • 28. #28# APP SERVERS Multi-Region Hot DR LOAD BALANCERS MASTER DB SLAVE DB APP SERVERS LOAD BALANCERS REPLICATE DNS SLAVE DB REPLICATE US WEST SNAPSHOTS 172.168.7.31 US EAST Parallel Deployment with all servers running but all traffic going to primary • Not recommended • Very high additional cost to allow rapid recovery SNAPSHOTS EBS S3
  • 29. #29# Hybrid HA APP SERVERS LOAD BALANCERS MASTER DB SLAVE DB APP SERVERS LOAD BALANCERS REPLICATE DNS SLAVE DB REPLICATE CHICAGO SNAPSHOTS 172.168.7.31 172.168.8.62 US-EAST S3 SWIFT SNAPSHOTS Live/Live configuration. Geo-target IP services to direct traffic to regional LBs. • Possible, but not recommended (more to follow…) • Max additional cost and max availability, but complex to implement and manage EBS
  • 30. #30# APP SERVERS LOAD BALANCERS MASTER DB SLAVE DB APP SERVERS LOAD BALANCERS REPLICATE DNS SLAVE DB REPLICATE CHICAGO SNAPSHOTS 172.168.7.31 172.168.8.62 US-EAST S3 Hybrid HA You need DNS management or a global load balancer. Security requires addt’l effort as security groups are Region- specific. Machine Images are specific to the cloud/region. Looks similar to Multi-Zone… but additional problems to solve as some resources are not shared SNAPSHOTS SWIFT EBS VOLUME
  • 31. #31# • Procurement software • SLA to their customers require HA • Subway chain is a customer that procures perishable goods through Coupa
  • 32. #32# In the Dashboard Multi-region or cloud Multi-region Warm DR Staged servers Cost forecasting for DR environment
  • 33. #33# Automating HA and DR • Use dynamic DNS for your database servers Allow app servers to use a single FQDN. Use a low TTL to allow rapid failover in the case of a change in master database • Automatic connection of app servers to load balancing servers App servers can connect to all load balancers automatically at launch No manual intervention No DNS modifications • Automated promotion of slave to master Process is automated Decision to run process is manual
  • 34. #34# MultiCloud Images • MultiCloud Images can be launched across regions and hybrid without modification How RightScale makes it possible MultiCloud Images Cloud A, RightImage 1 Cloud B, RightImage 2 Cloud C, RightImage 3 ServerTemplate contains a list of MultiCloud Images (MCIs) When the Server is created, a specific MCI is chosen. Cloud A, RightImage 1 Cloud A Image 1 The appropriate RightImage is used at launch. RightImage Stability across clouds 1 2 3
  • 35. #35# How RightScale makes it possible ServerTemplates, Tags, and Inputs • Automated load balancer registration and database connections • Autoscaling across zones • Dynamic configuration
  • 36. #36# DR Cost Comparison Example Multi-Region Cold DR Multi-Region Warm DR Multi-Region Hot DR Total $4480 / month $5630 / month $8800 / month Running $4470 / month 3 Load Balancers (Large) 6 App Servers (XLarge) 1 Master DB (2XLarge) 1 Slave DB (2XLarge) $5540 / month 3 Load Balancers (Large) 6 App Servers (XLarge) 1 Master DB (2XLarge) 2 Slave DB (2XLarge) $8440 / month 6 Load Balancers (Large) 12 App Servers (XLarge) 1 Master DB (2XLarge) 2 Slave DB (2XLarge) Staged $0 / month 3 Load Balancers (Large) 6 App Servers (XLarge) 1 Slave DB (2XLarge) $0 / month 3 Load Balancers (Large) 6 App Servers (Xlarge) Replication $10 / month 25GB / day cross-zone $90 / month 25GB / day cross-region $360 / month 100GB / day cross-region
  • 37. #37# Outage-Proofing Best Practices Place in >1 zone: • Load balancers • App servers • Databases Maintain capacity to absorb zone or region failures Replicate data across zones Design stateless apps for resilience to reboot / relaunch Replicate data across zones Backup across regions Monitoring, alert, a nd automate operations to speed up failover
  • 38. #38# AWS Contact:aws.amazon.com/contact- us Resources and Q&A RightScale Try: RightScale Free Edition www.rightscale.com/free Contact: Toll Free: 1.866.720.0208 Int’l: 1.805.855.0265

Notas del editor

  1. Cloud computing is a better way to run your business. The cloud helps companies of all sizesbecome moreagile. Instead of running your applications yourself you can run them on the cloud where IT infrastructure is offered as a service like a utility. With the cloud, your company saves money: there are no up-front capital expenses as you don’t have to buy hardware for your projects. The massive scale and fast pace of innovation of the cloud drive the costs down for you. In the cloud, you pay only for what you use just like electricity.The cloud can also help your company save time and improve agility – it’s faster to get started: you can build new environments in minutes as you don’t need to wait for new servers to arrive. The elastic nature of the cloud makes it easy to scale up and down as needed. At the end of the day you have more resources left for innovation which allows you to focus on projects that can really impact your businesses like building and deploying more applications. “With the high growth nature of our business, we were looking for a cloud solution to enable us to scale fast. Think twice before buying your next server. Cloud computing is the way forward.” - Sami Lababidi, CTO, Playfish
  2. AWS is useful for low-end traditional DR to high-end HA, but…AWS encourages a rethinking of traditional DR / HA practicesEverything in the cloud is “off-site” and (potentially) “multi-site”Using multiple sites (multiple AZs) comes largely for freeUsing multiple geographically-distributed sites (multiple Regions) is significantly cheaper and easierTends to move the default design point away from “cold” Disaster Recovery toward “hot” High AvailabilityMakes it easier to stack multiple mechanismse.g., Basic HA within one Region, DR site in second Region
  3. Cold DR(Most common... hours) Staged Server Configuration and generally no staged data. Bring up the servers and load the data to failover. Cold DR failover is typically manual.Warm DR(Recommended... &gt;hour) Staged Server Configuration, pre-staged data and running Database Slave Server. Warm DR failover is typically manual but can be automated.Hot DR(Least common... but needed if &lt;5 min) Parallel Deployment with all servers running but all traffic going to primary. Hot DR failover is normally automated.Hot HALive/Live configuration. May use Geo-target IP services to direct traffic to regional load balancers. Failover to other region if one has problems. Hot HA is normally seamlessly automated.
  4. Note: Other costs such as IOPS, volumes, other bandwidth, object storage, and snapshot storage is additional