Scalability and Reliability in the Cloud

HIGH SCALABILITY AND
RELIABILITY IN THE
CLOUD
GREG THOMPSON
HEAD OF ARCHITECTURE, APPS ENABLEMENT
ALCATEL-LUCENT

@gmthomp greg.thompson@alcatel-lucent.com

About This Session
 Target audience is backend application
developers deploying infrastructure into a
cloud environment
 Will cover concepts for scalability and
reliability with the goal of helping application
developers understand some key
considerations when designing and building
the backend.

Design Time Decisions
 When first building your application backend,
consider a few important questions
 How fast should the application be recovered if a
failure occurs?
 What kind of down time is acceptable?
 Is the application maintaining stateful data?
 What kind of information needs to be shared across
multiple instances?

What is Scalability?
 Scalability is a term
used to describe
how the application
will handle
increased loads of
traffic volume

Scalability – Factors to Consider
 Horizontal vs. Vertical
 Stateless vs. Stateful
 Understanding Limitations
 Connection Management
 Segmentation of traffic
 Segmentation of responsibility (distributed arch)
 Clustering
 Messaging

What Type of Scalability?
Vertical vs. Horizontal
Vertical Horizontal
 Scaling up a single  Scaling out across
node multiple nodes
 Physical limitations –  Ability to distribute
instances are very
powerful but still have traffic over a number
finite limits of nodes
 Resources such as  Allows for more
number of sockets flexibility over time
can only go so high

Will the App Maintain State?
Stateless Applications
 Application does not
persist information
about transactions Request Respons
e
 Each transaction is
independent and Application
atomic

Will the App Maintain State?
Stateful Applications
 Application needs to
maintain data about
transactions in
First Subseque
progress Request nt
Request

 Requires storage D
Application B
 Persistence may also
be required
depending the

Understanding Limitations
 Thorough testing is
key to understanding
bottlenecks
 Test real-world
scenarios included
latency
 Push the system to
the max to
understand how it

Connection Management
Mobile Device Connections
 Mobile devices don’t always
behave like you expect
 Connectivity is often very
dynamic
 Devices move from 4G/3G/2G/no
G/Wifi
 Not all TCP events will get
reported and sockets can remain
open
 If not handled correctly, these
factors can be time bomb no
matter how vertically you scale a
component

Segmenting Traffic
 Once the application is
able to be scaled out,
traffic can be
segmented in different
ways
 Location (i.e. east coast
vs. west coast)
 Pre-assigned criteria -
User ID, IP, or other
dynamic criteria
 Load Balanced

Segmenting Responsibility
 Segmenting
responsibility allows for
a distributed
architecture
 Each component can be
scaled independently
 Allows for more flexibility
in scaling
 Adds more complexity
and potential messaging
overhead

Clustering
 Clustering is the
concept of having a
group of nodes working App App App App
Nod Nod Nod Nod
together to provide the e e e e
same capability
 Nodes typically co- Share
located d
 Common data shared Data
as needed across the
cluster
 Communication may be
needed between nodes

Messaging
 Once a clustered  Types of Messaging
and/or distributed  JMS
architecture is used  Open Source MQ
messaging will be packages
needed between  Custom Designed
various components  Use of APIs
and/or nodes

Example of Scaled Architecture
Load Load
Load Load
Balancer Balancer
Balancer Balancer

Web Compone Compone Web Compone Compone
Web
Server Compone
nt 1 Compone
nt 2 Web
Server Compone
nt 1 Compone
nt 2
Server nt 1 nt 2 Server nt 1 nt 2

Database Database

Site 1 Site 2

What is Reliability/Availability?
 Availability is typically
measured by the amount of
downtime your application
has in a given year
 Unplanned downtime and
planned downtime are both
considered
 Reliability is described by the
likelihood of failure based on
actual measurements
 We’ll focus more on
Availability

Reliability/Availability
Factors to Consider
 Cost vs. Need
 Problem detection
 Automation for recovery
 Active/standby, active/active, hot standby vs. cold
standby
 Local and Geo-redundancy
 Multi-zone, multi-cloud
 Test Until You Break the System

Reliability Requirements
Cost Considerations Need

 Number of instances  User Experience
 Bandwidth  Customer
requirements requirements
between sites
 Negative Publicity
 Complexity of
software
 Monitoring

Problem Detection
 Effective monitoring of
the application is key to
minimizing downtime
 Event reporting in the
software
 External monitoring –
test for successful
behavior
 Auto detection and
alerting to minimize cost
of operations personnel

Automation for Recovery
 How quickly a failed
component recovers
increases reliability
 Automatic detection
and automatic
recovery
 Automated installation
key for minimizing
setup time during
recovery

Availability Models
 N = number of nodes
required for normal N N
processing
 N+1 = one additional
node to provide N N +1
redundancy in case of
failure
 N+K = K nodes provide N N K K
additional redundancy

Redundancy Models
 Active/Cold Standby Cold
 backup site is booted Active Standb
up when needed y

 Active/Hot Standby
Active
 Backup site is running Active Standb
and ready to takeover y

 Active/Active
 Both sites active and Active Active
processing traffic

Local and Geo-Redundancy
 Local  Geo-Graphic
 Backup instances  Backup instances
are available within are available in
the same location another geo-graphic
location
 Use of availability
 Typically in a
zones within a separate region to
region very similar account for events
such as natural
disasters

Availability to the Max
 Multi-Zone/Multi-  Multi-Cloud
Region
 Ifyour application
 Multi-zone typically
requires the
provide instances
running in different maximum possible
physical locations, but availability
in same region  Run in different
 Multi-region provides cloud providers in
different geographic
regions of availability
different regions

Test Until You Break the System
 Push the system to
the max and observe
the breaking points
 Fix the problem,
repeat
 The best way to find
problems to prevent
unplanned downtime
is to thoroughly test
with a mindset to
break

THANK YOU!
Greg Thompson
@gmthomps
greg.thompson@alcatel-lucent.com

Scalability and Reliability in the Cloud

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Scalability and Reliability in the Cloud

Similar a Scalability and Reliability in the Cloud (20)

Último

Último (20)

Scalability and Reliability in the Cloud