Disaster recovery and business continuity planning are processes which help organizations prepare for disruptive events. The talk explains the basic concepts of business continuity, giving a brief overview on the business continuity plan and more detail informations (technical) on how to setup a Disaster Recovery site . We show two different approaches for creating a disaster recovery (DR) site, one the based on operating system layer and one based on the right design of the applications . The common elements on the two approaches are network design, data replication, monitoring system and system/configuration management. All these elements can be implemented with open source software, we explain advantages and disadvantages, performances and layouts on each solutions.
1. Create a backup site with opensource
Fabrizio Manfredi Furuholmen
Beolink.org
2. Agenda Beolink.org
Introduction (Part I)
Disaster Recovery Plan
Business continuity Plan
Disaster Recovery
Components (Part II)
Starting point
Software list
Design Pattern
Case Studies (Part III)
2
8/23/09
3. Disaster Statistics & Potential
Implications
Beolink.org
90% of businesses that lose data from a disaster are forced to close
down within 2 years since the disaster.
80% of businesses without a well structured recovery plan are
forced to close down within 12 months since the flood or fire.
43% of companies experiencing disasters never recover.
50% of companies experiencing a computer outage will be forced to
close down within 5 years
Companies experiencing a computer outage lasting longer than 10
days will never recover its full financial capacity.
Less than 50% of all organizations in the UK have a business
continuity plan or disaster recovery plan.
One out of 500 data centers experience a severe disaster every
year.
3
8/23/09
4. What is a Disaster Recovery Plan ? Beolink.org
Disaster recovery plan (DRP)
Disaster recovery plan
consists of the precautions
taken so that the effects of a
disaster will be minimized and
the organization will be able to
either maintain or quickly
resume mission critical
functions
4
8/23/09
5. Classification of Risk Beolink.org
Natural disasters
Such as tornadoes, floods, blizzards, earthquakes and
fire
Accidents
Sabotage
Power and energy disruptions
Service sector failure
Such as communications, transportation …
Environmental disasters
Such as pollution and hazardous materials spills
Cyber attacks and hacker activity
5
8/23/09
6. BIA
Beolink.org
How many transactions can be lost without
significantly impacting revenue or productivity?
Does a majority of the businesses depend upon
one or more mission critical applications?
How much revenue per hour would be lost
when these critical applications remain
unavailable?
Are there periods of time, for example the end
of fiscal quarters or holiday seasons, when an
outage would cause a greater disruption?
How will productivity be affected if critical
applications become unavailable?
In the event of an unexpected outage, how will
your partners, vendors and customers be
affected ?
Historically, what has been the total cost of lost
productivity and revenue during downtime?
6
8/23/09
7. Business Continuity Plan
Beolink.org
Disaster recovery planning (DRP)
is a subset of a larger process
known as Business Continuity
Plan.
Business Continuity Plan (BCP)
A business continuity plan
enables critical services or
products to be continually
delivered to clients.
7
8/23/09
8. Technology problem ?
Beolink.org
Understand your business
Organizational Strategy
Business Impacy Analysis
Risk Assessment & Control
BCM Strategies
Organization BCM Strategy
Process Level BCM Strategy
Resource Recovery BCM Strategy
Developing & Implementing BCM Response
Crisis Management
Business Continuity Plans
Business Unit Plan
Developing BCM Culture
Assessing
Designing and Delivery
Measuring Results
Exercise, Maintain & Audit
Exercising
BCM Controls
BS 25999 compliance
8
8/23/09
9. Disaster recovery starting point
Beolink.org
Recovery point objective (RPO)
Maximum data you can loose
Recovery time objective (RTO)
The amount of time you need to restart from
the RPO
9
8/23/09
10. Basic Steps
Beolink.org
ystem Duplication
S
ardware
H
etwork
N
perating System
O
pplication
A
ata Replication
D
ilesystem
F
atabase
D
pplication Data
A
witch Procedure
S
Activation
ollback
R
ow to switch back
H
ow to merge data
H
10
8/23/09
11. Disaster Recovery Components
Beolink.org
Elements
Routing
Application
Data
Infrastructures
Working mode
Off-line
Near on-line
Online
11
8/23/09
12. Standby
Beolink.org
Type OS Database Application Several
minutes
Off Line
OFF
OFF
OFF
Near ON
OFF/ON
OFF
Online
Online
ON
ON
ON
Low Cost High Cost
Keep your DR site Zero
unreachable until switch
12
8/23/09
16. Data Layer
Beolink.org
Off-Line Days
• Amanda, Bacula, tar : Tape
incremenetal, os dump Backup
Periodic Hours
Replication
Near on-line
• rsync: compression,
incremental
Asynchronous Mins
Replication
On Line
• GlousterFS, openAFS, DB
Replication
Synchronous Secs
Replication
16
8/23/09
17. Data Layer: Amanda
Beolink.org
Compression Amanda.conf:
inparallel 4 # maximum dumpers that will run in parallel (max 63)
Client Fast/Best # this maximum can be increased at compile-time,
Server Fast/Best # modifying MAX_DUMPERS in server-src/driverio.h
Multi Stream: netusage 600 Kbps
se
# maximum net bandwidth for Amanda, in KB per
Number of parrells dump …
Diskbase define dumptype comp-high {
global
Holding disk comment "very important partitions on fast machines"
changer file base compress client best
kencrypt no
Bandwith
priority high
}
Global
Per interface define interface lo0 {
comment "a local disk"
Encryption }
use 1000 kbps
SSH
KRB5 DiskList:
Dbserver /dumps comp-user-tar
…
17
8/23/09
18. Data Layer: rsync
Beolink.org
Diffs
Only actual changed pieces of files are
transferred, rather than the whole file.
Compression
The tiny pieces of diffs are then compressed on
the fly, further saving you file transfer time and
reducing the load on the network.
rsync --verbose --progress --stats
Encryption
--compress
The stream from rsync is passed through the
--rsh=/usr/local/bin/ssh
ssh protocol to encrypt your session.
--recursive
--times
Latancy
--perms
Pipelining of file transfers to minimize latency
--links
costs
--delete
--exclude "*bak"
--exclude "*~”
Simple Usage
/www/* webserver:path_name
18
8/23/09
19. Data Layer: GlusterFS
Beolink.org
Features
Automatic Replication
Aggregation
Scalable Striping
Distributed Locking
Performance Modules
Pluggable I/O Scheduler
Pluggable Transport
Pluggable Auth
Administration
NFS-like Backend
Self-healing
Staking Vol/modules
And much more ..
19
8/23/09
20. Data Layer: GlusterFS
Beolink.org
Replication volume posix
volume remote1
type protocol/client
Client side
type storage/posix
option transport-type tcp
Server side
option directory /data/export
option remote-host
end-volume
storage1.example.com
option remote-subvolume brick
Raid volume locks
end-volume
Raid 1
type features/locks
volume remote2
subvolumes posix
Stripe …
end-volume
end-volume
Raid 10
Volume brick
volume replicate
Performance
type performance/io-threads
type cluster/replicate
subvolumes remote1 remote2
I/O modules
option thread-count 8
Cache modules
subvolumes locks
end-volume
end-volume
volume writebehind
type performance/write-behind
volume server
option window-size 1MB
type protocol/server
subvolumes replicate
option transport-type tcp
end-volume
option auth.addr.brick.allow *
subvolumes brick
volume cache
type performance/io-cache
end-volume
option cache-size 512MB
subvolumes writebehind
end-volume
20
8/23/09
21. Data Layer: Database
Beolink.org
Builtin
Mysql Master – Slave
Mysql Multi-Master
External
Tungsten Replicator
Slony-I (postgresql)
21
8/23/09
22. Application Layer
Beolink.org
Data replication
Cluster JDBC
Group messaging (spread,
jgroup, ..)
Custom protocol
Distributed Application
Design
Data must be Unique
Primary key unique for all
sites (hash, mixing key ..)
22
8/23/09
23. Application Layer: spread
Beolink.org
Spread
functions as a unified message bus for
distributed applications, and provides highly
tuned application-level multicast, group
communication, and point to point support.
Reliable and scalable messaging and group communication
Easy to use, deploy and maintain.
Highly scalable from one local area network to complex wide
area networks.
Supports thousands of groups with different sets of
members.
Enables message reliability in the presence of machine
failures, process crashes and recoveries, and network
partitions and merges.
Provides a range of reliability, ordering and stability
guarantees for messages.
.Completely distributed algorithms with no central point of
failure.
23
8/23/09
24. Application Layer: Example
Beolink.org
Session
Distributed file system
Replicated Database
AppServer Cluster
24
8/23/09
25. Routing Layer
Beolink.org
DNS: Bind Several
minutes
keep low value of TTL
dynamic update
MPLS
Change MPLS configuration, keep
the same ip configuration
Routing Table : quagga
OSPF/BGP announce new routing
for your side
Seconds
25
8/23/09
26. Routing Layer: Quagga
Beolink.org
Quagga
CISCO IOS syntax
All routing protocol
Device base interface
Quagga.conf:
interface eth0
ip address XXX.XXX.XXX.XXX
!
! Work only for ! …
Ospf.conf
Autonomous System
Large network router ospf
ospf router-id XXX.XXX.XXX.XX
log-adjacency-changes
compatible rfc1583
auto-cost reference-bandwidth 10000
network XXX.XXX.XXX.XXX/XX area …
26
8/23/09
27. Routing Layer: Heart
Beolink.org
Monitoring System (outside)
Probe/check external
Remote Invocation scripts
Manual/Automatic
Linux HA (inside)
Cluster definition with Quorum
Server
Local Scripts / Fencing script
Manual/Automatic
27
8/23/09
28. Important Things
Beolink.org
Business How the service
Which order to
critical is linked to
do things
Application
others
Keep close
When to run When to run
disaster
them (2)
them (1)
recovery site
Where have
you put
employees ?
28
8/23/09
29. Case Studies
Beolink.org
Case Studies
29
8/23/09
30. Case study
Beolink.org
Company type: Heavy industry
Location: Single production site
Web Farm/IT: Inside the production site
APPS: Order management,
production control
RPO/RTO: 2h
DR site: distance 400 km
Cost: a Lot
Disaster Event: Flooding of the production site
GOOD Solution ?
30
8/23/09
31. Case study
Beolink.org
What happened ?
Perfect plan, everything OK
But…
Production restart after 6 months …
WRONG SOLUTION
Disaster not related to Business
Misunderstanding between BCP and DRP
IT Manager was Fired
31
8/23/09
32. Base Solution: small-middle companies
Beolink.org
Switch by hand
Configuration
Restore
Starting point
RTO/RPO Hours
Linear to storage
dimension
Synchronization
Backup
No more licenses
needed
32
8/23/09
33. Medium/High Solution: Enterprise
Beolink.org
Automatic Switch
IP migration (one IP
per service)
Change Firewall role
RTO/RPO seconds
Synchronization
Multi-master FS
Database Replication
Configuration
Management
Double licenses
needed 33
8/23/09
34. Conclusion
Beolink.org
“Everything
Only Business Right RTO/
right now” is Keep it Simple
Application
RPO
hard
Are you sure
Manual the data is on
Latency
Bandwidth
rollback
the
filesystem ?
Donʼt forget
Periodic Donʼt forget How Far Away
the power of
Testing
your time
is Far Enough?
rsync
34
8/23/09
35. I look forward to meeting you… Beolink.org
XVI European AFS meeting 2009
Rome: September 28-30
Who should attend:
Everyone interested in deploying a globally accessible file
system
Everyone interested in learning more about real world usage
of Kerberos authentication in single realm and federated
single sign-on environments
Everyone who wants to share their knowledge and
experience with other members of the AFS and Kerberos
communities
Everyone who wants to find out the latest developments
affecting AFS and Kerberos
More Info:
www.openafs.it
35
8/23/09