Más contenido relacionado
Similar a System Availability Talk (20)
System Availability Talk
- 2. So what is High Availability?
• Five 9s?
• No Single point of failure?
• Multiple Data Centre’s?
• Fault Tolerance?
• Load Balancing?
• Uptime?
2© 2012 Energized Work - www.energizedwork.com
- 3. The 9’s of Availability
3© 2012 Energized Work - www.energizedwork.com
9
9
- 4. The 9’s of Availability
4© 2012 Energized Work - www.energizedwork.com
Availability Downtime per Year
One nine (90%) 36.5 days
Two nines (99%) 3.65 days
Three nines (99.9%) 8.76 hours
Four nines (99.99%) 52.56 minutes
Five nines (99.999%) 5.26 minutes
- 5. Problem with the 9’s
5© 2012 Energized Work - www.energizedwork.com
• What do they mean?
• Guaranteed or just an SLA
• Multiplicity
(99.9% * 99.9% * 99.9% = 99.7%)
- 7. No Single Point of
Failure (SPOF)
7© 2012 Energized Work - www.energizedwork.com
- 10. End with this
10© 2012 Energized Work - www.energizedwork.com
WEB1
switch 1 switch 2
WEB2 APP1 APP2 DB1 DB2
Firewall 1 Firewall 2
Users
- 11. • It’s expensive ££
• Where do you draw the line?
• Are failures independent
• Can you guarantee No SPOF?
• Increased complexity
11© 2012 Energized Work - www.energizedwork.com
Problems with
eliminating SPOF
- 13. Solution: Get a 2nd
Data Centre
13© 2012 Energized Work - www.energizedwork.com
- 14. Hot/Hot Multisite
14© 2012 Energized Work - www.energizedwork.com
• Full range of services available in
multiple locations.
• Easy to automate failover of sites
• Data Consistency is hard.
• Capacity Planning concerns
+
- 15. Hot/Warm Multisite
15© 2012 Energized Work - www.energizedwork.com
• Simpler than Hot/Hot
• Read/write ratio dependant
• Synchronous or Asynchronously
replicate data?
+
- 16. Hot/Cold Multisite
16© 2012 Energized Work - www.energizedwork.com
• Easy to setup
• Will it work?
• Can it be trusted?
• Cold site rapidly become stale
• Is it actually valuable?
+
- 17. DR Multisite
17© 2012 Energized Work - www.energizedwork.com
• Fingers crossed you never need it.
• How can/should you test it?
• Cloud?
+
- 18. Problems with Multiple sites
18© 2012 Energized Work - www.energizedwork.com
• ££ - it’s expensive
• Managing more systems
• Managing consistency of Data
• Managing Capacity
• Is it still fail proof?
• Unless you test it, it’s just a plan
- 20. • More redundancy and automation leads
to more complexity.
• More complexity often adds more
points of failure.
20© 2012 Energized Work - www.energizedwork.com
Complex Systems
- 21. Author: Dr. Richard Cook
21© 2012 Energized Work - www.energizedwork.com
“How Complex Systems fail”
• Catastrophe is always just around the
corner.
• Human Operators have dual roles.
• Change introduces new forms of failure
- 23. Questions for the Customer
23© 2012 Energized Work - www.energizedwork.com
• What is the cost of downtime?
• What are the RTO and RPO?
- 24. 24© 2012 Energized Work - www.energizedwork.com
RTO = Recovery Time Objective
RPO = Recovery Point Objective
- 25. Aggressive RTO & RPO is
expensive and has a
performance impact.
25© 2012 Energized Work - www.energizedwork.com
- 26. RTO / RPO example
26© 2012 Energized Work - www.energizedwork.com
problem
•Simple DB
•Business can tolerate up to 15 minutes
downtime
•10 minute window of data lose.
- 27. RTO / RPO example
27© 2012 Energized Work - www.energizedwork.com
Possible solution
1.Continuously replicate data to 2nd
host
2.Continue with nightly backups and also
copy DB transaction logs from the primary
host to another system.
- 28. So what’s more important?
28© 2012 Energized Work - www.energizedwork.com
Increasing Availability
Or
Reducing Recovery Time
Notas del editor
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- Ask any business how much downtime is acceptable and you will get a consistent answer. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- Found more in Marketing literature than technical literature 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- An SLA is just an instrument that makes business people comfortable (just like insurance) 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 1 & 2 Diminishing returns Paradoxically, adding more components to an overall system design can undermine efforts to achieve high availability Cascading failures 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- Read & Write anywhere Global Server Load Balancing with DNS 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- Read intensive apps are well suited to this – Reads Hot/Hot 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- Cold site is so untrusted that perhaps spending hours restoring the primary DC is a better and safer bet. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- Cold site is so untrusted that perhaps spending hours restoring the primary DC is a better and safer bet. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- Talk about capacity planning Hot/Hot – config switches Most companies don ’ t thoroughly test DC failover. When failure occurs many companies will often focus on restoring the failure in the primary DC rather attempt a failover. So why bother having a 2 nd DC anyway. If you plan on having multiple DC ’ s or DR then test your procedures when you ’ re not in an emergency situation. Game Day events 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- Mention John Alspaw ’ s Qcon talk 2. Dual roles of humans Defenders against failure Producers of failure 3. Introduce a technology change To prevent low-consequence, but high frequency failures May introduce low frequency, but high consequence failure Introduce new pathways to large-scale, catastrophic failures. Focus of humans is on the beneficial charactistics of the change. New failure ’ s maybe difficult to foresee. Give config management example Knife Resolv.conf 3. Also covers maintenance and why many find it difficult. Build and forget mentality. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- Cost of downtime – easy or difficult to measure Can downtime actually be equated to lost revenue. Give online shopping example 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- RTO and RPO are often in competition Give eg of replication lag between 2 sites. Zero RPO example - If replication lags between systems and you have an aggressive RPO you maybe better off taking a few hours outage and focusing on restoring your primary site. Zero RTO example – if replication lags between DC ’ s you may decide to failover immediately and take the data loss for some inflight transactions Aggressive RTO & RPO is expensive and has a performance 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- Typical nightly backups aren ’ t going to cut it. Common practice is to backup systems nightly. Is your business happy to lose up to 24 hours of data? Probably not. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- Covers you for any catastrophic hardware failure 2 nd host has independent storage infrastructure. Data corruption would however result in 2 copies of crap 2. Covers you for data corruption Playing back transaction logs will also allow you to identify the place where corruption occurred. 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- What about MTTD? 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- My experience tells me most companies focus on availability How many companies take nightly tape backups but have never bothered trying to restore or test them? If you think you can built a completely fail-proof system you are kidding yourself. How many companies have game days? 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING
- 28/10/10 © Energized Work Limited 2010 Agile Evangelists - LEANING