4. Agenda
1. Learn about other companies of SRE
1. In case of Microsoft Azure SRE
2. In case of New Relic
3. In case of Pinterest
4. In case of Netflix
2. 最後まとめ的な
5. In case of Microsoft Azure SRE
Caskey L. Dickson and Jake Welch
https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_welch.pdf
https://www.usenix.org/conference/srecon16/program/presentation/dickson
7. Why do?
• Builds relationships and trust between the teams
• SRE learns about the service
• Dramatically speeds up ‘newbie to expert’ process
• 加速度的にproductを成長させる
• Exposes details that otherwise would be difficult (or painful) to learn of
• 秘伝のタレ化の排除
• Creates a shared backlog of improvements
• 課題の共有
8. Tone
• Not an attack on the service
• Not a judgment of past choices
• Focus on ‘How’ questions not ‘Why’ questions
• Why’s can be seen as judgmental
• Every participant must understand this
• Managing emotions is critical to a safe discussion environment
9. Tone
• Not an attack on the service
• Not a judgment of past choices
• Focus on ‘How’ questions not ‘Why’ questions
• Why’s can be seen as judgmental
• Every participant must understand this
• Managing emotions is critical to a safe discussion environment
10. In case of New Relic
Alice Goldfuss
https://www.usenix.org/conference/srecon16/program/presentation/goldfuss
https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_goldfuss_0.pdf
13. In case of Pinterest
Ernie Souhrada
https://www.usenix.org/conference/srecon16/program/presentation/souhrada
https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_souhrada.pdf
14.
15. History
• 今となってはAWSに100% hostedしているが、以前はオンプレミス環境
• Cloudサービスが普及する前の話
• 1. Individual servers matter.
• 2. Failure is expensive, so it must be prevented.
• 3. Capacity planning can make or break you.
• 4. Sometimes your destiny is still outside your control.
Operational Materialism
運用物質主義?
16. Now
• 1. Cloud servers can, and do, fail at any time, for any reason.
• 2. Trying to prevent this server failure is an endless source of suffering
for SREs and DBAs alike.
• Trying to prevent server failure leads only to suffering
• 3. Accepting the impermanence of our servers, we should design
systems that are failure-resilient, not failure-resistant.
• Cloud-based servers can fail at any time, for any reason.
• Automated replacement
• Configuration management tools
• 4. We can break the cycle of suffering and create a better experience for
end users, internal customers, and colleagues
Operational Buddhism
仏のような静かな心で見守り続ける?w
17. In case of Netflix
Jonah Horowitz
https://www.usenix.org/conference/srecon16/program/presentation/horowitz
https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_horowitz.pdf