Sr econt

SREcon
2016-08-26
社内勉強会
Tsuyoshi Nakamura

https://www.usenix.org/conference/srecon16

勉強会で初めて知り、各
Sessionの動画、スライドを頑
張って追いかけた

Agenda
1. Learn about other companies of SRE
1. In case of Microsoft Azure SRE
2. In case of New Relic
3. In case of Pinterest
4. In case of Netflix
2. 最後まとめ的な

In case of Microsoft Azure SRE
Caskey L. Dickson and Jake Welch
https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_welch.pdf
https://www.usenix.org/conference/srecon16/program/presentation/dickson

Service Roast
目的：欠点だったり、設計上の考慮漏れ、皆がすでに知ってるプロダクト
の課題を理解し、明確に示す
Devから災害復旧までサービス全体のライフサイクルを把握
改善すべき点をあげ、継続的に改善の為を続ける

Why do?
• Builds relationships and trust between the teams
• SRE learns about the service
• Dramatically speeds up ‘newbie to expert’ process
• 加速度的にproductを成長させる
• Exposes details that otherwise would be difficult (or painful) to learn of
• 秘伝のタレ化の排除
• Creates a shared backlog of improvements
• 課題の共有

Tone
• Not an attack on the service
• Not a judgment of past choices
• Focus on ‘How’ questions not ‘Why’ questions
• Why’s can be seen as judgmental
• Every participant must understand this
• Managing emotions is critical to a safe discussion environment

In case of New Relic
Alice Goldfuss
https://www.usenix.org/conference/srecon16/program/presentation/goldfuss
https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_goldfuss_0.pdf

Summary
• 政府や軍のインデント対応プロセスから適用したチーム
• Incident Command Systemの応用
• アメリアだと結構有名らしい
• それぞれの役割が明確に定義
• 全体影響を特に考慮されている

In case of Pinterest
Ernie Souhrada
https://www.usenix.org/conference/srecon16/program/presentation/souhrada
https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_souhrada.pdf

History
• 今となってはAWSに100% hostedしているが、以前はオンプレミス環境
• Cloudサービスが普及する前の話
• 1. Individual servers matter.
• 2. Failure is expensive, so it must be prevented.
• 3. Capacity planning can make or break you.
• 4. Sometimes your destiny is still outside your control.
Operational Materialism
運用物質主義？

Now
• 1. Cloud servers can, and do, fail at any time, for any reason.
• 2. Trying to prevent this server failure is an endless source of suffering
for SREs and DBAs alike.
• Trying to prevent server failure leads only to suffering
• 3. Accepting the impermanence of our servers, we should design
systems that are failure-resilient, not failure-resistant.
• Cloud-based servers can fail at any time, for any reason.
• Automated replacement
• Configuration management tools
• 4. We can break the cycle of suffering and create a better experience for
end users, internal customers, and colleagues
Operational Buddhism
仏のような静かな心で見守り続ける？w

In case of Netflix
Jonah Horowitz
https://www.usenix.org/conference/srecon16/program/presentation/horowitz
https://www.usenix.org/sites/default/files/conference/protected-files/srecon16_slides_horowitz.pdf

topic
• 190カ国でサービス展開しているのにSREは5名！？
• SREs are expensive & hard to find
• Freedom & Responsibility

🏁最後まとめ的な
 まぁ会社によってroleの部分で違いは当然ある
 DevOpsの時でも感じたけど、結局サービスをスピード感もってグロースさせていく上で
どうしてもぽてんヒットが生まれてしまう
 そのぽてんヒットをどうやって拾っていくかに始まってる気がする
 チームを優先して動いていれば自然とSRE的なタスクをこなしている事になると思うけど
 その部分をしっかり評価しましょうとなってSRE的なタグが付いたと思う部分がある
 技術的なものよりもむしろマインド的なものが重要？！
 PM的な要素も色々と入ってる気がする
 “SRE should not be a Servant”
 勉強になる情報
 https://github.com/dastergon/awesome-sre/blob/master/README.md

Sr econt

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Sr econt

Similar to Sr econt (20)

Recently uploaded

Recently uploaded (11)

Sr econt