WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
Resilience and Compliance at Speed and Scale
1. Resilience and Compliance
at Speed and Scale
ISACA SV Spring Conference
Jason Chan
chan@netflix.com
linkedin.com/in/jasonbchan
@chanjbs
2. About Me
Engineering Director @ Netflix:
Security: product, app, ops, IR, fraud/abuse
Previously:
Led infosec team @ VMware
Consultant - @stake, iSEC Partners
5. Common Controls to Promote Resilience
Architectural committees
Change approval boards
Centralized deployments
Vendor-specific, component-
level HA
Standards and checklists
Designed to standardize on
design patterns, vendors, etc.
Problems for Netflix:
Freedom and Responsibility
Culture
Highly aligned and loosely
coupled
Innovation cycles
6. Common Controls to Promote Resilience
Architectural committees
Change approval boards
Centralized deployments
Vendor-specific, component-
level HA
Standards and checklists
Designed to control and de-
risk change
Focus on artifacts, test and
rollback plans
Problems for Netflix:
Freedom and Responsibility
Culture
Highly aligned and loosely
coupled
Innovation cycles
7. Common Controls to Promote Resilience
Architectural committees
Change approval boards
Centralized deployments
Vendor-specific, component-
level HA
Standards and checklists
Separate Ops team deploys at
a pre-ordained time (e.g.
weekly, monthly)
Problems for Netflix:
Freedom and Responsibility
Culture
Highly aligned and loosely
coupled
Innovation cycles
8. Common Controls to Promote Resilience
Architectural committees
Change approval boards
Centralized deployments
Vendor-specific, component-
level HA
Standards and checklists
High reliance on vendor
solutions to provide HA and
resilience
Problems for Netflix:
Traditional data center oriented
systems do not translate well
to the cloud
Heavy use of open source
9. Common Controls to Promote Resilience
Architectural committees
Change approval boards
Centralized deployments
Vendor-specific, component-
level HA
Standards and checklists
Designed for repeatable
execution
Problems for Netflix:
Reliance on humans
11. What does the business value?
Customer experience
Innovation and agility
In other words:
Stability and availability for customer experience
Rapid development and change to continually improve product
and outpace competition
Not that different from anyone else
12. Overall Approach
Understand and solve for relevant failure modes
Rely on automation and tools, not humans or
committees
Make no assumptions that planned controls will work
Provide train tracks and guardrails, but invite deviation
13.
14. Goals of Simian Army
“Each system has to be able to succeed, no matter what, even all on its own.
We’re designing each distributed system to expect and tolerate failure from
other systems on which it depends.”
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
17. Chaos Monkey
“By frequently causing failures, we force our services to
be built in a way that is more resilient.”
Terminates cluster nodes during business hours
Rejects “If it ain’t broke, don’t fix it”
Goals:
Simulate random hardware failures, human error at small scale
Identify weaknesses
No service impact
20. Chaos Gorilla
Chaos Monkey’s bigger brother
Standard deployment pattern is to distribute
load/systems/data across three data centers (AZs)
What happens if one is lost?
Goals:
Simulate data center loss, hardware/service failures at larger
scale
Identify weaknesses, dependencies, etc.
Minimal service impact
23. Chaos Kong
Simulate an entire region (US west coast, US east coast)
failing
For example – hurricane, large winter storm, earthquake, etc.
Goals:
Exercise end-to-end large-scale failover (routing, DNS, scaling
up)
26. Latency Monkey
Distributed systems have many upstream/downstream
connections
How fault-tolerant are systems to dependency
failure/slowdown?
Goals:
Simulate latencies and error codes, see how a service responds
Survivable services regardless of dependencies
29. Conformity Monkey
Without architecture review, how do you ensure designs
leverage known successful patterns?
Conformity Monkey provides automated analysis for
pattern adherence
Goals:
Evaluate deployment modes (data center distribution)
Evaluate health checks, discoverability, versions of key libraries
Help ensure service has best chance of successful operation
32. Janitor Monkey
Clutter accumulates, in the form of:
Complexity
Vulnerabilities
Cost
Janitor identifies unused resources and reaps them to
save money and reduce exposure
Goals:
Automated hygiene
More freedom for engineers to innovate and move fast
33. Non-Simian Approaches
Org model
Engineers write, deploy, support code
Culture
De-centralized with as few processes and rules as possible
Lots of local autonomy
“If you’re not failing, you’re not trying hard enough”
Peer pressure
Productive and transparent incident reviews
35. Control Objectives for Software Deployments
Visibility and transparency
Who did what, when?
What was the scope of the
change or deployment?
Was it reviewed?
Was it tested?
Was it approved?
Typically attempted via:
Restricted access/SoD
CMDBs
Change management
processes
Test results
Change windows
36. Large and Dynamic Systems Need a Different Approach
No operations organization
No acceptable windows for downtime
Thousands of deployments and changes per day
37. Control Objectives Haven’t Changed
Visibility and transparency
Who did what, when?
What was the scope of the change or deployment?
Was it reviewed?
Was it tested?
Was it approved?