2. Netflix, Inc.
“Netflix is the world’s leading Internet television
network with more than 33 million members in
40 countries enjoying more than one billion
hours of TV shows and movies per month,
including original series . . .”
Source: http://ir.netflix.com
3. Me
Director of Engineering @ Netflix
Responsible for:
Cloud app, product, infrastructure, ops security
Previously:
Led security team @ VMware
Earlier, primarily security consulting at @stake, iSEC Partners
12. On the way to the cloud . . .
(or NoOps,
depending on definitions)
13. Some As-Is #s
33m+ subscribers
10,000s of systems
100s of engineers, apps
~250 test deployments/day **
~70 production deployments/day *
** Sample based on one week‟s activities
15. A common graph @ Netflix
Lots of watching in prime time Not as much in early morning
Old way - pay and provision for peak, 24/7/365
Multiply this pattern across the dozens of apps that comprise the
Netflix streaming service
17. Autoscaling
Goals:
# of systems matches load requirements
Load per server is constant
Happens without intervention (the „auto‟ in autoscaling)
Results:
Clusters continuously add & remove nodes
New nodes must mirror existing
18. Every change requires a new cluster push
(not an incremental change to existing systems)
20. Netflix Deployment Pipeline
RPM with
app-specific VM template
bits ready to launch
YUM AMI
Perforce/Git Bakery ASG
Code change Base image + Cluster config
Config change RPM Running systems
21. Operational Impact
No changes to running systems
No systems mgmt infrastructure (Puppet, Chef, etc.)
Fewer logins to prod
No snowflakes
Trivial “rollback”
22. Security Impact
Need to think differently on:
Vulnerability management
Patch management
User activity monitoring
File integrity monitoring
Forensic investigations
26. Points of Emphasis
Integrate Two contexts:
1. Integration with your
Make the right way easy engineering ecosystem
Self-service, with 2. Integration of your security
exceptions controls
Organization
Trust, but verify
SCM, build and release
Monitoring and alerting
26
27. Integration: Base AMI Testing
The base AMI is managed like other packages, via P4, Jenkins, etc.
We watch the SCM directory & kick off testing when it changes
Launch an instance of the AMI, perform vuln scan and other checks
SCAN COMPLETED ALERT
Site name: AMI1
Stopped by: N/A
Total Scan Time: 4 minutes 46 seconds
Critical Vulnerabilities: 5
Severe Vulnerabilities: 4
Moderate Vulnerabilities: 4
28. Integration: Control Packaging and Installation
From the RPM spec file of a webserver:
Requires: ossec cloudpassage nflx-base-harden hyperguard-enforcer
Pulls in the following RPMs:
HIDS agent
Config assessment/firewall agent
Host hardening package
WAF
29. Integration: Timeline (Chronos)
What IP addresses have been blacklisted by the WAF in
the last few weeks?
GET /api/v1/event?timelines=type:blacklist&start=20130125000000000
Which security groups have changed today?
GET /api/v1/event?timelines=type:securitygroup&start=20130206000000000
30. Integration: Static Analysis
Available self-service through build environment
FindBugs, PMD
Jenkins plugin to display graphs and support drill
through to results
32. Integration: Alerting (Central Alerting Gateway)
Single place to generate and deliver alerts
Python, Java libraries (or JSON post)
Ties in to PagerDuty notification/escalation system
Permits stateful alerting and some response
A prerequisite that our security tools will leverage
33. CAG Example
import CORE.Gateway
gw = CORE.Gateway.Gateway()
# testcluster is a defined app with associated escalation
# schedule in PagerDuty
gw.send("testcluster", "normal", "Something went wrong")
34. Points of Emphasis
Integrate Developers are lazy
Make the right way easy
Self-service, with
exceptions
Trust, but verify
35. Making it Easy: Cryptex
Crypto: DDIY (“Don‟t Do It Yourself”)
Many uses of crypto in web/distributed systems:
Encrypt/decrypt (cookies, data, etc.)
Sign/verify (URLs, data, etc.)
Netflix also uses heavily for device activation, DRM
playback, etc.
36. Making it Easy: Cryptex
Multi-layer crypto system (HSM basis, scale out layer)
Easy to use
Key management handled transparently
Access control and auditable operations
37. Making it Easy: Cloud-Based SSO
In the AWS cloud, access to data center services is
problematic
Examples: AD, LDAP, DNS
But, many cloud-based systems require authN, authZ
Examples: Dashboards, admin UIs
Asking developers to securely handle/accept credentials
is also problematic
38. Making it Easy: Cloud-Based SSO
Solution: Leverage OneLogin SaaS SSO (SAML) used
by IT for enterprise apps (e.g. Workday, Google Apps)
Uses Active Directory credentials
Provides a single & centralized login page
Developers don‟t accept username & password directly
Built filter for our base server to make SSO/authN trivial
39. Points of Emphasis
Integrate Self-service is perhaps the
most transformative cloud
Make the right way easy characteristic
Self-service, with Failing to adopt this for security
exceptions controls will lead to friction
Trust, but verify
40. Self-Service: Security Groups
Asgard cloud orchestration tool allows developers to
configure their own firewall rules
Limited to same AWS account, no IP-based rules
41. Points of Emphasis
Integrate Culture precludes traditional
“command and control”
Make the right way easy approach
Self-service, with Organizational desire for
exceptions agile, DevOps, CI/CD blur
traditional security
Trust, but verify engagement touchpoints
42. Trust but Verify: Security Monkey
Cloud APIs make verification Includes:
and analysis of configuration Certificate checking
and running state simpler Firewall analysis
Security Monkey created as IAM entity analysis
the framework for this analysis Limit warnings
Resource policy analysis
43. Trust but Verify: Security Monkey
From: Security Monkey
Date: Wed, 24 Oct 2012 17:08:18 +0000
To: Security Alerts
Subject: prod Changes Detected
Table of Contents:
Security Groups
Changed Security Group
<sgname> (eu-west-1 / prod)
<#Security Group/<sgname> (eu-west-1 / prod)>
44. Trust but Verify: Exploit Monkey
AWS Autoscaling group is unit of deployment, so
changes signal a good time to rerun dynamic scans
On 10/23/12 12:35 PM, Exploit Monkey wrote:
I noticed that testapp-live has changed current ASG name from testapp-
live-v001 to testapp-live-v002.
I'm starting a vulnerability scan against test app from these
private/public IPs:
10.29.24.174
45. Trust but Verify: ELB Checker (gauntlt)
AWS Elastic Load Balancer (ELB) provides cross-
datacenter traffic balancing, but no security controls
If your cluster is attached to an ELB, it is available to the Internet
Engineers may misunderstand:
ELB use cases (and alternatives)
Security features
Other measures used to protect ELB-fronted clusters
46. Trust but Verify: ELB Checker (gauntlt)
1. Launch gauntlt test runner instance,
loaded with “master list” of ELBs and
expected state
2. Determine “target list” of current ELBs
to evaluate
3. Generate per-ELB listener gauntlt
attack files
4. Execute attacks
5. Alert on failures and new ELBs
6. Triage findings and update master list
47. Takeaways
Netflix runs a large, dynamic service in AWS
Newer concepts like cloud & DevOps need an
updated approach to security
Specific context can help jumpstart a pragmatic
and effective security program
Don‟t swim upstream - integrate and collaborate
with your engineering partners