This document summarizes an AWS re:Invent session on incident response in the cloud. The session covered basics of incident response, best practices for incident response in the cloud, and a case study of information spillage incident response from Johns Hopkins Applied Physics Laboratory (JHUAPL). It discussed the difference between events and incidents, and components of an effective incident response process including preparation, identification, containment, investigation, eradication, recovery, and follow up. It provided advice on leveraging AWS services and capabilities to enhance incident response. The JHUAPL case study discussed their approach to incident response and how they apply it in AWS, including use of encryption, log aggregation, and isolation techniques during containment and eradication
2. Incident Response in the Cloud
• Basics – Events and incidents – Things that “go bump”
• Incident response in the cloud – Best practices
• Case study: JHUAPL – Information spillage incident response
3. Events and Incidents
So, What’s the Difference?
All incidents are events—but all events are NOT incidents
Via “event management” we monitor (ex: use tools like Amazon
CloudWatch, AWS CloudTrail, Splunk, and others, to track,
monitor, analyze and audit EVENTS)
If event management identifies an event that is analyzed and
qualified as an incident, that “qualifying event” will trigger the
registration of an Incident and trigger the incident management
process and any response actions (where required) will be initiated
4. What Is an Incident?
An unplanned interruption to an IT service or reduction in
the quality of an IT service
Failure of a configuration item that has not yet affected
service is also an incident
5. Incident Response in the Cloud
Bottom line up front: Incident response can be
complex but there is NO ROCKET SCIENCE
INVOLVED OR NEEDED!
7. Domains and Scope of Incident Response
Understanding the Interfaces and Boundaries
• Incident response of the customer “in the cloud”
• Incident response of the CSP for the cloud infrastructure and
services it provides
• Joint coordinated incident response of the customer and CSP in
cooperation with one another
• Joint coordinated incident response of multiple CSPs and/or
customers in cooperation with each other
8. IR Policy, Process, Procedure
Example Applicable and Governing Standards
Example governance in the federal community
NIST 800-53/NIST 800-171
NIST 800-61
FEDRAMP Incident Communications Procedure
FEDRAMP Continuous Monitoring Strategy Guide
CJCSM 6510.01B
DoD 8530.1/8530.2 – DoD PKI (aka “CAC”) required
Homework
assignment
12. Incident Response in the Cloud
Am I prepared for
incidents and failures?
Both are guaranteed to happen!
13. Topics and Best Practices
• Is incident response in the cloud different?
• Building your IR policies, governance, plans, and
procedures/run books
• Preparation, training, and execution
• IR for continuous improvement
• IR exercises – training the way you fight
• Iterate and automate!
• Engaging your CSP (AWS)
15. • Cloud is different and IR in the cloud is different!
Easier, faster, cheaper, more effective!
• Your ability to detect, react, and recover can be greatly
enhanced by leveraging the cloud
• Many capabilities for investigation are ONLY possible in
the cloud
• IR in the cloud is NOT just about being reactive!
Incident Response – Cloud Considerations
16. Leveraging IR to improve your security posture in the cloud
Let’s look at some best practices…
17. Integrate Incident Response with Continuous Improvement
Establish control
Determine impact
Recover as needed
Investigate root cause
Implement improvement
Iterate!
Think: NTSB incident/accident investigation and
recommendations but geared to your AWS environment!
18. Preparation – Being Proactive!
• Architect for failure and IR throughout
• Implement clear, lightweight governance and ownership.
• Architect and build for speed, agility, security, and integrity
• Implement clear, simple controls and run books for
responders
• Leverage principle of “least privilege” throughout
• Validate readiness and run tests continually
• Consider “chaos engineering”
19. Clear Ownership and Governance
• Tools to identify resources and find owners and
administrators
– Tags are your friends...remember the power of a mission
focused tagging taxonomy rigorously enforced!
• Procedures to engage owners and administrators
• Procedures to engage your CSP (AWS)
• Don’t create policies and procedures you are not willing
and able to enforce!
20. Take Advantage of Your CSP for IR in the Cloud
• “How can I leverage all the tools the CSP (AWS) makes
available to me?”
• “When do I need to engage my CSP (AWS) for
support?”
21. Apply DEV-SEC-OPS to Incident Response…
• Leverage what you ALREADY DO WELL!
• Start small and grow incrementally
• Build an IR “flight simulator” in the Cloud
• Schedule IR scenario planning and prioritization sessions
• Run your first incident response simulation (IRS) in the
cloud—AWS can help!
Iterate and improve!
Build/run another IRS…improve
Build/run another IRS…improve…repeat
22. Building an IRS Scenario Catalog
Incident Response Simulation How To
• Identify an issue of importance (historical or “What if?”)
• Leverage skilled users, security, and operations people
• Build a realistic simulation
• Invite other stakeholders
• Run the simulation live
• Complete an after action “hot wash”
• Identify how to improve and repeat!
• AWS is here to help!
23. • Real time metrics/automation—everywhere!
• Lightweight governance with delegation of decision-
making/enforcement
• Develop thresholds for security engagement in support
processes
• Develop rapid security escalations for access
• Utilize secure communications for incidents with ability
to verify and authorize actions
Good Incident Response Is NO ACCIDENT!
Build for Speed, Agility, and Security… Use the Ecosystem!
24. AWS Config
CloudWatch/
CloudWatch Logs
CloudWatch
alarms
AWS CloudTrail
Amazon EC2 OS logs
Amazon VPC
Flow Logs
Amazon SNS
Email
notification
HTTP/S
notification
SMS
notifications
Mobile push
notifications
API calls
from/for
most
services
Monitoring
data from
AWS services
Custom
metrics
Logs→Metrics→Alerts→Actions
Amazon
SQS
AWS
Lambda
Lambda
function
25. When to Engage AWS?
Engage AWS Support any time an event may be occurring
that affects your ideal operational state
26. When Do I Contact AWS Security?
Obtaining permission to perform penetration testing/scanning
Reporting security vulnerabilities
Reporting suspicious emails
Reporting abuse of AWS resources
28. ITIL
Role
AWS Role Responsibilities
Incident
Analyst
Cloud Support
Engineer (CSE)
● Initial support and classification of concerns
● Owns issues, monitors, tracks, and communicates during the issue management process
● Resolves and supports recovery of concerns not requiring escalation to an AWS service subject matter expert
● Escalates concerns to the AWS service subject matter experts (as required)
● Can close issue-related cases when consensus is reached with the customer
Incident
Manager
Technical
Account
Manager (TAM)
● Monitors issue details from an AWS internal perspective on the customer’s behalf
● Investigates and diagnoses concerns, as well coordination between the customer, AWS Cloud Support Engineers, and AWS
subject matter experts. These engagements can be via videoconference, telephone conference, or any method the
customer chooses.
● Monitors customer-requested escalations for concerns and providing a conduit for customers to engage internal AWS
subject matter experts to meet the objectives of the issue management process
● Drives the efficiency and effectiveness of the AWS issue management process
● Produces customer-specific management information such as metrics, reports, and so on
● Records out-of-scope/intent issues related to service design for consideration toward future service releases and
improvements
Subject
Matter
Expert
AWS Service
Subject Matter
Expert
● Analyzes concerns to identify service restoration actions to be taken
● Conducts event resolution actions to restore services to customers
● Assists issue management staff with assessing the impact of any events
Engaging Human Support
30. Good IR Is NO ACCIDENT!
Build for Speed, Agility, and Security… Use the Ecosystem!
• Build securely and verify before deployment (provisioning enclave)
• Build in monitoring, metrics, alerts, and messaging
• Proactively analyze and preserve data
- Resource configs, logs, volatile memory, snapshots
• Build forensics AMIs, SGs, storage, and isolated subnets
• Build for rapid recovery—automate!
• Regularly run incident response simulations (IRS)—iterate and improve!
• Incidents DO NOT HAVE TO BE DISASTERS!
31. Common Objections
• Running IR simulations is expensive and high risk…we
can’t afford to do “live fire” exercises!
• “I am an understaffed, interrupt driven ops
organization. I do not have time for drills.”
• “What if we fail? We could look bad.”
32. Why You Should Do It…
• If you already do it…just include your cloud!
• Helps you understand your AWS environment!
• Augments training and readiness—troops fight like
they train
• Fixes real issues and helps build a culture of continuous
improvement
• Helps build your own expertise and improve response
• Helps meet your security requirements
• Cloud allows you to execute quickly and economically
• Can you afford not to?
33. AWS Security Resources
AWS Compliance
https://aws.amazon.com/compliance/
AWS Security Blog
http://blogs.aws.amazon.com/security/
AWS Security Center
https://aws.amazon.com/security
Contact the AWS security team
aws-security@amazon.com
34. Other Incident Response Resources
SANS Reading Room, Incident Response
http://www.sans.org/reading-room/whitepapers/incident
FIRST
http://www.first.org/resources/guides
CERT, Incident Management
http://www.cert.org/incident-management/publications/
36. Johns Hopkins Applied Physics Laboratory
§ Technically skilled
and operationally
oriented
§ Objective and
independent
§ DoD
§ NASA
§ Critical contributions
to critical challenges
§ DHS
§ IC
§ Division of Johns
Hopkins University
§ University
Affiliated Research
Center
37. What APL Missions Require
Reliable and elastic infrastructure
Scalable computing and storage – Medical image processing, big-
data analysis, machine learning
… With agility! (noun, “ability to move quickly and easily”)
Preconfigured and bootstrapped machine images, scripting,
templates to build cloud infrastructure via automation
… While maintaining security and governance
Multifactor authentication, security groups, access controls, data
encryption, secure monitoring, notifications, incident response
… And compliance to laws and regulations for sensitive data
FOUO/CUI (DoD) commercial and GovCloud, HIPAA (Medical)
38. What APL Cloud Team Provides
§ IT cloud team works closely with
APL mission areas to provide
cloud computing services and
infrastructure
§ Designs and architects network
and security enterprise wide
§ Creates the structure for security
monitoring and incident
response
39. IR-4 “Incident Handling”
Comparing Incident Response Contexts
IR-9 “Information Spillage Response”
Life’s a breach! Cleanup on aisle 9!
§ [intention: usually inadvertent]
§ Identification: Always notified
§ Eradication: Fairly standard wipe
(data sanitization) DoD processes
§ Follow-Up: Lots of official
paperwork
§ [intention: usually malicious]
§ Identification: Difficult
detection/evasive tactics, exploits
§ Eradication: Can be difficult to
locate all footholds; incomplete
§ Follow-up: Lots of lessons learned
40. Incident Response Approach
Preparation Identification Containment Investigation
EradicationRecoveryFollow-Up
* Applies to all types of IR, including IR-4 (breaches) and IR-9 (spills)
41. Preparation Identification Containment Investigation Eradication Recovery Follow-Up
§ Train incident handlers for responding to cloud specific events
§ Ensure logging is enabled
§ VPC Flow Flogs, Cloud Trail, AWS Config, Amazon Simple
Notification Service (Amazon SNS) notifications
§ OS and application logs from Amazon EC2 instances
§ Collect and aggregate the logs centrally for correlation and
analysis
§ Example: Amazon CloudWatch, Amazon ElasticSearch Service, or
Security Information and Event Management – SIEM vendors (such
as Splunk)
If prevention is better than cure …
preparation is better than eradication
42. Preparation Identification Containment Investigation Eradication Recovery Follow-Up
§ Use Amazon Elastic Block Store (Amazon EBS) encryption when
creating EBS volumes
§ Equivalent to full disk encryption (FDE) on corporate laptops
§ Use Amazon Server-Side Encryption (SSE) for landing Amazon S3
objects
§ Amazon S3-managed keys (SSE-S3): easiest key management
§ AWS Key Management Service (AWS KMS)-managed keys (SSE-
KMS): additional benefits
§ SSE with customer-provided keys (SSE-C): customer manages
Encryption in preparation phase renders data as CIPHERTEXT
- Huge advantage for spillage cleanup in eradication phase
43. Multi-account isolation and policy enforcement: “AWS organizations”
§ Enforces “separation of duties” principle
§ Limits the blast radius in the event of compromise
§ Organize accounts along business lines or mission areas
§ Use of overarching Service Control Policies (SCP) to control sub accounts
with restrictive policies
Preparation Identification Containment Investigation Eradication Recovery Follow-Up
44. AWS Orgs: Example Layout
OUs,
projects SCPs
Business Project A
Business Sector 1
Business Sector 2
Project A
Project B
Project C
Business Sector 3
Business Sector 4
Business Sector 5
Business Sector 6
45. § Usually notified about which user
accounts and systems have data that
need “cleaning up”
§ Can use data loss prevention (DLP) or
new service “Amazon Macie”
§ Open up spillage case # with AWS
Business Support for cross-validation
§ Use behavioral based rules for
detection and searching
§ CloudWatch rules
§ SIEM tools, such as Splunk for AWS,
or AWS ElasticSearch (Kibana
visualizations)
IR-4 “Incident Handling” IR-9 “Information Spillage Response”
Preparation Identification Containment Investigation Eradication Recovery Follow-Up
Also known as “detection”
47. IR-4 “Incident Handling”
§ Multiple use-cases for live-box and
dead-box isolation and forensics
§ Investigation complex: correlation,
threat intelligence, timeline analysis
§ Beyond the scope of this
presentation
IR-9 “Information Spillage Response”
§ Closer to live-box forensics
§ Investigation easier: usually limited to
known users and host machines
§ Isolation using security group
§ Via console or automation for speed (see
example below)
Containment isolation:
§ Save the current security group of the host or instance
§ Isolate host using restrictive ingress and egress security group rules
CLI> aws ec2 modify-instance-attribute --instance-id <instance-id>
--groups "<Isolation-SG>"
§ Isolation-SG: only SSH (22) or RDP (3389) ingress rules with IR enclave as source. No egress
Preparation Identification Containment Investigation Eradication Recovery Follow-Up
49. Preparation Identification Containment Investigation Eradication Recovery Follow-Up
§ If Amazon Elastic Block Store (Amazon EBS) encryption was used for
volumes:
§ Delete the spilled file
§ Create a new encrypted volume, copying all the good files (minus the spillage)
§ Delete the affected encrypted volume and delete the key used to encrypt it
§ If Amazon Server-Side Encryption (SSE) was used for Amazon S3 objects:
§ If Amazon S3-managed keys (SSE-S3) were used: simply delete the object!
§ If AWS KMS-managed (SSE-KMS) or customer-provided (SSE-C) keys used:
§ Delete the file object and the customer master keys (CMKs) used to
encrypt the object
If Encryption was used during preparation …
it’s as simple as deleting objects and keys
50. • Copy DoD (or authorized) sanitization tools to affected EC2 hosts
IR-Net# scp –i “host-private-key” bcwipe.exe ec2-user@TargetHost.
amazonaws.com:[root_volume/bcwipe.exe]
• Remote connect to the host via SSH (port 22) or RDP (3389) to perform
sanitization actions
IR-Net# ssh -i “host-private-key” ec2-user@TargetHost.amazonaws.com
• Once on target host, wipe files and slack, per authority (example: DoD 5220-
22M)
AffectedHost# bcwipe <spilled file>
Preparation Identification Containment Investigation Eradication Recovery Follow-Up
If encryption was NOT USED during preparation …
you may be able to sanitize EBS volumes only
51. Recovery
§ Restore network access to original state (prior to isolation)
Restore previous security group ingress, egress rules
CLI> aws ec2 modify-instance-attribute --instance-id <instance-id>
--groups "<Original Security Group>"
Preparation Identification Containment Investigation Eradication Recovery Follow-Up
Follow-Up
§ Verify deletion of data encryption keys (if EBS or Amazon S3 encryption was
used)
§ Cross-validate with Amazon Support Case #
§ Report spillage findings and response actions
§ In accordance with DoD 5220 or appropriate authorities
52. Takeaways
§ Understand the differences between IR-4 (threat based) and IR-9 (spills)
and plan the handling and response accordingly
§ Use a phased approach for IR: Create well-defined steps and operational
procedures, including training for the response teams
§ Preparation step is critical
§ Use encryption – for EBS volumes, Amazon S3 storage, and wherever possible
§ Use AWS organizations to separate projects/functions and limit the blast radius
§ Enable all critical logging mechanisms (EC2 OS, AWS CloudTrail, VPC FlowLogs)
§ Create detection rules in AWS CloudWatch, Amazon ES, or third-party SIEM
§ Use AWS CLI or SDKs especially for quick “containment”, such as using
predefined restrictive security groups