2. v
Your first stop
Subscribe to the RSS Feed, and possibly integrate into your
operations dashboard. If there are no issues, then it is not
likely to be an issue with the AWS Service.
http://status.aws.amazon.com/
3. v
Your second stop
AWS Cloudwatch
Provides metrics across all AWS services,
that are not available to external
monitoring systems.
Detailed monitoring should be enabled.
Proactively configure notifications for
threshold alarming.
Use customs metrics.
4. v
Troubleshooting – EC2
• Instance Launch
• EC2 Instance Health
• EC2 Instance Network connectivity
• EBS issues
5. v
Troubleshooting Instance launch
• Potential causes
• Account limits issue
• IAM user issue
• AutoScaling event terminating instance
• Bad/Corrupted AMI configuration
• Storage attachment issues
• AWS Infrastructure issues
6. v
Troubleshooting EC2 Instance Health
• Potential causes
• EBS volume snapshot in progress
• Cloud-init user-data script failures
• Meta data access issues
• OS filesystem issues
• Kernel issues
• Underlying AWS infrastructure issues
7. v
Troubleshooting Network Connectivity
• Potential causes
• CPU, Memory or I/O utilization of instance
• Number of active connections exceeding capacity / limits / memory.
• AWS EC2 Network Security Groups, ACL's, Routing misconfiguration.
• Instance OS Firewall blocking connectivity.
• SSH keys lost or misconfiguration.
• Network carrier issues.
9. v
ELB Troubleshooting – API Response
• OutofService: A Transient Error Occurred
Internal ELB error. Retry API call, and raise support case on failure
• CertificateNotFound: undefined
• ELBs based on load, span across AZs, and on create might take some time to
sync certificates. Retry API call after some time
• The certificate you are trying to use is not found, or not well formed.
10. v
ELB Troubleshooting – Error Messages
• HTTP 400: BAD_REQUEST - Client sent a bad request.
• HTTP 405: METHOD_NOT_ALLOWED - Length of the method in the request
header exceeds 127 characters.
• HTTP 408: Request Timeout - Indicates that the client cancelled the request
or failed to send a full request.
• HTTP 502: Bad Gateway - Indicates that the load balancer was unable to
parse the response sent from a registered instance.
• HTTP 503: Service Unavailable or HTTP 504 Gateway Timeout
• Insufficient capacity in the load balancer to handle the request.
• Registered instances closing the connection (KeepAlive issues)
• No registered instances, or no healthy instances
• Connection to the client is closed
• The instance's security group does not allow communication with load balancer.
11. v
ELB Troubleshooting – Response Metrics
• HTTPCode_ELB_4XX - Indicates a malformed or a cancelled request from
the client.
• HTTPCode_ELB_5XX - Either the load balancer or the registered instance is
causing the error or the load balancer is unable to parse the response.
• HTTPCode_Backend_2XX - Indicates a normal, successful response from
the registered instance(s).
• HTTPCode_Backend_3XX - Indicates some type of redirect response sent
from the registered instance(s).
• HTTPCode_Backend_4XX - Indicates some type of client error response
sent from the registered instance(s).
• HTTPCode_Backend_5XX - Indicates some type of server error response
sent from the registered instance(s).
12. v
ELB Troubleshooting – Other issues
• Load balancer health check failure
• Instance(s) closing the connection to the load balancer. (ELB has 60s timeout)
• Responses timing out.
• Non-200 response received
• Failing public key authentication.
• Registering instances taking longer than expected to be In Service.
13. v
ELB Troubleshooting – Potential problems
• Possible causes
• The client(s) are caching IP of DNS lookups
• The back end instances within an AZ can have an imbalance
• Sticky sessions
• Request processing time - requests that can take a long time to process
• Unhealthy hosts
• Timeout settings (keep alives)
• NACLs and SGs
• ELB fails to scale on spiky traffic
• SSL/Certificate issues
15. v
Information required in support case
• All resource ID's of all resources involved in problem description or
diagnosis steps.
• Instance types, AZ locations, AMI, storage configuration, etc. for any
patterns or trends.
• Exact times problems began to, or stopped occurring, frequency of
occurrence if repeating.
• Instance console or ELB logs/error logs
• Troubleshooting steps performed to date, protocols used, etc.
16. v
What next?
• Run books / Play books
• Automation
• Monthly internal reviews and documented mitigation strategies
• Continual improvement plan
• Collaboration and Communication with AWS TAM and SA
• Well Architected review
17. v
Resources and References
• Documentation
• Amazon Elastic Cloud – User Guide
• Elastic Load Balancing – Developer Guide
• Training
• AWS Certified Sysops Associate
• AWS Certified Architect Associate
EC2 Instance Health - Issues of EC2 instance failing to boot or launch.
EC2 Instance Network connectivity - Issues with inability to connect to instance.
EC2 Storage Devices & File Systems - Issues with storage devices, RAID configurations, storage type conversion.
Linux OS - Issues with Linux Kernel and OS configuration.
Linux Applications - Issues with Application running on EC2 Linux.
EC2 & Linux Performance - Issues with EC2 Linux & Storage Performance and monitoring.