This document describes a social media aggregation and recommendation application developed for a large client. It discusses how moving the application to AWS with Chef configuration management improved performance, reduced costs, and made the environment easier to manage and replicate. Key benefits included being able to quickly scale out stateless APIs, ensure consistent configurations, create staging environments in under a day, and reduce costs by only running non-production environments for 8 hours/day. The document also provides examples of infrastructure deployment and management commands using AWS services like EC2, ELB, CloudFormation, and lessons learned around high availability, performance testing, and instance sizing.
3. Description
●
Social feed aggregation/recommendation app
●
Client developed by a Global Fortune 500
company that makes video consoles, TVs
and many years ago Walkmans…
●
Expected at the end of 2013 around
1.000.000 new users registered in the
platform and 170.000 DAU
●
All servers are running in AWS and the
deployments and configuration
management are handled by Chef.
4. System stats
Main Components
– Custom
API(Java)
– Beanstalk
– RabbitMQ
– Redis
– MongoDB
(Sharding)
EC2
– Production env: Reserved
instances for the mininum
configuration. On demand
instances for scale out.
– Staging env: Reserved instances
for ½ day
– Elastic Load Balancers
– Security Groups and ACLs
– Key Pairs per each subnet
– Current EC2 region is US east
6. VPC Subnet
VPC Subnet VPC Subnet VPC Subnet VPC Subnet VPC Subnet
DEV
Stage
APP
Stage
DB
Prod
APP
Stage
DB
DNS
VPN
DEV-
NAT
Public-
Nexus
Public
Git
server
Public-
Chef
Public-
Jenkins
Stage
NAT
Prod
NAT
Prod
NAT
Nagios
forwarder
ELB 1 Web Servers
Stage ELB 1 Web Servers
Prod
Security Group Security Group Security Group Security Group Security Group
Architecture/Infrastructure
8. Improvements achieved (I)
●
APIs are state-less so you can scale out very easily. Nodes
are created by Chef(Knife).
●
Fine integration with Chef. Ensure that you have the same
configuration in all of the environments and avoid
misconfigurations in production environment. Chef Bootstrap
ec2 instances works fully integrated with knife.
●
Get a quick and confident way to create an exact production
mirror (staging) environment with Chef and Cloudformation
– Before AWS/Chef → create a staging env took 6 weeks
– After AWS/Chef → create a staging env takes less than 1
day
9. ● Save costs managing non-production environments
– Before AWS/Chef → environments up 24*7
– After AWS/Chef → environments up 8 hours / working
days (scripts in cron which use API Tools)
– Python Script example
● Outage recovery plan handled with nodes snapshots
(MongoDB) or Chef (other nodes stateless)
● Very quick response and customized consulting for the
project provided by Amazon Team.
Improvements achieved (II)
10. Staging example with dynamic ip (dhcp)
knife ec2 server create -I ami-af71f8c6 -r "role[apache]" -f
m1.medium --region us-east-1 -S scp-staging -i
/Users/juanvi/keypairs/scp-staging.pem -g sg-2418e54b -s
subnet-919cecfc -x ec2-user -N stapp-apache-Test -E staging
Staging example with static ip
ec2-run-instances ami-af71f8c6 -k vpc-public-10-234-1 -g sg-
379e6d58 -s subnet-cb9596a0 -t m1.xlarge --private-ip-
address 10.234.2.204
knife bootstrap 10.234.2.204 -/Users/juanvi/keypairs/scp-
staging.pem -r "role[webserver]" -N STAGING-public-
webserver2 -x ec2-user -E staging --sudo
Example Create a new node
11. What we have learned
●
Strongly recommended run servers in more than one availability
zone for avoid a total downtime in case of outage
us-east-1a us-east-1d
12. ●For certain services balanced use TCP instead of
HTTP. The balancing of requests to different nodes of
our APIs by TCP internally solved some problems with
HTTP requests without closing sessions. We only use
HTTP balancing for requests that come to the public
Apache.
We noticed that a lot of Apache connections were not
closed properly with HTTP balance mode and in
some hours we reached the limit connections
Solved with TCP balance mode in ELB
What we have learned (II)
13. ●Use Cloudformation to create network
resources automatically.
–Before Cloudformation→ create
one by one all of the resources
–After Cloudformation →create
automatically all the nodes and
network resources of an entire
environment in one execution
–Cloudformation Example
What we have learned (III)
14. ●Analyze performance tests for choose the
minimum number of nodes that will be running
24 * 7 and sizes to reserve instances.
Reserved instances reduce the cost to 2/3.
–Before AWS/Chef→ limits in the
performance tests caused by non
available servers due to their costs. Test
simulated.
–After AWS/Chef →High-powerful
Instances available per use only for
some hours or days with a reduced cost
What we have learned (IV)
15. ●Advisable to use a large number of small
servers instances close to 100% CPU usage,
instead of having few powerful machines with
their resources wasted, and launch new
nodes and balancing requests among them
when load increase.
●Pre balancers warming if you expect a
exponential increase of the requests
●Request to support increasing the initial
limitations of instances that can run on a
simultaneous EC2 (20)
What we have learned (V)
16. • You must adapt to the size of the instances
whose resources(CPU, RAM...) are predefined
and not customizable
• You have no control over the evolution of the
products that your service depends
• You don't have access to the logs of some
instances (for example load balancers)
• Danger engaging AWS services and consequent
difficulty migrating to another DC.
Things to consider