Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Rails Operations - Lessons Learned
1. Rails Operations
Lessons learned from deploying and managing hundreds
of Rails applications
Thanks for coming out this morning. I know it’s hungover oclock,
so it means a lot. You are dedicated, upstanding individuals.
3. • @techpickles
• http://github.com/technicalpickles
• http://technicalpickles.com
I am from the internet
4.
5. Awesomeness Engineer
of Supreme Versatility
II
My official title is Awesomeness Engineer of Supreme Versatility.
2. (I recently was promoted)
6. Managed hosting and
operations
We’re mostly known for our hosting. What isn’t as well known is our managed services. For this, we engage more closely
with our customers.
When bringing on new managed customers, we work with them to spec out servers, review application’s needs. We get
them up and running on these servers with our configuration management tool, moonshine. And once deployed, we
provide 24x7 monitoring. If you’re server goes down, we let you know, and get it back online as soon as possible,
regardless of when it happens.
And that’s not all. Once live, we provide operational support. Anything from application performance analysis,
recommending architecture improvements, installing and managing new software on servers, or just being there to give
feedback on how the application is operating.
You can basically think of us as a Rails Operations company.
7. I’m talking about
Rails Operations
Conveniently enough, I’m talking about Rails Operations today.
8. WTF is
Rails Operations?
I found this hard to distill down to a simple statement.
I think it’s safe to say that the majority of us are developers. We write code, build applications, launch products.
A lot of organizations, operations is something different. eople associate operations with system administration. And
to an extent, this can be fairly accurate. Different people, different teams, different. As developers, we write some
code, and toss it over wall, and let _them_ handle it.
I think this is a bit flawed. The code you write has an operational impact. The systems you run it on have an
operational impact on your code. It’s a complex relationship, and when developer and operations teams are
separate, it’s hard to bridge the gap between, since it’s neithers responsibility.
10. Very important assumption
You develop code that will eventually go into production,
and in part to some business model, generate revenue
That is to say, you are part of some organization
20. • Lives on servers located in data centers
and clouds
• Real users
• Either code works, or it doesn’t
• Either the application is available or not
21. NO
Just because your application works in production, doesn’t mean
people are using it or buying your product.
22. but it PRESERVES
potential revenue
If you have good operations, that means users will be able to see
your application working and actually be able to use it.
26. • Working features (or at least that work
enough)
• Infrastructure to keep the application up
and running (or at least up enough)
• A business model
• Sheer determination
• Good luck
27. Lessons learned
Alright. I’ve given you a definition of Rails Operations, and had a
brief detour to talk about the business and where development
and operations fit into it.
Now for some lessons. Basically, I’ll be going over some patterns,
some antipatterns, and other practices and topics.
28. Common threads
Putting this all together, I kept coming back to some common
threads. That is, some ideas that apply to many aspects. I’m going
to start you off with a few together, and then just jump into the
lessons. We’ll probably pick up a few more along the way.
29. Give a damn
If you don’t care about what you’re doing, everything else I’m
talking about today probably doesn’t matter. I don’t think you
need to worry about this though, since you are here.
30. Earlier we talked about how operations preserves revenue. To that
end, our goal is to mitigate risk as much as makes sense.
31. Tradeoffs and compromise. Each possible solution has them. The
trick is understanding that there are tradeoffs. What tradeoffs you
make depends on what your priorities are. For example:
* Dollar signs
* Time
* Sanity
* Technical debt
* Higher risk
34. You write code that
manages your servers’
configuration
Take a moment to think about how you might describe a server to someone.
There’s plenty of nouns:
* packages
* users
* files
* cronjobs
* services
And some verbs:
* running commands
35. • apache package is installed
• apache service is running
• deploy user exists
• cron jobs
• etc
37. Automation
Bootstrapping. Anyone that has setup a new server from scratch
can tell you... it’s time consuming, labor intensive, and error
prone.
Bootstraping is just part of it though, only ever happens once
though. What’s more interesting is that you can use this to
manage your infrastructure as it involves. Need to start using
redis? Just add it to your configuration management, and you’ll
have it next deploy.
38. The best way to illustrate why you should be using configuration management is to explore the
consequences of not using it.
Imagine it’s time to add a new application server. Your application is under heavy load, and needs this
server to be up and serving requests. How long will it take you to get it up? And how will you know it’s
setup correctly? If you’re doing this all manually, you can’t really know the answers to these questions.
Here’s another example. Adding a new dependency to your application. It can be a gem, a native
package, a new daemon, whatever. How do you ensure this gets on the server when you need it?
Deploy and pray? Log into the server and install it yourself? This sucks, and kind of risky especially if
you’re talking about production.
39. As always, there’s tradeoffs to be made.
Setting up and learning how to do configuration management takes time. Time that could be
spent working on user-facing tasks.
Taking on risk of having to cold deploy, or having deploys fail because of missing
dependencies.
Usually, the balance is to have to take the risk and have it burn you enough times that it’s more
painful to not stop and get your configuration management on, that it is to not do so.
If you do know it, it’s a no brainer. Just DO IT.
43. configuration
management
+
staging servers
=
VERY YES
If you use configuration management, and have staging servers,
then this is a huge win.
We talked about adding new dependencies earlier. If you are
doing configuration management, then staging is the first place
you can see if ur doing it right.
44. There’s basically no downside to using staging servers. The only
tradeoff though is that servers do cost dollar signs and staging
servers are no different. This leads us to a new thread...
45. Maths... look around you. In most cases, you can do some dollar sign math to justify costs of a thing. Let’s try this.
A staging server may cost $60/mo
But how can you calculate the cost of not having a staging server? Let’s assume that if you don’t have a staging server,
you’re bound to do a bad deploy that it could have prevented. Some code that doesn’t work outright, or is otherwise
flawed. Let’s say it causes an hour of downtime while you determine the problem and try to fix it. Do you know how much it
costs your business in lost revenue to be down an hour?
This is actually a pretty mature question, and I’d be surprised if many people can answer it off hand. In any event, I think
we can do some fuzzy math to say yeah, it probably is more than $60. If that’s the case, then one failed deploy a month is
enough to validate a staging server.
47. capistrano-gitflow
Whenever possible, I like to enforce standard by means of automation
For the flow of code from development -> staging -> production, we have capistrano-gitflow.
Originally done up by apinstein, I did some refactorings and cleaned it up enough to be usable as a
gem
Effectively, this enforces development -> staging -> production. Whenever you deploy to staging, it
tags the current branch including information about the date, the user deploying, and a small blurb
about the changes. Assuming this is cool, you can promote a tag to production and go on from there.
If you haven’t deployed to staging yet, you’ll be promtpted and it will default to using the last
production tag.
49. A play on release early,
release often.
Although technically, I guess it’s the same
It’s basically the same thing we hear in the open source
community.
The sooner you release code, the sooner you can validate it and
the sooner you can get feedback. Does it work? Does it not break
the entire site? Are users happy?
50. By deploying early and often, we’re also limiting risk. The less
changes that go out in a single deploy, the less things there are
that can possibly break. By waiting to deploy, you’re accumulating
a larger set of changes to deploy, and therefore there’s more
surface area to debug if it breaks.
51. In a way, you can consider undeployed code a liability.
Imagine spending a day or two doing some code cleanups to get ready for a sprint. Should you deploy
when you are done and happy with the refactorings, or should you go ahead and do your sprint.
If it were me, I’d deploy the refactorings first. That way, the code is out there, and you’ll know if it
performs equally to its nonrefactored version. It’s really easy to introduce performance killing changes
in even a few line diff.
If you instead wait and deploy with new features, if anything goes awry, you have significantly more
code to spelunk to track down a potential problem.
57. What does it even
mean?
This drives me nuts. By saying something ‘feels’ slow, there’s an
implied assumption. The assumption is that it should be fast.
Saying it like that is...weird, because it gives no indication of what
is slow or not.
The trick is in determining what the assumption is, and then
finding a way to measure and identify the problem.
How can we do this?
60. Metrics everywhere!
With the right tools, you can easily be continuously collecting data
so you have it in your pocket when you need it.
61. • New Relic - http://newrelic.com
• Scout - http://scoutapp.com
These are the two we use and highly recommend.
New Relic is really great for giving a high level view of your application. We’re talking at the request response level,
including all sorts of fun maths with most time consuming requests, highest standard deviation, etc. It also breaks down
requests by where time spent. Like if it’s all in the view, the controller, the database, partials, etc etc
Scout is useful for other reasons. While New Relic is good for high level understanding of your application, Scout is a bit
more low level. You can use it to collect metrics about your servers, and how well they are running. Memory, CPU, disk
space, IO, mysql connection stats, and so on.
I really believe these are a great combination, because New Relic can point you in the direction of a problem area, and Scout
can better understand what’s contributing to it at a system level.
62. The front page feels
slow
The front page is taking 10 seconds to load, but we
really need it to be loading in under 1 second
63. The primary key seems
like it’s increasing
rapidly
The primary key is at 90% of it’s maximum, up from
80% yesterday, and looks like it’ll run out overnight.
64. IO seems high
IO fluctatues up to 90% sometimes, but doesn’t appear
to have a negative effect
68. How would you prefer
to know?
• Angry tweets
• Angry email from your boss
• You personally checking everything all the
time
• An automated system to let you know
73. Single point of contact
If everything is awful, needs to be a single point of contact. They
take point, acknowledge and begin looking into it. If need be,
bring on others
88. Remember what’s important for th ebusiness? Do you want to
become the expert at <insert technology here>? Is it really the
most valuable thing you can be doing?
89. If you’re still going to go
hipster...
• experiment in branches
• understand operational impact
• Staging!
91. Further Reading
• Web Operations - John Allspaw and Jesse
Robins
• Continuous Delivery - Jez Humble and
David Farley
• “Web Operations for Developers 101”
http://www.amazon.com/Web-Operations-Keeping-Data-Time/dp/1449377440/
ref=sr_1_1?s=books&ie=UTF8&qid=1314447411&sr=1-1
http://www.amazon.com/Continuous-Delivery-Deployment-Automation-
Addison-Wesley/dp/0321601912/ref=sr_1_4?
s=books&ie=UTF8&qid=1314447411&sr=1-4
http://www.paperplanes.de/2011/7/25/web_operations_101_for_developers.html