Rails Operations - Lessons Learned

Rails Operations
Lessons learned from deploying and managing hundreds
of Rails applications

Thanks for coming out this morning. I know it’s hungover oclock,
so it means a lot. You are dedicated, upstanding individuals.

• @techpickles
• http://github.com/technicalpickles
• http://technicalpickles.com

I am from the internet

Awesomeness Engineer
of Supreme Versatility
II

My official title is Awesomeness Engineer of Supreme Versatility.
2. (I recently was promoted)

Managed hosting and
operations

We’re mostly known for our hosting. What isn’t as well known is our managed services. For this, we engage more closely
with our customers.

When bringing on new managed customers, we work with them to spec out servers, review application’s needs. We get
them up and running on these servers with our conﬁguration management tool, moonshine. And once deployed, we
provide 24x7 monitoring. If you’re server goes down, we let you know, and get it back online as soon as possible,
regardless of when it happens.

And that’s not all. Once live, we provide operational support. Anything from application performance analysis,
recommending architecture improvements, installing and managing new software on servers, or just being there to give
feedback on how the application is operating.

You can basically think of us as a Rails Operations company.

I’m talking about
Rails Operations

Conveniently enough, I’m talking about Rails Operations today.

WTF is
Rails Operations?

I found this hard to distill down to a simple statement.

I think it’s safe to say that the majority of us are developers. We write code, build applications, launch products.

A lot of organizations, operations is something different. eople associate operations with system administration. And
to an extent, this can be fairly accurate. Different people, different teams, different. As developers, we write some
code, and toss it over wall, and let _them_ handle it.

I think this is a bit ﬂawed. The code you write has an operational impact. The systems you run it on have an
operational impact on your code. It’s a complex relationship, and when developer and operations teams are
separate, it’s hard to bridge the gap between, since it’s neithers responsibility.

Development and maintenance
of a production Rails application

The simplest deﬁnition I’ve found is this.

Very important assumption
You develop code that will eventually go into production,
and in part to some business model, generate revenue

That is to say, you are part of some organization

Before we dig in too
deep...

Let’s talk about the business. We need to start with where
development and operations ﬁt within the rest of The Business.

Question
Does development generate revenue?

• Takes place on laptops, desktop machines,
staging servers

• No real users
• Unknown if it truly works
• Tests are green, but...

but it CREATES
potential revenue

• Step 1: Development
• Step 2: .......
• Step 3: PROFIT

Question
Does operations generate revenue?

• Lives on servers located in data centers
and clouds

• Real users
• Either code works, or it doesn’t
• Either the application is available or not

NO

Just because your application works in production, doesn’t mean
people are using it or buying your product.

but it PRESERVES
potential revenue

If you have good operations, that means users will be able to see
your application working and actually be able to use it.

• Step 1: Development
• Step 2: Operations
• Step 3: ......
• Step 4: PROFIT!

Question
Uh, what generates revenue?

• Working features (or at least that work
enough)

• Infrastructure to keep the application up
and running (or at least up enough)

• A business model
• Sheer determination
• Good luck

Lessons learned

Alright. I’ve given you a deﬁnition of Rails Operations, and had a
brief detour to talk about the business and where development
and operations ﬁt into it.

Now for some lessons. Basically, I’ll be going over some patterns,
some antipatterns, and other practices and topics.

Common threads

Putting this all together, I kept coming back to some common
threads. That is, some ideas that apply to many aspects. I’m going
to start you off with a few together, and then just jump into the
lessons. We’ll probably pick up a few more along the way.

Give a damn

If you don’t care about what you’re doing, everything else I’m
talking about today probably doesn’t matter. I don’t think you
need to worry about this though, since you are here.

Earlier we talked about how operations preserves revenue. To that
end, our goal is to mitigate risk as much as makes sense.

Tradeoffs and compromise. Each possible solution has them. The
trick is understanding that there are tradeoffs. What tradeoffs you
make depends on what your priorities are. For example:

* Dollar signs
* Time
* Sanity
* Technical debt
* Higher risk

Conﬁguration
Management
Pattern

It’s about managing
conﬁguration.
duh.

You write code that
manages your servers’
conﬁguration

Take a moment to think about how you might describe a server to someone.
There’s plenty of nouns:

* packages
* users
* ﬁles
* cronjobs
* services

And some verbs:

* running commands

• apache package is installed
• apache service is running
• deploy user exists
• cron jobs
• etc

• Moonshine
• Puppet
• Chef

Automation

Bootstrapping. Anyone that has setup a new server from scratch
can tell you... it’s time consuming, labor intensive, and error
prone.

Bootstraping is just part of it though, only ever happens once
though. What’s more interesting is that you can use this to
manage your infrastructure as it involves. Need to start using
redis? Just add it to your conﬁguration management, and you’ll
have it next deploy.

The best way to illustrate why you should be using conﬁguration management is to explore the
consequences of not using it.

Imagine it’s time to add a new application server. Your application is under heavy load, and needs this
server to be up and serving requests. How long will it take you to get it up? And how will you know it’s
setup correctly? If you’re doing this all manually, you can’t really know the answers to these questions.

Here’s another example. Adding a new dependency to your application. It can be a gem, a native
package, a new daemon, whatever. How do you ensure this gets on the server when you need it?
Deploy and pray? Log into the server and install it yourself? This sucks, and kind of risky especially if
you’re talking about production.

As always, there’s tradeoffs to be made.

Setting up and learning how to do conﬁguration management takes time. Time that could be
spent working on user-facing tasks.

Taking on risk of having to cold deploy, or having deploys fail because of missing
dependencies.

Usually, the balance is to have to take the risk and have it burn you enough times that it’s more
painful to not stop and get your conﬁguration management on, that it is to not do so.

If you do know it, it’s a no brainer. Just DO IT.

Preproduction servers

Staging servers are all about being a testbed between

Helps ensure
correctness of deploy

configuration
management
+
staging servers
=
VERY YES
If you use configuration management, and have staging servers,
then this is a huge win.

We talked about adding new dependencies earlier. If you are
doing configuration management, then staging is the first place
you can see if ur doing it right.

There’s basically no downside to using staging servers. The only
tradeoff though is that servers do cost dollar signs and staging
servers are no different. This leads us to a new thread...

Maths... look around you. In most cases, you can do some dollar sign math to justify costs of a thing. Let’s try this.

A staging server may cost $60/mo

But how can you calculate the cost of not having a staging server? Let’s assume that if you don’t have a staging server,
you’re bound to do a bad deploy that it could have prevented. Some code that doesn’t work outright, or is otherwise
ﬂawed. Let’s say it causes an hour of downtime while you determine the problem and try to ﬁx it. Do you know how much it
costs your business in lost revenue to be down an hour?

This is actually a pretty mature question, and I’d be surprised if many people can answer it off hand. In any event, I think
we can do some fuzzy math to say yeah, it probably is more than $60. If that’s the case, then one failed deploy a month is
enough to validate a staging server.

Repeat after me
• development

• staging

• production

capistrano-gitflow

Whenever possible, I like to enforce standard by means of automation

For the flow of code from development -> staging -> production, we have capistrano-gitflow.
Originally done up by apinstein, I did some refactorings and cleaned it up enough to be usable as a
gem

Effectively, this enforces development -> staging -> production. Whenever you deploy to staging, it
tags the current branch including information about the date, the user deploying, and a small blurb
about the changes. Assuming this is cool, you can promote a tag to production and go on from there.
If you haven’t deployed to staging yet, you’ll be promtpted and it will default to using the last
production tag.

Deploy early, deploy
often
Pattern

A play on release early,
release often.
Although technically, I guess it’s the same

It’s basically the same thing we hear in the open source
community.

The sooner you release code, the sooner you can validate it and
the sooner you can get feedback. Does it work? Does it not break
the entire site? Are users happy?

By deploying early and often, we’re also limiting risk. The less
changes that go out in a single deploy, the less things there are
that can possibly break. By waiting to deploy, you’re accumulating
a larger set of changes to deploy, and therefore there’s more
surface area to debug if it breaks.

In a way, you can consider undeployed code a liability.

Imagine spending a day or two doing some code cleanups to get ready for a sprint. Should you deploy
when you are done and happy with the refactorings, or should you go ahead and do your sprint.

If it were me, I’d deploy the refactorings ﬁrst. That way, the code is out there, and you’ll know if it
performs equally to its nonrefactored version. It’s really easy to introduce performance killing changes
in even a few line diff.

If you instead wait and deploy with new features, if anything goes awry, you have signiﬁcantly more
code to spelunk to track down a potential problem.

Feeling Driven
Development
Antipattern

The front page feels
slow

The primary key seems
like it’s increasing
rapidly

What does it even
mean?

This drives me nuts. By saying something ‘feels’ slow, there’s an
implied assumption. The assumption is that it should be fast.
Saying it like that is...weird, because it gives no indication of what
is slow or not.

The trick is in determining what the assumption is, and then
ﬁnding a way to measure and identify the problem.

How can we do this?

Science Driven
Development
Counterpattern

Metrics everywhere!

With the right tools, you can easily be continuously collecting data
so you have it in your pocket when you need it.

• New Relic - http://newrelic.com
• Scout - http://scoutapp.com

These are the two we use and highly recommend.

New Relic is really great for giving a high level view of your application. We’re talking at the request response level,
including all sorts of fun maths with most time consuming requests, highest standard deviation, etc. It also breaks down
requests by where time spent. Like if it’s all in the view, the controller, the database, partials, etc etc

Scout is useful for other reasons. While New Relic is good for high level understanding of your application, Scout is a bit
more low level. You can use it to collect metrics about your servers, and how well they are running. Memory, CPU, disk
space, IO, mysql connection stats, and so on.

I really believe these are a great combination, because New Relic can point you in the direction of a problem area, and Scout
can better understand what’s contributing to it at a system level.

The front page feels
slow
The front page is taking 10 seconds to load, but we
really need it to be loading in under 1 second

The primary key seems
like it’s increasing
rapidly
The primary key is at 90% of it’s maximum, up from
80% yesterday, and looks like it’ll run out overnight.

IO seems high
IO ﬂuctatues up to 90% sometimes, but doesn’t appear
to have a negative effect

How do you know
when everything is
awful?

How would you prefer
to know?
• Angry tweets
• Angry email from your boss
• You personally checking everything all the
time
• An automated system to let you know

What to monitor

It’s not a problem til it’s a problem

Deﬁne priority

Does it wake someone up?

Single point of contact

If everything is awful, needs to be a single point of contact. They
take point, acknowledge and begin looking into it. If need be,
bring on others

Vertical scaling
Pattern

Your app is slow
Now what?

Resources are
(relatively) cheap

Developers are
(relatively) expensive

As always there’s a balance.

Remember, it’s a tradeoff to optimize for developer time by
vertically scaling. It buys you time to either deal

“I read a blog post
about how mongo is
totally web scale”

Remember what’s important for th ebusiness? Do you want to
become the expert at <insert technology here>? Is it really the
most valuable thing you can be doing?

If you’re still going to go
hipster...

• experiment in branches
• understand operational impact
• Staging!

Test in production
Wait, what?

Further Reading

• Web Operations - John Allspaw and Jesse
Robins
• Continuous Delivery - Jez Humble and
David Farley
• “Web Operations for Developers 101”

http://www.amazon.com/Web-Operations-Keeping-Data-Time/dp/1449377440/
ref=sr_1_1?s=books&ie=UTF8&qid=1314447411&sr=1-1

http://www.amazon.com/Continuous-Delivery-Deployment-Automation-
Addison-Wesley/dp/0321601912/ref=sr_1_4?
s=books&ie=UTF8&qid=1314447411&sr=1-4

http://www.paperplanes.de/2011/7/25/web_operations_101_for_developers.html

Want to talk ops?
ﬁnd me here
josh@railsmachine
@techpickles

Do you like these
things?
• Rails
• Operations
• Ping Pong
• Beer
We are hiring

Rails Operations - Lessons Learned

Recomendados

Recomendados

Más contenido relacionado

Similar a Rails Operations - Lessons Learned

Similar a Rails Operations - Lessons Learned (20)

Último

Último (20)

Rails Operations - Lessons Learned