After years of running one of the largest OpenStack clouds, we've learned a thing or two. Early architectural decisions about networking, storage, and scaling have real and lasting consequences. We'll walk through some of these early decisions, some which turned out to be good, but many turned out to be bad. Also included are some strategies for thinking through the long term impacts, to help you avoid similar pitfalls for your cloud.
(Video at https://www.youtube.com/watch?v=LzIkTqfb1nI )
KRIS
Late 2013, Havana, POC/”dev pilot” cloud
Morphed into production cloud by 2014
Liberty since 2015, blocked on containerization
KRIS
So what did we end up building?
Totally separate clouds at multiple location (no shared keystone)VM’s boot to local storage, may have cinder volume to store “persistent data”Network centric approach to networking. Letting the networking gear take care of the packets.No live migration - meaning we didn’t want to have pets. Teams were advised to be able rebuild servers. Anything of state should not go on openstack (Databases)
MIKE
Decisions have long lasting impacts
Tough to change later
Talk about general categories of things
Infrastructure & Scaling
Management (Config Management)
Product
We’re going to work backwards, and Kris is going to kick us off with some issues around product
MIKE
Free for everybody, then it gets all used up and isn’t there when needed
Tragedy of the commons (find some good images for this)
Had a lot of trouble keeping up with capacity consumption, so we would run out of space
We had ridiculously high quotas, intending on reporting back to corporate finance (but that never happened)
KRIS
I know there is a lot of text on this slide, but I am not going to go through everything here
DOCUMENT WHAT YOUR ARE PROVIDING
SLA’s
Patching policy
How do you want end users to use what you built
Example deployment’s/architectures
Integrations with legacy applications
If you are running a cloud and you don’t have this documented. Please, Please, Please do the work to get this documented and agreed upon.
product vision drives your technical requirements
Small changes to the vision/requirements can have a fundamental shift in what you need to provide.
After getting this documented, also get the process for how to change the vision documented.
MIKE
Be clear on what you’re providing and what you’re not (before you build it)
Know where you are going!
Even if you don’t plan to actually charge others real money for using your cloud, you need to show them what they’re using and translate that to value somehow
Definitely enforce some quota control
Talk about how we opened up our quotas with intentions to report back to finance department against budgets (which never happened)
Unless you can actually scale hardware super fast (you can’t) then it can’t just be a free-for-all
Education and evangelism
Good docs, getting started guides, sample code
Give them something to copy and paste
Start with teams that are already “cloud ready” as early adopters
Provide ongoing architectural advice and constructive advice
Don’t be arrogant or treat people who aren’t to your level yet poorly
This should go without saying, but if we’re honest, we all have condescending attitudes toward some
Help when things go wrong
Describe SendGrid ProdOps team
Now, moving on to the more technical architecture decisions
MIKE
We knew we would grow fast (see earlier graph)
Known challenges with scaling Nova/RMQ
Easier to move to cells v1 early, rather than a fire drill scaling exercise later
Knew we would take on some ongoing debt to forward port v1 patches for each new version
Cells v2 was coming “real soon now”
Details about how we did it in Link to my YVR talk about moving to cells
MIKE
Good
Helped us to scale and segment our infrastructure (failure boundaries)
Gained a lot of expertise with Nova
Street cred in community (LDT group, etc.)
Bad
Neutron doesn’t scale the same way, which ended up being our main bottleneck (not Nova)
Forward porting patches becomes more and more difficult over time (eternal thanks to Sam from NeCTAR)
Unknown how/if we can online migrate to cells v2
Cells v2 still coming “real soon now” (mostly there now)
KRIS
Run all API/server services, plus RMQ all on one set of servers
Glance separate to stay network adjacent to computes
Most nova services moved later as part of cells v1
Symptom of starting small with POC environment and then growing larger
KRIS
Good
Less hardware to deal with
Simpler architecture
Easier network/firewall ACLs
It helped us get started quickly
Bad
Any problems are very impactful, it takes out a wide swath of services
Resource contention (RMQ and Neutron fighting over RAM and oom killing each other)
No admin vs. public endpoints, more difficult to do maintenance that doesn’t expose errors to users
KRIS
Neutron assumes (or used to assume) L2 everywhere and its available anywhere
In our datacenter network
L2 stops at TOR
So getting to a server in another rack goes though the gateway of the local switch to the spine and in to the other rack
Persistent IP’s can be routed to any vm within the network
Overlays viewed as unnecessarily complex, difficult to troubleshoot
Provider network per rack (L2 domain), we pick a network for you based on AZ selection
Local patches to do network scheduling
KRIS
Good
Able to provide the same networking paradigm to VMs as to metal
Simple infrastructure, VMs just get an IP and they’re good to go
Network IP usages API implemented and committed upstream
Kicked off segmented networks spec as collaboration between LDT and Neutron
This remains the thing I’m most proud of accomplishing/helping with in OpenStack
Bad
Our Neutron doesn’t work like everybody else’s
People love their L2 adjacency
Unable to support more complex networking features out of the box (e.g. LBaaS)
Unsure how we will go about migrating to real Neutron segmented networks
MIKE
Big puppet shop
Pretty good for server bootstrapping and config management
Wanted one stop shop for all config
OpenStack API providers for managing users, groups, roles, AZs, networks, etc.
Code review/pipeline already in place
Config mostly in Puppet and Hiera repos
State of OS resources inside APIs
Physical hosts in manually curated Ansible hosts file
MIKE
Good
Single place for all config (in theory)
Helpful for new server bootstrapping and initial config
Noop mode helpful to see what will happen
Bad
Config and current state was actually split across Puppet and Hiera repos, as well as the service APIs
Difficulties with API providers led to duplicate objects (networks, AZs)
Difficult to do non-omnibus targeted deployments (Puppet upgraded RabbitMQ, woops!)
Roles and grants still managed manually ad-hocly
Noop report not always accurate!
Sometimes servers are forgotten about because we forget to put them in the list
Difficult to do more intelligent orchestration of things when the data is all over the place
MIKE
Think of your future self
Almost nothing will be temporary
Unless have you have a specific plan and timeline for moving away from it, and you can trust yourself to follow through
Try to quantify the interest you will pay on the tech debt
Consider your expected scale (more than seat-of-your-pants)
Just as bad to overestimate and overbuild than to underestimate
Automate first (or at least make sure the capability is there)
MIKE
Keep it simple
The perfect design is not when nothing else can be added, but when nothing else can be removed.
As we were working on this, the Stella Report came out which articulates pretty well a lot of the ideas we were thinking about.
Particularly around the idea of complexity, and a term they coined “dark debt”
Dark debt/unknown unknowns that come from complexity (link to Velocity talk/Stella paper)
Best thing you can do to minimize is to keep things as simple as possible
MIKE
Do as much as you can to simplify, but it’s still complex.
Spread the knowledge wealth
Try to keep everybody up to speed with what’s going on
Keep the “mental models” of the system accurate and up to date (above the line/below the line)
Avoid individual/tribal knowledge