Rackspace uses OpenStack to power both its public cloud and many private clouds.
Lets take a look at how OpenStack Compute (Nova) works with other OpenStack services to convert a users REST API call into accessible compute resources, be they virtual machines, containers or bare metal.
Now you understand how Nova is a highly distributed system, lets have a look at how you can upgrade the control place, spread across thousands of nodes, with minimal downtime.
11. 11
Upgrade Needs
• No User Impact
• Scope:
– To the next release
– Continuous Deployment
• Existing Configuration works
• Warn before removing features
14. 14
API Users
The Absent The Active Multi-Cloud Ops & Dev
• Cloud upgrades
• But old script
works
• Uses newest APIs
• Check availability
• Multiple clouds
• Different versions
• Single script
• Who is using
what?
• How to evolve
API?
15. 15
API Evolution
v2.0 v2.1
• First API
• Base + Extensions
• Now Deprecated
• No Extensions
• Evolve using “Micro-versions”
• Better Validation
17. 17
Nova Architecture
API Nodes
Behind LB
Compute Compute Compute Compute Compute
Database
Message
Queue
Conductor(s)
Other Control
Nodes
Isolate from DB using
oslo.versionedobjects Versioned RPC Signature
Schema and Data Migrations
Graceful Shutdown
RPC Signature
Rackspace public cloud powered by OpenStack Nova
Started working on OpenStack at Citrix in 2010
Joined nova-core in June 2013, Nova PTL for Liberty and Mitaka
Image from unsplash.com
What is Nova?
Data plane / VM downtime
Control plane / API downtime
=
Lost income and Support Calls
No API downtime + No VM downtime = Happy users
No lost income, lower cost of upgrade
https://upload.wikimedia.org/wikipedia/commons/7/78/Airforce_forklift.jpg
https://images.unsplash.com/photo-1429497419816-9ca5cfb4571a?q=80&fm=jpg&s=4bf1164d23eea4f04aeefe1732149cf3
This talk will focus on the control plane
Flow:
API (-> DB) -> Conductor (-> Scheduler) -> Compute (talks to other services)
Why:
Scale small and large: API requests vs Compute nodes
Note Upgrade features.
Lets take a look at our users, and what they want.
Reference:
https://dague.net/2015/06/05/the-nova-api-in-kilo-and-beyond-2/
Flow:
API (-> DB) -> Conductor (-> Scheduler) -> Compute (talks to other services)
Why:
Scale small and large: API requests vs Compute nodes
Note Upgrade features.
http://www.danplanet.com/blog/2015/06/26/upgrading-nova-to-kilo-with-minimal-downtime/
Aim: zero downtime.
Note: no rollback
(1) Expand DB, checks all data migrations are complete, removes any cruft from previous releases
(2) Pin RPC, upgrade all the control plane together, but conductor first
(3) Talk about graceful compute shutdown, and its limitations
(4) Un pin RPC by rechecking
Stats: over 100 blueprints, 100 specs to review, 500 outstanding patches, etc
http://docs.openstack.org/developer/nova/process.html
Always on expectations
But dealing with failure is hard
Hide complexity but keep control = decomposition
Software structure often mirrors team structure
Need better layers to support this problem