2. Deployer use cases
• As a deployer I want to ensure that an instance is
reserved & provisioned without falling back
and/or reporting to users internal OpenStack
errors.
• As a deployer I want to be able to allocate,
schedule and reserve resources before they are
consumed so that I can make
advanced/complex/custom scheduling decisions
using the combination of those resources as a
whole.
• I want to convey to my users that OpenStack is a
reliable and dependable system that is resilient to
API outages, resource failures…
3. Developer use cases
• I want to be able to add new (and improved!) states to
OpenStack and know what the impacts will be on the
other states in OpenStack in a easy to understand
manner.
• I want to be able to undo (and redo) resource
allocation decisions in a transactional and verifiably
correct manner on errors or on other ‘smart’
algorithmic placement logic.
• I want to be able to quickly and easily understand an
API request from start to finish & I want other
developers to have a single place to understand the
same.
4. User use cases
• I want to ensure that my instances are reliably
brought up without involving myself to resolve
(or raise to support) errors inside of OpenStack.
• I want to ensure that my instances (and
associated resources) are optimally scheduled in
a reliable and correct manner or not have them
scheduled to begin with.
• I want my resources to be fully utilized, and not
have zombie resources being ‘locked’ due to the
lack of transactional semantics (and recovery) in
the underlying code.
5. The problem
• Hard to [follow, recover from, debug, ensure
reliability, correctness, extend, audit…] ad-hoc
distributed state transitions.
– Created by continual placement of new features
without revisiting the underlying state management
system.
• The never ending battle between new hotness vs. stability
– Majority of focus (understandably) on getting
OpenStack operational.
– Typical technical debt.
• Acceptable for a new project like OpenStack to get off the
ground, but now is the time to focus on features that add
stability/scalability...
6. The problem
• Inter-state ‘cutting’ results in instances which
require manual or periodic tasks to recover.
– Distributed systems should always be able to
automatically recover from failures, and not require
manual/periodic intervention.
• Continually adding local [solutions,fixes,patches]
• Lack of [focus,time,desire] to fix the system as a whole?
• How many inter-state race conditions are hiding
underneath the covers??
– Can verification even be done with the current
codebase (in a reasonable time period)?
7. CREATE SERVER API (admin/user)
1 4 10,14 nova-
request nova-api MySQL 16 compute
2 8
5 11 13
9 15
keystone 3 RabbitMQ
Libvirt
6
nova-
scheduler 7
Volume
glance Service
12
Network
Service
8. Create Server - Transitions and States
ID Service Operation vm_state task_state power_state
1 Nova API Initial State - - -
2 Keystone Authenticate user - - -
3 Nova API/Glance Show image - - -
4 Nova API/MySQL Create entry BUILDING SCHEDULING -
5 Nova API/RabbitMQ Cast to Scheduler BUILDING SCHEDULING -
6 Scheduler Received at Scheduler BUILDING SCHEDULING -
7 Scheduler/RabbitMQ Cast to Compute BUILDING SCHEDULING -
8 Compute Received at Compute BUILDING SCHEDULING -
9 Compute/Glance Show image BUILDING SCHEDULING -
10 Compute/MySQL Update DB BUILDING NETWORKING -
11 Compute/RabbitMQ Call on Network BUILDING NETWORKING -
12 Network Allocate Network BUILDING NETWORKING -
13 Compute/Volume Attach volume BUILDING BLOCK_DEVICE_ -
MAPPING
14 Compute/MySQL Update DB BUILDING SPAWNING -
15 Compute/Libvirt Spawn instance BUILDING SPAWNING -
16 Compute/MySQL Update DB ACTIVE None RUNNING
9. What happens
if we cut here??
Or here??
Or here??
10. Solutions solutions solutions
• Nova has mostly stabilized (code-wise)
– It appears to be a good time to rethink some of the
foundations. And rework some of the foundations
(with as minimal of an impact as we can)
– Eventually as other core components (quantum)
stabilize similar analysis can be done there (if needed)
• Prototyping a potential solution and discuss with
community on next steps.
– That’s why we are here folks
11. Create request without orchestration
https://docs.google.com/document/d/1xpUszQFEtKmRAf1Wz_XpwyJslhI5X6siM29amPnKifE
12. Create request with orchestration
https://docs.google.com/document/d/1xpUszQFEtKmRAf1Wz_XpwyJslhI5X6siM29amPnKifE
13. Key Benefits
• Less scattering of state management
– Makes it easier to understand…
• Less scattering of recovery scenarios
– Clearly defined rollbacks…
• Faster and more dependable resource acquisition
– Compute node will perform initialization and final acquisition of resources.
– Reservations and initial acquisitions will be done before request to provision
instances, hence faster VM spawns.
• Scheduler can be make better ‘overall’ scheduling decisions.
– Ex. no need for compute <-> scheduler retry hacks
– Can make advanced scheduling decisions based on volume choices, locality,
network choices... When you are able to acquire/release resources before
there use, anything is possible…
– No more need for 'hinting'...
• Creates a single place where others can extend or alter nova state
transitions to plug-in there own ‘custom/internal’ state transitions.
14. DEMO
AND
DISCUSSION
https://etherpad.openstack.org/the-future-of-orch