Don't Repeat Our Mistakes! Lessons Learned from Running Go Daddy's Private Cloud (OpenStack Queens Summit)

Copyright© 2017 GoDaddy Inc. All Rights Reserved.
Don’t Repeat Our Mistakes!
Lessons Learned from Running Go Daddy’s Private Cloud
Kris Lindgren
klindgren@godaddy.com
Mike Dorman
mike.dorman@sendgrid.com
OpenStack Queens Summit, November 2017, Sydney

OpenStack at Go Daddy
● 2013: POC cloud (Havana)
● 2014: First production apps (Icehouse)
● 2014: Nova cells v1 (Kilo)
● 2015: “OpenStack everywhere” (Liberty)
● 2017: Working toward containerized services

OpenStack at Go Daddy
● What we built:
○ Shared nothing regions
○ Ephemeral disk on local storage
○ Simple networking
○ No live migration
○ Multiple AZ’s
● Scale
○ 1000’s Computes, >100,000 Cores
○ 10,000’s VM’s

Avoiding “Accidental Architecture”
Product Infrastructure & Scaling Management

Private Cloud =
Free Compute
High Demand =
Overconsumption
Product - Need for Chargeback/Showback
Free Compute =
High Demand

Product - Have a Cohesive Vision
• Which OpenStack Services/features
• User onboard/off-boarding
• Patching cadences/methodology
• Legacy integrations
• Adding capacity
• SLAs
• How do end users “consume” OpenStack?
• Procedure for changing the vision
• Helps with cloud paradigm shift
• Expect and tolerate failure

Product Issues - How to Avoid
• Manage expectations (for yourself and for users)
• Showback and controls around quota
• Education and evangelism
• Docs and sample code
• “Cloud ready” early adopters
• Ongoing guidance
1.Cloud
2.??????
3.Profit!X

Scaling - Nova Cells (v1)
Justification
• Assumed we would grow fast
• Challenges with scaling Nova/RMQ
• Easier earlier than later
• Ongoing debt to manage patches
• Cells v2 was coming soon
http://www.dorm.org/blog/converting-to-openstack-nova-cells-without-destroying-the-world/

Scaling - Nova Cells (v1)
Retrospective
Good
• Helped us to scale
• Gained expertise with Nova
• Community street cred
Bad
• No scaling for Neutron
• Patches get more difficult
• Non-standard config
• Delays on v2
• Migration to v2 is unknown
20/20 Hindsight
• Scale/shard RMQ instead
• Aspirations about scale
• Porting patches is top blocker

• Colocated API services and RMQ
• (Except Glance)
• Dedicated hardware overkill
• Local python packages
• Made sense for POC
• Nova separated later with Cells v1
Scaling - Collapsed Architecture
Justification

Scaling - Collapsed Architecture
Retrospective
Good
• Simple architecture
• Minimal hardware
• Easy network ACLs
• Up and running fast
Bad
• Large failure impacts
• Resource contention
• Single API endpoints
20/20 Hindsight
• OK for POC
• Ignored it too long
• Easy to scale out
• (Implementing now)

Infrastructure - Special Neutron Architecture
Justification
• Neutron L2 assumptions
• L3 folded clos network
• L2 stops at leafs
• Uncomfortable with overlays
• Provider network per rack
• Routed floating IPs
• Overload AZ to pick a network
• Local patches for network scheduling

Infrastructure - Special Neutron Architecture
Retrospective
Good
• Same for VMs and metal
• Simple infrastructure
• Easy on users
• Network IP usages API
• Segmented networks spec
Bad
• Snowflake setup
• L2 adjacency expectations
• Added features difficult (LBaaS)
• Migration to Neutron segmented networks?
20/20 Hindsight
• Works pretty well
• Patches are limited
• IP usages API extension
• Segmented networks in Neutron
• Many others with same problem

Management - Puppet Single Source of Truth
Justification
• Big Puppet shop
• Single source of config
• Good for server bootstrapping
• OpenStack-Puppet modules
• API providers
• Code pipeline already in place
• Ansible kicks off puppet apply

Management - Puppet Single Source of Truth
Retrospective
Good
• Single source of config (in theory)
• Efficient bootstrapping
• NOOP mode for sanity
Bad
• State in Puppet, Hiera, APIs
• Some managed manually
• Duplicate API objects
• Omnibus deployments
• NOOP report not always accurate!
• Orphaned/forgotten servers
• Orchestration difficult
20/20 Hindsight
• Many unintended problems
• Not really a single source
• Need for targeted deployments
• Other tools for orchestration
• Use for bootstrapping

Strategies for Avoiding Accidental Architecture
• Think of your future selves
•Quantify tech debt interest
• Almost nothing will be temporary
•Make a specific plan and timeline
• Carefully consider scale
•Overestimating can be as bad as
underestimating
• Automate first
•At least make it capable

• KISS!
http://stella.report

• Spread the knowledge wealth
http://stella.report
* The Coming Software Apocalypse: https://www.theatlantic.com/technology/archive/2017/09/saving-the-world-from-code/540393/
“The problem, [...] is that we are attempting to build systems that are
beyond our ability to intellectually manage.” *

Recap: How to Live with No Regrets
Questions?
Other Ideas?
klindgren@godaddy.com
mike.dorman@sendgrid.com
● Manage expectations
● Education and evangelism
● Helpful early adopters
● Ongoing guidance
● Remember your future self
● Account and plan for tech debt
● Sane scale expectations
● Automate, automate, automate
● Simplicity
● Knowledge sharing

Don't Repeat Our Mistakes! Lessons Learned from Running Go Daddy's Private Cloud (OpenStack Queens Summit)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Don't Repeat Our Mistakes! Lessons Learned from Running Go Daddy's Private Cloud (OpenStack Queens Summit)

Similar a Don't Repeat Our Mistakes! Lessons Learned from Running Go Daddy's Private Cloud (OpenStack Queens Summit) (20)

Último

Último (20)

Don't Repeat Our Mistakes! Lessons Learned from Running Go Daddy's Private Cloud (OpenStack Queens Summit)

Notas del editor