Is OpenStack Neutron production ready for large scale deployments?

Copyright © 2016 Mirantis, Inc. All rights reserved
www.mirantis.com
Is OpenStack Neutron
production ready for large
scale deployments?
Oleg Bondarev, Senior Software Engineer, Mirantis
Elena Ezhova, Software Engineer, Mirantis

Why are we here?
“We've learned from experience that the truth will come out.”
Richard Feynman

Key highlights (Spoilers!)
Mitaka-based OpenStack deployed
by Fuel
2 hardware labs were used for
testing
378 nodes was the size of the
largest lab
Line-rate throughput was achieved
Over 24500 VMs were launched on
a 200-node lab
...and yes, Neutron works at scale!

Agenda
Labs overview & tools
Testing methodology
Results and analysis
Issues
Outcomes

Deployment description
Mirantis OpenStack with Mitaka-based Neutron
ML2 OVS
VxLAN/L2 POP
DVR
rootwrap-daemon ON
ovsdb native interface OFF
ofctl native interface OFF

Environment description. 200 node lab
3 controllers, 196 computes, 1 node for Grafana/Prometheus
CPU
2x CPU Intel Xeon E5-2650v3,Socket 2011,
2.3 GHz, 25MB Cache, 10 core, 105 W
RAM
8x 16GB Samsung M393A2G40DB0-CPB
DDR-IV PC4-2133P ECC Reg. CL13
Networ
k
2x Intel Corporation I350 Gigabit Network
Connection (public network)
2x Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
Controllers Computes
CPU
1x INTEL XEON Ivy Bridge 6C E5-2620 V2
2.1G 15M 7.2GT/s QPI 80w SOCKET 2011R
1600
RAM
4x Samsung DDRIII 8GB DDR3-1866 1Rx4
ECC REG RoHS M393B1G70QH0-CMA
Network
1x AOC-STGN-i2S - 2-port 10 Gigabit
Ethernet SFP+

3 controllers, 375 computes
Model Dell PowerEdge R63
CPU 2x Intel, E5-2680 v3, 2.5 GHz, 12 core
RAM
256 GB RAM, Samsung, M393A2G40DB0-
CPB
Networ
k
2x Intel X710 Dual Port, 10-Gigabit
Storage
3.6 TB, SSD, raid1 - Dell, PERC H730P Mini,
2 disks Intel S3610
Model Lenovo RD550-1U
CPU 2x E5-2680v3, 12-core CPUs
RAM 256GB RAM
Network 2x Intel X710 Dual Port, 10-Gigabit
Storage
2x Intel S3610 800GB SSD
2x DP and 3Yr Standard Support 23 176
RD650-2

Tools
Control plane testing
Rally
Data plane testing
Shaker
Density testing
Heat
Custom (ancillary) scripts
System resource monitoring
Grafana/Prometheus
Additionally

Integrity test
Control group of resources that must
stay persistent no matter what other
operations are performed on the
cluster.
2 server groups of 10 instances
2 subnets connected by router
Connectivity checks by floating IPs
and fixed IPs
Checks are run between other tests
to ensure dataplane operability

Integrity test
● From fixed IP to fixed
IP in the same subnet
● From fixed IP to fixed
IP in different subnets

Integrity test
● From floating IP to
floating IP
● From fixed IP to
floating IP

Rally control plane tests
Basic Neutron test suite
Tests with increased number of iterations and
concurrency
Neutron scale test with many servers/networks

Rally basic Neutron test suite
create_and_update_
create_and_list_
create_and_delete_
● floating_ips
● networks
● subnets
● security_groups
● routers
● ports
Verify that cloud is
healthy, Neutron
services up and
running

Rally high load tests, increased
iterations/concurrency
Concurrency 50-100
Iterations 2000-5000
API tests
create-and-list-networks
create-and-list-ports
create-and-list-routers
create-and-list-security-groups
create-and-list-subnets
Boot VMs tests
boot-and-list-server
boot-and-delete-server-with-secgroups
boot-runcommand-delete

All test runs were successful, no errors.
Results on Lab 378 slightly better than
on Lab 200.
API tests
Boot VMs tests
Scenario Iterations/
Concurrency
Time
Lab 200 Lab 378
create-and-list-routers 2000/50 avg 15.59
max 29.00
avg 12.942
max 19.398
create-and-list-subnets 2000/50 avg 25.973
max 64.553
avg 17.415
max 50.41

First run on Lab 200:
● 7.75% failures, concurrency
100
● 1.75% failures, concurrency 15
Fixes applied on Lab 378:
● 0% failures, concurrency 100
● 0% failures, concurrency 50
API tests
Boot VMs tests

Trends
create_and_list_networks
● create - slow linear growth
● list - linear growth

create_and_list_networks trends
create network
list networks

Trends
● create - stable
create_and_list_routers
● create - stable
● list - linear growth (6.5 times in 2000 iterations)

create_and_list_routers trends
create router
list routers

Trends
● create - stable
● create - stable
create_and_list_subnets
● create - slow linear growth
● list - linear growth (20 times in 2000 iterations)

create_and_list_subnets trends
create subnet
list subnets

Trends
● create - stable
● create - stable
● create - low linear growth
create_and_list_ports

create_and_list_ports trends
average load

Trends
● create - stable
● create - stable
● create - low linear growth
create_and_list_ports
● gradual growth
create_and_list_secgroups
● create 10 sec groups - stable, with peaks
● list - rapid growth rate by 17.2 times

create_and_list_secgroups trends
create 10 security groups
list security groups

Rally scale with many networks
100 networks per iteration
1 VM per network
Iterations 20, concurrency 3

Rally scale with many VMs
1 network per iteration
100 VMs per network
Iterations 20, concurrency 3

Shaker: Architecture
Shaker is a distributed data-
plane testing tool for
OpenStack.

Shaker: L2 scenario
Tests the bandwidth
between pairs of instances
on different nodes in the
same virtual network.

Shaker: L3 East-West scenario
Tests the bandwidth
between pairs of
instances on different
nodes deployed in
different virtual networks
plugged into the same
router.

Shaker: L3 North-South scenario
Tests the bandwidth
between pairs of
instances on different
nodes deployed in
different virtual networks.

Shaker: Lab 200, MTU 1500
Standard configuration
Bi-directional L3 East-West
scenario:
● 561 Mbits/sec upload,
528 Mbits/sec
download
Intel 82599ES 10-Gigabit

Shaker: Lab 200, MTU 9000
Enabled jumbo frames
Bi-directional L3 East-West
scenario:
● 3615 Mbits/sec upload,
3844 Mbits/sec
download
x7 increase in throughput
Intel 82599ES 10-Gigabit

Shaker: Lab 378,
L3 East-West Bi-directional test
HW offloads-capable NIC
Hardware offloads boost with
small MTU (1500):
● x3.5 throughput increase
in bi-directional test
Increasing MTU from 1500 to
9000 also gives a significant
boost:
● 75% throughput
increase in bi-directional
test (offloads on)
Intel X710 Dual Port 10-Gigabit

Shaker: Lab 378,
L3 East-West Download test
HW offloads-capable NIC
Hardware offloads boost with
small MTU (1500):
● x2.5 throughput increase
in download
Increasing MTU from 1500 to
9000 also gives a significant
boost:
● 41% throughput
increase in download
test (offloads on)

Shaker: Lab 378,
Near line-rate results in L2 and
L3 east-west Shaker tests
even with concurrency >50:
● 9800 Mbits/sec in
download/upload tests
● 6100 Mbits/sec each
direction in bi-directional
tests

Shaker: Lab 378,
Full L2 Download test

Shaker: Lab 378,

Shaker: Lab 378,
Full L3 North-South Download test

Shaker: Lab 378,
L3 East-west Bi-directional test

Dataplane testing outcomes
Neutron DVR+VxLAN+L2pop installations are capable of almost line-
rate performance.
Main bottlenecks: hardware configuration and MTU settings.
Solution:
1. Use HW offloads-capable NICs
2. Enable jumbo frames
North-South scenario needs improvement

Density test
Aim:
Boot the maximum number of VMs the cloud can manage.
Make sure VMs are properly wired and have access to the
external network.
Verify that data-plane is not affected by high load on the
cloud.

3 controllers, 196 computes, 1 node for Grafana/Prometheus
CPU 20 core
RAM 128 GB
Networ
k
2x Intel Corporation I350 Gigabit Network
Connection (public network)
2x Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection (rev 01)
Controllers Computes
CPU 6 core
RAM 32 GB
Network
1x AOC-STGN-i2S - 2-port 10 Gigabit
Ethernet SFP+

Density test process
Heat used for creating 1 network
with a subnet, 1 DVR router,
and 1 cirros VM per compute
node.
1 Heat stack == 196 VMs
Upon spawn VMs get their IPs
from metadata and send them
to the external HTTP server
Iteration 1

Density test process
Heat stacks were created in
batches of 1 to 5 (5 most of the
times)
1 iteration == 196*5 VMs
Integrity test was ran periodically
Constant monitoring of lab status
using Grafana dashboard
Iteration k

Density test results
125 Heat stacks were created
Total 24500 VMs on a cluster
Number of bugs filed and fixed: 8
Days spent: 3
People involved: 12
Data-plane connectivity lost: 0 times

Grafana dashboard during density test

Density test load analysis

Issues faced
● Ceph failure!
● Bugs
● LP #1614452 Port create time grows at scale due to dvr arp update
● LP #1610303 l2pop mech fails to update_port_postcommit on a loaded cluster
● LP #1528895 Timeouts in update_device_list (too slow with large # of VIFs)
● LP #1606827 Agents might be reported as down for 10 minutes after all controllers restart
● LP #1606844 L3 agent constantly resyncing deleted router
● LP #1549311 Unexpected SNAT behavior between instances with DVR+floating ip
● LP #1609741 oslo.messaging does not redeclare exchange if it is missing
● LP #1606825 nova-compute hangs while executing a blocking call to librbd
● Limits
● ARP table size on nodes
● cpu_allocation_ratio

Outcomes
● No major issues in Neutron
● No threatening trends in control-plane tests
● Data-plane tests showed stable performance on all hardware
● Data-plane does not suffer from control-plane failures
● 24K+ VMs on 200 nodes without serious performance
degradation
● Neutron is ready for large-scale production deployments on
350+ nodes

Links
http://docs.openstack.org/developer/performance-
docs/test_plans/neutron_features/vm_density/plan.html
http://docs.openstack.org/developer/performance-
docs/test_results/neutron_features/vm_density/results.h
tml

Thank you
for your time

Is OpenStack Neutron production ready for large scale deployments?

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Is OpenStack Neutron production ready for large scale deployments?

Similar a Is OpenStack Neutron production ready for large scale deployments? (20)

Último

Último (20)

Is OpenStack Neutron production ready for large scale deployments?

Notas del editor