The document discusses the results of testing the scalability of OpenStack Neutron in large deployments. Two hardware labs with 378 and 200 nodes were used. Rally and Shaker tools tested the control and data planes. Over 24500 VMs were launched on the 200-node lab with no loss of data plane connectivity. Near line-rate throughput was achieved in data plane tests. Some issues were encountered and fixed, such as bugs and Ceph failure. The outcomes indicate Neutron can scale to large deployments.
Good afternoon, everyone! My name is Elena Ezhova, I am a Software Engineer at Mirantis, and this is Oleg Bondarev, Senior Software Engineer at Mirantis. Today we are going to talk about Neutron performance at scale and find out whether it is ready for large deployments.
So, why are we here? For quite a long time there has been a misconception that Neutron is not production-ready and has certain performance issues. That’s why we aspired to put an end to these rumors and perform Neutron-focused performance and scale testing. And now we’d like to share our results.
Here are some key points of our testing:
First, we deployed Mirantis OpenStack 9.0 with Mitaka-based Neutron on 2 hardware labs, with the largest lab having 378 nodes in total.
Secondly, we were able to achieve line-rate throughput in dataplane tests and boot over 24 thousand VMs during density test
...and finally, that’s the major spoiler by the way, we can confirm that Neutron works at scale!
But let’s not get ahead of ourselves and stick to the agenda.
We shall start with describing the clusters we used for testing, their hardware and software configuration along with the tools that we used.
Then we’ll go on to describe tests that were performed, results we got and their analysis.
After that we’ll take a look at issues that were faced during the testing process as well as some performance considerations.
Finally, we’ll round out with the conclusions and outcomes.
We were testing the Mitaka-based Mirantis OpenStack 9.0 distribution with Neutron with ML2/OVS plugin.
We’ve used VxLAN segmentation type as it is a common choice in production.
We were also using DVR for enhanced data-plane performance.
As to hardware, we were lucky to be able to experiment on two different hardware labs.
The first one had 200 nodes: 3 of which were controllers, 1 we used for running Prometheus w/Grafana for cluster status monitoring. And the rest nodes were computes.
Here, as you can see, controllers were more powerful than computes, all of them having standard NICs with Intel 82599 controllers.
Now, the second lab had more nodes and had way more powerful hardware.
It had 378 nodes: 3 controllers and all the rest computes. As I said, these servers are more powerful than those on the first lab as they have more CPU, RAM and, what’s important, modern X710 Intel NICs.
Now a quick look at the tools that were used in testing process.
All the tests that we were running can be roughly classified into three groups: control plane, data plane and density tests.
For control plane testing we were using Rally.
For testing data-plane we used a specially designed tool called Shaker and for density testing - it were mostly our custom scripts and Heat templates for creating stacks.
Prometheus with Grafana dashboard was quite useful for monitoring cluster state.
And, of course we were using our eyes, hands and sometimes even the 6th sense for tracking down issues.
So, what exactly were we doing?
The very first thing we wanted to know when we got the deployed cloud is whether it is working correctly, meaning, do we have internal and external connectivity? What’s more, we needed to always have a way to check that data-plane is working after massive resources creation/deletion, heavy workloads, etc.
The solution was to create an Integrity test. It is very simple and straightforward.
We create a control group of 20 instances, all of which are located on different compute nodes. Half of them are in one subnet and have floating IPs, the other half are in another subnet and have only fixed IPs. Both subnets are plugged into a router with a gateway to an external network.
For each of the instances we check that it’s possible to:
1. SSH into it.
2. Ping an external resource (eg. Google)
3. Ping other VMs (by fixed or floating IPs)
This infrastructure should always be persistent and resources shouldn’t not be cleaned up after connectivity check is made.
Lists of IPs to ping are formed in a way to check all possible combinations with minimum redundancy. Having VMs from different subnets with and without floating IPs allows to check that all possible traffic routes are working.
For example, the check validates that ping passes:
From fixed IP to fixed IP in the same subnet
From fixed IP to fixed IP in different subnets, when packets have to go through the qrouter namespace
From floating IP to floating IP, traffic goes through FIP namespace to the external network
From fixed IP to floating IP, when traffic goes through a controller.
This connectivity check is really very helpful for verifying that data-plane connectivity is not lost during testing and it really helped us spotting that something went wrong with dataplane early.
Now I’d like to pass the ball to Oleg who will tell you of control plane testing process and results.
Rally is a well known and I’d say “official” tool for testing control plane performance of OpenStack clusters. I won’t talk much about the tool itself, let’s move to the tests and results.
We started with so called basic Neutron test suite - it’s actually Neutron API tests like create and list nets, subnets, routers, etc. which doesn’t include VMs spawning. This test suite goes with rally itself and we didn’t modify test options much, as the main purpose is to validate cluster operability.
Secondly we ran “hardened” version of same tests with increased numbers of iterations and concurrency. Plus we added several tests which spawn VMs.
Finally we ran two tests specially targeted to create many networks and servers in different proportions (servers per network) - like many nets with one VM in each vs. less nets with many servers in each.
Not much to add here, as I already said these are basic Neutron API tests to validate cluster (and Rally) operability. The picture shows that there is no big difference between avg and max response times which is positive.
Moving on. Following tests were run with concurrency 50-100 and 2000-5000 iterations. Create_and_list are additive type of tests which do not delete resources on each iteration, so the load (in terms of number of resources) grows with each iteration.
We also added booting VMs tests where boot_runcomand_delete is the most interesting, since it tests successful VM spawning and external connectivity through a floating IP, all at a high rate.
Speaking of results I’d like to note that all highlighted tests were successful (each iteration) and results on a more powerful lab are better, which is expected.
For boot-and-delete-server-with-secgroups and boot-runcommand-delete there were some failures initially on lab 200 (I’ll talk about failures later), after investigation and applying fixes on lab 378 we got a 100% success rate for these tests even with greater concurrency.
Speaking of trends we see that for create and list nets it is a linear growth for list and slow linear growth for create. This has a simple explanation - the more resources we have, the more time neutron server needs for processing.
create & list from 200 node lab
It’s even better for routers - no time increase for create and slow linear growth for list.
create & list from 200 node lab
Same for subnets - slow linear growth for both create and list
create & list from 200 node lab
Here is an aggregated graph for ports - gradual growth as well with some peaks
There is something to look and profile in list security groups as it seems not quite linear growth. For create it’s more or less stable response times not depending on amount of resources created.
In this test on each iteration 100 networks are created with a VM in each network. There were 20 iterations with concurrency 3 and as you can see from the graph this is a really slow response time increase.
And it’s even better for so called “Rally scale with many VMs” test, where it is 1 net with 100 VMs per iteration, 20 iterations and concurrency 3 - a pretty stable time for each iteration. Probably we should’ve done more iterations but we were very limited in time and had to give a priority to other tests.
Just like with this talk! So now I’ll pass the ball to Elena and she will speak about Shaker and data plane testing.
Thanks, Oleg! Shaker is a distributed data-plane testing tool for OpenStack that was developed at Mirantis. Shaker wraps around popular system network testing tools like iperf3, netperf and others. Shaker is able to deploy OpenStack instances and networks in different topologies using Heat.
Shaker starts lightweight agents inside VMs, these agents execute tests and report the results back to a centralized server. In case of network testing only master agents are involved, while slaves are used as back-ends handling incoming packets.
There are three typical dataplane test scenarios.
The L2 scenario tests the bandwidth between pairs of VMs in the same subnet. Each instance is deployed on own compute node. The test increases the load from 1 pair until all available computes are used.
The L3 east-west scenario is the same as the previous with the only difference that pairs of VMs are deployed in different subnets.
In the L3 north-south scenario VMs with master agents are located in one subnet, and VMs with slave agents are reached via their floating IPs
Our data plane performance testing started on the 200-node lab deployed with standard configuration, which also means that we had 1500 MTU. Having run the Shaker test suite we saw disquietingly low throughput: in east-west bi-directional tests upload was almost 500 MBits/sec!
These results suggested that it would be reasonable to update the MTU from the default 1500 to 9000 that is commonly used in production installations. By doing so we were able to increase throughput by almost 7 times and it reached almost 4 GBits/sec each direction in the same test case. Such difference in results shows that performance to a great extent depends on a lab configuration.
Now, if you remember I was telling that we actually had two hardware labs, where the second lab had more advanced hardware, most importantly - more advanced Intel X710 NICs.
Among else, these NICs allow to make more full use of hardware offloads, that are especially needed when VxLAN segmentation (with 50 bytes overhead) comes in. Hardware offloads allow to significantly increase throughput while reducing load on CPU.
Let’s see what difference does advanced offloads-capable hardware make.
On the 300+ node lab we ran Shaker tests with different lab configurations: MTU 1500 and 9000 and hardware offloads on and off.
As it can be seen on the chart, hardware offloads are most effective with smaller MTU, mostly due to segmentation offloads:
we can see x3.5 throughput increase in bi-directional test (compare columns 1 and 2)
Increasing MTU from 1500 to 9000 also gives a significant boost:
75% throughput increase in bi-directional test (offloads on) (columns 2 and 4)
The situation is the same for unidirectional test cases (download in this example): hardware offloads give x2.5 throughput increase (compare columns 1 and 2).
And combining enabled hardware offloads with jumbo frames helps to increase throughput by 41% (columns 2 and 4).
These results prove that it makes very much sense to enable jumbo frames and hardware offloads in production environments whenever possible.
So, here are the real numbers that we got on this lab:
We were able to achieve near line-rate results in L2 and L3 east-west Shaker tests even with concurrency > 50, which means that there were more than 50 pairs of instances sending traffic simultaneously:
9.8 Gbits/sec in download and upload tests
Over 6 Gbits/sec each direction in bi-directional tests
Now, let’s compare the results we got on 200-node lab, that had less advanced hardware with results on 300+ node lab that had more advanced hardware.
On this chart you can see how average throughput between VMs in the same network changes with increasing concurrency. On a 300+ node lab throughput remains line-rate even when concurrency reaches 99.
Almost the same situation is with L3 east-west download test when the VMs are in different subnets connected to the same router.
Here it can be seen that running the same test on a lab with enabled jumbo frames and supported hardware offloads leads to sufficient increase of throughput, that keeps stable even with high concurrency.
L3 North-South performance is still far from being perfect mostly due to the fact that in this scenario even with DVR all the traffic goes through the controller which in case of high concurrency may get flooded. Apart from that the resulting throughput depends on many factors including configuration of a switch and lab topology (whether nodes are situated in the same rack or not, etc.) AND MTU in the external network that must always considered to be no more than 1500.
The results of bi-directional tests are the most important as in real environments there is usually traffic going in and out and therefore it is important that throughput is stable in both directions. Here we can see that on the 300+ node lab the average throughput in both directions was almost 3 times higher than on the 200-node lab with the same MTU 9000.
The average results that are shown on the previous graphs are often affected by corner cases when the channel gets stuck due to various reasons and throughput drops significantly. To have a fuller understanding of what throughput is achievable you can take a look at a chart with most successful results, where upload/download exceeds 7 Gbits/sec on a 378-node lab.
To sum up, the dataplane testing has shown that Neutron DVR+VxLAN installations are capable of very high, almost line-rate performance.
There are two major factors: hardware configuration and MTU settings. This means that to get the best results it’s needed to have a modern HW-offloads capable NIC and enable jumbo frames. Even on older NICs that don’t support ALL offloads network performance can be improved drastically, which the results that we got on a 200-node lab clearly show.
The North-South scenario clearly needs improvement as DVR is not currently truly distributed and in this scenario all traffic goes through controller which eventually gets clogged.
Now, Oleg will tell you about Density testing and share probably the most exciting results that we got.
Right! With density test we aimed 3 main things:
Boot as many VMs as the cloud can manage
But not only boot - make sure VMs are properly wired and have access to the external network
Verify that data-plane is not affected by high load on the cloud
So essentially the main idea was to load cluster to death to see what are the limits and where are bottlenecks. And additionally check what happens to data plane when control plane breaks.
We only had a chance to ran density test on a 200 node lab. Just to remind about the HW: it was 3 controllers with 20 cores and 128 gigs of RAM, and 196 computes with 6 cores and 32 gigs of RAM. One node was taken for cluster health monitoring, with Grafana/Prometeus on it
Now about the process. We used Heat for the first version of density test on this lab.
1 Heat stack is 1 private net with a subnet connected by a router to a public net and 1 VM per compute node. So 1 stack means 196 new VMs. To control external connectivity and metadata access of VMs, each of them should get some metadata from metadata server and send this info to the external HTTP server. Thus server will check that all VMs got metadata and external access.
We created heat stacks in batches of 1-5 (5 most of the times), so 1 iteration means up to 1000 new VMs.
After each iteration we checked data plane integrity by executing connectivity check which Elena described earlier. We also constantly monitored cluster health to be able to detect and investigate any problem at an early stage.
I’ll speak about issues we faced a bit later. Now about the results: it was a 3 (or maybe 4) days journey with over 10 people from different teams involved, and finally we successfully created 125 stacks on this cluster, which is more that 24k VMs which were successfully spawned and got external connectivity. Data plane connectivity for the control group of VMs was never lost.
This is how one of Grafana pages was looking during density tests. It has CPU and Memory load as well as load on DB and Network. These are aggregated graphs for all controllers and computes. Here peaks correspond to batches of VMs spawned. You can also see how memory usage grows on compute nodes, while staying pretty stable on controllers. This is by the way close to final iterations as you see memory on computes is getting close to end.
And this is how CPU and memory consumption changed from first to last iteration. As you see we almost reached memory limit on computes which we expected to be the limiting factor, but no.
Actually the bottleneck appeared to be in Ceph which was used in our deployment.
The initial failure was with the lack of allowed PIDs per OSD node, then Ceph monitors started to consume all (and even more) resources on controllers in order to restart, causing all other services (Rabbit, OpenStack services) to suffer.
After this Ceph failure the cluster could not be recovered, so the density test had to be stopped before the capacity of compute nodes was exhausted.
The Ceph team commented that 3 Ceph monitors aren't enough for over 20000 VMs (each having 2 drives) and recommended to have at least 1 monitor per ~1000 client connections. It’s also better to move monitors to dedicated nodes.
One pretty important note: Connectivity check of Integrity test passed 100% even when cluster went crazy. That is a good illustration of control plane failures not affecting data plane.
Other issues:
At some point we had to increase ARP table size on computes and then on controllers;
Later we had to increase cpu_allocation_ratio on computes. It’s a nova config controlling how many VMs can be spawned on a certain compute node depending on the number of real cores;
Several neutron bugs, nothing critical though, most interesting is port creation time growth which was fixed by a 2-lines patch. Other thing that deserves attention is OVS agent restart on a loaded compute node - there might be timeouts on agent side trying to update status of a big number of interfaces at once. It’s a well known issue which has two alternative patches on review and just needs to reach consensus.
A bug in oslo.messaging which affected us pretty much and took some time to be investigated and fixed by our messaging team; the gist is that agents were reporting to queues consumed by nobody;
A Nova bug where massive VM deleting leads to nova-computes hanging; it’s related to nova - ceph interactions;
And finally here are the main outcomes of our scale testing:
No major issues in Neutron were found during testing (all labs, all tests).
Issues found were either already fixed in upstream or fixed in upstream during our testing, one is in progress and close to be fixed.
Rally tests did not reveal any significant issues.
No threatening trends in Rally tests results.
Data-plane tests showed stable performance on all hardware. It was demonstrated that high network performance can be achieved even on old hardware, that doesn’t support VxLAN offloads, just need proper MTU settings. On servers with modern NICs throughput is almost line-rate.
Data-plane connectivity is not lost even during serious issues with control plane.
Density testing clearly demonstrated that Neutron is capable of managing over 24500 VMs on 200 nodes (3 controllers) without serious performance degradation. In fact we weren’t even able to spot significant bottlenecks in Neutron control plane as had to stop the test due to issues not related to Neutron.
Neutron is ready for large-scale production deployments on 350+ nodes.
Our process and results has been shared on docs.openstack.org, here’re the links