2. Agenda
● Kubernetes/OpenShift runtimes & scalability goals
● OpenShift system testing: what does it cover?
● Installing large clusters
● Scalability test tools (the Kubernetes performance
test repo and the the OpenShift SVT repo)
● Sample results
3.
4. K8s and OpenShift runtimes
● Primarily targeted at cloud platforms
○ Amazon EC2, Google Cloud Platform, Microsoft Azure
○ Enterprise-hosted cloud offerings/infra
○ On-prem cloud infra such as OpenStack
○ Bare metal and other virtualization environments, too
● Cluster sizes from all-in-one dev/sandbox to
multi-master, 1000+ nodes or federated clusters
5. Persistent Volume StorageNodes
node
1
node
2
EBS
(Persistent
Volumes)
S3 (Registry)
node
1000
Control Plane
master1
+ etcd1
SSD
master2
+ etcd2
SSD
master3
+ etcd3
SSD
Infrastructure Group
infra2:
HAProxy router2
docker-registry2
infra1:
HAProxy router1
docker-registry1
Application
ELB
(Routes)
External
ELB
(Console)
Internet
Int
ELB
(Nodes)
What does a cluster look like?
AWS sample:
6. Kubernetes SIG-scale
● Scalability special interest group
○ https://github.com/kubernetes/community/tree/master/sig-scalability
● Container workload is what matters - listen to your applications
○ The numbers here are more “control plane” - think small pods/containers
● Stated future goals:
○ Assumption: core/node = 64 (higher in the future)
○ Pods/core = 10 (depends on workload)
○ Pods/node = 500 - 640 (depends on workload, these would be small pods)
○ nodes/cluster = 5000
○ pods/cluster = 500,000 (note: less than node x pods/node)
○ pod startup time < 5 seconds
○ Schedule 100 pods/second
9. System Test team in Red Hat
● Kubernetes and OpenShift Scalability
○ Cluster horizontal scale
■ # of nodes
■ # of running pods across all nodes
■ application traffic
○ Node vertical scale
■ # of pods running on a single node
■ workload that a single node can support (applications, builds, storage)
○ Application scalability
■ Scale # of application replicas up/down
10. System Test team in Red Hat
● Performance
○ Resource usage and response times for scenarios and workloads
■ Application workload and access performance
■ Builds (OpenShift)
■ Metrics and Log collection
○ OpenShift infrastructure performance
■ Resource usage of processes under load
■ Network (SDN) throughput
■ Routing
■ Storage (EBS, Ceph, Gluster, Cinder, etc)
11. System Test team in Red Hat
● Reliability
○ Simulated user workloads
■ monthly, weekly, daily, hourly, minute activities
■ accelerated to run faster than real-time
○ Run for extended periods and measure CPU, memory, I/O,
network over time
12. SVT Challenges/Fun
● Installation
○ 1000+ node installs are time consuming (multiple hours)
○ On public cloud providers, time = $$$. Maximize time testing
○ 500 node test cluster on AWS is around USD $1500 - 2000/day
● Verifying that a cluster is viable
○ Don’t waste time on buggy clusters
● Loading up a cluster with application containers
● Putting a workload on the cluster
● Collecting performance data in large clusters
18. Kubernetes e2e and perf test
● e2e (end-to-end) tests
○ https://github.com/kubernetes/community/blob/master/contributors/devel/e2e-te
sts.md
○ Subset of e2e tests are tagged as Conformance.
○ Conformance = minimum supported functionality for operational cluster
○ OpenShift also adds some additional Conformance tests if you yum install
atomic-openshift-tests on top of OpenShift
● Performance tests
○ https://github.com/kubernetes/perf-tests
○ Work in progress
19. OpenShift SVT repo
● https://github.com/openshift/svt
● Tools for OpenShift performance, scale, reliability
○ cluster load-up
○ traffic generation
○ concurrent builds, deployments, pod start/stop
○ reliability testing
○ network performance
○ logging and metrics tests
● Automated and executed from Jenkins
20. Cluster load-up
● cluster-loader - python tool to quickly load clusters according to a YAML test
specification. Takes advantage of OpenShift’s template capabilities
● Can be used with Kubernetes or OpenShift
● SVT repository has sample YAML configurations for node vertical, cluster horizontal,
“Quick Start” applications with and without persistent storage.
“I want an environment with thousands of deployments, pods (with persistent storage), build
configurations, routes, services, secrets and more…”
projects:
- num: 1000
basename: nginx-explorer
tuning: default
templates:
- num: 10
file: cluster-loader/nginx.yaml
- num: 20
file: cluster-loader/explorer-pod.yaml
21. Cluster traffic generation
● cluster-loader can also run in traffic generation mode
● Runs a JMeter pod to generate traffic against applications (installed
by cluster-loader or otherwise)
● Hit rate, throughput, response codes, response times, etc
● Discovers applications, exposed routes, etc
● Currently OpenShift only, but working on an upstream version.
23. Performance Tools
● PBench: Performance and Benchmark Analysis
Framework
○ pbench-agent: collection agent and harness for running tests.
■ Collects data from sar, vmstat, iostat, pidstat, perf, etc
■ Extensible: additional data collectors can be added
■ Packages raw data from a test and ships it to pbench-server
○ pbench-server: processes raw data from all systems under test
○ web-server: provides visualization of data
https://github.com/distributed-system-analysis/pbench
26. Master 1 - is the controller leader for
most of the run
Master 2 - has to pick up controller
leader when Master 1 fails
Loading on OSP 8 cluster:
● 500 nodes
● 20K projects
● 52K pods
Masters are 40vCPU and peak out at
22 cores used.
27. Create/delete hundreds of pods : Amazon EBS IOPs credit exhaustion - AWS “I/O
cliff”
gp2 EBS volumes on EC2 can run “fast” until their IOPS credits are exhausted
After that, they are throttled to 3 iops/gb until credits build back up