SlideShare una empresa de Scribd logo
1 de 56
Descargar para leer sin conexión
Let’s build a private clod
2
Who am I?
●
Kevin Honka
●
Senior System Engineer at AD IT Systems
●
Twitter: @piratehonk
●
Mastodon: @piratehonk@norden.social
●
Mail: kevin (at) honka.dev
3
Roadmap
●
What is a private cloud
●
How does one build it
●
Pros / Cons
●
Monitoring
●
Difficulties
4
What is a private cloud?
●
Similar to
– Google cloud
– AWS
– Azure
●
But on our own hardware
5
What is a private cloud?
●
KVM on steroids
●
Loads of services
●
Lots of Infrastructure automation
●
Kept together by tears and duct tape
6
Building a cloud
●
Commercial
– VMWare
– Nutanix
– Red Hat Openstack / Openshift
– Mirantis Openstack
– Nebula
7
Building a cloud
●
Non Commercial
– Apache Mesos
– Openstack Foundation
8
Building a cloud
●
Manpower
●
Money
●
Time
●
Knowledge
●
Scale
9
Building a cloud
●
Openstack
– SDN
– Easily scalable
– Good documentation
– External support available from multiple companies
10
Building a cloud
●
Minimum of 3 nodes
●
Split control, network, storage, compute
●
Scale later when necessary
11
Building a cloud
●
3 high performance servers for Openstack
– Dedicated Fiber Channel
– Dual sockets with high core CPUs
– All RAM slots occupied for optimal usage
●
4 high I/O servers for Ceph
– Dedicated Fiber Channel
– Single socket with medium cpu
– Nvme SSDs for Storage
12
Building a cloud
●
4 Node Ceph cluster with default settings
– Setup using cephadm
●
3 Node Openstack cluster
– 1 Control/Network Node
– 2 Compute Nodes
●
setup with kolla-ansible
– A single run takes around 30 – 60 Minutes
13
Building a cloud
●
kolla-ansible
– Modifed ansible
– Runs on a single YAML File
16
Pros
●
Good documentation
●
Highly customizable
●
Control over all services
●
Great for learning new things
– Kvm
– Linux Storage handling
– Networking
17
Cons
●
Steep learning curve
– Kvm
– Networking
– Openstack services
– Integration in existing environments
●
Overspecific documentation
●
Takes a long time to mvp
18
Monitoring
●
Prometheus
– Exporter for most services
– Predefined alert rules available
●
Graylog
– Log collection via nxlog
– Processing for data extraction
19
Monitoring
●
Many different formats
– Json
– Classic system logging
– Custom logging setups
– Multi line logs
20
Placement logs
●
2022-11-09 10:03:15.070 25 INFO placement.requestlog [req-6e970f73-e493-4418-ad9c-25b7ff34ba57
ba26e9e5beaa41018db4a3e00c6e7ef9 9abdc13c709a42949a985af187d64a4b - default default] 10.XX.XX.XX "GET
/resource_providers/480ccd47-c2c0-4a49-8972-b1486598f6e9/allocations" status: 200 len: 575 microversion: 1.0
●
2022-11-09 10:03:15.097 24 INFO placement.requestlog [req-d1118868-c205-4b38-8567-eb1c7db17811
ba26e9e5beaa41018db4a3e00c6e7ef9 9abdc13c709a42949a985af187d64a4b - default default] 10.XX.XX.XX "GET
/resource_providers?in_tree=480ccd47-c2c0-4a49-8972-b1486598f6e9" status: 200 len: 817 microversion: 1.14
●
2022-11-09 10:03:15.123 22 INFO placement.requestlog [req-e77ad21e-ead7-4b5e-9e03-b3350b188234
ba26e9e5beaa41018db4a3e00c6e7ef9 9abdc13c709a42949a985af187d64a4b - default default] 10.XX.XX.XX "GET
/resource_providers/480ccd47-c2c0-4a49-8972-b1486598f6e9/inventories" status: 200 len: 410 microversion: 1.0
●
2022-11-09 10:03:15.143 23 INFO placement.requestlog [req-7e8f0090-69f7-4b5c-b63d-6d8aeffa6312
ba26e9e5beaa41018db4a3e00c6e7ef9 9abdc13c709a42949a985af187d64a4b - default default] 10.XX.XX.XX "GET
/resource_providers/480ccd47-c2c0-4a49-8972-b1486598f6e9/aggregates" status: 200 len: 54 microversion: 1.19
●
2022-11-09 10:03:15.178 21 INFO placement.requestlog [req-416b2d05-eebc-4fac-b75d-10c02c7df252
ba26e9e5beaa41018db4a3e00c6e7ef9 9abdc13c709a42949a985af187d64a4b - default default] 10.XX.XX.XX "GET
/resource_providers/480ccd47-c2c0-4a49-8972-b1486598f6e9/traits" status: 200 len: 1593 microversion: 1.6
21
Placement logs
●
2022-11-09 10:03:15.070 25 INFO placement.requestlog
[req-6e970f73-e493-4418-ad9c-25b7ff34ba57
ba26e9e5beaa41018db4a3e00c6e7ef9
9abdc13c709a42949a985af187d64a4b - default default]
10.XX.XX.XX "GET /resource_providers/480ccd47-c2c0-4a49-
8972-b1486598f6e9/allocations" status: 200 len: 575
microversion: 1.0
22
Log components
●
Time: 2022-11-09 10:03:15.070
●
Loglevel: INFO
●
Origin: 10.XX.XX.XX
●
HTTP Method: GET
●
URL: /resource_providers/480ccd47-c2c0-4a49-8972-
b1486598f6e9/allocations
●
HTTP Status: “status: 200”
23
Tracing
●
Request ID: req-6e970f73-e493-4418-ad9c-25b7ff34ba57
●
Other Identifiers:
– Ba26e9e5beaa41018db4a3e00c6e7ef9
– 9abdc13c709a42949a985af187d64a4b
●
Random data: default default
24
Difficulties
●
Slow Interface / API
●
Performance inconsistencies in VM
●
Bad I/O
25
Slow Interface / API
●
Long loading times
●
Sometimes timeouts
26
Slow Interface / API
●
Why is this happening?
●
How do we resolve this?
27
Slow Interface / API
●
Understanding how the Services work
●
Which way do requests take?
28
Slow Interface / API
29
Slow Interface / API
30
Slow Interface / API
●
Why is this happening?
– Too many connections via HAProxy
●
A single request can generate up to 500-2000 internal requests
●
How do we resolve this?
– Use HAProxy only for incomming requests
– Remove HAProxy completely
31
Slow Interface / API
●
Use HAProxy only for incomming requests
– Minimal impact
– Easy to configure
●
Remove HAProxy completely
– Loss of high availability
– One less service to worry about
32
Slow Interface / API
●
Monitoring takeaways
– Check logs for dropped connections
– Monitor open tcp connections and times of the linux kernel
33
Performance inconsistencies
●
I/O Wait
●
CPU lag
34
Performance inconsistencies
●
Why is this happening?
– Problem with KVM?
– Hardware issues?
– Something with the Network?
– Ceph Issues?
35
Performance inconsistencies
36
Performance inconsistencies
●
No progress after a week of debugging
●
A hint from @isotopp@chaos.social
– Old story about a MySQL DB
– Something about Numa swapping
37
Performance inconsistencies
●
NUMA node0 CPU(s): 0-31,64-95
●
NUMA node1 CPU(s): 32-63,96-127
38
Numaswapping
39
Numaswapping
40
Numaswapping
41
Numaswapping
●
KVM Processes jump between Cores
●
On Socket change, Memory is behind a different CPU
– Increased memory access time
– Slower PCIe access
42
Performance inconsistencies
●
Activate CPU pinning
– CPU cores will be exclusive to a single KVM Thread
– Less available resources on compute nodes
– Need more compute nodes for same amount of VMs
●
Run KVM NUMA aware
– KVM Threads will always run on the same NUMA Node
– No exclusive cores
43
Performance inconsistencies
●
Monitoring takeaways
– Impossible to monitor
●
Intel Resource Director Technology can help
– Not available on AMD systems
44
Bad I/O
●
Ceph RDB volumes for VMs
●
Causes?
– Network?
– Wrong configuration?
– Hardware limits?
45
Bad I/O
●
Symptoms
– Slow writes; less than 300 op/s
– Inconsistent reads; fluctuating between 20k and 20 op/s
– Slow commits; more than 50 msec
46
Bad I/O
●
Searching for a solution
– Many tipps for optimizations
●
Stabilized I/O but did not increase it to estimated levels
– Estimation
●
NVMe SSDs
●
Atleast 100k op/s
●
Fast commit to disk; less than 500 usec
47
Bad I/O
●
Searching for a solution
– Network works at peak, with 20GBps
– Hardware resources are hardly touched
– Possible Problem with Ceph?
●
Nothing in the documentation
●
No recommendations
– Accept it as fate and move to local storage?
48
Bad I/O
●
A random link to a ceph mailing list
– OSDs should be at a max of 1TB
else performance will be poor
49
Bad I/O
●
Reconfiguring the ceph cluster to OSDs with a max size of
1TB
– OSDs increase from 20 to 60
– Each OSD gets it’s own core
●
No NUMA swapping
– Each SSD contains 3 OSDs
50
Bad I/O
●
Success?
– Partially
– I/O Performance
●
Commit down to 40 µsec
●
Consistent 15k+ op/s
– Could be better could be worse
51
Bad I/O
●
Monitoring takeaway
– Collect the metrics from libvirt
– Plotting graphs can actually help here
52
53
Monitoring takeaways
●
Use existing Tools
– Prometheus exporter
●
Openstack
●
Ceph
●
Visualize everything!
– Use existing Dashboards and customize
54
What happened since then?
●
Implementation of Prometheus for all Services and
Servers
●
Grafana Dashboards for everything important
●
Custom alert rules based on aggregated metrics
55
Questions ?
56
Thank you and safe travels

Más contenido relacionado

Similar a OSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin Honka

How we scaled Rudder to 10k, and the road to 50k
How we scaled Rudder to 10k, and the road to 50kHow we scaled Rudder to 10k, and the road to 50k
How we scaled Rudder to 10k, and the road to 50kRUDDER
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseDatabricks
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudHostedbyConfluent
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraCeph Community
 
OSDC 2016 - Tuning Linux for your Database by Colin Charles
OSDC 2016 - Tuning Linux for your Database by Colin CharlesOSDC 2016 - Tuning Linux for your Database by Colin Charles
OSDC 2016 - Tuning Linux for your Database by Colin CharlesNETWAYS
 
PuppetConf 2016: Why Network Automation Matters, and What You Can Do About It...
PuppetConf 2016: Why Network Automation Matters, and What You Can Do About It...PuppetConf 2016: Why Network Automation Matters, and What You Can Do About It...
PuppetConf 2016: Why Network Automation Matters, and What You Can Do About It...Puppet
 
OpenStack Infrastructure at any Scale - Simple is BEST!? - - OpenStack最新情報セミ...
OpenStack Infrastructure at any Scale - Simple is BEST!? -  - OpenStack最新情報セミ...OpenStack Infrastructure at any Scale - Simple is BEST!? -  - OpenStack最新情報セミ...
OpenStack Infrastructure at any Scale - Simple is BEST!? - - OpenStack最新情報セミ...VirtualTech Japan Inc.
 
Rapid IPv6 Deployment for ISP Networks
Rapid IPv6 Deployment for ISP NetworksRapid IPv6 Deployment for ISP Networks
Rapid IPv6 Deployment for ISP NetworksSkeeve Stevens
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageKernel TLV
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOpsSveta Smirnova
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
Using OpenStack In a Traditional Hosting Environment
Using OpenStack In a Traditional Hosting EnvironmentUsing OpenStack In a Traditional Hosting Environment
Using OpenStack In a Traditional Hosting EnvironmentOpenStack Foundation
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific DashboardCeph Community
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AITyrone Systems
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community
 
Apache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling OutApache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling OutSander Temme
 
Cloud Networking is not Virtual Networking - London VMUG 20130425
Cloud Networking is not Virtual Networking - London VMUG 20130425Cloud Networking is not Virtual Networking - London VMUG 20130425
Cloud Networking is not Virtual Networking - London VMUG 20130425Greg Ferro
 
KSCOPE 2013: Exadata Consolidation Success Story
KSCOPE 2013: Exadata Consolidation Success StoryKSCOPE 2013: Exadata Consolidation Success Story
KSCOPE 2013: Exadata Consolidation Success StoryKristofferson A
 
Tối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G core
Tối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G coreTối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G core
Tối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G coreVietnam Open Infrastructure User Group
 
MySQL X protocol - Talking to MySQL Directly over the Wire
MySQL X protocol - Talking to MySQL Directly over the WireMySQL X protocol - Talking to MySQL Directly over the Wire
MySQL X protocol - Talking to MySQL Directly over the WireSimon J Mudd
 

Similar a OSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin Honka (20)

How we scaled Rudder to 10k, and the road to 50k
How we scaled Rudder to 10k, and the road to 50kHow we scaled Rudder to 10k, and the road to 50k
How we scaled Rudder to 10k, and the road to 50k
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
 
OSDC 2016 - Tuning Linux for your Database by Colin Charles
OSDC 2016 - Tuning Linux for your Database by Colin CharlesOSDC 2016 - Tuning Linux for your Database by Colin Charles
OSDC 2016 - Tuning Linux for your Database by Colin Charles
 
PuppetConf 2016: Why Network Automation Matters, and What You Can Do About It...
PuppetConf 2016: Why Network Automation Matters, and What You Can Do About It...PuppetConf 2016: Why Network Automation Matters, and What You Can Do About It...
PuppetConf 2016: Why Network Automation Matters, and What You Can Do About It...
 
OpenStack Infrastructure at any Scale - Simple is BEST!? - - OpenStack最新情報セミ...
OpenStack Infrastructure at any Scale - Simple is BEST!? -  - OpenStack最新情報セミ...OpenStack Infrastructure at any Scale - Simple is BEST!? -  - OpenStack最新情報セミ...
OpenStack Infrastructure at any Scale - Simple is BEST!? - - OpenStack最新情報セミ...
 
Rapid IPv6 Deployment for ISP Networks
Rapid IPv6 Deployment for ISP NetworksRapid IPv6 Deployment for ISP Networks
Rapid IPv6 Deployment for ISP Networks
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOps
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Using OpenStack In a Traditional Hosting Environment
Using OpenStack In a Traditional Hosting EnvironmentUsing OpenStack In a Traditional Hosting Environment
Using OpenStack In a Traditional Hosting Environment
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AI
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Apache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling OutApache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling Out
 
Cloud Networking is not Virtual Networking - London VMUG 20130425
Cloud Networking is not Virtual Networking - London VMUG 20130425Cloud Networking is not Virtual Networking - London VMUG 20130425
Cloud Networking is not Virtual Networking - London VMUG 20130425
 
KSCOPE 2013: Exadata Consolidation Success Story
KSCOPE 2013: Exadata Consolidation Success StoryKSCOPE 2013: Exadata Consolidation Success Story
KSCOPE 2013: Exadata Consolidation Success Story
 
Tối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G core
Tối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G coreTối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G core
Tối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G core
 
MySQL X protocol - Talking to MySQL Directly over the Wire
MySQL X protocol - Talking to MySQL Directly over the WireMySQL X protocol - Talking to MySQL Directly over the Wire
MySQL X protocol - Talking to MySQL Directly over the Wire
 

Último

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durbanmasabamasaba
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationShrmpro
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 

Último (20)

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 

OSMC 2022 | Let’s build a private cloud – how hard can it be? by Kevin Honka

  • 1. Let’s build a private clod
  • 2. 2 Who am I? ● Kevin Honka ● Senior System Engineer at AD IT Systems ● Twitter: @piratehonk ● Mastodon: @piratehonk@norden.social ● Mail: kevin (at) honka.dev
  • 3. 3 Roadmap ● What is a private cloud ● How does one build it ● Pros / Cons ● Monitoring ● Difficulties
  • 4. 4 What is a private cloud? ● Similar to – Google cloud – AWS – Azure ● But on our own hardware
  • 5. 5 What is a private cloud? ● KVM on steroids ● Loads of services ● Lots of Infrastructure automation ● Kept together by tears and duct tape
  • 6. 6 Building a cloud ● Commercial – VMWare – Nutanix – Red Hat Openstack / Openshift – Mirantis Openstack – Nebula
  • 7. 7 Building a cloud ● Non Commercial – Apache Mesos – Openstack Foundation
  • 9. 9 Building a cloud ● Openstack – SDN – Easily scalable – Good documentation – External support available from multiple companies
  • 10. 10 Building a cloud ● Minimum of 3 nodes ● Split control, network, storage, compute ● Scale later when necessary
  • 11. 11 Building a cloud ● 3 high performance servers for Openstack – Dedicated Fiber Channel – Dual sockets with high core CPUs – All RAM slots occupied for optimal usage ● 4 high I/O servers for Ceph – Dedicated Fiber Channel – Single socket with medium cpu – Nvme SSDs for Storage
  • 12. 12 Building a cloud ● 4 Node Ceph cluster with default settings – Setup using cephadm ● 3 Node Openstack cluster – 1 Control/Network Node – 2 Compute Nodes ● setup with kolla-ansible – A single run takes around 30 – 60 Minutes
  • 13. 13 Building a cloud ● kolla-ansible – Modifed ansible – Runs on a single YAML File
  • 14.
  • 15.
  • 16. 16 Pros ● Good documentation ● Highly customizable ● Control over all services ● Great for learning new things – Kvm – Linux Storage handling – Networking
  • 17. 17 Cons ● Steep learning curve – Kvm – Networking – Openstack services – Integration in existing environments ● Overspecific documentation ● Takes a long time to mvp
  • 18. 18 Monitoring ● Prometheus – Exporter for most services – Predefined alert rules available ● Graylog – Log collection via nxlog – Processing for data extraction
  • 19. 19 Monitoring ● Many different formats – Json – Classic system logging – Custom logging setups – Multi line logs
  • 20. 20 Placement logs ● 2022-11-09 10:03:15.070 25 INFO placement.requestlog [req-6e970f73-e493-4418-ad9c-25b7ff34ba57 ba26e9e5beaa41018db4a3e00c6e7ef9 9abdc13c709a42949a985af187d64a4b - default default] 10.XX.XX.XX "GET /resource_providers/480ccd47-c2c0-4a49-8972-b1486598f6e9/allocations" status: 200 len: 575 microversion: 1.0 ● 2022-11-09 10:03:15.097 24 INFO placement.requestlog [req-d1118868-c205-4b38-8567-eb1c7db17811 ba26e9e5beaa41018db4a3e00c6e7ef9 9abdc13c709a42949a985af187d64a4b - default default] 10.XX.XX.XX "GET /resource_providers?in_tree=480ccd47-c2c0-4a49-8972-b1486598f6e9" status: 200 len: 817 microversion: 1.14 ● 2022-11-09 10:03:15.123 22 INFO placement.requestlog [req-e77ad21e-ead7-4b5e-9e03-b3350b188234 ba26e9e5beaa41018db4a3e00c6e7ef9 9abdc13c709a42949a985af187d64a4b - default default] 10.XX.XX.XX "GET /resource_providers/480ccd47-c2c0-4a49-8972-b1486598f6e9/inventories" status: 200 len: 410 microversion: 1.0 ● 2022-11-09 10:03:15.143 23 INFO placement.requestlog [req-7e8f0090-69f7-4b5c-b63d-6d8aeffa6312 ba26e9e5beaa41018db4a3e00c6e7ef9 9abdc13c709a42949a985af187d64a4b - default default] 10.XX.XX.XX "GET /resource_providers/480ccd47-c2c0-4a49-8972-b1486598f6e9/aggregates" status: 200 len: 54 microversion: 1.19 ● 2022-11-09 10:03:15.178 21 INFO placement.requestlog [req-416b2d05-eebc-4fac-b75d-10c02c7df252 ba26e9e5beaa41018db4a3e00c6e7ef9 9abdc13c709a42949a985af187d64a4b - default default] 10.XX.XX.XX "GET /resource_providers/480ccd47-c2c0-4a49-8972-b1486598f6e9/traits" status: 200 len: 1593 microversion: 1.6
  • 21. 21 Placement logs ● 2022-11-09 10:03:15.070 25 INFO placement.requestlog [req-6e970f73-e493-4418-ad9c-25b7ff34ba57 ba26e9e5beaa41018db4a3e00c6e7ef9 9abdc13c709a42949a985af187d64a4b - default default] 10.XX.XX.XX "GET /resource_providers/480ccd47-c2c0-4a49- 8972-b1486598f6e9/allocations" status: 200 len: 575 microversion: 1.0
  • 22. 22 Log components ● Time: 2022-11-09 10:03:15.070 ● Loglevel: INFO ● Origin: 10.XX.XX.XX ● HTTP Method: GET ● URL: /resource_providers/480ccd47-c2c0-4a49-8972- b1486598f6e9/allocations ● HTTP Status: “status: 200”
  • 23. 23 Tracing ● Request ID: req-6e970f73-e493-4418-ad9c-25b7ff34ba57 ● Other Identifiers: – Ba26e9e5beaa41018db4a3e00c6e7ef9 – 9abdc13c709a42949a985af187d64a4b ● Random data: default default
  • 24. 24 Difficulties ● Slow Interface / API ● Performance inconsistencies in VM ● Bad I/O
  • 25. 25 Slow Interface / API ● Long loading times ● Sometimes timeouts
  • 26. 26 Slow Interface / API ● Why is this happening? ● How do we resolve this?
  • 27. 27 Slow Interface / API ● Understanding how the Services work ● Which way do requests take?
  • 30. 30 Slow Interface / API ● Why is this happening? – Too many connections via HAProxy ● A single request can generate up to 500-2000 internal requests ● How do we resolve this? – Use HAProxy only for incomming requests – Remove HAProxy completely
  • 31. 31 Slow Interface / API ● Use HAProxy only for incomming requests – Minimal impact – Easy to configure ● Remove HAProxy completely – Loss of high availability – One less service to worry about
  • 32. 32 Slow Interface / API ● Monitoring takeaways – Check logs for dropped connections – Monitor open tcp connections and times of the linux kernel
  • 34. 34 Performance inconsistencies ● Why is this happening? – Problem with KVM? – Hardware issues? – Something with the Network? – Ceph Issues?
  • 36. 36 Performance inconsistencies ● No progress after a week of debugging ● A hint from @isotopp@chaos.social – Old story about a MySQL DB – Something about Numa swapping
  • 37. 37 Performance inconsistencies ● NUMA node0 CPU(s): 0-31,64-95 ● NUMA node1 CPU(s): 32-63,96-127
  • 41. 41 Numaswapping ● KVM Processes jump between Cores ● On Socket change, Memory is behind a different CPU – Increased memory access time – Slower PCIe access
  • 42. 42 Performance inconsistencies ● Activate CPU pinning – CPU cores will be exclusive to a single KVM Thread – Less available resources on compute nodes – Need more compute nodes for same amount of VMs ● Run KVM NUMA aware – KVM Threads will always run on the same NUMA Node – No exclusive cores
  • 43. 43 Performance inconsistencies ● Monitoring takeaways – Impossible to monitor ● Intel Resource Director Technology can help – Not available on AMD systems
  • 44. 44 Bad I/O ● Ceph RDB volumes for VMs ● Causes? – Network? – Wrong configuration? – Hardware limits?
  • 45. 45 Bad I/O ● Symptoms – Slow writes; less than 300 op/s – Inconsistent reads; fluctuating between 20k and 20 op/s – Slow commits; more than 50 msec
  • 46. 46 Bad I/O ● Searching for a solution – Many tipps for optimizations ● Stabilized I/O but did not increase it to estimated levels – Estimation ● NVMe SSDs ● Atleast 100k op/s ● Fast commit to disk; less than 500 usec
  • 47. 47 Bad I/O ● Searching for a solution – Network works at peak, with 20GBps – Hardware resources are hardly touched – Possible Problem with Ceph? ● Nothing in the documentation ● No recommendations – Accept it as fate and move to local storage?
  • 48. 48 Bad I/O ● A random link to a ceph mailing list – OSDs should be at a max of 1TB else performance will be poor
  • 49. 49 Bad I/O ● Reconfiguring the ceph cluster to OSDs with a max size of 1TB – OSDs increase from 20 to 60 – Each OSD gets it’s own core ● No NUMA swapping – Each SSD contains 3 OSDs
  • 50. 50 Bad I/O ● Success? – Partially – I/O Performance ● Commit down to 40 µsec ● Consistent 15k+ op/s – Could be better could be worse
  • 51. 51 Bad I/O ● Monitoring takeaway – Collect the metrics from libvirt – Plotting graphs can actually help here
  • 52. 52
  • 53. 53 Monitoring takeaways ● Use existing Tools – Prometheus exporter ● Openstack ● Ceph ● Visualize everything! – Use existing Dashboards and customize
  • 54. 54 What happened since then? ● Implementation of Prometheus for all Services and Servers ● Grafana Dashboards for everything important ● Custom alert rules based on aggregated metrics
  • 56. 56 Thank you and safe travels