Efficient monitoring is crucial when managing your Cloud infrastructure. The metrics collected by OpenNebula can be used to trigger automatic scaling, or quickly detect failures to automatically restart virtual machines. During this talk, I will show how OpenNebula can be used to efficiently monitor thousands of virtual machines at sub-1 minute interval. I will show how OpenNebula can be enhanced and optimized, and how different metrics collection tools such as Ganglia and Host-sFlow can be used with OpenNebula to monitor large-scale Cloud infrastructures.
Bio:
Simon Boulet is an Entrepreneur and an IT Consultant from Montreal, Canada. He has worked on various Cloud infrastructure projects, including projects for the CBC/Radio-Canada public television that had important scaling needs for hosting online interactive TV shows. Prior to becoming an IT Consultant, Simon was IT Director at iWeb, Canada’s largest Web Hosting company, where he led iWeb’s first steps into Cloud Computing with the development of the Smart Servers. Simon is also an active and frequent contributor to OpenNebula, with a deep understanding of OpenNebula internals, and has contributed several enhancements and bug fixes that made it through the official releases of OpenNebula.
2. Goals
1. Show how to configure OpenNebula to
achieve sub-1 minute monitoring interval
2. Demonstrate the use of OpenNebula in
large-scale cloud infrastructures
3. Suggest enhancements to OpenNebula
performance and monitoring
3. How Big Exactly is Large-scale?
How many hosts?
1,000? 2,000? 10,000 VMs?
4. Monitoring in OpenNebula
● Detects when a VM or host changes status
(Running, Stopped, etc.)
● Built-in metrics: CPU, memory and network
usage
● You can add as many metrics as you like by
customizing driver
● Can be used to perform various tasks (auto
scaling, high-availability redeployment, etc.)
5. Don't Expect the Default
Configuration to Perform Optimally
● Database: Use MySQL database backend,
not the default SQLite
● Logs: Use Syslog log system, and disable
debug logging (debug_level=1)
● Number of threads: Adjust the number of
drivers threads (see -t option to your *MAD
config options)
6. Use OpenNebula >= 4.0
Prior versions did monitoring in two phases:
1. The IM Monitor action monitored Hosts
2. The VMM Poll action monitored VMs
100 Hosts + 1,000 VMs * 15 seconds interval = 4,400
actions per minute
Since OpenNebula 4.0, the IM Monitor action is
capable of returning the information of VMs
running on the monitored host
7. Monitoring History
By default OpenNebula keeps 24h of
monitoring history
15 seconds interval X 24h = 5760 records per VM
Average record size: 4KB
23MB of monitoring history per VM
100 VM = 2.3GB
10,000 VM = 230GB
HOST_MONITORING_EXPIRATION_TIME and
VM_MONITORING_EXPIRATION_TIME config options
8. Monitoring History (continued)
● Reduce history to 30 minutes (1800
seconds)
● Use MySQL MEMORY storage engine for
vm_monitoring and host_monitoring tables
It's OK to lose monitoring history when MySQL
is restarted
Most recent monitoring values are stored in VM
template
Set MySQL max_heap_table_size large enough to hold all your monitoring
history
9. Watch your Load Average
As of 4.2, the maximum number of
simultaneous XML-RPC API connections is
limited to 15
Overloaded OpenNebula = Slow XML-RPC API response =
API Limit / Timeout
● Reduce load at deployment time by
adjusting number of VMs simultaneously
deployed by scheduler
● Watch next release (4.4) for
XML-RPC API concurrency
enhancements
10. Local Caching Nameserver
OpenNebula use DNS name for monitoring
hosts (unless you named your hosts using their
IP address instead of name)
● Use a local caching nameserver to speed up
DNS lookup (such as dnsmasq).
11. Beware of SSH Transport
Most OpenNebula drivers (KVM, Xen, etc.) use
SSH connections to perform actions
OK for deploying new VM, but expensive when
doing VM monitoring
12. Meet Ganglia
<< Ganglia is a scalable distributed system monitor tool for high-performance
computing systems such as clusters and grids. >>
- Wikipedia
OpenNebula has built-in support for Ganglia
By default Ganglia and OpenNebula must run
on the same machine
Set GANGLIA_HOST in /var/lib/one/remotes/im/ganglia.d/ganglia_probe and
/var/lib/one/remotes/vmm/kvm/poll_ganglia
14. Ganglia Driver Limitations
1. Currently only 1 Ganglia Collector is
supported
2. Need to run script on each host to export
OpenNebula-specific metric
(OPENNEBULA_VMS_INFORMATION)
3. Ganglia as a maximum length of 1392 bytes
for string metrics
15. Host sFlow
<< The Host sFlow agent exports physical and virtual server performance
metrics using the sFlow protocol. The agent provides scalable, multi-vendor,
multi-OS performance monitoring with minimal impact on the systems being
monitored.>>
- http://host-sflow.sourceforge.net/
Exports a standard set of hypervisor and VM
metrics
Official support for Xen, KVM and Hyper-V, but
uses Libvirt to gather metrics (and Libvirt has
support LXC, OpenVZ, VMWare, etc.)
17. Host sFlow (continued)
Sample Metrics
Hosts Metrics
VMs Metrics
Not currently supported in OpenNebula. Contact me if you're interested.
vnode_mem_total Hypervisor Total Memory
vnode_domains Hypervisor VM Count
<VM ID>.vcpu_state VM State (Running, Stopped, etc.)
<VM ID>.vmem_util VM Memory Utilization
<VM ID>.vdisk_free VM Free Disk Space
18. 4,000 VMs at Sub-1 Minute Interval
OpenNebula 4.2 + xml-rpc patch (upcoming in 4.4)
Experimental Host sFlow Driver
1 OpenNebula Core (EC2 High-CPU XLarge instance)
1 Sunstone Web Server (EC2 Standard Medium instance)
1 Ganglia Collector (EC2 Standard Medium instance)
100 Hosts (EC2 High-CPU Medium instances)
~40 VMs per Host
~4,000 VMs (OpenVZ)
15 - 60 second monitoring interval
22. Looking Forward
There’s room for optimizations
● The command line tools can get very slow when
returning very large result sets (but not the API…)
● Distributed driver, for example using ZeroMQ for
distributing tasks to multiple workers
● Investigate PoolSQL locks being held for long period
and blocking other threads (discussed in bug #1818)
● Gather metrics about OpenNebula internals: locks wait,
effective monitoring interval, memory footprints, etc.
● Investigate very large Sunstone memory usage
23. Thank you!
Questions?
“OpenNebula captured my interest for several technical
reasons besides the fact that it is truly open. It's architecture
is very elegant; it has C++ bones, ruby muscles and bash
tendons. It's extensible and understandable. It has no peer
as far as I can tell.”
Christopher Barry, Infrastructure Engineer, RJMetrics,
September 2012
http://opennebula.org/users:testimonials