SlideShare una empresa de Scribd logo
1 de 124
Descargar para leer sin conexión
OSDC 2014
Overlay Datacenter Information
Christian Kniep

Bull SAS!
2014-04-10
About Me
❖ Me (>30y)
2
!
❖ SysOps (>10y)
About Me
❖ Me (>30y)
2
!
!
❖ SysOps v1.1 (>8y)
!
❖ SysOps (>10y)
About Me
❖ Me (>30y)
2
!
!
❖ SysOps v1.1 (>8y)
!
!
!
❖ BSc (2008-2011)
!
❖ SysOps (>10y)
About Me
❖ Me (>30y)
2
!
!
!
!
❖ DevOps (>4y)
!
!
❖ SysOps v1.1 (>8y)
!
!
!
❖ BSc (2008-2011)
!
❖ SysOps (>10y)
About Me
❖ Me (>30y)
2
!
!
!
!
!
❖ R&D [OpsDev?](>1y)
!
!
!
!
❖ DevOps (>4y)
!
!
❖ SysOps v1.1 (>8y)
!
!
!
❖ BSc (2008-2011)
!
❖ SysOps (>10y)
About Me
❖ Me (>30y)
2
Agenda
3
❖ Cluster Stack
Agenda
3
Cluster
Stack
!
❖ Motivation (InfiniBand use-case)
❖ Cluster Stack
Agenda
3
Cluster
Stack
IB
!
!
❖ QNIB/ng
!
❖ Motivation (InfiniBand use-case)
❖ Cluster Stack
Agenda
3
Cluster
Stack
QNIBngIB
!
!
!
❖ QNIBTerminal (virtual cluster using docker)
!
!
❖ QNIB/ng
!
❖ Motivation (InfiniBand use-case)
❖ Cluster Stack
Agenda
3
Cluster
Stack
QNIBngIB
QNIB

Terminal
!
!
!
❖ QNIBTerminal (virtual cluster using docker)
!
!
❖ QNIB/ng
!
❖ Motivation (InfiniBand use-case)
❖ Cluster Stack
Agenda
3
Cluster
Stack
QNIBngIB
I.
QNIB

Terminal
II.
III.
Cluster Stack Work Environment
4
Cluster?
5
„A computer cluster consists of a set of loosely connected or tightly connected computers !
that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org
User
Cluster?
5
„A computer cluster consists of a set of loosely connected or tightly connected computers !
that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org
User
Cluster?
5
„A computer cluster consists of a set of loosely connected or tightly connected computers !
that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org
User
HPC-Cluster
6
High Performance Computing
HPC-Cluster
6
High Performance Computing
❖ HPC: Surfing the bottleneck!
❖ Weakest link breaks performance
HPC-Cluster
6
High Performance Computing
❖ HPC: Surfing the bottleneck!
❖ Weakest link breaks performance
Cluster Layers
7
(rough estimate)
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
(rough estimate)
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
(rough estimate)
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
(rough estimate)
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
(rough estimate)
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
PowerUser/ISV
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
Excel:! ! ! ! ! KPI, SLA
Mgmt
PowerUser/ISV
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
Excel:! ! ! ! ! KPI, SLA
Mgmt
SysOps
PowerUser/ISV
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
Excel:! ! ! ! ! KPI, SLA
Mgmt
SysOps
PowerUser/ISV
SysOpsL1
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
Excel:! ! ! ! ! KPI, SLA
Mgmt
SysOps
PowerUser/ISV
SysOpsL2
SysOpsL1
Events Metrics
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
Excel:! ! ! ! ! KPI, SLA
Mgmt
SysOps
PowerUser/ISV
SysOpsL2
SysOpsL1
Events Metrics
SysOpsL3
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
Excel:! ! ! ! ! KPI, SLA
Mgmt
SysOps
PowerUser/ISV
SysOpsMgmt
SysOpsL2
SysOpsL1
Events Metrics
SysOpsL3
Cluster Layers
7
Hardware:! ! ! IMPI, lm_sensors, IB counter
Operating System:! Kernel, Userland tools
MiddleWare:! ! ! MPI, ISV-libs
Services:! ! ! ! Storage, Job Scheduler, sshd
Software:! ! ! ! End user application
(rough estimate)
End

User
Excel:! ! ! ! ! KPI, SLA
Mgmt
SysOps
PowerUser/ISV
SysOpsMgmt
ISVMgmt
SysOpsL2
SysOpsL1
Events Metrics
SysOpsL3
Layer
n
❖ Every Layer is composed of layers!
❖ How deep to go?
8
Little Data w/o Connection
9
❖ Multiple data sources
!
❖ No way of connecting them
Little Data w/o Connection
9
❖ Multiple data sources
!
!
❖ Connecting is manual labour
!
❖ No way of connecting them
Little Data w/o Connection
9
❖ Multiple data sources
!
!
!
❖ Experience driven
!
!
❖ Connecting is manual labour
!
❖ No way of connecting them
Little Data w/o Connection
9
❖ Multiple data sources
!
!
!
!
❖ Niche solutions misleading
!
!
!
❖ Experience driven
!
!
❖ Connecting is manual labour
!
❖ No way of connecting them
Little Data w/o Connection
9
❖ Multiple data sources
IB + QNIBng Motivation
10
Modular Switch
11
❖ Looks like one „switch“!
Modular Switch
12
❖ Looks like one „switch“!
❖ Composed of a network itself
Modular Switch
13
❖ Looks like one „switch“!
❖ Composed of a network itself!
❖ Which route is taken is transparent to
application!
❖ LB1<>FB1<>LB4
Modular Switch
14
❖ Looks like one „switch“!
❖ Composed of a network itself!
❖ Which route is taken is transparent to
application!
❖ LB1<>FB1<>LB4!
❖ LB1<>FB2<>LB4
Modular Switch
15
❖ Looks like one „switch“!
❖ Composed of a network itself!
❖ Which route is taken is transparent to
application!
❖ LB1<>FB1<>LB4!
❖ LB1<>FB2<>LB4!
❖ LB1 ->FB1 ->LB4 / LB1 <-FB2 <-LB4
Debug-Nightmare
16
❖ Job seems to fail due to bad internal link
!
❖ 96 port switch
Debug-Nightmare
16
❖ Job seems to fail due to bad internal link
!
!
❖ multiple autonomous job-cells
!
❖ 96 port switch
Debug-Nightmare
16
❖ Job seems to fail due to bad internal link
!
!
!
❖ Relevant information!
❖ Job status (Resource Scheduler)!
❖ Routes (IB Subnet Manager)!
❖ IB Counter (Command Line)
!
!
❖ multiple autonomous job-cells
!
❖ 96 port switch
Debug-Nightmare
16
❖ Job seems to fail due to bad internal link
!
!
!
!
!
!
!
❖ changing one plug, recomputes routes :)
!
!
!
❖ Relevant information!
❖ Job status (Resource Scheduler)!
❖ Routes (IB Subnet Manager)!
❖ IB Counter (Command Line)
!
!
❖ multiple autonomous job-cells
!
❖ 96 port switch
Debug-Nightmare
16
❖ Job seems to fail due to bad internal link
Communication Networks
IBPM: Demo OverviewBackground: InfiniBand (IB)
Rate Measurement in IB Networks
IBPM: An Open-Source-Based Framework for
InfiniBand Performance Monitoring
Michael Hoefling1, Michael Menth1, Christian Kniep2, Marcus Camen2
State-of-the art communication technology for interconnection in
high-performance computing data centers
Point-to-point bidirectional links
High throughput (40 Gbit/s with QDR)
Low latency
Dynamic on-line network reconfiguration
in cooperation with
Idea
Extract raw network information from IB network
Analyze output
Derive statistics about performance of the network
Topology Extraction
Subnet discovery using ibnetdiscover
Produces human readable file of network topology
Process output to produce graphical representation of the
network
Remote Counter Readout
Each port has its own set of performance counters
Counters measure, e.g., transferred data, congestion, errors,
link states changes
ibsim-Based Network Simulation
ibsim simulates an IB network
Simple topology changes possible (GUI)
ibsim limitations
No performance simulation possible
No data rate changes possible
Real IB Network
Physical network
Allows performance measurements
GUI controlled traffic scenarios
17
OpenSM
18
Sw
OpenSM
18
OpenSM
nodenode
Sw
node
nodenode
node
node
❖ OpenSM Performance Manager
Sw
OpenSM
18
OpenSM
PerfMgmt
nodenode
Sw
node
nodenode
node
node
!
❖ Sends token to all ports
❖ OpenSM Performance Manager
Sw
OpenSM
18
OpenSM
PerfMgmt
nodenode
Sw
node
nodenode
node
node
!
!
❖ All ports reply with metrics
!
❖ Sends token to all ports
❖ OpenSM Performance Manager
Sw
OpenSM
18
OpenSM
PerfMgmt
nodenode
Sw
node
nodenode
node
node
!
!
!
❖ Callback triggered for every reply
!
!
❖ All ports reply with metrics
!
❖ Sends token to all ports
❖ OpenSM Performance Manager
Sw
OpenSM
18
OpenSM
PerfMgmt
nodenode
Sw
node
nodenode
node
node
!
!
!
❖ Callback triggered for every reply
!
!
❖ All ports reply with metrics
!
❖ Sends token to all ports
❖ OpenSM Performance Manager
Sw
OpenSM
18
OpenSM
PerfMgmt
osmeventplugin
nodenode
Sw
node
nodenode
node
node
❖ osmeventplugin
!
!
!
❖ Callback triggered for every reply
!
❖ Dumps info to file
!
!
❖ All ports reply with metrics
!
❖ Sends token to all ports
❖ OpenSM Performance Manager
Sw
OpenSM
18
OpenSM
PerfMgmt
osmeventplugin
nodenode
Sw
node
nodenode
node
node
❖ osmeventplugin
!
!
!
❖ Callback triggered for every reply
!
❖ Dumps info to file
!
!
❖ All ports reply with metrics
!
❖ Sends token to all ports
❖ OpenSM Performance Manager
Sw
OpenSM
18
OpenSM
PerfMgmt
nodenode
Sw
node
nodenode
node
node
❖ osmeventplugin
OpenSM
PerfMgmt
OpenSM
19
OpenSM
PerfMgmt
qnib
OpenSM
19
❖ qnib
OpenSM
PerfMgmt
qnib
OpenSM
19
!
❖ sends metrics to RRDtool !
❖ events to PostgreSQL
❖ qnib
OpenSM
PerfMgmt
qnibng
OpenSM
19
!
❖ sends metrics to RRDtool !
❖ events to PostgreSQL
❖ qnib
❖ qnibng
OpenSM
PerfMgmt
qnibng
OpenSM
19
!
❖ sends metrics to RRDtool !
❖ events to PostgreSQL
❖ qnib
!
❖ sends metrics to graphite !
❖ events to logstash
❖ qnibng
OpenSM
PerfMgmt
qnibng
OpenSM
19
!
❖ sends metrics to RRDtool !
❖ events to PostgreSQL
❖ qnib
!
❖ sends metrics to graphite !
❖ events to logstash
❖ qnibng
Graphite Events port is up/down
20
21
22
QNIBTerminal Proof of Concept
23
Cluster Stack Mock-Up
❖ IB events and metrics are not enough!
❖ How to get real-world behavior?!
❖ Wanted:!
❖ Slurm (Resource Scheduler)!
❖ MPI enabled compute nodes!
❖ As much additional cluster stack as possible 

(Graphite,elasticsearch/logstash/kibana, Icinga, Cluster-FS, …)
24
Classical Virtualization
❖ Big overhead for simple node!
❖ Resources provisioned in advance!
❖ Host resources allocated
25
LXC (docker)
❖ minimal overhead ( couple of MB)!
❖ no resource pinning!
❖ cgroups option!
❖ highly automatable
26
LXC (docker)
❖ minimal overhead ( couple of MB)!
❖ no resource pinning!
❖ cgroups option!
❖ highly automatable
26
NOW: Watch OSDC2014 talk ‚Docker‘ by ‚Tobias Schwab‘
Virtual Cluster Nodes
27
host
Virtual Cluster Nodes
❖ Master Node (etcd, DNS, slurmctld)
27
host
master
!
❖ monitoring (graphite + statsd)
Virtual Cluster Nodes
❖ Master Node (etcd, DNS, slurmctld)
27
host
master
monitoring
!
!
❖ log mgmt (ELK)
!
❖ monitoring (graphite + statsd)
Virtual Cluster Nodes
❖ Master Node (etcd, DNS, slurmctld)
27
host
master
monitoring
logmgmt
!
!
!
❖ compute nodes (slurmd)
!
!
❖ log mgmt (ELK)
!
❖ monitoring (graphite + statsd)
Virtual Cluster Nodes
❖ Master Node (etcd, DNS, slurmctld)
27
host
master
monitoring
logmgmt
compute0
compute1
computeN
!
!
!
!
❖ alarming (Icinga) [not integrated]
!
!
!
❖ compute nodes (slurmd)
!
!
❖ log mgmt (ELK)
!
❖ monitoring (graphite + statsd)
Virtual Cluster Nodes
❖ Master Node (etcd, DNS, slurmctld)
27
host
master
monitoring
logmgmt
compute0
compute1
computeN
Master Node
❖ takes care of inventory (etcd)!
❖ provides DNS (+PTR)!
❖ Integrate Rudder, ansible, chef,…?
28
Non-Master Nodes (in general)
❖ are started with master as DNS!
❖ mounting /scratch, /chome (sits on SSDs)!
❖ supervisord kicks in and starts services and setup-scripts!
❖ sending metrics to graphite!
❖ logs to logstash
29
docker-compute
❖ slurmd!
❖ sshd!
❖ logstash-forwarder!
❖ openmpi!
❖ qperf
30
docker-compute
❖ slurmd!
❖ sshd!
❖ logstash-forwarder!
❖ openmpi!
❖ qperf
30
docker-compute
❖ slurmd!
❖ sshd!
❖ logstash-forwarder!
❖ openmpi!
❖ qperf
30
docker-compute
❖ slurmd!
❖ sshd!
❖ logstash-forwarder!
❖ openmpi!
❖ qperf
30
docker-compute
❖ slurmd!
❖ sshd!
❖ logstash-forwarder!
❖ openmpi!
❖ qperf
30
docker-compute
❖ slurmd!
❖ sshd!
❖ logstash-forwarder!
❖ openmpi!
❖ qperf
30
docker-compute
❖ slurmd!
❖ sshd!
❖ logstash-forwarder!
❖ openmpi!
❖ qperf
30
docker-graphite (monitoring)
❖ full graphite stack + statsd!
❖ stresses IO (<3 SSDs)!
❖ needs more care (optimize IO)
31
docker-elk (Log Mgmt)
❖ elasticsearch, logstash, kibana!
❖ inputs: syslog, lumberjack!
❖ filters: none!
❖ outputs: elasticsearch
32
It’s alive!
33
Start Compute Node
34
Start Compute Node
35
Check Slurm Config
36
Check Slurm Config
36
Check Slurm Config
36
Check Slurm Config
36
Check Slurm Config
36
Run MPI-Job
37
Run MPI-Job
37
Run MPI-Job
37
TCP benchmark
38
QNIBTerminal Future Work
39
docker-icinga
40
❖ Icinga to provide !
❖ state-of-the-cluster overview!
❖ bundle with graphite/elk!
❖ no big deal…
docker-icinga
40
❖ Icinga to provide !
❖ state-of-the-cluster overview!
❖ bundle with graphite/elk!
❖ no big deal…
!
!
!
!
❖ Is this going to scale?
docker-(GlusterFS,Lustre)
❖ Cluster scratch to integrate with!
❖ Use of kernel-modules freezes attempt!
❖ Might be pushed in VirtualBox (vagrant)
41
❖ How is SysOps/DevOps/Mgmt
Humans!
42
!
❖ react to the changes
❖ How is SysOps/DevOps/Mgmt
Humans!
42
!
!
❖ adopt them
!
❖ react to the changes
❖ How is SysOps/DevOps/Mgmt
Humans!
42
!
!
!
❖ feared by them
!
!
❖ adopt them
!
❖ react to the changes
❖ How is SysOps/DevOps/Mgmt
Humans!
42
❖ Truckload of
Big Data!
43
!
❖ Events
❖ Truckload of
Big Data!
43
!
!
❖ Metrics
!
❖ Events
❖ Truckload of
Big Data!
43
!
!
!
❖ Interaction
!
!
❖ Metrics
!
❖ Events
❖ Truckload of
Big Data!
43
!
!
!
❖ Interaction
!
!
❖ Metrics
!
❖ Events
❖ Truckload of
Big Data!
43
node01.system.memory.usage 9!
node13.system.memory.usage 14!
node35.system.memory.usage 12!
node95.system.memory.usage 11
!
!
!
❖ Interaction
!
!
❖ Metrics
!
❖ Events
❖ Truckload of
Big Data!
43
node01.system.memory.usage 9!
node13.system.memory.usage 14!
node35.system.memory.usage 12!
node95.system.memory.usage 11
target=sumSeries(node{01,13,35,95}.system.memory.usage)
!
!
!
❖ Interaction
!
!
❖ Metrics
!
❖ Events
❖ Truckload of
Big Data!
43
job1.node01.system.memory.usage 9!
job1.node13.system.memory.usage 14!
job1.node35.system.memory.usage 12!
job1.node95.system.memory.usage 11
node01.system.memory.usage 9!
node13.system.memory.usage 14!
node35.system.memory.usage 12!
node95.system.memory.usage 11
target=sumSeries(node{01,13,35,95}.system.memory.usage)
!
!
!
❖ Interaction
!
!
❖ Metrics
!
❖ Events
❖ Truckload of
Big Data!
43
job1.node01.system.memory.usage 9!
job1.node13.system.memory.usage 14!
job1.node35.system.memory.usage 12!
job1.node95.system.memory.usage 11
target=sumSeries(job01.*.system.memory.usage)
node01.system.memory.usage 9!
node13.system.memory.usage 14!
node35.system.memory.usage 12!
node95.system.memory.usage 11
target=sumSeries(node{01,13,35,95}.system.memory.usage)
pipework / mininet
❖ Currently all containers are bound to docker0 bridge!
❖ Creating topology with virtual/real switches would be nice!
❖ First iteration might use pipework!
❖ More complete one should use vSwitches (mininet?)
44
Dockerfiles
❖ Only 3 images are fd20 based
45
Questions?
❖ Pictures!
❖ p2: http://de.wikipedia.org/wiki/Datei:Audi_logo.svg

http://commons.wikimedia.org/wiki/File:Daimler_AG.svg

http://ffb.uni-lueneburg.de/20JahreFFB/!
❖ p4: https://www.flickr.com/photos/adeneko/4229090961!
❖ p6: cae t100

https://www.flickr.com/photos/losalamosnatlab/7422429706!
❖ p8: http://www.brendangregg.com/Slides/SCaLE_Linux_Performance2013.pdf!
❖ p9: https://www.flickr.com/photos/riafoge/6796129047!
❖ p10: https://www.flickr.com/photos/119364768@N03/12928685224/!
❖ p11: http://www.mellanox.com/page/products_dyn?product_family=74 !
❖ p23: https://www.flickr.com/photos/jaxport/3077543062!
❖ p25/26: https://blog.trifork.com/2013/08/08/next-step-in-virtualization-docker-lightweight-containers/!
❖ p33: https://www.flickr.com/photos/fkehren/5139094564!
❖ p39: https://www.flickr.com/photos/brizzlebornandbred/12852909293
46

Más contenido relacionado

Similar a OSDC 2014: Christian Kniep - Understand your data center by overlaying multiple information layers

DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroDevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroGaurav "GP" Pal
 
Production Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated WorldProduction Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated WorldSean Chittenden
 
QNIBTerminal: Understand your datacenter by overlaying multiple information l...
QNIBTerminal: Understand your datacenter by overlaying multiple information l...QNIBTerminal: Understand your datacenter by overlaying multiple information l...
QNIBTerminal: Understand your datacenter by overlaying multiple information l...QNIB Solutions
 
Hardware planning & sizing for sql server
Hardware planning & sizing for sql serverHardware planning & sizing for sql server
Hardware planning & sizing for sql serverDavide Mauri
 
MySQL Cluster 7.3 Performance Tuning - Severalnines Slides
MySQL Cluster 7.3 Performance Tuning - Severalnines SlidesMySQL Cluster 7.3 Performance Tuning - Severalnines Slides
MySQL Cluster 7.3 Performance Tuning - Severalnines SlidesSeveralnines
 
Evaluation of Web Processing Service Frameworks
Evaluation of Web Processing Service FrameworksEvaluation of Web Processing Service Frameworks
Evaluation of Web Processing Service FrameworksEbrahim Poorazizi
 
Can $0.08 Change your View of Storage?
Can $0.08 Change your View of Storage?Can $0.08 Change your View of Storage?
Can $0.08 Change your View of Storage?DataCore Software
 
Puppet devops wdec
Puppet devops wdecPuppet devops wdec
Puppet devops wdecWojciech Dec
 
Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Fwdays
 
NGS Informatics and Interpretation - Hardware Considerations by Michael McManus
NGS Informatics and Interpretation - Hardware Considerations by Michael McManusNGS Informatics and Interpretation - Hardware Considerations by Michael McManus
NGS Informatics and Interpretation - Hardware Considerations by Michael McManusKnome_Inc
 
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmxMoved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmxMilen Dyankov
 
Raising ux bar with offline first design
Raising ux bar with offline first designRaising ux bar with offline first design
Raising ux bar with offline first designKyrylo Reznykov
 
Circonus: Design failures - A Case Study
Circonus: Design failures - A Case StudyCirconus: Design failures - A Case Study
Circonus: Design failures - A Case StudyHeinrich Hartmann
 
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...Severalnines
 
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentationjward5519
 
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentationjward5519
 
Functional Reactive Programming / Compositional Event Systems
Functional Reactive Programming / Compositional Event SystemsFunctional Reactive Programming / Compositional Event Systems
Functional Reactive Programming / Compositional Event SystemsLeonardo Borges
 

Similar a OSDC 2014: Christian Kniep - Understand your data center by overlaying multiple information layers (20)

DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroDevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
 
nodebots presentation @seekjobs
nodebots presentation @seekjobsnodebots presentation @seekjobs
nodebots presentation @seekjobs
 
Production Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated WorldProduction Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated World
 
QNIBTerminal: Understand your datacenter by overlaying multiple information l...
QNIBTerminal: Understand your datacenter by overlaying multiple information l...QNIBTerminal: Understand your datacenter by overlaying multiple information l...
QNIBTerminal: Understand your datacenter by overlaying multiple information l...
 
Hardware planning & sizing for sql server
Hardware planning & sizing for sql serverHardware planning & sizing for sql server
Hardware planning & sizing for sql server
 
MySQL Cluster 7.3 Performance Tuning - Severalnines Slides
MySQL Cluster 7.3 Performance Tuning - Severalnines SlidesMySQL Cluster 7.3 Performance Tuning - Severalnines Slides
MySQL Cluster 7.3 Performance Tuning - Severalnines Slides
 
Evaluation of Web Processing Service Frameworks
Evaluation of Web Processing Service FrameworksEvaluation of Web Processing Service Frameworks
Evaluation of Web Processing Service Frameworks
 
Can $0.08 Change your View of Storage?
Can $0.08 Change your View of Storage?Can $0.08 Change your View of Storage?
Can $0.08 Change your View of Storage?
 
Puppet devops wdec
Puppet devops wdecPuppet devops wdec
Puppet devops wdec
 
Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...
 
NGS Informatics and Interpretation - Hardware Considerations by Michael McManus
NGS Informatics and Interpretation - Hardware Considerations by Michael McManusNGS Informatics and Interpretation - Hardware Considerations by Michael McManus
NGS Informatics and Interpretation - Hardware Considerations by Michael McManus
 
The NECSTLab Multi-Faceted Experience with AWS F1
The NECSTLab Multi-Faceted Experience with AWS F1The NECSTLab Multi-Faceted Experience with AWS F1
The NECSTLab Multi-Faceted Experience with AWS F1
 
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmxMoved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
 
Overlay HPC Information
Overlay HPC InformationOverlay HPC Information
Overlay HPC Information
 
Raising ux bar with offline first design
Raising ux bar with offline first designRaising ux bar with offline first design
Raising ux bar with offline first design
 
Circonus: Design failures - A Case Study
Circonus: Design failures - A Case StudyCirconus: Design failures - A Case Study
Circonus: Design failures - A Case Study
 
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
Deep Dive Into How To Monitor MySQL or MariaDB Galera Cluster / Percona XtraD...
 
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentation
 
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentation
 
Functional Reactive Programming / Compositional Event Systems
Functional Reactive Programming / Compositional Event SystemsFunctional Reactive Programming / Compositional Event Systems
Functional Reactive Programming / Compositional Event Systems
 

Último

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 

Último (20)

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 

OSDC 2014: Christian Kniep - Understand your data center by overlaying multiple information layers

  • 1. OSDC 2014 Overlay Datacenter Information Christian Kniep
 Bull SAS! 2014-04-10
  • 2. About Me ❖ Me (>30y) 2
  • 3. ! ❖ SysOps (>10y) About Me ❖ Me (>30y) 2
  • 4. ! ! ❖ SysOps v1.1 (>8y) ! ❖ SysOps (>10y) About Me ❖ Me (>30y) 2
  • 5. ! ! ❖ SysOps v1.1 (>8y) ! ! ! ❖ BSc (2008-2011) ! ❖ SysOps (>10y) About Me ❖ Me (>30y) 2
  • 6. ! ! ! ! ❖ DevOps (>4y) ! ! ❖ SysOps v1.1 (>8y) ! ! ! ❖ BSc (2008-2011) ! ❖ SysOps (>10y) About Me ❖ Me (>30y) 2
  • 7. ! ! ! ! ! ❖ R&D [OpsDev?](>1y) ! ! ! ! ❖ DevOps (>4y) ! ! ❖ SysOps v1.1 (>8y) ! ! ! ❖ BSc (2008-2011) ! ❖ SysOps (>10y) About Me ❖ Me (>30y) 2
  • 10. ! ❖ Motivation (InfiniBand use-case) ❖ Cluster Stack Agenda 3 Cluster Stack IB
  • 11. ! ! ❖ QNIB/ng ! ❖ Motivation (InfiniBand use-case) ❖ Cluster Stack Agenda 3 Cluster Stack QNIBngIB
  • 12. ! ! ! ❖ QNIBTerminal (virtual cluster using docker) ! ! ❖ QNIB/ng ! ❖ Motivation (InfiniBand use-case) ❖ Cluster Stack Agenda 3 Cluster Stack QNIBngIB QNIB
 Terminal
  • 13. ! ! ! ❖ QNIBTerminal (virtual cluster using docker) ! ! ❖ QNIB/ng ! ❖ Motivation (InfiniBand use-case) ❖ Cluster Stack Agenda 3 Cluster Stack QNIBngIB I. QNIB
 Terminal II. III.
  • 14. Cluster Stack Work Environment 4
  • 15. Cluster? 5 „A computer cluster consists of a set of loosely connected or tightly connected computers ! that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org User
  • 16. Cluster? 5 „A computer cluster consists of a set of loosely connected or tightly connected computers ! that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org User
  • 17. Cluster? 5 „A computer cluster consists of a set of loosely connected or tightly connected computers ! that work together so that in many respects they can be viewed as a single system.“ - wikipedia.org User
  • 19. HPC-Cluster 6 High Performance Computing ❖ HPC: Surfing the bottleneck! ❖ Weakest link breaks performance
  • 20. HPC-Cluster 6 High Performance Computing ❖ HPC: Surfing the bottleneck! ❖ Weakest link breaks performance
  • 22. Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter (rough estimate) Events Metrics
  • 23. Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools (rough estimate) Events Metrics
  • 24. Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs (rough estimate) Events Metrics
  • 25. Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd (rough estimate) Events Metrics
  • 26. Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) Events Metrics
  • 27. Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Events Metrics
  • 28. Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User PowerUser/ISV Events Metrics
  • 29. Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt PowerUser/ISV Events Metrics
  • 30. Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV Events Metrics
  • 31. Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV SysOpsL1 Events Metrics
  • 32. Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV SysOpsL2 SysOpsL1 Events Metrics
  • 33. Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV SysOpsL2 SysOpsL1 Events Metrics SysOpsL3
  • 34. Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV SysOpsMgmt SysOpsL2 SysOpsL1 Events Metrics SysOpsL3
  • 35. Cluster Layers 7 Hardware:! ! ! IMPI, lm_sensors, IB counter Operating System:! Kernel, Userland tools MiddleWare:! ! ! MPI, ISV-libs Services:! ! ! ! Storage, Job Scheduler, sshd Software:! ! ! ! End user application (rough estimate) End
 User Excel:! ! ! ! ! KPI, SLA Mgmt SysOps PowerUser/ISV SysOpsMgmt ISVMgmt SysOpsL2 SysOpsL1 Events Metrics SysOpsL3
  • 36. Layer n ❖ Every Layer is composed of layers! ❖ How deep to go? 8
  • 37. Little Data w/o Connection 9 ❖ Multiple data sources
  • 38. ! ❖ No way of connecting them Little Data w/o Connection 9 ❖ Multiple data sources
  • 39. ! ! ❖ Connecting is manual labour ! ❖ No way of connecting them Little Data w/o Connection 9 ❖ Multiple data sources
  • 40. ! ! ! ❖ Experience driven ! ! ❖ Connecting is manual labour ! ❖ No way of connecting them Little Data w/o Connection 9 ❖ Multiple data sources
  • 41. ! ! ! ! ❖ Niche solutions misleading ! ! ! ❖ Experience driven ! ! ❖ Connecting is manual labour ! ❖ No way of connecting them Little Data w/o Connection 9 ❖ Multiple data sources
  • 42. IB + QNIBng Motivation 10
  • 43. Modular Switch 11 ❖ Looks like one „switch“!
  • 44. Modular Switch 12 ❖ Looks like one „switch“! ❖ Composed of a network itself
  • 45. Modular Switch 13 ❖ Looks like one „switch“! ❖ Composed of a network itself! ❖ Which route is taken is transparent to application! ❖ LB1<>FB1<>LB4
  • 46. Modular Switch 14 ❖ Looks like one „switch“! ❖ Composed of a network itself! ❖ Which route is taken is transparent to application! ❖ LB1<>FB1<>LB4! ❖ LB1<>FB2<>LB4
  • 47. Modular Switch 15 ❖ Looks like one „switch“! ❖ Composed of a network itself! ❖ Which route is taken is transparent to application! ❖ LB1<>FB1<>LB4! ❖ LB1<>FB2<>LB4! ❖ LB1 ->FB1 ->LB4 / LB1 <-FB2 <-LB4
  • 48. Debug-Nightmare 16 ❖ Job seems to fail due to bad internal link
  • 49. ! ❖ 96 port switch Debug-Nightmare 16 ❖ Job seems to fail due to bad internal link
  • 50. ! ! ❖ multiple autonomous job-cells ! ❖ 96 port switch Debug-Nightmare 16 ❖ Job seems to fail due to bad internal link
  • 51. ! ! ! ❖ Relevant information! ❖ Job status (Resource Scheduler)! ❖ Routes (IB Subnet Manager)! ❖ IB Counter (Command Line) ! ! ❖ multiple autonomous job-cells ! ❖ 96 port switch Debug-Nightmare 16 ❖ Job seems to fail due to bad internal link
  • 52. ! ! ! ! ! ! ! ❖ changing one plug, recomputes routes :) ! ! ! ❖ Relevant information! ❖ Job status (Resource Scheduler)! ❖ Routes (IB Subnet Manager)! ❖ IB Counter (Command Line) ! ! ❖ multiple autonomous job-cells ! ❖ 96 port switch Debug-Nightmare 16 ❖ Job seems to fail due to bad internal link
  • 53. Communication Networks IBPM: Demo OverviewBackground: InfiniBand (IB) Rate Measurement in IB Networks IBPM: An Open-Source-Based Framework for InfiniBand Performance Monitoring Michael Hoefling1, Michael Menth1, Christian Kniep2, Marcus Camen2 State-of-the art communication technology for interconnection in high-performance computing data centers Point-to-point bidirectional links High throughput (40 Gbit/s with QDR) Low latency Dynamic on-line network reconfiguration in cooperation with Idea Extract raw network information from IB network Analyze output Derive statistics about performance of the network Topology Extraction Subnet discovery using ibnetdiscover Produces human readable file of network topology Process output to produce graphical representation of the network Remote Counter Readout Each port has its own set of performance counters Counters measure, e.g., transferred data, congestion, errors, link states changes ibsim-Based Network Simulation ibsim simulates an IB network Simple topology changes possible (GUI) ibsim limitations No performance simulation possible No data rate changes possible Real IB Network Physical network Allows performance measurements GUI controlled traffic scenarios 17
  • 56. ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt nodenode Sw node nodenode node node
  • 57. ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt nodenode Sw node nodenode node node
  • 58. ! ! ❖ All ports reply with metrics ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt nodenode Sw node nodenode node node
  • 59. ! ! ! ❖ Callback triggered for every reply ! ! ❖ All ports reply with metrics ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt nodenode Sw node nodenode node node
  • 60. ! ! ! ❖ Callback triggered for every reply ! ! ❖ All ports reply with metrics ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt osmeventplugin nodenode Sw node nodenode node node ❖ osmeventplugin
  • 61. ! ! ! ❖ Callback triggered for every reply ! ❖ Dumps info to file ! ! ❖ All ports reply with metrics ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt osmeventplugin nodenode Sw node nodenode node node ❖ osmeventplugin
  • 62. ! ! ! ❖ Callback triggered for every reply ! ❖ Dumps info to file ! ! ❖ All ports reply with metrics ! ❖ Sends token to all ports ❖ OpenSM Performance Manager Sw OpenSM 18 OpenSM PerfMgmt nodenode Sw node nodenode node node ❖ osmeventplugin
  • 65. OpenSM PerfMgmt qnib OpenSM 19 ! ❖ sends metrics to RRDtool ! ❖ events to PostgreSQL ❖ qnib
  • 66. OpenSM PerfMgmt qnibng OpenSM 19 ! ❖ sends metrics to RRDtool ! ❖ events to PostgreSQL ❖ qnib ❖ qnibng
  • 67. OpenSM PerfMgmt qnibng OpenSM 19 ! ❖ sends metrics to RRDtool ! ❖ events to PostgreSQL ❖ qnib ! ❖ sends metrics to graphite ! ❖ events to logstash ❖ qnibng
  • 68. OpenSM PerfMgmt qnibng OpenSM 19 ! ❖ sends metrics to RRDtool ! ❖ events to PostgreSQL ❖ qnib ! ❖ sends metrics to graphite ! ❖ events to logstash ❖ qnibng
  • 69. Graphite Events port is up/down 20
  • 70. 21
  • 71. 22
  • 72. QNIBTerminal Proof of Concept 23
  • 73. Cluster Stack Mock-Up ❖ IB events and metrics are not enough! ❖ How to get real-world behavior?! ❖ Wanted:! ❖ Slurm (Resource Scheduler)! ❖ MPI enabled compute nodes! ❖ As much additional cluster stack as possible 
 (Graphite,elasticsearch/logstash/kibana, Icinga, Cluster-FS, …) 24
  • 74. Classical Virtualization ❖ Big overhead for simple node! ❖ Resources provisioned in advance! ❖ Host resources allocated 25
  • 75. LXC (docker) ❖ minimal overhead ( couple of MB)! ❖ no resource pinning! ❖ cgroups option! ❖ highly automatable 26
  • 76. LXC (docker) ❖ minimal overhead ( couple of MB)! ❖ no resource pinning! ❖ cgroups option! ❖ highly automatable 26 NOW: Watch OSDC2014 talk ‚Docker‘ by ‚Tobias Schwab‘
  • 78. Virtual Cluster Nodes ❖ Master Node (etcd, DNS, slurmctld) 27 host master
  • 79. ! ❖ monitoring (graphite + statsd) Virtual Cluster Nodes ❖ Master Node (etcd, DNS, slurmctld) 27 host master monitoring
  • 80. ! ! ❖ log mgmt (ELK) ! ❖ monitoring (graphite + statsd) Virtual Cluster Nodes ❖ Master Node (etcd, DNS, slurmctld) 27 host master monitoring logmgmt
  • 81. ! ! ! ❖ compute nodes (slurmd) ! ! ❖ log mgmt (ELK) ! ❖ monitoring (graphite + statsd) Virtual Cluster Nodes ❖ Master Node (etcd, DNS, slurmctld) 27 host master monitoring logmgmt compute0 compute1 computeN
  • 82. ! ! ! ! ❖ alarming (Icinga) [not integrated] ! ! ! ❖ compute nodes (slurmd) ! ! ❖ log mgmt (ELK) ! ❖ monitoring (graphite + statsd) Virtual Cluster Nodes ❖ Master Node (etcd, DNS, slurmctld) 27 host master monitoring logmgmt compute0 compute1 computeN
  • 83. Master Node ❖ takes care of inventory (etcd)! ❖ provides DNS (+PTR)! ❖ Integrate Rudder, ansible, chef,…? 28
  • 84. Non-Master Nodes (in general) ❖ are started with master as DNS! ❖ mounting /scratch, /chome (sits on SSDs)! ❖ supervisord kicks in and starts services and setup-scripts! ❖ sending metrics to graphite! ❖ logs to logstash 29
  • 85. docker-compute ❖ slurmd! ❖ sshd! ❖ logstash-forwarder! ❖ openmpi! ❖ qperf 30
  • 86. docker-compute ❖ slurmd! ❖ sshd! ❖ logstash-forwarder! ❖ openmpi! ❖ qperf 30
  • 87. docker-compute ❖ slurmd! ❖ sshd! ❖ logstash-forwarder! ❖ openmpi! ❖ qperf 30
  • 88. docker-compute ❖ slurmd! ❖ sshd! ❖ logstash-forwarder! ❖ openmpi! ❖ qperf 30
  • 89. docker-compute ❖ slurmd! ❖ sshd! ❖ logstash-forwarder! ❖ openmpi! ❖ qperf 30
  • 90. docker-compute ❖ slurmd! ❖ sshd! ❖ logstash-forwarder! ❖ openmpi! ❖ qperf 30
  • 91. docker-compute ❖ slurmd! ❖ sshd! ❖ logstash-forwarder! ❖ openmpi! ❖ qperf 30
  • 92. docker-graphite (monitoring) ❖ full graphite stack + statsd! ❖ stresses IO (<3 SSDs)! ❖ needs more care (optimize IO) 31
  • 93. docker-elk (Log Mgmt) ❖ elasticsearch, logstash, kibana! ❖ inputs: syslog, lumberjack! ❖ filters: none! ❖ outputs: elasticsearch 32
  • 107. docker-icinga 40 ❖ Icinga to provide ! ❖ state-of-the-cluster overview! ❖ bundle with graphite/elk! ❖ no big deal…
  • 108. docker-icinga 40 ❖ Icinga to provide ! ❖ state-of-the-cluster overview! ❖ bundle with graphite/elk! ❖ no big deal… ! ! ! ! ❖ Is this going to scale?
  • 109. docker-(GlusterFS,Lustre) ❖ Cluster scratch to integrate with! ❖ Use of kernel-modules freezes attempt! ❖ Might be pushed in VirtualBox (vagrant) 41
  • 110. ❖ How is SysOps/DevOps/Mgmt Humans! 42
  • 111. ! ❖ react to the changes ❖ How is SysOps/DevOps/Mgmt Humans! 42
  • 112. ! ! ❖ adopt them ! ❖ react to the changes ❖ How is SysOps/DevOps/Mgmt Humans! 42
  • 113. ! ! ! ❖ feared by them ! ! ❖ adopt them ! ❖ react to the changes ❖ How is SysOps/DevOps/Mgmt Humans! 42
  • 114. ❖ Truckload of Big Data! 43
  • 115. ! ❖ Events ❖ Truckload of Big Data! 43
  • 116. ! ! ❖ Metrics ! ❖ Events ❖ Truckload of Big Data! 43
  • 117. ! ! ! ❖ Interaction ! ! ❖ Metrics ! ❖ Events ❖ Truckload of Big Data! 43
  • 118. ! ! ! ❖ Interaction ! ! ❖ Metrics ! ❖ Events ❖ Truckload of Big Data! 43 node01.system.memory.usage 9! node13.system.memory.usage 14! node35.system.memory.usage 12! node95.system.memory.usage 11
  • 119. ! ! ! ❖ Interaction ! ! ❖ Metrics ! ❖ Events ❖ Truckload of Big Data! 43 node01.system.memory.usage 9! node13.system.memory.usage 14! node35.system.memory.usage 12! node95.system.memory.usage 11 target=sumSeries(node{01,13,35,95}.system.memory.usage)
  • 120. ! ! ! ❖ Interaction ! ! ❖ Metrics ! ❖ Events ❖ Truckload of Big Data! 43 job1.node01.system.memory.usage 9! job1.node13.system.memory.usage 14! job1.node35.system.memory.usage 12! job1.node95.system.memory.usage 11 node01.system.memory.usage 9! node13.system.memory.usage 14! node35.system.memory.usage 12! node95.system.memory.usage 11 target=sumSeries(node{01,13,35,95}.system.memory.usage)
  • 121. ! ! ! ❖ Interaction ! ! ❖ Metrics ! ❖ Events ❖ Truckload of Big Data! 43 job1.node01.system.memory.usage 9! job1.node13.system.memory.usage 14! job1.node35.system.memory.usage 12! job1.node95.system.memory.usage 11 target=sumSeries(job01.*.system.memory.usage) node01.system.memory.usage 9! node13.system.memory.usage 14! node35.system.memory.usage 12! node95.system.memory.usage 11 target=sumSeries(node{01,13,35,95}.system.memory.usage)
  • 122. pipework / mininet ❖ Currently all containers are bound to docker0 bridge! ❖ Creating topology with virtual/real switches would be nice! ❖ First iteration might use pipework! ❖ More complete one should use vSwitches (mininet?) 44
  • 123. Dockerfiles ❖ Only 3 images are fd20 based 45
  • 124. Questions? ❖ Pictures! ❖ p2: http://de.wikipedia.org/wiki/Datei:Audi_logo.svg
 http://commons.wikimedia.org/wiki/File:Daimler_AG.svg
 http://ffb.uni-lueneburg.de/20JahreFFB/! ❖ p4: https://www.flickr.com/photos/adeneko/4229090961! ❖ p6: cae t100
 https://www.flickr.com/photos/losalamosnatlab/7422429706! ❖ p8: http://www.brendangregg.com/Slides/SCaLE_Linux_Performance2013.pdf! ❖ p9: https://www.flickr.com/photos/riafoge/6796129047! ❖ p10: https://www.flickr.com/photos/119364768@N03/12928685224/! ❖ p11: http://www.mellanox.com/page/products_dyn?product_family=74 ! ❖ p23: https://www.flickr.com/photos/jaxport/3077543062! ❖ p25/26: https://blog.trifork.com/2013/08/08/next-step-in-virtualization-docker-lightweight-containers/! ❖ p33: https://www.flickr.com/photos/fkehren/5139094564! ❖ p39: https://www.flickr.com/photos/brizzlebornandbred/12852909293 46