SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Ceph in the GRNET cloud stack
Nikos Kormpakis
September 26, 2017
GRNET - Greek Research and Technology Network
About GRNET
GRNET’s Infrastructure
• 5 datacenters (3 in Athens, 1 in Louros, 1 in Crete)
• ~1000 VMCs (Virtual Machine Containers)
• ~180 SCs (Storage Containers, only for Ceph)
• 4 NetApp boxes (3x FAS8040, 1x IBM N5300)
• EMC boxes, IBM TS4500 tape library etc etc
1
GRNET Cloud Platform
• ViMa: VPS for GRNET Peers (not really ”cloud”)
• ganetimgr
• NFS and DRBD
• ~okeanos: Elastic VMs for students, researchers
• Synnefo
• DRBD, RBD, Archipelago (Ceph)
30 Ganeti clusters, QEMU/KVM, Debian Jessie and a lot of more...
2
Ceph @ GRNET
Ceph Infrastructure
4 Ceph clusters
• 2 production clusters
• 2 testing clusters
Some facts:
• Each ~okeanos installation has its own cluster
• Variety of hardware
• Spine-leaf network topology
• Each cluster lives only in one DC
• Mix of Ceph versions and setups
• No NAT, everything dualstack
3
Where do we use Ceph?
Two large use cases
• Block Storage for VMs
~okeanos (and maybe later for ViMa), using Archipelago and RBD.
• Pithos+
dropbox-like object storage, part of ~okeanos, using Archipelago.
4
Ceph Clusters
rd0 Cluster
rd0, old cluster (~okeanos)
• ~300TB raw storage, 164 OSDs, 3 MONs, 6 OSDs/node
• MONs colocated with OSDs
• HP ProLiant DL380 G7 with 2x HP DS2600 enclosures/node
• SAS disks for journals and OSDs, all on RAID1...
• Hammer (v0.94.9), 3.16 kernel, with default crushmap
• No more space allocated, will be deprecated
• Used only for Archipelago (more on it later)
• Started 4 years ago as an experiment for Archipelago
5
rd0 Cluster
What’s wrong with this cluster?
• All disks are on RAID1 -> low performance, 50% space loss
• Broadcom NetXtreme II BCM5709 NICs really suck
A lot of flapping has caused multiple outages (1 major one)
• replica count = 2, min_size=1...
• No deep-scrub due to performance issues
• Non-optimal tunables set
• No warranty
• ext4 OSDs
6
rd0 Cluster
About the future of rd0
• Old, but functional hardware
• Suitable for users with little or no I/O performance
requirements (i.e. short-living VMs for student classes)
• Install refurbished X540 NICs and get rid of the old ones
• Upgrade to Jewel yesterday
• Recreate all OSDs with ceph-deploy (XFS)
• Add more nodes to cluster
• Increase replica count to 3, set optimal tunables
• Maybe... use single disk OSDs instead of RAID1
7
rd1 Cluster
rd1, new Ceph production cluster (~okeanos-knossos)
• ~550TB raw storage, 144 OSDs, 3 MONs, 12 OSDs/node
• Lenovo ThinkServer RD550 with SA120 arrays (JBOD enabled)
• SSD for MON stores, 12x4TB SAS for OSD stores, 6x200GB for OSD
journals (SSDs)
• 2x Journals/OSD, 20GB per Journal
• 2x Intel Xeon E5-2640V3 2.60GHz, 128GB RAM
• 2x10G LACP
• Jewel (v10.2.9), 4.9 kernel, default crushmap, optimal tunables
8
rd1 Cluster
Configuration, tuning etc
• RBD and Archipelago
• librbd for VMs (more on krbd vs librbd)
• MONs colocated with OSDs. Each MON on different rack.
• sysctl && controller settings (tcp stuff, pid/thread limits, perf
governor etc)
• Configs mostly from ML suggestions (+testing of course)
• deep-scrub enabled with low priority settings
• Cephx auth enabled
• Separate client and replication networks
9
rd1 Cluster
Pain points
• Large number of threads with librbd images
Considering using async messenger
• Trouble with systemd/udev (had to upgrade to systemd v230)
• Issues with DC network hardware cause flaps, packet loss, etc
• (Almost) no monitoring for (lib)rbd on clients, working on it
• There’s still a lot to learn about RBD
10
OK, what is Archipelago?
10
Archipelago
Archipelago is a storage layer that provides a unified way to handle
volumes/files, independently of the storage backend.
• C
• Uses blktap
• Two backends supported: NFS and RADOS (librados)
• Nothing needed on Ceph’s side, just two pools: blocks and maps
• blocks pool used for actual disk blocks
• objects in maps pool contain the map for each volume
• Has its own userspace tools: vlmc and xseg
11
Archipelago
Issues with Archipelago
• Debugging can be a serious pain
• Issues with shm can cause data corruption
• No easy way to map objects to a particular volume
• No garbage collection :)
• Synnefo has a ’hardcoded’ dependency on Archipelago, VM
images are still stored using Archipelago. Will eventually be
replaced (custom solution, Glance?).
12
RBD
RBD vs Archipelago
Since 2017, all our new ~okeanos clusters will use RBD instead of
Archipelago. We took that decision for the following reasons:
• RBD is a part of Ceph
• RBD is stable
• Features that were missing from Archipelago
• Multiple ways to access images (krbd, librbd)
• Large community of users and devs behind it
• No need to maintain an in-house tool
13
krbd vs librbd. Which one?
Two things to evaluate: Operational issues and performance. Went
along with librbd:
• Almost same performance (1 exception: seq read)
• krbd uses the host’s page cache: can cause crashes, hung tasks
• librbd has no kernel version dependency
• librbd has a nice admin socket
• librbd’s cache is easily configurable through ceph.conf
• No stale mapped devices with librbd
Issues:
• No monitoring yet on librbd (in progress)
• librbd opens up a lot of connections -> Evaluating async msg
• Had to write custom extstorage provider for Ganeti
Mega props to alexaf for his work on librbd vs krbd.
14
Ops
Automation & Config Management
We use extensively Puppet and Python Fabric all across our infra.
• Using puppet only for configuration management
• No remote execution or Ceph provisioning
• In-house modules for Ceph: Most modules out there perform
deployment tasks
• Extensive use of role/profile pattern
• Python Fabric scripts for a lot of tasks: Deployments, upgrades,
maintenance etc.
• mcollective for non-critical operations
• A large number of Engineering Runbooks for most operations.
15
Monitoring
We extensively monitor our Ceph infrastructure with a lot of tools:
• icinga/checkmk, mostly in-house stuff for host and ceph status
• munin, mainly for debugging purposes
• collectd/graphite for ceph metrics
• prometheus for networking and disk monitoring
• ELK for collecting, storing and analyzing host and ceph logs
16
Outages
Major outage on rd0
On 2016-09-09 we had our first multi-day Ceph outage
• OSDs of 1 node started flapping
• After 5 hours, flapping OSDs crashed
• 100 PGs peering+down, unfound objects (rc = 2 anyone?)
• MONs leaked FDs, lost communication
• Failed OSDs could not start with multiple assertions
• LevelDB corruption
Cluster recovered after 4 days of gdb sessions, the creation of an
temp OSD, LevelDB repairing and more...
Root cause was never found: Hardware? Cosmic Radiation? Ghosts?
(Report on the outage on last slide)
17
Minor outages
• Nodes flapping due to faulty hardware
• Second node crashing while recovery is in progress
• Volumes mapped on two nodes
• Two same VMs running on different nodes...
18
Lessons learned
Lessons learned
• Never ever ever ever run a cluster with replica count = 2
• One crazy OSD/switch/NIC can ruin a whole cluster
• Ceph is hard :)
• Be up-to-date: a lot of bugfixes, improvements, features
• Docs are not always up-to-date, most info found in ML
• Benchmarking is difficult
• Deep-scrub can degrade performance, but is very important
• Test hardware thoroughly before running Ceph on it
19
Wishlist and TODO
Wishlist and TODO
• More tunings on ceph.conf and crushmap
• Try to keep up with new Ceph versions and features (bluestore,
CephFS)
• Evaluate async messenger
• Evaluate RBD features (striping, snapshotting, layering etc)
• Better monitoring
• Try prometheus exporter for ceph and experiment with alerting
• Better ELK filtering
• Maybe increase loglevels on some subsystems
• Better hard disk monitoring
• Gain some experience on rgw, CephFS
• Gain some insight in Ceph code
• Some more automation would be nice
• ...
20
Links
https://grnet.gr
https://twitter.com/grnetnoc
https://www.synnefo.org/
https://www.synnefo.org/docs/archipelago/latest/
https://grnet.github.io/ganetimgr/
https://blog.noc.grnet.gr/2016/10/18/
surviving-a-ceph-cluster-outage-the-hard-way/
https://www.terena.org/activities/tf-storage/ws18/slides/
120215-archipelago.pdf
21
whoami
whoami
Nikos Kormpakis aka nkorb aka nkorbb
Systems & Services Engineer at GRNET, messing around with:
• IaaS platforms (ViMa, ~okeanos)
• Storage systems (Ceph, NetApp ONTAP boxes)
• Puppet
• a ton of other stuff (monitoring, multiple web stacks etc)
Mainly interested in storage, virtualization, OS internals, networking
stuff, libre software, etc etc etc.
Find me at nkorb_at_noc(dot)grnet(dot)gr & blog.nkorb.gr
22

Más contenido relacionado

La actualidad más candente

Monitoring of OpenNebula installations
Monitoring of OpenNebula installationsMonitoring of OpenNebula installations
Monitoring of OpenNebula installationsNETWAYS
 
How Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project FeedbackHow Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project FeedbackNETWAYS
 
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...rhatr
 
Ceph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wildCeph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wildCeph Community
 
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...ScyllaDB
 
XenTT: Deterministic Systems Analysis in Xen
XenTT: Deterministic Systems Analysis in XenXenTT: Deterministic Systems Analysis in Xen
XenTT: Deterministic Systems Analysis in XenThe Linux Foundation
 
Cgroups, namespaces and beyond: what are containers made from?
Cgroups, namespaces and beyond: what are containers made from?Cgroups, namespaces and beyond: what are containers made from?
Cgroups, namespaces and beyond: what are containers made from?Docker, Inc.
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Ceph Community
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuCeph Community
 
OSv – The OS designed for the Cloud
OSv – The OS designed for the CloudOSv – The OS designed for the Cloud
OSv – The OS designed for the CloudYandex
 
XPDS14: Xen 4.5 Roadmap - Konrad Wilk, Oracle
XPDS14: Xen 4.5 Roadmap - Konrad Wilk, OracleXPDS14: Xen 4.5 Roadmap - Konrad Wilk, Oracle
XPDS14: Xen 4.5 Roadmap - Konrad Wilk, OracleThe Linux Foundation
 
A Performance Comparison of Container-based Virtualization Systems for MapRed...
A Performance Comparison of Container-based Virtualization Systems for MapRed...A Performance Comparison of Container-based Virtualization Systems for MapRed...
A Performance Comparison of Container-based Virtualization Systems for MapRed...Marcelo Veiga Neves
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Ceph Community
 
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan XuCeph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan XuCeph Community
 
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea ArcangeliKernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea ArcangeliAnne Nicolas
 
Disk Performance Comparison Xen v.s. KVM
Disk Performance Comparison Xen v.s. KVMDisk Performance Comparison Xen v.s. KVM
Disk Performance Comparison Xen v.s. KVMnknytk
 

La actualidad más candente (20)

Monitoring of OpenNebula installations
Monitoring of OpenNebula installationsMonitoring of OpenNebula installations
Monitoring of OpenNebula installations
 
Erlang on OSv
Erlang on OSvErlang on OSv
Erlang on OSv
 
How Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project FeedbackHow Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project Feedback
 
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...
 
Ceph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wildCeph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wild
 
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
 
Ceph on arm64 upload
Ceph on arm64   uploadCeph on arm64   upload
Ceph on arm64 upload
 
XenTT: Deterministic Systems Analysis in Xen
XenTT: Deterministic Systems Analysis in XenXenTT: Deterministic Systems Analysis in Xen
XenTT: Deterministic Systems Analysis in Xen
 
Cgroups, namespaces and beyond: what are containers made from?
Cgroups, namespaces and beyond: what are containers made from?Cgroups, namespaces and beyond: what are containers made from?
Cgroups, namespaces and beyond: what are containers made from?
 
LXC
LXCLXC
LXC
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
 
OSv – The OS designed for the Cloud
OSv – The OS designed for the CloudOSv – The OS designed for the Cloud
OSv – The OS designed for the Cloud
 
XPDS14: Xen 4.5 Roadmap - Konrad Wilk, Oracle
XPDS14: Xen 4.5 Roadmap - Konrad Wilk, OracleXPDS14: Xen 4.5 Roadmap - Konrad Wilk, Oracle
XPDS14: Xen 4.5 Roadmap - Konrad Wilk, Oracle
 
A Performance Comparison of Container-based Virtualization Systems for MapRed...
A Performance Comparison of Container-based Virtualization Systems for MapRed...A Performance Comparison of Container-based Virtualization Systems for MapRed...
A Performance Comparison of Container-based Virtualization Systems for MapRed...
 
ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
 
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan XuCeph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
 
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea ArcangeliKernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
 
Disk Performance Comparison Xen v.s. KVM
Disk Performance Comparison Xen v.s. KVMDisk Performance Comparison Xen v.s. KVM
Disk Performance Comparison Xen v.s. KVM
 

Similar a Ceph in the GRNET cloud stack

Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Hajime Tazaki
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack eurobsdcon
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Dave Holland
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFShapeBlue
 
Rmll Virtualization As Is Tool 20090707 V1.0
Rmll Virtualization As Is Tool 20090707 V1.0Rmll Virtualization As Is Tool 20090707 V1.0
Rmll Virtualization As Is Tool 20090707 V1.0guest72e8c1
 
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...Equnix Business Solutions
 
[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020Akihiro Suda
 
The State of Rootless Containers
The State of Rootless ContainersThe State of Rootless Containers
The State of Rootless ContainersAkihiro Suda
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETNikos Kormpakis
 
Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018David Stockton
 
Webinar - Getting Started With Ceph
Webinar - Getting Started With CephWebinar - Getting Started With Ceph
Webinar - Getting Started With CephCeph Community
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraCeph Community
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introductionkanedafromparis
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph clusterMirantis
 
OSDC 2016 - Tuning Linux for your Database by Colin Charles
OSDC 2016 - Tuning Linux for your Database by Colin CharlesOSDC 2016 - Tuning Linux for your Database by Colin Charles
OSDC 2016 - Tuning Linux for your Database by Colin CharlesNETWAYS
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Colin Charles
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFZoltan Arnold Nagy
 

Similar a Ceph in the GRNET cloud stack (20)

Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
RMLL / LSM 2009
RMLL / LSM 2009RMLL / LSM 2009
RMLL / LSM 2009
 
Rmll Virtualization As Is Tool 20090707 V1.0
Rmll Virtualization As Is Tool 20090707 V1.0Rmll Virtualization As Is Tool 20090707 V1.0
Rmll Virtualization As Is Tool 20090707 V1.0
 
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
 
[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020
 
The State of Rootless Containers
The State of Rootless ContainersThe State of Rootless Containers
The State of Rootless Containers
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018
 
Webinar - Getting Started With Ceph
Webinar - Getting Started With CephWebinar - Getting Started With Ceph
Webinar - Getting Started With Ceph
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
OSDC 2016 - Tuning Linux for your Database by Colin Charles
OSDC 2016 - Tuning Linux for your Database by Colin CharlesOSDC 2016 - Tuning Linux for your Database by Colin Charles
OSDC 2016 - Tuning Linux for your Database by Colin Charles
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 

Último

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Último (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Ceph in the GRNET cloud stack

  • 1. Ceph in the GRNET cloud stack Nikos Kormpakis September 26, 2017 GRNET - Greek Research and Technology Network
  • 3. GRNET’s Infrastructure • 5 datacenters (3 in Athens, 1 in Louros, 1 in Crete) • ~1000 VMCs (Virtual Machine Containers) • ~180 SCs (Storage Containers, only for Ceph) • 4 NetApp boxes (3x FAS8040, 1x IBM N5300) • EMC boxes, IBM TS4500 tape library etc etc 1
  • 4. GRNET Cloud Platform • ViMa: VPS for GRNET Peers (not really ”cloud”) • ganetimgr • NFS and DRBD • ~okeanos: Elastic VMs for students, researchers • Synnefo • DRBD, RBD, Archipelago (Ceph) 30 Ganeti clusters, QEMU/KVM, Debian Jessie and a lot of more... 2
  • 6. Ceph Infrastructure 4 Ceph clusters • 2 production clusters • 2 testing clusters Some facts: • Each ~okeanos installation has its own cluster • Variety of hardware • Spine-leaf network topology • Each cluster lives only in one DC • Mix of Ceph versions and setups • No NAT, everything dualstack 3
  • 7. Where do we use Ceph? Two large use cases • Block Storage for VMs ~okeanos (and maybe later for ViMa), using Archipelago and RBD. • Pithos+ dropbox-like object storage, part of ~okeanos, using Archipelago. 4
  • 9. rd0 Cluster rd0, old cluster (~okeanos) • ~300TB raw storage, 164 OSDs, 3 MONs, 6 OSDs/node • MONs colocated with OSDs • HP ProLiant DL380 G7 with 2x HP DS2600 enclosures/node • SAS disks for journals and OSDs, all on RAID1... • Hammer (v0.94.9), 3.16 kernel, with default crushmap • No more space allocated, will be deprecated • Used only for Archipelago (more on it later) • Started 4 years ago as an experiment for Archipelago 5
  • 10. rd0 Cluster What’s wrong with this cluster? • All disks are on RAID1 -> low performance, 50% space loss • Broadcom NetXtreme II BCM5709 NICs really suck A lot of flapping has caused multiple outages (1 major one) • replica count = 2, min_size=1... • No deep-scrub due to performance issues • Non-optimal tunables set • No warranty • ext4 OSDs 6
  • 11. rd0 Cluster About the future of rd0 • Old, but functional hardware • Suitable for users with little or no I/O performance requirements (i.e. short-living VMs for student classes) • Install refurbished X540 NICs and get rid of the old ones • Upgrade to Jewel yesterday • Recreate all OSDs with ceph-deploy (XFS) • Add more nodes to cluster • Increase replica count to 3, set optimal tunables • Maybe... use single disk OSDs instead of RAID1 7
  • 12. rd1 Cluster rd1, new Ceph production cluster (~okeanos-knossos) • ~550TB raw storage, 144 OSDs, 3 MONs, 12 OSDs/node • Lenovo ThinkServer RD550 with SA120 arrays (JBOD enabled) • SSD for MON stores, 12x4TB SAS for OSD stores, 6x200GB for OSD journals (SSDs) • 2x Journals/OSD, 20GB per Journal • 2x Intel Xeon E5-2640V3 2.60GHz, 128GB RAM • 2x10G LACP • Jewel (v10.2.9), 4.9 kernel, default crushmap, optimal tunables 8
  • 13. rd1 Cluster Configuration, tuning etc • RBD and Archipelago • librbd for VMs (more on krbd vs librbd) • MONs colocated with OSDs. Each MON on different rack. • sysctl && controller settings (tcp stuff, pid/thread limits, perf governor etc) • Configs mostly from ML suggestions (+testing of course) • deep-scrub enabled with low priority settings • Cephx auth enabled • Separate client and replication networks 9
  • 14. rd1 Cluster Pain points • Large number of threads with librbd images Considering using async messenger • Trouble with systemd/udev (had to upgrade to systemd v230) • Issues with DC network hardware cause flaps, packet loss, etc • (Almost) no monitoring for (lib)rbd on clients, working on it • There’s still a lot to learn about RBD 10
  • 15. OK, what is Archipelago? 10
  • 16. Archipelago Archipelago is a storage layer that provides a unified way to handle volumes/files, independently of the storage backend. • C • Uses blktap • Two backends supported: NFS and RADOS (librados) • Nothing needed on Ceph’s side, just two pools: blocks and maps • blocks pool used for actual disk blocks • objects in maps pool contain the map for each volume • Has its own userspace tools: vlmc and xseg 11
  • 17. Archipelago Issues with Archipelago • Debugging can be a serious pain • Issues with shm can cause data corruption • No easy way to map objects to a particular volume • No garbage collection :) • Synnefo has a ’hardcoded’ dependency on Archipelago, VM images are still stored using Archipelago. Will eventually be replaced (custom solution, Glance?). 12
  • 18. RBD
  • 19. RBD vs Archipelago Since 2017, all our new ~okeanos clusters will use RBD instead of Archipelago. We took that decision for the following reasons: • RBD is a part of Ceph • RBD is stable • Features that were missing from Archipelago • Multiple ways to access images (krbd, librbd) • Large community of users and devs behind it • No need to maintain an in-house tool 13
  • 20. krbd vs librbd. Which one? Two things to evaluate: Operational issues and performance. Went along with librbd: • Almost same performance (1 exception: seq read) • krbd uses the host’s page cache: can cause crashes, hung tasks • librbd has no kernel version dependency • librbd has a nice admin socket • librbd’s cache is easily configurable through ceph.conf • No stale mapped devices with librbd Issues: • No monitoring yet on librbd (in progress) • librbd opens up a lot of connections -> Evaluating async msg • Had to write custom extstorage provider for Ganeti Mega props to alexaf for his work on librbd vs krbd. 14
  • 21. Ops
  • 22. Automation & Config Management We use extensively Puppet and Python Fabric all across our infra. • Using puppet only for configuration management • No remote execution or Ceph provisioning • In-house modules for Ceph: Most modules out there perform deployment tasks • Extensive use of role/profile pattern • Python Fabric scripts for a lot of tasks: Deployments, upgrades, maintenance etc. • mcollective for non-critical operations • A large number of Engineering Runbooks for most operations. 15
  • 23. Monitoring We extensively monitor our Ceph infrastructure with a lot of tools: • icinga/checkmk, mostly in-house stuff for host and ceph status • munin, mainly for debugging purposes • collectd/graphite for ceph metrics • prometheus for networking and disk monitoring • ELK for collecting, storing and analyzing host and ceph logs 16
  • 25. Major outage on rd0 On 2016-09-09 we had our first multi-day Ceph outage • OSDs of 1 node started flapping • After 5 hours, flapping OSDs crashed • 100 PGs peering+down, unfound objects (rc = 2 anyone?) • MONs leaked FDs, lost communication • Failed OSDs could not start with multiple assertions • LevelDB corruption Cluster recovered after 4 days of gdb sessions, the creation of an temp OSD, LevelDB repairing and more... Root cause was never found: Hardware? Cosmic Radiation? Ghosts? (Report on the outage on last slide) 17
  • 26. Minor outages • Nodes flapping due to faulty hardware • Second node crashing while recovery is in progress • Volumes mapped on two nodes • Two same VMs running on different nodes... 18
  • 28. Lessons learned • Never ever ever ever run a cluster with replica count = 2 • One crazy OSD/switch/NIC can ruin a whole cluster • Ceph is hard :) • Be up-to-date: a lot of bugfixes, improvements, features • Docs are not always up-to-date, most info found in ML • Benchmarking is difficult • Deep-scrub can degrade performance, but is very important • Test hardware thoroughly before running Ceph on it 19
  • 30. Wishlist and TODO • More tunings on ceph.conf and crushmap • Try to keep up with new Ceph versions and features (bluestore, CephFS) • Evaluate async messenger • Evaluate RBD features (striping, snapshotting, layering etc) • Better monitoring • Try prometheus exporter for ceph and experiment with alerting • Better ELK filtering • Maybe increase loglevels on some subsystems • Better hard disk monitoring • Gain some experience on rgw, CephFS • Gain some insight in Ceph code • Some more automation would be nice • ... 20
  • 33. whoami Nikos Kormpakis aka nkorb aka nkorbb Systems & Services Engineer at GRNET, messing around with: • IaaS platforms (ViMa, ~okeanos) • Storage systems (Ceph, NetApp ONTAP boxes) • Puppet • a ton of other stuff (monitoring, multiple web stacks etc) Mainly interested in storage, virtualization, OS internals, networking stuff, libre software, etc etc etc. Find me at nkorb_at_noc(dot)grnet(dot)gr & blog.nkorb.gr 22