SlideShare una empresa de Scribd logo
1 de 52
Descargar para leer sin conexión
The Scenario of Permanent
Data Loss & Durability
in Ceph
Assumption
• 10 OSDs
• 512 PGs
• 3 Replica Pool
Goal
• Ceph에서 Data가 완전히 망실(permanent data loss)되는 Case를 통해,
Ceph의 Durability 및 운용 메커니즘에 대한 이해.
Summary of Scenario
• 특정 PG를 담당하는 1번째 OSD 망실.
• 망실된 OSD를 사용하는 PG의 Replica수가 3에서 2로 감소.
• Ceph는 새로운 OSD를 선택하여, 장애 PG의 3번째 Replica 복구 시작.
• 1번째 OSD 복구완료전, 동일 PG의 2번째 OSD 망실.
• 해당 PG의 Replica는 이제 1개만 남음.
• Ceph는 또 다시 신규 OSD를 선택하고, 원하는 Replica 만큼 복제 시도.
• 1,2 번째 OSD에 대한 복구가 완료 되기 전, 마지막 OSD마저 추가 망실.
• 해당 PG의 데이터는 영구적 망실(permanent data loss).
Normal…☺
OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
Pool /w 3 Replicas
OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
Mapping
OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
𝟓𝟏𝟐 × 𝟑
𝟏𝟎
≅ 𝟏𝟓𝟎
PGs per OSD
Trouble…
PG#3의 한 OSD 망실…
OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #1 OSD #2 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #3
OSD #1 OSD #2 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #3
Replicas
3 to 2
Replicas
3 to 2
150 PGs Replicas 3 to 2 유발
OSD #1 OSD #2 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #3
Add New OSD Add New OSD
OSD #1 OSD #2 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #3
Start
Recovery
Start
Recovery
OSD #1 OSD #2 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #3
Recovering…
OSD #3
Recovering…
OSD #3
OSD #1 OSD #2 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #3
Completed
Recovering…
OSD #3
150 PGs, Homogeneously Spread
OSD #1 OSD #2 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #3
Completed
Recovering…
OSD #3
Long Time…
Oops…
PG#3의 1번째 OSD의 복구가 미처 완료되기 전에, PG#3의 2번째 OSD가
추가적으로 망실됨…
OSD #1 OSD #2 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #3
Recovering…
OSD #3
Long Time…
OSD #1 OSD #2 OSD #4 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #3
Recovering…
OSD #3
OSD #5
Long Time…
OSD #1 OSD #2 OSD #4 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #3
Recovering…
OSD #3
OSD #5
Replicas
2 to 1
Long Time…
OSD #1 OSD #2 OSD #4 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #3
Recovering…
OSD #3
OSD #5
Replicas
2 to 1
Add New OSD
Start
Recovery
Long Time…
Long Time…
OSD #1 OSD #2 OSD #4 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #3
Recovering…
OSD #3
OSD #5
Replicas
2 to 1
Add New OSD
Start
RecoveryRecovering…
OSD #5
Long Time…
Long Time…
Ooooooooooops…
PG#3의 데이터를 온전히 보유하고 있던 마지막 3번째 OSD마저 망실됨…
OSD #1 OSD #2 OSD #4 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #3
Recovering…
OSD #3
OSD #5
Replicas
2 to 1
Add New OSD
Start
RecoveryRecovering…
OSD #5
Long Time…
Long Time…
OSD #1 OSD #2 OSD #4 OSD #10OSD #6 OSD #7 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #3
Recovering…
OSD #3
OSD #5
Replicas
2 to 1
Add New OSD
Start
RecoveryRecovering…
OSD #5
Long Time…
Long Time…
OSD #8
OSD #1 OSD #2 OSD #4 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
PG #1 PG #2 PG #3 PG #512
~PG #4 PG #5
Pool /w 3 Replicas
CRUSH
OSD #3
Recovering…
OSD #3
OSD #5
Replicas
2 to 1
Add New OSD
Start
RecoveryRecovering…
OSD #5
Long Time…
Long Time…
???
Appendix
About “Long Time”
About “Long Time”(1)
• 본 시나리오에서 1번째 OSD의 망실은, 512개 PGs들 중, 약 150개의 PG
들의 Replica 하나를 잃었다는 것을 의미함. (512 × 3 ÷ 10)
• 유실된 Replica 1개를 복구 하기 위해, 남아 있는 9개 OSD들 중 하나를 대
체 OSD로 선택.
• OSD들은 상호간에 유실된 150개의 PG의 복구 복제본을 주고 받음.
• 이 과정에서 시간 소요가 필연적으로 따르며, 이 시간 소요는 Ceph 클러스
터 아키텍처에 따라 달라짐.
About “Long Time”(2)
• (가정)
• PGs는 전체 OSD들에 대해 균일하게 분산배치 되어 있으며, 1TB SSD타입의 OSD
가 모두 단독 머신을 사용하고 있고, 각 머신은 10Gbps 대역폭으로 스위치와 연결되
어 있는 상황에서, 망실된 하나의 OSD를 완전히 복구 하는데 M분의 시간이 걸린
다고 가정하자.
About “Long Time”(3)
• 일반적으로 Ceph에서 PGs 수는 Durability 및 Recovery 속도에 거의 영
향을 미치지 않음.
• 그러나, OSD수가 10에서 20으로 증가 할 경우, Recovery 속도는 증가 할 것
이고, 이는 곧 Durability가 현저히 증가함을 의미.
• 각 OSD는 10개 일 때, 150개의 PG를 처리하게 되지만, 20개의 OSD가 되면 단
75개만 처리하면 됨. (OSD당 관리 대상 감소 -> OSD당 복구 대상 감소)
• 10개의 OSD가 100GB를 복제 해야 한다면, 20개의 OSD상황에서는 50GB만 복제 하면 됨.
• 또한, 데이터 복구를 9대가 수행 하였지만, 19개가 수행하게 되므로, 컴퓨팅/네트웍
병목만 없다면 당연히 복구 속도는 상승.
About “Long Time”(4)
• 만일 이 클러스터가 40 OSD로 증가 된다면?
• 각 OSD는 35 PGs만을 담당.
• 한 OSD가 죽었을 경우, 20 OSD 경우보다, 더욱 빠르게 복구 될 것임.
• 더 나아가, 200 OSD까지 증가 시킨다면?
• 각 OSD는 겨우 7 PGs만을 담당.
• 한 OSD가 죽었을 경우, 최대 21개 (7PGs * 3Replica) OSD만이 복구 작업에 참여
못함. (나머지 179개의 OSD들은??? ->  비효율)
• 40개일 때 보다 오래 걸림.
• (40개 OSD일 경우는 OSD당 35PGs이므로, 40개 OSD모두 병렬 복구 작업 수행.)
• 이 경우, PG수를 늘려야 함. (가능한 모든 OSD가 복구 작업에 참여 유도)
Appendix
About “Durability”
About “Durability”(1)
• PGs는 Durability와는 무관.
• 그냥 CURSH 연산 비용을 줄이기 위한 논리적 그룹 개념.
• 당연, 복구 시간이 짧으면 좋지만, 그게 전부가 아님.
• 처음 (가정)의 상황에서,
• 10개 1번째 OSD가 망실되어, 나머지 9개 OSD중 하나로 150PGs의 복구작업 중,
• 2번째 OSD가 추가 망실되면, 17(≅ 150 ÷ 9) PGs에 해당되는 Replica는 클러스
터 전체에 걸쳐 “오직 하나의 사본만 존재”하게 됨.
• 최악의 상황으로, 2개의 OSD가 망실된 상황(가용 OSD는 8개)에서, 추가로 OSD가
망실되면(동시 3개 망실),
• 2 ≅ 17 ÷ 8 개의 PGs에 속한 모든 데이터는 영구히 유실됨. (Permanently Lost)
About “Durability”(2)
• 20 OSD를 가진 클러스터에서, 앞서 3개 동시 망실 케이스를 살펴보면,
• 2번째 OSD 망실
• Replica가 한 개만 남은 PGs는 4 ≅ 75 ÷ 19 개
• 10 OSD 때의 17보다 훨씬 감소한다.
• 3번째 OSD 망실
• 망실된 3번째 OSD가 Replica 한 개만 남은 PGs를 담당하는 OSD일 경우에만, Permanently
Lost가 발생됨.
• 즉, 동시에 3개의 OSD가 망실되더라도, Permanently Lost가 100% 발생되는 것은 아님.
About “Durability”(3)
• (가정) OSD복구 중, 추가로 OSD가 망실될 확률이 0.0001%라고 한다면,
• “Permanently Data Lost”가 발생할 확률은,
• 10 OSDs (& 512 PGs)
• 17𝑃𝐺𝑠 × 10𝑂𝑆𝐷𝑠 × 0.0001%
• 20 OSDs (& 512 PGs)
• 4𝑃𝐺𝑠 × 20𝑂𝑆𝐷𝑠 × 0.0001%
About “Durability”(4)
• Summary
• OSD가 많을 수록, 복구 속도가 빠름.
• 빠른 속도로 인해, PGs의“Cascading Failure”발생 리스크 감소.
• 50 OSDs 이하 규모에서는 512PGs나 4096PGs나 “Durability”측면에서는 차
이가 없다.
• 권장 PG 계산식
𝑇𝑜𝑡𝑎𝑙 𝑃𝐺𝑠 =
𝑂𝑆𝐷𝑠 ∗ 100
𝑅𝑒𝑝𝑙𝑖𝑐𝑎𝑠
APPENDIX
Replacement of PG Replicas
OSD #1
Pool /w 3 Replicas
PG #1 PG #2 PG #3 PG #6PG #4 PG #5 PG #7 PG #8 PG #9
PG
#10
OSD #2 OSD #3 OSD #3 OSD #5OSD #3
PG #1 PG #1 PG #1PG #2 PG #2 PG #2
PG #3
PG #3 PG #3PG #4 PG #4 PG #4 PG #5
PG #5 PG #6 PG #6PG #6PG #7 PG #7PG #7 PG #8 PG #8 PG #8PG #9
PG #9 PG #9
PG
#10
PG #5
PG
#10
PG
#10
OSD #1
Pool /w 3 Replicas
PG #1 PG #2 PG #3 PG #6PG #4 PG #5 PG #7 PG #8 PG #9
PG
#10
OSD #2 OSD #3 OSD #3 OSD #5OSD #3
PG #1 PG #1 PG #1PG #2 PG #2 PG #2
PG #3
PG #3 PG #3PG #4 PG #4 PG #4 PG #5
PG #5 PG #6 PG #6PG #6PG #7 PG #7PG #7 PG #8 PG #8 PG #8PG #9
PG #9 PG #9
PG
#10
PG #5
PG
#10
PG
#10
Failure
OSD #1
Pool /w 3 Replicas
PG #1 PG #2 PG #3 PG #6PG #4 PG #5 PG #7 PG #8 PG #9
PG
#10
OSD #2 OSD #3 OSD #3 OSD #5OSD #3
PG #1 PG #1 PG #1PG #2 PG #2 PG #2
PG #3
PG #3 PG #3PG #4 PG #4 PG #4 PG #5
PG #5 PG #6 PG #6PG #6PG #7 PG #7PG #7 PG #8 PG #8 PG #8PG #9
PG #9 PG #9
PG
#10
PG #5
PG
#10
PG
#10
Failure
PG #1 PG #2PG #7 PG #9
PG #3
OSD #1
Pool /w 3 Replicas
PG #1 PG #2 PG #3 PG #6PG #4 PG #5 PG #7 PG #8 PG #9
PG
#10
OSD #2 OSD #3 OSD #3 OSD #5OSD #3
PG #1 PG #1 PG #1PG #2 PG #2 PG #2
PG #3
PG #3 PG #3PG #4 PG #4 PG #4 PG #5
PG #5 PG #6 PG #6PG #6PG #7 PG #7PG #7 PG #8 PG #8 PG #8PG #9
PG #9 PG #9
PG
#10
PG #5
PG
#10
PG
#10
Failure
PG #1 PG #2
PG #3
PG #7 PG #9
FailurePG #2PG #7 PG #4PG #8 PG #5
OSD #1
Pool /w 3 Replicas
PG #1 PG #2 PG #3 PG #6PG #4 PG #5 PG #7 PG #8 PG #9
PG
#10
OSD #2 OSD #3 OSD #3 OSD #5OSD #3
PG #1 PG #1 PG #1PG #2 PG #2 PG #2
PG #3
PG #3 PG #3PG #4 PG #4 PG #4 PG #5
PG #5 PG #6 PG #6PG #6PG #7 PG #7PG #7 PG #8 PG #8 PG #8PG #9
PG #9 PG #9
PG
#10
PG #5
PG
#10
PG
#10
Failure
PG #1 PG #2
PG #3
PG #7 PG #9
FailurePG #2PG #7 PG #4PG #8 PG #5
R: 1 R: 1
OSD #1
Pool /w 3 Replicas
PG #1 PG #2 PG #3 PG #6PG #4 PG #5 PG #7 PG #8 PG #9
PG
#10
OSD #2 OSD #3 OSD #3 OSD #5OSD #3
PG #1 PG #1 PG #1PG #2 PG #2 PG #2
PG #3
PG #3 PG #3PG #4 PG #4 PG #4 PG #5
PG #5 PG #6 PG #6PG #6PG #7 PG #7PG #7 PG #8 PG #8 PG #8PG #9
PG #9 PG #9
PG
#10
PG #5
PG
#10
PG
#10
Failure
PG #1 PG #2
PG #3
PG #7 PG #9
FailurePG #2PG #7 PG #4PG #8 PG #5
R: 1 R: 1
If failue,
PG#2 is loss
If failue,
PG#5 is loss
Think You
Jung-In.Jung (call518@gmail.com)
2018/08/03
References
• http://docs.ceph.com/docs/master/architecture/#mapping-pgs-to-
osds
• http://docs.ceph.com/docs/master/rados/operations/placement-
groups/
• https://ceph.com/geen-categorie/how-data-is-stored-in-ceph-
cluster/

Más contenido relacionado

La actualidad más candente

Introduction to docker and docker compose
Introduction to docker and docker composeIntroduction to docker and docker compose
Introduction to docker and docker composeLalatendu Mohanty
 
Let’s unbox Rancher 2.0 <v2.0.0>
Let’s unbox Rancher 2.0 <v2.0.0>  Let’s unbox Rancher 2.0 <v2.0.0>
Let’s unbox Rancher 2.0 <v2.0.0> LINE Corporation
 
How to build a Kubernetes networking solution from scratch
How to build a Kubernetes networking solution from scratchHow to build a Kubernetes networking solution from scratch
How to build a Kubernetes networking solution from scratchAll Things Open
 
Docker Introduction
Docker IntroductionDocker Introduction
Docker Introductionw_akram
 
Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization with KVM (Kernel-based Virtual Machine)Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization with KVM (Kernel-based Virtual Machine)Novell
 
Build your own embedded linux distributions by yocto project
Build your own embedded linux distributions by yocto projectBuild your own embedded linux distributions by yocto project
Build your own embedded linux distributions by yocto projectYen-Chin Lee
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelDivye Kapoor
 
Kubernetes in Docker
Kubernetes in DockerKubernetes in Docker
Kubernetes in DockerDocker, Inc.
 
Software update for embedded systems - elce2014
Software update for embedded systems - elce2014Software update for embedded systems - elce2014
Software update for embedded systems - elce2014Stefano Babic
 
OpenStack Neutron's Distributed Virtual Router
OpenStack Neutron's Distributed Virtual RouterOpenStack Neutron's Distributed Virtual Router
OpenStack Neutron's Distributed Virtual Routercarlbaldwin
 
Kvm and libvirt
Kvm and libvirtKvm and libvirt
Kvm and libvirtplarsen67
 

La actualidad más candente (20)

Introduction to docker and docker compose
Introduction to docker and docker composeIntroduction to docker and docker compose
Introduction to docker and docker compose
 
Let’s unbox Rancher 2.0 <v2.0.0>
Let’s unbox Rancher 2.0 <v2.0.0>  Let’s unbox Rancher 2.0 <v2.0.0>
Let’s unbox Rancher 2.0 <v2.0.0>
 
How to build a Kubernetes networking solution from scratch
How to build a Kubernetes networking solution from scratchHow to build a Kubernetes networking solution from scratch
How to build a Kubernetes networking solution from scratch
 
Docker Introduction
Docker IntroductionDocker Introduction
Docker Introduction
 
Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization with KVM (Kernel-based Virtual Machine)Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization with KVM (Kernel-based Virtual Machine)
 
Vagrant - Concept
Vagrant - ConceptVagrant - Concept
Vagrant - Concept
 
Build your own embedded linux distributions by yocto project
Build your own embedded linux distributions by yocto projectBuild your own embedded linux distributions by yocto project
Build your own embedded linux distributions by yocto project
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
The kvm virtualization way
The kvm virtualization wayThe kvm virtualization way
The kvm virtualization way
 
Rancher Labs - Your own PaaS in action
Rancher Labs - Your own PaaS in actionRancher Labs - Your own PaaS in action
Rancher Labs - Your own PaaS in action
 
Kubernetes in Docker
Kubernetes in DockerKubernetes in Docker
Kubernetes in Docker
 
DPDK In Depth
DPDK In DepthDPDK In Depth
DPDK In Depth
 
Software update for embedded systems - elce2014
Software update for embedded systems - elce2014Software update for embedded systems - elce2014
Software update for embedded systems - elce2014
 
Kvm
KvmKvm
Kvm
 
Kubernetes Introduction
Kubernetes IntroductionKubernetes Introduction
Kubernetes Introduction
 
Qemu Introduction
Qemu IntroductionQemu Introduction
Qemu Introduction
 
OpenStack Neutron's Distributed Virtual Router
OpenStack Neutron's Distributed Virtual RouterOpenStack Neutron's Distributed Virtual Router
OpenStack Neutron's Distributed Virtual Router
 
Docker Introduction
Docker IntroductionDocker Introduction
Docker Introduction
 
Kvm and libvirt
Kvm and libvirtKvm and libvirt
Kvm and libvirt
 

Más de JungIn Jung

NAT Traversal and P2P
NAT Traversal and P2PNAT Traversal and P2P
NAT Traversal and P2PJungIn Jung
 
Eywa - Cloud Network Architecture 20180625(20150907)(compact)
Eywa - Cloud Network Architecture 20180625(20150907)(compact)Eywa - Cloud Network Architecture 20180625(20150907)(compact)
Eywa - Cloud Network Architecture 20180625(20150907)(compact)JungIn Jung
 
EYEA HA Workflow
EYEA HA WorkflowEYEA HA Workflow
EYEA HA WorkflowJungIn Jung
 
EYWA Presentation v0.1.27
EYWA Presentation v0.1.27EYWA Presentation v0.1.27
EYWA Presentation v0.1.27JungIn Jung
 
Virtualized Datacenter as a Service (vDCaaS)
Virtualized Datacenter as a Service (vDCaaS)Virtualized Datacenter as a Service (vDCaaS)
Virtualized Datacenter as a Service (vDCaaS)JungIn Jung
 
Limitation of Cloud Networking & Eywa virtual network model for full HA and LB
Limitation of Cloud Networking & Eywa virtual network model for full HA and LBLimitation of Cloud Networking & Eywa virtual network model for full HA and LB
Limitation of Cloud Networking & Eywa virtual network model for full HA and LBJungIn Jung
 
About VXLAN (2013)
About VXLAN (2013)About VXLAN (2013)
About VXLAN (2013)JungIn Jung
 
Qemu & KVM Guide #1 (intro & basic)
Qemu & KVM Guide #1 (intro & basic)Qemu & KVM Guide #1 (intro & basic)
Qemu & KVM Guide #1 (intro & basic)JungIn Jung
 

Más de JungIn Jung (9)

NAT Traversal and P2P
NAT Traversal and P2PNAT Traversal and P2P
NAT Traversal and P2P
 
Eywa - Cloud Network Architecture 20180625(20150907)(compact)
Eywa - Cloud Network Architecture 20180625(20150907)(compact)Eywa - Cloud Network Architecture 20180625(20150907)(compact)
Eywa - Cloud Network Architecture 20180625(20150907)(compact)
 
EYEA HA Workflow
EYEA HA WorkflowEYEA HA Workflow
EYEA HA Workflow
 
EYWA Presentation v0.1.27
EYWA Presentation v0.1.27EYWA Presentation v0.1.27
EYWA Presentation v0.1.27
 
Virtualized Datacenter as a Service (vDCaaS)
Virtualized Datacenter as a Service (vDCaaS)Virtualized Datacenter as a Service (vDCaaS)
Virtualized Datacenter as a Service (vDCaaS)
 
SDN TEST Suite
SDN TEST SuiteSDN TEST Suite
SDN TEST Suite
 
Limitation of Cloud Networking & Eywa virtual network model for full HA and LB
Limitation of Cloud Networking & Eywa virtual network model for full HA and LBLimitation of Cloud Networking & Eywa virtual network model for full HA and LB
Limitation of Cloud Networking & Eywa virtual network model for full HA and LB
 
About VXLAN (2013)
About VXLAN (2013)About VXLAN (2013)
About VXLAN (2013)
 
Qemu & KVM Guide #1 (intro & basic)
Qemu & KVM Guide #1 (intro & basic)Qemu & KVM Guide #1 (intro & basic)
Qemu & KVM Guide #1 (intro & basic)
 

Ceph Durability and Replica Data Lost - PG Numbers / Replica Failure

  • 1. The Scenario of Permanent Data Loss & Durability in Ceph
  • 2. Assumption • 10 OSDs • 512 PGs • 3 Replica Pool
  • 3. Goal • Ceph에서 Data가 완전히 망실(permanent data loss)되는 Case를 통해, Ceph의 Durability 및 운용 메커니즘에 대한 이해.
  • 4. Summary of Scenario • 특정 PG를 담당하는 1번째 OSD 망실. • 망실된 OSD를 사용하는 PG의 Replica수가 3에서 2로 감소. • Ceph는 새로운 OSD를 선택하여, 장애 PG의 3번째 Replica 복구 시작. • 1번째 OSD 복구완료전, 동일 PG의 2번째 OSD 망실. • 해당 PG의 Replica는 이제 1개만 남음. • Ceph는 또 다시 신규 OSD를 선택하고, 원하는 Replica 만큼 복제 시도. • 1,2 번째 OSD에 대한 복구가 완료 되기 전, 마지막 OSD마저 추가 망실. • 해당 PG의 데이터는 영구적 망실(permanent data loss).
  • 6. OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9
  • 7. OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 Pool /w 3 Replicas
  • 8. OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas
  • 9. OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH
  • 10. OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH Mapping
  • 11. OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH
  • 12. OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH 𝟓𝟏𝟐 × 𝟑 𝟏𝟎 ≅ 𝟏𝟓𝟎 PGs per OSD
  • 14. OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH
  • 15. OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH
  • 16. OSD #1 OSD #2 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH OSD #3
  • 17. OSD #1 OSD #2 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH OSD #3 Replicas 3 to 2 Replicas 3 to 2 150 PGs Replicas 3 to 2 유발
  • 18. OSD #1 OSD #2 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH OSD #3 Add New OSD Add New OSD
  • 19. OSD #1 OSD #2 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH OSD #3 Start Recovery Start Recovery
  • 20. OSD #1 OSD #2 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH OSD #3 Recovering… OSD #3 Recovering… OSD #3
  • 21. OSD #1 OSD #2 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH OSD #3 Completed Recovering… OSD #3 150 PGs, Homogeneously Spread
  • 22. OSD #1 OSD #2 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH OSD #3 Completed Recovering… OSD #3 Long Time…
  • 23. Oops… PG#3의 1번째 OSD의 복구가 미처 완료되기 전에, PG#3의 2번째 OSD가 추가적으로 망실됨…
  • 24. OSD #1 OSD #2 OSD #4 OSD #5 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH OSD #3 Recovering… OSD #3 Long Time…
  • 25. OSD #1 OSD #2 OSD #4 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH OSD #3 Recovering… OSD #3 OSD #5 Long Time…
  • 26. OSD #1 OSD #2 OSD #4 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH OSD #3 Recovering… OSD #3 OSD #5 Replicas 2 to 1 Long Time…
  • 27. OSD #1 OSD #2 OSD #4 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH OSD #3 Recovering… OSD #3 OSD #5 Replicas 2 to 1 Add New OSD Start Recovery Long Time… Long Time…
  • 28. OSD #1 OSD #2 OSD #4 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH OSD #3 Recovering… OSD #3 OSD #5 Replicas 2 to 1 Add New OSD Start RecoveryRecovering… OSD #5 Long Time… Long Time…
  • 29. Ooooooooooops… PG#3의 데이터를 온전히 보유하고 있던 마지막 3번째 OSD마저 망실됨…
  • 30. OSD #1 OSD #2 OSD #4 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH OSD #3 Recovering… OSD #3 OSD #5 Replicas 2 to 1 Add New OSD Start RecoveryRecovering… OSD #5 Long Time… Long Time…
  • 31. OSD #1 OSD #2 OSD #4 OSD #10OSD #6 OSD #7 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH OSD #3 Recovering… OSD #3 OSD #5 Replicas 2 to 1 Add New OSD Start RecoveryRecovering… OSD #5 Long Time… Long Time… OSD #8
  • 32. OSD #1 OSD #2 OSD #4 OSD #10OSD #6 OSD #7 OSD #8 OSD #9 PG #1 PG #2 PG #3 PG #512 ~PG #4 PG #5 Pool /w 3 Replicas CRUSH OSD #3 Recovering… OSD #3 OSD #5 Replicas 2 to 1 Add New OSD Start RecoveryRecovering… OSD #5 Long Time… Long Time… ???
  • 34. About “Long Time”(1) • 본 시나리오에서 1번째 OSD의 망실은, 512개 PGs들 중, 약 150개의 PG 들의 Replica 하나를 잃었다는 것을 의미함. (512 × 3 ÷ 10) • 유실된 Replica 1개를 복구 하기 위해, 남아 있는 9개 OSD들 중 하나를 대 체 OSD로 선택. • OSD들은 상호간에 유실된 150개의 PG의 복구 복제본을 주고 받음. • 이 과정에서 시간 소요가 필연적으로 따르며, 이 시간 소요는 Ceph 클러스 터 아키텍처에 따라 달라짐.
  • 35. About “Long Time”(2) • (가정) • PGs는 전체 OSD들에 대해 균일하게 분산배치 되어 있으며, 1TB SSD타입의 OSD 가 모두 단독 머신을 사용하고 있고, 각 머신은 10Gbps 대역폭으로 스위치와 연결되 어 있는 상황에서, 망실된 하나의 OSD를 완전히 복구 하는데 M분의 시간이 걸린 다고 가정하자.
  • 36. About “Long Time”(3) • 일반적으로 Ceph에서 PGs 수는 Durability 및 Recovery 속도에 거의 영 향을 미치지 않음. • 그러나, OSD수가 10에서 20으로 증가 할 경우, Recovery 속도는 증가 할 것 이고, 이는 곧 Durability가 현저히 증가함을 의미. • 각 OSD는 10개 일 때, 150개의 PG를 처리하게 되지만, 20개의 OSD가 되면 단 75개만 처리하면 됨. (OSD당 관리 대상 감소 -> OSD당 복구 대상 감소) • 10개의 OSD가 100GB를 복제 해야 한다면, 20개의 OSD상황에서는 50GB만 복제 하면 됨. • 또한, 데이터 복구를 9대가 수행 하였지만, 19개가 수행하게 되므로, 컴퓨팅/네트웍 병목만 없다면 당연히 복구 속도는 상승.
  • 37. About “Long Time”(4) • 만일 이 클러스터가 40 OSD로 증가 된다면? • 각 OSD는 35 PGs만을 담당. • 한 OSD가 죽었을 경우, 20 OSD 경우보다, 더욱 빠르게 복구 될 것임. • 더 나아가, 200 OSD까지 증가 시킨다면? • 각 OSD는 겨우 7 PGs만을 담당. • 한 OSD가 죽었을 경우, 최대 21개 (7PGs * 3Replica) OSD만이 복구 작업에 참여 못함. (나머지 179개의 OSD들은??? ->  비효율) • 40개일 때 보다 오래 걸림. • (40개 OSD일 경우는 OSD당 35PGs이므로, 40개 OSD모두 병렬 복구 작업 수행.) • 이 경우, PG수를 늘려야 함. (가능한 모든 OSD가 복구 작업에 참여 유도)
  • 39. About “Durability”(1) • PGs는 Durability와는 무관. • 그냥 CURSH 연산 비용을 줄이기 위한 논리적 그룹 개념. • 당연, 복구 시간이 짧으면 좋지만, 그게 전부가 아님. • 처음 (가정)의 상황에서, • 10개 1번째 OSD가 망실되어, 나머지 9개 OSD중 하나로 150PGs의 복구작업 중, • 2번째 OSD가 추가 망실되면, 17(≅ 150 ÷ 9) PGs에 해당되는 Replica는 클러스 터 전체에 걸쳐 “오직 하나의 사본만 존재”하게 됨. • 최악의 상황으로, 2개의 OSD가 망실된 상황(가용 OSD는 8개)에서, 추가로 OSD가 망실되면(동시 3개 망실), • 2 ≅ 17 ÷ 8 개의 PGs에 속한 모든 데이터는 영구히 유실됨. (Permanently Lost)
  • 40. About “Durability”(2) • 20 OSD를 가진 클러스터에서, 앞서 3개 동시 망실 케이스를 살펴보면, • 2번째 OSD 망실 • Replica가 한 개만 남은 PGs는 4 ≅ 75 ÷ 19 개 • 10 OSD 때의 17보다 훨씬 감소한다. • 3번째 OSD 망실 • 망실된 3번째 OSD가 Replica 한 개만 남은 PGs를 담당하는 OSD일 경우에만, Permanently Lost가 발생됨. • 즉, 동시에 3개의 OSD가 망실되더라도, Permanently Lost가 100% 발생되는 것은 아님.
  • 41. About “Durability”(3) • (가정) OSD복구 중, 추가로 OSD가 망실될 확률이 0.0001%라고 한다면, • “Permanently Data Lost”가 발생할 확률은, • 10 OSDs (& 512 PGs) • 17𝑃𝐺𝑠 × 10𝑂𝑆𝐷𝑠 × 0.0001% • 20 OSDs (& 512 PGs) • 4𝑃𝐺𝑠 × 20𝑂𝑆𝐷𝑠 × 0.0001%
  • 42. About “Durability”(4) • Summary • OSD가 많을 수록, 복구 속도가 빠름. • 빠른 속도로 인해, PGs의“Cascading Failure”발생 리스크 감소. • 50 OSDs 이하 규모에서는 512PGs나 4096PGs나 “Durability”측면에서는 차 이가 없다. • 권장 PG 계산식 𝑇𝑜𝑡𝑎𝑙 𝑃𝐺𝑠 = 𝑂𝑆𝐷𝑠 ∗ 100 𝑅𝑒𝑝𝑙𝑖𝑐𝑎𝑠
  • 44. Replacement of PG Replicas
  • 45. OSD #1 Pool /w 3 Replicas PG #1 PG #2 PG #3 PG #6PG #4 PG #5 PG #7 PG #8 PG #9 PG #10 OSD #2 OSD #3 OSD #3 OSD #5OSD #3 PG #1 PG #1 PG #1PG #2 PG #2 PG #2 PG #3 PG #3 PG #3PG #4 PG #4 PG #4 PG #5 PG #5 PG #6 PG #6PG #6PG #7 PG #7PG #7 PG #8 PG #8 PG #8PG #9 PG #9 PG #9 PG #10 PG #5 PG #10 PG #10
  • 46. OSD #1 Pool /w 3 Replicas PG #1 PG #2 PG #3 PG #6PG #4 PG #5 PG #7 PG #8 PG #9 PG #10 OSD #2 OSD #3 OSD #3 OSD #5OSD #3 PG #1 PG #1 PG #1PG #2 PG #2 PG #2 PG #3 PG #3 PG #3PG #4 PG #4 PG #4 PG #5 PG #5 PG #6 PG #6PG #6PG #7 PG #7PG #7 PG #8 PG #8 PG #8PG #9 PG #9 PG #9 PG #10 PG #5 PG #10 PG #10 Failure
  • 47. OSD #1 Pool /w 3 Replicas PG #1 PG #2 PG #3 PG #6PG #4 PG #5 PG #7 PG #8 PG #9 PG #10 OSD #2 OSD #3 OSD #3 OSD #5OSD #3 PG #1 PG #1 PG #1PG #2 PG #2 PG #2 PG #3 PG #3 PG #3PG #4 PG #4 PG #4 PG #5 PG #5 PG #6 PG #6PG #6PG #7 PG #7PG #7 PG #8 PG #8 PG #8PG #9 PG #9 PG #9 PG #10 PG #5 PG #10 PG #10 Failure PG #1 PG #2PG #7 PG #9 PG #3
  • 48. OSD #1 Pool /w 3 Replicas PG #1 PG #2 PG #3 PG #6PG #4 PG #5 PG #7 PG #8 PG #9 PG #10 OSD #2 OSD #3 OSD #3 OSD #5OSD #3 PG #1 PG #1 PG #1PG #2 PG #2 PG #2 PG #3 PG #3 PG #3PG #4 PG #4 PG #4 PG #5 PG #5 PG #6 PG #6PG #6PG #7 PG #7PG #7 PG #8 PG #8 PG #8PG #9 PG #9 PG #9 PG #10 PG #5 PG #10 PG #10 Failure PG #1 PG #2 PG #3 PG #7 PG #9 FailurePG #2PG #7 PG #4PG #8 PG #5
  • 49. OSD #1 Pool /w 3 Replicas PG #1 PG #2 PG #3 PG #6PG #4 PG #5 PG #7 PG #8 PG #9 PG #10 OSD #2 OSD #3 OSD #3 OSD #5OSD #3 PG #1 PG #1 PG #1PG #2 PG #2 PG #2 PG #3 PG #3 PG #3PG #4 PG #4 PG #4 PG #5 PG #5 PG #6 PG #6PG #6PG #7 PG #7PG #7 PG #8 PG #8 PG #8PG #9 PG #9 PG #9 PG #10 PG #5 PG #10 PG #10 Failure PG #1 PG #2 PG #3 PG #7 PG #9 FailurePG #2PG #7 PG #4PG #8 PG #5 R: 1 R: 1
  • 50. OSD #1 Pool /w 3 Replicas PG #1 PG #2 PG #3 PG #6PG #4 PG #5 PG #7 PG #8 PG #9 PG #10 OSD #2 OSD #3 OSD #3 OSD #5OSD #3 PG #1 PG #1 PG #1PG #2 PG #2 PG #2 PG #3 PG #3 PG #3PG #4 PG #4 PG #4 PG #5 PG #5 PG #6 PG #6PG #6PG #7 PG #7PG #7 PG #8 PG #8 PG #8PG #9 PG #9 PG #9 PG #10 PG #5 PG #10 PG #10 Failure PG #1 PG #2 PG #3 PG #7 PG #9 FailurePG #2PG #7 PG #4PG #8 PG #5 R: 1 R: 1 If failue, PG#2 is loss If failue, PG#5 is loss