Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화

26.534 visualizaciones

Publicado el

엄주관 / SK 텔레콤

Publicado en: Tecnología
  • Inicia sesión para ver los comentarios

[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화

  1. 1. All-Flash Ceph 구성과 최적화 Feb. 18, 2016 SDS Tech. Lab, Corporate R&D Center SK Telecom OpenStack Days in Korea
  2. 2. 1 Why are we focusing at all-flash Ceph? Tech. Trends of Storage Systems Hybrid Scale-up Storage Systems Hybrid Scale-out Storage Systems All-flash Scale-up Storage Systems All-flash Scale-out Storage Systems Effective Capacity Increasing Performance Up Requirements for All-IT Network/Infra Storage System Scalability Availability Performance
  3. 3. 2 What is Ceph? http://docs.ceph.com/docs/master/_images/stack.png Object Virtual Disk Files & DirsObject App App Host/VM Client • Ceph is a unified, distributed, massively scalable open source storage solution  Object, Block and File storage • mostly LGPL open source project • Failure is normal • Self managing • Scale out on commodity hardware • Everything runs in software
  4. 4. 3 Ceph Architecture OSD Cluster Maps Direct IO between clients and OSDs Service Network Storage Network Ceph Storage System OSD OSD KVM librbd Application krbd Application librados Monitor Monitor Monitor OSD Cluster Maps
  5. 5. 4 Ceph Operation: Ceph Block Device PG#0 OSD #1 Disk or RAID Group XFS Journal 1. O_DIRECT 2. O_DSYNC 2. Buffered I/O OSD #0 Synchronous Replication FileStore PG#1 PG#2 PG#3 Data PG#2 librbd librados OSD Service OSD #0 Application 데이터 배치 : CRUSH 알고리즘  Ceph Block Device 고정 크기 (기본: 4MB) Object의 연속 예) 1GB Block Image = 256 Objects Hash: Object to PG
  6. 6. 5 Ceph OSD 노드 구성  Journal / Data 디스크 구성 • 일반적인 조합 (Journal / Data)  SSD / HDD  외부 저널 디스크 없음 / SSD  PCIe SSD / SATA SSD  NVRAM / SSD 0 5 10 15 20 25 30 35 0 20000 40000 60000 80000 100000 SSD NVRAM ms IOPS 4KB Random Write IOPS Latency  노드 별 OSD 개수 • OSD: 1 OSD per (1 DISK or 1 RAID GROUP) • Ceph OSD Daemon  CPU-intensive processes 0 20 40 60 80 100 120 0 5000 10000 15000 20000 25000 30000 35000 3 OSDs 4 OSDs 6 OSDs 8 OSDs 12 OSDs ms IOPS 4KB Random Write IOPS Latency Journal Type
  7. 7. 6 Ceph on All-Flash Performance Issues 6945 11585 14196 15885 15298 16603 2.9 3.5 5.7 10.1 21.9 39.7 0 5 10 15 20 25 30 35 40 45 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 4 8 16 32 64 128 ms IOPS Thread count 4KB Random Write IOPS IOPS Latency 3716 4910 7243 13694 80062 1020905.4 8.1 11.0 11.7 5.9 10.9 0 2 4 6 8 10 12 14 0 50,000 100,000 150,000 200,000 250,000 300,000 4 8 16 32 64 128 ms IOPS Thread count 4KB Random Read IOPS IOPS Latency  Issue: Low Throughput & High Latency • SSD Spec.  4KB Random Read: up to 95K IOPS  4KB Random Write: up to 85K IOPS (Sustained 14K IOPS)  Latency < 1ms • Ideal Throughput  4KB Random Read: 95K * 10EA * 4 Node  3800K IOPS  4KB Random Write: 14K * 10EA * 4 Node / 2 Replication  280K IOPS Sustained Performance: 2x Write(80% of Usable Space) 5 Clients use krbd 측정치 Ideal 4KB Random Read 102K IOPS 3800K IOPS 4KB Random Write 17K IOPS 280K IOPS
  8. 8. 7 Ceph IO Flow in OSD ObjectStore JournalingObjectStore KeyValueStore MemStore BlueStore FileStore PGBackend ReplicatedBackend ECBackend PG 1. Journal: LIBAIO (O_DIRECT && O_DSYNC)  Committed 2. Data: Buffered IO and syncfs() later  Applied OSD Messenger
  9. 9. 8 Ceph Write IO Flow: Receiving Request OSD Secondary OSD 5. Send Rep Operations to 2nd OSD Operation Threads L: ReplicatedBackend FileStore L: Messenger Operation Threads FileStore 2. Receive Write Req. 4. Do Operation 3. Queue Op. Queue Rep Op. Operation WQ Operation WQ L: Messenger Client Cluster Network Public Network 1. Send Write Req. 6. Enqueue Transaction to FileStore PG Lock PG Unlock PG Lock PG Unlock
  10. 10. 9 Ceph Write IO Flow: in File Store File Store (Data) File Journal Operation Threads Writer Thread Committed Data Disk Journal Disk Operation Threads 1.Queue Transaction 2. Operate Journal Transactions 5. Queue op 6. Queue to Finisher Finisher Thread writeq Operation WQ 3. Write to Journal Disk Write Finisher Thread 4. AIO Complete Journal and Data completion? Finisher Thread 7. Write Data 8. Queue to Finisher Applied PG Lock PG Lock PG Unlock Send RepOp Reply to Primary if this is secondary OSD
  11. 11. 10 최적화 항목 이슈 PG Lock  전체 Latency 중 30% 이상 이 PG Lock을 얻는데 소모 • OP Processing Worker Thread가 Block되어 관련 없는 OP 의 처리가 늦어짐 • 큰 PG Lock의 Critical Section • Secondary OSD의 ACK 처리가 늦어져 IO Latency 증가 Ceph & System Tuning  성능 측정 도중 결과값의 기 복이 큼 • Ceph 설정 변수: 개별 변경은 효과가 없고 최적 조합이 필요 • Memory Allocator의 CPU 사용량이 높음 • TCP/IP Nagle 알고리즘 Log  Log 비활성화 여부에 따라 성능 변화가 큼 • OSD의 I/O 처리 과정에서 Log로 인한 시간 소모 Transaction  Transaction 처리가 성능에 큰 영향 • Transaction 처리 비효율: 불필요한 연산, Lock Contention
  12. 12. 11 VM 성능: 실험 환경 Service Network (10GbE) Storage Network (10GbE) Physical Client (x 5) Vender / Model DELL R720XD Processor Intel® Xeon® E5-2670v3 @ 2.60GHz x 2 (10core) Memory 128GB OS CentOS 7.0 OSD Node/Monitor (x 4) Vender / Model DELL R630 Processor Intel® Xeon® E5-2690v3 @ 2.60GHz x 2 (12core) Memory 128GB NIC 10Gbe OS CentOS 7.0 JOURNAL RAMDISK Switch (x 2) Vender / Model Cisco nexus 5548UP 10G Disk SSD SK Hynix SSD 480GB 10개 / OSD Node RAID RAID 0, SSD 3개, 3개, 2개, 2개 (4 RAID Group) - Device(4개) & Daemon(4개) / OSD Node Ceph Version SKT Ceph와 Community(0.94.4) VM (x Physical Client 당 최대 4개) Guest OS Spec 2 Core, 4 GB memory librbd FIO Test Configuration Run Time 300 Ramp Time 10 Threads 8 Queue Depth 8 Sustained Performance: 2x Write(80% of Usable Space)
  13. 13. 12 VM 성능 비교 : Random Workload 71 3 43 3 185 114 118 71 3.4 5.7 2.7 5.5 3.5 3.4 2.0 2.5 0 1 2 3 4 5 6 7 8 9 10 0 20 40 60 80 100 120 140 160 180 200 4KBRWSKT CEPH 4KBRW Community 32KBRWSKT CEPH 32KBRW Community 4KBRRSKT CEPH 4KBRR Community 4KBRRSKT CEPH 4KBRR Community ms KIOPS ■ SKT CEPH IOPS ■ Community IOPS ◆ Latency
  14. 14. 13 VM 성능 비교 : Sequential Workload 2,669 2,729 2,768 2,948 4,287 4,281 4,281 4,296 59.7 28.3 172.4 425.2 73.2 36.7 293.6 292.7 0 50 100 150 200 250 300 350 400 450 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 1MBSWSKT CEPH 1MBSW Community 4MBSWSKT CEPH 4MBSW Community 1MBSRSKT CEPH 1MBSR Community 4MBSRSKT CEPH 4MBSR Community ms MB/s ■ SKT CEPH BW ■ Community BW ◆ Latency
  15. 15. 14 SKT AF-Ceph AFC-S: 4 Data Node + 1 Management Node (Commodity Server & SSD 기반) Monitor Node (관리 서버) Data Node (OSD Node) NVRAM Journal SATA SSD 10ea Data Store System Configuration 구성 4 Data Node + 1 Monitor Node 상면 5U SSD SATA SSD 40 ea (in 4U) NVRAM 8GB NVRAM 용량 Total 40TB / Usable 20TB (w/ 1TB SSD) Total 80TB / Usable 40TB (w/ 2TB SSD) Node H/W CPU Intel Xeon E5 2690v3 2-socket RAM 128GB (DDR3 1866MHz) Network 10GbE x 2 for Service & Storage … AFC-N: 2U MicroServer (4 Data Node) + 1U NVMe All-Flash JBOF … NV-Array (All-Flash JBOF) NV-Drive (NVMe SSD) E5 2-socket Server (4 Nodes in 2U) • 고성능(PCIe 3.0) • 고집적(2.5” NVMe SSD 24EA: Up to 96TB) • ‘16. 4Q 예정
  16. 16. 15 SKT AF-Ceph Real Time Monitoring Multi Dashboard Rule Base Alarm Drag & Drop Admin Rest API Real-time Graph Graph Merge Drag & Zooming Auto Configuration Cluster Management RBD Management Object Storage Management
  17. 17. 16 End-of-Document 연락처. 엄주관, jugwan.eom@sk.com

×