SlideShare a Scribd company logo
©2017 LinkedIn Corporation. All Rights Reserved.
Kafka at half the price
Dong Lin
Streams Infrastructure
©2017 LinkedIn Corporation. All Rights Reserved. 2
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
©2017 LinkedIn Corporation. All Rights Reserved. 3
RAID-10 setup with RF=2
producer
Broker 1 Broker 2
A
B
C
A
B
C
A
B
C
A
B
C
©2017 LinkedIn Corporation. All Rights Reserved. 4
RAID-10 setup with RF=2
producer
Broker 1 Broker 2
A
B
C
A
B
C
A
B
C
A
B
C
- Tolerate only one broker failure
©2017 LinkedIn Corporation. All Rights Reserved. 5
RAID-10 setup with RF=3
producer
Broker 1 Broker 3
A
B
C
A
B
C
A
B
C
A
B
C
Broker 2
A
B
C
A
B
C
- Tolerate up to two broker failures
- 50% more storage cost
©2017 LinkedIn Corporation. All Rights Reserved. 6
JBOD setup with RF=2
producer
Broker 1
A
B
C
Broker 2
A
B
C
- Tolerate only one broker failure
- 50% less storage cost
©2017 LinkedIn Corporation. All Rights Reserved. 7
JBOD setup with RF=3
producer
Broker 1
A
B
C
Broker 3
A
B
C
Broker 2
A
B
C
- Tolerate up to two broker failures
- 25% less storage cost
©2017 LinkedIn Corporation. All Rights Reserved. 8
RAID vs. JBOD
Setup Replication Storage cost
Broker failure
tolerance
Disk failure
tolerance
RAID-10
2 (baseline) 4X 1 (too small) 3
3 6X (50% up) 2 5
©2017 LinkedIn Corporation. All Rights Reserved. 9
RAID vs. JBOD
Setup Replication Storage cost
Broker failure
tolerance
Disk failure
tolerance
RAID-10
2 (baseline) 4X 1 (too small) 3
3 6X (50% up) 2 5
JBOD
2 2X (50% down) 1 (too small) 1 (too small)
3 (future) 3X (25% down) 2 (100% up) 2 (33% down)
©2017 LinkedIn Corporation. All Rights Reserved. 10
RAID vs. JBOD
Setup Replication Storage cost
Broker failure
tolerance
Disk failure
tolerance
RAID-10
2 (baseline) 4X 1 (too small) 3
3 6X (50% up) 2 5
JBOD
2 2X (50% down) 1 (too small) 1 (too small)
3 (future) 3X (25% down) 2 (100% up) 2 (33% down)
4 4X 3 (300% up) 3
©2017 LinkedIn Corporation. All Rights Reserved. 11
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
©2017 LinkedIn Corporation. All Rights Reserved. 12
Problem 1: All replicas become offline if any log directory fails
Broker
Disk A
IOException
when accessing disk B
Disk B
Disk C
Broker
Disk A
Disk B
Disk C
©2017 LinkedIn Corporation. All Rights Reserved. 13
Solution: Only replicas on the failed disk become offline
Broker
Disk A
IOException
when accessing disk B
Disk B
Disk C
Broker
Disk A
Disk B
Disk C
©2017 LinkedIn Corporation. All Rights Reserved. 14
Problem 2: Controller does not recognize disk failure
Zookeeper
Controller
Broker 1
Partition 1
Partition 2
STEP 2:
- Broker -> is alive?
- Broker -> partition list
STEP 1: I am online
X No further
leader election
STEP 3:
Become leader for
partitions 1 and 2
STEP 4:
partition 2 is offline
©2017 LinkedIn Corporation. All Rights Reserved. 15
Solution: Broker notifies and provides partition list to controller
Zookeeper
Controller
Broker 1
Partition 1
Partition 2
STEP 2: Broker 1 has new disk failureSTEP 1: Notify disk failure
X
STEP 3:
Become leader for
partitions 1 and 2
STEP 4:
partition 2 is offline
STEP 5: Elect
another broker as
leader for partition 2
©2017 LinkedIn Corporation. All Rights Reserved. 16
Problem 3: Broker always creates log for partition if not exist
Zookeeper
Controller
STEP 3:
Become follower for partition 2
Create partition 2 if non-existent
Broker 1
Partition 1
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
STEP 1: I am online
©2017 LinkedIn Corporation. All Rights Reserved. 17
Problem 3: Broker always creates log for partition if not exist
Zookeeper
Controller
STEP 3:
Become follower for partition 2
Create partition 2 if non-existent
STEP 4:
Created partition 2
(problematic)
Broker 1
Partition 1
Partition 2
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
STEP 1: I am online
©2017 LinkedIn Corporation. All Rights Reserved. 18
Problem 3: Broker always creates log for partition if not exist
Zookeeper
Controller
STEP 3:
Become follower for partition 2
Create partition 2 if non-existent
Broker 1
Partition 1
Partition 2
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
Good disk may
become overloaded STEP 1: I am online
STEP 4:
Created partition 2
(problematic)
©2017 LinkedIn Corporation. All Rights Reserved. 19
Solution: Controller specifies whether to create log for partition
Zookeeper
Controller
STEP 3:
Become follower for partition 2
This is NOT a new partition
STEP 4:
Partition 2 is not available
and there is offline log dir
Broker 1
Partition 1
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
- Broker -> is new partition?
STEP 5:
Exclude broker 1 from
leader election
for partition 2
STEP 1: I am online
©2017 LinkedIn Corporation. All Rights Reserved. 20
Problem 4: No mechanism to move replicas between disks
Broker 1
P1 P2 P3
P5P4 P6
P7
Disk 1 Disk 2
©2017 LinkedIn Corporation. All Rights Reserved. 21
Example workflow to move replicas between disks
Broker
Client
STEP 1: DescribeDirRequest
STEP 2: DescribeDirResponse
Partition list and size
STEP 3: ChangeDirRequest
Disk 1 Disk 2
STEP 4: create p1.move
STEP 5: ChangeDirResponse
(Inprogress)
STEP 6: copy data from
p1.log to p1.move
STEP 7: delete p1.log and
rename p1.move to p1.log
STEP 8: Verify new assignment
via DescribeDirRequest
©2017 LinkedIn Corporation. All Rights Reserved. 22
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
©2017 LinkedIn Corporation. All Rights Reserved. 23
Alternatives
▪ RAID-0 doesn’t provide disk fault tolerance
– Assume each broker has 10 disks and RF = 2
– RAID-0 has 100X higher probability of unavailability due to disk failure than JBOD
▪ RAID-5 and RAID-6 have poor performance
▪ Hardware RAID is expensive
▪ One broker per disk
©2017 LinkedIn Corporation. All Rights Reserved. 24
one-broker-per-machine vs. one-broker-per-disk
Physical Machine
Disk 1 Disk 2 Disk 3
Broker 1
Physical Machine
Disk 1 Disk 2 Disk 3
Broker 1 Broker 2 Broker 3
V.S.
One-broker-per-machine One-broker-per-disk
©2017 LinkedIn Corporation. All Rights Reserved. 25
one-broker-per-machine vs. one-broker-per-disk
▪ Both solutions use JBOD as disk configuration
▪ Main drawbacks of one-broker-per-disk (assume 10 disk per machine)
– 100X threads and 100X sockets per machine
– 10X control plane traffic from the controller to brokers (e.g. MetadataRequest)
– 10X broker instances and configuration files to manage
– 10X time to bounce a cluster if we bounce one broker at a time
– 10X load on external service (e.g. a service used to query per-topic ACL)
– Less efficient quota enforcement
– Less efficient rebalance across disks on the same machine
– Lower throughput
©2017 LinkedIn Corporation. All Rights Reserved. 26
Experimental setup
▪ Brokers deployed on 15 machines with 10 disks per machine
IO threads Network threads Replica-fetcher threads
One-broker-per-machine 160 120 140
One-broker-per-disk 16 12 14
▪ Producers deployed on 15 machines
acks threads sync retries retry backoff message size batch size request timeout
all 50 true MAX_INT 60 sec 100 KB 1 MB MAX_INT
▪ Topic configuration
partition replication factor min-insync-replicas
512 3 3
©2017 LinkedIn Corporation. All Rights Reserved. 27
One-broker-per-machine throughput
Average throughput is 2.3 GBps
©2017 LinkedIn Corporation. All Rights Reserved. 28
One-broker-per-disk throughput
Average throughput is 2 GBps
©2017 LinkedIn Corporation. All Rights Reserved. 29
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
©2017 LinkedIn Corporation. All Rights Reserved. 30
Changes in operational procedure
▪ Adjust replication factor and min.insync.replicas
▪ Configure num.replica.move.threads for broker
▪ Monitor disk failure via the OfflineLogDirectoriesCount metric
©2017 LinkedIn Corporation. All Rights Reserved. 31
Future work
▪ Use more intelligent solution to select log directory for new replica
▪ Automatic load balancing across log directories on the same broker
– Reduced operational overhead
▪ Distribute segments of a given replica across multiple log directories
– Less overhead for rebalance between disks
– Higher partition size limit
▪ Handle partial disk failure, e.g. disk with degraded performance.
©2017 LinkedIn Corporation. All Rights Reserved. 32
References
▪ KIP-112: Handle disk failure for JBOD (link)
▪ KIP-113: Support replicas movement between log directories (link)
©2017 LinkedIn Corporation. All Rights Reserved. 33

More Related Content

What's hot

Grafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsMarco Pracucci
 
Git - Basic Crash Course
Git - Basic Crash CourseGit - Basic Crash Course
Git - Basic Crash CourseNilay Binjola
 
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...confluent
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 Grokking Techtalk #39: How to build an event driven architecture with Kafka ... Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...Grokking VN
 
Loki - like prometheus, but for logs
Loki - like prometheus, but for logsLoki - like prometheus, but for logs
Loki - like prometheus, but for logsJuraj Hantak
 
Version Control with Git
Version Control with GitVersion Control with Git
Version Control with GitLuigi De Russis
 
Ready player 2 Multiplayer Red Teaming Against macOS
Ready player 2  Multiplayer Red Teaming Against macOSReady player 2  Multiplayer Red Teaming Against macOS
Ready player 2 Multiplayer Red Teaming Against macOSCody Thomas
 
Introduction to Git and GitHub Part 1
Introduction to Git and GitHub Part 1Introduction to Git and GitHub Part 1
Introduction to Git and GitHub Part 1Omar Fathy
 
Learning git
Learning gitLearning git
Learning gitSid Anand
 
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드confluent
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBrendan Gregg
 
Gitlab Training with GIT and SourceTree
Gitlab Training with GIT and SourceTreeGitlab Training with GIT and SourceTree
Gitlab Training with GIT and SourceTreeTeerapat Khunpech
 
Thick Application Penetration Testing: Crash Course
Thick Application Penetration Testing: Crash CourseThick Application Penetration Testing: Crash Course
Thick Application Penetration Testing: Crash CourseScott Sutherland
 
Derbycon - The Unintended Risks of Trusting Active Directory
Derbycon - The Unintended Risks of Trusting Active DirectoryDerbycon - The Unintended Risks of Trusting Active Directory
Derbycon - The Unintended Risks of Trusting Active DirectoryWill Schroeder
 
Git and GitHub | Concept about Git and GitHub Process | Git Process overview
Git and GitHub | Concept about Git and GitHub Process | Git Process overviewGit and GitHub | Concept about Git and GitHub Process | Git Process overview
Git and GitHub | Concept about Git and GitHub Process | Git Process overviewRueful Robin
 

What's hot (20)

Grafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for Logs
 
Git - Basic Crash Course
Git - Basic Crash CourseGit - Basic Crash Course
Git - Basic Crash Course
 
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 Grokking Techtalk #39: How to build an event driven architecture with Kafka ... Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 
Loki - like prometheus, but for logs
Loki - like prometheus, but for logsLoki - like prometheus, but for logs
Loki - like prometheus, but for logs
 
Version Control with Git
Version Control with GitVersion Control with Git
Version Control with Git
 
Ready player 2 Multiplayer Red Teaming Against macOS
Ready player 2  Multiplayer Red Teaming Against macOSReady player 2  Multiplayer Red Teaming Against macOS
Ready player 2 Multiplayer Red Teaming Against macOS
 
Introduction to Git and GitHub Part 1
Introduction to Git and GitHub Part 1Introduction to Git and GitHub Part 1
Introduction to Git and GitHub Part 1
 
Learning git
Learning gitLearning git
Learning git
 
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
Grokking opensource with github
Grokking opensource with githubGrokking opensource with github
Grokking opensource with github
 
TDD refresher
TDD refresherTDD refresher
TDD refresher
 
Gitlab Training with GIT and SourceTree
Gitlab Training with GIT and SourceTreeGitlab Training with GIT and SourceTree
Gitlab Training with GIT and SourceTree
 
Tomcatx performance-tuning
Tomcatx performance-tuningTomcatx performance-tuning
Tomcatx performance-tuning
 
Thick Application Penetration Testing: Crash Course
Thick Application Penetration Testing: Crash CourseThick Application Penetration Testing: Crash Course
Thick Application Penetration Testing: Crash Course
 
Git 101 for Beginners
Git 101 for Beginners Git 101 for Beginners
Git 101 for Beginners
 
Derbycon - The Unintended Risks of Trusting Active Directory
Derbycon - The Unintended Risks of Trusting Active DirectoryDerbycon - The Unintended Risks of Trusting Active Directory
Derbycon - The Unintended Risks of Trusting Active Directory
 
Git and GitHub | Concept about Git and GitHub Process | Git Process overview
Git and GitHub | Concept about Git and GitHub Process | Git Process overviewGit and GitHub | Concept about Git and GitHub Process | Git Process overview
Git and GitHub | Concept about Git and GitHub Process | Git Process overview
 
Git commands
Git commandsGit commands
Git commands
 

Similar to Kafka at half the price with JBOD setup

Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...Mydbops
 
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 20161049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016panagenda
 
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixXPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixThe Linux Foundation
 
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...Mydbops
 
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle The Linux Foundation
 
Reusing your existing software on Android
Reusing your existing software on AndroidReusing your existing software on Android
Reusing your existing software on AndroidTetsuyuki Kobayashi
 
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Shuo LI
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedInAllen Wittenauer
 
Fusion-IO - Building a High Performance and Reliable VSAN Environment
Fusion-IO - Building a High Performance and Reliable VSAN EnvironmentFusion-IO - Building a High Performance and Reliable VSAN Environment
Fusion-IO - Building a High Performance and Reliable VSAN EnvironmentVMUG IT
 
Containerize Legacy .NET Framework Web Apps for Cloud Migration
Containerize Legacy .NET Framework Web Apps for Cloud Migration Containerize Legacy .NET Framework Web Apps for Cloud Migration
Containerize Legacy .NET Framework Web Apps for Cloud Migration Amazon Web Services
 
Loadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkitLoadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkitFrederic Descamps
 
GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS  GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS Gluster.org
 
Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScyllaDB
 
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
Db As Behaving Badly... Worst Practices For Database Administrators Rod ColledgeDb As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledgesqlserver.co.il
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantSwiss Data Forum Swiss Data Forum
 
Percona 服务器与 XtraDB 存储引擎
Percona 服务器与 XtraDB 存储引擎Percona 服务器与 XtraDB 存储引擎
Percona 服务器与 XtraDB 存储引擎YUCHENG HU
 
Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Labs
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data DeduplicationRedWireServices
 

Similar to Kafka at half the price with JBOD setup (20)

Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
 
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 20161049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
 
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixXPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
 
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
 
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
 
Reusing your existing software on Android
Reusing your existing software on AndroidReusing your existing software on Android
Reusing your existing software on Android
 
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedIn
 
Fusion-IO - Building a High Performance and Reliable VSAN Environment
Fusion-IO - Building a High Performance and Reliable VSAN EnvironmentFusion-IO - Building a High Performance and Reliable VSAN Environment
Fusion-IO - Building a High Performance and Reliable VSAN Environment
 
Galera Cluster 3.0 Features
Galera Cluster 3.0 FeaturesGalera Cluster 3.0 Features
Galera Cluster 3.0 Features
 
OpenStack Days Krakow
OpenStack Days KrakowOpenStack Days Krakow
OpenStack Days Krakow
 
Containerize Legacy .NET Framework Web Apps for Cloud Migration
Containerize Legacy .NET Framework Web Apps for Cloud Migration Containerize Legacy .NET Framework Web Apps for Cloud Migration
Containerize Legacy .NET Framework Web Apps for Cloud Migration
 
Loadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkitLoadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkit
 
GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS  GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS
 
Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla Operator
 
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
Db As Behaving Badly... Worst Practices For Database Administrators Rod ColledgeDb As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenant
 
Percona 服务器与 XtraDB 存储引擎
Percona 服务器与 XtraDB 存储引擎Percona 服务器与 XtraDB 存储引擎
Percona 服务器与 XtraDB 存储引擎
 
Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs Talks
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 

More from Dong Lin

FeatHub_DataFun_2023.pptx
FeatHub_DataFun_2023.pptxFeatHub_DataFun_2023.pptx
FeatHub_DataFun_2023.pptxDong Lin
 
FeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptxFeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptxDong Lin
 
FeatHub_FFA_2022
FeatHub_FFA_2022FeatHub_FFA_2022
FeatHub_FFA_2022Dong Lin
 
基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统Dong Lin
 
为实时机器学习设计的算法接口与迭代引擎_FFA_2021
为实时机器学习设计的算法接口与迭代引擎_FFA_2021为实时机器学习设计的算法接口与迭代引擎_FFA_2021
为实时机器学习设计的算法接口与迭代引擎_FFA_2021Dong Lin
 
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
An introduction to Apache Kafka and Kafka ecosystem at LinkedInAn introduction to Apache Kafka and Kafka ecosystem at LinkedIn
An introduction to Apache Kafka and Kafka ecosystem at LinkedInDong Lin
 

More from Dong Lin (6)

FeatHub_DataFun_2023.pptx
FeatHub_DataFun_2023.pptxFeatHub_DataFun_2023.pptx
FeatHub_DataFun_2023.pptx
 
FeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptxFeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptx
 
FeatHub_FFA_2022
FeatHub_FFA_2022FeatHub_FFA_2022
FeatHub_FFA_2022
 
基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统
 
为实时机器学习设计的算法接口与迭代引擎_FFA_2021
为实时机器学习设计的算法接口与迭代引擎_FFA_2021为实时机器学习设计的算法接口与迭代引擎_FFA_2021
为实时机器学习设计的算法接口与迭代引擎_FFA_2021
 
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
An introduction to Apache Kafka and Kafka ecosystem at LinkedInAn introduction to Apache Kafka and Kafka ecosystem at LinkedIn
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
 

Recently uploaded

Paint shop management system project report.pdf
Paint shop management system project report.pdfPaint shop management system project report.pdf
Paint shop management system project report.pdfKamal Acharya
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineJulioCesarSalazarHer1
 
retail automation billing system ppt.pptx
retail automation billing system ppt.pptxretail automation billing system ppt.pptx
retail automation billing system ppt.pptxfaamieahmd
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
 
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...Aimil Ltd
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamDr. Radhey Shyam
 
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringC Sai Kiran
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdfKamal Acharya
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationDr. Radhey Shyam
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxMd. Shahidul Islam Prodhan
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGKOUSTAV SARKAR
 
Online book store management system project.pdf
Online book store management system project.pdfOnline book store management system project.pdf
Online book store management system project.pdfKamal Acharya
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf884710SadaqatAli
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdfKamal Acharya
 
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdfONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Hall booking system project report .pdf
Hall booking system project report  .pdfHall booking system project report  .pdf
Hall booking system project report .pdfKamal Acharya
 
An improvement in the safety of big data using blockchain technology
An improvement in the safety of big data using blockchain technologyAn improvement in the safety of big data using blockchain technology
An improvement in the safety of big data using blockchain technologyBOHRInternationalJou1
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringC Sai Kiran
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Prakhyath Rai
 

Recently uploaded (20)

Paint shop management system project report.pdf
Paint shop management system project report.pdfPaint shop management system project report.pdf
Paint shop management system project report.pdf
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission line
 
retail automation billing system ppt.pptx
retail automation billing system ppt.pptxretail automation billing system ppt.pptx
retail automation billing system ppt.pptx
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...
Soil Testing Instruments by aimil ltd.- California Bearing Ratio apparatus, c...
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
 
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdf
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
Online book store management system project.pdf
Online book store management system project.pdfOnline book store management system project.pdf
Online book store management system project.pdf
 
Explosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdfExplosives Industry manufacturing process.pdf
Explosives Industry manufacturing process.pdf
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdf
 
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdfONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
 
Hall booking system project report .pdf
Hall booking system project report  .pdfHall booking system project report  .pdf
Hall booking system project report .pdf
 
An improvement in the safety of big data using blockchain technology
An improvement in the safety of big data using blockchain technologyAn improvement in the safety of big data using blockchain technology
An improvement in the safety of big data using blockchain technology
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
 

Kafka at half the price with JBOD setup

  • 1. ©2017 LinkedIn Corporation. All Rights Reserved. Kafka at half the price Dong Lin Streams Infrastructure
  • 2. ©2017 LinkedIn Corporation. All Rights Reserved. 2 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  • 3. ©2017 LinkedIn Corporation. All Rights Reserved. 3 RAID-10 setup with RF=2 producer Broker 1 Broker 2 A B C A B C A B C A B C
  • 4. ©2017 LinkedIn Corporation. All Rights Reserved. 4 RAID-10 setup with RF=2 producer Broker 1 Broker 2 A B C A B C A B C A B C - Tolerate only one broker failure
  • 5. ©2017 LinkedIn Corporation. All Rights Reserved. 5 RAID-10 setup with RF=3 producer Broker 1 Broker 3 A B C A B C A B C A B C Broker 2 A B C A B C - Tolerate up to two broker failures - 50% more storage cost
  • 6. ©2017 LinkedIn Corporation. All Rights Reserved. 6 JBOD setup with RF=2 producer Broker 1 A B C Broker 2 A B C - Tolerate only one broker failure - 50% less storage cost
  • 7. ©2017 LinkedIn Corporation. All Rights Reserved. 7 JBOD setup with RF=3 producer Broker 1 A B C Broker 3 A B C Broker 2 A B C - Tolerate up to two broker failures - 25% less storage cost
  • 8. ©2017 LinkedIn Corporation. All Rights Reserved. 8 RAID vs. JBOD Setup Replication Storage cost Broker failure tolerance Disk failure tolerance RAID-10 2 (baseline) 4X 1 (too small) 3 3 6X (50% up) 2 5
  • 9. ©2017 LinkedIn Corporation. All Rights Reserved. 9 RAID vs. JBOD Setup Replication Storage cost Broker failure tolerance Disk failure tolerance RAID-10 2 (baseline) 4X 1 (too small) 3 3 6X (50% up) 2 5 JBOD 2 2X (50% down) 1 (too small) 1 (too small) 3 (future) 3X (25% down) 2 (100% up) 2 (33% down)
  • 10. ©2017 LinkedIn Corporation. All Rights Reserved. 10 RAID vs. JBOD Setup Replication Storage cost Broker failure tolerance Disk failure tolerance RAID-10 2 (baseline) 4X 1 (too small) 3 3 6X (50% up) 2 5 JBOD 2 2X (50% down) 1 (too small) 1 (too small) 3 (future) 3X (25% down) 2 (100% up) 2 (33% down) 4 4X 3 (300% up) 3
  • 11. ©2017 LinkedIn Corporation. All Rights Reserved. 11 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  • 12. ©2017 LinkedIn Corporation. All Rights Reserved. 12 Problem 1: All replicas become offline if any log directory fails Broker Disk A IOException when accessing disk B Disk B Disk C Broker Disk A Disk B Disk C
  • 13. ©2017 LinkedIn Corporation. All Rights Reserved. 13 Solution: Only replicas on the failed disk become offline Broker Disk A IOException when accessing disk B Disk B Disk C Broker Disk A Disk B Disk C
  • 14. ©2017 LinkedIn Corporation. All Rights Reserved. 14 Problem 2: Controller does not recognize disk failure Zookeeper Controller Broker 1 Partition 1 Partition 2 STEP 2: - Broker -> is alive? - Broker -> partition list STEP 1: I am online X No further leader election STEP 3: Become leader for partitions 1 and 2 STEP 4: partition 2 is offline
  • 15. ©2017 LinkedIn Corporation. All Rights Reserved. 15 Solution: Broker notifies and provides partition list to controller Zookeeper Controller Broker 1 Partition 1 Partition 2 STEP 2: Broker 1 has new disk failureSTEP 1: Notify disk failure X STEP 3: Become leader for partitions 1 and 2 STEP 4: partition 2 is offline STEP 5: Elect another broker as leader for partition 2
  • 16. ©2017 LinkedIn Corporation. All Rights Reserved. 16 Problem 3: Broker always creates log for partition if not exist Zookeeper Controller STEP 3: Become follower for partition 2 Create partition 2 if non-existent Broker 1 Partition 1 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list STEP 1: I am online
  • 17. ©2017 LinkedIn Corporation. All Rights Reserved. 17 Problem 3: Broker always creates log for partition if not exist Zookeeper Controller STEP 3: Become follower for partition 2 Create partition 2 if non-existent STEP 4: Created partition 2 (problematic) Broker 1 Partition 1 Partition 2 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list STEP 1: I am online
  • 18. ©2017 LinkedIn Corporation. All Rights Reserved. 18 Problem 3: Broker always creates log for partition if not exist Zookeeper Controller STEP 3: Become follower for partition 2 Create partition 2 if non-existent Broker 1 Partition 1 Partition 2 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list Good disk may become overloaded STEP 1: I am online STEP 4: Created partition 2 (problematic)
  • 19. ©2017 LinkedIn Corporation. All Rights Reserved. 19 Solution: Controller specifies whether to create log for partition Zookeeper Controller STEP 3: Become follower for partition 2 This is NOT a new partition STEP 4: Partition 2 is not available and there is offline log dir Broker 1 Partition 1 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list - Broker -> is new partition? STEP 5: Exclude broker 1 from leader election for partition 2 STEP 1: I am online
  • 20. ©2017 LinkedIn Corporation. All Rights Reserved. 20 Problem 4: No mechanism to move replicas between disks Broker 1 P1 P2 P3 P5P4 P6 P7 Disk 1 Disk 2
  • 21. ©2017 LinkedIn Corporation. All Rights Reserved. 21 Example workflow to move replicas between disks Broker Client STEP 1: DescribeDirRequest STEP 2: DescribeDirResponse Partition list and size STEP 3: ChangeDirRequest Disk 1 Disk 2 STEP 4: create p1.move STEP 5: ChangeDirResponse (Inprogress) STEP 6: copy data from p1.log to p1.move STEP 7: delete p1.log and rename p1.move to p1.log STEP 8: Verify new assignment via DescribeDirRequest
  • 22. ©2017 LinkedIn Corporation. All Rights Reserved. 22 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  • 23. ©2017 LinkedIn Corporation. All Rights Reserved. 23 Alternatives ▪ RAID-0 doesn’t provide disk fault tolerance – Assume each broker has 10 disks and RF = 2 – RAID-0 has 100X higher probability of unavailability due to disk failure than JBOD ▪ RAID-5 and RAID-6 have poor performance ▪ Hardware RAID is expensive ▪ One broker per disk
  • 24. ©2017 LinkedIn Corporation. All Rights Reserved. 24 one-broker-per-machine vs. one-broker-per-disk Physical Machine Disk 1 Disk 2 Disk 3 Broker 1 Physical Machine Disk 1 Disk 2 Disk 3 Broker 1 Broker 2 Broker 3 V.S. One-broker-per-machine One-broker-per-disk
  • 25. ©2017 LinkedIn Corporation. All Rights Reserved. 25 one-broker-per-machine vs. one-broker-per-disk ▪ Both solutions use JBOD as disk configuration ▪ Main drawbacks of one-broker-per-disk (assume 10 disk per machine) – 100X threads and 100X sockets per machine – 10X control plane traffic from the controller to brokers (e.g. MetadataRequest) – 10X broker instances and configuration files to manage – 10X time to bounce a cluster if we bounce one broker at a time – 10X load on external service (e.g. a service used to query per-topic ACL) – Less efficient quota enforcement – Less efficient rebalance across disks on the same machine – Lower throughput
  • 26. ©2017 LinkedIn Corporation. All Rights Reserved. 26 Experimental setup ▪ Brokers deployed on 15 machines with 10 disks per machine IO threads Network threads Replica-fetcher threads One-broker-per-machine 160 120 140 One-broker-per-disk 16 12 14 ▪ Producers deployed on 15 machines acks threads sync retries retry backoff message size batch size request timeout all 50 true MAX_INT 60 sec 100 KB 1 MB MAX_INT ▪ Topic configuration partition replication factor min-insync-replicas 512 3 3
  • 27. ©2017 LinkedIn Corporation. All Rights Reserved. 27 One-broker-per-machine throughput Average throughput is 2.3 GBps
  • 28. ©2017 LinkedIn Corporation. All Rights Reserved. 28 One-broker-per-disk throughput Average throughput is 2 GBps
  • 29. ©2017 LinkedIn Corporation. All Rights Reserved. 29 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  • 30. ©2017 LinkedIn Corporation. All Rights Reserved. 30 Changes in operational procedure ▪ Adjust replication factor and min.insync.replicas ▪ Configure num.replica.move.threads for broker ▪ Monitor disk failure via the OfflineLogDirectoriesCount metric
  • 31. ©2017 LinkedIn Corporation. All Rights Reserved. 31 Future work ▪ Use more intelligent solution to select log directory for new replica ▪ Automatic load balancing across log directories on the same broker – Reduced operational overhead ▪ Distribute segments of a given replica across multiple log directories – Less overhead for rebalance between disks – Higher partition size limit ▪ Handle partial disk failure, e.g. disk with degraded performance.
  • 32. ©2017 LinkedIn Corporation. All Rights Reserved. 32 References ▪ KIP-112: Handle disk failure for JBOD (link) ▪ KIP-113: Support replicas movement between log directories (link)
  • 33. ©2017 LinkedIn Corporation. All Rights Reserved. 33