Cloud migration: it's practically a rite of passage for anyone who's built infrastructure on bare metal. When we migrated our 5-year-old Kafka deployment from the datacenter to GCP, we were faced with the task of making our highly mutable server infrastructure more cloud-friendly. This led to a surprising decision: we chose to run our Kafka cluster on Kubernetes. I'll share war stories from our Kafka migration journey, explain why we chose Kubernetes over arguably simpler options like GCP VMs, and present the lessons we learned while making our way toward a stable and self-healing Kubernetes deployment. I'll also go through some improvements in the more recent Kafka releases that make upgrades crucial for any Kafka deployment on immutable and ephemeral infrastructure. You'll learn what happens when you try to run one complex distributed system on top of another, and come away with some handy tricks for automating cloud cluster management, plus some migration pitfalls to avoid. And if you're not sure whether running Kafka on Kubernetes is right for you, our experiences should provide some extra data points that you can use as you make that decision.
35. WHAT WE CONSIDERED
➤ VMs + Chef
➤ Mutable infrastructure
➤ VMs + Machine/Docker Images
36. WHAT WE CONSIDERED
➤ VMs + Chef
➤ Mutable infrastructure
➤ VMs + Machine/Docker Images
➤ Same amount of toil, new cloud infra problems
37. WHAT WE CONSIDERED
➤ VMs + Chef
➤ Mutable infrastructure
➤ VMs + Machine/Docker Images
➤ Same amount of toil, new cloud infra problems
➤ Managed Instance Groups
38. WHAT WE CONSIDERED
➤ VMs + Chef
➤ Mutable infrastructure
➤ VMs + Machine/Docker Images
➤ Same amount of toil, new cloud infra problems
➤ Managed Instance Groups
➤ Not good for stateful workloads
39. WHAT WE CONSIDERED
➤ VMs + Chef
➤ Mutable infrastructure
➤ VMs + Machine/Docker Images
➤ Same amount of toil, new cloud infra problems
➤ Managed Instance Groups
➤ Not good for stateful workloads
➤ …Kubernetes?!
40. WHAT WE CONSIDERED
➤ VMs + Chef
➤ Mutable infrastructure
➤ VMs + Machine/Docker Images
➤ Same amount of toil, new cloud infra problems
➤ Managed Instance Groups
➤ Not good for stateful workloads
➤ …Kubernetes?!
47. RESOURCE MANAGEMENT
➤ Examples of separation:
➤ zones
➤ node pools - varying workloads, e.g. high CPU requirement
➤ Workload allocation controlled by:
➤ nodeSelectors
➤ pool: highmem
➤ failure-domain.beta.kubernetes.io/zone: us-central1-a
48. RESOURCE MANAGEMENT
➤ Examples of separation:
➤ zones
➤ node pools - varying workloads, e.g. high CPU requirement
➤ Workload allocation controlled by:
➤ nodeSelectors
➤ pool: highmem
➤ failure-domain.beta.kubernetes.io/zone: us-central1-a
➤ taints + tolerations
➤ - key: pool
operator: Equal
value: highmem
effect: NoSchedule
49. SERVICE DISCOVERY
➤ Within the cluster: ClusterIP Service
➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to
Kafka broker
50.
51. SERVICE DISCOVERY
➤ Within the cluster: ClusterIP Service
➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to
Kafka broker
➤ External services to Kafka:
➤ Bootstrapping: Cloud DNS to LoadBalancer
52.
53. SERVICE DISCOVERY
➤ Within the cluster: ClusterIP Service
➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to
Kafka broker
➤ External services to Kafka:
➤ Bootstrapping: Cloud DNS to LoadBalancer
➤ Direct to broker: NodePort*
* In our specific case, ClusterIP due to VPC-native IP aliasing on GCP. Also possible: dedicated LoadBalancers.
54. SERVICE DISCOVERY
➤ Within the cluster: ClusterIP Service
➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to
Kafka broker
➤ External services to Kafka:
➤ Bootstrapping: Cloud DNS to LoadBalancer
➤ Direct to broker: NodePort*
➤ Configuring your Kafka listeners
* In our specific case, ClusterIP due to VPC-native IP aliasing on GCP. Also possible: dedicated LoadBalancers.
55. SERVICE DISCOVERY
➤ Within the cluster: ClusterIP Service
➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to
Kafka broker
➤ External services to Kafka:
➤ Bootstrapping: Cloud DNS to LoadBalancer
➤ Direct to broker: NodePort*
➤ Configuring your Kafka listeners
* In our specific case, ClusterIP due to VPC-native IP aliasing on GCP. Also possible: dedicated LoadBalancers.
56. SERVICE DISCOVERY
➤ Within the cluster: ClusterIP Service
➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to
Kafka broker
➤ External services to Kafka:
➤ Bootstrapping: Cloud DNS to LoadBalancer
➤ Direct to broker: NodePort*
➤ Configuring your Kafka listeners
➤ https://rmoff.net/2018/08/02/kafka-listeners-explained/
* In our specific case, ClusterIP due to VPC-native IP aliasing on GCP. Also possible: dedicated LoadBalancers.
58. SOME NOTES ON MANAGEMENT
➤ Installation, deploy:
➤ Confluent Helm charts
➤ https://github.com/confluentinc/cp-helm-charts
59. SOME NOTES ON MANAGEMENT
➤ Installation, deploy:
➤ Confluent Helm charts
➤ https://github.com/confluentinc/cp-helm-charts
➤ No Tiller, Helm templating only
60. SOME NOTES ON MANAGEMENT
➤ Installation, deploy:
➤ Confluent Helm charts
➤ https://github.com/confluentinc/cp-helm-charts
➤ No Tiller, Helm templating only
➤ Confluent Docker images, with additions
69. WHAT WE LEARNED
➤ Ephemeral resources
➤ Don’t assume static IPs; make sure Kafka clients can handle
pod evictions
70. WHAT WE LEARNED
➤ Ephemeral resources
➤ Don’t assume static IPs; make sure Kafka clients can handle
pod evictions
➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl)
71. WHAT WE LEARNED
➤ Ephemeral resources
➤ Don’t assume static IPs; make sure Kafka clients can handle
pod evictions
➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl)
72. WHAT WE LEARNED
➤ Ephemeral resources
➤ Don’t assume static IPs; make sure Kafka clients can handle
pod evictions
➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl)
➤ Check the Apache Kafka JIRA
73. WHAT WE LEARNED
➤ Ephemeral resources
➤ Don’t assume static IPs; make sure Kafka clients can handle
pod evictions
➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl)
➤ Oh, and upgrade Kafka.
74. WHAT WE LEARNED
➤ Ephemeral resources
➤ Don’t assume static IPs; make sure Kafka clients can handle
pod evictions
➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl)
➤ Check JIRA/Upgrade Kafka
➤ Producers and consumers too!
75. WHAT WE LEARNED
➤ Ephemeral resources
➤ Don’t assume static IPs; make sure Kafka clients can handle
pod evictions
➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl)
➤ Check JIRA/Upgrade Kafka
➤ Producers and consumers too!
➤ Examples: KAFKA-7755, KAFKA-7890
➤ See also: cp-helm-charts issue #240
91. WHAT WE HAD TO FIGURE OUT
➤ Different kinds of updates/rolling restarts
➤ Changes to Kafka cluster: versions, broker properties
➤ Workload configuration: resources, security policies
➤ Upgrading Kubernetes nodes
➤ How health checks affect updates
92. WHAT WE HAD TO FIGURE OUT
➤ Different kinds of updates/rolling restarts
➤ Changes to Kafka cluster: versions, broker properties
➤ Workload configuration: resources, security policies
➤ Upgrading Kubernetes nodes
➤ How health checks affect updates
93. WHAT WE HAD TO FIGURE OUT
➤ Different kinds of updates/rolling restarts
➤ Changes to Kafka cluster: versions, broker properties
➤ Workload configuration: resources, security policies
➤ Upgrading Kubernetes nodes
➤ How health checks affect updates
➤ podDisruptionBudget
94. WHAT WE HAD TO FIGURE OUT
➤ Different kinds of updates/rolling restarts
➤ Changes to Kafka cluster: versions, broker properties
➤ Workload configuration: resources, security policies
➤ Upgrading Kubernetes nodes
➤ How health checks affect updates
➤ podDisruptionBudget, podManagementPolicy
95. WHAT WE HAD TO FIGURE OUT
➤ Different kinds of updates/rolling restarts
➤ Changes to Kafka cluster: versions, broker properties
➤ Workload configuration: resources, security policies
➤ Upgrading Kubernetes nodes
➤ How health checks affect updates
➤ podDisruptionBudget, podManagementPolicy
➤ Health check overrides
➤ What happens if you deploy a change that breaks the
health checks?
96. WHAT WE HAD TO FIGURE OUT
➤ Different kinds of updates/rolling restarts
➤ Changes to Kafka cluster: versions, broker properties
➤ Workload configuration: resources, security policies
➤ Upgrading Kubernetes nodes
➤ How health checks affect updates
➤ podDisruptionBudget, podManagementPolicy
➤ Health check overrides
➤ What happens if you deploy a change that breaks the
health checks?
➤ See Kubernetes issue #62750
104. MIGRATING PRODUCTION ARCHITECTURE
➤ Get your hands dirty!
➤ Simulate common maintenance tasks
➤ Benchmark for performance
➤ kafka-producer-perf-test, kafka-consumer-perf-test
105. MIGRATING PRODUCTION ARCHITECTURE
➤ Get your hands dirty!
➤ Simulate common maintenance tasks
➤ Benchmark for performance
➤ kafka-producer-perf-test, kafka-consumer-perf-test
➤ Variables: disk type, CPU count, producer record size,
producer batch size, Java opts…
106. MIGRATING PRODUCTION ARCHITECTURE
➤ Get your hands dirty!
➤ Simulate common maintenance tasks
➤ Benchmark for performance
➤ kafka-producer-perf-test, kafka-consumer-perf-test
➤ Variables: disk type, CPU count, producer record size,
producer batch size, Java opts…
➤ We were able to use Compute Engine persistent disks
(shared storage) rather than local SSDs
107. MIGRATING PRODUCTION ARCHITECTURE
➤ Get your hands dirty!
➤ Simulate common maintenance tasks
➤ Benchmark for performance
➤ kafka-producer-perf-test, kafka-consumer-perf-test
➤ Variables: disk type, CPU count, producer record size,
producer batch size, Java opts…
➤ We were able to use Compute Engine persistent disks
(shared storage) rather than local SSDs
➤ Simulate failure
108. MIGRATING PRODUCTION ARCHITECTURE
➤ Get your hands dirty!
➤ Simulate common maintenance tasks
➤ Benchmark for performance
➤ kafka-producer-perf-test, kafka-consumer-perf-test
➤ Variables: disk type, CPU count, producer record size,
producer batch size, Java opts…
➤ We were able to use Compute Engine persistent disks
(shared storage) rather than local SSDs
➤ Simulate failure
113. WHY IT WORKS FOR US
➤ Increased automation
➤ Simpler configuration
114. WHY IT WORKS FOR US
➤ Increased automation
➤ Simpler configuration
➤ Efficient resource usage
➤ Bin packing
➤ GKE autoscaling
115. WHY IT WORKS FOR US
➤ Increased automation
➤ Simpler configuration
➤ Efficient resource usage
➤ Bin packing
➤ GKE autoscaling
➤ Improved developer workflows for streaming services
➤ e.g. adding new Kafka Streams applications, Kafka Connect
workloads