Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019

KAFKA ON KUBERNETES:
Keeping It Simple

We’re not doing that… 
right?

We’re not doing that… 
right?
Definitely not!

ONE YEAR LATER
1. Production Kafka cluster on Kubernetes.
2. Suggesting this idea to other people.

RUNNING KAFKA ON
KUBERNETES DOESN'T
HAVE TO BE COMPLICATED.

RUNNING KAFKA ON
KUBERNETES DOESN'T
* "Complicated"

RUNNING KAFKA ON
KUBERNETES DOESN'T
* "Complicated" = custom resource deﬁnitions, plugins, operators, etc.

WHAT YOU’LL GET OUT OF THIS
➤ Example of real-life production setup

➤ Technical tips and tricks

➤ Technical tips and tricks
➤ Advice for migrating production systems

SYSTEMS MENTIONED
➤ Kafka
➤ Kubernetes
➤ Chef
➤ Terraform
➤ Helm
➤ Prometheus
➤ Google Cloud Platform

WHAT WE BUILT
and how it works

WHAT WE CONSIDERED
➤ VMs + Chef

WHAT WE CONSIDERED
➤ VMs + Chef
➤ Mutable infrastructure

WHAT WE CONSIDERED
➤ VMs + Chef
➤ VMs + Machine/Docker Images

WHAT WE CONSIDERED
➤ VMs + Chef
➤ Same amount of toil, new cloud infra problems

WHAT WE CONSIDERED
➤ VMs + Chef
➤ Managed Instance Groups

WHAT WE CONSIDERED
➤ VMs + Chef
➤ Not good for stateful workloads

WHAT WE CONSIDERED
➤ VMs + Chef
➤ Not good for stateful workloads
➤ …Kubernetes?!

RESOURCE MANAGEMENT
➤ Examples of separation:
➤ zones

RESOURCE MANAGEMENT
➤ zones
➤ node type

RESOURCE MANAGEMENT
➤ zones
➤ node type - varying workloads, e.g. high CPU requirement

RESOURCE MANAGEMENT
➤ zones
➤ node pools - varying workloads, e.g. high CPU requirement
➤ Workload allocation controlled by:
➤ nodeSelectors
➤ pool: highmem
➤ failure-domain.beta.kubernetes.io/zone: us-central1-a

RESOURCE MANAGEMENT
➤ zones
➤ node pools - varying workloads, e.g. high CPU requirement
➤ Workload allocation controlled by:
➤ nodeSelectors
➤ pool: highmem
➤ failure-domain.beta.kubernetes.io/zone: us-central1-a
➤ taints + tolerations
➤ - key: pool 
operator: Equal 
value: highmem 
effect: NoSchedule

SERVICE DISCOVERY
➤ Within the cluster: ClusterIP Service
➤ e.g. Kafka broker to Kafka broker, Kafka Connect worker to
Kafka broker

SERVICE DISCOVERY
Kafka broker
➤ External services to Kafka:
➤ Bootstrapping: Cloud DNS to LoadBalancer

SERVICE DISCOVERY
Kafka broker
➤ Direct to broker: NodePort*
* In our specific case, ClusterIP due to VPC-native IP aliasing on GCP. Also possible: dedicated LoadBalancers.

SERVICE DISCOVERY
Kafka broker
➤ Conﬁguring your Kafka listeners

SERVICE DISCOVERY
Kafka broker
➤ Conﬁguring your Kafka listeners
➤ https://rmoﬀ.net/2018/08/02/kafka-listeners-explained/

SOME NOTES ON MANAGEMENT
➤ Installation, deploy:
➤ Conﬂuent Helm charts
➤ https://github.com/conﬂuentinc/cp-helm-charts

➤ No Tiller, Helm templating only

➤ No Tiller, Helm templating only
➤ Conﬂuent Docker images, with additions

➤ Monitoring:

➤ Monitoring:
➤ Kafka exposes JMX metrics by default

➤ Monitoring:
➤ Kafka exposes JMX metrics by default
➤ Prometheus JMX Exporter as Java agent (vs. Helm chart
sidecar)

WHAT WE LEARNED
and what we think you should know

WHAT WE LEARNED
➤ Ephemeral resources

WHAT WE LEARNED
➤ Don’t assume static IPs; make sure Kafka clients can handle
pod evictions

WHAT WE LEARNED
pod evictions
➤ Check your JVM DNS cache TTL (networkaddress.cache.ttl)

WHAT WE LEARNED
pod evictions
➤ Check the Apache Kafka JIRA

WHAT WE LEARNED
pod evictions
➤ Oh, and upgrade Kafka.

WHAT WE LEARNED
pod evictions
➤ Check JIRA/Upgrade Kafka
➤ Producers and consumers too!

WHAT WE LEARNED
pod evictions
➤ Check JIRA/Upgrade Kafka
➤ Producers and consumers too!
➤ Examples: KAFKA-7755, KAFKA-7890
➤ See also: cp-helm-charts issue #240

WHAT WE LEARNED
➤ Diﬀerent kinds of updates/rolling restarts

WHAT WE LEARNED
➤ Changes to Kafka cluster: versions, broker properties

WHAT WE LEARNED
➤ Workload conﬁguration: resources, security policies

WHAT WE LEARNED
➤ Upgrading Kubernetes nodes

Health checks are important for self-healing clusters!

CONTAINER PROBES
➤ Liveness Probe 
“Should I restart this container?”

CONTAINER PROBES
➤ Readiness Probe 
“Should this container accept traﬃc?”

https://github.com/andreas-schroeder/kafka-health-check

CONTAINER PROBES
➤ Endpoint: is this broker healthy?

CONTAINER PROBES
➤ Endpoint: is this broker healthy?
➤ Don’t be too strict!

WHAT WE HAD TO FIGURE OUT
➤ How health checks aﬀect updates

➤ podDisruptionBudget

➤ podDisruptionBudget, podManagementPolicy

➤ Health check overrides
➤ What happens if you deploy a change that breaks the
health checks?

➤ Health check overrides
➤ What happens if you deploy a change that breaks the
health checks?
➤ See Kubernetes issue #62750

MIGRATING PRODUCTION ARCHITECTURE

➤ Get your hands dirty!

➤ Simulate common maintenance tasks

➤ Benchmark for performance
➤ kafka-producer-perf-test, kafka-consumer-perf-test

➤ Variables: disk type, CPU count, producer record size,
producer batch size, Java opts…

➤ We were able to use Compute Engine persistent disks
(shared storage) rather than local SSDs

➤ We were able to use Compute Engine persistent disks
(shared storage) rather than local SSDs
➤ Simulate failure

WHY IT WORKS FOR US
( for now, at least!)

WHY IT WORKS FOR US
➤ Increased automation

WHY IT WORKS FOR US
➤ Simpler conﬁguration

WHY IT WORKS FOR US
➤ Eﬃcient resource usage
➤ Bin packing
➤ GKE autoscaling

WHY IT WORKS FOR US
➤ Eﬃcient resource usage
➤ Bin packing
➤ GKE autoscaling
➤ Improved developer workﬂows for streaming services
➤ e.g. adding new Kafka Streams applications, Kafka Connect
workloads

THANK YOU!
Twitter: @NikkiThean 
Confluent Slack: @nikki 
Email: nikki.thean@gmail.com
Thank you to Kamo for drawing inspiration!

Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019

Similar a Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019 (20)

Más de confluent

Más de confluent (20)

Último

Último (20)

Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019