6. Confluent Platform and Apache Kafka
Apache Kafka
• Open source
• Publish and Subscribe
• Processing
• Storage
Confluent Open Source
• Open source
• Data management
• Connectors
• Clients
Confluent Enterprise
• Administration features
• Operations features
• Monitoring features
• 30-day free evaluation
7. What is Apache Kafka?
• A distributed streaming platform
• Pub-sub paradigm similar to message
queue
• Fault-tolerant
• Allows stream processing of events as
they occur
10. Before: Many Ad Hoc Pipelines
Search Security
Fraud Detection Application
User Tracking Operational Logs Operational Metrics
Hadoop Search Monitoring
Data
Warehouse
Espresso Cassandra Oracle
11. After: Central Hub with Kafka
Search Security
Fraud Detection Application
User Tracking Operational Logs Operational MetricsEspresso Cassandra Oracle
Hadoop Log Search Monitoring
Data
Warehouse
Kafka
12. What’s in a Kafka Cluster?
• A typical Kafka cluster consists of these:
– Broker
– Zookeeper
– Kafka Connect
– Kafka Streams
• Confluent Platform add-ons:
– Schema Registry
– REST Proxy
– And others (Replicator, Auto Data Balancer, and others)
13. Why Kafka on DC/OS?
• Ease of management
• Container support
• Stateful and stateless services
• Service discovery and routing
16. How do you manage all these?
• A cluster consists of a mix of stateful and
stateless services
• Plus other external systems
• Service discovery, load balancing
concerns
• DC/OS comes to the rescue
17. Where to place these services?
• Brokers should NOT be co-located
• Broker nodes should have dedicated disks
• Same for Zookeeper servers
• Brokers should not be co-located with
Zookeepers
• And which containerizer should I use?
Mesos, docker
18. DC/OS benefits for Kafka
• Easy to configure cluster size
• Node and placement constraints
• Handling of stateful services like brokers
19. How many brokers
• Recommended: 3 brokers to start
• At least 1-2 GB memory each
• A few CPU cores is ok (~4)
• Don’t place them together!
• In Marathon’s config.json:
"placement_constraint": "hostname:MAX_PER:1"
• Don’t change this: "PLACEMENT_STRATEGY": "NODE”
22. Kafka Storage Volumes
• Messages are flushed to disks
• HDDs are better than SSDs
• RAID is better than JBOD
• MOUNT volumes better than ROOT
23. Pinning the brokers
• You could tie brokers to nodes with bigger
storage by using placement constraints
• "placement_constraint": ”10.0.10.1|
10.0.10.2| 10.0.10.3"
24. What about Zookeeper?
• DC/OS comes with Zookeeper, should I
use it?
• Depends on how many services need ZK
26. Kafka Broker Restarts
• Upon restart broker has to catch up with
others (because of replication, etc.)
• No rolling restarts in DC/OS UI
• Better to do rolling restart via CLI:
• dcos broker restart <broker_id>
• Check broker logs!
30. Kafka Streams with states
• Each Kafka Streams application instance has
its own embedded state store
• State stores have to be synced/assigned
upon restarts, that could take time
• When containers restart, they will be
spawned on different slave node, hence
empty state store -> longer startup time
31. Running other Kafka Components
• They are stateless and lighter on
resources
• Kafka Connect workers
32. Failures
• DC/OS will restart services when they fail
• But admins should look into why
• Use alerting tools, e.g. Nagios