2. 2
Apache Kafka committer and PMC member. A frequent
speaker on both Hadoop and Cassandra, Joe is the Co-
Founder and CTO of Elodina Inc. Joe has been a distributed
systems developer and architect for over {years} now having
built backend systems that supported over one hundred
million unique devices a day processing trillions of events. He
blogs and hosts a podcast about Hadoop and related systems
at All Things Hadoop.
@allthingshadoop
$(whoami)
3. 3
● Introduction to Apache Kafka
● Brokers “as a Service”
● Producers & Consumers “as a Service”
● More Use Cases for Kafka
Overview
5. 5
Apache Kafka was first open sourced by LinkedIn in 2011
Papers
● Building a Replicated Logging System with Apache Kafka http://www.vldb.org/pvldb/vol8/p1654-wang.pdf
● Kafka: A Distributed Messaging System for Log Processing http://research.microsoft.com/en-
us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
● Building LinkedIn’s Real-time Activity Data Pipeline http://sites.computer.org/debull/A12june/pipeline.pdf
● The Log: What Every Software Engineer Should Know About Real-time Data's Unifying Abstraction http://engineering.
linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
http://kafka.apache.org/
Apache Kafka
24. Client Libraries
Community Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients
● Go (aka golang) Pure Go implementation with full protocol support.
Consumer and Producer implementations included, GZIP and Snappy
compression supported.
● Python - Pure Python implementation with full protocol support. Consumer
and Producer implementations included, GZIP and Snappy compression
supported.
● C - High performance C library with full protocol support
● Ruby - Pure Ruby, Consumer and Producer implementations included,
GZIP and Snappy compression supported. Ruby 1.9.3 and up (CI runs MRI
2.
● Clojure - Clojure DSL for the Kafka API
● JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation
27. 27
CURRENT STATE OF IMPLEMENTATION
11 STEPS BEFORE ANY BUSINESS VALUE IS CREATED
1 SET UP Instances → AWS / GCE / etc..
2 Repeat above by # of instances
3 SET UP uniformly, harden, secure every machine
4 DOWNLOAD: Apache Kafka
5 LEARN to install, run on multiple nodes / high availability
6 LEARN to run on multiple data centers / multiple racks
7 CONFIGURE nodes, tables specifically by cluster
8 MONITOR performance, isolate bottlenecks
9 OPTIMIZE system / team to hands off through next objective
10 MONITOR for failure and build disaster recovery protocol
11 FAILURE RECOVERY investigation, recovery and spin back up time
10
11
9
8
7
6
5
4
3
2
1
10
11
9
8
7
6
5
4
3
2
1
10
11
9
8
7
6
5
4
3
2
1
10
11
9
8
7
6
5
4
3
2
1
10
11
9
8
7
6
5
4
3
2
1
10
11
9
8
7
6
5
4
3
2
1
AND process
must repeat by #
of instances and
technologies
28. 28
ELODINA AUTOMATES DEPLOYMENT, SCALING AND MAINTENANCE
Reduce steps and learning curve to a THREE stage repeatable process
1 SET UP Instances → AWS / GCE / etc..
2 Repeat above by # of instances
3 SET UP uniformly, harden, secure every machine
4 DOWNLOAD: Apache Kafka
5 LEARN to install, run on multiple nodes / high availability
6 LEARN to run on multiple data centers / multiple racks
7 CONFIGURE nodes, tables specifically by cluster
8 MONITOR performance, isolate bottlenecks
9 OPTIMIZE system / team to hands off through next objective
10 MONITOR for failure and build disaster recovery protocol
11 FAILURE RECOVERY investigation, recovery and spin back up time
Platform modulars allow for
deployment in minutes
DEPLOY
Grid scales automatically with low
latency based on real time traffic
patterns.
SCALE
Single destination to observe and
troubleshoot from CLI, REST API or
GUI
OBSERVE
29. 29
BUILT- IN FRAMEWORKS DIRECTLY IN PLATFORM
Leading technologies deployable across any compute resource
Platform modulars allow for
deployment in minutes
DEPLOY
Grid scales automatically with low
latency based on real time traffic
patterns.
SCALE
Single destination to observe and
troubleshoot from CLI, REST API
or GUI
OBSERVE
ResourcesTechnologies
30. 30
IMMEDIATE OPERATIONAL BENEFITS
Removing Fragmentation with Interoperability
Clearing crowded market decisioning on which software or stack of software to
choose and interoperate with your data center
Immediate Efficiency & Reliability
Operation resources deployed across multiple data centers across multiple regions
streamlined with dynamic compute and automated scheduling capabilities.
Automated Speed and Recovery
Reduce costs and time to market on development cycle time and Automate recovery
from failure and
39. Scheduler
● Provides the operational automation for a Kafka Cluster.
● Manages the changes to the broker's configuration.
● Exposes a REST API for the CLI to use or any other client.
● Runs on Marathon for high availability.
● Broker Failure Management “stickiness”
Executor
● The executor interacts with the kafka broker as an
intermediary to the scheduler
Scheduler & Executor
40. Typical Operations
● Run the scheduler with Docker
● Run the scheduler on Marathon
● Changing the location where data is stored
● Starting 3 brokers
● View broker log
● High Availability Scheduler State
● Failed Broker Recovery
● Passing multiple options
● Broker metrics
● Rolling restart
49. Topology Master
The Topology Master (TM) manages a topology
throughout its entire lifecycle, from the time it’s
submitted until it’s ultimately killed. When heron
deploys a topology it starts a single TM and
multiple containers. The TM creates an
ephemeral ZooKeeper node to ensure that
there’s only one TM for the topology and that
the TM is easily discoverable by any process in
the topology. The TM also constructs the
physical plan for a topology which it relays to
different components.
Container Each Heron topology consists of multiple containers, each of which
houses multiple Heron Instances, a Stream Manager, and a Metrics Manager.
Containers communicate with the topology’s TM to ensure that the topology forms
a fully connected graph. For an illustration, see the figure in the Topology Master
section above.
50. Stream Manager
The Stream Manager (SM) manages the
routing of tuples between topology
components. Each Heron Instance in a
topology connects to its local SM, while all of
the SMs in a given topology connect to one
another to form a network. Below is a visual
illustration of a network of SMs:
51. Heron Instance
A Heron Instance (HI) is a process that handles a single task of a spout or bolt, which allows for easy
debugging and profiling.
Currently, Heron only supports Java, so all HIs are JVM processes, but this will change in the future.
Heron Instance Configuration
HIs have a variety of configurable parameters that you can adjust at each phase of a topology’s lifecycle.
54. Metrics Manager
Each topology runs a Metrics Manager (MM) that collects and exports metrics from all components in a
container. It then routes those metrics to both the Topology Master and to external collectors, such as
Scribe, Graphite, or analogous systems.
You can adapt Heron to support additional systems by implementing your own custom metrics sink.
55. Cluster-level Components
Heron CLI
Heron has a CLI tool called heron that is used to manage topologies. Documentation can be found in Managing Topologies.
Heron Tracker
The Heron Tracker (or just Tracker) is a centralized gateway for cluster-wide information about topologies, including which topologies are running,
being launched, being killed, etc. It relies on the same ZooKeeper nodes as the topologies in the cluster and exposes that information through a
JSON REST API. The Tracker can be run within your Heron cluster (on the same set of machines managed by your Heron scheduler) or outside
of it.
Instructions on running the tracker including JSON API docs can be found in Heron Tracker.
Heron UI
Heron UI is a rich visual interface that you can use to interact with topologies. Through Heron UI you can see color-coded visual representations of
the logical and physical plan of each topology in your cluster.
For more information, see the Heron UI document.