A modern business operates 24/7 and generates data continuously. Shouldn’t we process it continuously too?
A rich ecosystem of real-time data-processing frameworks, tools and systems has been forming around Apache Kafka that allows data to be processed continuously as it occurs. This talk will introduce Kafka and explain why it has become the de facto standard for streaming data. It draws on practical experience building stream-processing applications to discuss the difference between architectures and the challenges each presents. It outlines the streams API in Kafka, and explains how it helps tame some of the complexity in real-time architectures.
TODO: fix title
Introduce self
What is Stream Processing
Brief intro to Kafka
Kafka Streams
Exciting! Important!
Doesn’t mean you drop everything on the floor if anything slows down
Streaming algorithms—online space
Can compute median
About how inputs are translated into outputs (very fundamental)
HTTP/REST
All databases
Run all the time
Each request totally independent—No real ordering
Can fail individual requests if you want
Very simple!
About the future!
“Ed, the MapReduce job never finishes if you watch it like that”
Job kicks off at a certain time
Cron!
Processes all the input, produces all the input
Data is usually static
Hadoop!
DWH, JCL
Archaic but powerful. Can do analytics! Compex algorithms!
Also can be really efficient!
Inherently high latency
Generalizes request/response and batch.
Program takes some inputs and produces some outputs
Could be all inputs
Could be one at a time
Runs continuously forever!
Companies == streams
What a retail store do
Streams
Retail
- Sales
- Shipments and logistics
- Pricing
- Re-ordering
- Analytics
- Fraud and theft
Quick run-through of the features in Kafka.
Logs
Distributed
Fault-tolerant
Change to Logs Unify Batch and stream processing
Can’t just scale storage, need to scale processing
Important: order
Streaming platform is the successor to messaging
Stream processing is how you build asynchronous services.
That is going to be the key to solving my pipeline sprawl problem.
Instead of having N^2 different pipelines, one for each pair of systems I am going to have a central place that hosts all these event streams—the streaming platform.
This is a central way that all these systems and applications can plug in to get the streams they need.
So I can capture streams from databases, and feed them into DWH, Hadoop, monitoring and analytics systems.
They key advantage is that there is a single integration point for each thing that wants data.
Now obviously to make this work I’m going to need to ensure I have met the reliability, scalability, and latency guarantees for each of these systems.
Current state
OpenGL Triangle
Add screenshot example
Add screenshot example
TODO: Summarize
Change to “Logs make reprocessing easy”
Time is hard
Need a model of time
Request/Response ignores the issue, you just set an aggressive timeout
Batch solves the issue usually by just freezing all data for the day
Stream processing needs to actually address the issue
Kafka Streams:
Manage the set of live processors and route data to them
Uses Kafka’s group management facility
External framework
Start and restart processes
Package processes
Deploy code
Companies == streams
What a retail store do
Streams
Retail
- Sales
- Shipments and logistics
- Pricing
- Re-ordering
- Analytics
- Fraud and theft
But…no notion of time
Also:
Other talks
Kafka Summit
Streaming data hackathon
Stop by the Confluent booth and ask your questions about Kafka or stream processing
Get a Kafka t-shirt and sticker.
We’re also giving away a few books: the early release of Kafka: The Definitive Guide, Making Sense of Stream Processing, and I Heart Logs
Meet the authors and get your book signed.
We also want to invite you to participate in the Stream Data Hackathon in San Francisco on the evening of April 25, the day before Kafka Summit
You might be interested in some of the other Confluent talks. If you missed it you’ll have access to the video recording.