The document discusses various stream processing concepts including event processing, stream processing operations like aggregation, windowing and joining. It provides examples of using Kafka Streams to perform stream processing on topics, including counting, windowing and joining streams. Key concepts covered are events, event time, processing time, sources, sinks, aggregation, tumbling windows, hopping windows, sessions and joining streams.
12. Batch Processing
Batch processing is where the processing happens of blocks of (bounded) data that have already
been stored over a period of time.
ד"שלניר:)
coolmeen, grinfeld
13. Stream Processing
What is Stream? Unbounded (infinite) data flow
“a type of data processing that is designed with infinite data sets in mind”
Tyler Akidau from Google
So what is Stream Processing?
coolmeen, grinfeld
16. Let’s cover few terms we need for diving into Streams use cases and implementations
● Event -
from oxford dictionary: “A thing that happens or takes place”
in computer systems: “is an action or occurrence recognized by software, often originating
asynchronously from the external environment, that may be handled by the software”.
Usually, event has additional data about state of event at time it has occurred
● Event Time - the time at which events actually occurred
● Processing Time - the time at which events are observed in the system
● Wall clock time - Wall-clock time is the time that a clock on the wall (or a stopwatch in
hand) would measure as having elapsed between the start of the process and 'now'.
● Upstream - The stream processor where current stream comes from
● Downstream - The stream processor where current stream goes to
● Source - the source structure (data source) to get data from (in Kafka Streams is topic)
● Sink - the destination structure (data source) to send data to (in Kafka Streams is topic)
coolmeen, grinfeld
17. So what operations could be done with stream?
count
sum, min, max,....
reduce
aggregate (agg)
Aggregation
coolmeen, grinfeld
18. So what operations could be done with stream?
final StreamsBuilder builder = new StreamsBuilder();
builder.stream("topic-from", Consumed.with(Serdes.String(),
Serdes.Integer()))
.peek((k,v) -> log.info("get event {} with key {}", v, k))
.groupByKey(Serialized.with(Serdes.String(), Serdes.Integer()))
.count()
.toStream()
.peek((k,v) -> log.info("produce value {} with key {}", v, k))
.to("topic-to", Produced.with(Serdes.String(), Serdes.Long()));
final StreamsBuilder builder = new StreamsBuilder();
builder.stream("topic-from", Consumed.with(Serdes.String(),
Serdes.Integer()))
.peek((k,v) -> log.info("get event {} with key {}", v, k))
.groupByKey(Serialized.with(Serdes.String(), Serdes.Integer()))
.count()
.filter((k,v) -> v > 1)
.toStream()
.peek((k,v) -> log.info("produce value {} for key {}", v, k))
.to("topic-to" Produced.with(Serdes.String(), Serdes.Long()));
Input: s1->1, s2->2, s1->1, s3->3, s1->1
Output:
2019-06-01 12:10:35.499 INFO 19671 ..... : get event 1 with key s1
2019-06-01 12:10:35.510 INFO 19671 ..... : produce value 1 with key s1
2019-06-01 12:10:35.525 INFO 19671 ..... : get event 2 with key s2
2019-06-01 12:10:35.526 INFO 19671 ..... : produce value 1 with key s2
2019-06-01 12:10:35.527 INFO 19671 ..... : get event 1 with key s1
2019-06-01 12:10:35.527 INFO 19671 ..... : produce value 2 with key s1
2019-06-01 12:10:35.527 INFO 19671 ..... : get event 3 with key s3
2019-06-01 12:10:35.528 INFO 19671 ..... : produce value 1 with key s3
2019-06-01 12:10:35.528 INFO 19671 ..... : get event 1 with key s1
2019-06-01 12:10:35.528 INFO 19671 ..... : produce value 3 with key s1
Output:
2019-06-01 12:10:35.499 INFO 19671 ..... : get event 1 with key s1
2019-06-01 12:10:35.525 INFO 19671 ..... : get event 2 with key s2
2019-06-01 12:10:35.527 INFO 19671 ..... : get event 1 with key s1
2019-06-01 12:10:35.527 INFO 19671 ..... : produce value 2 with key s1
2019-06-01 12:10:35.527 INFO 19671 ..... : get event 3 with key s3
2019-06-01 12:10:35.528 INFO 19671 ..... : get event 1 with key s1
2019-06-01 12:10:35.528 INFO 19671 ..... : produce value 3 with key s1
counter-stream-KSTREAM-AGGREGATE-STATE-STORE-
0000000002-changelog
Aggregation
coolmeen, grinfeld
19. So what operations could be done with stream? Windowing
Static (Tumbling) Window - repeats at a non-overlapping interval. Every record appears only in one
window (only once)
coolmeen, grinfeld
20. final StreamsBuilder builder = new StreamsBuilder();
builder
.stream("counter-topic", Consumed.with(Serdes.String(), Serdes.Integer()))
.peek(((key, value) -> log.info("received {}", key, value)))
.groupByKey(Serialized.with(Serdes.String(), Serdes.Integer()))
.windowedBy(TimeWindows.of(TimeUnit.SECONDS.toMillis(3)))
.count()
// (Windowed<String>, Long)
.toStream((key, value) -> {
log.info("{} - {}",
new Date(key.window().start()), new Date(key.window().end()));
return key.key();
})
.filter((k,v) -> v > 1)
.peek((k,v) -> log.info("produce value {} for key {}", v, k))
.to("counter-topic-to", Produced.with(Serdes.String(), Serdes.Long()));
Output:
..... ..... : Sat Jun 01 19:01:40 2019 - Sat Jun 01 19:01:45 2019
..... ..... : received s1
..... ..... : received s2
..... ..... : received s1
..... ..... : produce value 2 for key s1
..... ..... : received s3
..... ..... : received s1
..... ..... : produce value 3 for key s1
..... ..... : Sat Jun 01 19:01:50 2019 - Sat Jun 01 19:01:55 2019
..... ..... : received s1
..... ..... : received s2
..... ..... : received s1
..... ..... : produce value 2 for key s1
..... ..... : received s3
..... ..... : received s1
..... ..... : produce value 3 for key s1
..... ..... : Sat Jun 01 19:01:55 2019 - Sat Jun 01 19:02:00 2019
..... ..... : received s1
..... ..... : received s2
..... ..... : received s1
..... ..... : produce value 2 for key s1
..... ..... : received s3
..... ..... : received s1
..... ..... : produce value 3 for key s1
So what operations could be done with stream? Windowing
coolmeen, grinfeld
21. So what operations could be done with stream?
Hopping Window - is similar to tumbling, but hopping generally has an overlapping interval. One
record could appear in more than one window
Windowing
……
…...
.windowedBy(
TimeWindows.of(TimeUnit.SECONDS.toMillis(5))
.advanceBy(TimeUnit.SECONDS.toMillis(1))
)
…...
…...
final StreamsBuilder builder = new StreamsBuilder();
builder
.stream("counter-topic", Consumed.with(Serdes.String(), Serdes.Integer()))
.peek(((key, value) -> log.info("received {}", key, value)))
.groupByKey(Serialized.with(Serdes.String(), Serdes.Integer()))
.windowedBy(
TimeWindows.of(TimeUnit.SECONDS.toMillis(3)))
.advanceBy(TimeUnit.SECONDS.toMillis(1)
)
.count()
// (Windowed<String>, Long)
.toStream((key, value) -> {
log.info("{} - {}",
new Date(key.window().start()), new Date(key.window().end()));
return key.key();
})
.filter((k,v) -> v > 1)
.peek((k,v) -> log.info("produce value {} for key {}", v, k))
.to("counter-topic-to", Produced.with(Serdes.String(), Serdes.Long()));
coolmeen, grinfeld
22. So what operations could be done with stream? Windowing
Session Window - Sessions are a special type of window that captures a period of activity in the
data that is terminated by a gap of inactivity.
……..
.windowedBy(SessionWindows.with(TimeUnit.SECONDS.toMillis(5)))
……..
coolmeen, grinfeld
23. So what operations could be done with stream? Windowing - Join
A Sliding Window, opposed to a tumbling window, slides over the stream of data. Because of this, a
sliding window can be overlapping.
coolmeen, grinfeld
24. So what operations could be done with stream? KTable
How to create KTable: any aggregate, reduce operation returns KTable:
stream.groupByKey().reduce((aggValue, newValue) -> newValue, Materialized.with(Serdes.String(), new
JSONSerde<>(MyObject.class)));
KTable - in memory table representation of stream (backed by in memory by RockDB key value store)
builder.table("my-topic", Consumed.with(Serdes.String(), new JSONSerde<>(MyObject.class)));
Or simply create from stream builder:
coolmeen, grinfeld
25. So what operations could be done with stream? Join
Let’s look at use case, when we send messages and receive status, shortly after sending asynchronously
First we need to define stream to receive data from “sent-message” and “status-message” topics:
KStream<String, SentMessage> sentMessageStreamBuilder = builder.stream("sent-messages",
Consumed.with(Serdes.String(), new JSONSerde<>(SentMessage.class))
)
KStream<String, MessageStatus> statusMessageStreamBuilder = builder.stream("status-messages",
Consumed.with(Serdes.String(), new JSONSerde<>(MessageStatus.class))
)
We want to define (business decision) how much time we should wait for DRs (1 hour, 6 hours…)
KStream<String, MessageStatus> joinedStream = sentMessageStreamBuilder.join(
statusMessageStreamBuilder,
(sent, dr) -> dr.toBuilder().id(sent.getId()).build();,
JoinWindows.of(TimeUnit.SECONDS.toMillis(60L)),
Joined.with(Serdes.String(), new JSONSerde<>(SentMessage.class), new JSONSerde<>(MessageStatus.class))
);
coolmeen, grinfeld
26. So what operations could be done with stream? Join
So we received following messages
message SentMessage(id=1d51a90a-fb90-4988-9636-141d43ba5865, providerId=119,
extMessageId=cc3e48f6-e641-43d6-a30e-2bbd1a33bc02, from=972544406, to=972544306, status=SENT,
statusTime=1559893509893, sentTime=1559893509888, order=6)
message SentMessage(id=4922c6dc-c3f3-44ee-b0c9-e66fa71e39e6, providerId=31,
extMessageId=16f409b4-7dac-4625-9730-80a3523a5962, from=972544403, to=972544303, status=SENT,
statusTime=1559893509893, sentTime=1559893509888, order=3)
message SentMessage(id=aa501317-a1c1-43e5-92c4-c0549b9a30df, providerId=63,
extMessageId=17f54e6f-df15-45ee-859f-48029a3d81d5, from=972544407, to=972544307, status=SENT, statusTime=1559893509893,
sentTime=1559893509888, order=7)
message SentMessage(id=9aed49ff-ceb4-4f80-a3f9-95a6e35400fb, providerId=7,
extMessageId=721cae9a-b102-4a05-bf32-035e10ce098f, from=972544402, to=972544302, status=SENT,
statusTime=1559893509893, sentTime=1559893509888, order=2)
message SentMessage(id=175f47bc-9fd5-49a5-bfa6-52c66253e3d0, providerId=44,
extMessageId=d2a8e6d2-e0db-44b2-b5b8-08ab3f235010, from=972544405, to=972544305, status=SENT,
statusTime=1559893509893, sentTime=1559893509888, order=5)
message SentMessage(id=ed78978b-f719-4699-9d70-5673f57ba59d, providerId=45,
extMessageId=1dfbbfeb-baa7-445a-8c5d-f950bd051c95, from=972544401, to=972544301, status=SENT, statusTime=1559893509893,
sentTime=1559893509888, order=1)
message SentMessage(id=8653b530-ac66-46a4-aaf4-fe3140addcd2, providerId=113,
extMessageId=592f22ed-eb14-4015-be08-8614a44768e8, from=972544409, to=972544309, status=SENT,
statusTime=1559893509893, sentTime=1559893509888, order=9)
coolmeen, grinfeld
27. So what operations could be done with stream? Join
We received following statuses
status MessageStatus(id=4922c6dc-c3f3-44ee-b0c9-e66fa71e39e6, providerId=31, from=972544403,
to=972544303, extMessageId=16f409b4-7dac-4625-9730-80a3523a5962, status=DELIVERED,
statusTime=1559893509908)
status MessageStatus(id=aa501317-a1c1-43e5-92c4-c0549b9a30df, providerId=63, from=972544407,
to=972544307, extMessageId=17f54e6f-df15-45ee-859f-48029a3d81d5, status=DELIVERED,
statusTime=1559893509908)
status MessageStatus(id=9aed49ff-ceb4-4f80-a3f9-95a6e35400fb, providerId=7, from=972544402, to=972544302,
extMessageId=721cae9a-b102-4a05-bf32-035e10ce098f, status=DELIVERED, statusTime=1559893509908)
status MessageStatus(id=175f47bc-9fd5-49a5-bfa6-52c66253e3d0, providerId=44, from=972544405,
to=972544305, extMessageId=d2a8e6d2-e0db-44b2-b5b8-08ab3f235010, status=DELIVERED,
statusTime=1559893509908)
status MessageStatus(id=ed78978b-f719-4699-9d70-5673f57ba59d, providerId=45, from=972544401,
to=972544301, extMessageId=1dfbbfeb-baa7-445a-8c5d-f950bd051c95, status=DELIVERED,
statusTime=1559893509908)
status MessageStatus(id=8653b530-ac66-46a4-aaf4-fe3140addcd2, providerId=113, from=972544409,
to=972544309, extMessageId=592f22ed-eb14-4015-be08-8614a44768e8, status=DELIVERED,
statusTime=1559893509908)
coolmeen, grinfeld
28. So what operations could be done with stream? Join
And result of join operation:
final status for MessageStatus(id=4922c6dc-c3f3-44ee-b0c9-e66fa71e39e6, providerId=31, from=972544403,
to=972544303, extMessageId=16f409b4-7dac-4625-9730-80a3523a5962, status=DELIVERED,
statusTime=1559893509908)
final status for MessageStatus(id=aa501317-a1c1-43e5-92c4-c0549b9a30df, providerId=63, from=972544407,
to=972544307, extMessageId=17f54e6f-df15-45ee-859f-48029a3d81d5, status=DELIVERED,
statusTime=1559893509908)
final status for MessageStatus(id=9aed49ff-ceb4-4f80-a3f9-95a6e35400fb, providerId=7, from=972544402,
to=972544302, extMessageId=721cae9a-b102-4a05-bf32-035e10ce098f, status=DELIVERED,
statusTime=1559893509908)
final status for MessageStatus(id=175f47bc-9fd5-49a5-bfa6-52c66253e3d0, providerId=44, from=972544405,
to=972544305, extMessageId=d2a8e6d2-e0db-44b2-b5b8-08ab3f235010, status=DELIVERED,
statusTime=1559893509908)
final status for MessageStatus(id=ed78978b-f719-4699-9d70-5673f57ba59d, providerId=45, from=972544401,
to=972544301, extMessageId=1dfbbfeb-baa7-445a-8c5d-f950bd051c95, status=DELIVERED,
statusTime=1559893509908)
final status for MessageStatus(id=8653b530-ac66-46a4-aaf4-fe3140addcd2, providerId=113, from=972544409,
to=972544309, extMessageId=592f22ed-eb14-4015-be08-8614a44768e8, status=DELIVERED,
statusTime=1559893509908)
coolmeen, grinfeld
29. Streaming 101: The world beyond batch
Streaming 102: The world beyond batch
Kafka Streams’ Take on Watermarks and Triggers
Introducing Kafka Streams: Stream Processing Made Simple
Big Data Battle : Batch Processing vs Stream Processing
Taming IoT Data: Making Sense of Sensors with SQL Streaming by Hans-Peter Grahsl
Developing Event-Driven Microservices with Event Sourcing and CQRS
Data Stream Processing Concepts and Implementations by Matthias Niehoff
Applying Reactive Programming with Rx
A pattern language for microservices
Kafka Streams - Not Looking at Facebook
Leveraging the Power of a Database Unbundled
Enabling Exactly Once in Kafka Streams
The Event Streaming Platform Explained (For Technical Leaders and Executives)
Sliding Vs Tumbling Windows
Kafka Streams: Streams DSL
Window Functions in Stream Analytics
Streams vs Serverless: Friend or Foe? by Ben Stopford
Introducing Stream Windows in Apache Flink
Flink Streaming - Tumbling and Sliding Windows
Preview of Kafka Streams
Kafka Streams – A First Impression
Kafka Stream Playground Github
coolmeen, grinfeld
Notas del editor
There are a lot of different solution for stream processing (mostly from Apache Foundation)
streaming and batching
evolving architecture
For example, processing all the transaction that have been performed by a major financial firm in a week. This data contains millions of records for a day that can be stored as a file or record etc. This particular file will undergo processing at the end of the day for various analysis that firm wants to do. Obviously it will take large amount of time for that file to be processed. That would be what Batch Processing is. Program doesn’t react on some incoming event, but processes data on some limited (and huge) set of data.
We have similar ETL process for reports in Charlie (implemented inside the Oracle with jobs and queries)
“infinite” - we can’t bound the data by “start” and “end” since we don’t have any historical knowledge about data (opposite to batch, where we can find data boundaries by time or some other parameter)
It can be confusing, since we could call almost any system “streaming”, but let’s divide it into ingestion stage, when we want as fast as possible to return OK to customer/system and inner propagation of events inside the system. Inside our system we want to react on ingested data at “near real time” form and deliver (transform) data to some end-point (external or inner). For example, when we receive message, we want as fast as possible to send it to destination or in case of ad-tech, when we receive click on some item, we want to find advertisement to show to user before he leaves the page (and without affecting time of loading page).
In streaming we work with discrete elements (events). Sometimes we’ll wish to group them according to some time, for example - this is called windowing and we’ll talk about it this little bit later, sometimes we’ll want to aggregate (count), again, during some period of time (since we “unbounded”, and NOT since beginning of time) or simple to deliver further in our system pipeline (pipeline is actually good word :) )
Another thing about streaming is, in my opinion, resources. Now all (the most) of systems use shared resources (CPU, MEMORY and so on). We don’t want to take and hold resources when we don’t need it, yes, yes - non blocking IO and other nice phrases, and this is applicable for the whole system architecture. It means, we want to start working only when some “event” occurs. And it takes me to another (if not the name, at least the result of) streaming definition - event driven design. We want to react on event, make minimum required work and send the task/event/job to another process which will take care about next stage and so and so on. It means, that we stream events, transform events to another events, compare to some data or/and enrich them and store them for future processing (in SQL DB, NoSQL DB, messaging queue, file system…)
Event - something that happened in the past and had name, at least (usually some state which describes event at SPECIFIC time it had occurred) and usually we know what time this event occurred. Event is something that already happened and unlike science fiction, it couldn’t be changed. This makes event perfect to fit the logic of read-only append log (anybody says Kafka?). Every stream processor creates new event (based on received event) and again when downstream (see previous slide) receives event- it’s already the new one. So events by definition are immutable.
Event Time - we’ll talk little bit more about this when we reach Windowing. In most cases we’ll have problems if we don’t normalize time to some agreed vector. It should be normalized (at the one of the stages) to some standard time (aka, GMT/IST..)
Processing Time - the time when specific element (module) processes event (usually, it time when message received by the system, Kafka for example, but it could be time when specific micro service processes the event)
Wall-clock time - since it’s time on specific computer, processing time usually set by wall-clock time. It’s important to be sure that in the system all your components have same wall clock time (as much as possible), else you’ll get time skew and it can introduce un-expected behavior in the system
Flink, Kafka Streams/KSQL and Spark Streaming
Let’s look at simple example where we want to count number of events per key. For every key that appears again - counter increases
But it’s not so useful (usually), let’s show only those keys that appear more than twice
You can see automatically created topic by Kafka Streams with aggregated data for every key, so how Kafka knows to aggregate specific value (hence, Kafka is not search engine and why it should be efficient if Kafka consume from this topic for every incoming event?) - we’ll explain this topic later (KTable)
Let’s start with the simple time of window - static (tumbling). (I prefer to call this static since its boundaries don’t move and no overlapping
A tumbling window has a fixed length. The next window is placed right after the end of the previous one on the time axis. Tumbling windows do not overlap and span the whole time domain, i.e. each event is assigned to exactly one window.
Let’s send same input to Kafka, but 3 times with 7 seconds interval.
Here is simple example of window. We count keys which appears more than 1 time during window of 5 sec
We can see that count result from previous example appears 3 times (we have 3 windows ranges from input) and reset for every window
Here example of Kafka topic which stores aggregated results, but per window
The result (sink) topic we see only final result
Like tumbling windows, hopping windows also have a fixed length. However they introduce a second configuration parameter: The hop size h. Instead of moving the window of length s forward in time by s we move it by h.
A common use case for hopping windows are moving average computations. (or irate function like in Prometheus :) )
Hopping window usually confused with sliding window. In Kafka Streams there is hopping window, but not sliding (sliding exists in join operatoin)
Show animation at: https://dev.to/frosnerd/window-functions-in-stream-analytics-1m6c
Simple example: when we want to know how many operations use performed during time he has been logged in (in session), we can do it with session window, so every click we (aggregate) count any click until he is in session. When session expired (we can set window time same as session expiration time) and user log in again, we start counter from zero and so on
We can explain the sliding window as window which starts when event with some key arrives.
If you read articles and see videos about streaming and batching, you’ll be confused between hopping and sliding window definitions. Since, we are looking into Kafka Streams, we follow Kafka notion of windowing.
In order to perform join “keys” in Kafka should be the same (represent same value: “id” for example)
Since data from streams actually stored in memory, we should consider how long time we want to store data in-memory.
The last line in join statement, defines how to serialize data from topic
So what happened in our application during join:
During join created Kafka Streams created 2 KTable:
sent message ktable
status message ktable
when message comes to sent message stream, it
checks if in status message ktable exists record for this in key and time of message fits defined in join window (actually, in ktable key is compound key for original key and window, where start window is time when event received and end is start + window size\length)
inserts record into sent message ktable with key: original key + window based on event received time and window size\length
when message comes to status message stream, happens the same as previous but in opposite direction
Questions: expiration, compaction, application crash and so on….