Speaker: Gabriel Schenker, Lead Curriculum Developer, Confluent
Streaming platforms have emerged as a popular, new trend, but what exactly is a streaming platform? Part messaging system, part Hadoop made fast, part fast ETL and scalable data integration. With Apache Kafka® at the core, event streaming platforms offer an entirely new perspective on managing the flow of data. This talk will explain what an event streaming platform such as Apache Kafka is and some of the use cases and design patterns around its use—including several examples of where it is solving real business problems. New developments in this area such as KSQL will also be discussed.
What is Apache Kafka and What is an Event Streaming Platform?
1. 1
What is Apache Kafka and
What is an Event Streaming Platform?
Bern Apache Kafka®
Meetup
2. 2
Join the Confluent
Community Slack Channel
Subscribe to the
Confluent blog
cnfl.io/community-slack cnfl.io/read
Welcome to the Apache Kafka® Meetup in Bern!
6:00pm
Doors open
6:00pm - 6:30pm
Food, Drinks and Networking
6:30pm – 6:50pm
Matthias Imsand, Amanox Solutions
6:50pm - 7:35pm
Gabriel Schenker, Confluent
7:35pm - 8:00pm
Additional Q&A & Networking
Apache, Apache Kafka, Kafka and the Kafka logo are trademarks of the Apache Software Foundation. The Apache Software Foundation has no affiliation
with and does not endorse the materials provided at this event.
3. 3
About Me
● Gabriel N. Schenker
● Lead Curriculum Developer @ Confluent
● Formerly at Docker, Alienvault, …
● Lives in Appenzell, AI
● Github: github.org/gnschenker
● Twitter: @gnschenker
9. 99
ETL/Data Integration Messaging
Transient MessagesStored records
ETL/Data Integration MessagingMessaging
Batch
Expensive
Time Consuming
Difficult to Scale
No Persistence
Data Loss
No Replay
High Throughput
Durable
Persistent
Maintains Order
Fast (Low Latency)
Event Streaming Paradigm
High Throughput
Durable
Persistent
Maintains Order
Fast (Low Latency)
10. 1010
Fast (Low Latency)
Event Streaming Paradigm
To rethink data as not stored records
or transient messages, but instead as
a continually updating stream of events
16. 16C O N F I D E N T I A L
Mainframes Hadoop
Data
Warehouse
...
Device
Logs
... Splunk ... App App Microservice ...
Data Stores Custom Apps/MicroservicesLogs 3rd Party Apps
Universal Event Pipeline
Real-Time
Inventory
Real-Time
Fraud
Detection
Real-Time
Customer 360
Machine
Learning
Models
Real-Time
Data
Transformation
...
Contextual Event-Driven Apps
Apache Kafka®
STREAMS
CONNECT CLIENTS
33. 3333
The Serializer
Kafka doesn’t care about what you send to it as long as it’s
been converted to a byte stream beforehand.
JSON
CSV
Avro
Protobufs
XML
SERIALIZERS
01001010 01010011 01001111 01001110
01000011 01010011 01010110
01001010 01010011 01001111 01001110
01010000 01110010 01101111 01110100 ...
01011000 01001101 01001100
(if you must)
Reference
https://kafka.apache.org/10/documentation/streams/developer-guide/datatypes.html
34. 3434
The Serializer
private Properties settings = new Properties();
settings.put("bootstrap.servers", "broker1:9092,broker2:9092");
settings.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
settings.put("value.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");
settings.put("schema.registry.url", "https://schema-registry:8083");
producer = new KafkaProducer<String, Invoice>(settings);
Reference
https://kafka.apache.org/10/documentation/streams/developer-guide/datatypes.html
35. 3535
Producer Record
Topic
[Partition]
[Key]
Value
Record keys determine the partition with the default kafka
partitioner
If a key isn’t provided, messages will be produced
in a round robin fashion
partitioner
Record Keys and why they’re important - Ordering
36. 3636
Producer Record
Topic
[Partition]
AAAA
Value
Record keys determine the partition with the default kafka
partitioner, and therefore guarantee order for a key
Keys are used in the default partitioning algorithm:
partition = hash(key) % numPartitions
partitioner
Record Keys and why they’re important - Ordering
37. 3737
Producer Record
Topic
[Partition]
BBBB
Value
Keys are used in the default partitioning algorithm:
partition = hash(key) % numPartitions
partitioner
Record keys determine the partition with the default kafka
partitioner, and therefore guarantee order for a key
Record Keys and why they’re important - Ordering
38. 3838
Producer Record
Topic
[Partition]
CCCC
Value
Keys are used in the default partitioning algorithm:
partition = hash(key) % numPartitions
partitioner
Record keys determine the partition with the default kafka
partitioner, and therefore guarantee order for a key
Record Keys and why they’re important - Ordering
39. 3939
Record Keys and why they’re important - Ordering
Producer Record
Topic
[Partition]
DDDD
Value
Keys are used in the default partitioning algorithm:
partition = hash(key) % numPartitions
partitioner
Record keys determine the partition with the default kafka
partitioner, and therefore guarantee order for a key
40. 4040
Record Keys and why they’re important - Key Cardinality
Consumers
Key cardinality affects the amount
of work done by the individual
consumers in a group. Poor key
choice can lead to uneven
workloads.
Keys in Kafka don’t have to be
primitives, like strings or ints. Like
values, they can be be anything:
JSON, Avro, etc… So create a key
that will evenly distribute groups of
records around the partitions.
Car·di·nal·i·ty
/ˌkärdəˈnalədē/
Noun
the number of elements in a set or other grouping, as a property of that grouping.
41. 4141
{
“Name”: “John Smith”,
“Address”: “123 Apple St.”,
“Zip”: “19101”
}
You don’t have to but... use a Schema!
Data
Producer
Service
Data
Consumer
Service
{
"Name": "John Smith",
"Address": "123 Apple St.",
"City": "Philadelphia",
"State": "PA",
"Zip": "19101"
}
send JSON
“Where’s record.City?”
Reference
https://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you
-really-need-one/
42. 4242
Schema Registry: Make Data Backwards Compatible and Future-Proof
● Define the expected fields for each Kafka topic
● Automatically handle schema changes (e.g. new
fields)
● Prevent backwards incompatible
changes
● Support multi-data center environments
Elastic
Cassandra
HDFS
Example Consumers
Serializer
App 1
Serializer
App 2
!
Kafka Topic!
Schema
Registry
Open Source Feature
43. 4343
Developing with Confluent Schema Registry
We provide several Maven plugins for developing with
the Confluent Schema Registry
● download - download a subject’s schema to
your project
● register - register a new schema to the
schema registry from your development env
● test-compatibility - test changes made to
a schema against compatibility rules set by the
schema registry
Reference
https://docs.confluent.io/current/schema-registry/docs/maven-plugin.html
<plugin>
<groupId>io.confluent</groupId>
<artifactId>kafka-schema-registry-maven-plugin</
<version>5.0.0</version>
<configuration>
<schemaRegistryUrls>
<param>http://192.168.99.100:8081</p
</schemaRegistryUrls>
<outputDirectory>src/main/avro</outputDi
<subjectPatterns>
<param>^TestSubject000-(key|value)$<
</subjectPatterns>
</configuration>
</plugin>
44. 4444
{
"Name": "John Smith",
"Address": "123 Apple St.",
"Zip": "19101",
"City": "NA",
"State": "NA"
}
Avro allows for evolution of schemas
{
"Name": "John Smith",
"Address": "123 Apple St.",
"City": "Philadelphia",
"State": "PA",
"Zip": "19101"
}
Data
Producer
Service
Data
Consumer
Service
send AvroRecord
Schema
Registry
Version 1Version 2
Reference
https://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you
-really-need-one/
45. 4545
Use Kafka’s Headers
Reference
https://cwiki.apache.org/confluence/display/KAFKA/KIP-82+-+Add+Record+Headers
Producer Record
Topic
[Partition]
[Timestamp]
Value
[Headers]
[Key]
Kafka Headers are simply an interface that requires a key of type
String, and a value of type byte[], the headers are stored in an
iterator in the ProducerRecord .
Example Use Cases
● Data lineage: reference previous topic partition/offsets
● Producing host/application/owner
● Message routing
● Encryption metadata (which key pair was this message
payload encrypted with?)
54. 5454
A basic Java Consumer
final Consumer<String, String> consumer = new KafkaConsumer<String, String>(props);
consumer.subscribe(Arrays.asList(topic));
try {
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
// Do Some Work …
}
}
} finally {
consumer.close();
}
}