From a kafkaesque story to The Promised Land

From a Kafkaesque Story to
the Promised Land
7/7/2013
Ran Silberman

Open Source paradigm
The Cathedral & the Bazaar by Eric S Raymond, 1999
the struggle between top-down and bottom-up design

Challenges of data platform[1]
• High throughput
• Horizontal scale to address growth
• High availability of data services
• No Data loss
• Satisfy Real-Time demands
• Enforce structural data with schemas
• Process Big Data and Enterprise Data
• Single Source of Truth (SSOT)

SLA's of data platform
BI DWH
Real-time
Customers
Real-time
dashboards
Data Bus
Offline
Customers
SLA:
1. 98% in < 1/2 hr
2. 99.999% < 4 hrs
SLA:
1. 98% in < 500 msec
2. No send > 2 sec
Real-time
servers

Legacy Data flow in LivePerson
BI DWH
(Oracle)
RealTime
servers
View
Reports
Customers
ETL
Sessionize
Modeling
Schema
View

1st phase - move to Hadoop
ETL
Sessionize
Modeling
Schema
View
RealTime
servers
BI DWH
(Vertica)
View
Reports
HDFS
Hadoop
MR Job transfers
data to BI DWH
Customers
BI DWH
(Oracle)

2. move to Kafka
6
RealTime
servers
HDFS
BI DWH
(Vertica)
Hadoop
MR Job transfers
data to BI DWH
Kafka
Topic-1
View
Reports
Customers

3. Integrate with new producers
6
RealTime
servers
HDFS
BI DWH
(Vertica)
Hadoop
MR Job transfers
data to BI DWH
Kafka
Topic-1 Topic-2
New
RealTime
servers
View
Reports
Customers

4. Add Real-time BI
View
Reports
6
Customers
RealTime
servers
HDFS
BI DWH
(Vertica)
Hadoop
MR Job transfers
data to BI DWH
Kafka
Topic-1 Topic-2
New
RealTime
servers
Storm
Topology

5. Standardize Data-Model using Avro
View
Reports
6
Customers
RealTime
servers
HDFS
BI DWH
(Vertica)
Hadoop
MR Job transfers
data to BI DWH
Kafka
Topic-1 Topic-2
New
RealTime
servers
Storm
Topology
Camus

6. Define Single Source of Truth (SSOT)
View
Reports
6
Customers
RealTime
servers
HDFS
BI DWH
(Vertica)
Hadoop
MR Job transfers
data to BI DWH
Kafka
Topic-1 Topic-2
New
RealTime
servers
Storm
Topology
Camus

Kafka[2] as Backbone for Data
• Central "Message Bus"
• Support multiple topics (MQ style)
• Write ahead to files
• Distributed & Highly Available
• Horizontal Scale
• High throughput (10s MB/Sec per server)
• Service is agnostic to consumers' state
• Retention policy

Kafka Architecture cont.
Node 1
Zookeeper
Producer 1 Producer 2 Producer 3
Node 2 Node 3
Consumer 1 Consumer 1Consumer 1

Group1
Kafka Architecture cont.
Node 1
Zookeeper
Producer 1 Producer 2
Node 3 Node 4
Consumer 2 Consumer 3Consumer 1
Node 2
Topic1 Topic2

Kafka replay messages.
Zookeeper
Node 3 Node 4
Min Offset ->
Max Offset ->
fetchRequest = new fetchRequest(topic, partition, offset, size);
currentOffset : taken from zookeeper
Earliest offset: -2
Latest offset : -1

Kafka API[3]
• Producer API
• Consumer API
o High-level API
 using zookeeper to access brokers and to save
offsets
o SimpleConsumer API
 direct access to Kafka brokers
• Kafka-Spout, Camus, and KafkaHadoopConsumer all
use SimpleConsumer

Kafka API[3]
• Producer
messages = new List<KeyedMessage<K, V>>()
Messages.add(new KeyedMessage(“topic1”, null, msg1));
Send(messages);
• Consumer
streams[] = Consumer.createMessageStream((“topic1”, 1);
for (message: streams[0]{
//do something with message
}

Kafka in Unit Testing
• Use of class KafkaServer
• Run embedded server

Introducing Avro[5]
• Schema representation using JSON
• Support types
o Primitive types: boolean, int, long, string, etc.
o Complex types:
Record, Enum, Union, Arrays, Maps, Fixed
• Data is serialized using its schema
• Avro files include file-header of the schema

Add Avro protocol to the story
Topic 1
Schema Repo
Producer 1
Topic 2
Consumers:
Camus/Storm
Create Message
according to
Schema 1.0
Register schema 1.0
Add revision to
message header
Send message
Read message Extract header
and obtain
schema version
Get schema by
version 1.0
Encode message
with Schema 1.0
Decode message
with schema 1.0
{event1:{header:{sessionId:"102122"),{timestamp:"12346")}...
Header Avro message
Kafka message
Pass message
1.0

Kafka + Storm + Avro example
• Demonstrating use of Avro data passing from Kafka to
Storm
• Explains Avro revision evolution
• Requires Kafka and Zookeeper installed
• Uses Storm artifact and Kafka-Spout artifact in Maven
• Plugin generates Java classes from Avro Schema
• https://github.com/ransilberman/avro-kafka-storm

Producer machine
Resiliency
Producer
Consistent
Topic
Send message
to Kafka
local file
Persist message to
local disk
Kafka Bridge
Send message
to Kafka
Fast
Topic
Real-time
Consumer:
Storm
Offline
Consumer:
Hadoop

Challenges of Kafka
• Still not mature enough
• Not enough supporting tools (viewers, maintenance)
• Duplications may occur
• API not documented enough
• Open Source - support by community only
• Difficult to replay messages from specific point in time
• Eventually Consistent...

Eventually Consistent
Because it is a distributed system -
• No guarantee for delivery order
• No way to tell to which broker message is sent
• Kafka do not guarantee that there are no duplications
• ...But eventually, all message will arrive!
Event
generated
Event
destination
Desert

Major Improvements in Kafka 0.8[4]
• Partitions replication
• Message send guarantee
• Consumer offsets are represented numbers instead of
bytes (e.g., 1, 2, 3, ..)

Addressing Data Challenges
• High throughput
o Kafka, Hadoop
• Horizontal scale to address growth
o Kafka, Storm, Hadoop
• High availability of data services
o Kafka, Storm, Zookeeper
• No Data loss
o Highly Available services, No ETL

Addressing Data Challenges Cont.
• Satisfy Real-Time demands
o Storm
• Enforce structural data with schemas
o Avro
• Process Big Data and Enterprise Data
o Kafka, Hadoop
• Single Source of Truth (SSOT)
o Hadoop, No ETL

References
• [1]Satisfying new requirements for Data Integration By
David Loshin
• [2]Apache Kafka
• [3]Kafka API
• [4]Kafka 0.8 Quick Start
• [5]Apache Avro
• [5]Storm

From a kafkaesque story to The Promised Land

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (17)

Similar a From a kafkaesque story to The Promised Land

Similar a From a kafkaesque story to The Promised Land (20)

Último

Último (20)

From a kafkaesque story to The Promised Land