SlideShare a Scribd company logo
1 of 8
Download to read offline
Solution for events logging
How does Akka Streams in couple with Kafka make it easier to manage
data flows
2Sementsov A., 2020
Architecture: Step 1
• Client communicate with server using Kafka broker
• Consumer writes each record in both storages: PostgreSQL and
Hadoop (possible Hive)
• PostgreSQL used for operational data with maximum storage
period of about 1-3 months. It is the storage with fast search
capability
• Hadoop and related components used as cheaper storage, but
with slower access
Bird's eye view
Challenges
1. It is difficult to maintain consistency when writing to multiple
repositories simultaneously
2. Consumer must use existing access rights provider, monitoring
and logging system
3. How to reduce the amount of developed code and speed up
the process of adding new repositories
4. How to choose the optimal storage format from all the
possible options offered by the Hadoop and related products
5. No vendor lock-in allowed
3Sementsov A., 2020
Architecture: Step 2
Finish solution
1. Consistency – Kafka provides functionality to create several groups
of consumers. Each group process all messages independently
2. There are many tools such as Fluent, LogStash, Flume provides “get
-> process -> put” functionality. However for two main reasons
(Refusal to vendor lock-in and integration with our proprietary
systems) we decided to develop our own
3. We chose Akka Streams library to reduce the amount of developed
code and speed up the process of adding new storages
4. Hadoop write stages
1. Consumer service writes files directly to HDFS in Apache Avro
format. Pros Avro - fast write and compression. Cons – it has
no indexes, so the search is very slow
2. The Apache Oozie plans a task to convert Avro files into ORC
table using Hive. Apache ORC files has several advantages:
• Has indexes
• Allowed columnar encryption and/or masking
Cons
• File can be written only once because indexes must be
added at the end of file
5. Both PostgreSQL and Hive have JDBC drivers to access data
4Sementsov A., 2020
Akka Streams: Consumer application architecture
Consumer application
• HTTP endpoints protected by desired access provider.
Endpoint developed using Akka Http
• HTTP endpoint backed by KafkaProcessor Actor. Actor
can process two type of messages – Start and Stop
• KafkaProcessor can start & stops several kind of streams:
• PostgresFlow
• HdfsAvroFlow
• Trait BaseKafkaFlow provides common functions:
• Logging & monitoring functionality
• Get messages from and commit to Kafka
• Extension points to parse and store messages
• Each of final streams extends BaseKafkaFlow trait and
implements flow stages for
• Parser
• Store
5Sementsov A., 2020
Akka Streams coding: Base classes
These two traits – BaseProcessorFlow and BaseKafkaFlow provide functions which
cover entire processing procedure.
Type parameters In and P allowed to use different sources and process different
messages
For BaseKafkaFlow source type is KafkaMessage, which defined as
BaseProcessorFlow has two abstract values
• parse – will be used to convert incoming value into PassThrough[In, P]
• saveBatch – will be used to store batch data and transfer all processed data to
downstream. This stage can be cause of backpressure
BaseKafkaFlow extends base processor by adding ability to receive and send
messages. BaseKafkaFlow is an Actor. Also this class have states.
BaseKafkaFlow can process the following messages:
• StartKafkaConsumer(startConfig) – Message contains parameters to create and
start stream connected to Kafka brokers. Stream can read message from Kafka,
processing its and committing offset to Kafka partitions when messages
processes successfully. Moreover, startConfig contains information about
consumer group.
• StopKafkaConsumer - Message used to stop all streams gracefully
Main application actor can start many actors of different classes derived from
BaseKafkaFlow
6Sementsov A., 2020
Akka Streams coding: Storages
PostgresFlow implements parser to construct SomeEvent object stored as JSON. If
JSON parsing was not success then constructs SomeEvent object which contains
error.
saveBatch stores data into PostgreSQL database. In the case when the data cannot
be saved due to a database error, this function has repeatedly tried to save the
data. To avoid CPU overload, it has an ExponentialBackoff to control the time
between attempts. The whole batch is saved in one transaction in function
saveEvents
7Sementsov A., 2020
Akka Streams coding: Storages
To store Avro files we need more complex program code. For this purpose using
custom flow processing. HdfsAvroFileFlow defines FlowStage
HdfsAvroFileFlowLogic defines processing logic. It is necessary because we need
state to store output avro stream. One stream = one file in HDFS. In fact this logic is
very simple. Open file is it not open and write file if it open. From time to time, the
file is forced to close, rotation occurs
writeAvroEvents function writes data and pass successfully written data to the
downstream. Due to this action only data which actually written to HDFS will be
committed to Kafka
However, there is one point to consider. Kafka stream will not reread uncommitted
records. To do this we need to restart stream. It can be done either send special
message to actor or by throwing Exception during this stage
8Sementsov A., 2020
Links
• Akka main site - https://akka.io/
• Akka Streams - https://doc.akka.io/docs/akka/current/stream/index.htm
• Author email -anatolse@gmail.com

More Related Content

What's hot

What's hot (20)

Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsKafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Data Integration
Data IntegrationData Integration
Data Integration
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector?
 
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
 
Apache Pulsar at Yahoo! Japan
Apache Pulsar at Yahoo! JapanApache Pulsar at Yahoo! Japan
Apache Pulsar at Yahoo! Japan
 
A Short Presentation on Kafka
A Short Presentation on KafkaA Short Presentation on Kafka
A Short Presentation on Kafka
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changes
 
Coprocessors - Uses, Abuses, Solutions - presented at HBaseCon East 2016
Coprocessors - Uses, Abuses, Solutions - presented at HBaseCon East 2016Coprocessors - Uses, Abuses, Solutions - presented at HBaseCon East 2016
Coprocessors - Uses, Abuses, Solutions - presented at HBaseCon East 2016
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Apache Kafka Streams
Apache Kafka StreamsApache Kafka Streams
Apache Kafka Streams
 
Cloudera's Flume
Cloudera's FlumeCloudera's Flume
Cloudera's Flume
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
 
Stream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaStream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache Kafka
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 

Similar to Solution for events logging with akka streams and kafka

Similar to Solution for events logging with akka streams and kafka (20)

Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
 
Hadoop Vectored IO
Hadoop Vectored IOHadoop Vectored IO
Hadoop Vectored IO
 
Event driven-arch
Event driven-archEvent driven-arch
Event driven-arch
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streams
 
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedApache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Distributed & Highly Available server applications in Java and Scala
Distributed & Highly Available server applications in Java and ScalaDistributed & Highly Available server applications in Java and Scala
Distributed & Highly Available server applications in Java and Scala
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 

Recently uploaded

Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Lisi Hocke
 

Recently uploaded (20)

Community is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea GouletCommunity is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea Goulet
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
Workshop -  Architecting Innovative Graph Applications- GraphSummit MilanWorkshop -  Architecting Innovative Graph Applications- GraphSummit Milan
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
 
Auto Affiliate AI Earns First Commission in 3 Hours..pdf
Auto Affiliate  AI Earns First Commission in 3 Hours..pdfAuto Affiliate  AI Earns First Commission in 3 Hours..pdf
Auto Affiliate AI Earns First Commission in 3 Hours..pdf
 
The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024
 
Weeding your micro service landscape.pdf
Weeding your micro service landscape.pdfWeeding your micro service landscape.pdf
Weeding your micro service landscape.pdf
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeCon
 
Software Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringSoftware Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements Engineering
 
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
 
A Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdfA Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdf
 
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
 
architecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfarchitecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdf
 
GraphSummit Milan - Visione e roadmap del prodotto Neo4j
GraphSummit Milan - Visione e roadmap del prodotto Neo4jGraphSummit Milan - Visione e roadmap del prodotto Neo4j
GraphSummit Milan - Visione e roadmap del prodotto Neo4j
 
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdfThe Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
 
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
 
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
 

Solution for events logging with akka streams and kafka

  • 1. Solution for events logging How does Akka Streams in couple with Kafka make it easier to manage data flows
  • 2. 2Sementsov A., 2020 Architecture: Step 1 • Client communicate with server using Kafka broker • Consumer writes each record in both storages: PostgreSQL and Hadoop (possible Hive) • PostgreSQL used for operational data with maximum storage period of about 1-3 months. It is the storage with fast search capability • Hadoop and related components used as cheaper storage, but with slower access Bird's eye view Challenges 1. It is difficult to maintain consistency when writing to multiple repositories simultaneously 2. Consumer must use existing access rights provider, monitoring and logging system 3. How to reduce the amount of developed code and speed up the process of adding new repositories 4. How to choose the optimal storage format from all the possible options offered by the Hadoop and related products 5. No vendor lock-in allowed
  • 3. 3Sementsov A., 2020 Architecture: Step 2 Finish solution 1. Consistency – Kafka provides functionality to create several groups of consumers. Each group process all messages independently 2. There are many tools such as Fluent, LogStash, Flume provides “get -> process -> put” functionality. However for two main reasons (Refusal to vendor lock-in and integration with our proprietary systems) we decided to develop our own 3. We chose Akka Streams library to reduce the amount of developed code and speed up the process of adding new storages 4. Hadoop write stages 1. Consumer service writes files directly to HDFS in Apache Avro format. Pros Avro - fast write and compression. Cons – it has no indexes, so the search is very slow 2. The Apache Oozie plans a task to convert Avro files into ORC table using Hive. Apache ORC files has several advantages: • Has indexes • Allowed columnar encryption and/or masking Cons • File can be written only once because indexes must be added at the end of file 5. Both PostgreSQL and Hive have JDBC drivers to access data
  • 4. 4Sementsov A., 2020 Akka Streams: Consumer application architecture Consumer application • HTTP endpoints protected by desired access provider. Endpoint developed using Akka Http • HTTP endpoint backed by KafkaProcessor Actor. Actor can process two type of messages – Start and Stop • KafkaProcessor can start & stops several kind of streams: • PostgresFlow • HdfsAvroFlow • Trait BaseKafkaFlow provides common functions: • Logging & monitoring functionality • Get messages from and commit to Kafka • Extension points to parse and store messages • Each of final streams extends BaseKafkaFlow trait and implements flow stages for • Parser • Store
  • 5. 5Sementsov A., 2020 Akka Streams coding: Base classes These two traits – BaseProcessorFlow and BaseKafkaFlow provide functions which cover entire processing procedure. Type parameters In and P allowed to use different sources and process different messages For BaseKafkaFlow source type is KafkaMessage, which defined as BaseProcessorFlow has two abstract values • parse – will be used to convert incoming value into PassThrough[In, P] • saveBatch – will be used to store batch data and transfer all processed data to downstream. This stage can be cause of backpressure BaseKafkaFlow extends base processor by adding ability to receive and send messages. BaseKafkaFlow is an Actor. Also this class have states. BaseKafkaFlow can process the following messages: • StartKafkaConsumer(startConfig) – Message contains parameters to create and start stream connected to Kafka brokers. Stream can read message from Kafka, processing its and committing offset to Kafka partitions when messages processes successfully. Moreover, startConfig contains information about consumer group. • StopKafkaConsumer - Message used to stop all streams gracefully Main application actor can start many actors of different classes derived from BaseKafkaFlow
  • 6. 6Sementsov A., 2020 Akka Streams coding: Storages PostgresFlow implements parser to construct SomeEvent object stored as JSON. If JSON parsing was not success then constructs SomeEvent object which contains error. saveBatch stores data into PostgreSQL database. In the case when the data cannot be saved due to a database error, this function has repeatedly tried to save the data. To avoid CPU overload, it has an ExponentialBackoff to control the time between attempts. The whole batch is saved in one transaction in function saveEvents
  • 7. 7Sementsov A., 2020 Akka Streams coding: Storages To store Avro files we need more complex program code. For this purpose using custom flow processing. HdfsAvroFileFlow defines FlowStage HdfsAvroFileFlowLogic defines processing logic. It is necessary because we need state to store output avro stream. One stream = one file in HDFS. In fact this logic is very simple. Open file is it not open and write file if it open. From time to time, the file is forced to close, rotation occurs writeAvroEvents function writes data and pass successfully written data to the downstream. Due to this action only data which actually written to HDFS will be committed to Kafka However, there is one point to consider. Kafka stream will not reread uncommitted records. To do this we need to restart stream. It can be done either send special message to actor or by throwing Exception during this stage
  • 8. 8Sementsov A., 2020 Links • Akka main site - https://akka.io/ • Akka Streams - https://doc.akka.io/docs/akka/current/stream/index.htm • Author email -anatolse@gmail.com