Building scalable data with kafka and spark

•

1 recomendación•99 vistas

Slides for the talk given on 20-07-2019 at Nairobi JVM. It was a talk about building data pipelines with Apache Kafka as a message broker or enterprise bus and Apache spark as a distributed computing engine that enables processing of large volume of data efficiently.

Ingeniería

BUILDING SCALABLE DATA PIPELINES
WITH KAFKA AND SPARK

Who am i?
Babatunde Ekemode
Product & Data Engineer at Africastalking Ltd
@babatush_ babatundeekmode

WHAT IS A DATA PIPELINE?
Simply put, A data pipeline facilitates the movement and
optional transformation of data from one point to another.

DATA PIPELINE ORG
CAPTURE DATAFEEDBACK
ANALYZE DATAPLAN AND DECIDE
DERIVE ACTIONABLE
INSIGHTS
DATA DRIVEN
ORG.

Role of data pipeline in data driven org
E- COMMERCE
ACCOUNTING
CRM
PM
DRUID.IO
HADOOP
BIGQUERY
QUERIES
VISUALIZE
OLAP
OLTP

USE CASES
Cloud Migration
Data Backup
Data Processing

7 Vs of Big data
● Volume
● Velocity
● Veracity
● Variety
● Variability
● Visualization
● Value

What could go wrong
● Data loss
● Data duplication
● Data corruption
● Data Formatting
● Latency
● Process Failure
● Flow control
● Data velocity
● Data volume

DATA PIPELINE GUARANTEES
● Scalability
● Reliability
● Durability
● Traceability
● Timeliness
● Security
● ....

DATA PIPELINE ARCHITECTURE
SOURCE INGESTION PROCESSING STORAGE VISUALIZATION

Apache kafka in ARCHITECTURE
SOURCE INGESTION PROCESSING STORAGE VISUALIZATION
Apache Kafka is a distributed streaming platform capable of handling trillions of
events a day. Initially conceived as messaging system.

Apache kafka
Kafka guarantees the following 3 capabilities of streaming systems:
● Publish and subscribe to streams of records, similar to a message queue
or EMS.
● Store streams of records in a fault tolerant durable way.
● Process streams of records as they occur.
Kafka is generally used for 2 broad classes of applications:
● Building real-time streaming data pipelines that reliably get data
between systems or applications
● Building real-time streaming applications that transform or react to
the streams of data

KAFKA APIS
● Producer API: allows applications to publish a stream of records to one or more
kafka topics.
● Consumer API: allows applications to subscribe to one or more topics and
process the stream of records provided to them.
● Streams API: allows an application to act as a stream processor, consuming an
input stream from one or more topics and producing an output stream to one or
more output topics, effectively transforming the input streams to output
streams
● Connector API: allows building and running reusable producers or consumers that
connect Kafka topics to existing applications or data systems.
● AdminClient API: supports managing and inspecting topics, brokers, acls, and
other kafka objects.

KAFKA CLI TOOLS
● Topic creation: bin/kafka-topics.sh --create --zookeeper
localhost:2181 --replication-factor 1 --partitions 5 --topic test
● Publish to topic: bin/kafka-console-producer.sh --broker-list
localhost:9092 --topic test
● Consume from topic: bin/kafka-console-consumer.sh --bootstrap-server
localhost:9092 --topic test --from-beginning

Apache SPARK in ARCHITECTURE
SOURCE INGESTION PROCESSING STORAGE VISUALIZATION
Apache Spark is a unified analytics engine for large-scale data processing. Its
design is focused on speed, ease of use, generality, and multiple environments
deployment. It provides support for the following languages: Scala, Java, Python, R
and SQL.

Apache spark programming abstractions
● Resilient Distributed Dataset
● DataFrame
● Dataset

Apache cassandra in ARCHITECTURE
SOURCE INGESTION PROCESSING STORAGE VISUALIZATION
Apache Cassandra is a distributed nosql database for managing large amounts of
structured data across many commodity servers, while providing highly available
service and no single point of failure.

BEST PRACTICES
Data serialization
Data Evolution
Schema Registry
....

Más contenido relacionado

La actualidad más candente

Real-Time Dynamic Data Export Using the Kafka Ecosystemconfluent

Creating an Elastic Platform Using Kafka and Microservices in OpenShift confluent

Kafka Summit SF 2017 - Database Streaming at WePayconfluent

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent

Kafka Summit SF 2017 - Fast Data in Supply Chain Planningconfluent

Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...HostedbyConfluent

Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko

Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uberconfluent

Kafka for Real-Time Event Processing in Serverless Environmentsconfluent

Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...HostedbyConfluent

Maximize the Business Value of Machine Learning and Data Science with Kafka (...confluent

Tackling Kafka, with a Small Team ( Jaren Glover, Robinhood) Kafka Summit SF ...confluent

Kafka and Kafka Streams in the Global Schibsted Data PlatformFredrik Vraalsen

Apache Flink @ Alibaba - Seattle Apache Flink MeetupBowen Li

Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...HostedbyConfluent

Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...HostedbyConfluent

Operational Analytics on Event Streams in Kafkaconfluent

Stream Processing Live Traffic Data with Kafka StreamsTom Van den Bulck

Digital Transformation in Healthcare with Kafka—Building a Low Latency Data P...confluent

SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent

La actualidad más candente (20)

Real-Time Dynamic Data Export Using the Kafka Ecosystem

Creating an Elastic Platform Using Kafka and Microservices in OpenShift

Kafka Summit SF 2017 - Database Streaming at WePay

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...

Kafka Summit SF 2017 - Fast Data in Supply Chain Planning

Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...

Storing State Forever: Why It Can Be Good For Your Analytics

Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber

Kafka for Real-Time Event Processing in Serverless Environments

Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...

Maximize the Business Value of Machine Learning and Data Science with Kafka (...

Tackling Kafka, with a Small Team ( Jaren Glover, Robinhood) Kafka Summit SF ...

Kafka and Kafka Streams in the Global Schibsted Data Platform

Apache Flink @ Alibaba - Seattle Apache Flink Meetup

Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...

Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...

Operational Analytics on Event Streams in Kafka

Stream Processing Live Traffic Data with Kafka Streams

Digital Transformation in Healthcare with Kafka—Building a Low Latency Data P...

SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...

Similar a Building scalable data with kafka and spark

Cowboy Dating with Big Data or DWH Evolution in Action, Борис ТрофимовSigma Software

Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann

Data Pipeline for The Big Data/Data Science OKCMark Smith

JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann

The Never Landing Stream with HTAP and StreamingTimothy Spann

Streaming Data and Stream Processing with Apache Kafkaconfluent

Cowboy dating with big data, Борис ТрофімовSigma Software

Cloud-Native Patterns for Data-Intensive ApplicationsVMware Tanzu

Session 8 - Creating Data Processing Services | Train the Trainers ProgramFIWARE

Confluent kafka meetupseattle jan2017Nitin Kumar

Introduction to Apache Kafka and why it matters - MadridPaolo Castagna

GSJUG: Mastering Data Streaming Pipelines 09May2023Timothy Spann

Streaming Data Ingest and Processing with Apache KafkaAttunity

ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelinesconfluent

Day 13 - Creating Data Processing Services | Train the Trainers ProgramFIWARE

Apache Spark StreamingBartosz Jankiewicz

Intelligent Integration OOW2017 - Jeff PollockJeffrey T. Pollock

Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA

SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) Surendar S

Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikHostedbyConfluent

Similar a Building scalable data with kafka and spark (20)

Cowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов

Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...

Data Pipeline for The Big Data/Data Science OKC

JConWorld_ Continuous SQL with Kafka and Flink

The Never Landing Stream with HTAP and Streaming

Streaming Data and Stream Processing with Apache Kafka

Cowboy dating with big data, Борис Трофімов

Cloud-Native Patterns for Data-Intensive Applications

Session 8 - Creating Data Processing Services | Train the Trainers Program

Confluent kafka meetupseattle jan2017

Introduction to Apache Kafka and why it matters - Madrid

GSJUG: Mastering Data Streaming Pipelines 09May2023

Streaming Data Ingest and Processing with Apache Kafka

ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines

Day 13 - Creating Data Processing Services | Train the Trainers Program

Apache Spark Streaming

Intelligent Integration OOW2017 - Jeff Pollock

Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...

SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)

Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik

Último

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth

Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7Call Girls in Nagpur High Profile Call Girls

MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N

UNIT-II FMM-Flow Through Circular Conduitsrknatarajan

Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile

Porous Ceramics seminar and technical writingrakeshbaidya232001

result management system report for college projectTonystark477637

UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan

UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan

Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

Introduction and different types of Ethernet.pptxupamatechverse

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat

UNIT - IV - Air Compressors and its Performancesivaprakash250

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Building scalable data with kafka and spark

1. BUILDING SCALABLE DATA PIPELINES WITH KAFKA AND SPARK

2. WHAT IS A DATA PIPELINE?

3. Who am i? Babatunde Ekemode Product & Data Engineer at Africastalking Ltd @babatush_ babatundeekmode

4. WHAT IS A DATA PIPELINE? Simply put, A data pipeline facilitates the movement and optional transformation of data from one point to another.

5. DATA PIPELINE ORG

6. DATA PIPELINE ORG CAPTURE DATAFEEDBACK ANALYZE DATAPLAN AND DECIDE DERIVE ACTIONABLE INSIGHTS DATA DRIVEN ORG.

7. Role of data pipeline in data driven org E- COMMERCE ACCOUNTING CRM PM DRUID.IO HADOOP BIGQUERY QUERIES VISUALIZE OLAP OLTP

8. USE CASES Cloud Migration Data Backup Data Processing

9. 7 Vs of Big data ● Volume ● Velocity ● Veracity ● Variety ● Variability ● Visualization ● Value

10. What could go wrong ● Data loss ● Data duplication ● Data corruption ● Data Formatting ● Latency ● Process Failure ● Flow control ● Data velocity ● Data volume

11. DATA PIPELINE GUARANTEES ● Scalability ● Reliability ● Durability ● Traceability ● Timeliness ● Security ● ....

12. DATA PIPELINE ARCHITECTURE SOURCE INGESTION PROCESSING STORAGE VISUALIZATION

13. Apache kafka in ARCHITECTURE SOURCE INGESTION PROCESSING STORAGE VISUALIZATION Apache Kafka is a distributed streaming platform capable of handling trillions of events a day. Initially conceived as messaging system.

14. Apache kafka Kafka guarantees the following 3 capabilities of streaming systems: ● Publish and subscribe to streams of records, similar to a message queue or EMS. ● Store streams of records in a fault tolerant durable way. ● Process streams of records as they occur. Kafka is generally used for 2 broad classes of applications: ● Building real-time streaming data pipelines that reliably get data between systems or applications ● Building real-time streaming applications that transform or react to the streams of data

15. APACHE KAFKA ARCHITECTURE

16. APACHE KAFKA ARCHITECTURE

17. APACHE KAFKA ARCHITECTURE

18. APACHE KAFKA ARCHITECTURE

19. KAFKA APIS ● Producer API: allows applications to publish a stream of records to one or more kafka topics. ● Consumer API: allows applications to subscribe to one or more topics and process the stream of records provided to them. ● Streams API: allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams ● Connector API: allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. ● AdminClient API: supports managing and inspecting topics, brokers, acls, and other kafka objects.

20. KAFKA CLI TOOLS ● Topic creation: bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 5 --topic test ● Publish to topic: bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test ● Consume from topic: bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning

21. KAFKA API EXAMPLES

22. Apache SPARK in ARCHITECTURE SOURCE INGESTION PROCESSING STORAGE VISUALIZATION Apache Spark is a unified analytics engine for large-scale data processing. Its design is focused on speed, ease of use, generality, and multiple environments deployment. It provides support for the following languages: Scala, Java, Python, R and SQL.

23. APACHE SPARK ARCHITECTURE

24. APACHE SPARK STACK

25. Apache spark programming abstractions ● Resilient Distributed Dataset ● DataFrame ● Dataset

26. APACHE SPARK EXAMPLES

27. Apache cassandra in ARCHITECTURE SOURCE INGESTION PROCESSING STORAGE VISUALIZATION Apache Cassandra is a distributed nosql database for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure.

28. APACHE CASSANDRA ARCHITECTURE

29. BEST PRACTICES Data serialization Data Evolution Schema Registry ....

Building scalable data with kafka and spark

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Building scalable data with kafka and spark

Similar a Building scalable data with kafka and spark (20)

Último

Último (20)

Building scalable data with kafka and spark