SlideShare una empresa de Scribd logo
1 de 15
Descargar para leer sin conexión
Integrating Apache Pulsar

with 

Big Data Ecosystem
Yijie Shen
20190817
Data analytics with Apache Pulsar
Why so many analytic frameworks ?
Each kind has its best fit
•Interactive Engine
• Time critical
• Medium data size
• Rerun on failure
•Batch Engine
• The amount of data can be very
large
• Could run on a huge cluster
• Fine-grained fault tolerance
•Streaming
• Ever running jobs
• Time critical
• Need scalability as well as
resilient on failures
•Serverless
• Simple processing logic
• Processing data with high
velocity
Don’t ask, I don’t know.
Why Apache Pulsar fits all ?
It’s a Pulsar Meetup, dude...
Pulsar – A cloud-native architecture
Stateless Serving
Durable Storage
Pulsar – Segment-based storage
•Managed ledger
• The storage layer for a single topic
•Ledger
• Single writer, append-only
• Replicated to multiple bookies
Pulsar – Infinite stream storage
•Reduce storage cost
• offloading segment to tiered storage one-by-one
Pulsar Schema
• Consensus of data at server-side
• Built-in schema registry
• Data schema on a per-topic basis
•Send and receive typed message directly
• Validation
• Multi-version
Durable and ordered source
•Failures are inevitable for engines
•Re-schedule failed tasks
• Tasks assigned to fixed (start, end] in Spark
• Tasks recover from checkpoint (start in Flink
•Exactly-once
• Based on message order in topic
• Seek & read
•Messages ”keep-alive” by subscription
• Move sub cursor on commit
task1 task2
Durable cursor
Two levels of reading API
•Consumer
• Subscribe / seek / receive
• Per topic partition
• Pulsar-Spark, Pulsar-Flink
•Segment
• Read directly from Bookies
• For parallelism
• Presto
Processing typed records
•Regard Pulsar as structured storage
•Fetching schema as the first step
• With Pulsar Admin API
• Dynamic / multi-versioned schema not supported in Spark/Flink
• But you could try AUTO_CONSUME
•SerDe your messages into InternalRow / Row
• Avro schema and avro/json/protobuf Message
• Or parse the Avro record as we do in pulsar-spark[1]
•Message metadata as metadata fields
• __key, __publishTime, __eventTime, __messageId, __topic
Topic/Partition add/delete discovery
• Streaming jobs are long
running
• Topics & partitions may be
added on removed during a job
• Periodically check topic for
status
• Spark: during incremental
planning
• Flink: with a monitoring thread
in each task
Pulsar-Spark as an example
• Happens during logical planning
• getBatch(start: Option[Offset],
end: Offset)
• Discovery topic differences between
start and end
• Start – last end
• End – getOffset()
• Connector
• provide available offset for all topic/
partitions for each getOffset
• Create DataFrame/DataSet based on
existing topic/partitions
• SS take care of the rest
Offset {
topicOffsets: Map[String, MessageId
}
Various APIs use Pulsar as source
val df = spark

.read

.format("pulsar")

.option("service.url", "pulsar://...")

.option("admin.url", "http://...")

.option("topic", "topic1")

.load()
val prop = new Properties()

prop.setProperty(“service.url”, serviceUrl)

prop.setProperty(“admin.url”, adminUrl)

prop.setProperty(“partitionDiscoveryIntervalMillis”, "5000")

prop.setProperty(“startingOffsets”, "earliest")
env.addSource(new FlinkPulsarSource(sourceProps))
show tables in pulsar."public/default";
select * from pulsar."public/
default".generator_test;
Spark
Flink
Presto
Pulsar-Spark and Pulsar-Flink
•Pulsar-Spark based on Spark 2.4 is now open sourced
• https://github.com/streamnative/pulsar-spark
•Pulsar-Flink based on Flink 1.9 will open-source soon
•Roadmaps for these two projects
• End-to-end exactly once with pulsar transaction support
• Fine-grained batch parallelism on segment level
• Pulsar-spark / Pulsar-flink

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Pulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platformPulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platform
 
Building event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache PulsarBuilding event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache Pulsar
 
Building a FaaS with pulsar
Building a FaaS with pulsarBuilding a FaaS with pulsar
Building a FaaS with pulsar
 
Serverless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar FunctionsServerless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar Functions
 
High performance messaging with Apache Pulsar
High performance messaging with Apache PulsarHigh performance messaging with Apache Pulsar
High performance messaging with Apache Pulsar
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Transaction preview of Apache Pulsar
Transaction preview of Apache PulsarTransaction preview of Apache Pulsar
Transaction preview of Apache Pulsar
 
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache Kafka
 
What's new in apache pulsar 2.4.0
What's new in apache pulsar 2.4.0What's new in apache pulsar 2.4.0
What's new in apache pulsar 2.4.0
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
 
Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
 
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
 
Open keynote_carolyn&matteo&sijie
Open keynote_carolyn&matteo&sijieOpen keynote_carolyn&matteo&sijie
Open keynote_carolyn&matteo&sijie
 
Effectively-once semantics in Apache Pulsar
Effectively-once semantics in Apache PulsarEffectively-once semantics in Apache Pulsar
Effectively-once semantics in Apache Pulsar
 
kafka
kafkakafka
kafka
 
Getting Pulsar Spinning_Addison Higham
Getting Pulsar Spinning_Addison HighamGetting Pulsar Spinning_Addison Higham
Getting Pulsar Spinning_Addison Higham
 
Large scale log pipeline using Apache Pulsar_Nozomi
Large scale log pipeline using Apache Pulsar_NozomiLarge scale log pipeline using Apache Pulsar_Nozomi
Large scale log pipeline using Apache Pulsar_Nozomi
 
Pulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionPulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless Evolution
 

Similar a Integrating Apache Pulsar with Big Data Ecosystem

One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
Tim Vaillancourt
 

Similar a Integrating Apache Pulsar with Big Data Ecosystem (20)

Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Virtual Flink Forward 2020: Build your next-generation stream platform based ...Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologies
 
Pulsar - flexible pub-sub for internet scale
Pulsar - flexible pub-sub for internet scalePulsar - flexible pub-sub for internet scale
Pulsar - flexible pub-sub for internet scale
 
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
Messaging, storage, or both?  The real time story of Pulsar and Apache Distri...Messaging, storage, or both?  The real time story of Pulsar and Apache Distri...
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Building an Event Bus at Scale
Building an Event Bus at ScaleBuilding an Event Bus at Scale
Building an Event Bus at Scale
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache Kafka
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Hands-on Workshop: Apache Pulsar
Hands-on Workshop: Apache PulsarHands-on Workshop: Apache Pulsar
Hands-on Workshop: Apache Pulsar
 
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
 
Drupal performance
Drupal performanceDrupal performance
Drupal performance
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage Systems
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
 
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 

Más de StreamNative

Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
StreamNative
 

Más de StreamNative (20)

Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
 
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
 
Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...
 
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
 
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
 
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
 
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
 
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022
 
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
 
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
 
Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
 

Último

➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
nirzagarg
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
nirzagarg
 
( Pune ) VIP Pimpri Chinchwad Call Girls 🎗️ 9352988975 Sizzling | Escorts | G...
( Pune ) VIP Pimpri Chinchwad Call Girls 🎗️ 9352988975 Sizzling | Escorts | G...( Pune ) VIP Pimpri Chinchwad Call Girls 🎗️ 9352988975 Sizzling | Escorts | G...
( Pune ) VIP Pimpri Chinchwad Call Girls 🎗️ 9352988975 Sizzling | Escorts | G...
nilamkumrai
 

Último (20)

➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
 
( Pune ) VIP Pimpri Chinchwad Call Girls 🎗️ 9352988975 Sizzling | Escorts | G...
( Pune ) VIP Pimpri Chinchwad Call Girls 🎗️ 9352988975 Sizzling | Escorts | G...( Pune ) VIP Pimpri Chinchwad Call Girls 🎗️ 9352988975 Sizzling | Escorts | G...
( Pune ) VIP Pimpri Chinchwad Call Girls 🎗️ 9352988975 Sizzling | Escorts | G...
 
Real Escorts in Al Nahda +971524965298 Dubai Escorts Service
Real Escorts in Al Nahda +971524965298 Dubai Escorts ServiceReal Escorts in Al Nahda +971524965298 Dubai Escorts Service
Real Escorts in Al Nahda +971524965298 Dubai Escorts Service
 
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
 
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
 

Integrating Apache Pulsar with Big Data Ecosystem

  • 1.
  • 2. Integrating Apache Pulsar
 with 
 Big Data Ecosystem Yijie Shen 20190817
  • 3. Data analytics with Apache Pulsar
  • 4. Why so many analytic frameworks ? Each kind has its best fit •Interactive Engine • Time critical • Medium data size • Rerun on failure •Batch Engine • The amount of data can be very large • Could run on a huge cluster • Fine-grained fault tolerance •Streaming • Ever running jobs • Time critical • Need scalability as well as resilient on failures •Serverless • Simple processing logic • Processing data with high velocity Don’t ask, I don’t know.
  • 5. Why Apache Pulsar fits all ? It’s a Pulsar Meetup, dude...
  • 6. Pulsar – A cloud-native architecture Stateless Serving Durable Storage
  • 7. Pulsar – Segment-based storage •Managed ledger • The storage layer for a single topic •Ledger • Single writer, append-only • Replicated to multiple bookies
  • 8. Pulsar – Infinite stream storage •Reduce storage cost • offloading segment to tiered storage one-by-one
  • 9. Pulsar Schema • Consensus of data at server-side • Built-in schema registry • Data schema on a per-topic basis •Send and receive typed message directly • Validation • Multi-version
  • 10. Durable and ordered source •Failures are inevitable for engines •Re-schedule failed tasks • Tasks assigned to fixed (start, end] in Spark • Tasks recover from checkpoint (start in Flink •Exactly-once • Based on message order in topic • Seek & read •Messages ”keep-alive” by subscription • Move sub cursor on commit task1 task2 Durable cursor
  • 11. Two levels of reading API •Consumer • Subscribe / seek / receive • Per topic partition • Pulsar-Spark, Pulsar-Flink •Segment • Read directly from Bookies • For parallelism • Presto
  • 12. Processing typed records •Regard Pulsar as structured storage •Fetching schema as the first step • With Pulsar Admin API • Dynamic / multi-versioned schema not supported in Spark/Flink • But you could try AUTO_CONSUME •SerDe your messages into InternalRow / Row • Avro schema and avro/json/protobuf Message • Or parse the Avro record as we do in pulsar-spark[1] •Message metadata as metadata fields • __key, __publishTime, __eventTime, __messageId, __topic
  • 13. Topic/Partition add/delete discovery • Streaming jobs are long running • Topics & partitions may be added on removed during a job • Periodically check topic for status • Spark: during incremental planning • Flink: with a monitoring thread in each task Pulsar-Spark as an example • Happens during logical planning • getBatch(start: Option[Offset], end: Offset) • Discovery topic differences between start and end • Start – last end • End – getOffset() • Connector • provide available offset for all topic/ partitions for each getOffset • Create DataFrame/DataSet based on existing topic/partitions • SS take care of the rest Offset { topicOffsets: Map[String, MessageId }
  • 14. Various APIs use Pulsar as source val df = spark
 .read
 .format("pulsar")
 .option("service.url", "pulsar://...")
 .option("admin.url", "http://...")
 .option("topic", "topic1")
 .load() val prop = new Properties()
 prop.setProperty(“service.url”, serviceUrl)
 prop.setProperty(“admin.url”, adminUrl)
 prop.setProperty(“partitionDiscoveryIntervalMillis”, "5000")
 prop.setProperty(“startingOffsets”, "earliest") env.addSource(new FlinkPulsarSource(sourceProps)) show tables in pulsar."public/default"; select * from pulsar."public/ default".generator_test; Spark Flink Presto
  • 15. Pulsar-Spark and Pulsar-Flink •Pulsar-Spark based on Spark 2.4 is now open sourced • https://github.com/streamnative/pulsar-spark •Pulsar-Flink based on Flink 1.9 will open-source soon •Roadmaps for these two projects • End-to-end exactly once with pulsar transaction support • Fine-grained batch parallelism on segment level • Pulsar-spark / Pulsar-flink