SlideShare una empresa de Scribd logo
1 de 77
Descargar para leer sin conexión
1
UNIFYING MESSAGING, QUEUING, STREAMING &
COMPUTE WITH APACHE PULSAR
KARTHIK RAMASAMY
CO-FOUNDER AND CEO
2
Connected World
3
Ubiquity of Real-Time Data Streams & Events
EVENT/STREAM DATA PROCESSING
4
✦ Events are analyzed and processed as they arrive
✦ Decisions are timely, contextual and based on fresh data
✦ Decision latency is eliminated
✦ Data in motion
Ingest/
Buffer
Analyze Act
MICROSERVICES
MODEL INFERENCEWORKFLOWS ANALYTICS
MONITORING
EVENT/STREAM PROCESSING PATTERNS
STREAM PROCESSING PATTERN
6
ComputeMessaging
Storage
Data	Inges6on Data	Processing
Results	StorageData	Storage
Data	
Serving
ELEMENTS OF EVENT/STREAM PROCESSING
7
Aggregation
Systems
Messaging
Systems
Result
Engine
HDFS
Queryable
Engines
APACHE PULSAR
8
Flexible Messaging + Streaming System
backed by a durable log storage
Key Concepts
Core concepts: Tenants, namespaces, topics
10
Apache Pulsar Cluster
Tenants Namespaces Topics
Marketing
Sales
Security
Analytics
Campaigns
Data
transformation
Data Integration
Microservices
Visits
Conversions
Responses
Conversions
Transactions
Interactions
Log events
Signatures
Accesses
Topics
11
TopicProducers
Consumers
Time
Consumers
Consumers
Producers
Topic partitions
12
Topic - P0
Time
Topic - P1
Topic - P2
Producers
Producers
Consumers
Consumers
Consumers
Segments
13
Time
Segment 1 Segment 2 Segment 3
Segment 1 Segment 2 Segment 3 Segment 4
Segment 1 Segment 2 Segment 3
P0
P1
P2
Architecture
APACHE PULSAR
15
Bookie Bookie Bookie
Broker Broker Broker
Producer Consumer
SERVING
Brokers can be added independently
Traffic can be shifted quickly across brokers
STORAGE
Bookies can be added independently
New bookies will ramp up traffic quickly
APACHE PULSAR - BROKER
16
✦ Broker is the only point of interaction for clients (producers and consumers)
✦ Brokers acquire ownership of group of topics and “serve” them
✦ Broker has no durable state
✦ Provides service discovery mechanism for client to connect to right broker
APACHE PULSAR - BROKER
17
APACHE PULSAR - CONSISTENCY
18
Bookie
Bookie
BookieBrokerProducer
APACHE PULSAR - DURABILITY (NO DATA LOSS)
19
Bookie
Bookie
BookieBrokerProducer
Journal
Journal
Journal
fsync
fsync
fsync
APACHE PULSAR - ISOLATION
20
APACHE PULSAR - SEGMENT STORAGE
21
234…20212223…40414243…60616263…
Segment 1
Segment 3
Segment 2
Segment 2
Segment 1
Segment 3
Segment 4
Segment 3
Segment 2
Segment 1
Segment 4
Segment 4
APACHE PULSAR - RESILIENCY
22
1234…20212223…40414243…60616263…
Segment 1
Segment 3
Segment 2
Segment 2
Segment 1
Segment 3
Segment 4
Segment 3
Segment 2
Segment 1
Segment 4
Segment 4
APACHE PULSAR - SEAMLESS CLUSTER EXPANSION
23
1234…20212223…40414243…60616263…
Segment 1
Segment 3
Segment 2
Segment 2
Segment 1
Segment 3
Segment 4
Segment 3
Segment 2
Segment 1
Segment 4
Segment 4
Segment Y
Segment Z
Segment X
APACHE PULSAR - TIERED STORAGE
24
Low Cost Storage
1234…20212223…40414243…60616263…
Segment 3
Segment 2Segment 3
Segment 4
Segment 3
Segment 1
Segment 4 Segment 4
Multi-tiered storage and serving
25
Partition
Broker Broker Broker
.
.
. .
.
. .
.
. .
.
.
Processing
(brokers)
Warm
Storage
Cold
Storage
Tailing reads: served from
in-memory cache
Catch-up reads: served
from persistent storage
layer
Historical reads: served
from cold storage
PARTITIONS VS SEGMENTS - WHY SHOULD YOU CARE?
26
Legacy Architectures
# Storage co-resident with processing
# Partition-centric
# Cumbersome to scale--data
redistribution, performance impact
Logical
View
Apache Pulsar
# Storage decoupled from processing
# Partitions stored as segments
# Flexible, easy scalability
Partition
Processing
& Storage
Segment 1 Segment 3Segment 2 Segment n
Partition
Broker
Partition
(primary)
Broker
Partition
(copy)
Broker
Partition
(copy)
Broker Broker Broker
Segment 1
Segment 2
Segment n
.
.
.
Segment 2
Segment 3
Segment n
.
.
.
Segment 3
Segment 1
Segment n
.
.
.
Segment 1
Segment 2
Segment n
.
.
.
Processing
(brokers)
Storage
PARTITIONS VS SEGMENTS - WHY SHOULD YOU CARE?
27
✦ In Kafka, partitions are assigned to brokers “permanently”
✦ A single partition is stored entirely in a single node
✦ Retention is limited by a single node storage capacity
✦ Failure recovery and capacity expansion require expensive “rebalancing”
✦ Rebalancing has a big impact over the system, affecting regular traffic
UNIFIED MESSAGING MODEL - STREAMING
28
Pulsar topic/
partition
Producer 2
Producer 1
Consumer 1
Consumer 2
Subscription
A
M4
M3
M2
M1
M0
M4
M3
M2
M1
M0
X
Exclusive
UNIFIED MESSAGING MODEL - STREAMING
29
Pulsar topic/
partition
Producer 2
Producer 1
Consumer 1
Consumer 2
Subscription
B
M4
M3
M2
M1
M0
M4
M3
M2
M1
M0
Failover
In case of failure in
consumer 1
UNIFIED MESSAGING MODEL - QUEUING
30
Pulsar topic/
partition
Producer 2
Producer 1
Consumer 2
Consumer 3
Subscription
C
M4
M3
M2
M1
M0
Shared
Traffic is equally distributed
across consumers
Consumer 1
M4M3
M2M1M0
DISASTER RECOVERY
31
Topic	(T1) Topic	(T1)
Topic	(T1)
Subscrip6on	
(S1)
Subscrip6on	
(S1)
Producer		
(P1)
Consumer		
(C1)
Producer		
(P3)
Producer		
(P2)
Consumer		
(C2)
Data	Center	A Data	Center	B
Data	Center	C
Integrated in the
broker message flow
Simple configuration
to add/remove regions
Asynchronous (default)
and synchronous
replication
• Two independent clusters,
primary and standby
• Configured tenants and
namespaces replicate to standby
• Data published to primary is
asynchronously replicated to
standby
• Producers and consumers
restarted in second datacenter
upon primary failure
Asynchronous replication example
32
Producers
(active)
Datacenter 1
Consumers
(active)
Pulsar Cluster
(primary)
Datacenter 2
Producers
(standby)
Consumers
(standby)
Pulsar Cluster
(standby)
Pulsar
replication
ZooKeeper ZooKeeper
ZooKeeper
• Each topic owned by one
broker at a time, i.e. in one
datacenter
• ZooKeeper cluster spread
across multiple locations
• Broker commits writes to
bookies in both datacenters
• In event of datacenter failure,
broker in surviving datacenter
assumes ownership of topic
Synchronous replication example
33
Producers
Datacenter 1
Consumers
Pulsar Cluster
Datacenter 2
Producers
Consumers
Replicated subscriptions
34
Producers
Datacenter 1
Consumers
Pulsar
Cluster 1
Subscriptions
Datacenter 2
Consumers
Pulsar
Cluster 2
Subscriptions
Pulsar
Replication
MarkerMarker Marker
MULTITENANCY - CLOUD NATIVE
35
Apache Pulsar Cluster
Product
Safety
ETL
Fraud
Detection
Topic-1
Account History
Topic-2
User Clustering
Topic-1
Risk Classification
MarketingCampaigns
ETL
Topic-1
Budgeted Spend
Topic-2
Demographic Classification
Topic-1
Location Resolution
Data
Serving
Microservice
Topic-1
Customer Authentication
10 TB
7 TB
5 TB
✦ Authentication
✦ Authorization
✦ Software isolation
๏ Storage quotas, flow control, back pressure, rate limiting
✦ Hardware isolation
๏ Constrain some tenants on a subset of brokers/bookies
PULSAR CLIENTS
36
Apache Pulsar Cluster
Java
Python
Go
C++ C
PULSAR PRODUCER
37
PulsarClient client = PulsarClient.create(
“http://broker.usw.example.com:8080”);
Producer producer = client.createProducer(
“persistent://my-property/us-west/my-namespace/my-topic”);
// handles retries in case of failure
producer.send("my-message".getBytes());
// Async version:
producer.sendAsync("my-message".getBytes()).thenRun(() -> {
// Message was persisted
});
PULSAR CONSUMER
38
PulsarClient client = PulsarClient.create(
"http://broker.usw.example.com:8080");
Consumer consumer = client.subscribe(
"persistent://my-property/us-west/my-namespace/my-topic",
"my-subscription-name");
while (true) {
// Wait for a message
Message msg = consumer.receive();
System.out.println("Received message: " + msg.getData());
// Acknowledge the message so that it can be deleted by broker
consumer.acknowledge(msg);
}
SCHEMA REGISTRY
39
✦ Provides type safety to applications built on top of Pulsar
✦ Two approaches
✦ Client side - type safety enforcement up to the application
✦ Server side - system enforces type safety and ensures that producers and consumers remain synced
✦ Schema registry enables clients to upload data schemas on a topic basis.
✦ Schemas dictate which data types are recognized as valid for that topic
PULSAR SCHEMAS - HOW DO THEY WORK?
40
✦ Enforced at the topic level
✦ Pulsar schemas consists of
✦ Name - Name refers to the topic to which the schema is applied
✦ Payload - Binary representation of the schema
✦ Schema type - JSON, Protobuf and Avro
✦ User defined properties - Map of strings to strings (application specific - e.g git hash of the schema)
SCHEMA VERSIONING
41
PulsarClient client = PulsarClient.builder()
.serviceUrl(“http://broker.usw.example.com:6650")
.build()
Producer<SensorReading> producer = client.newProducer(JSONSchema.of(SensorReading.class))
.topic(“sensor-data”)
.sendTimeout(3, TimeUnit.SECONDS)
.create()
Scenario What happens
No schema exists for the topic Producer is created using the given schema
Schema already exists; producer
connects using the same schema
that’s already stored
Schema is transmitted to the broker, determines that it is already stored
Schema already exists; producer
connects using a new schema that is
compatible
Schema is transmitted, compatibility determined and stored as new schema
Processing framework
HOW TO PROCESS DATA MODELED AS STREAMS
43
✦ Consume data as it is produced (pub/sub)
✦ Light weight compute - transform and react to data as it arrives
✦ Heavy weight compute - continuous data processing
✦ Interactive query of stored streams
LIGHT WEIGHT COMPUTE
44
f(x)
Incoming	Messages Output	Messages
ABSTRACT VIEW OF COMPUTE REPRESENTATION
TRADITIONAL COMPUTE REPRESENTATION
45
DAG
%
%
%
%
%
Source 1
Source 2
Action
Action
Action
Sink 1
Sink 2
REALIZING COMPUTATION - EXPLICIT CODE
46
public static class SplitSentence extends BaseBasicBolt {
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
@Override
public Map<String, Object> getComponentConfiguration() {
return null;
}
public void execute(Tuple tuple, BasicOutputCollector
basicOutputCollector) {
String sentence = tuple.getStringByField("sentence");
String words[] = sentence.split(" ");
for (String w : words) {
basicOutputCollector.emit(new Values(w));
}
}
}
STITCHED BY PROGRAMMERS
REALIZING COMPUTATION - FUNCTIONAL
47
Builder.newBuilder()
.newSource(() -> StreamletUtils.randomFromList(SENTENCES))
.flatMap(sentence -> Arrays.asList(sentence.toLowerCase().split("s+")))
.reduceByKeyAndWindow(word -> word, word -> 1,
WindowConfig.TumblingCountWindow(50),
(x, y) -> x + y);
TRADITIONAL REAL TIME - SEPARATE SYSTEMS
48
Messaging Compute
TRADITIONAL REAL TIME SYSTEMS
49
DEVELOPER EXPERIENCE
✦ Powerful API but complicated
✦ Does everyone really need to learn functional programming?
✦ Configurable and scalable but management overhead
✦ Edge systems have resource and management constraints
TRADITIONAL REAL TIME SYSTEMS
50
OPERATIONAL EXPERIENCE
✦ Multiple systems to operate
✦ IoT deployments routinely have thousands of edge systems
✦ Semantic differences
✦ Mismatch and duplication between systems
✦ Creates developer and operator friction
LESSONS LEARNT - USE CASES
51
✦ Data transformations
✦ Data classification
✦ Data enrichment
✦ Data routing
✦ Data extraction and loading
✦ Real time aggregation
✦ Microservices
Significant set of processing tasks are exceedingly simple
EMERGENCE OF CLOUD - SERVERLESS
52
✦ Simple function API
✦ Functions are submitted to the system
✦ Runs per events
✦ Composition APIs to do complex things
✦ Wildly popular
SERVERLESS VS STREAMING
53
✦ Both are event driven architectures
✦ Both can be used for analytics and data serving
✦ Both have composition APIs
๏ Configuration based for serverless
๏ DSL based for streaming
✦ Serverless typically does not guarantee ordering
✦ Serverless is pay per action
STREAM NATIVE COMPUTE USING FUNCTIONS
54
✦ Simplest possible API -function or a procedure
✦ Support for multi language
✦ Use of native API for each language
✦ Scale developers
✦ Use of message bus native concepts - input and output as topics
✦ Flexible runtime - simple standalone applications vs managed system applications
APPLYING INSIGHT GAINED FROM SERVERLESS
PULSAR FUNCTIONS
55
SDK LESS API
import java.util.function.Function;
public class ExclamationFunction implements Function<String, String> {
@Override
public String apply(String input) {
return input + "!";
}
}
PULSAR FUNCTIONS
56
SDK API
import org.apache.pulsar.functions.api.PulsarFunction;
import org.apache.pulsar.functions.api.Context;
public class ExclamationFunction implements PulsarFunction<String, String> {
@Override
public String process(String input, Context context) {
return input + "!";
}
}
PULSAR FUNCTIONS
57
✦ Function executed for every message of input topic
✦ Support for multiple topics as inputs
✦ Function output goes into output topic - can be void topic as well
✦ SerDe takes care of serialization/deserialization of messages
๏ Custom SerDe can be provided by the users
๏ Integration with schema registry
PROCESSING GUARANTEES
58
✦ ATMOST_ONCE
๏ Message acked to Pulsar as soon as we receive it
✦ ATLEAST_ONCE
๏ Message acked to Pulsar after the function completes
๏ Default behavior - don’t want people to loose data
✦ EFFECTIVELY_ONCE
๏ Uses Pulsar’s inbuilt effectively once semantics
✦ Controlled at runtime by user
DEPLOYING FUNCTIONS - BROKER
59
Broker 1
Worker
Function
wordcount-1
Function
transform-2
Broker 1
Worker
Function
transform-1
Function
dataroute-1
Broker 1
Worker
Function
wordcount-2
Function
transform-3
Node 1 Node 2 Node 3
DEPLOYING FUNCTIONS - WORKER NODES
60
Worker
Function
wordcount-1
Function
transform-2
Worker
Function
transform-1
Function
dataroute-1
Worker
Function
wordcount-2
Function
transform-3
Node 1 Node 2 Node 3
Broker 1 Broker 2 Broker 3
Node 4 Node 5 Node 6
DEPLOYING FUNCTIONS - KUBERNETES
61
Function
wordcount-1
Function
transform-1
Function
transform-3
Pod 1 Pod 2 Pod 3
Broker 1 Broker 2 Broker 3
Pod 7 Pod 8 Pod 9
Function
dataroute-1
Function
wordcount-2
Function
transform-2
Pod 4 Pod 5 Pod 6
BUILT-IN STATE MANAGEMENT IN FUNCTIONS
62
✦ Functions can store state in inbuilt storage
๏ Framework provides a simple library to store and retrieve state
✦ Support server side operations like counters
✦ Simplified application development
๏ No need to standup an extra system
DISTRIBUTED STATE IN FUNCTIONS
63
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.PulsarFunction;
public class CounterFunction implements PulsarFunction<String, Void> {
@Override
public Void process(String input, Context context) throws Exception {
for (String word : input.split(".")) {
context.incrCounter(word, 1);
}
return null;
}
}
PULSAR - DATA IN AND OUT
64
✦ Users can write custom code using Pulsar producer and consumer API
✦ Challenges
๏ Where should the application to publish data or consume data from Pulsar?
๏ How should the application to publish data or consume data from Pulsar?
✦ Current systems have no organized and fault tolerant way to run applications that ingress and egress
data from and to external systems
PULSAR IO TO THE RESCUE
65
Apache Pulsar ClusterSource Sink
PULSAR IO - EXECUTION
66
Broker 1
Worker
Sink
Cassandra-1
Source
Kinesis-2
Broker 2
Worker
Source
Kinesis-1
Source
Twitter-1
Broker 3
Worker
Sink
Cassandra-2
Source
Kinesis-3
Node 1 Node 2 Node 3
Fault tolerance Parallelism Elasticity Load Balancing On-demand updates
INTERACTIVE QUERYING OF STREAMS - PULSAR SQL
67
1234…20212223…40414243…60616263…
Segment 1
Segment 3
Segment 2
Segment 2
Segment 1
Segment 3
Segment 4
Segment 3
Segment 2
Segment 1
Segment 4
Segment 4
Segment
Reader
Segment
Reader
Segment
Reader
Segment
Reader
Coordinator
PULSAR PERFORMANCE
68
PULSAR PERFORMANCE - LATENCY
69
APACHE PULSAR VS. APACHE KAFKA
70
Mul$-tenancy	
A	single	cluster	can	support	many	
tenants	and	use	cases
Seamless	Cluster	Expansion	
Expand	the	cluster	without	any	
down	$me
High	throughput	&	Low	Latency	
Can	reach	1.8	M	messages/s	in	a	
single	par$$on	and	publish	latency	
of	5ms	at	99pct
Durability	
Data	replicated	and	synced	to	disk
Geo-replica$on	
Out	of	box	support	for	geographically	
distributed	applica$ons
Unified	messaging	model	
Support	both	Topic	&	Queue	
seman$c	in	a	single	model
Tiered	Storage	
Hot/warm	data	for	real	$me	access	and	
cold	event	data	in	cheaper	storage
Pulsar	Func$ons	
Flexible	light	weight	compute
Highly	scalable	
Can	support	millions	of	topics,	makes	data	
modeling	easier
APACHE PULSAR IN PRODUCTION @SCALE
71
4+	years	
Serves	2.3	million	topics	
700	billion	messages/day	
500+	bookie	nodes	
200+	broker	nodes	
Average	latency	<	5	ms	
99.9%	15	ms	(strong	durability	guarantees)	
Zero	data	loss	
150+	applica6ons	
Self	served	provisioning	
Full-mesh	cross-datacenter	replica6on	-	8+	data	centers
Growing ecosystem
72
73
✓ Understanding How Pulsar Works

https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar-
works
✓ How To (Not) Lose Messages on Apache Pulsar Cluster

https://jack-vanlightly.com/blog/2018/10/21/how-to-not-lose-messages-on-an-
apache-pulsar-cluster
MORE READINGS
MORE READINGS
74
✓ Unified queuing and streaming

https://streaml.io/blog/pulsar-streaming-queuing
✓ Segment centric storage

https://streaml.io/blog/pulsar-segment-based-architecture
✓ Messaging, Storage or Both

https://streaml.io/blog/messaging-storage-or-both
✓ Access patterns and tiered storage

https://streaml.io/blog/access-patterns-and-tiered-storage-in-apache-pulsar
✓ Tiered Storage in Apache Pulsar

https://streaml.io/blog/tiered-storage-in-apache-pulsar
QUESTIONS
75
STAY IN TOUCH
@karthikz
TWITTER
EMAIL
karthik@streaml.io
76
77
@karthikz

Más contenido relacionado

La actualidad más candente

Elastic Data Processing with Apache Flink and Apache Pulsar
Elastic Data Processing with Apache Flink and Apache PulsarElastic Data Processing with Apache Flink and Apache Pulsar
Elastic Data Processing with Apache Flink and Apache Pulsar
StreamNative
 

La actualidad más candente (20)

Apache Pulsar at Yahoo! Japan
Apache Pulsar at Yahoo! JapanApache Pulsar at Yahoo! Japan
Apache Pulsar at Yahoo! Japan
 
Apache Pulsar, Supporting the Entire Lifecycle of Streaming Data
Apache Pulsar, Supporting the Entire Lifecycle of Streaming DataApache Pulsar, Supporting the Entire Lifecycle of Streaming Data
Apache Pulsar, Supporting the Entire Lifecycle of Streaming Data
 
Elastic Data Processing with Apache Flink and Apache Pulsar
Elastic Data Processing with Apache Flink and Apache PulsarElastic Data Processing with Apache Flink and Apache Pulsar
Elastic Data Processing with Apache Flink and Apache Pulsar
 
When apache pulsar meets apache flink
When apache pulsar meets apache flinkWhen apache pulsar meets apache flink
When apache pulsar meets apache flink
 
Building a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre Zemb
Building a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre ZembBuilding a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre Zemb
Building a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre Zemb
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Query Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache FlinkQuery Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache Flink
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarUnifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
 
kafka
kafkakafka
kafka
 
Serverless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar FunctionsServerless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar Functions
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Integrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar with Big Data EcosystemIntegrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar with Big Data Ecosystem
 
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaI Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
 
Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streaming
 
A Unified Platform for Real-time Storage and Processing
A Unified Platform for Real-time Storage and ProcessingA Unified Platform for Real-time Storage and Processing
A Unified Platform for Real-time Storage and Processing
 
Pulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionPulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless Evolution
 
Messaging queue - Kafka
Messaging queue - KafkaMessaging queue - Kafka
Messaging queue - Kafka
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
kafka
kafkakafka
kafka
 

Similar a Apache Pulsar Seattle - Meetup

Designing Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data Applications
Arun Kejariwal
 
(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the Conference(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the Conference
Timothy Spann
 
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Princeton Dec 2022 Meetup_ StreamNative and Cloudera StreamingPrinceton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Timothy Spann
 
Kafka-and-event-driven-architecture-OGYatra20.ppt
Kafka-and-event-driven-architecture-OGYatra20.pptKafka-and-event-driven-architecture-OGYatra20.ppt
Kafka-and-event-driven-architecture-OGYatra20.ppt
Inam Bukhary
 

Similar a Apache Pulsar Seattle - Meetup (20)

Designing Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data Applications
 
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
 
(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the Conference(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the Conference
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin Presentation
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Timothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLTimothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for ML
 
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
 
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Princeton Dec 2022 Meetup_ StreamNative and Cloudera StreamingPrinceton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
 
BigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Introduction to Apache ApexBigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Introduction to Apache Apex
 
Apache Pulsar Overview
Apache Pulsar OverviewApache Pulsar Overview
Apache Pulsar Overview
 
Kafka-and-event-driven-architecture-OGYatra20.ppt
Kafka-and-event-driven-architecture-OGYatra20.pptKafka-and-event-driven-architecture-OGYatra20.ppt
Kafka-and-event-driven-architecture-OGYatra20.ppt
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
 
Pulsar - flexible pub-sub for internet scale
Pulsar - flexible pub-sub for internet scalePulsar - flexible pub-sub for internet scale
Pulsar - flexible pub-sub for internet scale
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Architectures with Windows Azure
Architectures with Windows AzureArchitectures with Windows Azure
Architectures with Windows Azure
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 

Más de Karthik Ramasamy

Más de Karthik Ramasamy (11)

Scaling Apache Pulsar to 10 PB/day
Scaling Apache Pulsar to 10 PB/dayScaling Apache Pulsar to 10 PB/day
Scaling Apache Pulsar to 10 PB/day
 
Apache Pulsar @Splunk
Apache Pulsar @SplunkApache Pulsar @Splunk
Apache Pulsar @Splunk
 
Pulsar summit-keynote-final
Pulsar summit-keynote-finalPulsar summit-keynote-final
Pulsar summit-keynote-final
 
Creating Data Fabric for #IOT with Apache Pulsar
Creating Data Fabric for #IOT with Apache PulsarCreating Data Fabric for #IOT with Apache Pulsar
Creating Data Fabric for #IOT with Apache Pulsar
 
Exactly once in Apache Heron
Exactly once in Apache HeronExactly once in Apache Heron
Exactly once in Apache Heron
 
Tutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesTutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming Architectures
 
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeper
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeperStreaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeper
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeper
 
Modern Data Pipelines
Modern Data PipelinesModern Data Pipelines
Modern Data Pipelines
 
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
 
Storm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014 paperStorm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014 paper
 
Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014
 

Último

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 

Último (20)

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 

Apache Pulsar Seattle - Meetup

  • 1. 1 UNIFYING MESSAGING, QUEUING, STREAMING & COMPUTE WITH APACHE PULSAR KARTHIK RAMASAMY CO-FOUNDER AND CEO
  • 3. 3 Ubiquity of Real-Time Data Streams & Events
  • 4. EVENT/STREAM DATA PROCESSING 4 ✦ Events are analyzed and processed as they arrive ✦ Decisions are timely, contextual and based on fresh data ✦ Decision latency is eliminated ✦ Data in motion Ingest/ Buffer Analyze Act
  • 6. STREAM PROCESSING PATTERN 6 ComputeMessaging Storage Data Inges6on Data Processing Results StorageData Storage Data Serving
  • 7. ELEMENTS OF EVENT/STREAM PROCESSING 7 Aggregation Systems Messaging Systems Result Engine HDFS Queryable Engines
  • 8. APACHE PULSAR 8 Flexible Messaging + Streaming System backed by a durable log storage
  • 10. Core concepts: Tenants, namespaces, topics 10 Apache Pulsar Cluster Tenants Namespaces Topics Marketing Sales Security Analytics Campaigns Data transformation Data Integration Microservices Visits Conversions Responses Conversions Transactions Interactions Log events Signatures Accesses
  • 12. Topic partitions 12 Topic - P0 Time Topic - P1 Topic - P2 Producers Producers Consumers Consumers Consumers
  • 13. Segments 13 Time Segment 1 Segment 2 Segment 3 Segment 1 Segment 2 Segment 3 Segment 4 Segment 1 Segment 2 Segment 3 P0 P1 P2
  • 15. APACHE PULSAR 15 Bookie Bookie Bookie Broker Broker Broker Producer Consumer SERVING Brokers can be added independently Traffic can be shifted quickly across brokers STORAGE Bookies can be added independently New bookies will ramp up traffic quickly
  • 16. APACHE PULSAR - BROKER 16 ✦ Broker is the only point of interaction for clients (producers and consumers) ✦ Brokers acquire ownership of group of topics and “serve” them ✦ Broker has no durable state ✦ Provides service discovery mechanism for client to connect to right broker
  • 17. APACHE PULSAR - BROKER 17
  • 18. APACHE PULSAR - CONSISTENCY 18 Bookie Bookie BookieBrokerProducer
  • 19. APACHE PULSAR - DURABILITY (NO DATA LOSS) 19 Bookie Bookie BookieBrokerProducer Journal Journal Journal fsync fsync fsync
  • 20. APACHE PULSAR - ISOLATION 20
  • 21. APACHE PULSAR - SEGMENT STORAGE 21 234…20212223…40414243…60616263… Segment 1 Segment 3 Segment 2 Segment 2 Segment 1 Segment 3 Segment 4 Segment 3 Segment 2 Segment 1 Segment 4 Segment 4
  • 22. APACHE PULSAR - RESILIENCY 22 1234…20212223…40414243…60616263… Segment 1 Segment 3 Segment 2 Segment 2 Segment 1 Segment 3 Segment 4 Segment 3 Segment 2 Segment 1 Segment 4 Segment 4
  • 23. APACHE PULSAR - SEAMLESS CLUSTER EXPANSION 23 1234…20212223…40414243…60616263… Segment 1 Segment 3 Segment 2 Segment 2 Segment 1 Segment 3 Segment 4 Segment 3 Segment 2 Segment 1 Segment 4 Segment 4 Segment Y Segment Z Segment X
  • 24. APACHE PULSAR - TIERED STORAGE 24 Low Cost Storage 1234…20212223…40414243…60616263… Segment 3 Segment 2Segment 3 Segment 4 Segment 3 Segment 1 Segment 4 Segment 4
  • 25. Multi-tiered storage and serving 25 Partition Broker Broker Broker .
.
. .
.
. .
.
. .
.
. Processing (brokers) Warm Storage Cold Storage Tailing reads: served from in-memory cache Catch-up reads: served from persistent storage layer Historical reads: served from cold storage
  • 26. PARTITIONS VS SEGMENTS - WHY SHOULD YOU CARE? 26 Legacy Architectures # Storage co-resident with processing # Partition-centric # Cumbersome to scale--data redistribution, performance impact Logical View Apache Pulsar # Storage decoupled from processing # Partitions stored as segments # Flexible, easy scalability Partition Processing & Storage Segment 1 Segment 3Segment 2 Segment n Partition Broker Partition (primary) Broker Partition (copy) Broker Partition (copy) Broker Broker Broker Segment 1 Segment 2 Segment n .
.
. Segment 2 Segment 3 Segment n .
.
. Segment 3 Segment 1 Segment n .
.
. Segment 1 Segment 2 Segment n .
.
. Processing (brokers) Storage
  • 27. PARTITIONS VS SEGMENTS - WHY SHOULD YOU CARE? 27 ✦ In Kafka, partitions are assigned to brokers “permanently” ✦ A single partition is stored entirely in a single node ✦ Retention is limited by a single node storage capacity ✦ Failure recovery and capacity expansion require expensive “rebalancing” ✦ Rebalancing has a big impact over the system, affecting regular traffic
  • 28. UNIFIED MESSAGING MODEL - STREAMING 28 Pulsar topic/ partition Producer 2 Producer 1 Consumer 1 Consumer 2 Subscription A M4 M3 M2 M1 M0 M4 M3 M2 M1 M0 X Exclusive
  • 29. UNIFIED MESSAGING MODEL - STREAMING 29 Pulsar topic/ partition Producer 2 Producer 1 Consumer 1 Consumer 2 Subscription B M4 M3 M2 M1 M0 M4 M3 M2 M1 M0 Failover In case of failure in consumer 1
  • 30. UNIFIED MESSAGING MODEL - QUEUING 30 Pulsar topic/ partition Producer 2 Producer 1 Consumer 2 Consumer 3 Subscription C M4 M3 M2 M1 M0 Shared Traffic is equally distributed across consumers Consumer 1 M4M3 M2M1M0
  • 31. DISASTER RECOVERY 31 Topic (T1) Topic (T1) Topic (T1) Subscrip6on (S1) Subscrip6on (S1) Producer (P1) Consumer (C1) Producer (P3) Producer (P2) Consumer (C2) Data Center A Data Center B Data Center C Integrated in the broker message flow Simple configuration to add/remove regions Asynchronous (default) and synchronous replication
  • 32. • Two independent clusters, primary and standby • Configured tenants and namespaces replicate to standby • Data published to primary is asynchronously replicated to standby • Producers and consumers restarted in second datacenter upon primary failure Asynchronous replication example 32 Producers (active) Datacenter 1 Consumers (active) Pulsar Cluster (primary) Datacenter 2 Producers (standby) Consumers (standby) Pulsar Cluster (standby) Pulsar replication ZooKeeper ZooKeeper
  • 33. ZooKeeper • Each topic owned by one broker at a time, i.e. in one datacenter • ZooKeeper cluster spread across multiple locations • Broker commits writes to bookies in both datacenters • In event of datacenter failure, broker in surviving datacenter assumes ownership of topic Synchronous replication example 33 Producers Datacenter 1 Consumers Pulsar Cluster Datacenter 2 Producers Consumers
  • 34. Replicated subscriptions 34 Producers Datacenter 1 Consumers Pulsar Cluster 1 Subscriptions Datacenter 2 Consumers Pulsar Cluster 2 Subscriptions Pulsar Replication MarkerMarker Marker
  • 35. MULTITENANCY - CLOUD NATIVE 35 Apache Pulsar Cluster Product Safety ETL Fraud Detection Topic-1 Account History Topic-2 User Clustering Topic-1 Risk Classification MarketingCampaigns ETL Topic-1 Budgeted Spend Topic-2 Demographic Classification Topic-1 Location Resolution Data Serving Microservice Topic-1 Customer Authentication 10 TB 7 TB 5 TB ✦ Authentication ✦ Authorization ✦ Software isolation ๏ Storage quotas, flow control, back pressure, rate limiting ✦ Hardware isolation ๏ Constrain some tenants on a subset of brokers/bookies
  • 36. PULSAR CLIENTS 36 Apache Pulsar Cluster Java Python Go C++ C
  • 37. PULSAR PRODUCER 37 PulsarClient client = PulsarClient.create( “http://broker.usw.example.com:8080”); Producer producer = client.createProducer( “persistent://my-property/us-west/my-namespace/my-topic”); // handles retries in case of failure producer.send("my-message".getBytes()); // Async version: producer.sendAsync("my-message".getBytes()).thenRun(() -> { // Message was persisted });
  • 38. PULSAR CONSUMER 38 PulsarClient client = PulsarClient.create( "http://broker.usw.example.com:8080"); Consumer consumer = client.subscribe( "persistent://my-property/us-west/my-namespace/my-topic", "my-subscription-name"); while (true) { // Wait for a message Message msg = consumer.receive(); System.out.println("Received message: " + msg.getData()); // Acknowledge the message so that it can be deleted by broker consumer.acknowledge(msg); }
  • 39. SCHEMA REGISTRY 39 ✦ Provides type safety to applications built on top of Pulsar ✦ Two approaches ✦ Client side - type safety enforcement up to the application ✦ Server side - system enforces type safety and ensures that producers and consumers remain synced ✦ Schema registry enables clients to upload data schemas on a topic basis. ✦ Schemas dictate which data types are recognized as valid for that topic
  • 40. PULSAR SCHEMAS - HOW DO THEY WORK? 40 ✦ Enforced at the topic level ✦ Pulsar schemas consists of ✦ Name - Name refers to the topic to which the schema is applied ✦ Payload - Binary representation of the schema ✦ Schema type - JSON, Protobuf and Avro ✦ User defined properties - Map of strings to strings (application specific - e.g git hash of the schema)
  • 41. SCHEMA VERSIONING 41 PulsarClient client = PulsarClient.builder() .serviceUrl(“http://broker.usw.example.com:6650") .build() Producer<SensorReading> producer = client.newProducer(JSONSchema.of(SensorReading.class)) .topic(“sensor-data”) .sendTimeout(3, TimeUnit.SECONDS) .create() Scenario What happens No schema exists for the topic Producer is created using the given schema Schema already exists; producer connects using the same schema that’s already stored Schema is transmitted to the broker, determines that it is already stored Schema already exists; producer connects using a new schema that is compatible Schema is transmitted, compatibility determined and stored as new schema
  • 43. HOW TO PROCESS DATA MODELED AS STREAMS 43 ✦ Consume data as it is produced (pub/sub) ✦ Light weight compute - transform and react to data as it arrives ✦ Heavy weight compute - continuous data processing ✦ Interactive query of stored streams
  • 44. LIGHT WEIGHT COMPUTE 44 f(x) Incoming Messages Output Messages ABSTRACT VIEW OF COMPUTE REPRESENTATION
  • 45. TRADITIONAL COMPUTE REPRESENTATION 45 DAG % % % % % Source 1 Source 2 Action Action Action Sink 1 Sink 2
  • 46. REALIZING COMPUTATION - EXPLICIT CODE 46 public static class SplitSentence extends BaseBasicBolt { @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } @Override public Map<String, Object> getComponentConfiguration() { return null; } public void execute(Tuple tuple, BasicOutputCollector basicOutputCollector) { String sentence = tuple.getStringByField("sentence"); String words[] = sentence.split(" "); for (String w : words) { basicOutputCollector.emit(new Values(w)); } } } STITCHED BY PROGRAMMERS
  • 47. REALIZING COMPUTATION - FUNCTIONAL 47 Builder.newBuilder() .newSource(() -> StreamletUtils.randomFromList(SENTENCES)) .flatMap(sentence -> Arrays.asList(sentence.toLowerCase().split("s+"))) .reduceByKeyAndWindow(word -> word, word -> 1, WindowConfig.TumblingCountWindow(50), (x, y) -> x + y);
  • 48. TRADITIONAL REAL TIME - SEPARATE SYSTEMS 48 Messaging Compute
  • 49. TRADITIONAL REAL TIME SYSTEMS 49 DEVELOPER EXPERIENCE ✦ Powerful API but complicated ✦ Does everyone really need to learn functional programming? ✦ Configurable and scalable but management overhead ✦ Edge systems have resource and management constraints
  • 50. TRADITIONAL REAL TIME SYSTEMS 50 OPERATIONAL EXPERIENCE ✦ Multiple systems to operate ✦ IoT deployments routinely have thousands of edge systems ✦ Semantic differences ✦ Mismatch and duplication between systems ✦ Creates developer and operator friction
  • 51. LESSONS LEARNT - USE CASES 51 ✦ Data transformations ✦ Data classification ✦ Data enrichment ✦ Data routing ✦ Data extraction and loading ✦ Real time aggregation ✦ Microservices Significant set of processing tasks are exceedingly simple
  • 52. EMERGENCE OF CLOUD - SERVERLESS 52 ✦ Simple function API ✦ Functions are submitted to the system ✦ Runs per events ✦ Composition APIs to do complex things ✦ Wildly popular
  • 53. SERVERLESS VS STREAMING 53 ✦ Both are event driven architectures ✦ Both can be used for analytics and data serving ✦ Both have composition APIs ๏ Configuration based for serverless ๏ DSL based for streaming ✦ Serverless typically does not guarantee ordering ✦ Serverless is pay per action
  • 54. STREAM NATIVE COMPUTE USING FUNCTIONS 54 ✦ Simplest possible API -function or a procedure ✦ Support for multi language ✦ Use of native API for each language ✦ Scale developers ✦ Use of message bus native concepts - input and output as topics ✦ Flexible runtime - simple standalone applications vs managed system applications APPLYING INSIGHT GAINED FROM SERVERLESS
  • 55. PULSAR FUNCTIONS 55 SDK LESS API import java.util.function.Function; public class ExclamationFunction implements Function<String, String> { @Override public String apply(String input) { return input + "!"; } }
  • 56. PULSAR FUNCTIONS 56 SDK API import org.apache.pulsar.functions.api.PulsarFunction; import org.apache.pulsar.functions.api.Context; public class ExclamationFunction implements PulsarFunction<String, String> { @Override public String process(String input, Context context) { return input + "!"; } }
  • 57. PULSAR FUNCTIONS 57 ✦ Function executed for every message of input topic ✦ Support for multiple topics as inputs ✦ Function output goes into output topic - can be void topic as well ✦ SerDe takes care of serialization/deserialization of messages ๏ Custom SerDe can be provided by the users ๏ Integration with schema registry
  • 58. PROCESSING GUARANTEES 58 ✦ ATMOST_ONCE ๏ Message acked to Pulsar as soon as we receive it ✦ ATLEAST_ONCE ๏ Message acked to Pulsar after the function completes ๏ Default behavior - don’t want people to loose data ✦ EFFECTIVELY_ONCE ๏ Uses Pulsar’s inbuilt effectively once semantics ✦ Controlled at runtime by user
  • 59. DEPLOYING FUNCTIONS - BROKER 59 Broker 1 Worker Function wordcount-1 Function transform-2 Broker 1 Worker Function transform-1 Function dataroute-1 Broker 1 Worker Function wordcount-2 Function transform-3 Node 1 Node 2 Node 3
  • 60. DEPLOYING FUNCTIONS - WORKER NODES 60 Worker Function wordcount-1 Function transform-2 Worker Function transform-1 Function dataroute-1 Worker Function wordcount-2 Function transform-3 Node 1 Node 2 Node 3 Broker 1 Broker 2 Broker 3 Node 4 Node 5 Node 6
  • 61. DEPLOYING FUNCTIONS - KUBERNETES 61 Function wordcount-1 Function transform-1 Function transform-3 Pod 1 Pod 2 Pod 3 Broker 1 Broker 2 Broker 3 Pod 7 Pod 8 Pod 9 Function dataroute-1 Function wordcount-2 Function transform-2 Pod 4 Pod 5 Pod 6
  • 62. BUILT-IN STATE MANAGEMENT IN FUNCTIONS 62 ✦ Functions can store state in inbuilt storage ๏ Framework provides a simple library to store and retrieve state ✦ Support server side operations like counters ✦ Simplified application development ๏ No need to standup an extra system
  • 63. DISTRIBUTED STATE IN FUNCTIONS 63 import org.apache.pulsar.functions.api.Context; import org.apache.pulsar.functions.api.PulsarFunction; public class CounterFunction implements PulsarFunction<String, Void> { @Override public Void process(String input, Context context) throws Exception { for (String word : input.split(".")) { context.incrCounter(word, 1); } return null; } }
  • 64. PULSAR - DATA IN AND OUT 64 ✦ Users can write custom code using Pulsar producer and consumer API ✦ Challenges ๏ Where should the application to publish data or consume data from Pulsar? ๏ How should the application to publish data or consume data from Pulsar? ✦ Current systems have no organized and fault tolerant way to run applications that ingress and egress data from and to external systems
  • 65. PULSAR IO TO THE RESCUE 65 Apache Pulsar ClusterSource Sink
  • 66. PULSAR IO - EXECUTION 66 Broker 1 Worker Sink Cassandra-1 Source Kinesis-2 Broker 2 Worker Source Kinesis-1 Source Twitter-1 Broker 3 Worker Sink Cassandra-2 Source Kinesis-3 Node 1 Node 2 Node 3 Fault tolerance Parallelism Elasticity Load Balancing On-demand updates
  • 67. INTERACTIVE QUERYING OF STREAMS - PULSAR SQL 67 1234…20212223…40414243…60616263… Segment 1 Segment 3 Segment 2 Segment 2 Segment 1 Segment 3 Segment 4 Segment 3 Segment 2 Segment 1 Segment 4 Segment 4 Segment Reader Segment Reader Segment Reader Segment Reader Coordinator
  • 69. PULSAR PERFORMANCE - LATENCY 69
  • 70. APACHE PULSAR VS. APACHE KAFKA 70 Mul$-tenancy A single cluster can support many tenants and use cases Seamless Cluster Expansion Expand the cluster without any down $me High throughput & Low Latency Can reach 1.8 M messages/s in a single par$$on and publish latency of 5ms at 99pct Durability Data replicated and synced to disk Geo-replica$on Out of box support for geographically distributed applica$ons Unified messaging model Support both Topic & Queue seman$c in a single model Tiered Storage Hot/warm data for real $me access and cold event data in cheaper storage Pulsar Func$ons Flexible light weight compute Highly scalable Can support millions of topics, makes data modeling easier
  • 71. APACHE PULSAR IN PRODUCTION @SCALE 71 4+ years Serves 2.3 million topics 700 billion messages/day 500+ bookie nodes 200+ broker nodes Average latency < 5 ms 99.9% 15 ms (strong durability guarantees) Zero data loss 150+ applica6ons Self served provisioning Full-mesh cross-datacenter replica6on - 8+ data centers
  • 73. 73 ✓ Understanding How Pulsar Works
 https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar- works ✓ How To (Not) Lose Messages on Apache Pulsar Cluster
 https://jack-vanlightly.com/blog/2018/10/21/how-to-not-lose-messages-on-an- apache-pulsar-cluster MORE READINGS
  • 74. MORE READINGS 74 ✓ Unified queuing and streaming
 https://streaml.io/blog/pulsar-streaming-queuing ✓ Segment centric storage
 https://streaml.io/blog/pulsar-segment-based-architecture ✓ Messaging, Storage or Both
 https://streaml.io/blog/messaging-storage-or-both ✓ Access patterns and tiered storage
 https://streaml.io/blog/access-patterns-and-tiered-storage-in-apache-pulsar ✓ Tiered Storage in Apache Pulsar
 https://streaml.io/blog/tiered-storage-in-apache-pulsar