Integrating Apache Pulsar with Big Data Ecosystem

•

0 recomendaciones•1,900 vistas

In Apache Pulsar Beijing Meetup, Yijieshen gave a presentation of the current state of Apache Pulsar integrating with Big Data Ecosystem. He explains why and how Pulsar fits into current big data computing and query engines, and how Pulsar integrates with Spark, Flink and Presto for unified data processing system.

Internet

Integrating Apache Pulsar 
with  
Big Data Ecosystem
Yijie Shen
20190817

Why so many analytic frameworks ?
Each kind has its best fit
•Interactive Engine
• Time critical
• Medium data size
• Rerun on failure
•Batch Engine
• The amount of data can be very
large
• Could run on a huge cluster
• Fine-grained fault tolerance
•Streaming
• Ever running jobs
• Time critical
• Need scalability as well as
resilient on failures
•Serverless
• Simple processing logic
• Processing data with high
velocity
Don’t ask, I don’t know.

Why Apache Pulsar fits all ?
It’s a Pulsar Meetup, dude...

Pulsar – A cloud-native architecture
Stateless Serving
Durable Storage

Pulsar – Segment-based storage
•Managed ledger
• The storage layer for a single topic
•Ledger
• Single writer, append-only
• Replicated to multiple bookies

Pulsar – Infinite stream storage
•Reduce storage cost
• offloading segment to tiered storage one-by-one

Pulsar Schema
• Consensus of data at server-side
• Built-in schema registry
• Data schema on a per-topic basis
•Send and receive typed message directly
• Validation
• Multi-version

Durable and ordered source
•Failures are inevitable for engines
•Re-schedule failed tasks
• Tasks assigned to fixed (start, end] in Spark
• Tasks recover from checkpoint (start in Flink
•Exactly-once
• Based on message order in topic
• Seek & read
•Messages ”keep-alive” by subscription
• Move sub cursor on commit
task1 task2
Durable cursor

Two levels of reading API
•Consumer
• Subscribe / seek / receive
• Per topic partition
• Pulsar-Spark, Pulsar-Flink
•Segment
• Read directly from Bookies
• For parallelism
• Presto

Processing typed records
•Regard Pulsar as structured storage
•Fetching schema as the first step
• With Pulsar Admin API
• Dynamic / multi-versioned schema not supported in Spark/Flink
• But you could try AUTO_CONSUME
•SerDe your messages into InternalRow / Row
• Avro schema and avro/json/protobuf Message
• Or parse the Avro record as we do in pulsar-spark[1]
•Message metadata as metadata fields
• __key, __publishTime, __eventTime, __messageId, __topic

Topic/Partition add/delete discovery
• Streaming jobs are long
running
• Topics & partitions may be
added on removed during a job
• Periodically check topic for
status
• Spark: during incremental
planning
• Flink: with a monitoring thread
in each task
Pulsar-Spark as an example
• Happens during logical planning
• getBatch(start: Option[Offset],
end: Offset)
• Discovery topic differences between
start and end
• Start – last end
• End – getOffset()
• Connector
• provide available offset for all topic/
partitions for each getOffset
• Create DataFrame/DataSet based on
existing topic/partitions
• SS take care of the rest
Offset {
topicOffsets: Map[String, MessageId
}

Various APIs use Pulsar as source
val df = spark 
.read 
.format("pulsar") 
.option("service.url", "pulsar://...") 
.option("admin.url", "http://...") 
.option("topic", "topic1") 
.load()
val prop = new Properties() 
prop.setProperty(“service.url”, serviceUrl) 
prop.setProperty(“admin.url”, adminUrl) 
prop.setProperty(“partitionDiscoveryIntervalMillis”, "5000") 
prop.setProperty(“startingOffsets”, "earliest")
env.addSource(new FlinkPulsarSource(sourceProps))
show tables in pulsar."public/default";
select * from pulsar."public/
default".generator_test;
Spark
Flink
Presto

Pulsar-Spark and Pulsar-Flink
•Pulsar-Spark based on Spark 2.4 is now open sourced
• https://github.com/streamnative/pulsar-spark
•Pulsar-Flink based on Flink 1.9 will open-source soon
•Roadmaps for these two projects
• End-to-end exactly once with pulsar transaction support
• Fine-grained batch parallelism on segment level
• Pulsar-spark / Pulsar-flink

Más contenido relacionado

La actualidad más candente

Pulsar - Distributed pub/sub platform

Matteo Merli

Building event streaming pipelines using Apache Pulsar

StreamNative

At Clever Cloud, we are working on extremely light virtual machines to run WebAssembly binaries. As it’s WASM, we can write code using a lot of languages. We use a custom unikernel to run this WASM as Function-as-a-Service, using one VM per function execution. These VM can run on events from messages coming through Pulsar, or from HTTP invocation, the run is on-demand as only the consumers stay up. This can be a new model: Pulsar functions for real isolation in multi-tenancy use cases. This talk will show the use case, explain the virtualization underneath and demonstrate the multi-tenancy use case.

Building a FaaS with pulsar

StreamNative

The last few years have seen the emergence of Serverless as a paradigm for event streaming. Its very simple programming model has attracted developers in droves. At the same time, its ability to elastically scale has simplified operations significantly. Combined together with the ubiquity of their presence across all cloud providers, serverless today has become the leading choice to do event processing at scale for a lot of companies. In this talk, Sijie Guo from StreamNative will explore how the serverless paradigm is applied to event streaming in Apache Pulsar, a next-generation event streaming system. Pulsar provides native support for serverless functions where the events are processed as soon as they arrive in a streaming manner and that provides flexible deployment options (thread, process, container). He will describe how these serverless functions make data engineering easier and share the real world usage of Pulsar Functions.

Serverless Event Streaming with Pulsar Functions

StreamNative

High performance messaging with Apache Pulsar

Matteo Merli

Apache Kafka - Martin Podval

Martin Podval

Transaction preview of Apache Pulsar

StreamNative

Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...

StreamNative

Introduction Apache Kafka

Joe Stein

What's new in apache pulsar 2.4.0

StreamNative

Kafka blr-meetup-presentation - Kafka internals

Ayyappadas Ravindran (Appu)

In this presentation, we will cover: - How to performance test and optimize a Pulsar cluster. We will present how we load tested Pulsar with locust and, following this, how we tuned our configurations for our use cases. - Event sourcing pattern with Apache Pulsar. Avro schema usage, compatibility choices and schema evolution on pulsar topics that worked for us. - Bonus: How we source Apache Flink from apache pulsar and run our workflows. By attending this webinar, you can expect to come away with: - How to performance test a Pulsar cluster for your use case. - How to leverage the highly configurable broker and Bookkeeper to suit your needs. - Event sourcing patterns on top of Apache Pulsar. - Avro schema usage, compatibility choices, and evolution. - Familiarise with pulsar connector for Flink and possible use cases.

Lessons from managing a Pulsar cluster (Nutanix)

StreamNative

Whether you are deploying a new application in Microservices or transitioning from a monolithic database application to a cloud-ready architecture, you will inevitably face the decision of either creating a service mesh of API’s – or – using an event bus for better durability, reliability and extensibility of your application. If you choose to go the event bus route, Kafka is an excellent choice for several reasons. One key technology not to overlook is Avro Schemas. They provide a definition for your event payload, just like an API, to ensure all of the event consumers can reliably consume the events. They also handle schema evolution as requirements change and much, much more. In this talk we will discuss all the nuances and considerations around using Avro Schemas for your JSON event payloads. From developer tools, to DevOps approaches, versioning, governance and some “gotchas” we found when working with Avro Schemas and the Confluent Schema Registry.

Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...

HostedbyConfluent

Kafka-on-Pulsar has been one of the most anticipated features in the Pulsar ecosystem. The Kafka-on-Pulsar project was initiated by StreamNative and the OVHCloud team quickly joined the project to collaborate on its development. Kafka-on-Pulsar enables Kafka applications to leverage Pulsar’s powerful features, such as streamlined operations with enterprise-grade multi-tenancy, without modifying code. In this webinar, Sijie Guo, from StreamNative, and Pierre Zemb, from OVHCloud, will introduce KoP and discuss the following: 1. What are the key benefits? 2. What is the protocol handler and how does it work? 3. How KoP is implemented? 4. What are the new use cases it unlocks? 5. Watch a Live Demo!

Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...

StreamNative

Open keynote_carolyn&matteo&sijie

StreamNative

Effectively-once semantics in Apache Pulsar

Matteo Merli

kafka

Amikam Snir

Pulsar is a great technology, but it is also a new, less well-known technology competing against incumbent technologies, which is always a bit of a tough sell. In this talk, we will go over the whole end-to-end process of how we researched, advocated, built, integrated, and established Apache Pulsar at Instructure in less than a year. We will share details of how Pulsar's capabilities differentiate it, how we deploy Pulsar, and how we focused on an ecosystem of tools to accelerate adoption. We will also discuss one major motivating use case of change-data-capture for hundreds of databases servers at scale.

Getting Pulsar Spinning_Addison Higham

StreamNative

Yahoo Japan Corporation has been using Apache Pulsar as a centralized pub-sub messaging platform for more than 3 years. We adopted Pulsar because of its great performance, scalability and multi-tenancy capability. It plays an important role to provide our 100+ services in various areas such as e-commerce media, advertising and more. Recently, we addressed to solve our new use case: A large scale log pipeline. In our production environment, we are starting to run a lot of our services on container environments. Our goal is to send all logs and metrics from application containers to various monitoring or analyzing platforms. We expect Pulsar to keep its performance even in tremendously high traffic volume situations (i.e. in tens of Gbps). In this presentation, we will talk about our architecture design, producer/consumer side implementation and the result of performance test. We will also share our experience and knowledge from our production environment operations for more than 3 years. Takeaway: - Practical use case of Apache Pulsar on production - Knowledge of operating Apache Pulsar for large scale data stream

Large scale log pipeline using Apache Pulsar_Nozomi

StreamNative

Apache Pulsar has a distinct architecture from other messaging systems. There is a clear separation of the compute layer that does message processing and dispatching, from the storage layer that handles persistent message storage, using Apache Bookkeeper. This separation of concerns leads to a very efficient design, in terms of performance and cost. Messaging systems that provide guaranteed delivery, when used in production use cases, impose on the underlying storage, demands that are very different from simple benchmark scenarios that test write throughput. Pulsar, with both I/O isolation and separation of concerns, performs better than other messaging systems in production use cases. The strategy of I/O isolation provides better performance from each storage node at less cost, and the separation between computing and storage means that compute nodes can be scaled independently from storage. Irrespective of the choice of storage, Pulsar can be configured to get the best performance for any of those storage configurations. This paper also discusses how some of the latest technologies like NVMe and Persistent Memory can be leveraged at a very low cost overhead, by Pulsar, without any architectural or design changes, with some data from real use cases. The fundamental choice of using Bookkeeper as the storage layer for Pulsar is validated from our experience.

Pulsar Storage on BookKeeper _Seamless Evolution

StreamNative

La actualidad más candente (20)

Pulsar - Distributed pub/sub platform

Building event streaming pipelines using Apache Pulsar

Building a FaaS with pulsar

Serverless Event Streaming with Pulsar Functions

High performance messaging with Apache Pulsar

Apache Kafka - Martin Podval

Transaction preview of Apache Pulsar

Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...

Introduction Apache Kafka

What's new in apache pulsar 2.4.0

Kafka blr-meetup-presentation - Kafka internals

Lessons from managing a Pulsar cluster (Nutanix)

Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...

Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...

Open keynote_carolyn&matteo&sijie

Effectively-once semantics in Apache Pulsar

kafka

Getting Pulsar Spinning_Addison Higham

Large scale log pipeline using Apache Pulsar_Nozomi

Pulsar Storage on BookKeeper _Seamless Evolution

Similar a Integrating Apache Pulsar with Big Data Ecosystem

As organizations are getting better at capturing streaming data and the data velocity and volume are ever-increasing, the traditional messaging queues or log storage systems are suffering from scalability or operational and maintenance problems. Apache Pulsar is a multi-tenant, high-performance distributed pub-sub messaging system. Pulsar includes multiple features, such as native support for multiple clusters in a Pulsar instance, seamless geo-replication of messages across clusters, very low publishing and end-to-end latency, seamless scalability to over a million topics, and guaranteed message delivery with persistent message storage provided by Apache BookKeeper. In this talk, I will use one of the most popular stream processing engines, Apache Flink, as an example, to share our experience in building a stream processing and storage stack. Some of the traits are: * How to ensure end-to-end exactly-once semantics based on Pulsar's durable and replayable storage as well as Pulsar transaction. * How to implement Pulsar topics as infinite tables based on Pulsar's schema. * How to efficiently store stream states in Pulsar based on Pulsar's layered storage API. * A usage scenario that chaining all functionalities in the streaming platform.

Virtual Flink Forward 2020: Build your next-generation stream platform based ...

Flink Forward

Apache Content Technologies

gagravarr

Pub-Sub messaging is a very convenient abstraction that allows system and application developers to decouple components and let them communicate, by acting as durable buffer for transient data, or as a persistent log from where to recover after crashes. This talk will present an overview of Apache Pulsar, the reasons that led to its development and how it enabled many teams at Yahoo and to build scalable and reliable applications. Apache Pulsar has become the defacto pub-sub messaging at Yahoo serving 100+ applications and processing 100’s of billions of messages for over 3+ years. In this talk, we will explore in detail different categories of use cases that highlight how Pulsar can be applied to solve a broad range of problems thanks to its flexible messaging model that supports both queuing and streaming semantics with a focus on durability and transaction guarantees.

Pulsar - flexible pub-sub for internet scale

Matteo Merli

Messaging, storage, or both? The real time story of Pulsar and Apache Distri...

Streamlio

A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.

Real time Analytics with Apache Kafka and Apache Spark

Rahul Jain

Within the ASF, there are a wide variety of projects with technologies to help you store, retrieve, host, transform and generate content. This talk will review the landscape of Apache content technologies, provide a quick introduction to the more common and more interesting projects, and flag up new and innovative features within them. It'll also highlight talks from the rest of the week on many of the projects covered, so that you'll know where and when to go to learn more about those projects and technologies which catch your eye!

If You Have The Content, Then Apache Has The Technology!

gagravarr

Lessons Learned: Using Spark and Microservices

Alexis Seigneurin

At Hootsuite, we've been transitioning from a single monolithic PHP application to a set of scalable Scala-based microservices. To avoid excessive coupling between services, we've implemented an event system using Apache Kafka that allows events to be reliably produced + consumed asynchronously from services as well as data stores. In this presentation, I talk about: - Why we chose Kafka - How we set up our Kafka clusters to be scalable, highly available, and multi-data-center aware. - How we produce + consume events - How we ensure that events can be understood by all parts of our system (Some that are implemented in other programming languages like PHP and Python) and how we handle evolving event payload data.

Building an Event Bus at Scale

jimriecken

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example

confluent

Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

Jeremy Beard

Fundamentals and Architecture of Apache Kafka

Angelo Cesaro

Presto At Treasure Data

Taro L. Saito

Hands-on Workshop: Apache Pulsar

Sijie Guo

One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...

Tim Vaillancourt

Drupal performance

Piyuesh Kumar

Cloud computing UNIT 2.1 presentation in

RahulBhole12

Ruby and Distributed Storage Systems

SATOSHI TAGOMORI

Tesla ingests trillions of events every day from hundreds of unique data sources through our streaming data platform. Find out how we developed a set of high-throughput, non-blocking primitives that allow us to transform and ingest data into a variety of data stores with minimal development time. Additionally, we will discuss how these primitives allowed us to completely migrate the streaming platform in just a few months. Finally, we will talk about how we scale team size sub-linearly to data volumes, while continuing to onboard new use cases.

0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019

confluent

Serverlesss Big Data Analytics with Amazon Athena and Quicksight

Amazon Web Services

Sa introduction to big data pipelining with cassandra & spark west mins...

Simon Ambridge

Similar a Integrating Apache Pulsar with Big Data Ecosystem (20)

Virtual Flink Forward 2020: Build your next-generation stream platform based ...

Apache Content Technologies

Pulsar - flexible pub-sub for internet scale

Messaging, storage, or both? The real time story of Pulsar and Apache Distri...

Real time Analytics with Apache Kafka and Apache Spark

If You Have The Content, Then Apache Has The Technology!

Lessons Learned: Using Spark and Microservices

Building an Event Bus at Scale

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example

Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

Fundamentals and Architecture of Apache Kafka

Presto At Treasure Data

Hands-on Workshop: Apache Pulsar

One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...

Drupal performance

Cloud computing UNIT 2.1 presentation in

Ruby and Distributed Storage Systems

0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019

Serverlesss Big Data Analytics with Amazon Athena and Quicksight

Sa introduction to big data pipelining with cassandra & spark west mins...

Más de StreamNative

So, you are a responsible software engineer building microservices for Apache Kafka, and life is good. Eventually, you hear the community talking about the outstanding experience they are having with Apache Pulsar features. They talk about infinite event stream retention, a rebalance-free architecture, native support for event processing, and multi-tenancy. Exciting, right? Most people would want to migrate their code to Pulsar. Especially when you know that Pulsar also supports Kafka clients natively via the protocol handler known as KoP — which enables the Kafka client APIs on Pulsar. But, as said before, you are responsible; and you don't believe in fairy tales, just like you don't believe that migrations like this happen effortlessly. This session will discuss the architecture behind protocol handlers, what it means having one enabled on Pulsar, and how the KoP works. It will detail the effort required to migrate a microservice written for Kafka to Pulsar, and whether the code need to change for this.

Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022

StreamNative

This talk describes Klaviyo’s internal messaging system, an asynchronous application framework built around Pulsar that provides a set of high-quality tools for building business-critical asynchronous data flows in unreliable environments. This framework includes: a pulsar ORM and schema migrator for topic configuration; a retry/replay system; a versioned schema registry; a consumer framework oriented around preventing message loss and in hostile environments while maximizing observability; an experimental “online schema change” for topics; and more. Development of this system was informed by lessons learned during heavy use of datastores like RabbitMQ and Kafka, and frameworks like Celery, Spark, and Flink. In addition to the capabilities of this system, this talk will also cover (sometimes painful) lessons learned about the process of converting a heterogenous async-computing environment onto Pulsar and a unified model.

Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...

StreamNative

In this talk, learn how Toast leverages our Envoy control-plane to manage blue-green deploys of Pulsar consumers, and how this has helped drive adoption across the engineering organization. Dive into the history of Pulsar at Toast, starting from its introduction in 2019 to provide event-driven architecture across a rapidly scaling restaurant software platform. We will detail some of the hurdles that we encountered gaining buy-in across a diverse set of teams, and dive deep into how we enforce best practices and integrate with our service control plane.

Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...

StreamNative

Event streaming architectures launched a reexamination of applications and systems architectures across the board. We live in a world where answers are needed now in a constant real-time flow. Yet beyond the event streaming system itself, what are the corequisites to ensure our large scale distributed database systems can keep pace with this always-on, always-current real time flow of data? What are the requirements and expectations for this next tech cycle?

Distributed Database Design Decisions to Support High Performance Event Strea...

StreamNative

Pulsar Functions is a succinct framework provided by Apache Pulsar to conduct real-time data processing. Its use cases include ETL pipeline, event-driven applications, and simple data analytics. While Pulsar Functions already provides an extremely simple programming interface, we want to further lower the barrier for users to access real-time data. Since SQL is one of the universal languages in the technology world and well accepted by the vast majority of data engineers, we decided to add a SQL expressing layer on top of Pulsar Functions runtime. In this talk, we will discuss the architecture and implementation of this new service. We will see how SQL syntax, Pulsar Functions, and Function Mesh can work together to deliver a unique user development experience for real-time data jobs in the cloud environment. We will also walk through use cases like filtering, routing, and projecting messages as well as integrating with the Pulsar IO Connectors framework.

Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022

StreamNative

Starting with version 2.10, the Apache ZooKeeper dependency has been eliminated and replaced with a pluggable framework that enables you to reduce the infrastructure footprint of Apache Pulsar by leveraging alternative metadata and coordination systems based on your deployment environment. In this talk, walk through the steps required to utilize the existing etcd service running inside Kubernetes to act as Pulsar's metadata store, thereby eliminating the need to run ZooKeeper entirely, leaving you with a Zookeeper-less Pulsar.

Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022

StreamNative

Apache Pulsar is a highly available, distributed messaging system that provides guarantees of no message loss and strong message ordering with predictable read and write latency. In this talk, learn how this can be validated for Apache Pulsar Kubernetes deployments. Various failures are injected using Chaos Mesh to simulate network and other infrastructure failure conditions. There are many questions that are asked about failure scenarios, but it could be hard to find answers to these important questions. When a failure happens, how long does it take to recover? Does it cause unavailability? How does it impact throughput and latency? Are the guarantees of no message loss and strong message ordering kept, even when components fail? If a complete availability zone fails, is the system configured correctly to handle AZ failures? This talk will help you find answers to these questions and apply the tooling and practices to your own testing and validation.

Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...

StreamNative

Despite what the Ghostbusters said, we’re going to go ahead and cross (or, join) the streams. This session covers getting started with streaming data pipelines, maximizing Pulsar’s messaging system alongside one of the most flexible streaming frameworks available, Apache Flink. Specifically, we’ll demonstrate the use of Flink SQL, which provides various abstractions and allows your pipeline to be language-agnostic. So, if you want to leverage the power of a high-speed, highly customizable stream processing engine without the usual overhead and learning curves of the technologies involved (and their interconnected relationships), then this talk is for you. Watch the step-by-step demo to build a unified batch and streaming pipeline from scratch with Pulsar, via the Flink SQL client. This means you don’t need to be familiar with Flink, (or even a specific programming language). The examples provided are built for highly complex systems, but the talk itself will be accessible to any experience level.

Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...

StreamNative

Apache Pulsar depends upon message acknowledgments to provide at-least-once or exactly-once processing guarantees. With these guarantees, any transmission between the broker and its producers and consumers requires an acknowledgment. But what happens if an acknowledgment is not received? Resending the message introduces the potential of duplicate processing and increases the likelihood of out or order processing. Therefore, it is critical to understand the Pulsar message redelivery semantics in order to prevent either of these conditions. In this talk, we will walk you through the redelivery semantics of Apache Pulsar, and highlight some of the control mechanisms available to application developers to control this behavior. Finally, we will present best practices for configuring message redelivery to suit various use cases.

Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022

StreamNative

Lakehouses are quickly growing in popularity as a new approach to Data Platform Architecture bringing some of the long-established benefits from OLTP world to OLAP, including transactions, record-level updates/deletes, and changes streaming. In this talk, we will discuss Apache Hudi and how it unlocks possibilities of building your own fully open-source Lakehouse featuring a rich set of integrations with existing technologies, including Apache Pulsar. In this session, we will present: - What Lakehouses are, and why they are needed. - What Apache Hudi is and how it works. - Provide a use-case and demo that applies Apache Hudi’s DeltaStreamer tool to ingest data from Apache Pulsar.

Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...

StreamNative

Pulsar is a horizontally scalable messaging system, so the traffic in a logical cluster must be balanced across all the available Pulsar brokers as evenly as possible, in order to ensure full utilization of the broker layer. You can use multiple settings and tools to control the traffic distribution which requires a bit of context to understand how the traffic is managed in Pulsar. In this talk, we will walk you through the load balancing capabilities of Apache Pulsar, and highlight some of the control mechanisms available to control the distribution of load across the Pulsar brokers. Finally, we will discuss the various loading shedding strategies that are available. At the end of the talk, you will have a better understanding of how Pulsar's broker level auto-balancing works, and how to properly configure it to meet your workload demands.

Understanding Broker Load Balancing - Pulsar Summit SF 2022

StreamNative

Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...

StreamNative

In today’s world, we are seeing a big shift toward the Cloud. With this shift comes a big shift in the expectations we have for a messaging system, especially when the messaging system is presented as managed service in a large-scale, multi-tenant environment. For any large-scale enterprise, it’s very important to evaluate messaging system and be confident before expanding complex distributed data systems like Apache Pulsar from on-premise to elastically scalable, fully managed services on cloud services. We must consider aspects such as: migration from and integration with large-scale on-premise clusters, security, cost efficiency, and the cloud friendliness of the architecture, modeling cost and capacity, tenant isolation, deployment robustness, availability, monitoring, etc. Not every messaging system is built to be cloud-native and run as a managed service with cost efficiency. We have been running large-scale Apache Pulsar at Yahoo for the last 8 years on various platforms and hardware configurations while meeting application SLAs and serving more than 1M topics in a cluster. In this talk, we will talk about Pulsar’s journey in Yahoo! from an on-premise platform to a hybrid cloud and on-premise system. We will talk about Pulsar’s architecture and features that make Pulsar a good cloud-native messaging-system choice for any enterprise.

Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022

StreamNative

Pulsar Summit San Francisco is the event dedicated to Apache Pulsar. This one-day, action-packed event will include 5 keynotes, 12 breakout sessions, and 1 amazing happy hour. Speakers are from top companies, including Google, AWS, Databricks, Onehouse, StarTree, Intel, ScyllaDB, and more! It’s the perfect opportunity to network with Pulsar thought leaders in person. Join developers, architects, data engineers, DevOps professionals, and anyone who wants to learn about messaging and event streaming for this one-day, in-person event. Pulsar Summit San Francisco brings the Apache Pulsar Community together to share best practices and discuss the future of streaming technologies.

Event-Driven Applications Done Right - Pulsar Summit SF 2022

StreamNative

Our services team creates, builds, and maintains the as a service offering for base platform services within our organization. Several thousand applications use these custom services daily generating more than 700 million requests per minute. One of these services was our publish / subscriber offering, BQ with custom SDK and custom metrics based on Apache Pulsar. BQ is the core communication service within our organization, having more 200M RPM. All the core processes of the organization depend on this service for operation: the CDC of any of our RDBMS or NoSQL offering, all the eventing efforts of the organization, async communication between apps, notification systems, etc. The backend of the solution was Apache Pulsar running on EC2 on AWS and on top of that we built several components as wrappers of the actual backend, creating our own SDKs and abstractions and in many ways extending the features provided by Pulsar. We had a multi-cluster setup 100% on AWS, with custom Pulsar Docker images running on large ASG setups, along with our own wrapping and admin APIs and DBs. All of this in turn transformed the solution into a volatile solution.

Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022

StreamNative

There is an increasing need to unleash analytical capabilities directly to the end-users to democratize decision-making. User-Facing Analytics is a new frontier that will shape the products of tomorrow and push the limits of existing technology. It demands a solution that will scale to millions of users to provide fast, real-time insights. In this session, Xiang will talk about his journey to build Apache Pinot to tackle the analytics problem space with the architectural changes and technology inventions made over the past decade. He will also talk about how other big data companies such as LinkedIn, Uber, and Stripe power their user-facing analytical applications.

Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022

StreamNative

Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022

StreamNative

Welcome and Opening Remarks - Pulsar Summit SF 2022

StreamNative

Milvus is an open-source vector database that leverages a novel data fabric to build and manage vector similarity search applications. As the world's most popular vector database, it has already been adopted in production by thousands of companies around the world, including Lucidworks, Shutterstock, and Cloudinary. With the launch of Milvus 2.0, the community aims to introduce a cloud-native, highly scalable and extendable vector similarity solution, and the key design concept is log as data. Milvus relies on Pulsar as the log pub/sub system. Pulsar helps Milvus to reduce system complexity by loosely decoupling each micro service, making the system stateless by disaggregating log storage and computation, which also makes the system further extendable. We will introduce the overview design, the implementation details of Milvus and its roadmap in this topic. Takeaways: 1) Get a general idea about what is a vector database and its real-world use cases. 2) Understand the major design principles of Milvus 2.0. 3) Learn how to build a complex system with the help of a modern log system like Pulsar.

Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...

StreamNative

MQTT (Message Queuing Telemetry Transport,) is a message protocol based on the pub/sub model with the advantages of compact message structure, low resource consumption, and high efficiency, which is suitable for IoT applications with low bandwidth and unstable network environments. This session will introduce MQTT on Pulsar, which allows developers users of MQTT transport protocol to use Apache Pulsar. I will share the architecture, principles and future planning of MoP, to help you understand Apache Pulsar's capabilities and practices in the IoT industry.

MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...

StreamNative

Más de StreamNative (20)