Flink currently features different APIs for bounded/batch (DataSet) and streaming (DataStream) programs. And while the DataStream API can handle batch use cases, it is much less efficient in that compared to the DataSet API. The Table API was built as a unified API on top of both, to cover batch and streaming with the same API, and under the hood delegate to either DataSet or DataStream.
In this talk, we present the latest on the Flink community's efforts to rework the APIs and the stack for better unified batch & streaming experience. We will discuss:
- The future roles and interplay of DataSet, DataStream, and Table API
- The new Flink stack and the abstractions on which these APIs will build
- The new unified batch/streaming sources
- How batch and streaming optimizations differ in the runtime, and what the future interplay of batch and streaming execution could look like.
Virtual Flink Forward 2020: Integrate Flink with Kubernetes natively - Yang WangFlink Forward
Currently Flink supports the resource management system YARN and Mesos. However, they were not designed for fast moving cloud native architectures, and they could not support mixed workloads (e.g. batch, streaming, deep learning, web services, etc.) relatively well. At the same time, Kubernetes is evolving very fast to fill those gaps and become the de-facto orchestration framework. So running Flink on Kubernetes is a very basic requirement for many users. In this talk, firstly we will quickly go through Kubernetes architecture and the efforts we have been made to run Flink on Kubernetes. Then we deep dive into the technical details about how to make Flink natively run on Kubernetes. Native means Flink KubernetesResourceManager calls directly the Kubernetes APIs to allocate and release TaskManager pods. Next we will share some practices of application lifecycle management and production optimizations (e.g. high-availability, storage, network, etc.). Finally, we will conclude the talk with advantages for Flink on Kubernetes and a simple demo. This talk is aimed at users and companies who are looking to run Flink on Kubernetes cluster. We assume that the listener has some basic knowledge of cluster orchestration and containers.
At Yelp we run hundreds of Flink jobs to power a wide range of applications: push notifications, data replication, ETL, sessionizing and more. Routine operations like deploys, restart, and savepointing for so many jobs would take quite a bit of developers’ time without the right degree of automation. The latest addition to our toolshed is a Kubernetes operator managing the deployment and the lifetime of Flink clusters on PaaSTA, Yelp’s Platform As A Service.
We replaced our deployment framework launching Flink clusters on top of AWS EMR with a Kubernetes operator managing fully Docker-ized Flink clusters. Compared to EMR, this architecture allowed us to both drastically reduce the deployment time of our Flink clusters and to share our hardware resources more efficiently. In addition, we now offer to our developers the same interface they are used to for running REST services, batch jobs and many other workloads on PaaSTA.
This talk will give a brief overview of Yelp’s PaaSTA before diving into the details of how the Kubernetes operator has been implemented and how it has been integrated with Yelp developers’ workflow (deploy, logs, savepoints, upgrades, etc), to end with a glimpse of the future features we are planning for the operator (Flink as a library, autoscaling, etc.).
We present a web service named FLOW to let users do FLink On Web. FLOW aims to minimize the effort of handwriting streaming applications similar in spirit to Hortonworks Stream Analytics Manager, StreamAnalytix, and Nussknacker by letting users drag and drop graphical icons representing streaming operators on GUI.
FLOW builds on Flink Table API and lets users assemble graphical icons associated with not only basic SQL operations but also advanced SQL operations like window aggregation, temporal join, and pattern recognition (MATCH_RECOGNIZE clause). Its data preview function enables to observe how sample data changes before and after applying each operation on screen. In addition, FLOW shows the sample data as time-series charts and geographical maps by interacting with Elasticsearch and Kibana. Therefore, domain experts with basic knowledge of SQL can design their streaming applications easily on GUI without understanding of Flink DataStream API and Flink CEP library.
In this talk, we first present what motivates the development of FLOW, then show how FLOW can be used to figure out the "Popular Places" exercise in its own style, and lastly explain how FLOW leverages Flink Table API.
Flink Connector Development Tips & TricksEron Wright
A look at some of the challenges and techniques for developing a connector for Apache Flink, covering the different types of connectors, lifecycle, metrics, event-time support, and fault tolerance.
Presentation video: https://www.youtube.com/watch?v=ZkbYO5S4z18
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...Flink Forward
Within fintech catching fraudsters is one of the primary opportunities for us to use streaming applications to apply ML models in real-time. This talk will be a review of our journey to bring fraud decisioning to our tellers at Capital One using Kafka, Flink and AWS Lambda. We will share our learnings and experiences to common problems such as custom windowing, breaking down a monolith app to small queryable state apps, feature engineering with Jython, dealing with back pressure from combining two disparate streams, model/feature validation in a regulatory environment, and running Flink jobs on Kubernetes.
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...Flink Forward
Operationalizing Machine Learning models is never easy. Our team at Comcast has been challenged with operationalizing predictive ML models to improve customer care experiences. Using Apache Flink we have been able to apply real-time streaming to all aspects of the Machine Learning lifecycle. This includes data feature exploration and preparation by data scientists, deploying live models to serve near-real-time predictions, and validating results for model retraining and iteration. We will share best practices and lessons learned from Flink’s role in our operationalized lifecycle including:
• Executing as the “Prediction Pipeline” – a model container environment for near-real-time streaming and batch predictions
• Preparing streaming features and data sets for model training, as input for production model predictions, and for a continually-updated customer context
• Using connected streams and savepoints for “Live in the Dark”, multi-variant testing, and validation scenarios
• Incorporating Flink’s Queryable State as an approach to the online “Feature Store” – a data catalog for reuse by multiple models and use cases
• Enabling versioned models, versioned feature sets, and versioned data through DevOps approaches.
Virtual Flink Forward 2020: Integrate Flink with Kubernetes natively - Yang WangFlink Forward
Currently Flink supports the resource management system YARN and Mesos. However, they were not designed for fast moving cloud native architectures, and they could not support mixed workloads (e.g. batch, streaming, deep learning, web services, etc.) relatively well. At the same time, Kubernetes is evolving very fast to fill those gaps and become the de-facto orchestration framework. So running Flink on Kubernetes is a very basic requirement for many users. In this talk, firstly we will quickly go through Kubernetes architecture and the efforts we have been made to run Flink on Kubernetes. Then we deep dive into the technical details about how to make Flink natively run on Kubernetes. Native means Flink KubernetesResourceManager calls directly the Kubernetes APIs to allocate and release TaskManager pods. Next we will share some practices of application lifecycle management and production optimizations (e.g. high-availability, storage, network, etc.). Finally, we will conclude the talk with advantages for Flink on Kubernetes and a simple demo. This talk is aimed at users and companies who are looking to run Flink on Kubernetes cluster. We assume that the listener has some basic knowledge of cluster orchestration and containers.
At Yelp we run hundreds of Flink jobs to power a wide range of applications: push notifications, data replication, ETL, sessionizing and more. Routine operations like deploys, restart, and savepointing for so many jobs would take quite a bit of developers’ time without the right degree of automation. The latest addition to our toolshed is a Kubernetes operator managing the deployment and the lifetime of Flink clusters on PaaSTA, Yelp’s Platform As A Service.
We replaced our deployment framework launching Flink clusters on top of AWS EMR with a Kubernetes operator managing fully Docker-ized Flink clusters. Compared to EMR, this architecture allowed us to both drastically reduce the deployment time of our Flink clusters and to share our hardware resources more efficiently. In addition, we now offer to our developers the same interface they are used to for running REST services, batch jobs and many other workloads on PaaSTA.
This talk will give a brief overview of Yelp’s PaaSTA before diving into the details of how the Kubernetes operator has been implemented and how it has been integrated with Yelp developers’ workflow (deploy, logs, savepoints, upgrades, etc), to end with a glimpse of the future features we are planning for the operator (Flink as a library, autoscaling, etc.).
We present a web service named FLOW to let users do FLink On Web. FLOW aims to minimize the effort of handwriting streaming applications similar in spirit to Hortonworks Stream Analytics Manager, StreamAnalytix, and Nussknacker by letting users drag and drop graphical icons representing streaming operators on GUI.
FLOW builds on Flink Table API and lets users assemble graphical icons associated with not only basic SQL operations but also advanced SQL operations like window aggregation, temporal join, and pattern recognition (MATCH_RECOGNIZE clause). Its data preview function enables to observe how sample data changes before and after applying each operation on screen. In addition, FLOW shows the sample data as time-series charts and geographical maps by interacting with Elasticsearch and Kibana. Therefore, domain experts with basic knowledge of SQL can design their streaming applications easily on GUI without understanding of Flink DataStream API and Flink CEP library.
In this talk, we first present what motivates the development of FLOW, then show how FLOW can be used to figure out the "Popular Places" exercise in its own style, and lastly explain how FLOW leverages Flink Table API.
Flink Connector Development Tips & TricksEron Wright
A look at some of the challenges and techniques for developing a connector for Apache Flink, covering the different types of connectors, lifecycle, metrics, event-time support, and fault tolerance.
Presentation video: https://www.youtube.com/watch?v=ZkbYO5S4z18
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...Flink Forward
Within fintech catching fraudsters is one of the primary opportunities for us to use streaming applications to apply ML models in real-time. This talk will be a review of our journey to bring fraud decisioning to our tellers at Capital One using Kafka, Flink and AWS Lambda. We will share our learnings and experiences to common problems such as custom windowing, breaking down a monolith app to small queryable state apps, feature engineering with Jython, dealing with back pressure from combining two disparate streams, model/feature validation in a regulatory environment, and running Flink jobs on Kubernetes.
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...Flink Forward
Operationalizing Machine Learning models is never easy. Our team at Comcast has been challenged with operationalizing predictive ML models to improve customer care experiences. Using Apache Flink we have been able to apply real-time streaming to all aspects of the Machine Learning lifecycle. This includes data feature exploration and preparation by data scientists, deploying live models to serve near-real-time predictions, and validating results for model retraining and iteration. We will share best practices and lessons learned from Flink’s role in our operationalized lifecycle including:
• Executing as the “Prediction Pipeline” – a model container environment for near-real-time streaming and batch predictions
• Preparing streaming features and data sets for model training, as input for production model predictions, and for a continually-updated customer context
• Using connected streams and savepoints for “Live in the Dark”, multi-variant testing, and validation scenarios
• Incorporating Flink’s Queryable State as an approach to the online “Feature Store” – a data catalog for reuse by multiple models and use cases
• Enabling versioned models, versioned feature sets, and versioned data through DevOps approaches.
Virtual Flink Forward 2020: Keynote: The Evolution of Data Infrastructure at ...Flink Forward
Over the past few years almost all data processing has moved from batch to stream processing. This isn’t simply driven by a desire for lower latency, but by a fundamental understanding that streams are a more effective primitive for data processing, providing a better impedance match to varied downstream systems and services. Splunk, like many others, has been evolving its core data infrastructure to better provide a simpler and more consistent programming model, address correctness and latency of data, and allow for a more open integration model with our data platform. Throughout this process, we’ve come to view Apache Flink as a critical backbone in our core data infrastructure. Join us to learn more about how our data infrastructure - and how we think about it - has fundamentally changed.
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...Flink Forward
Flink currently features different APIs for bounded/batch (DataSet) and streaming (DataStream) programs. And while the DataStream API can handle batch use cases, it is much less efficient in that compared to the DataSet API. The Table API was built as a unified API on top of both, to cover batch and streaming with the same API, and under the hood delegate to either DataSet or DataStream.
In this talk, we present the latest on the Flink community's efforts to rework the APIs and the stack for better unified batch & streaming experience. We will discuss:
- The future roles and interplay of DataSet, DataStream, and Table API
- The new Flink stack and the abstractions on which these APIs will build
- The new unified batch/streaming sources
- How batch and streaming optimizations differ in the runtime, and what the future interplay of batch and streaming execution could look like
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...Flink Forward
Stream storage enables ingestion and permanent storage of continuously generated data. Implementing such a system correctly is an engineering endeavor. In this presentation, we share our experience designing and implementing core features of our stream store, Pravega, necessary to implement a complete and effective connector for Apache Flink. Pravega is a system for storing unbounded amounts of data per stream, while guaranteeing strong delivery semantics. It supports features including watermarking, transactions, scaling of streams in response to workload changes, and recent and historical data via a common API. While designing and implementing such features of Pravega, we soon realized that they look orthogonal at the surface, but actually interrelate with each other in intricate ways. For example, when using transactions to guarantee exactly-once semantics, time used for watermarks can only advance upon commit. As another example, scaling streams changes the degree of parallelism of streams dynamically, which directly impacts the context of transactions, the coordination of checkpoints and the time bounds exposed to support watermarks. All these mechanisms need to be aware of dynamic changes to streams. The lessons we learned from Pravega and share in this presentation are fundamental, and we expect them to be broadly applicable and of interest to a stream processing audience. Pravega is a fully open-source project and welcomes external contributions.
Flink Forward San Francisco 2018: Jörg Schad and Biswajit Das - "Operating Fl...Flink Forward
Flink has supported Apache Mesos officially since the 1.2 release and many users have been using them together even before that. The latest releases 1.4 and 1.5 (not released at the time of writing) add a deeper integration for resource schedulers, such as Mesos, which also resulted in many new features around this integration. But what does that mean in practice for operating large cluster? In this talk, we will discuss operational best practices-alongside with some pitfalls- for operating large Flink cluster on top of Apache Mesos, including topics such as: * Deployments, * Monitoring, * Scaling, * Upgrades, * Debugging.
Distributed stream processing is evolving from a technology in the sidelines of Big Data to a key enabler for businesses to provide more scalable, real-time services to their customers. We at Ververica, the company founded by the original creators of Apache Flink, and other prominent players in the Flink community have witnessed this development from the driver’s seat. Working with our customer and the wider community we have seen great success stories and we have seen things going wrong. In this talk, I would like to share anecdotes and hard-learned lessons of adopting distributed stream processing – Apache Flink specific as well as across frameworks. Afterwards, you will know, how not to model your use cases as a stream processing application, which data structures not to use, how not to deal with failure, how not to approach the topic of monitoring and much more.
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Flink Forward
In this session, we will look at how Apache Flink can be used to stream anonymized API request and response data from a production environment to make sure staging environments are up-to-date and reflect the most recent features (and bugs) that comprise a service. The talk will also examine how to deal with issues of data retention, throttling, and persistence, finishing with recommendations for how to use these sandbox environments to rapidly prototype and test new features and fixes.
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...Flink Forward
“Customer experience is the next big battle ground for telcos,” proclaimed recently Amit Akhelikar, Global Director of Lynx Analytics at TM Forum Live! Asia in Singapore. But, how to fight in this battle? A common approach has been to keep “under control” some well-known network quality indicators, like dropped calls, radio access congestion, availability, and so on; but this has proven not to be enough to keep customers happy, like a siege weapon is not enough to conquer a city. But, what if it were possible to know how customers perceive services, at least most demanded ones, like web browsing or video streaming? That would be like a squad of archers ready to battle. And even having that, how to extract value of it and take actions in no time, giving our skilled archers the right targets? Meet CANVAS (Customer And Network Visualization and AnaltyticS), one of the first LATAM implementations of a Flink-based stream processing use case for a telco, which successfully combines leading and innovative technologies like Apache Hadoop, YARN, Kafka, Nifi, Druid and advanced visualizations with Flink core features like non-trivial stateful stream processing (joins, windows and aggregations on event time) and CEP capabilities for alarm generation, delivering a next-generation tool for SOC (Service Operation Center) teams.
Deploying Flink on Kubernetes - David AndersonVerverica
Kubernetes has rapidly established itself as the de facto standard for orchestrating containerized infrastructures. And with the recent completion of the refactoring of Flink's deployment and process model known as FLIP-6, Kubernetes has become a natural choice for Flink deployments. In this talk we will walk through how to get Flink running on Kubernetes
In the last decade, many distributed stream processing engines (SPEs) were developed to perform continuous queries on massive online data. The central design principle of these engines is to handle queries that potentially run forever on data streams with a query-at-a-time model, i.e., each query is optimized and executed separately. In many real applications, streams are not only processed with long-running queries, but also thousands of short-running ad-hoc queries. To support this efficiently, it is essential to share resources and computation for stream ad-hoc queries in a multi-user environment.
The goal of this talk is to bridge the gap between stream processing and ad-hoc queries in SPEs by sharing computation and resources. We define three main requirements for ad-hoc shared stream processing: (1) Integration: Ad-hoc query processing should be a composable layer which can extend stream operators, such as join, aggregation, and window operators; (2) Consistency: Ad-hoc query creation and deletion must be performed in a consistent manner and ensure exactly-once semantics and correctness; (3) Performance: In contrast to state-of-the-art SPEs, ad-hoc SPE should not only maximize data throughput but also query throughout via incremental computation and resource sharing. Based on these requirements, we have developed AStream, an ad-hoc, shared computation stream processing framework.
To the best of our knowledge, AStream is the first system that supports distributed ad-hoc stream processing. AStream is built on top of Apache Flink. Our experiments show that AStream shows comparable results to Flink for single query deployments and outperforms it in orders of magnitude with multiple queries.
Streaming your Lyft Ride Prices - Flink Forward SF 2019Thomas Weise
At Lyft we dynamically price our rides with a combination of various data sources, machine learning models, and streaming infrastructure for low latency, reliability and scalability. Dynamic pricing allows us to quickly adapt to real world changes and be fair to drivers (by say raising rates when there's a lot of demand) and fair to passengers (by let’s say offering to return 10 mins later for a cheaper rate). The streaming platform powers pricing by bringing together the best of two worlds using Apache Beam; ML algorithms in Python and Apache Flink as the streaming engine.
https://sf-2019.flink-forward.org/conference-program#streaming-your-lyft-ride-prices
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data. This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...Flink Forward
Stream Processing in conjunction with a Consistent, Durable, Reliable stream storage is kicking the revolution up a notch in Big Data processing. This modern paradigm is enabling a new generation of data middleware that delivers on the streaming promise of a simplified and unified programming model. From data ingest, transformation, and messaging to search, time series and more, a robust streaming data ecosystem means we’ll all be able to more quickly build applications that solve problems we could not solve before.
Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...Flink Forward
HyperConnect's 1to1 video matchmaking system is consist of various machine learning techniques to maximize user satisfaction. Our matchmaking system manages large user context containing actions a few seconds ago, and reacts in milliseconds to produce meaningful new results in each user session. It's difficult in traditional way. So, distributed streaming is essential to handle in this cases. Topics include: - Why our team choose Apache Flink in comparison with alternatives - Matchmaking streaming architecture with detail abstraction levels based on Flink operator - Pairwise scoring microservice management with Flink - Stateful matchmaking computation with low latency, fault-tolerance, and scalability - How to manage large-scale events: classifying feature types, collecting with a multi-window stream - Applications: personalization, multi-armed-bandit on stream.
Scaling stream data pipelines with Pravega and Apache FlinkTill Rohrmann
Extracting insights out of continuously generated data requires a stream processor with powerful data analytics features such as Apache Flink. A stream data pipeline with Flink typically includes a storage component to ingest and serve the data. Pravega is a stream store that ingests and stores stream data permanently, making the data available for tail, catch-up, and historical reads. One important challenge for such stream data pipelines is coping with the variations in the workload. Daily cycles and seasonal spikes might require the provisioning of the application to adapt accordingly. Pravega has a feature called stream scaling, which enables the capacity offered for the ingestion of events of a stream to grow and shrink over time according to workload. Such a feature is useful when the application downstream has the ability of accommodating such changes and also scale its provisioning accordingly. In this presentation, we introduce stream scaling in Pravega and how Flink jobs leverage this feature to rescale stateful jobs according to variations in the workload.
Flink Forward San Francisco 2019: Using Flink to inspect live data as it flow...Flink Forward
Using Flink to inspect live data as it flows through a data pipeline
One of the hardest challenges with authoring a data pipeline in Flink is understanding what your data looks like at each stage of the pipeline. Pipeline authors would love to answer questions like ""why is no data coming through my filter?"" Or ""why did my regex not extract any fields?"" Or ""is my pipeline even reading anything from Kafka?"" Unit and integration testing pipeline logic goes a long way, and metrics are another great tool to understand what a pipeline is doing, but sometimes you need the data itself to answer why a pipeline is behaving the way it is.
To answer these questions for ourselves and our customers, at Splunk we created a simple yet robust architecture for extracting data as it moves through a pipeline. You'll also learn about our implementation of this architecture, including the lessons learned while creating it, and how you can apply this architecture yourself. You'll hear about how to rewrite your Flink job graph at job submission time, how to retrieve data from all the nodes in the job graph, and how to expose this information to a user interface through a REST API.
One of the biggest challenge in streaming application development is making sure your pipeline does exactly what it is supposed to do. The combination of different data sources, sinks and complex application behavior such as time based functionality or interaction with external systems doesn’t make the problem of proper testing any easier. In this talk we show you some of the excellent testing utilities built into Flink that can be used to unit-test parts of our application and to integration test complex data pipelines. We will also look at some external libraries developed by the community that can be used to further improve the testing experience and reduce time to production. Last but not least we will share some tools and best practices that can help debugging problems that managed to fall through the cracks. By the end of this talk you will be familiar with some of the best tools to test and debug your streaming pipelines to give you extra confidence in your applications.
The streaming space is evolving at an ever increasing pace. This trend is also reflected in Apache Flink whose latest major release included again many new features. For streaming practitioners it is essential to learn about Flink's newest capabilities because often they enable completely new use cases and applications.
In this talk, I want to give a brief overview about Apache Flink and its latest feature additions, including the integration of CEP with streaming SQL, proper support for state evolution, temporal joins and many more. Furthermore, I want to put them in perspective with respect to Flink's future direction by giving some insights into ongoing development threads in the community. Thereby, I intend to give attendees a better picture about Flink's current and future capabilities.
dA Platform is a production-ready platform for stream processing with Apache Flink®. The Platform includes open source Apache Flink, a stateful stream processing and event-driven application framework, and dA Application Manager, a central deployment and management component. dA Platform schedules clusters on Kubernetes, deploys stateful Flink applications, and controls these applications and their state.
Flink Forward Berlin 2017: Patrick Lucas - Flink in ContainerlandFlink Forward
Apache Flink, a powerful distributed stateful stream processing framework, is an especially good fit for deployment on a containerization platform: its storage requirement is primarily external (e.g. HDFS or S3), clusters often share the lifetime of the jobs that run on them, and the flexibility of allocating resources on such a platform allows for scaling jobs up and down as necessary. In this talk I will give a brief introduction to Apache Flink, then describe the journey to making it a first-class citizen of the container world. I will cover my experience preparing to publish the “official repository” of Flink images on Docker Hub, the challenges of fitting a Flink deployment in a Kubernetes-shaped box, and the rough edges of Flink itself that were exposed by this process.
Distributed stream processing is evolving from a technology in the sidelines of Big Data to a key enabler for businesses to provide more scalable, real-time services to their customers. We at Ververica, the company founded by the original creators of Apache Flink, and other prominent players in the Flink community have witnessed this development from the driver’s seat. Working with our customer and the wider community we have seen great success stories and we have seen things going wrong. In this talk, I would like to share anecdotes and hard-learned lessons of adopting distributed stream processing – Apache Flink specific as well as across frameworks. Afterwards, you will know, how not to model your use cases as a stream processing application, which data structures not to use, how not to deal with failure, how not to approach the topic of monitoring and much more.
Video: https://www.youtube.com/watch?v=F7HQd3KX2TQ&list=PLDX4T_cnKjD207Aa8b5CsZjc7Z_KRezGz&index=48&t=6s
This is a talk that I gave at the Data Council Berlin Meetup on May 16th, 2019
Abstract:
Stream processing is being rapidly adopted by the enterprise. While in the past, stream processing frameworks mostly provided Java- or Scala-based APIs, stream processing with SQL is growing increasingly popular because it makes stream processing accessible to non-programmers and significantly reduces the effort to solve common tasks.
About three years ago, the Apache Flink community started adding SQL support to process static and streaming data in a unified fashion. Today, Flink SQL powers production systems at Alibaba, Huawei, Lyft, and Uber. Fabian Hueske discusses the current state of Flink’s SQL support and explains the importance of Flink’s unified approach to process static and streaming data. After covering the basics, he shares common real-world use cases ranging from low-latency ETL to pattern detection and demonstrates how easily they can be addressed with Flink SQL.
Virtual Flink Forward 2020: Keynote: The Evolution of Data Infrastructure at ...Flink Forward
Over the past few years almost all data processing has moved from batch to stream processing. This isn’t simply driven by a desire for lower latency, but by a fundamental understanding that streams are a more effective primitive for data processing, providing a better impedance match to varied downstream systems and services. Splunk, like many others, has been evolving its core data infrastructure to better provide a simpler and more consistent programming model, address correctness and latency of data, and allow for a more open integration model with our data platform. Throughout this process, we’ve come to view Apache Flink as a critical backbone in our core data infrastructure. Join us to learn more about how our data infrastructure - and how we think about it - has fundamentally changed.
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...Flink Forward
Flink currently features different APIs for bounded/batch (DataSet) and streaming (DataStream) programs. And while the DataStream API can handle batch use cases, it is much less efficient in that compared to the DataSet API. The Table API was built as a unified API on top of both, to cover batch and streaming with the same API, and under the hood delegate to either DataSet or DataStream.
In this talk, we present the latest on the Flink community's efforts to rework the APIs and the stack for better unified batch & streaming experience. We will discuss:
- The future roles and interplay of DataSet, DataStream, and Table API
- The new Flink stack and the abstractions on which these APIs will build
- The new unified batch/streaming sources
- How batch and streaming optimizations differ in the runtime, and what the future interplay of batch and streaming execution could look like
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...Flink Forward
Stream storage enables ingestion and permanent storage of continuously generated data. Implementing such a system correctly is an engineering endeavor. In this presentation, we share our experience designing and implementing core features of our stream store, Pravega, necessary to implement a complete and effective connector for Apache Flink. Pravega is a system for storing unbounded amounts of data per stream, while guaranteeing strong delivery semantics. It supports features including watermarking, transactions, scaling of streams in response to workload changes, and recent and historical data via a common API. While designing and implementing such features of Pravega, we soon realized that they look orthogonal at the surface, but actually interrelate with each other in intricate ways. For example, when using transactions to guarantee exactly-once semantics, time used for watermarks can only advance upon commit. As another example, scaling streams changes the degree of parallelism of streams dynamically, which directly impacts the context of transactions, the coordination of checkpoints and the time bounds exposed to support watermarks. All these mechanisms need to be aware of dynamic changes to streams. The lessons we learned from Pravega and share in this presentation are fundamental, and we expect them to be broadly applicable and of interest to a stream processing audience. Pravega is a fully open-source project and welcomes external contributions.
Flink Forward San Francisco 2018: Jörg Schad and Biswajit Das - "Operating Fl...Flink Forward
Flink has supported Apache Mesos officially since the 1.2 release and many users have been using them together even before that. The latest releases 1.4 and 1.5 (not released at the time of writing) add a deeper integration for resource schedulers, such as Mesos, which also resulted in many new features around this integration. But what does that mean in practice for operating large cluster? In this talk, we will discuss operational best practices-alongside with some pitfalls- for operating large Flink cluster on top of Apache Mesos, including topics such as: * Deployments, * Monitoring, * Scaling, * Upgrades, * Debugging.
Distributed stream processing is evolving from a technology in the sidelines of Big Data to a key enabler for businesses to provide more scalable, real-time services to their customers. We at Ververica, the company founded by the original creators of Apache Flink, and other prominent players in the Flink community have witnessed this development from the driver’s seat. Working with our customer and the wider community we have seen great success stories and we have seen things going wrong. In this talk, I would like to share anecdotes and hard-learned lessons of adopting distributed stream processing – Apache Flink specific as well as across frameworks. Afterwards, you will know, how not to model your use cases as a stream processing application, which data structures not to use, how not to deal with failure, how not to approach the topic of monitoring and much more.
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Flink Forward
In this session, we will look at how Apache Flink can be used to stream anonymized API request and response data from a production environment to make sure staging environments are up-to-date and reflect the most recent features (and bugs) that comprise a service. The talk will also examine how to deal with issues of data retention, throttling, and persistence, finishing with recommendations for how to use these sandbox environments to rapidly prototype and test new features and fixes.
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...Flink Forward
“Customer experience is the next big battle ground for telcos,” proclaimed recently Amit Akhelikar, Global Director of Lynx Analytics at TM Forum Live! Asia in Singapore. But, how to fight in this battle? A common approach has been to keep “under control” some well-known network quality indicators, like dropped calls, radio access congestion, availability, and so on; but this has proven not to be enough to keep customers happy, like a siege weapon is not enough to conquer a city. But, what if it were possible to know how customers perceive services, at least most demanded ones, like web browsing or video streaming? That would be like a squad of archers ready to battle. And even having that, how to extract value of it and take actions in no time, giving our skilled archers the right targets? Meet CANVAS (Customer And Network Visualization and AnaltyticS), one of the first LATAM implementations of a Flink-based stream processing use case for a telco, which successfully combines leading and innovative technologies like Apache Hadoop, YARN, Kafka, Nifi, Druid and advanced visualizations with Flink core features like non-trivial stateful stream processing (joins, windows and aggregations on event time) and CEP capabilities for alarm generation, delivering a next-generation tool for SOC (Service Operation Center) teams.
Deploying Flink on Kubernetes - David AndersonVerverica
Kubernetes has rapidly established itself as the de facto standard for orchestrating containerized infrastructures. And with the recent completion of the refactoring of Flink's deployment and process model known as FLIP-6, Kubernetes has become a natural choice for Flink deployments. In this talk we will walk through how to get Flink running on Kubernetes
In the last decade, many distributed stream processing engines (SPEs) were developed to perform continuous queries on massive online data. The central design principle of these engines is to handle queries that potentially run forever on data streams with a query-at-a-time model, i.e., each query is optimized and executed separately. In many real applications, streams are not only processed with long-running queries, but also thousands of short-running ad-hoc queries. To support this efficiently, it is essential to share resources and computation for stream ad-hoc queries in a multi-user environment.
The goal of this talk is to bridge the gap between stream processing and ad-hoc queries in SPEs by sharing computation and resources. We define three main requirements for ad-hoc shared stream processing: (1) Integration: Ad-hoc query processing should be a composable layer which can extend stream operators, such as join, aggregation, and window operators; (2) Consistency: Ad-hoc query creation and deletion must be performed in a consistent manner and ensure exactly-once semantics and correctness; (3) Performance: In contrast to state-of-the-art SPEs, ad-hoc SPE should not only maximize data throughput but also query throughout via incremental computation and resource sharing. Based on these requirements, we have developed AStream, an ad-hoc, shared computation stream processing framework.
To the best of our knowledge, AStream is the first system that supports distributed ad-hoc stream processing. AStream is built on top of Apache Flink. Our experiments show that AStream shows comparable results to Flink for single query deployments and outperforms it in orders of magnitude with multiple queries.
Streaming your Lyft Ride Prices - Flink Forward SF 2019Thomas Weise
At Lyft we dynamically price our rides with a combination of various data sources, machine learning models, and streaming infrastructure for low latency, reliability and scalability. Dynamic pricing allows us to quickly adapt to real world changes and be fair to drivers (by say raising rates when there's a lot of demand) and fair to passengers (by let’s say offering to return 10 mins later for a cheaper rate). The streaming platform powers pricing by bringing together the best of two worlds using Apache Beam; ML algorithms in Python and Apache Flink as the streaming engine.
https://sf-2019.flink-forward.org/conference-program#streaming-your-lyft-ride-prices
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data. This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...Flink Forward
Stream Processing in conjunction with a Consistent, Durable, Reliable stream storage is kicking the revolution up a notch in Big Data processing. This modern paradigm is enabling a new generation of data middleware that delivers on the streaming promise of a simplified and unified programming model. From data ingest, transformation, and messaging to search, time series and more, a robust streaming data ecosystem means we’ll all be able to more quickly build applications that solve problems we could not solve before.
Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...Flink Forward
HyperConnect's 1to1 video matchmaking system is consist of various machine learning techniques to maximize user satisfaction. Our matchmaking system manages large user context containing actions a few seconds ago, and reacts in milliseconds to produce meaningful new results in each user session. It's difficult in traditional way. So, distributed streaming is essential to handle in this cases. Topics include: - Why our team choose Apache Flink in comparison with alternatives - Matchmaking streaming architecture with detail abstraction levels based on Flink operator - Pairwise scoring microservice management with Flink - Stateful matchmaking computation with low latency, fault-tolerance, and scalability - How to manage large-scale events: classifying feature types, collecting with a multi-window stream - Applications: personalization, multi-armed-bandit on stream.
Scaling stream data pipelines with Pravega and Apache FlinkTill Rohrmann
Extracting insights out of continuously generated data requires a stream processor with powerful data analytics features such as Apache Flink. A stream data pipeline with Flink typically includes a storage component to ingest and serve the data. Pravega is a stream store that ingests and stores stream data permanently, making the data available for tail, catch-up, and historical reads. One important challenge for such stream data pipelines is coping with the variations in the workload. Daily cycles and seasonal spikes might require the provisioning of the application to adapt accordingly. Pravega has a feature called stream scaling, which enables the capacity offered for the ingestion of events of a stream to grow and shrink over time according to workload. Such a feature is useful when the application downstream has the ability of accommodating such changes and also scale its provisioning accordingly. In this presentation, we introduce stream scaling in Pravega and how Flink jobs leverage this feature to rescale stateful jobs according to variations in the workload.
Flink Forward San Francisco 2019: Using Flink to inspect live data as it flow...Flink Forward
Using Flink to inspect live data as it flows through a data pipeline
One of the hardest challenges with authoring a data pipeline in Flink is understanding what your data looks like at each stage of the pipeline. Pipeline authors would love to answer questions like ""why is no data coming through my filter?"" Or ""why did my regex not extract any fields?"" Or ""is my pipeline even reading anything from Kafka?"" Unit and integration testing pipeline logic goes a long way, and metrics are another great tool to understand what a pipeline is doing, but sometimes you need the data itself to answer why a pipeline is behaving the way it is.
To answer these questions for ourselves and our customers, at Splunk we created a simple yet robust architecture for extracting data as it moves through a pipeline. You'll also learn about our implementation of this architecture, including the lessons learned while creating it, and how you can apply this architecture yourself. You'll hear about how to rewrite your Flink job graph at job submission time, how to retrieve data from all the nodes in the job graph, and how to expose this information to a user interface through a REST API.
One of the biggest challenge in streaming application development is making sure your pipeline does exactly what it is supposed to do. The combination of different data sources, sinks and complex application behavior such as time based functionality or interaction with external systems doesn’t make the problem of proper testing any easier. In this talk we show you some of the excellent testing utilities built into Flink that can be used to unit-test parts of our application and to integration test complex data pipelines. We will also look at some external libraries developed by the community that can be used to further improve the testing experience and reduce time to production. Last but not least we will share some tools and best practices that can help debugging problems that managed to fall through the cracks. By the end of this talk you will be familiar with some of the best tools to test and debug your streaming pipelines to give you extra confidence in your applications.
The streaming space is evolving at an ever increasing pace. This trend is also reflected in Apache Flink whose latest major release included again many new features. For streaming practitioners it is essential to learn about Flink's newest capabilities because often they enable completely new use cases and applications.
In this talk, I want to give a brief overview about Apache Flink and its latest feature additions, including the integration of CEP with streaming SQL, proper support for state evolution, temporal joins and many more. Furthermore, I want to put them in perspective with respect to Flink's future direction by giving some insights into ongoing development threads in the community. Thereby, I intend to give attendees a better picture about Flink's current and future capabilities.
dA Platform is a production-ready platform for stream processing with Apache Flink®. The Platform includes open source Apache Flink, a stateful stream processing and event-driven application framework, and dA Application Manager, a central deployment and management component. dA Platform schedules clusters on Kubernetes, deploys stateful Flink applications, and controls these applications and their state.
Flink Forward Berlin 2017: Patrick Lucas - Flink in ContainerlandFlink Forward
Apache Flink, a powerful distributed stateful stream processing framework, is an especially good fit for deployment on a containerization platform: its storage requirement is primarily external (e.g. HDFS or S3), clusters often share the lifetime of the jobs that run on them, and the flexibility of allocating resources on such a platform allows for scaling jobs up and down as necessary. In this talk I will give a brief introduction to Apache Flink, then describe the journey to making it a first-class citizen of the container world. I will cover my experience preparing to publish the “official repository” of Flink images on Docker Hub, the challenges of fitting a Flink deployment in a Kubernetes-shaped box, and the rough edges of Flink itself that were exposed by this process.
Distributed stream processing is evolving from a technology in the sidelines of Big Data to a key enabler for businesses to provide more scalable, real-time services to their customers. We at Ververica, the company founded by the original creators of Apache Flink, and other prominent players in the Flink community have witnessed this development from the driver’s seat. Working with our customer and the wider community we have seen great success stories and we have seen things going wrong. In this talk, I would like to share anecdotes and hard-learned lessons of adopting distributed stream processing – Apache Flink specific as well as across frameworks. Afterwards, you will know, how not to model your use cases as a stream processing application, which data structures not to use, how not to deal with failure, how not to approach the topic of monitoring and much more.
Video: https://www.youtube.com/watch?v=F7HQd3KX2TQ&list=PLDX4T_cnKjD207Aa8b5CsZjc7Z_KRezGz&index=48&t=6s
This is a talk that I gave at the Data Council Berlin Meetup on May 16th, 2019
Abstract:
Stream processing is being rapidly adopted by the enterprise. While in the past, stream processing frameworks mostly provided Java- or Scala-based APIs, stream processing with SQL is growing increasingly popular because it makes stream processing accessible to non-programmers and significantly reduces the effort to solve common tasks.
About three years ago, the Apache Flink community started adding SQL support to process static and streaming data in a unified fashion. Today, Flink SQL powers production systems at Alibaba, Huawei, Lyft, and Uber. Fabian Hueske discusses the current state of Flink’s SQL support and explains the importance of Flink’s unified approach to process static and streaming data. After covering the basics, he shares common real-world use cases ranging from low-latency ETL to pattern detection and demonstrates how easily they can be addressed with Flink SQL.
What's new for Apache Flink's Table & SQL APIs?Timo Walther
About three years ago, the Apache Flink community started adding a Table & SQL API to process static and streaming data in a unified fashion. It makes data processing accessible to non-programmers and significantly reduces the effort to solve common tasks. Today, Flink SQL already powers production systems at Alibaba, Huawei, Lyft, and Uber. But we are only getting started! The community is currently re-shaping the future of data processing.
Even for followers of the Flink mailing lists, it can be quite difficult to keep track with all the developments that happen on Flink's relational APIs. In this talk, we give an overview of recent contributions, such as pluggable optimizers, the new type system with consistent type inference, SQL DDL support, and the Python Table API. We elaborate on how all these efforts interact and discuss the future roadmap.
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...HostedbyConfluent
Apache Kafka is one of the most commonly used connectors with Apache Flink for exactly-once streaming use cases. The combination of both systems allows you to build mission-critical systems that require low end-to-end latency and exactly-once processing eg. banks processing transactions. In Apache Flink 1.14, we released a new KafkaSink based on Apache Flink’s unified Sink interface that natively supports streaming and batch executions.
However, we needed to stretch Kafka’s transactions API to fully support exactly-once processing in Flink. In this talk, we will start with a quick recap of Apache Kafka’s transactions and Flink’s checkpointing mechanism. Then, we describe the two-phase commit protocol implemented in KafkaSink in-depth and emphasize the difficulties we have overcome when applying Kafka’s transaction API to longer-lasting transactions.
We explain how we ensure performant writing to Apache Kafka and how the KafkaSink recovery works.
In summary, this talk should give users a deep dive into how Apache Flink leverages Apache Kafka’s transactions and developers an overview of what they have to consider when using Apache Kafka’s transactions.
Unified Data Processing with Apache Flink and Apache Pulsar_Seth WiesmanStreamNative
Come learn how the combination of Apache Pulsar and Apache Flink is making stateful stream processing even more expressive and flexible to support applications in streaming that were previously not considered streamable. The new world of applications and fast data architectures has broken up the database: Raw data persistence comes in the form of event logs, and the state of the world is computed by a stream processor. Apache Pulsar provides a strong solution for the event log, while Apache Flink forms a powerful foundation for the computation over the event streams.
We will discuss the key concepts behind Apache Flink's approach to stream processing and how it is a powerful abstraction for stateful event-driven applications. We will then see how to use Flink in conjunction with Apache Pulsar to creates a unified data processing platform.
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Timo Walther
Apache Flink is a distributed, stateful stream processor. It features exactly-once state consistency, sophisticated event-time support, high throughput and low latency processing, and APIs at different levels of abstraction (Java, Scala, SQL). In my talk, I'll give an introduction to Apache Flink, its features and discuss the use cases it solves. I'll explain why batch is just a special case of stream processing, how its community evolves Flink into a truly unified stream and batch processor and what this means for its users.
https://www.meetup.com/de-DE/Bangalore-Apache-Kafka-Group/events/265285812/
https://www.youtube.com/watch?v=Ych5bbmDIoA&list=PLvkUPePDi9sa27SG9eGNXH25cfUeo_WY9&index=2
OSMC 2019 | The Telegraf Toolbelt: It Can Do That, Really? by David McKayNETWAYS
Telegraf is an agent for collecting, processing, aggregating, and writing metrics.
With over 200 plugins, Telegraf can fetch metrics from a variety of sources, allowing you to build aggregations and write those metrics to InfluxDB, Prometheus, Kafka, and more.
In this talk, we will take a look at some of the lesser known, but awesome, plugins that are often overlooked; as well as how to use Telegraf for monitoring of Cloud Native systems.
Lessons Learned Building a Connector Using Kafka Connect (Katherine Stanley &...confluent
While many companies are embracing Apache Kafka as their core event streaming platform they may still have events they want to unlock in other systems. Kafka Connect provides a common API for developers to do just that and the number of open-source connectors available is growing rapidly. The IBM MQ sink and source connectors allow you to flow messages between your Apache Kafka cluster and your IBM MQ queues. In this session I will share our lessons learned and top tips for building a Kafka Connect connector. I’ll explain how a connector is structured, how the framework calls it and some of the things to consider when providing configuration options. The more Kafka Connect connectors the community creates the better, as it will enable everyone to unlock the events in their existing systems.
Webinar: Flink SQL in Action - Fabian HueskeVerverica
Stream processing is rapidly adopted by the enterprise. While in the past, stream processing frameworks mostly provided Java or Scala-based APIs, stream processing with SQL is recently gaining a lot of attention because it makes stream processing accessible to non-programmers and significantly reduces the effort to solve common tasks.
About three years ago, the Apache Flink community started adding SQL support to process static and streaming data in a unified fashion. Today, Flink SQL powers production systems at Alibaba, Huawei, Lyft, and Uber. In this talk, I will discuss the current state of Flink’s SQL support and explain the importance of Flink’s unified approach to process static and streaming data. Once the basics are covered, I will present common real-world use cases ranging from low-latency ETL to pattern detection and demonstrate how easily they can be addressed by Flink SQL.
We discuss the existing and new hardware virtualization features. First, we review the existing hardware features that are not used by Xen today, showing examples for use cases. 1) For example, The "descriptor-table exiting" should be useful for the guest kernels or security agent to enhance security features. 2) The VMX-preemption timer allows the hypervisor to preempt guest VM execution after a specified amount of time, which is useful to implement fair scheduling. The hardware can save the timer value on each successive VMexit, after setting the initial VM quantum. 3) VMFUNC is an operation provided by the processor that can be invoked from VMX non-root operation without a VM exit. Today, EPTP switching is available, and we discuss how we can use the feature. Second, we talk about new hardware features, especially for interrupt optimizations.
NETCONF & YANG Enablement of Network DevicesCisco DevNet
A technical discussion and a demo showing how Tail-f's ConfD management agent can be used to implement NETCONF and YANG, the industry-leading solution for providing a programmable management interface in a network element. ConfD is recognized as the best-in-breed embedded software for implementing management functions in network elements, including physical devices and virtualized network functions (VNF) for NFV.
This Workshop is a best fit for engineers who are involved in the design and development of embedded software for network devices. Attendees will gain a basic understanding of what NETCONF and YANG are and how ConfD provides a solution for embedding this technology in the network devices. More information about ConfD can be found at: https://developer.cisco.com/site/confD/
Watch the DevNet 1216 replay from the Cisco Live On-Demand Library at: https://www.ciscolive.com/online/connect/sessionDetail.ww?SESSION_ID=92703&backBtn=true
Check out more and register for Cisco DevNet: http://ow.ly/jCNV3030OfS
Stream processing still evolves and changes at a speed that can make it hard to keep up with the developments. Being at the forefront of stream processing technology, the evolution of Apache Flink has mirrored many of these developments and continues to do so.
We will take you on a journey through the major milestones of stream processing technology in past years, diving into the latest additions that Apache Flink and other communities introduced to the stream processing landscape, such as Streamng SQL, Time Versioned Tables, cluster-library-duality, language portability, etc.
We will take a sneak peek into our crystal ball and present in what the Flink community is working on next.
Similar a Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ververica (20)
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
Flink Forward San Francisco 2022.
To improve Amazon Alexa experiences and support machine learning inference at scale, we built an automated end-to-end solution for incremental model building or fine-tuning machine learning models through continuous learning, continual learning, and/or semi-supervised active learning. Customer privacy is our top concern at Alexa, and as we build solutions, we face unique challenges when operating at scale such as supporting multiple applications with tens of thousands of transactions per second with several dependencies including near-real time inference endpoints at low latencies. Apache Flink helps us transform and discover metrics in near-real time in our solution. In this talk, we will cover the challenges that we faced, how we scale the infrastructure to meet the needs of ML teams across Alexa, and go into how we enable specific use cases that use Apache Flink on Amazon Kinesis Data Analytics to improve Alexa experiences to delight our customers while preserving their privacy.
by
Aansh Shah
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Introducing the Apache Flink Kubernetes OperatorFlink Forward
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
One sink to rule them all: Introducing the new Async SinkFlink Forward
Flink Forward San Francisco 2022.
Next time you want to integrate with a new destination for a demo, concept or production application, the Async Sink framework will bootstrap development, allowing you to move quickly without compromise. In Flink 1.15 we introduced the Async Sink base (FLIP-171), with the goal to encapsulate common logic and allow developers to focus on the key integration code. The new framework handles things like request batching, buffering records, applying backpressure, retry strategies, and at least once semantics. It allows you to focus on your business logic, rather than spending time integrating with your downstream consumers. During the session we will dive deep into the internals to uncover how it works, why it was designed this way, and how to use it. We will code up a new sink from scratch and demonstrate how to quickly push data to a destination. At the end of this talk you will be ready to start implementing your own Flink sink using the new Async Sink framework.
by
Steffen Hausmann & Danny Cranmer
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
Flink powered stream processing platform at PinterestFlink Forward
Flink Forward San Francisco 2022.
Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework.
by
Rainie Li & Kanchi Masalia
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Flink Forward San Francisco 2022.
Based on the new Flink-Pulsar connector, we implemented Flink's TableAPI and Catalog to help users to interact with the Pulsar cluster via Flink SQL easily. We would like to go through the design and implementation of the SQL connector in the following aspects:
1. Two different modes of use Pulsar as a metadata store
2. Data format transformation and management
3. SQL semantics support within Pulsar context
by
Sijie Guo & Neng Lu
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
Flink Forward San Francisco 2022.
At Bloomberg, we deal with high volumes of real-time market data. Our clients expect to be notified of any anomalies in this market data, which may indicate volatile movements in the markets, notable trades, forthcoming events, or system failures. The parameters for these alerts are always evolving and our clients can update them dynamically. In this talk, we'll cover how we utilized the open source Apache Flink and Siddhi SQL projects to build a distributed, scalable, low-latency and dynamic rule-based, real-time alerting system to solve our clients' needs. We'll also cover the lessons we learned along our journey.
by
Ajay Vyasapeetam & Madhuri Jain
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Flink Forward San Francisco 2022.
What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases.
by
Patrick Lucas
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Time in data stream must be quasi monotonous to produce time progress (watermarks)
Always have close-to-latest incremental results
Resource requirements change over time
Recovery must catch up very fast
Order of time in data does not matter (parallel unordered reads)
Bulk operations (2 phase hash/sort)
Longer time for recovery (no low latency SLA)
Resource requirements change fast throughout the execution of a single job
Understanding this difference will help later, when we discuss scheduling changes.
Different requirements
Optimization potential for batch and streaming
Also: historic developments and slow-changing organizations
You have to decide between DataSet and DataStream when writing a job
Two (slightly) different APIs, with different capabilities
Different set of supported connectors: no Kafka DataSet connector, no HBase DataStream connector
Different performance characteristics
Different fault-tolerance behavior
Different scheduling logic
With Table API, you only have to learn one API
Still, the set of supported connectors depends on the underlying execution API
Feature set depends on whether there is an implementation for your underlying API
You cannot combine more batch-y with more stream-y sources/sinks
A “soft problem”: with two stacks of everything, less developer power will go into each one individual stack less features, worse performance, more bugs that are fixed slower
Recall the earlier processing-styles slide:
batch wants step by step
streaming is all at once
This has been mentioned a lot.
Lyft has given a talk about this at last FF
* FLINK-10886: Event-time alignment for sources; Jamie Grier (Lyft) contributed the first parts of this
You have to decide between DataSet and DataStream when writing a job
Two (slightly) different APIs, with different capabilities
Different set of supported connectors: no Kafka DataSet connector, no HBase DataStream connector
Different performance characteristics
Different fault-tolerance behavior
Different scheduling logic
With Table API, you only have to learn one API
Still, the set of supported connectors depends on the underlying execution API
Feature set depends on whether there is an implementation for your underlying API
You cannot combine more batch-y with more stream-y sources/sinks
A “soft problem”: with two stacks of everything, less developer power will go into each one individual stack less features, worse performance, more bugs that are fixed slower
Batch:
random reads
Coordinated by JM
Streaming:
sequential read
No coordination between sources
This must support both batch and streaming use cases, allow Flink to be clever, be able to deal with event-time, watermarks, source idiosyncrasies, and enable snapshotting
This should enable new features: generic idleness detection, event-time alignment*
* FLINK-10886: Event-time alignment for sources; Jamie Grier (Lyft) contributed the first parts of this
Talk about how this will enable event-time alignment for sources in generic way
You have to decide between DataSet and DataStream when writing a job
Two (slightly) different APIs, with different capabilities
Different set of supported connectors: no Kafka DataSet connector, no HBase DataStream connector
Different performance characteristics
Different fault-tolerance behavior
Different scheduling logic
With Table API, you only have to learn one API
Still, the set of supported connectors depends on the underlying execution API
Feature set depends on whether there is an implementation for your underlying API
You cannot combine more batch-y with more stream-y sources/sinks
A “soft problem”: with two stacks of everything, less developer power will go into each one individual stack less features, worse performance, more bugs that are fixed slower
You have to decide between DataSet and DataStream when writing a job
Two (slightly) different APIs, with different capabilities
Different set of supported connectors: no Kafka DataSet connector, no HBase DataStream connector
Different performance characteristics
Different fault-tolerance behavior
Different scheduling logic
With Table API, you only have to learn one API
Still, the set of supported connectors depends on the underlying execution API
Feature set depends on whether there is an implementation for your underlying API
You cannot combine more batch-y with more stream-y sources/sinks
A “soft problem”: with two stacks of everything, less developer power will go into each one individual stack less features, worse performance, more bugs that are fixed slower
Mention here that you can basically build your Job Jar that includes flink-runtime, and execute that any way you want: Put it in docker, Spring boot, just start multiple of these.
As-a-library mode
Note that this nicely jibes with the pull-based model. Enables the things we need for batch.
Mention the dog with the hose. Sources just keep spitting out records as fast as they can.
Possibly put these on separate slides, with fewer words. Or even some graphics.
Possibly put these on separate slides, with fewer words. Or even some graphics.
There are some quirks when you use DataStream for batch
a groupReduce would be window with a GlobalWindow
MapPartition would have to finalizing things in close()
Joins would have to specify global window
Of course, state requirements are bad for the naïve approach, i.e. large state, inefficient access patterns
Joins and grouping can be a lot faster with specific algorithms
Hash Join, Merge join, etc…
For example
different window operator
Different join implementations
The scheduling stuff and networking would be a whole talk on their own. Memory management is another issue.
Pull-based operator is how most databases were/are implemented.
Note how the pull model enables hash join, merge join, …
Side inputs benefit from a pull-based model
Bring the dog-drinking-from-hose example, also for Join operator
This will allow porting batch operators/algorithms to StreamOperator