Gaining actionable insights in real time enables organizations to grab opportunities and omit threats. Sensing the world, detecting actionable insights, and acting upon them has now become far easier than ever with the advancements of streaming SQL. Below are the topics discussed in this slide.
- Building stream processing applications using streaming SQL
- Deploying and monitoring streaming applications
- Scaling streaming applications
- Building domain specific business UIs
- Visualizing stream processing outputs via dashboards
Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL.
Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Apex is using Calcite to support streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Julian Hyde gave this talk at Apex Big Data World, Mountain View, on April 4, 2017.
Streaming is necessary to handle IoT data rates and latency but SQL is unquestionably the lingua franca of data. Apache Samza and Apache Storm have new high-level query interfaces based on standard SQL with streaming extensions, both powered by Apache Calcite. Calcite's relational algebra allows query optimization and federation with data-at-rest in databases, memory, or HDFS.
A talk given by Julian Hyde at Hadoop Summit, San Jose, on 2016/06/29.
Why you care about relational algebra (even though you didn’t know it)Julian Hyde
A talk given by Julian Hyde at Enterprise Data World on in Washington, DC on April 2nd, 2015.
With data in different systems, in different formats, and accessed via different tools, we need a lingua franca for data. Not all tools speak SQL, and data cannot be moved into a single convenient location.
Relational algebra underpins SQL and many other DB languages. It is also perfect for optimizing, caching and mediating.
Apache Calcite (formerly Optiq) is a framework for building and optimizing expressions in relational algebra. We show how to write queries, optimize queries using rewrite rules, and write adapters for back-end systems. We also show to configure Calcite to materialize queries, so your interactive analytics are effectively running against a fast in-memory database.
Streaming is a paradigm for data processing that is rapidly growing in popularity, because it allows high throughput, low latency responses, and efficiently manages multitudes of IoT devices. Is it an alternative to database processing, or is it complementary? Julian Hyde argues for applying the database paradigm to streaming systems, using SQL as a high-level language for streaming. He presents streaming SQL, a super-set of standard SQL developed in collaboration with several Apache projects, and the use cases it can solve, such as combining data in flight with historic data at rest. He also shows how query optimization techniques can make streaming applications more efficient.
A talk given by Julian Hyde at 9th XLDB conference at SLAC, Menlo Park, on 2016/05/25.
Apache Calcite: One Frontend to Rule Them AllMichael Mior
Apache Calcite is an open source framework that allows for a unified query interface over heterogeneous data sources. It provides an ANSI-compliant SQL parser, a logical query optimizer, and acts as a middleware layer that can integrate data from multiple sources. Calcite uses a relational algebra approach and has pluggable adapters that allow it to connect to different backends like MySQL, MongoDB, and streaming data sources. It supports features like SQL queries, views, optimization rules, and works across both batch and streaming data. The project aims to continue adding new capabilities like geospatial queries and improved cost modeling.
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentJulian Hyde
This document discusses streaming SQL and how it can be used to query streaming data sources like IoT devices, web servers, and databases. Some key points discussed include:
- Streaming SQL extends standard SQL to work over both streaming and static data sources. It allows queries to be executed continuously over streaming data.
- The replay principle states that streaming queries should produce the same results as equivalent non-streaming queries over the same static data. Techniques like watermarks and monotonic columns help ensure this.
- Windowing functions allow aggregating over sliding windows of records in a stream. Various window types like tumbling and hopping windows are described.
- Apache Calcite is an open source framework that can optimize streaming SQL queries
Cost-based query optimization in Apache HiveJulian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive 0.13 introduces cost-based optimization for the first time, based on the Optiq framework.
Optiq’s lead developer Julian Hyde shows the improvements that CBO is bringing to Hive 0.13. For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive.
Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL.
Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Apex is using Calcite to support streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Julian Hyde gave this talk at Apex Big Data World, Mountain View, on April 4, 2017.
Streaming is necessary to handle IoT data rates and latency but SQL is unquestionably the lingua franca of data. Apache Samza and Apache Storm have new high-level query interfaces based on standard SQL with streaming extensions, both powered by Apache Calcite. Calcite's relational algebra allows query optimization and federation with data-at-rest in databases, memory, or HDFS.
A talk given by Julian Hyde at Hadoop Summit, San Jose, on 2016/06/29.
Why you care about relational algebra (even though you didn’t know it)Julian Hyde
A talk given by Julian Hyde at Enterprise Data World on in Washington, DC on April 2nd, 2015.
With data in different systems, in different formats, and accessed via different tools, we need a lingua franca for data. Not all tools speak SQL, and data cannot be moved into a single convenient location.
Relational algebra underpins SQL and many other DB languages. It is also perfect for optimizing, caching and mediating.
Apache Calcite (formerly Optiq) is a framework for building and optimizing expressions in relational algebra. We show how to write queries, optimize queries using rewrite rules, and write adapters for back-end systems. We also show to configure Calcite to materialize queries, so your interactive analytics are effectively running against a fast in-memory database.
Streaming is a paradigm for data processing that is rapidly growing in popularity, because it allows high throughput, low latency responses, and efficiently manages multitudes of IoT devices. Is it an alternative to database processing, or is it complementary? Julian Hyde argues for applying the database paradigm to streaming systems, using SQL as a high-level language for streaming. He presents streaming SQL, a super-set of standard SQL developed in collaboration with several Apache projects, and the use cases it can solve, such as combining data in flight with historic data at rest. He also shows how query optimization techniques can make streaming applications more efficient.
A talk given by Julian Hyde at 9th XLDB conference at SLAC, Menlo Park, on 2016/05/25.
Apache Calcite: One Frontend to Rule Them AllMichael Mior
Apache Calcite is an open source framework that allows for a unified query interface over heterogeneous data sources. It provides an ANSI-compliant SQL parser, a logical query optimizer, and acts as a middleware layer that can integrate data from multiple sources. Calcite uses a relational algebra approach and has pluggable adapters that allow it to connect to different backends like MySQL, MongoDB, and streaming data sources. It supports features like SQL queries, views, optimization rules, and works across both batch and streaming data. The project aims to continue adding new capabilities like geospatial queries and improved cost modeling.
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentJulian Hyde
This document discusses streaming SQL and how it can be used to query streaming data sources like IoT devices, web servers, and databases. Some key points discussed include:
- Streaming SQL extends standard SQL to work over both streaming and static data sources. It allows queries to be executed continuously over streaming data.
- The replay principle states that streaming queries should produce the same results as equivalent non-streaming queries over the same static data. Techniques like watermarks and monotonic columns help ensure this.
- Windowing functions allow aggregating over sliding windows of records in a stream. Various window types like tumbling and hopping windows are described.
- Apache Calcite is an open source framework that can optimize streaming SQL queries
Cost-based query optimization in Apache HiveJulian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive 0.13 introduces cost-based optimization for the first time, based on the Optiq framework.
Optiq’s lead developer Julian Hyde shows the improvements that CBO is bringing to Hive 0.13. For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive.
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Christian Tzolov
When working with BigData & IoT systems we often feel the need for a Common Query Language. The system specific languages usually require longer adoption time and are harder to integrate within the existing stacks.
To fill this gap some NoSql vendors are building SQL access to their systems. Building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your NoSql system.
We will walk through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.
Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.
Visualforce Remote Objects, Visual Workflow, and Developer Console received new features. Canvas Apps can now be added to page layouts and support SAML single sign-on. The Apex Flex Queue pilot allows submitting more batch jobs simultaneously. Push notifications can now be configured for Mobile SDK connected apps. Change Sets and deployment tools gained additional monitoring capabilities.
Harnessing the power of YARN with Apache TwillTerence Yim
This document discusses Apache Twill, which aims to simplify developing distributed applications on YARN. Twill provides a Java thread-like programming model for YARN applications, avoiding the complexity of directly using YARN APIs. Key features of Twill include real-time logging, resource reporting, state recovery, elastic scaling, command messaging between tasks, service discovery, and support for executing bundled JAR applications on YARN. Twill handles communication with YARN and the Application Master while providing an easy-to-use API for application developers.
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
Talk about testing large Data Pipelines, mostly inspired from my experience at LinkedIn working on relevancy and recommender system pipelines.
Abstract: Applied machine learning data pipelines are being developed at a very fast pace and often exceed traditional web/business applications codebase in terms of scale and complexity. The algorithms and processes these data workflows implement fulfill business-critical applications which require robust and scalable architectures. But how to make these data pipelines robust? When the number of developers and data jobs grow while at the same time the underlying data change how do we test that everything works as expected?
In software development we divide things in clean, independent modules and use unit and integration testing to prevent bugs and regression. So why is it more complicated with big data workflows? Partly because these workflows usually pull data from dozens of sources out of our control and have a large number of interdependent data processing jobs. Also, partly because we don't know yet how to do or lack the proper tools.
Streaming is necessary to handle data rates and latency but SQL is unquestionably the lingua franca of data. Where do the two meet?
Apache Calcite is extending SQL to include streaming, and the Samza, Storm and Flink are projects are each building it into their engines. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Julian Hyde gave this talk at the first Kafka Summit, San Francisco, 2016/04/26.
Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member
Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.
Implementing Server Side Data Synchronization for Mobile AppsMichele Orselli
The document describes an architecture and implementation for server-side data synchronization for mobile apps. It discusses syncing scenarios, challenges with the existing solution, and the new architecture and implementation. The key aspects covered are using GUIDs for unique identifiers, suggesting a "from" timestamp for incremental syncing, transferring record states instead of operations, and algorithms for resolving conflicts including for hierarchical data using a sort by hierarchy and updating ids.
Omid: scalable and highly available transaction processing for Apache PhoenixDataWorks Summit
Apache Phoenix is an OLTP and operational analytics for Hadoop. To ensure operations correctness, Phoenix requires that a transaction processor guarantees that all data accesses satisfy the ACID properties. Traditionally, Apache Phoenix has been using the Apache Tephra transaction processing technology. Recently, we introduced into Phoenix the support for Apache Omid—an open source transaction processor for HBase that is used at Yahoo at a large scale.
A single Omid instance sustains hundreds of thousands of transactions per second and provides high availability at zero cost for mainstream processing. Omid, as well as Tephra, are now configurable choices for the Phoenix transaction processing backend, being enabled by the newly introduced Transaction Abstraction Layer (TAL) API. The integration requires introducing many new features and operations to Omid and will become generally available early 2018.
In this talk, we walk through the challenges of the project, focusing on the new use cases introduced by Phoenix and how we address them in Omid.
Speaker
Ohad Shacham, Yahoo Research, Oath, Senior Research Scientist
James Taylor
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. In this webinar, developers will learn:
*How Spark Streaming works - a quick review.
*Features in Spark Streaming that help prevent potential data loss.
*Complementary tools in a streaming pipeline - Kafka and Akka.
*Design and tuning tips for Reactive Spark Streaming applications.
Implementing data sync apis for mibile apps @cloudconfMichele Orselli
Today mobile apps are everywhere. These apps cannot count on a reliable and constant internet connection: working in offline mode is becoming a common pattern. This is quite easy for read-only apps but it becomes rapidly tricky for apps that create data in offline mode. This talk is a case study about a possible architecture for enabling data synchronization in these situations. Some of the topics touched will be:
- id generation
- hierarchical data
- managing differente data types
- sync algorithm
Apache Beam is a unified programming model for batch and streaming data processing. It defines concepts for describing what computations to perform (the transformations), where the data is located in time (windowing), when to emit results (triggering), and how to accumulate results over time (accumulation mode). Beam aims to provide portable pipelines across multiple execution engines, including Apache Flink, Apache Spark, and Google Cloud Dataflow. The talk will cover the key concepts of the Beam model and how it provides unified, efficient, and portable data processing pipelines.
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...Flink Forward
We have built a Flink-based system to allow our business users to configure processing rules on a Kafka stream dynamically. Additionally it allows the state to be built dynamically using replay of targeted messages from a long term storage system. This allows for new rules to deliver results based on prior data or to re-run existing rules that had breaking changes or a defect. Why we submitted this talk: We developed a unique solution that allows us to handle on the fly changes of business rules for stateful stream processing. This challenge required us to solve several problems -- data coming in from separate topics synchronized on a tracer-bullet, rebuilding state from events that are no longer on Kafka, and processing rule changes without interrupting the stream.
Server side data sync for mobile apps with silexMichele Orselli
oday mobile apps are everywhere. These apps cannot count on a reliable and constant internet connection: working in offline mode is becoming a common pattern. This is quite easy for read-only apps but it becomes rapidly tricky for apps that create data in offline mode. This talk is a case study about a possible architecture for enabling data synchronization in these situations. Some of the topics touched will be:
- id generation
- hierarchical data
- managing differente data types
- sync algorithm
The document discusses new features and enhancements in Apache Hive 3, including:
1) Materialized views which allow optimizing workloads and queries without changing SQL.
2) Improved ACID support which makes ACID transactions as fast as regular tables.
3) Enhancements to constraints, defaults, and surrogate keys which help the optimizer produce better plans and improve data integrity.
WSO2 Complex Event Processor (CEP) 4.0 provides real-time analytics capabilities. Some key features of CEP 4.0 include a re-architecture of the CEP server, improvements to the Siddhi query engine (version 3.0), scalable distributed processing using Apache Storm, and better high availability support. CEP 4.0 also offers a domain specific execution manager, real-time dashboards, improved usability, and pluggable event receivers and publishers. It can execute queries in a distributed manner across multiple nodes for scalability.
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...Flink Forward
http://flink-forward.org/kb_sessions/apache-beam-a-unified-model-for-batch-and-streaming-data-processing/
Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam (incubating) defines a new data processing programming model that evolved from more than a decade of experience within Google, including MapReduce, FlumeJava, MillWheel, and Cloud Dataflow. Beam handles both batch and streaming use cases and neatly separates properties of the data from runtime characteristics, allowing pipelines to be portable across multiple runtimes, both open-source (e.g., Apache Flink, Apache Spark, et al.) and proprietary (e.g., Google Cloud Dataflow). This talk will cover the basics of Apache Beam, touch on its evolution, describe main concepts in the programming model, and compare with similar systems. We’ll go from a simple scenario to a relatively complex data processing pipeline, and finally demonstrate execution of that pipeline on multiple runtimes.
This document provides an overview of WSO2 Complex Event Processor (CEP). It discusses key CEP concepts like event streams, queries, and execution plans. It also demonstrates various query patterns for filtering, transforming, enriching, joining, and detecting patterns in event streams. The document outlines the architecture of CEP and shows how to define streams, tables, queries, and adaptors to integrate CEP with external systems. It provides examples of windowing, aggregations, functions, and extensions that can be used in Siddhi queries to process event streams.
(DEV204) Building High-Performance Native Cloud Apps In C++Amazon Web Services
The document provides an overview of the AWS SDK for C++, including its core features such as credential management, asynchronous requests, rate limiting, error handling, and memory allocation. It also discusses how to override the HTTP/TLS stack and integrate high-level APIs. The presentation encourages attendees to contribute high-level APIs and send pull requests to the SDK's GitHub repository.
Slides for my talk at Hadoop Summit Dublin, April 2016.
The talk motivates how streaming can subsume batch use cases at the example of continuous counting.
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubData Con LA
Abstract:-
Data engineering at Dollar Shave Club has grown significantly over the last year. In that time, it has expanded in scope from conventional web-analytics and business intelligence to include real-time, big data and machine learning applications. We have bootstrapped a dedicated data engineering team in parallel with developing a new category of capabilities. And the business value that we delivered early on has allowed us to forge new roles for our data products and services in developing and carrying out business strategy. This progress was made possible, in large part, by adopting Apache Spark as an application framework. This talk describes what we have been able to accomplish using Spark at Dollar Shave Club.
Bio:-
Brett Bevers, Ph.D. Brett is a backend engineer and leads the data engineering team at Dollar Shave Club. More importantly, he is an ex-academic who is driven to understand and tackle hard problems. His latest challenge has been to develop tools powerful enough to support data-driven decision making in high value projects.
As the world moves to an era where data is the most valuable asset, being able to efficiently process large volumes of data in real time can help to gain a competitive advantage for businesses. Then, making business decision within milliseconds has become a mandatory need in many domains. Streaming analytics play a key role in making these decisions and is also a vital part of the digital transformation of businesses. WSO2 Stream Processor provides a high performance, lean, enterprise-ready streaming solution to solve data integration and analytics challenges. It provides real-time, interactive, predictive and batch processing technologies to deal with large volumes of data and generate meaningful decisions/output from it. This session explains how to enable digital transformation through streaming analytics and how easily streaming applications can be implemented.
- The Architecture of WSO2 Stream Processor
- Understanding streaming constructs
- Patterns of processing data in real time, incremental and with intelligence
- Applying patterns when building streaming apps
- Deployment patterns
[WSO2Con USA 2018] Patterns for Building Streaming AppsWSO2
This slide deck explains how to enable digital transformation through streaming analytics and how easily streaming applications can be implemented.
Watch video: https://wso2.com/library/conference/2018/07/wso2con-usa-2018-patterns-for-building-streaming-apps/
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Christian Tzolov
When working with BigData & IoT systems we often feel the need for a Common Query Language. The system specific languages usually require longer adoption time and are harder to integrate within the existing stacks.
To fill this gap some NoSql vendors are building SQL access to their systems. Building SQL engine from scratch is a daunting job and frameworks like Apache Calcite can help you with the heavy lifting. Calcite allow you to integrate SQL parser, cost-based optimizer, and JDBC with your NoSql system.
We will walk through the process of building a SQL access layer for Apache Geode (In-Memory Data Grid). I will share my experience, pitfalls and technical consideration like balancing between the SQL/RDBMS semantics and the design choices and limitations of the data system.
Hopefully this will enable you to add SQL capabilities to your prefered NoSQL data system.
Visualforce Remote Objects, Visual Workflow, and Developer Console received new features. Canvas Apps can now be added to page layouts and support SAML single sign-on. The Apex Flex Queue pilot allows submitting more batch jobs simultaneously. Push notifications can now be configured for Mobile SDK connected apps. Change Sets and deployment tools gained additional monitoring capabilities.
Harnessing the power of YARN with Apache TwillTerence Yim
This document discusses Apache Twill, which aims to simplify developing distributed applications on YARN. Twill provides a Java thread-like programming model for YARN applications, avoiding the complexity of directly using YARN APIs. Key features of Twill include real-time logging, resource reporting, state recovery, elastic scaling, command messaging between tasks, service discovery, and support for executing bundled JAR applications on YARN. Twill handles communication with YARN and the Application Master while providing an easy-to-use API for application developers.
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
Talk about testing large Data Pipelines, mostly inspired from my experience at LinkedIn working on relevancy and recommender system pipelines.
Abstract: Applied machine learning data pipelines are being developed at a very fast pace and often exceed traditional web/business applications codebase in terms of scale and complexity. The algorithms and processes these data workflows implement fulfill business-critical applications which require robust and scalable architectures. But how to make these data pipelines robust? When the number of developers and data jobs grow while at the same time the underlying data change how do we test that everything works as expected?
In software development we divide things in clean, independent modules and use unit and integration testing to prevent bugs and regression. So why is it more complicated with big data workflows? Partly because these workflows usually pull data from dozens of sources out of our control and have a large number of interdependent data processing jobs. Also, partly because we don't know yet how to do or lack the proper tools.
Streaming is necessary to handle data rates and latency but SQL is unquestionably the lingua franca of data. Where do the two meet?
Apache Calcite is extending SQL to include streaming, and the Samza, Storm and Flink are projects are each building it into their engines. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Julian Hyde gave this talk at the first Kafka Summit, San Francisco, 2016/04/26.
Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member
Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.
Implementing Server Side Data Synchronization for Mobile AppsMichele Orselli
The document describes an architecture and implementation for server-side data synchronization for mobile apps. It discusses syncing scenarios, challenges with the existing solution, and the new architecture and implementation. The key aspects covered are using GUIDs for unique identifiers, suggesting a "from" timestamp for incremental syncing, transferring record states instead of operations, and algorithms for resolving conflicts including for hierarchical data using a sort by hierarchy and updating ids.
Omid: scalable and highly available transaction processing for Apache PhoenixDataWorks Summit
Apache Phoenix is an OLTP and operational analytics for Hadoop. To ensure operations correctness, Phoenix requires that a transaction processor guarantees that all data accesses satisfy the ACID properties. Traditionally, Apache Phoenix has been using the Apache Tephra transaction processing technology. Recently, we introduced into Phoenix the support for Apache Omid—an open source transaction processor for HBase that is used at Yahoo at a large scale.
A single Omid instance sustains hundreds of thousands of transactions per second and provides high availability at zero cost for mainstream processing. Omid, as well as Tephra, are now configurable choices for the Phoenix transaction processing backend, being enabled by the newly introduced Transaction Abstraction Layer (TAL) API. The integration requires introducing many new features and operations to Omid and will become generally available early 2018.
In this talk, we walk through the challenges of the project, focusing on the new use cases introduced by Phoenix and how we address them in Omid.
Speaker
Ohad Shacham, Yahoo Research, Oath, Senior Research Scientist
James Taylor
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. In this webinar, developers will learn:
*How Spark Streaming works - a quick review.
*Features in Spark Streaming that help prevent potential data loss.
*Complementary tools in a streaming pipeline - Kafka and Akka.
*Design and tuning tips for Reactive Spark Streaming applications.
Implementing data sync apis for mibile apps @cloudconfMichele Orselli
Today mobile apps are everywhere. These apps cannot count on a reliable and constant internet connection: working in offline mode is becoming a common pattern. This is quite easy for read-only apps but it becomes rapidly tricky for apps that create data in offline mode. This talk is a case study about a possible architecture for enabling data synchronization in these situations. Some of the topics touched will be:
- id generation
- hierarchical data
- managing differente data types
- sync algorithm
Apache Beam is a unified programming model for batch and streaming data processing. It defines concepts for describing what computations to perform (the transformations), where the data is located in time (windowing), when to emit results (triggering), and how to accumulate results over time (accumulation mode). Beam aims to provide portable pipelines across multiple execution engines, including Apache Flink, Apache Spark, and Google Cloud Dataflow. The talk will cover the key concepts of the Beam model and how it provides unified, efficient, and portable data processing pipelines.
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...Flink Forward
We have built a Flink-based system to allow our business users to configure processing rules on a Kafka stream dynamically. Additionally it allows the state to be built dynamically using replay of targeted messages from a long term storage system. This allows for new rules to deliver results based on prior data or to re-run existing rules that had breaking changes or a defect. Why we submitted this talk: We developed a unique solution that allows us to handle on the fly changes of business rules for stateful stream processing. This challenge required us to solve several problems -- data coming in from separate topics synchronized on a tracer-bullet, rebuilding state from events that are no longer on Kafka, and processing rule changes without interrupting the stream.
Server side data sync for mobile apps with silexMichele Orselli
oday mobile apps are everywhere. These apps cannot count on a reliable and constant internet connection: working in offline mode is becoming a common pattern. This is quite easy for read-only apps but it becomes rapidly tricky for apps that create data in offline mode. This talk is a case study about a possible architecture for enabling data synchronization in these situations. Some of the topics touched will be:
- id generation
- hierarchical data
- managing differente data types
- sync algorithm
The document discusses new features and enhancements in Apache Hive 3, including:
1) Materialized views which allow optimizing workloads and queries without changing SQL.
2) Improved ACID support which makes ACID transactions as fast as regular tables.
3) Enhancements to constraints, defaults, and surrogate keys which help the optimizer produce better plans and improve data integrity.
WSO2 Complex Event Processor (CEP) 4.0 provides real-time analytics capabilities. Some key features of CEP 4.0 include a re-architecture of the CEP server, improvements to the Siddhi query engine (version 3.0), scalable distributed processing using Apache Storm, and better high availability support. CEP 4.0 also offers a domain specific execution manager, real-time dashboards, improved usability, and pluggable event receivers and publishers. It can execute queries in a distributed manner across multiple nodes for scalability.
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...Flink Forward
http://flink-forward.org/kb_sessions/apache-beam-a-unified-model-for-batch-and-streaming-data-processing/
Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam (incubating) defines a new data processing programming model that evolved from more than a decade of experience within Google, including MapReduce, FlumeJava, MillWheel, and Cloud Dataflow. Beam handles both batch and streaming use cases and neatly separates properties of the data from runtime characteristics, allowing pipelines to be portable across multiple runtimes, both open-source (e.g., Apache Flink, Apache Spark, et al.) and proprietary (e.g., Google Cloud Dataflow). This talk will cover the basics of Apache Beam, touch on its evolution, describe main concepts in the programming model, and compare with similar systems. We’ll go from a simple scenario to a relatively complex data processing pipeline, and finally demonstrate execution of that pipeline on multiple runtimes.
This document provides an overview of WSO2 Complex Event Processor (CEP). It discusses key CEP concepts like event streams, queries, and execution plans. It also demonstrates various query patterns for filtering, transforming, enriching, joining, and detecting patterns in event streams. The document outlines the architecture of CEP and shows how to define streams, tables, queries, and adaptors to integrate CEP with external systems. It provides examples of windowing, aggregations, functions, and extensions that can be used in Siddhi queries to process event streams.
(DEV204) Building High-Performance Native Cloud Apps In C++Amazon Web Services
The document provides an overview of the AWS SDK for C++, including its core features such as credential management, asynchronous requests, rate limiting, error handling, and memory allocation. It also discusses how to override the HTTP/TLS stack and integrate high-level APIs. The presentation encourages attendees to contribute high-level APIs and send pull requests to the SDK's GitHub repository.
Slides for my talk at Hadoop Summit Dublin, April 2016.
The talk motivates how streaming can subsume batch use cases at the example of continuous counting.
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubData Con LA
Abstract:-
Data engineering at Dollar Shave Club has grown significantly over the last year. In that time, it has expanded in scope from conventional web-analytics and business intelligence to include real-time, big data and machine learning applications. We have bootstrapped a dedicated data engineering team in parallel with developing a new category of capabilities. And the business value that we delivered early on has allowed us to forge new roles for our data products and services in developing and carrying out business strategy. This progress was made possible, in large part, by adopting Apache Spark as an application framework. This talk describes what we have been able to accomplish using Spark at Dollar Shave Club.
Bio:-
Brett Bevers, Ph.D. Brett is a backend engineer and leads the data engineering team at Dollar Shave Club. More importantly, he is an ex-academic who is driven to understand and tackle hard problems. His latest challenge has been to develop tools powerful enough to support data-driven decision making in high value projects.
As the world moves to an era where data is the most valuable asset, being able to efficiently process large volumes of data in real time can help to gain a competitive advantage for businesses. Then, making business decision within milliseconds has become a mandatory need in many domains. Streaming analytics play a key role in making these decisions and is also a vital part of the digital transformation of businesses. WSO2 Stream Processor provides a high performance, lean, enterprise-ready streaming solution to solve data integration and analytics challenges. It provides real-time, interactive, predictive and batch processing technologies to deal with large volumes of data and generate meaningful decisions/output from it. This session explains how to enable digital transformation through streaming analytics and how easily streaming applications can be implemented.
- The Architecture of WSO2 Stream Processor
- Understanding streaming constructs
- Patterns of processing data in real time, incremental and with intelligence
- Applying patterns when building streaming apps
- Deployment patterns
[WSO2Con USA 2018] Patterns for Building Streaming AppsWSO2
This slide deck explains how to enable digital transformation through streaming analytics and how easily streaming applications can be implemented.
Watch video: https://wso2.com/library/conference/2018/07/wso2con-usa-2018-patterns-for-building-streaming-apps/
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2
Today’s highly connected world is flooding businesses with big and fast-moving data. The ability to trawl this data ocean and identify actionable insights can deliver a competitive advantage to any organization. The WSO2 Analytics Platform enables businesses to do just that by providing batch, real-time, interactive and predictive analysis capabilities all in one place.
In this tutorial we will
Plug in the WSO2 Analytics Platform to some common business use cases
Showcase the numerous capabilities of the platform
Demonstrate how to collect data, analyze, predict and communicate effectively
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...WSO2
The WSO2 Analytics Platform uniquely combines real-time and batch analytics to derive insights from IoT, mobile, and web application data. It includes the WSO2 Data Analytics Server for collecting, analyzing, and communicating real-time and persisted data. The platform can collect data from various sources, analyze it using Spark SQL for batch or Siddhi for real-time, and communicate results through alerts, dashboards, and APIs. It also features predictive analytics capabilities via machine learning algorithms.
WSO2 Analytics Platform uniquely combines real-time and batch analytics to derive insights from IoT, mobile, and web application data. It includes WSO2 Data Analytics Server for collecting, analyzing, and communicating real-time and stored data. The platform can collect data, perform batch analytics using Spark SQL, real-time analytics using Siddhi, and predictive analytics using machine learning models. It also supports dashboards, APIs, alerts, and other methods for communicating results.
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital EnterpriseWSO2
The WSO2 analytics platform provides a high performance, lean, enterprise-ready, streaming solution to solve data integration and analytics challenges faced by connected businesses. This platform offers real-time, interactive, machine learning and batch processing technologies that empower enterprises to build a digital business. This session explores how to enable digital transformation by building a data analytics platform.
[WSO2Con Asia 2018] Patterns for Building Streaming AppsWSO2
This slide deck explains how to enable digital transformation through streaming analytics and how easily streaming applications can be implemented
Learn more: https://wso2.com/library/conference/2018/08/wso2con-asia-2018-patterns-for-building-streaming-apps/
Today’s highly connected world is flooding businesses with big and fast-moving data. The ability to trawl this data ocean and identify actionable insights can deliver a competitive advantage to any organization. The WSO2 Analytics Platform enables businesses to do just that by providing batch, real-time, interactive and predictive analysis capabilities all in one place.
In this tutorial we will
* Plug in the WSO2 Analytics Platform to some common business use cases
* Showcase the numerous capabilities of the platform
* Demonstrate how to collect data, analyze, predict and communicate effectively
* Demonstrate how it can analyze integration, security and IoT scenarios
Stick around till the end and you will walk away with the necessary skills to create a winning data strategy for your organization to stay ahead of its competition.
Sumedha Rubasinghe, Director of API Architecture presented this talk at the API Strategy & Practice Conference in Chicago where he illustrated how organisations can analyse who uses their APIs, while understanding how statistics help in capacity planning, deployment, maintain schedules and trend analyses as well as assist the decision-making process. The session discussed scalable collection of statistics for API ecosystems, key design considerations, real-time and offline analysis as well as WSO2’s approach for dealing with these challenges.
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemAccumulo Summit
Timely was born to visualize and analyze metric data at a scale untenable for existing solutions. We're returning to talk about what we've achieved over the past year, provide a detailed look into production architecture and discuss additional features added within the past year including alerting and support for external analytics.
– Speakers –
Drew Farris
Chief Technologist, Booz Allen Hamilton
Drew Farris is a software developer and technology consultant at Booz Allen Hamilton where he helps his client solve problems related to large scale analytics, distributed computing and machine learning. He is a member of the Apache Software Foundation and a contributing author to Manning Publications’ “Taming Text” and the Booz Allen Hamilton “Field Guide to Data Science”.
Bill Oley
Senior Lead Engineer, Booz Allen Hamilton
Bill Oley is a senior lead software engineer at Booz Allen Hamilton where he helps his clients analyze and solve problems related to large scale data ingest, storage, retrieval, and analysis. He is particularly interested in improving visibility into large scale systems by making actionable metrics scalable and usable. He has 16 years of experience designing and developing fault-tolerant distributed systems that operate on continuous streams of data. He holds a bachelor's degree in computer science from the United States Naval Academy and a master's degree in computer science from The Johns Hopkins University.
— More Information —
For more information see http://www.accumulosummit.com/
KPI definition with Business Activity Monitor 2.0WSO2
This document discusses key performance indicator (KPI) definition using WSO2 Business Activity Monitoring (BAM) 2.0. It provides an overview of BAM and the BAM 2.0 analytics design flow. It also presents two use cases - one for monitoring retail store performance and another for monitoring statistics across data centers and clusters. The document explains how to capture and analyze data, define indexes, create analyzer sequences, and design visualizations for these use cases in BAM 2.0.
Data to Insight in a Flash: Introduction to Real-Time Analytics with WSO2 Com...WSO2
In this webinar, Sriskandarajah Suhothayan, technical lead at WSO2, will take a closer look at the following use cases:
Natural language processing capabilities of WSO2 CEP: Introducing basic constructs of the CEP
Analyzing a soccer game in Real time: Explaining how complicated scenarios can be implemented
Geo fencing capabilities of WSO2 CEP: Focusing on the CEP’s virtualization support
Complex Event Processor 3.0.0 - An overview of upcoming features WSO2
This document provides an overview of upcoming features in WSO2 Complex Event Processor 3.0.0. It discusses the Siddhi CEP engine, event processing queries including filters, windows, joins, patterns and event tables. It also covers high availability, persistence, scaling, integration with BAM, and performance comparisons with Esper. The document concludes with a demo of monitoring stock prices and tweets to detect significant stock price changes when a company is highly discussed on Twitter.
The document provides information on performance testing processes and tools. It outlines 8 key steps: 1) create scripts, 2) create test scenarios, 3) execute load testing, 4) analyze results, 5) test reporting, 6) performance tuning, 7) communication planning, and 8) troubleshooting. It also discusses tools like LoadRunner, Controller, and Analysis for executing and analyzing tests. The document emphasizes having a thorough test process and communication plan to ensure performance testing is done correctly.
Introduction to WSO2 Data Analytics PlatformSrinath Perera
This document provides an introduction to the WSO2 Analytics Platform. It discusses how the platform allows users to collect data from various sources using a sensor API, then perform analysis on the data through both batch and real-time means. Batch analysis uses technologies like Apache Spark and Hadoop to perform tasks like finding averages, max/min, and building KPIs. Real-time analysis uses complex event processing to run queries over streaming data and detect patterns. The platform also enables predictive analytics using machine learning algorithms and anomaly detection. Results are then communicated through dashboards and alerts.
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
The document discusses evolving approaches to data warehousing and analytics using Azure Data Factory and Azure Stream Analytics. It provides an example scenario of analyzing game usage logs to create a customer profiling view. Azure Data Factory is presented as a way to build data integration and analytics pipelines that move and transform data between on-premises and cloud data stores. Azure Stream Analytics is introduced for analyzing real-time streaming data using a declarative query language.
This document discusses performance engineering for batch and web applications. It begins by outlining why performance testing is important. Key factors that influence performance testing include response time, throughput, tuning, and benchmarking. Throughput represents the number of transactions processed in a given time period and should increase linearly with load. Response time is the duration between a request and first response. Tuning improves performance by configuring parameters without changing code. The performance testing process involves test planning, creating test scripts, executing tests, monitoring tests, and analyzing results. Methods for analyzing heap dumps and thread dumps to identify bottlenecks are also provided. The document concludes with tips for optimizing PostgreSQL performance by adjusting the shared_buffers configuration parameter.
This document provides an overview and agenda for a webinar on the WSO2 CEP 3.1.0 product release. It introduces the presenters, Sriskandarajah Suhothayan and Lasantha Fernando. It also provides background on WSO2, an overview of the WSO2 CEP architecture including event streams, execution plans, and the Siddhi query language. The agenda outlines topics like event processing scenarios, new features in 3.1.0, high availability, and a demo.
Intro to Windows Server AppFabric
by Ron Jacobs, Senior Technical Evangelist at Microsoft
Windows Server AppFabric is a set of integrated technologies that make it easier to build, scale and manage Web and composite applications that run on IIS.
This presentation will help SQL Server developers and DBAs get up to speed on AppFabric. You'll also learn how Windows AppFabric caching can help you scale your Data Tier.
You will learn:
•The core capabilities of Windows Server AppFabric
•How the distributed nature of AppFabric’s cache allows large amounts of data to be stored in-memory for extremely fast access and help you scale your SQL Data Tier
•How to get started with Windows Server AppFabric
This document discusses implementing real-time IoT stream processing using Azure Stream Analytics. It introduces the Lambda architecture pattern for processing real-time and batch data streams. Azure Stream Analytics is presented as a tool for real-time stream processing that can ingest data from sources like IoT Hub and Event Hubs and output to databases, services, and functions. The document demonstrates Stream Analytics queries, windowing functions, and integrating with Azure Machine Learning models. It also discusses running Stream Analytics on IoT Edge devices to analyze and filter data locally.
Similar a Building Streaming Applications with Streaming SQL (20)
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
3. Streaming Application
An Application that provides
analytical operators to
orchestrate data flow, calculate
analytics, and detect patterns on
event data from multiple,
disparate live data sources to
allow developers to build
applications that sense, think, and
act in real-time.
- Forrester
8. ● Lightweight, lean & cloud native
● Easy to learn Streaming SQL
● High performance analytics with just 2 nodes (HA)
● Native support for streaming Machine Learning
● Long term aggregations without batch analytics
● Highly scalable deployment with exactly-once processing
● Tools for development and monitoring
● Tools for business users to write their own rules
Overview of WSO2 Stream Processor
9. WSO2 Stream Processor
• Editor/Studio - Developer environment
• Worker/Resource - Resource node
• Dashboard
– Portal - Business dashboard
– Business Rules Manager - Management
console for business users
– Status Dashboard - Monitoring dashboard
• Manager - Job manager for distributed
processing
Profiles
16. Streaming Processing
With WSO2 Stream Processor
Siddhi Streaming App
- Process events in streaming manner
- Isolated unit with set of queries, input and
output streams
- SQL Like Query Language
from Sales#window.time(1 hour)
select region, brand, avg(quantity) as AvgQuantity
group by region, brand
insert into LastHourSales ;
Stream
Processor
Siddhi App
{ Siddhi }
Input Streams Output Streams
Filter Aggregate
JoinTransform
Pattern
Siddhi Extensions
20. Stream Processor Studio
• Writing Siddhi applications
– Syntax highlighting
– Auto completion
– Error reporting
– Documentation support
• Debugging Siddhi apps
– Inspect events
– Inspect query states
Developer Environment
21. Stream Processor Studio
• Testing Siddhi apps via Event Simulation
– Send Event by Event
– Simulate Random Data
– Simulate via CSV file
– Simulate from Database
• Support for running and testing on Python
– as PySiddhi
• IDE Tools
– Intellij Idea Plugin
Developer Environment
29. Use Case 1 :
Production at each factory should not
reduce below 5000 units per hour !
30. 1.1 Monitor and Identify events that
indicate low production
31. Total Amount Produced
define stream SweetProductionStream (name string, amount double);
from SweetProductionStream
select sum(amount) as hourlyTotal
insert into LowProducitonAlertStream ;
Calculate total amount
produced forever
32. Total Amount Produced in the Last Hour
define stream SweetProductionStream (name string, amount double);
from SweetProductionStream#window.time(1 hour)
select sum(amount) as hourlyTotal
insert into LowProducitonAlertStream ; Calculate total amount
produced for last hour
33. Amount Produced Per Product
define stream SweetProductionStream (name string, amount double);
from SweetProductionStream#window.time(1 hour)
select name, sum(amount) as hourlyTotal
group by name
insert into LowProducitonAlertStream ;
Calculate total amount
produced for each product
34. Identify Low Production Rates
define stream SweetProductionStream (name string, amount double);
from SweetProductionStream#window.time(1 hour)
select name, sum(amount) as hourlyTotal
group by name
having hourlyTotal < 5000
insert into LowProducitonAlertStream ;
Filter events where produced
amount is less than 5000
35. Consider Working Hours for Calculation
define stream SweetProductionStream (name string, amount double);
from SweetProductionStream#window.time(1 hour)
select name, sum(amount) as hourlyTotal,
time:extract(currentTimeMillis(), 'HOUR') as currentHour
group by name
having hourlyTotal < 5000 and
currentHour > 9 and currentHour < 17
insert into LowProducitonAlertStream ;
Use functions to extract the
hour of event arrival time
36. Rate Limit Low Production Alerts
define stream SweetProductionStream (name string, amount double);
from SweetProductionStream#window.time(1 hour)
select name, sum(amount) as hourlyTotal,
time:extract(currentTimeMillis(), 'HOUR') as currentHour
group by name
having hourlyTotal < 5000 and
currentHour > 9 and currentHour < 17
output last every 15 min
insert into LowProducitonAlertStream ;
Send alerts every 15 minutes
38. Send Alerts via Email
@sink(type =‘email’, to=‘manager@sf.com’,
subject=‘Low Production of {{name}}!’,
@map (type=‘text’, @payload(“““
Hi Manager,
Production of {{name}} has gone down to {{hourlyTotal}}
in last hour!
From Sweet Factory”””)))
define stream LowProducitonAlertStream (name string, hourlyTotal double,
currentHour int);
Context sensitive
email
39. Use Case 2 :
Raw material storage at the factories
should be closely monitored
40. 2.1 Store raw material shipment
details in a data store
41. Data Store Integration
● Allow to perform operations with the data store while
processing the events on the fly.
Store, Retrieve, Remove and Modify
● Provides a REST endpoint to Query Data Store
● Query Optimizations using Primary and Indexing Keys
● Search ● Insert ● Delete ● Update ● Insert/Update
42. Store Raw Material Info
@source(type = ‘http’, @map(type = ‘json’))
define stream RawMaterialStream(name string, amount double);
define table LatestShipmentDetailTable (name string, amount double);
In-memory table to
store last shipment of raw
material
43. Store Data
With Primary Key & Index
@source(type = ‘http’, @map(type = ‘json’))
define stream RawMaterialStream(name string, amount double);
@primaryKey(‘name’)
@Index(‘amount’)
define table LatestShipmentDetailTable (name string, amount double);
Support for Primary Key and
Index for fast data access
44. Store in External Data Store
@source(type = ‘http’, @map(type = ‘json’))
define stream RawMaterialStream(name string, amount double);
@store(type=‘rdbms’, … )
@primaryKey(‘name’)
@Index(‘amount’)
define table LatestShipmentDetailTable (name string, amount double);
Table backed by
RDBMS, MongoDB, HBase, Cassandra, Solr,
Hazelcast, etc.
45. Insert Events into Table
@source(type = ‘http’, @map(type = ‘json’))
define stream RawMaterialStream(name string, amount double);
@store(type=‘rdbms’, … )
@primaryKey(‘name’)
@Index(‘amount’)
define table LatestShipmentDetailTable (name string, amount double);
from RawMaterialStream
select name, amount
insert into LatestShipmentDetailTable ;
Insert into table from stream
46. Update-Insert Events into Table
@source(type = ‘http’, @map(type = ‘json’))
define stream RawMaterialStream(name string, amount double);
@store(type=‘rdbms’, … )
@primaryKey(‘name’)
@Index(‘amount’)
define table LatestShipmentDetailTable (name string, amount double);
from RawMaterialStream
select name, amount
update or insert into LatestShipmentDetailTable
on LatestShipmentDetailTable.name == name ;
Update or Insert into
the table with stream
48. Streaming Data Summarization
Aggregations Over Long Time Periods
• Incremental Aggregation for every
– Seconds, Minutes, Hours, Days, …, Year
• Support for out-of-order event arrival
• Fast data retrieval from memory and disk for
real time update
Current Min
Current Hour
Sec
Min
Hour
0 - 1 - 5 ...
- 1
- 2 - 3 - 4 - 64 - 65 ...
- 2
- 124
49. Define Aggregation
define stream RawMaterialStream(name string, amount double);
define aggregation RawMaterialAggregation
from RawMaterialStream
select name, sum(amount) as totalAmount, avg(amount) as
averageAmount
group by name
aggregate every min … year
Calculate total and average amount
for each
min upto year
50. Define Aggregation ...
define stream RawMaterialStream(name string, amount double);
@store(type=‘rdbms’, … )
define aggregation RawMaterialAggregation
from RawMaterialStream
select name, sum(amount) as totalAmount, avg(amount) as
averageAmount
group by name
aggregate every min … year
Like Table Store Aggregation in
RDBMS, MongoDB, HBase, Cassandra, Solr, Hazelcast, etc.
51. Data Retrieval API
• Can perform data search on Data
Stores or pre-defined Aggregations.
• Supports both REST and Java APIs
52. Retrieve Summarized Data
Perform REST Call
curl -X POST https://localhost:9443/stores/query
-H "content-type: application/json"
-u "admin:admin"
-d '{"appName" : "Sweet-Factory-Analytics-3",
"query" : "from RawMaterialAggregation
on name == 'caramel'
within '2018-**-** **:**:**'
per 'minutes'
select name, totalAmount, averageAmount ;"
}'
54. Portal
Dashboard & Widgets for Business Users
• Generate dashboard and widgets
• Fine grained permissions
– Dashboard level
– Widget level
– Data level
• Localization support
• Inter widget communication
• Shareable dashboards with widget state persistence
56. Use Case 3 :
Warehouse managers should be
alerted if there will be a shortage of
raw material for future production
cycles
57. 3.1 Check if the current raw material
input rate is enough for production
58. Join Raw Material with Production Input
define stream RawMaterialStream (name string, amount double);
define stream ProductionInputStream (name string, amount double);
from ProductionInputStream#window.time(1 hour) as p
join RawMaterialStream#window.time(1 hour) as r
on r.name == p.name
select r.name, sum(r.amount) as totalRawMaterial,
sum(p.amount) as totalConsumption
group by r.name
having (totalConsumption - totalRawMaterial)*100.0 / totalRawMaterial > 5
insert into RawMaterialInputRateAlertStream ;
Identify 5% increase by
joining two streams
59. Join with External Window
define window RawMaterialWindow (name string, amount double) time(1 hour);
define stream ProductionInputStream (name string, amount double);
from ProductionInputStream#window.time(1 hour) as p
join RawMaterialWindow as r
on r.name == p.name
select r.name, sum(r.amount) as totalRawMaterial,
sum(p.amount) as totalConsumption
group by r.name
having (totalConsumption - totalRawMaterial)*100.0 / totalRawMaterial > 5
insert into RawMaterialInputRateAlertStream ;
Joining a stream with a
defined window
62. Using Pre-built PMML Model
define stream ProductionInputStream
(name string, currentHourAmmount double,
previousHourAmount double);
from ProductionInputStream#pmml:predict(‘file/model.pmml’, name,
previousHourAmount, currentHourAmmount )
select name, nextHourAmount, getEventTime() as currentTime
insert into PredictedProdInputStream ;
Predict required raw materials using a
static model
63. Online Machine Learning
define stream ProductionInputStream (currentHourAmount double,
previousHourAmount double );
define stream ProductionInputResultsStream ( currentHourAmount double,
previousHourAmount double, nextHourAmount double );
from ProductionInputResultsStream#streamingml:updateAMRulesRegressor
(currentHourAmount, previousHourAmount, nextHourAmount )
select *
insert into TrainOutputStream;
from ProductionInputStream#streamingml:AMRulesRegressor
(currentHourAmount, previousHourAmount )
select currentHourAmount, previousHourAmount, prediction as nextHourAmount
insert into PredictedProdInputStream;
Predict required raw materials while
learning in a streaming manner.
64. 3.3 Check predicted raw material
availability with warehouse stocks
and alert if insufficient
65. Predict & Alert
define window RawMaterialWindow (name string, amount double) time(1 hour);
define stream ProductionInputResultsStream ( currentHourAmount double,
previousHourAmount double, nextHourAmount double );
from ProductionInputResultsStream#streamingml:updateAMRulesRegressor
(currentHourAmount, previousHourAmount, nextHourAmount )
select *
insert into TrainOutputStream;
from PredictedProdInputStream as p join RawMaterialWindow as r
on r.name == p.name
select r.name, p.predictedAmount, sum(r.amount) as totalRawMaterial
having totalRawMaterial < totalConsumption
insert into RawMaterialInputRateAlertStream ;
66. Use Case 4 :
Factory Managers should be alerted if
production does not start within 15
min from raw material arrival
67. Non-occurrence through Patterns
define stream RawMaterialStream (name string, amount double);
define stream ProductionInputStream (name string, amount double);
from every (e1 = RawMaterialStream)
-> not ProductionInputStream[name == e1.name and
amount == e1.amount] for 15 min
select e1.name, e1.amount
insert into ProductionStartDelayed ; Identity non-occurrence Pattern
68. Use Case 5 :
Alert factory managers if rate of
production continuously decreases for
X time period
70. Identify Trends
define stream SweetProductionStream(name string, amount double);
from SweetProductionStream#window.timeBatch(1 min)
select name, sum(amount) as amount, currentTimeMillis() as timestamp
group by name
insert into LastMinProdStream;
partition with (name of LastMinProdStream)
begin
from every e1=LastMinProdStream,
e2=LastMinProdStream[timestamp - e1.timestamp < 10 * 60000
and e1.amount > amount]*,
e3=LastMinProdStream[timestamp - e1.timestamp > 10 * 60000
and e2[last].amount > amount]
select e1.name, e1.amount as initalamout, e3.amount as finalAmount
insert into ContinousProdReductionStream ;
end;
Identify decreasing trends
for 10 mins
72. Business Rules Manager
• Hide Siddhi app creation complexity to business users
• Build rules via a simple web-based UI
– From scratch
Build custom filters to event streams
– From a template
Build rules from developer created template
Dashboard for Rule Management
75. Template as Business Rules
define stream SweetProductionStream(...);
…
partition by (name of LastMinProdStream)
begin
from every e1=LastMinProdStream,
e2=LastMinProdStream[timestamp - e1.timestamp < $TimeInMin * 60000
and e1.amount > amount]*,
e3=LastMinProdStream[timestamp - e1.timestamp > $TimeInMin * 60000
and e2[last].amount > amount]
select e1.name, e1.amount as initalamout, e3.amount as finalAmount
insert into ContinousProdReductionStream ;
end;
Identify decreasing trend for
X mins
77. Minimum HA with 2 Nodes
Stream Processor
Stream Processor
• High Performance
– Process around 100k
events/sec
– Just 2 nodes
– While most others need 5+
• Zero Downtime
• Zero Event Loss
• Simple deployment with RDBMS
– No zookeeper, kafka, etc
• Multi Data Center Support
Event Sources
Siddhi App
Siddhi App
Siddhi App
Siddhi App
Siddhi App
Siddhi App
Event
Store
78. • Exactly-once processing
• Fault tolerance
• Highly scalable
• No back pressure
• Distributed development configurations via annotations
• Pluggable distribution options (YARN, K8, etc.)
Distributed Deployment
92. • Finance and Banking
• Retail
• Location
• Operational
• Smart Energy
• Social Media
• System and Network
• Healthcare
Available Options
93. Running Siddhi on the Edge
● Lightweight and lean
● OOTB support for consuming events from Android
sensors
● Support for Python
○ https://github.com/wso2/PySiddhi/
In Android & Raspberry Pi