3. Agenda
Overview
What is Streaming Data?
Streaming Data Pipeline
Streaming Platform components
What is Stishovite?
4. Overview
Monitoring Events
In RealTime
Monitoring Alerts
Sending alerts based on
detection of event patterns
in data streams.
Dashboards
RealTime Operational
Dashboards
Search
Full-text querying,
aggregations, Geo Data in
near real time
Analytics
Analyze big volumes of
data quickly and in near
real time
5. Streaming Data is data that is generated continuously by thousands of data sources, which
typically send in the data records simultaneously, and in small sizes (order of Kilobytes).
This data needs to be processed sequentially and incrementally on a record-by-record basis or
over sliding time windows, and used for a wide variety of analytics including correlations,
aggregations, filtering, and sampling.
Stream processing has become the defacto standard for building real-time ETL and Stream
Analytics applications. We see batch workloads move into Stream processing to act on the
data and derive insights faster. With the explosion of data such IoT and machine-generated
data, Stream Processing + Predictive Analytics is driving tremendous business value.
Streaming Data
6. Streaming Data examples include:
• Website, Network and Applications monitoring
• Fraud detection
• Advertising
• Internet of Things: sensors (trucks, transportation vehicles, industrial equipment)
• Machine-generated data
• Social analytics
• Private Searching
• Others
Streaming Data Examples
7. o Persistence
o Performance
o Scale
o Parallel & Partitioned
o Messaging
o Processing
o Storage
Key Requirements for Streaming Data
8. State of Stream Processing
Stateless
• Filter
• Map
Stateful
• Aggregate
• Join
10. We need to collect the data, process the data, store the data, and finally serve the data for
analysis, searching, machine learning and dashboards.
Streaming Data Pipeline
Data Sources Collect & Insgest
Data
Serve DataStore DataProcess Data
? ? ? ?
11. We need to collect the data from a wide array of inputs and write them into a wide array of
outputs in real time.
Collect Data
• Pull-based
• Push-based
Change Data Capture (CDC)
Database Changefeeds
CollectorsCustom Collectors
• Java
• Python
12. When data is ingested in real time, each data item is imported as it is emitted by the source. An
effective data ingestion process begins by prioritizing data sources, validating individual files
and routing data items to the correct destination.
Streaming Data Ingestion
Kafka Topics
13. Apache Kafka is a distributed system designed for streams. It is built to be fault-tolerant, high-
throughput, horizontally scalable, and allows geographically distributing data streams and
stream processing applications.
Apache Kafka
14. Kafka’s system design can be thought of as that of a distributed commit log, where incoming
data is written sequentially to disk. There are four main components involved in moving data in
and out of Kafka:
• Topics
• Producers
• Consumers
• Brokers
How Kafka Works
16. Collect & ingest
Data
We need to collect the data, process the data, store the data, and finally serve the data for
analysis, machine learning, and dashboards.
Data Sources Serve DataStore DataProcess Data
? ? ?
Streaming Data Pipeline
17. Data Stream Processing
There are a wide variety of technologies, frameworks, and libraries for building applications
that process streams of data. Frameworks such as Flink, Storm, Samza and Spark all can
process streams of data in real time writing code in Java, Python or Scala doing excellent job.
But if you was looking for something more simple to build data pipelines with a minimal data
processing you should test:
18. Apache NiFi is an integrated data platform that enables the automation of data flow between
systems. It provides real-time control that makes it easy to manage the movement of data
between any source and any destination. Apache NiFi helps move and track data.
Apache Nifi
Apache NiFi is used for:
• Reliable and secure transfer of data between systems
• Delivery of data from sources to analytic platforms
• Enrichment and preparation of data:
• Conversion between formats
• Extraction/Parsing/Splitting/Aggregation
• Schema translation
• Routing decisions
19. Collect & ingest
Data
Data Stream Processing
Data Sources Serve DataStore DataProcess Data
? ?
Streaming Data Pipeline
20. For storing lots of streaming data, we need a data store that supports fast writes and scales.
Storing Streaming Data
21. Collect & ingest
Data
Storing Streaming Data
Data Sources Serve DataStore DataProcess Data
?
Streaming Data Pipeline
22. End applications like dashboards, business intelligence tools, and other applications that use
the processed event data.
Serving the Data
23. Collect & ingest
Data
Complete workflow of streaming data
Data Sources Serve DataStore DataProcess Data
Streaming Data Pipeline
24. Stishovite is a centralized console to manage the entire pipeline of the xGem Streaming
Platform.
xGem Stream Platform is the integration of differents Open Source Products.
https://gitlab.com/xgem/stishovite
What is Stishovite?