Más contenido relacionado
La actualidad más candente (20)
Similar a Next Generation Tooling for building streaming analytics app (20)
Next Generation Tooling for building streaming analytics app
- 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next Gen Tooling for Building
Stream Analytics App
Berlin Talk
George Vetticaden
VP of Emerging Products at Hortonworks
- 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Streaming Analytics Manager (SAM)
What is it?
Helps users build and deploy complex
stream analytics apps without writing
code using graphical interface
An Open Source ASF Licensed tool
– Project Name: Streamline,
https://github.com/hortonworks/streamline
Key Design Principles
Build stream analytics apps w/o specialized
skillsets.
Support multiple underlining streaming
engine (Storm, Spark Streaming, Flink)
Extensibility – Provide SDK to plug in custom
sources/sinks/processors/udfs
Schema is a first class citizen
- 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SAM is All about Doing Real-Time Analytics on the Stream
Real-Time
Prescriptive
Analytics
Real-Time Analytics
Real-Time
Predictive
Analytics
Real-Time
Descriptive
Analytics
What should we do
right now?
What could happen
now/soon?
What is happening
right now?
- 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Showcasing SAM with an Use Case
- 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Trucking company w/ large fleet of international trucks
A truck generates millions of events for a given route;
an event could be:
'Normal' events: starting / stopping of the vehicle
‘Violation’ events: speeding, excessive acceleration and
breaking, unsafe tail distance
‘Speed’ Events: The speed of a driver that comes in every
minute.
Company uses an application that monitors truck
locations and violations from the truck/driver in real-
time
Route?
Truck?
Driver?
Analysts query a broad
history to understand if
today’s violations are
part of a larger problem
with specific routes,
trucks, or drivers
- 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Two Streaming Sources: Geo Event and Speed Stream
eventType
truckId
driverId
driverName longitude
eventTime routeId
eventSource
2018-01-22 15:00:58.493|1516654858493| truck_geo_event |40| 23 | G Vetticaden | 1090292248 | Peoria to Ceder Rapids Route 2 | Lane Departure| 40.7 | -89.52 |1|
route
latitude
correlationId
speed
truckId
driverId
driverName
eventTime routeIdeventSource
2018-01-22 15:00:58.673|1516654858673| truck_speed_event|40| 23 | G Vetticaden| 1090292248 | Peoria to Ceder Rapids Route 2 | 73 |
route
Each Truck emits different event stream
– Truck Geo Event
– Truck Speed Event
Truck Geo Event:
Truck Speed Event:
eventTimeLong
eventTimeLong
- 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Common Streaming Analytics Requirements
Streaming Analytics
Requirement #
Requirement Description
Req. #1 Create streams consuming from the two Kafka topics that NiFi delivered the
enriched geo and speed streams to.
Req. #2 Join the streams of the Geo and Speed sensors over a time based aggregation
window.
Req. #3 Apply rules on the stream to filter on events of interest.
Req. #4 Calculate the average speed of driver over 3 minute window and create alert
for speeding driver
Req. #5 Enrich the stream with features required for a machine learning (ML) model.
The enrichment entails performing lookups for driver HR info, hours/miles
driven in the past week, weather info.
Req. #6 Normalize the events in the stream to feed into the PMML model.
Req. #7 Execute a predictive logistical regression model on the stream built with
Spark ML to predict if a driver is in danger going to commit a violation.
Req. #8 Alert and feed into real-time dashboard if model predicted a violation.
- 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Implementing these 8 Requirements in Storm – Complex and Lots of
Java Code
- 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Implementing this with Spark Structured Streaming – Complex and
Lots of Scala Code
- 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Implementing this with SAM – Easy as Pie
- 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Implement Requirements 1-4
Streaming Analytics
Requirement #
Requirement Description
Req. #1 Create streams consuming from the two Kafka topics that NiFi delivered the
enriched geo and speed streams to.
Req. #2 Join the streams of the Geo and Speed sensors over a time based aggregation
window.
Req. #3 Apply rules on the stream to filter on events of interest.
Req. #4 Calculate the average speed of driver over 3 minute window and create alert
for speeding driver
Req. #5 Enrich the stream with features required for a machine learning (ML) model.
The enrichment entails performing lookups for driver HR info, hours/miles
driven in the past week, weather info.
Req. #6 Normalize the events in the stream to feed into the PMML model.
Req. #7 Execute a predictive logistical regression model on the stream built with
Spark ML to predict if a driver is in danger going to commit a violation.
Req. #8 Alert and feed into real-time dashboard if model predicted a violation.
- 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Implementing
Streaming Reqs 1 - 4 with SAM
- 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Implement Requirements 5-8
Streaming Analytics
Requirement #
Requirement Description
Req. #1 Create streams consuming from the two Kafka topics that NiFi delivered the
enriched geo and speed streams to.
Req. #2 Join the streams of the Geo and Speed sensors over a time based aggregation
window.
Req. #3 Apply rules on the stream to filter on events of interest.
Req. #4 Calculate the average speed of driver over 3 minute window and create alert
for speeding driver
Req. #5 Enrich the stream with features required for a machine learning (ML) model.
The enrichment entails performing lookups for driver HR info, hours/miles
driven in the past week, weather info.
Req. #6 Normalize the events in the stream to feed into the PMML model.
Req. #7 Execute a predictive logistical regression model on the stream built with
Spark ML to predict if a driver is in danger going to commit a violation.
Req. #8 Alert and feed into real-time dashboard if model predicted a violation.
- 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Real-Time Predictive Analytics
Question: No violation events but what might happen that I need to be worried about?
My data science team has a model that can predict that based on
– Weather
– Roads
– Driver HR info like driver certification status, wagePlan
– Driver timesheet info like hours, and miles logged over the last week
- 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Building the Predictive Model
Identify suitable ML algorithms to train a model – we will
use classification algorithms as we have labeled events
data
2
Transform enriched events data to a format that is
friendly to Spark MLlib – many ML libs expect
training data in a certain format
3
Train a logistic classification Spark model on YARN, with
above events as training input, and iterate to fine tune
generated model
4
Explore small subset of events to identify predictive
features and make a hypothesis. E.g. hypothesis: “foggy
weather causes driver violations”
1
- 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logistical Regression Model
- 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scoring the Predictive Model using SAM
Use SAM’s enrich/custom processors to enrich the event
with the features required for the model6
Enrich with Features
Use SAM’s projection/custom processors to
transform/normalize the streaming event and the
features required for the model
7
Transform/Normalize
Use SAM’s PMML processor to score the model for each
stream event with its required features8
Score Model
Use SAM’s rule and notification processors to alert,
notify and take action using the results of the model9
Alert / Notify / Action
Export the Spark Mllib model and import into the SAM’s
Model Registry
5 Model
Registry
- 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Implementing
Streaming Reqs 5 - 8
- 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SAM “Test” Mode
Key Highlights
Persona User: Developer
Allows Developers to test SAM app
locally without deploying to cluster
SAM Test Mode Lifecycle
1. Create Named Test
2. Mock out sources (e.g: Kafka) with test
data
3. Execute the Test Case
4. Visualize Validate the data as it traverses
through the SAM DAG Visualization
5. Download the Test Case Results
Can do steps 1-5 via UI or REST
Enables writing automated test
(Junit)
- 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Build Automated Junit Tests and CI & CD
Pipelines with SAM
- 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Visualizing Output of Test Mode is Great…But want CI/CD
Key Highlights
What Users Really Want?
– Automated Unit Tests – Write Junit tests
for streaming apps like their traditional
apps
– Continuous Integration - Incorporate
streaming apps into their CI Envs with
Jenkins
– Continuous Delivery – Deliver to business
new features in a continuous fashion.
SAM Addresses these needs. How?
– SAM has first class support for REST.
– Everything in UI can be done with SAM
REST. This includes SAM Test Mode.
SAM REST allows you to
– Create service pools, create envs, create
test case, setup test data, download test
results, validate, deploy app to diff envs
- 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Automated JUnit Tests / CI / CD