SlideShare una empresa de Scribd logo
1 de 23
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next Gen Tooling for Building
Stream Analytics App
Berlin Talk
George Vetticaden
VP of Emerging Products at Hortonworks
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Streaming Analytics Manager (SAM)
What is it?
 Helps users build and deploy complex
stream analytics apps without writing
code using graphical interface
 An Open Source ASF Licensed tool
– Project Name: Streamline,
https://github.com/hortonworks/streamline
Key Design Principles
 Build stream analytics apps w/o specialized
skillsets.
 Support multiple underlining streaming
engine (Storm, Spark Streaming, Flink)
 Extensibility – Provide SDK to plug in custom
sources/sinks/processors/udfs
 Schema is a first class citizen
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who Uses SAM?
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SAM is All about Doing Real-Time Analytics on the Stream
Real-Time
Prescriptive
Analytics
Real-Time Analytics
Real-Time
Predictive
Analytics
Real-Time
Descriptive
Analytics
What should we do
right now?
What could happen
now/soon?
What is happening
right now?
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Showcasing SAM with an Use Case
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Trucking company w/ large fleet of international trucks
A truck generates millions of events for a given route;
an event could be:
 'Normal' events: starting / stopping of the vehicle
 ‘Violation’ events: speeding, excessive acceleration and
breaking, unsafe tail distance
 ‘Speed’ Events: The speed of a driver that comes in every
minute.
Company uses an application that monitors truck
locations and violations from the truck/driver in real-
time
Route?
Truck?
Driver?
Analysts query a broad
history to understand if
today’s violations are
part of a larger problem
with specific routes,
trucks, or drivers
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Two Streaming Sources: Geo Event and Speed Stream
eventType
truckId
driverId
driverName longitude
eventTime routeId
eventSource
2018-01-22 15:00:58.493|1516654858493| truck_geo_event |40| 23 | G Vetticaden | 1090292248 | Peoria to Ceder Rapids Route 2 | Lane Departure| 40.7 | -89.52 |1|
route
latitude
correlationId
speed
truckId
driverId
driverName
eventTime routeIdeventSource
2018-01-22 15:00:58.673|1516654858673| truck_speed_event|40| 23 | G Vetticaden| 1090292248 | Peoria to Ceder Rapids Route 2 | 73 |
route
 Each Truck emits different event stream
– Truck Geo Event
– Truck Speed Event
Truck Geo Event:
Truck Speed Event:
eventTimeLong
eventTimeLong
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Common Streaming Analytics Requirements
Streaming Analytics
Requirement #
Requirement Description
Req. #1 Create streams consuming from the two Kafka topics that NiFi delivered the
enriched geo and speed streams to.
Req. #2 Join the streams of the Geo and Speed sensors over a time based aggregation
window.
Req. #3 Apply rules on the stream to filter on events of interest.
Req. #4 Calculate the average speed of driver over 3 minute window and create alert
for speeding driver
Req. #5 Enrich the stream with features required for a machine learning (ML) model.
The enrichment entails performing lookups for driver HR info, hours/miles
driven in the past week, weather info.
Req. #6 Normalize the events in the stream to feed into the PMML model.
Req. #7 Execute a predictive logistical regression model on the stream built with
Spark ML to predict if a driver is in danger going to commit a violation.
Req. #8 Alert and feed into real-time dashboard if model predicted a violation.
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Implementing these 8 Requirements in Storm – Complex and Lots of
Java Code
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Implementing this with Spark Structured Streaming – Complex and
Lots of Scala Code
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Implementing this with SAM – Easy as Pie
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Implement Requirements 1-4
Streaming Analytics
Requirement #
Requirement Description
Req. #1 Create streams consuming from the two Kafka topics that NiFi delivered the
enriched geo and speed streams to.
Req. #2 Join the streams of the Geo and Speed sensors over a time based aggregation
window.
Req. #3 Apply rules on the stream to filter on events of interest.
Req. #4 Calculate the average speed of driver over 3 minute window and create alert
for speeding driver
Req. #5 Enrich the stream with features required for a machine learning (ML) model.
The enrichment entails performing lookups for driver HR info, hours/miles
driven in the past week, weather info.
Req. #6 Normalize the events in the stream to feed into the PMML model.
Req. #7 Execute a predictive logistical regression model on the stream built with
Spark ML to predict if a driver is in danger going to commit a violation.
Req. #8 Alert and feed into real-time dashboard if model predicted a violation.
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Implementing
Streaming Reqs 1 - 4 with SAM
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Implement Requirements 5-8
Streaming Analytics
Requirement #
Requirement Description
Req. #1 Create streams consuming from the two Kafka topics that NiFi delivered the
enriched geo and speed streams to.
Req. #2 Join the streams of the Geo and Speed sensors over a time based aggregation
window.
Req. #3 Apply rules on the stream to filter on events of interest.
Req. #4 Calculate the average speed of driver over 3 minute window and create alert
for speeding driver
Req. #5 Enrich the stream with features required for a machine learning (ML) model.
The enrichment entails performing lookups for driver HR info, hours/miles
driven in the past week, weather info.
Req. #6 Normalize the events in the stream to feed into the PMML model.
Req. #7 Execute a predictive logistical regression model on the stream built with
Spark ML to predict if a driver is in danger going to commit a violation.
Req. #8 Alert and feed into real-time dashboard if model predicted a violation.
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Real-Time Predictive Analytics
 Question: No violation events but what might happen that I need to be worried about?
 My data science team has a model that can predict that based on
– Weather
– Roads
– Driver HR info like driver certification status, wagePlan
– Driver timesheet info like hours, and miles logged over the last week
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Building the Predictive Model
Identify suitable ML algorithms to train a model – we will
use classification algorithms as we have labeled events
data
2
Transform enriched events data to a format that is
friendly to Spark MLlib – many ML libs expect
training data in a certain format
3
Train a logistic classification Spark model on YARN, with
above events as training input, and iterate to fine tune
generated model
4
Explore small subset of events to identify predictive
features and make a hypothesis. E.g. hypothesis: “foggy
weather causes driver violations”
1
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logistical Regression Model
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scoring the Predictive Model using SAM
Use SAM’s enrich/custom processors to enrich the event
with the features required for the model6
Enrich with Features
Use SAM’s projection/custom processors to
transform/normalize the streaming event and the
features required for the model
7
Transform/Normalize
Use SAM’s PMML processor to score the model for each
stream event with its required features8
Score Model
Use SAM’s rule and notification processors to alert,
notify and take action using the results of the model9
Alert / Notify / Action
Export the Spark Mllib model and import into the SAM’s
Model Registry
5 Model
Registry
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Implementing
Streaming Reqs 5 - 8
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SAM “Test” Mode
Key Highlights
 Persona User: Developer
 Allows Developers to test SAM app
locally without deploying to cluster
 SAM Test Mode Lifecycle
1. Create Named Test
2. Mock out sources (e.g: Kafka) with test
data
3. Execute the Test Case
4. Visualize Validate the data as it traverses
through the SAM DAG Visualization
5. Download the Test Case Results
 Can do steps 1-5 via UI or REST
 Enables writing automated test
(Junit)
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Build Automated Junit Tests and CI & CD
Pipelines with SAM
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Visualizing Output of Test Mode is Great…But want CI/CD
Key Highlights
 What Users Really Want?
– Automated Unit Tests – Write Junit tests
for streaming apps like their traditional
apps
– Continuous Integration - Incorporate
streaming apps into their CI Envs with
Jenkins
– Continuous Delivery – Deliver to business
new features in a continuous fashion.
 SAM Addresses these needs. How?
– SAM has first class support for REST.
– Everything in UI can be done with SAM
REST. This includes SAM Test Mode.
 SAM REST allows you to
– Create service pools, create envs, create
test case, setup test data, download test
results, validate, deploy app to diff envs
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Automated JUnit Tests / CI / CD

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Data Driven API Testing: Best Practices for Real-World Testing Scenarios
Data Driven API Testing: Best Practices for Real-World Testing ScenariosData Driven API Testing: Best Practices for Real-World Testing Scenarios
Data Driven API Testing: Best Practices for Real-World Testing Scenarios
 
4 Major Advantages of API Testing
4 Major Advantages of API Testing4 Major Advantages of API Testing
4 Major Advantages of API Testing
 
Expanding beyond SPL -- More language support in IBM Streams V4.1
Expanding beyond SPL -- More language support in IBM Streams V4.1Expanding beyond SPL -- More language support in IBM Streams V4.1
Expanding beyond SPL -- More language support in IBM Streams V4.1
 
Mule soft meetup_chandigarh_#7_25_sept_2021
Mule soft meetup_chandigarh_#7_25_sept_2021Mule soft meetup_chandigarh_#7_25_sept_2021
Mule soft meetup_chandigarh_#7_25_sept_2021
 
Mule soft mcia-level-1 Dumps
Mule soft mcia-level-1 DumpsMule soft mcia-level-1 Dumps
Mule soft mcia-level-1 Dumps
 
Github Projects Overview and IBM Streams V4.1
Github Projects Overview and IBM Streams V4.1Github Projects Overview and IBM Streams V4.1
Github Projects Overview and IBM Streams V4.1
 
5 Pillars of Building Enterprise0grade APIs
5 Pillars of Building Enterprise0grade APIs5 Pillars of Building Enterprise0grade APIs
5 Pillars of Building Enterprise0grade APIs
 
Kong Summit 2021 - Opening Keynote
Kong Summit 2021 - Opening KeynoteKong Summit 2021 - Opening Keynote
Kong Summit 2021 - Opening Keynote
 
Vizag mulesoft-meetup-6-anypoint-datagraph--v2
Vizag mulesoft-meetup-6-anypoint-datagraph--v2Vizag mulesoft-meetup-6-anypoint-datagraph--v2
Vizag mulesoft-meetup-6-anypoint-datagraph--v2
 
The Paved PaaS to Microservices at Netflix (IAS2017 Nanjing)
The Paved PaaS to Microservices at Netflix (IAS2017 Nanjing)The Paved PaaS to Microservices at Netflix (IAS2017 Nanjing)
The Paved PaaS to Microservices at Netflix (IAS2017 Nanjing)
 
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsGraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
 
Dynatrace
DynatraceDynatrace
Dynatrace
 
Introduction to APIs & how to automate APIs testing with selenium web driver?
Introduction to APIs & how to automate APIs testing with selenium web driver?Introduction to APIs & how to automate APIs testing with selenium web driver?
Introduction to APIs & how to automate APIs testing with selenium web driver?
 
Introduction to Anypoint Runtime Fabric on Amazon Elastic Kubernetes Service ...
Introduction to Anypoint Runtime Fabric on Amazon Elastic Kubernetes Service ...Introduction to Anypoint Runtime Fabric on Amazon Elastic Kubernetes Service ...
Introduction to Anypoint Runtime Fabric on Amazon Elastic Kubernetes Service ...
 
GraphQL Europe Recap
GraphQL Europe RecapGraphQL Europe Recap
GraphQL Europe Recap
 
Deep Dive on CI/CD NYC Meet Up Group
Deep Dive on CI/CD NYC Meet Up GroupDeep Dive on CI/CD NYC Meet Up Group
Deep Dive on CI/CD NYC Meet Up Group
 
Virtual Flink Forward 2020: Apache Flink Worst Wractices - Konstantin Knauf
Virtual Flink Forward 2020: Apache Flink Worst Wractices - Konstantin KnaufVirtual Flink Forward 2020: Apache Flink Worst Wractices - Konstantin Knauf
Virtual Flink Forward 2020: Apache Flink Worst Wractices - Konstantin Knauf
 
The Magic Behind Faster API Development, Testing and Delivery with API Virtua...
The Magic Behind Faster API Development, Testing and Delivery with API Virtua...The Magic Behind Faster API Development, Testing and Delivery with API Virtua...
The Magic Behind Faster API Development, Testing and Delivery with API Virtua...
 
MuleSoft Surat Virtual Meetup#17 - Automated Code Review
MuleSoft Surat Virtual Meetup#17 - Automated Code ReviewMuleSoft Surat Virtual Meetup#17 - Automated Code Review
MuleSoft Surat Virtual Meetup#17 - Automated Code Review
 
2.3.anypoint exchange
2.3.anypoint exchange2.3.anypoint exchange
2.3.anypoint exchange
 

Similar a Next Generation Tooling for building streaming analytics app

Streamline - Stream Analytics for Everyone
Streamline - Stream Analytics for EveryoneStreamline - Stream Analytics for Everyone
Streamline - Stream Analytics for Everyone
DataWorks Summit/Hadoop Summit
 
SAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasySAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made Easy
DataWorks Summit
 
Neha Arora_Resume
Neha Arora_ResumeNeha Arora_Resume
Neha Arora_Resume
Neha Arora
 

Similar a Next Generation Tooling for building streaming analytics app (20)

Next gen tooling for building streaming analytics apps: code-less development...
Next gen tooling for building streaming analytics apps: code-less development...Next gen tooling for building streaming analytics apps: code-less development...
Next gen tooling for building streaming analytics apps: code-less development...
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easy
 
Streamline - Stream Analytics for Everyone
Streamline - Stream Analytics for EveryoneStreamline - Stream Analytics for Everyone
Streamline - Stream Analytics for Everyone
 
Unlocking insights in streaming data
Unlocking insights in streaming dataUnlocking insights in streaming data
Unlocking insights in streaming data
 
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
 
Vnv kumar performance testing
Vnv kumar performance testingVnv kumar performance testing
Vnv kumar performance testing
 
SAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasySAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made Easy
 
Schema Registry & Stream Analytics Manager
Schema Registry  & Stream Analytics ManagerSchema Registry  & Stream Analytics Manager
Schema Registry & Stream Analytics Manager
 
Streaming analytics manager
Streaming analytics managerStreaming analytics manager
Streaming analytics manager
 
Virtual Flink Forward 2020: Keynote: The Evolution of Data Infrastructure at ...
Virtual Flink Forward 2020: Keynote: The Evolution of Data Infrastructure at ...Virtual Flink Forward 2020: Keynote: The Evolution of Data Infrastructure at ...
Virtual Flink Forward 2020: Keynote: The Evolution of Data Infrastructure at ...
 
Breaking the Monolith
Breaking the MonolithBreaking the Monolith
Breaking the Monolith
 
Trucking demo w Spark ML - Paul Hargis - Hortonworks
Trucking demo w Spark ML - Paul Hargis - HortonworksTrucking demo w Spark ML - Paul Hargis - Hortonworks
Trucking demo w Spark ML - Paul Hargis - Hortonworks
 
Neha Arora_Resume
Neha Arora_ResumeNeha Arora_Resume
Neha Arora_Resume
 
Vijaybabu_Resume
Vijaybabu_ResumeVijaybabu_Resume
Vijaybabu_Resume
 
APIs and Services for Fleet Management - Talks given @ APIDays Berlin and Ba...
APIs and Services for  Fleet Management - Talks given @ APIDays Berlin and Ba...APIs and Services for  Fleet Management - Talks given @ APIDays Berlin and Ba...
APIs and Services for Fleet Management - Talks given @ APIDays Berlin and Ba...
 
Full-Stack Observability for IoT Event Stream Data Processing at Penske
Full-Stack Observability for IoT Event Stream Data Processing at PenskeFull-Stack Observability for IoT Event Stream Data Processing at Penske
Full-Stack Observability for IoT Event Stream Data Processing at Penske
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf
 
Prasanth
PrasanthPrasanth
Prasanth
 
Linux Akraino Blueprint
Linux Akraino BlueprintLinux Akraino Blueprint
Linux Akraino Blueprint
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Next Generation Tooling for building streaming analytics app

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Next Gen Tooling for Building Stream Analytics App Berlin Talk George Vetticaden VP of Emerging Products at Hortonworks
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Streaming Analytics Manager (SAM) What is it?  Helps users build and deploy complex stream analytics apps without writing code using graphical interface  An Open Source ASF Licensed tool – Project Name: Streamline, https://github.com/hortonworks/streamline Key Design Principles  Build stream analytics apps w/o specialized skillsets.  Support multiple underlining streaming engine (Storm, Spark Streaming, Flink)  Extensibility – Provide SDK to plug in custom sources/sinks/processors/udfs  Schema is a first class citizen
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Who Uses SAM?
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SAM is All about Doing Real-Time Analytics on the Stream Real-Time Prescriptive Analytics Real-Time Analytics Real-Time Predictive Analytics Real-Time Descriptive Analytics What should we do right now? What could happen now/soon? What is happening right now?
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Showcasing SAM with an Use Case
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Trucking company w/ large fleet of international trucks A truck generates millions of events for a given route; an event could be:  'Normal' events: starting / stopping of the vehicle  ‘Violation’ events: speeding, excessive acceleration and breaking, unsafe tail distance  ‘Speed’ Events: The speed of a driver that comes in every minute. Company uses an application that monitors truck locations and violations from the truck/driver in real- time Route? Truck? Driver? Analysts query a broad history to understand if today’s violations are part of a larger problem with specific routes, trucks, or drivers
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Two Streaming Sources: Geo Event and Speed Stream eventType truckId driverId driverName longitude eventTime routeId eventSource 2018-01-22 15:00:58.493|1516654858493| truck_geo_event |40| 23 | G Vetticaden | 1090292248 | Peoria to Ceder Rapids Route 2 | Lane Departure| 40.7 | -89.52 |1| route latitude correlationId speed truckId driverId driverName eventTime routeIdeventSource 2018-01-22 15:00:58.673|1516654858673| truck_speed_event|40| 23 | G Vetticaden| 1090292248 | Peoria to Ceder Rapids Route 2 | 73 | route  Each Truck emits different event stream – Truck Geo Event – Truck Speed Event Truck Geo Event: Truck Speed Event: eventTimeLong eventTimeLong
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Common Streaming Analytics Requirements Streaming Analytics Requirement # Requirement Description Req. #1 Create streams consuming from the two Kafka topics that NiFi delivered the enriched geo and speed streams to. Req. #2 Join the streams of the Geo and Speed sensors over a time based aggregation window. Req. #3 Apply rules on the stream to filter on events of interest. Req. #4 Calculate the average speed of driver over 3 minute window and create alert for speeding driver Req. #5 Enrich the stream with features required for a machine learning (ML) model. The enrichment entails performing lookups for driver HR info, hours/miles driven in the past week, weather info. Req. #6 Normalize the events in the stream to feed into the PMML model. Req. #7 Execute a predictive logistical regression model on the stream built with Spark ML to predict if a driver is in danger going to commit a violation. Req. #8 Alert and feed into real-time dashboard if model predicted a violation.
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Implementing these 8 Requirements in Storm – Complex and Lots of Java Code
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Implementing this with Spark Structured Streaming – Complex and Lots of Scala Code
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Implementing this with SAM – Easy as Pie
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo Implement Requirements 1-4 Streaming Analytics Requirement # Requirement Description Req. #1 Create streams consuming from the two Kafka topics that NiFi delivered the enriched geo and speed streams to. Req. #2 Join the streams of the Geo and Speed sensors over a time based aggregation window. Req. #3 Apply rules on the stream to filter on events of interest. Req. #4 Calculate the average speed of driver over 3 minute window and create alert for speeding driver Req. #5 Enrich the stream with features required for a machine learning (ML) model. The enrichment entails performing lookups for driver HR info, hours/miles driven in the past week, weather info. Req. #6 Normalize the events in the stream to feed into the PMML model. Req. #7 Execute a predictive logistical regression model on the stream built with Spark ML to predict if a driver is in danger going to commit a violation. Req. #8 Alert and feed into real-time dashboard if model predicted a violation.
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo Implementing Streaming Reqs 1 - 4 with SAM
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Implement Requirements 5-8 Streaming Analytics Requirement # Requirement Description Req. #1 Create streams consuming from the two Kafka topics that NiFi delivered the enriched geo and speed streams to. Req. #2 Join the streams of the Geo and Speed sensors over a time based aggregation window. Req. #3 Apply rules on the stream to filter on events of interest. Req. #4 Calculate the average speed of driver over 3 minute window and create alert for speeding driver Req. #5 Enrich the stream with features required for a machine learning (ML) model. The enrichment entails performing lookups for driver HR info, hours/miles driven in the past week, weather info. Req. #6 Normalize the events in the stream to feed into the PMML model. Req. #7 Execute a predictive logistical regression model on the stream built with Spark ML to predict if a driver is in danger going to commit a violation. Req. #8 Alert and feed into real-time dashboard if model predicted a violation.
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Real-Time Predictive Analytics  Question: No violation events but what might happen that I need to be worried about?  My data science team has a model that can predict that based on – Weather – Roads – Driver HR info like driver certification status, wagePlan – Driver timesheet info like hours, and miles logged over the last week
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Building the Predictive Model Identify suitable ML algorithms to train a model – we will use classification algorithms as we have labeled events data 2 Transform enriched events data to a format that is friendly to Spark MLlib – many ML libs expect training data in a certain format 3 Train a logistic classification Spark model on YARN, with above events as training input, and iterate to fine tune generated model 4 Explore small subset of events to identify predictive features and make a hypothesis. E.g. hypothesis: “foggy weather causes driver violations” 1
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logistical Regression Model
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scoring the Predictive Model using SAM Use SAM’s enrich/custom processors to enrich the event with the features required for the model6 Enrich with Features Use SAM’s projection/custom processors to transform/normalize the streaming event and the features required for the model 7 Transform/Normalize Use SAM’s PMML processor to score the model for each stream event with its required features8 Score Model Use SAM’s rule and notification processors to alert, notify and take action using the results of the model9 Alert / Notify / Action Export the Spark Mllib model and import into the SAM’s Model Registry 5 Model Registry
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo Implementing Streaming Reqs 5 - 8
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SAM “Test” Mode Key Highlights  Persona User: Developer  Allows Developers to test SAM app locally without deploying to cluster  SAM Test Mode Lifecycle 1. Create Named Test 2. Mock out sources (e.g: Kafka) with test data 3. Execute the Test Case 4. Visualize Validate the data as it traverses through the SAM DAG Visualization 5. Download the Test Case Results  Can do steps 1-5 via UI or REST  Enables writing automated test (Junit)
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Build Automated Junit Tests and CI & CD Pipelines with SAM
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Visualizing Output of Test Mode is Great…But want CI/CD Key Highlights  What Users Really Want? – Automated Unit Tests – Write Junit tests for streaming apps like their traditional apps – Continuous Integration - Incorporate streaming apps into their CI Envs with Jenkins – Continuous Delivery – Deliver to business new features in a continuous fashion.  SAM Addresses these needs. How? – SAM has first class support for REST. – Everything in UI can be done with SAM REST. This includes SAM Test Mode.  SAM REST allows you to – Create service pools, create envs, create test case, setup test data, download test results, validate, deploy app to diff envs
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo Automated JUnit Tests / CI / CD

Notas del editor

  1. s
  2. s