SlideShare una empresa de Scribd logo
1 de 19
Innovation and
Reinvention Driving
Transformation
OCTOBER 9, 2018
2018 HPCC Systems® Community
Day
Gus Reyna , Lili Xu
Using HPCC Systems ML to Map Thousands of
Public Records Data Descriptions to Standard
Codes
Introduction
• Background
• Approach
• Exploratory Analysis
• Next steps
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
4
Background
• Public records data: birth/marriage/death certificates,
business/professional/contractor licenses,
foreclosures and tax liens, etc.
• Businesses that use this data in their information
technology processes must account for state
variations of similar events.
• LexisNexis maps public record data from different
states to standard categories.
• Businesses use the standard categories to create
one system that can be used in all states.
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
5
Problem
• Team of Subject Matter Experts (SME) map
public record data to standardized categories.
• Data grew faster than team’s mapping
capacity.
• Data products time to market increased.
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
6
Solution
• Grow the team, train the team.
• BUT, Trainers are SMEs who are not
mapping when they’re training new
team members
• AND, It takes time to learn how to map public
records data to standard categories
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
7
Solution
• Use HPCC Systems machine learning
to generate 3 recommended standard
categories for public record data …
• Which shortens the time new team members
become effective mappers and …
• Reduces the time required for SMEs
to train new team members
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
8
Approach
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
9
Categorize
d Data
Clean &
Build
Vocabulary
Training
Data
Validation
data
Build
Models
New Data
Models**
Run
Throug
h
Model
Run Through
Model
Top 3 Category
Recommendations
Mappers
** Support Vector Machine (SVM)
Naïve Bayes
Category Description
Category 1 DOGS RUNNING AT LARGE
Category 1 CANINE RUNNING AT LARGE
PROHIBITED
Category 2 RUNNING A RED LIGHT
Category 3 FISHING WITHOUT LICENSE
Build Vocabulary
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
10
Categorize
d Data
Clean &
Build
Vocabulary
WORD COUNT
RUN 3
LARG 2
CANIN 1
…. ….
http://textanalysisonline.com/nltk-porter-stemmer
Category Description
Category 1 DOGS RUNNING AT LARGE
Category 1 CANINE RUNNING AT LARGE
PROHIBITED
Category 2 RUNNING A RED LIGHT
Category 3 FISHING WITHOUT LICENSE
Build & Validate Model
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
11
Build
Models
Models*
Run Through
Model
Category Description
Category
1
CANIN RUN AT LARG PROHIBIT
Category
2
RUN A RED LIGHT
Category
3
FISH WITHOUT LICENS
* Support Vector Machine
(SVM)
Naïve Bayes
Category Description
Category
1
DOG RUN AT LARG
Recommendations Category Description
Category 1 Category 1 DOGS RUNNING AT
LARGECategory 2
Category 3
Category Description
Category 1 DOGS RUNNING AT LARGE
Category 1 CANINE RUNNING AT LARGE
PROHIBITED
Category 2 RUNNING A RED LIGHT
Category 3 FISHING WITHOUT LICENSE
Training
Data
Validation
data
Process New Data
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
12
Models*
Top 3 Category
Recommendations
Mappers
* Support Vector Machine
(SVM)
Naïve Bayes
Description
CATS RUNNING AT
LARGE
HUNTING WITHOUT
LICENSE
Training Data
Category Description
Category 1 CANINE RUNNING AT LARGE
PROHIBITED
Category 2 RUNNING A RED LIGHT
Category 3 FISHING WITHOUT LICENSE
Recommende
d
Categories
Description
1, 2, 3 CATS RUNNING AT
LARGE
3, 1, 2 HUNTING WITHOUT
LICENSE
NEW Data
Approach
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
13
Categorize
d Data
Clean &
Build
Vocabulary
Training
Data
Validation
data
Build
Models
New Data
Models**
Run
Throug
h
Model
Run Through
Model
Top 3 Category
Recommendations
Mappers
** Support Vector Machine (SVM)
Naïve Bayes
Category Description
Category 1 DOGS RUNNING AT LARGE
Category 1 CANINE RUNNING AT LARGE
PROHIBITED
Category 2 RUNNING A RED LIGHT
Category 3 FISHING WITHOUT LICENSE
Outcome
• Backlog of public record data to standard
category mapping eliminated.
• Time to market for data products shortened,
no more delays from data mapping to
categories
• Happy mapping team – could work on data
enhancement projects.
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
14
Exploratory Analysis
15
Using HPCC Systems Machine Learning to map thousands of
violation descriptions to Standard Violation Codes
NLP Toolkits on HPCC Systems Platform
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
16
RECORD DESCRIPTION
OPERATING/VEH/OVER MAX
HGT
RECORD
OPERATING
VEH
OVER
MAX
HGT
RECORD
OPERATING
VEHICLE
OVER
MAX
HEIGHT
RECORD
OPER
VEHICL
MAX
HEIGHT
TOKENIZO
R
STOP-WORDS
REMOVER
SEMANTIC
ANALYZOR
N-GRAM
RECORD
OPER VEHICL
VEHICL MAX
MAX HEIGHT
RECORD
OPER
VEHICL
OVER
MAX
HEIGHT
STEMMER
Latent Dirichlet Allocation - Topic Model
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
17
• Unsupervised Natural Language Processing(NLP) Algorithm
• Explore the Topics in Documents.
• Each topic is a distribution over words
• Each document is a mixture of topics
• Each word is drawn from the topics
LDA Topic Model
TOPIC 1 TOPIC 2 TOPIC 3 TOPIC 4 TOPIC 5 TOPIC 6 TOPIC 7 TOPIC 8 TOPIC 9 TOPIC 10
OPER
UNLA
W
PROO
F
DRIVE
IMPRO
P
PARK
VEHIC
L
SPEED INSUR
MOTOCY
CL
INSUR
REQUI
R
SPEED DRIVE
UNLAW
SPEED
SPEED SPEED
IMPRO
P
REQUIR PROOF VEHICL
PROO
F
OPER OPER UNLAW
VEHIC
L
EY PARK IMPROP
INSUR
REQUIR
UNLAW
REQUI
R
UNLA
W
SPEED
VEHIC
L
EY
UNLA
W
VEHICL
PROO
F
INSUR
MOTORCY
CL
PARK
UNLA
W
INSUR
REQUI
R
IMPRO
P
MOTORCY
CL
FAIL
EY
PROTE
CT
FAIL
MOTORCY
CL
IMPROP OPER
Scalable LDA on HPCC Systems Platform
• Massive Parallel Topic Modeling
• Flexible Hyper-Parameter Setup
• Experiments Topic Range [10 – 103]
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
18
LDA TOPIC MODEL RESULT
Next Steps
• Continue exploratory analysis
• Additional algorithms
• Automatically map data to categories, not just
make recommendations
• Refine existing and build new models
• Solve other business problems with HPCC
Systems Machine Learning
• Uniform language
• Ease of data access
• High productivity
19
Using HPCC Systems Machine Learning to map thousands of
violation descriptions to Standard Violation Codes

Más contenido relacionado

Similar a Using HPCC Systems ML to Map Thousands of Public Records Data Descriptions to Standard Codes

Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Revolution Analytics
 
Presentacion day f-core v1.2.1.2-technical - english
Presentacion day f-core v1.2.1.2-technical - englishPresentacion day f-core v1.2.1.2-technical - english
Presentacion day f-core v1.2.1.2-technical - englishJose Luis Sanchez del Coso
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software EngineeringMiroslaw Staron
 
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014Amazon Web Services
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedRevolution Analytics
 
HPCC Systems - Open source, Big Data Processing & Analytics
HPCC Systems - Open source, Big Data Processing & AnalyticsHPCC Systems - Open source, Big Data Processing & Analytics
HPCC Systems - Open source, Big Data Processing & AnalyticsHPCC Systems
 
source{d} Engine - your code as data
source{d} Engine - your code as datasource{d} Engine - your code as data
source{d} Engine - your code as datasource{d}
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkAI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkValue Amplify Consulting
 
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar 18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar Revolution Analytics
 
The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11HPCC Systems
 
Ed Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisEd Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisVolha Banadyseva
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
FIM and System Call Auditing at Scale in a Large Container Deployment
FIM and System Call Auditing at Scale in a Large Container DeploymentFIM and System Call Auditing at Scale in a Large Container Deployment
FIM and System Call Auditing at Scale in a Large Container DeploymentPriyanka Aash
 
HL7 Survival Guide - Chapter 3 - The Heart of the Matter: Data Formats, Workf...
HL7 Survival Guide - Chapter 3 - The Heart of the Matter: Data Formats, Workf...HL7 Survival Guide - Chapter 3 - The Heart of the Matter: Data Formats, Workf...
HL7 Survival Guide - Chapter 3 - The Heart of the Matter: Data Formats, Workf...Caristix
 
: HL7 Survival Guide - Chapter 7 – Gap Analysis
: HL7 Survival Guide - Chapter 7 – Gap Analysis: HL7 Survival Guide - Chapter 7 – Gap Analysis
: HL7 Survival Guide - Chapter 7 – Gap AnalysisCaristix
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Data Con LA
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...HPCC Systems
 
Big Data LDN 2017: Delivering Instant Experience with Redid Enterprise
Big Data LDN 2017: Delivering Instant Experience with Redid EnterpriseBig Data LDN 2017: Delivering Instant Experience with Redid Enterprise
Big Data LDN 2017: Delivering Instant Experience with Redid EnterpriseMatt Stubbs
 

Similar a Using HPCC Systems ML to Map Thousands of Public Records Data Descriptions to Standard Codes (20)

Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
 
Presentacion day f-core v1.2.1.2-technical - english
Presentacion day f-core v1.2.1.2-technical - englishPresentacion day f-core v1.2.1.2-technical - english
Presentacion day f-core v1.2.1.2-technical - english
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
HPCC Systems - Open source, Big Data Processing & Analytics
HPCC Systems - Open source, Big Data Processing & AnalyticsHPCC Systems - Open source, Big Data Processing & Analytics
HPCC Systems - Open source, Big Data Processing & Analytics
 
source{d} Engine - your code as data
source{d} Engine - your code as datasource{d} Engine - your code as data
source{d} Engine - your code as data
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache SparkAI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
 
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar 18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
 
The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11
 
Ed Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisEd Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual Analysis
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
FIM and System Call Auditing at Scale in a Large Container Deployment
FIM and System Call Auditing at Scale in a Large Container DeploymentFIM and System Call Auditing at Scale in a Large Container Deployment
FIM and System Call Auditing at Scale in a Large Container Deployment
 
HL7 Survival Guide - Chapter 3 - The Heart of the Matter: Data Formats, Workf...
HL7 Survival Guide - Chapter 3 - The Heart of the Matter: Data Formats, Workf...HL7 Survival Guide - Chapter 3 - The Heart of the Matter: Data Formats, Workf...
HL7 Survival Guide - Chapter 3 - The Heart of the Matter: Data Formats, Workf...
 
: HL7 Survival Guide - Chapter 7 – Gap Analysis
: HL7 Survival Guide - Chapter 7 – Gap Analysis: HL7 Survival Guide - Chapter 7 – Gap Analysis
: HL7 Survival Guide - Chapter 7 – Gap Analysis
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...
Meetup: Big Data NLP with HPCC Systems® - A Development Ride from Spray to TH...
 
Big Data LDN 2017: Delivering Instant Experience with Redid Enterprise
Big Data LDN 2017: Delivering Instant Experience with Redid EnterpriseBig Data LDN 2017: Delivering Instant Experience with Redid Enterprise
Big Data LDN 2017: Delivering Instant Experience with Redid Enterprise
 

Más de HPCC Systems

Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...HPCC Systems
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Towards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsTowards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsHPCC Systems
 
Closing / Adjourn
Closing / Adjourn Closing / Adjourn
Closing / Adjourn HPCC Systems
 
Community Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingCommunity Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingHPCC Systems
 
Release Cycle Changes
Release Cycle ChangesRelease Cycle Changes
Release Cycle ChangesHPCC Systems
 
Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index HPCC Systems
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningHPCC Systems
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesHPCC Systems
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsHPCC Systems
 
DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch HPCC Systems
 
Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem HPCC Systems
 
Work Unit Analysis Tool
Work Unit Analysis ToolWork Unit Analysis Tool
Work Unit Analysis ToolHPCC Systems
 
Community Award Ceremony
Community Award Ceremony Community Award Ceremony
Community Award Ceremony HPCC Systems
 
Dapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterDapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterHPCC Systems
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...HPCC Systems
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...HPCC Systems
 

Más de HPCC Systems (20)

Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Towards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsTowards Trustable AI for Complex Systems
Towards Trustable AI for Complex Systems
 
Welcome
WelcomeWelcome
Welcome
 
Closing / Adjourn
Closing / Adjourn Closing / Adjourn
Closing / Adjourn
 
Community Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingCommunity Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon Cutting
 
Path to 8.0
Path to 8.0 Path to 8.0
Path to 8.0
 
Release Cycle Changes
Release Cycle ChangesRelease Cycle Changes
Release Cycle Changes
 
Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine Learning
 
Docker Support
Docker Support Docker Support
Docker Support
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network Capabilities
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC Systems
 
DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch
 
Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem
 
Work Unit Analysis Tool
Work Unit Analysis ToolWork Unit Analysis Tool
Work Unit Analysis Tool
 
Community Award Ceremony
Community Award Ceremony Community Award Ceremony
Community Award Ceremony
 
Dapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterDapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL Neater
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
 

Último

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 

Último (20)

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 

Using HPCC Systems ML to Map Thousands of Public Records Data Descriptions to Standard Codes

  • 1.
  • 2.
  • 3. Innovation and Reinvention Driving Transformation OCTOBER 9, 2018 2018 HPCC Systems® Community Day Gus Reyna , Lili Xu Using HPCC Systems ML to Map Thousands of Public Records Data Descriptions to Standard Codes
  • 4. Introduction • Background • Approach • Exploratory Analysis • Next steps Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 4
  • 5. Background • Public records data: birth/marriage/death certificates, business/professional/contractor licenses, foreclosures and tax liens, etc. • Businesses that use this data in their information technology processes must account for state variations of similar events. • LexisNexis maps public record data from different states to standard categories. • Businesses use the standard categories to create one system that can be used in all states. Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 5
  • 6. Problem • Team of Subject Matter Experts (SME) map public record data to standardized categories. • Data grew faster than team’s mapping capacity. • Data products time to market increased. Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 6
  • 7. Solution • Grow the team, train the team. • BUT, Trainers are SMEs who are not mapping when they’re training new team members • AND, It takes time to learn how to map public records data to standard categories Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 7
  • 8. Solution • Use HPCC Systems machine learning to generate 3 recommended standard categories for public record data … • Which shortens the time new team members become effective mappers and … • Reduces the time required for SMEs to train new team members Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 8
  • 9. Approach Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 9 Categorize d Data Clean & Build Vocabulary Training Data Validation data Build Models New Data Models** Run Throug h Model Run Through Model Top 3 Category Recommendations Mappers ** Support Vector Machine (SVM) Naïve Bayes Category Description Category 1 DOGS RUNNING AT LARGE Category 1 CANINE RUNNING AT LARGE PROHIBITED Category 2 RUNNING A RED LIGHT Category 3 FISHING WITHOUT LICENSE
  • 10. Build Vocabulary Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 10 Categorize d Data Clean & Build Vocabulary WORD COUNT RUN 3 LARG 2 CANIN 1 …. …. http://textanalysisonline.com/nltk-porter-stemmer Category Description Category 1 DOGS RUNNING AT LARGE Category 1 CANINE RUNNING AT LARGE PROHIBITED Category 2 RUNNING A RED LIGHT Category 3 FISHING WITHOUT LICENSE
  • 11. Build & Validate Model Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 11 Build Models Models* Run Through Model Category Description Category 1 CANIN RUN AT LARG PROHIBIT Category 2 RUN A RED LIGHT Category 3 FISH WITHOUT LICENS * Support Vector Machine (SVM) Naïve Bayes Category Description Category 1 DOG RUN AT LARG Recommendations Category Description Category 1 Category 1 DOGS RUNNING AT LARGECategory 2 Category 3 Category Description Category 1 DOGS RUNNING AT LARGE Category 1 CANINE RUNNING AT LARGE PROHIBITED Category 2 RUNNING A RED LIGHT Category 3 FISHING WITHOUT LICENSE Training Data Validation data
  • 12. Process New Data Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 12 Models* Top 3 Category Recommendations Mappers * Support Vector Machine (SVM) Naïve Bayes Description CATS RUNNING AT LARGE HUNTING WITHOUT LICENSE Training Data Category Description Category 1 CANINE RUNNING AT LARGE PROHIBITED Category 2 RUNNING A RED LIGHT Category 3 FISHING WITHOUT LICENSE Recommende d Categories Description 1, 2, 3 CATS RUNNING AT LARGE 3, 1, 2 HUNTING WITHOUT LICENSE NEW Data
  • 13. Approach Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 13 Categorize d Data Clean & Build Vocabulary Training Data Validation data Build Models New Data Models** Run Throug h Model Run Through Model Top 3 Category Recommendations Mappers ** Support Vector Machine (SVM) Naïve Bayes Category Description Category 1 DOGS RUNNING AT LARGE Category 1 CANINE RUNNING AT LARGE PROHIBITED Category 2 RUNNING A RED LIGHT Category 3 FISHING WITHOUT LICENSE
  • 14. Outcome • Backlog of public record data to standard category mapping eliminated. • Time to market for data products shortened, no more delays from data mapping to categories • Happy mapping team – could work on data enhancement projects. Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 14
  • 15. Exploratory Analysis 15 Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes
  • 16. NLP Toolkits on HPCC Systems Platform Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 16 RECORD DESCRIPTION OPERATING/VEH/OVER MAX HGT RECORD OPERATING VEH OVER MAX HGT RECORD OPERATING VEHICLE OVER MAX HEIGHT RECORD OPER VEHICL MAX HEIGHT TOKENIZO R STOP-WORDS REMOVER SEMANTIC ANALYZOR N-GRAM RECORD OPER VEHICL VEHICL MAX MAX HEIGHT RECORD OPER VEHICL OVER MAX HEIGHT STEMMER
  • 17. Latent Dirichlet Allocation - Topic Model Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 17 • Unsupervised Natural Language Processing(NLP) Algorithm • Explore the Topics in Documents. • Each topic is a distribution over words • Each document is a mixture of topics • Each word is drawn from the topics LDA Topic Model
  • 18. TOPIC 1 TOPIC 2 TOPIC 3 TOPIC 4 TOPIC 5 TOPIC 6 TOPIC 7 TOPIC 8 TOPIC 9 TOPIC 10 OPER UNLA W PROO F DRIVE IMPRO P PARK VEHIC L SPEED INSUR MOTOCY CL INSUR REQUI R SPEED DRIVE UNLAW SPEED SPEED SPEED IMPRO P REQUIR PROOF VEHICL PROO F OPER OPER UNLAW VEHIC L EY PARK IMPROP INSUR REQUIR UNLAW REQUI R UNLA W SPEED VEHIC L EY UNLA W VEHICL PROO F INSUR MOTORCY CL PARK UNLA W INSUR REQUI R IMPRO P MOTORCY CL FAIL EY PROTE CT FAIL MOTORCY CL IMPROP OPER Scalable LDA on HPCC Systems Platform • Massive Parallel Topic Modeling • Flexible Hyper-Parameter Setup • Experiments Topic Range [10 – 103] Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes 18 LDA TOPIC MODEL RESULT
  • 19. Next Steps • Continue exploratory analysis • Additional algorithms • Automatically map data to categories, not just make recommendations • Refine existing and build new models • Solve other business problems with HPCC Systems Machine Learning • Uniform language • Ease of data access • High productivity 19 Using HPCC Systems Machine Learning to map thousands of violation descriptions to Standard Violation Codes

Notas del editor

  1. Public records: http://www.epchc.org/i-want-to/request-a-public-record Map of states: https://openclipart.org/detail/230078/multicolored-united-states-map