As part of the 2018 HPCC Systems Summit Community Day event:
Up first, Farah Alshanik, Clemson University briefly discusses her poster, Equivalence Terms of Text Search Bundle.
Following, Lili Xu and Gus Reyna present their breakout session in the Machine Learning track.
There is a challenge of incorporating public records data into business processes given disparate descriptions across states for similar events, and then finding a standard that gives one consistent meaning for use. This session tells the story of how the HPCC Systems Machine Learning addressed the problem of mapping thousands of disparate public record data descriptions to a corresponding set of standard codes and the future direction for this approach.
Lili Xu is a PhD candidate from DICE lab directed by Dr. Apon in the school of computing of Clemson University. It’s her third time interning in HPCC Systems team working on machine learning applications. Her research area is machine learning, natural language processing and high performance computing. She can speak only three language but she can program more than three languages.
Gus Reyna is a Director with LexisNexis Risk Solutions where he leads the engineering team for the Motor Vehicle Report (MVR) data products. He has been working at LexisNexis for 9 years building data solutions on the HPCC Systems platform.
4. Introduction
• Background
• Approach
• Exploratory Analysis
• Next steps
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
4
5. Background
• Public records data: birth/marriage/death certificates,
business/professional/contractor licenses,
foreclosures and tax liens, etc.
• Businesses that use this data in their information
technology processes must account for state
variations of similar events.
• LexisNexis maps public record data from different
states to standard categories.
• Businesses use the standard categories to create
one system that can be used in all states.
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
5
6. Problem
• Team of Subject Matter Experts (SME) map
public record data to standardized categories.
• Data grew faster than team’s mapping
capacity.
• Data products time to market increased.
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
6
7. Solution
• Grow the team, train the team.
• BUT, Trainers are SMEs who are not
mapping when they’re training new
team members
• AND, It takes time to learn how to map public
records data to standard categories
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
7
8. Solution
• Use HPCC Systems machine learning
to generate 3 recommended standard
categories for public record data …
• Which shortens the time new team members
become effective mappers and …
• Reduces the time required for SMEs
to train new team members
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
8
9. Approach
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
9
Categorize
d Data
Clean &
Build
Vocabulary
Training
Data
Validation
data
Build
Models
New Data
Models**
Run
Throug
h
Model
Run Through
Model
Top 3 Category
Recommendations
Mappers
** Support Vector Machine (SVM)
Naïve Bayes
Category Description
Category 1 DOGS RUNNING AT LARGE
Category 1 CANINE RUNNING AT LARGE
PROHIBITED
Category 2 RUNNING A RED LIGHT
Category 3 FISHING WITHOUT LICENSE
10. Build Vocabulary
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
10
Categorize
d Data
Clean &
Build
Vocabulary
WORD COUNT
RUN 3
LARG 2
CANIN 1
…. ….
http://textanalysisonline.com/nltk-porter-stemmer
Category Description
Category 1 DOGS RUNNING AT LARGE
Category 1 CANINE RUNNING AT LARGE
PROHIBITED
Category 2 RUNNING A RED LIGHT
Category 3 FISHING WITHOUT LICENSE
11. Build & Validate Model
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
11
Build
Models
Models*
Run Through
Model
Category Description
Category
1
CANIN RUN AT LARG PROHIBIT
Category
2
RUN A RED LIGHT
Category
3
FISH WITHOUT LICENS
* Support Vector Machine
(SVM)
Naïve Bayes
Category Description
Category
1
DOG RUN AT LARG
Recommendations Category Description
Category 1 Category 1 DOGS RUNNING AT
LARGECategory 2
Category 3
Category Description
Category 1 DOGS RUNNING AT LARGE
Category 1 CANINE RUNNING AT LARGE
PROHIBITED
Category 2 RUNNING A RED LIGHT
Category 3 FISHING WITHOUT LICENSE
Training
Data
Validation
data
12. Process New Data
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
12
Models*
Top 3 Category
Recommendations
Mappers
* Support Vector Machine
(SVM)
Naïve Bayes
Description
CATS RUNNING AT
LARGE
HUNTING WITHOUT
LICENSE
Training Data
Category Description
Category 1 CANINE RUNNING AT LARGE
PROHIBITED
Category 2 RUNNING A RED LIGHT
Category 3 FISHING WITHOUT LICENSE
Recommende
d
Categories
Description
1, 2, 3 CATS RUNNING AT
LARGE
3, 1, 2 HUNTING WITHOUT
LICENSE
NEW Data
13. Approach
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
13
Categorize
d Data
Clean &
Build
Vocabulary
Training
Data
Validation
data
Build
Models
New Data
Models**
Run
Throug
h
Model
Run Through
Model
Top 3 Category
Recommendations
Mappers
** Support Vector Machine (SVM)
Naïve Bayes
Category Description
Category 1 DOGS RUNNING AT LARGE
Category 1 CANINE RUNNING AT LARGE
PROHIBITED
Category 2 RUNNING A RED LIGHT
Category 3 FISHING WITHOUT LICENSE
14. Outcome
• Backlog of public record data to standard
category mapping eliminated.
• Time to market for data products shortened,
no more delays from data mapping to
categories
• Happy mapping team – could work on data
enhancement projects.
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
14
16. NLP Toolkits on HPCC Systems Platform
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
16
RECORD DESCRIPTION
OPERATING/VEH/OVER MAX
HGT
RECORD
OPERATING
VEH
OVER
MAX
HGT
RECORD
OPERATING
VEHICLE
OVER
MAX
HEIGHT
RECORD
OPER
VEHICL
MAX
HEIGHT
TOKENIZO
R
STOP-WORDS
REMOVER
SEMANTIC
ANALYZOR
N-GRAM
RECORD
OPER VEHICL
VEHICL MAX
MAX HEIGHT
RECORD
OPER
VEHICL
OVER
MAX
HEIGHT
STEMMER
17. Latent Dirichlet Allocation - Topic Model
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
17
• Unsupervised Natural Language Processing(NLP) Algorithm
• Explore the Topics in Documents.
• Each topic is a distribution over words
• Each document is a mixture of topics
• Each word is drawn from the topics
LDA Topic Model
18. TOPIC 1 TOPIC 2 TOPIC 3 TOPIC 4 TOPIC 5 TOPIC 6 TOPIC 7 TOPIC 8 TOPIC 9 TOPIC 10
OPER
UNLA
W
PROO
F
DRIVE
IMPRO
P
PARK
VEHIC
L
SPEED INSUR
MOTOCY
CL
INSUR
REQUI
R
SPEED DRIVE
UNLAW
SPEED
SPEED SPEED
IMPRO
P
REQUIR PROOF VEHICL
PROO
F
OPER OPER UNLAW
VEHIC
L
EY PARK IMPROP
INSUR
REQUIR
UNLAW
REQUI
R
UNLA
W
SPEED
VEHIC
L
EY
UNLA
W
VEHICL
PROO
F
INSUR
MOTORCY
CL
PARK
UNLA
W
INSUR
REQUI
R
IMPRO
P
MOTORCY
CL
FAIL
EY
PROTE
CT
FAIL
MOTORCY
CL
IMPROP OPER
Scalable LDA on HPCC Systems Platform
• Massive Parallel Topic Modeling
• Flexible Hyper-Parameter Setup
• Experiments Topic Range [10 – 103]
Using HPCC Systems Machine Learning to map thousands of violation
descriptions to Standard Violation Codes
18
LDA TOPIC MODEL RESULT
19. Next Steps
• Continue exploratory analysis
• Additional algorithms
• Automatically map data to categories, not just
make recommendations
• Refine existing and build new models
• Solve other business problems with HPCC
Systems Machine Learning
• Uniform language
• Ease of data access
• High productivity
19
Using HPCC Systems Machine Learning to map thousands of
violation descriptions to Standard Violation Codes
Notas del editor
Public records: http://www.epchc.org/i-want-to/request-a-public-record
Map of states: https://openclipart.org/detail/230078/multicolored-united-states-map