Data Science @ The Search Party (Dr. Jan Luts)

Data Science @ The Search Party
Jan Luts

Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion

About myself
• Master in Information Sciences, Universiteit Hasselt, Belgium
• Master in Bioinformatics, Katholieke Universiteit Leuven, Belgium
• Master in Statistics, Katholieke Universiteit Leuven, Belgium
• PhD and Postdoc in Engineering, Department of Electrical Engineering,
Katholieke Universiteit Leuven (Sabine Van Huffel, Johan Suykens)
“Predictive computer models, machine learning, decision support systems”
• Postdoc, School of Mathematical Sciences, University of Technology Sydney,
Australia (Matt Wand) “Mean field variational Bayes,
semiparametric regression, streaming data, real-time analysis”
• October 2013: Data Scientist, The Search Party, Sydney

The Search Party
There are major forces acting on Recruitment as an industry…
Traditional
recruitment model
under pressure from
technology
Pressure on
pricing damaging
agency
profitability
Bulk of agency
costs are people
who drive revenue
Global
economic
uncertainty
Corp. investment
in internal talent
sourcing teams
?

We allow potential employers to
search a vast ocean of the worlds
best candidates
We connect employers with the Agencies who represent them to agree
a fee and arrange an introduction
Supporting this evolution is the world’s first marketplace for talent………..

Data
• 2 million candidates
• 46 million skills

Data
• 14 million employment history records
Concrete Formworker
Doran Contractors
1999-2012
Site Supervisor
Allied Gold
1997-2000
Java Developer
IBM
2010-2011

Data
• 40000 vacancies

Data
• 40000 vacancies
• 29 industries, 384 subsectors
Engineerin
g
Accounting
Administration & Office Support
Advertising, Arts & Media
Banking & Financial Services
Call Centre & Customer Services
Community Services & Development
Construction
Consulting & Strategy
Design & Architecture
Education & Training

Data
• 40000 vacancies
• 75 GB marketplace logs
Create Candidate
Publish Candidate
Forgot Password
Submit CandidateVote Up
Vote Down
Request Candidate
Appeared In Search Results
Account Login
Upload CV

Data
• 40000 vacancies
• 75 GB marketplace logs
• 100 recruitment agencies

Data science @ The Search Party!
• Testing hypotheses
• Design of experiments
• Cross-validation
• Training data vs. test data
• Performance measure
• Building a prediction model
• Regression
• Support vector machines
• Variable selection
• Sensitivity, specificity
• Cost and benefit
• Clustering
• Topic modeling
• Distributed computing
• Programming
• Software engineering
• Data structures
• Term frequency - inverse document frequency
• Entity resolution
• Sentence detection
• Tokenization
• Sentiment analysis
• Part-of-speech tagging
 statistics
 machine learning
 data mining
 computer science
 information retrieval
 natural language processing

Deduplication of candidates
Recruiter 1
Recruiter 2
Recruiter 3
The Search Party
Database

(Figure from Lise Getoor)

Clustering
• Entity resolution does not happen independently for each
pair or candidates separately
• Number of clusters is unknown
• Many, many small (possibly singleton) clusters

Correlation clustering
• Take a pair‐wise similarity graph as input
• Edge 𝑥𝑖𝑗 ∈ {0,1} with 𝑥𝑖𝑗 = 1 if candidates i and j assigned to
same cluster. 𝑝𝑖𝑗 is the ‘belief’ that candidates i and j are
the same
• Optimize:
Define:

Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for
correlation clustering beyond ILP. In Proceedings of the Workshop on Integer Linear
Programming for Natural Langauge Processing (ILP '09). Association for
Computational Linguistics, Stroudsburg, PA, USA, 19-27.

Pairwise similarity matrix
• We need a measure that quantifies the similarity between
candidates:
• Candidate 1: Jan Luts, jan.m.luts@gmail.com, KULeuven, UTS
• Candidate 2: Jan Luts, jan.m.luts@gmail.com, KULeuven, UTS
• Candidate 3: Jam Lutf, jan.m.luts@gmail.com
• Candidate 4: J Luts, KULeuven
• Candidate 5: Ian Luts, jan.m.luts@gmail.com, KULeuven, UTS, TSP
• Candidate 6: Jan Luts, john@staffrecruitment.com, UTS, TSP

Term frequency - inverse document frequency
jan. an.m n.m. luts uts@ mail gmai .com @hot jan_
Candidate1 1 1 1 1 1 1 1 1 0 0
Candidate2 1 1 1 1 1 1 1 1 0 0
Candidate3 1 1 1 1 1 1 1 1 0 0
Candidate4 0 0 0 0 0 0 0 0 0 0
Candidate5 1 1 1 1 1 0 1 1 0 0
Candidate6 0 0 0 1 1 1 0 1 1 1
 These are called ‘term frequencies’
 Inverse document frequency for ‘.com’: log(6/5)
 TF-IDF for ‘.com’ for candidate 6: 1 * log(6/5) = 0.18
 TF-IDF for ‘jan_’ for candidate 6: 1 * log(6/1) = 1.79
Terms


Pairwise similarity matrix
• Combine cosine similarity values for name, email
address, phone number, mobile number, skills,
employment history, …
Cand 1 Cand 2 Cand 3 Cand 4 Cand 5 Cand 6
Cand 1 1 1 0.8 0.9 0.95 0.75
Cand 2 1 0.8 0.9 0.95 0.75
Cand 3 1 0.6 0.87 0.7
Cand 4 1 0.75 0.7
Cand 5 1 0.8
Cand 6 1

Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for
correlation clustering beyond ILP. In Proceedings of the Workshop on Integer Linear
Programming for Natural Langauge Processing (ILP '09). Association for
Computational Linguistics, Stroudsburg, PA, USA, 19-27.
O(𝑛2)
Does not scale with
increasing number
of candidates!

‘Big Data’
• ‘Big Data’ criticism:
• ‘You May Not Need Big Data After All’, HBR, December 2013
• ‘Google Flu Trends: The Limits of Big Data’, NYT, March 2014
• ‘Big data: are we making a big mistake?’, FT Magazine, March 2014
• ‘The backlash against big data’, The Economist, April, 2014
• @ The Search Party:
• Sampling can help sometimes, but not always …
• We have a lot of data, this creates new problems …
• … and we just have to deal with it
• We need the right tools and algorithms to process millions of data
points

• So how can we do correlation clustering on millions of
candidates?
o Blocking: e.g. split data set in separate blocks based on
gender, geographical location, …
o Canopy clustering:
 Pre-clustering algorithm used as a preprocessing
step: Use a cheap distance measure to partition the
data into overlapping subsets (i.e. canopies)
 Run expensive clustering on each canopy
All candidates

Canopy clustering
Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient clustering of high-
dimensional data sets with application to reference matching. In Proceedings of the sixth
ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '00).
ACM, New York, NY, USA, 169-178.
• Start with a list of the candidates in any order, and with
two distance thresholds, T1 and T2, where T1 > T2.
• Pick a candidate of the list, make it a canopy center and
approximately measure its distance to all other candidates.
• Put all candidates that are within distance threshold T1
into a canopy. Remove from the list all candidates that are
within distance threshold T2. Repeat until the list is empty.

Canopy clustering
Five canopies found
Do correlation clustering on each canopy

Strategy outline:
• Do canopy clustering using TF-IDFs
• Do expensive correlation clustering for each canopy using a
similarity matrix based on all available candidate information
(e.g. name, email, phone, mobile, employment history,
publications, certificates, …)
• We need to do < 0.005 of all possible pairwise comparisons
Optimization:
• Parallelization of TF-IDF computation, canopy clustering
• Run correlation clustering in parallel for each canopy

Large-scale data processing:
• Open-source software framework for distributed computing
• MapReduce programming model
• Resilient to failure

How to do canopy clustering on Hadoop?
• Two steps:
• Canopy generation: identify the canopy centers
• Canopy filling: assign candidates to canopies

Canopy generation on Hadoop
Initialize:
centers1 = {} centers2 = {} centers3 = {} centers4 = {}
For each batch in parallel if ∀𝑖, distance(candidate x, center i) > T2
output the pair (‘intermediateCenter’, candidate x)
Candidates
Batch 1
Candidates
Batch 2
Candidates
Batch 3
Candidates
Batch 4
Intermediate
Centers
Map:
Reduce:
Initialize: finalCenters = {}
If ∀𝑖, distance(intermediateCenter x, finalCenter i) > T2
output the pair (‘finalCenter’, intermediateCenter x)

Canopy filling on Hadoop
Retrieve canopyCenters from canopy generation job
For each batch in parallel ∀𝑖, if distance(candidate x, center i)
< T1 output the pair (center i, candidate x)
Candidates
Batch 1
Candidates
Batch 2
Candidates
Batch 3
Candidates
Batch 4
Center-Candidate
Batch 1
Map:
Reduce:
For each batch: Output the list of all candidates belonging to the
same canopy with center i
Center-Candidate
Batch 2
Center-Candidate
Batch 3

Deduplication of candidates - Summary
• Our dedupe pipeline is a blend of concepts from information
retrieval (TF-IDF), statistics and machine learning
(correlation clustering)
• Applying it to large data sets causes new problems and
requires redesigning/adjusting the algorithms (canopy
clustering, distributed computing, hadoop)
• Integration in the existing platform:
o How do data get in and out of the dedupe pipeline
o Making it work in a ‘production environment’: Fail-safe
code - in case of failure, handle it in a safe way

Visualization of career paths
• 14 million employment history records:
• Longitudinal data: transitions between different jobs
• Available data: job titles, employer, full description, skills,
start dates, end dates, different versions of CV…

• Visualize transition between jobs based on job title:
network consultant
senior network
consultant
technical project
manager
senior network
engineer
technical consultantnetwork analyst
network manager
consultant
network engineer
network architect
project manager
IT manager
.05
.04
.04
.11
.10
.12
.10.09
.06
.08
.18

Demo

Conclusion
• Innovative work in a challenging environment
• Variety: understanding business problems, literature
review, algorithm design, prototyping, evaluation,
implementation, optimization
• Data science: statistics has a very important role to play
• Software engineering skills
• Big data: large data sets cause new problems
• Team work
• Passion!

Data Science @ The Search Party (Dr. Jan Luts)

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

Data Science @ The Search Party (Dr. Jan Luts)