University of Technology Sydney
Seminar 24 June 2014.
http://www.statsoc.org.au/events/ssai-events/data-science-search-party/
The Search Party is a Sydney-based technology platform that is a positive disruptor for the recruitment industry. We have created the first online marketplace for talent which makes it quicker and easier to hire better people, whilst for recruitment agencies we provide a sustainable and profitable revenue stream. In this presentation I will give an overview of the various challenges that we are facing to extract value from the large amounts of data that are daily circulating in our software platform. These data originate from job seekers, employers and recruiters, and processing them requires interdisciplinary work at the intersection of statistics, machine learning, data mining, computer science, information retrieval and natural language processing. The ultimate goal is to accurately match vacancies with job seekers and automate the recruitment service.
2. Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion
3. Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion
4. About myself
• Master in Information Sciences, Universiteit Hasselt, Belgium
• Master in Bioinformatics, Katholieke Universiteit Leuven, Belgium
• Master in Statistics, Katholieke Universiteit Leuven, Belgium
• PhD and Postdoc in Engineering, Department of Electrical Engineering,
Katholieke Universiteit Leuven (Sabine Van Huffel, Johan Suykens)
“Predictive computer models, machine learning, decision support systems”
• Postdoc, School of Mathematical Sciences, University of Technology Sydney,
Australia (Matt Wand) “Mean field variational Bayes,
semiparametric regression, streaming data, real-time analysis”
• October 2013: Data Scientist, The Search Party, Sydney
5. Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion
6. The Search Party
There are major forces acting on Recruitment as an industry…
Traditional
recruitment model
under pressure from
technology
Pressure on
pricing damaging
agency
profitability
Bulk of agency
costs are people
who drive revenue
Global
economic
uncertainty
Corp. investment
in internal talent
sourcing teams
?
7. We allow potential employers to
search a vast ocean of the worlds
best candidates
We connect employers with the Agencies who represent them to agree
a fee and arrange an introduction
Supporting this evolution is the world’s first marketplace for talent………..
14. Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion
17. Data
• 2 million candidates
• 46 million skills
• 14 million employment history records
Concrete Formworker
Doran Contractors
1999-2012
Site Supervisor
Allied Gold
1997-2000
Java Developer
IBM
2010-2011
18. Data
• 2 million candidates
• 46 million skills
• 14 million employment history records
• 40000 vacancies
19. Data
• 2 million candidates
• 46 million skills
• 14 million employment history records
• 40000 vacancies
• 29 industries, 384 subsectors
Engineerin
g
Accounting
Administration & Office Support
Advertising, Arts & Media
Banking & Financial Services
Call Centre & Customer Services
Community Services & Development
Construction
Consulting & Strategy
Design & Architecture
Education & Training
20. Data
• 2 million candidates
• 46 million skills
• 14 million employment history records
• 40000 vacancies
• 29 industries, 384 subsectors
• 75 GB marketplace logs
Create Candidate
Publish Candidate
Forgot Password
Submit CandidateVote Up
Vote Down
Request Candidate
Appeared In Search Results
Account Login
Upload CV
21. Data
• 2 million candidates
• 46 million skills
• 14 million employment history records
• 40000 vacancies
• 29 industries, 384 subsectors
• 75 GB marketplace logs
• 100 recruitment agencies
22. Data science @ The Search Party!
• Testing hypotheses
• Design of experiments
• Cross-validation
• Training data vs. test data
• Performance measure
• Building a prediction model
• Regression
• Support vector machines
• Variable selection
• Sensitivity, specificity
• Cost and benefit
• Clustering
• Topic modeling
• Distributed computing
• Programming
• Software engineering
• Data structures
• Term frequency - inverse document frequency
• Entity resolution
• Sentence detection
• Tokenization
• Sentiment analysis
• Part-of-speech tagging
statistics
machine learning
data mining
computer science
information retrieval
natural language processing
23. Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion
29. Clustering
• Entity resolution does not happen independently for each
pair or candidates separately
• Number of clusters is unknown
• Many, many small (possibly singleton) clusters
30. Correlation clustering
• Take a pair‐wise similarity graph as input
• Edge 𝑥𝑖𝑗 ∈ {0,1} with 𝑥𝑖𝑗 = 1 if candidates i and j assigned to
same cluster. 𝑝𝑖𝑗 is the ‘belief’ that candidates i and j are
the same
• Optimize:
Define:
31. Correlation clustering
Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for
correlation clustering beyond ILP. In Proceedings of the Workshop on Integer Linear
Programming for Natural Langauge Processing (ILP '09). Association for
Computational Linguistics, Stroudsburg, PA, USA, 19-27.
32. Pairwise similarity matrix
• We need a measure that quantifies the similarity between
candidates:
• Candidate 1: Jan Luts, jan.m.luts@gmail.com, KULeuven, UTS
• Candidate 2: Jan Luts, jan.m.luts@gmail.com, KULeuven, UTS
• Candidate 3: Jam Lutf, jan.m.luts@gmail.com
• Candidate 4: J Luts, KULeuven
• Candidate 5: Ian Luts, jan.m.luts@gmail.com, KULeuven, UTS, TSP
• Candidate 6: Jan Luts, john@staffrecruitment.com, UTS, TSP
33. Term frequency - inverse document frequency
jan. an.m n.m. luts uts@ mail gmai .com @hot jan_
Candidate1 1 1 1 1 1 1 1 1 0 0
Candidate2 1 1 1 1 1 1 1 1 0 0
Candidate3 1 1 1 1 1 1 1 1 0 0
Candidate4 0 0 0 0 0 0 0 0 0 0
Candidate5 1 1 1 1 1 0 1 1 0 0
Candidate6 0 0 0 1 1 1 0 1 1 1
These are called ‘term frequencies’
Inverse document frequency for ‘.com’: log(6/5)
TF-IDF for ‘.com’ for candidate 6: 1 * log(6/5) = 0.18
TF-IDF for ‘jan_’ for candidate 6: 1 * log(6/1) = 1.79
Terms
35. Correlation clustering
Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for
correlation clustering beyond ILP. In Proceedings of the Workshop on Integer Linear
Programming for Natural Langauge Processing (ILP '09). Association for
Computational Linguistics, Stroudsburg, PA, USA, 19-27.
O(𝑛2)
Does not scale with
increasing number
of candidates!
36. ‘Big Data’
• ‘Big Data’ criticism:
• ‘You May Not Need Big Data After All’, HBR, December 2013
• ‘Google Flu Trends: The Limits of Big Data’, NYT, March 2014
• ‘Big data: are we making a big mistake?’, FT Magazine, March 2014
• ‘The backlash against big data’, The Economist, April, 2014
• @ The Search Party:
• Sampling can help sometimes, but not always …
• We have a lot of data, this creates new problems …
• … and we just have to deal with it
• We need the right tools and algorithms to process millions of data
points
37. Deduplication of candidates
• So how can we do correlation clustering on millions of
candidates?
o Blocking: e.g. split data set in separate blocks based on
gender, geographical location, …
o Canopy clustering:
Pre-clustering algorithm used as a preprocessing
step: Use a cheap distance measure to partition the
data into overlapping subsets (i.e. canopies)
Run expensive clustering on each canopy
All candidates
38. Canopy clustering
Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient clustering of high-
dimensional data sets with application to reference matching. In Proceedings of the sixth
ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '00).
ACM, New York, NY, USA, 169-178.
• Start with a list of the candidates in any order, and with
two distance thresholds, T1 and T2, where T1 > T2.
• Pick a candidate of the list, make it a canopy center and
approximately measure its distance to all other candidates.
• Put all candidates that are within distance threshold T1
into a canopy. Remove from the list all candidates that are
within distance threshold T2. Repeat until the list is empty.
40. Deduplication of candidates
Strategy outline:
• Do canopy clustering using TF-IDFs
• Do expensive correlation clustering for each canopy using a
similarity matrix based on all available candidate information
(e.g. name, email, phone, mobile, employment history,
publications, certificates, …)
• We need to do < 0.005 of all possible pairwise comparisons
Optimization:
• Parallelization of TF-IDF computation, canopy clustering
• Run correlation clustering in parallel for each canopy
41. Large-scale data processing:
• Open-source software framework for distributed computing
• MapReduce programming model
• Resilient to failure
42. How to do canopy clustering on Hadoop?
• Two steps:
• Canopy generation: identify the canopy centers
• Canopy filling: assign candidates to canopies
43. Canopy generation on Hadoop
Initialize:
centers1 = {} centers2 = {} centers3 = {} centers4 = {}
For each batch in parallel if ∀𝑖, distance(candidate x, center i) > T2
output the pair (‘intermediateCenter’, candidate x)
Candidates
Batch 1
Candidates
Batch 2
Candidates
Batch 3
Candidates
Batch 4
Intermediate
Centers
Map:
Reduce:
Initialize: finalCenters = {}
If ∀𝑖, distance(intermediateCenter x, finalCenter i) > T2
output the pair (‘finalCenter’, intermediateCenter x)
44. Canopy filling on Hadoop
Retrieve canopyCenters from canopy generation job
For each batch in parallel ∀𝑖, if distance(candidate x, center i)
< T1 output the pair (center i, candidate x)
Candidates
Batch 1
Candidates
Batch 2
Candidates
Batch 3
Candidates
Batch 4
Center-Candidate
Batch 1
Map:
Reduce:
For each batch: Output the list of all candidates belonging to the
same canopy with center i
Center-Candidate
Batch 2
Center-Candidate
Batch 3
45. Deduplication of candidates - Summary
• Our dedupe pipeline is a blend of concepts from information
retrieval (TF-IDF), statistics and machine learning
(correlation clustering)
• Applying it to large data sets causes new problems and
requires redesigning/adjusting the algorithms (canopy
clustering, distributed computing, hadoop)
• Integration in the existing platform:
o How do data get in and out of the dedupe pipeline
o Making it work in a ‘production environment’: Fail-safe
code - in case of failure, handle it in a safe way
46. Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion
47. Visualization of career paths
• 14 million employment history records:
• Longitudinal data: transitions between different jobs
• Available data: job titles, employer, full description, skills,
start dates, end dates, different versions of CV…
48. Visualization of career paths
• Visualize transition between jobs based on job title:
network consultant
senior network
consultant
technical project
manager
senior network
engineer
technical consultantnetwork analyst
network manager
consultant
network engineer
network architect
project manager
IT manager
.05
.04
.04
.11
.10
.12
.10.09
.06
.08
.18
50. Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion
52. Outline
• About myself
• The Search Party
• What is data science @ The Search Party?
• Deduplication of candidates
• Visualization of career paths
• Technology - Software
• Conclusion
53. Conclusion
• Innovative work in a challenging environment
• Variety: understanding business problems, literature
review, algorithm design, prototyping, evaluation,
implementation, optimization
• Data science: statistics has a very important role to play
• Software engineering skills
• Big data: large data sets cause new problems
• Team work
• Passion!