Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
1.
Machine intelligence in HR technology: resume analysis at scale.
Similarity matching, resume processing and no-frills deep learning models deployment
2. Matching jobs to people
—
We apply data science over large numbers of resumes in real time telling recruiters
who the most qualified candidates are for their job requirements and explain why.
Resumes processing and profile analysis
—
Opening scans through resume files and database candidate profiles to recommend the
perfect candidates for any given raw job description by analyzing patterns in candidate
history, weighing up skills and fetching candidate code & portfolios to support the decision.
A high level overview of our platform is here:
https://speakerdeck.com/amorroxic/opening-dot-io-system-architecture
3. Quick overview: resume logic pipeline
input doc->pdf
string byte array (pdf)
read pdf
resume text
download
byte array
feature extraction
topics extraction
json
json
education parser
json
json
… (10 other tasks)
elasticsearch percolator
combine
json
json
text stream
extra tasks
json
regex (email, etc)
salary regression
json
…
Reactive streams - successive aggregation of state generated by specialized actors
5. Matching pipeline
provided title
search
job description
job title
neural parsing
encoder network
neural parsing
dense vector
encoder network
dense vector
Matching jobs/candidates and people similar to each other in high volumes of resumes
—
All input encoded as dense vectors
Similarity = angular/cosine sim between sets of encodings
6. Real time queries
random projection trees candidates
candidates
A * x + B * (1-x)
random projection trees
Fast matching - computing similarity over vast vector collections (x2)
—
Expensive to compute similarity metrics in real time -> k-nn approximations.
dense input
dense input
job title
x - search biasjob description
7. PARSING:
Multi-class, seq2seq, character-level output (dates / OOV names / ..)
SIMILARITY/ENCODERS:
siamese networks
UP-SKILLING
model ensembles (input -> latent space -> salary regression -> sequences)
SUMMARIES
current area of research
We train multiple models for various contexts (jobs / resumes / ..)
Encoding input and NLP models architecture
General considerations
—
Mostly seq2seq, siamese, attention architectures
Input is mostly word vectors - however at times we augment input features
with ngrams / character-level information
8. Caution on word embedding
Potentially trivial example, however - ideal to have models trained on data specific to a particular problem domain
—
fastText (own corpus, 10gb)
“scala’s”, “java/c++/scala”, “java/scala”, “clojure”, ..
similarity “scala” - “opera” = 0.17 (very syntax oriented)
fastText (own corpus, no character n-grams)
“kotlin”, “clojure”, “haskell”, “scala’s”, “f#”, ..
similarity “scala” - “opera” = 0.14 (good)
fastText (facebook pre-trained vectors, en wiki)
“traviata”, “barbiere”, “teatro”, “verdi”, ..
similarity “scala” - “opera” = 0.57 (very broad)
word2vec (own corpus)
“kotlin”, “clojure”, “haskell”, “f#”, ..
similarity “scala” - “opera” = 0.05 (very specific)
syntactic bias char n-grams in skipgram/cbow semantic biasno char n-grams in skipgram/cbow“scala”
9. Similarity network architecture
Sequence encoders and similarity core
—
Recurrent networks sharing weights (siamese architecture)
x
1
(b)
x
2
(b)
x
3
(b)
machine learning rocks
h
1
(b)
h
2
(b)
h
3
(b)
x
1
(a)
x
2
(a)
x
3
(a)
x
4
(a)
she loves data science
h
1
(a)
h
2
(a)
h
3
(a)
h
4
(a)
objective score
Input encoding derives from the trained sim network:
activations from the last dense layer before output.
10. Models as http micro-services
—
Components: Simeria (horizontal scale), Yenisei (vertical), model servers
All native binaries - golang (simeria), c (yenisei & model servers)
Identical provisioning for dev/prod (Ansible) and model hot-swap / roll-back with 0 downtime (Tensorflow serving), AWS/Azure VMs.
Deployments at scale - opening Baikal vm’s
json
processing / search
simeria
…
vector
candidates
yenisei
model server
model server
model server
yenisei
model server
model server
model server
horizontal
verticalvertical
http
http, grpc grpc
LSH query
11. Search approximation take 1: random projection trees
Forced to optimize this from day one: not a problem of high traffic on regular usage, instead one of large spikes in I/O at ingestion, each customer
having potentially 1M+ resumes = 60M i/o requests (conversions/screenshots/etc), 100m queries (regressions, vectors, etc) and real-time search.
—
Reduced number of lookups via hyperplanes:
k random partitions of set elements using a suitable sim metric (eq. cosine)
dense input
id
id
id
idid
id
sim
sort
idid
id
id
id
id
sim
id
id
id
id
id
id
sim
candidates
12. Random projection trees: issues
Good.
—
Good recall, fast queries Slow to generate
The bad.
—
The ugly.
—
Memory usage
Hashing functions generating identical hashes for similar (but not identical) input.
Various implementations for different distances: Hyperplane, Cross polytope (cosine), MinHash LSH (Jaccard), …
Survey:
https://arxiv.org/pdf/1408.2927.pdf
Locality sensitive hashing
Alternatives.
—
We use Super-Bit LSH (internal variant, golang) but there’s a wide array of libraries readily available:
FALCONN, ANNOY, FLANN, RPFOREST, ..
14. Supporting infrastructure (i/o & conversion)
doc->pdf
byte array http post
conversion service
storage
pdf byte array
screenshot
string byte arrayhttp post
screenshot service
http response (zip with images)
storage
json {“path”: …”, “url”: “… }
Load balancing containerized services via Fabio
pdf byte array
conversion service
conversion service
screenshot service
15. Supporting infrastructure (i/o & conversion)
libreoffice ramdisk
golang web server (iris)Docker containers (http micro servers, golang)
Deployed via MESOS / Marathon
Mesos - kernel abstraction over a cluster
(exposes several machines as they would be one)
Marathon - Mesos init system
Discovery & http load balancing - Consul / Fabio
Conversion (document->pdf) service: http://convert.opening.io/doc-to-pdf
URL screenshots service: http://convert.opening.io/visitor
Conversion (pdf screenshots) service: http://convert.opening.io/pdf-to-img
Generic demo: http://engineering.opening.io/demo.html
doc to pdf container