From a talk I gave to a group of Connecticut College students in November of 2012. This looks at some of the challenges of dealing with huge amounts of member-inputted data as well as techniques used to solve these challenges and product applications of that member-inputted data.
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Big Data and Data Standardization at LinkedIn
1. Reading the Tea
Leaves: Alexis
Big Data at LinkedIn
Alexis Baird
Product Manager
LinkedIn
Recruiting Solutions 1
2. What is LinkedIn?
§ LinkedIn’s mission: “Connect the world’s professionals to
make them more productive and successful”
§ The site officially launched on May 5, 2003
§ Now has >187 million members worldwide
§ LinkedIn has >3,000 employees in offices all around the
world
§ Headquartered in Mountain View, CA
§ Three different lines of revenue:
– Subscriptions
– Talent Solutions
– Marketing Solutions
2
5. Big Data at LinkedIn
§ 187+ million members from >200 countries
§ Each month, 52 million members come to the site
generating ~2 billion page views:
– Performing searches
– Connecting with other members
– Editing their profile
– Sharing, commenting on, or liking news articles
– Participating in group discussions
– And much more…
5
6. Big Data Challenges
§ Storage and processing constraints
§ Noisy signal
– Variation
– People are not always rational or consistent
6
8. Data Standardization
§ Take an input (usually a user-entered string) and turn it
into a meaningful abstract id
“Microsoft”
“MSFT” Company_id = 1035
(“Microsoft Corporation”)
“Bing”
“Microsoft/Bing”
“Microsoft-Mountain View
8
16. How LinkedIn matches people to jobs
Job Corpus Stats
Matching Transition probabilities
Connectivity
Binary yrs of experience to reach title
title industry …
Exact matches: education needed for this title
geo description …
company functional area geo, industry,
…
User Base Soft Similarity
(candidate expertise, job description)
transition
Filtered 0.56
probabilities,
Similarity
Candidate similarity, (candidate specialties, job description)
… 0.2
Transition probability
Text (candidate industry, job industry)
General Current Position 0.43
expertise title
specialties summary Title Similarity
education tenure length 0.8
headline industry
Similarity (headline, title)
geo functional area
experience … 0.7
.
derive
d
.
.
16
18. Data Standardization: Occupations
§ How do we know a “senior software developer” and a
“software developer” are the same occupation?
– Strip a special set of words known to indicate seniority
18
19. Data Standardization: Occupations
§ How do we know a “senior software developer” and a
“software developer” are the same occupation?
– Strip a special set of words known to indicate seniority
§ How do we know a “software developer” and a “software
engineer” are the same occupation?
19
20. Data Standardization: Occupations
§ How do we know a “senior software developer” and a
“software developer” are the same occupation?
– Strip a special set of words known to indicate seniority
§ How do we know a “software developer” and a “software
engineer” are the same occupation?
– Term similarity
20
21. Data Standardization: Occupations
§ How do we know a “senior software developer” and a
“software developer” are the same occupation?
– Strip a special set of words known to indicate seniority
§ How do we know a “software developer” and a “software
engineer” are the same occupation?
– Term similarity
§ How do we know a “programmer” and a “software
developer” are the same occupation but a “programmer”
and a “program director” are not?
21
22. Data Standardization: Occupations
§ How do we know a “senior software developer” and a
“software developer” are the same occupation?
– Strip a special set of words known to indicate seniority
§ How do we know a “software developer” and a “software
engineer” are the same occupation?
– Term similarity
§ How do we know a “programmer” and a “software
developer” are the same occupation but a “programmer”
and a “program director” are not?
– Need something more complicated
22
23. Data standardization: Occupations
1. Rule-based string clean up:
– ~2 million different titles => 24,000 different “cleaned” titles
– Eg. “Sr software dev” => “senior software developer”
2. Create “virtual profiles” for each title using various
extracted and normalized profile features (i.e. skills,
degree, field of study, summary, job description, honors,
etc.)
3. Cluster similar titles
4. Get rid of uninformative titles spread across too many
different topics
5. Apply hand QA to tune the clusters/name the clusters
23
24.
25. Lessons learned
§ Know your machine learning!
§ Know your success metric!
§ Need to allow for ambiguity within a given title
§ “Head of production”
§ DDS
§ Some titles are not standardizable:
25
26. Take aways
§ The more information you give, the better your
standardization will be
§ Why do you want LI to do a good job standardizing the
data on your profile?
– Better recommendations:
§ News
§ Jobs
§ Groups
§ Connections
§ Etc.
– Recruiters can find you more easily
– Potential connections can find you
26
27. Thank You!
175M+ 2/sec
62% non U.S.
25th
90 We’re Most visit website worldwide
(Comscore 6-12)
55
Hiring! >2M
Company pages
85%
32
17
8
2 4 Fortune 500 Companies use
LinkedIn to hire
2004 2005 2006 2007 2008 2009 2010 2011
LinkedIn Members (Millions)
Learn more at http://data.linkedin.com/
27