Better data beats better algorithms, but better data can be hard to come by. In this talk, Vitaly Gordon, Senior Data Scientist at LinkedIn, and Patrick Philips, Crowdsourcing Expert at LinkedIn, will show how the LinkedIn data science team hacks data science using sophisticated data mining and crowdsourcing techniques to leverage the data they already have and create the data that's missing.
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Context: why it mattersOff-topic comments lower the perceived value of Influencer content, LI network, etc.Legit members may leave low-quality topics -> no hell-banning
Especially if you only guess on the hard ones+ Gold and wawa don’t work as well with binary tasks
+ references to article, other comments, etc.
Sampling: took clusters where at least one item scored poorly with existing classifierStill a biased dataset -> skew gold to catch positive cases (80% of Golds have at least one comment flagged)Treat any comment that got at least 1 vote as “suspect”NEXT TIME: set minimum agreement thresholds and collect more labels dynamically
Sampling: took clusters where at least one item scored poorly with existing classifierStill a biased dataset -> skew gold to catch positive cases (80% of Golds have at least one comment flagged)Treat any comment that got at least 1 vote as “suspect”NEXT TIME: set minimum agreement thresholds and collect more labels dynamically
+ Using results to evaluate new implementations of spam classifierImprove Prec without drop in Rec+ 18k comments labeled in 54 hrs for $180
+ simple as possible, but not any simpler
need to find timely, relevant content for many subjects
Free-text tagging = standardization pain, plus hard to manage quality+ double-pass -> annoyingStandardized taxonomy: 1,200 topics selected as representative linkedin members interests + random guessing: 1200 topics is still a lot
Pick “likely” labels for evaluation:+ weak classifier to identify skills in an article -> expand to related skills+ weak classifier to identify industry of article -> expand to related skills+ pick labels based on source of article (e.g., forbes -> economy, marketing, etc.)+ 100 candidate labels for each article
+ 400k article-topic pairs+ e.g., 60k pairs in ~1 week @ 7c each+ 4 labels for each item, take the average value (rather than looking for consensus)+ bootstrap additional gold from items completed with high agreementLessons+ difference between very & somewhat relevant: “is this the primary topic”+ some non-english articles, some garbled articles
+ 400k article-topic pairs+ e.g., 60k pairs in ~1 week @ 7c each+ 4 labels for each item, take the average value (rather than looking for consensus)+ bootstrap additional gold from items completed with high agreementLessons+ difference between very & somewhat relevant: “is this the primary topic”+ some non-english articles, some garbled articles
Working towards a “less” supervised way to create new channels
Preprocessing the data to select likely matches greatly reduced the number of labels needed
search: + helps members find and be found+ People, Jobs, Groups and more
LI search is personalized: + tuple of (user, query, document)Too much to ask a random person to label for training+ “imagine that you’re X and see Y” has its limits+ train from logs
Indirect measures: + CTR@1, CTR@P1, Session Abandonment, etc.Explicit measures:+ what about non-personalized search (such as for recruiters)?+ what about identifying items that are off-topic for all members?
1000 query-result pairs+ retrieve all queries where result@1 didn’t get a click+ remove any queries tagged as {firstname, lastname} where the name in the query matched the name in the profile (we know these perform well}Binary tasks bad – added a second set of questions+ allows us to audit query tagger at the same timeUsing results to triage queries for additional manual review+ also adds an explicit relevance metric to track over time (wtf@1)
Other behavioral stuff:+ individual judgment duration, scrolls, clicks, mouse movement+ jQuery is your friend
Picking the right problem gets you a long way thereSkillRank example----- Meeting Notes (8/15/13 16:55) -----+ name queries really aren't that useful so we excluded those+ ran it internally first, then with turkers++ nearly identical, arguably it was better
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Picking the right problem gets you a long way thereSkillRank example----- Meeting Notes (8/15/13 16:55) -----+ name queries really aren't that useful so we excluded those+ ran it internally first, then with turkers++ nearly identical, arguably it was better