Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.

•Descargar como PPTX, PDF•

1 recomendación•2,601 vistas

Better data beats better algorithms, but better data can be hard to come by. In this talk, Vitaly Gordon, Senior Data Scientist at LinkedIn, and Patrick Philips, Crowdsourcing Expert at LinkedIn, will show how the LinkedIn data science team hacks data science using sophisticated data mining and crowdsourcing techniques to leverage the data they already have and create the data that's missing.

Tecnología Noticias y política

Overview of ML pipeline
Gather data
Feature
engineering
Model
fitting
Evaluation
©2013 LinkedIn Corporation. All Rights Reserved. 2

Understanding Seniority
©2013 LinkedIn Corporation. All Rights Reserved. 3

©2013 LinkedIn Corporation. All Rights Reserved. 4
Companies are not standard

©2013 LinkedIn Corporation. All Rights Reserved. 5
Titles are not enough

©2013 LinkedIn Corporation. All Rights Reserved. 6
Things change

Learning to target better
©2013 LinkedIn Corporation. All Rights Reserved. 7

Classifying names to genders
©2013 LinkedIn Corporation. All Rights Reserved. 8

Let’s look at Monica again
©2013 LinkedIn Corporation. All Rights Reserved. 9

Not so fast …
©2013 LinkedIn Corporation. All Rights Reserved. 10

Not so fast …
©2013 LinkedIn Corporation. All Rights Reserved. 11

Even slower …
©2013 LinkedIn Corporation. All Rights Reserved. 12

Sometime the answer is just under your nose
©2013 LinkedIn Corporation. All Rights Reserved. 13

Comment Spam on Influencer content
©2013 LinkedIn Corporation. All Rights Reserved. 14

Challenge 1: Binary tasks are too guessable
©2013 LinkedIn Corporation. All Rights Reserved. 15

Challenge 2: Context matters
©2013 LinkedIn Corporation. All Rights Reserved. 16

Spam Comment Annotation Task
©2013 LinkedIn Corporation. All Rights Reserved. 17

Quality: Gold distributions and skewed datasets
©2013 LinkedIn Corporation. All Rights Reserved. 18

Using results to evaluate new features
©2013 LinkedIn Corporation. All Rights Reserved. 19
Model ΔP ΔR ΔPRC
Baseline - - -
Variation 1 + - +
Variation 2 - + +
Variation 3 - ++ - -
Variation 4 - +++ ++
Variation 5 - +++ ++
Variation 6 - +++ ++
Variation 7 - ++++ +++
Variation 8 - ++++ +++
Variation 9 - ++++ +++
Variation 10 - ++++ +++

“As simple as possible, but not simpler”
©2013 LinkedIn Corporation. All Rights Reserved. 20

Linkedin Channels
©2013 LinkedIn Corporation. All Rights Reserved. 21

Labels aren’t free
©2013 LinkedIn Corporation. All Rights Reserved. 22

Suggest likely candidates for topics then expand
©2013 LinkedIn Corporation. All Rights Reserved. 23

Evaluate suggested article-topic pairs
 Using results to evaluate new implementations of spam classifier
– Improve Prec without drop in Rec
 18k comments labeled in 54 hrs for $180
©2013 LinkedIn Corporation. All Rights Reserved. 24

Quality: Not by Gold alone
©2013 LinkedIn Corporation. All Rights Reserved. 25

Using results to evaluate existing classification
framework
©2013 LinkedIn Corporation. All Rights Reserved. 26

“Help your helpers”
©2013 LinkedIn Corporation. All Rights Reserved. 27

Search is a major portal to information
©2013 LinkedIn Corporation. All Rights Reserved. 28

LI Search is personalized
©2013 LinkedIn Corporation. All Rights Reserved. 29

Evaluation is still possible
©2013 LinkedIn Corporation. All Rights Reserved. 30

Search Evaluation – WTF@1
©2013 LinkedIn Corporation. All Rights Reserved. 31

Quality: Behavioral metrics are good too!
©2013 LinkedIn Corporation. All Rights Reserved. 32

“Pick a solvable problem”
©2013 LinkedIn Corporation. All Rights Reserved. 33

Standardizing titles
©2013 LinkedIn Corporation. All Rights Reserved. 34

©2013 LinkedIn Corporation. All Rights Reserved. 35

Which question is easier
©2013 LinkedIn Corporation. All Rights Reserved. 36
1. Find a better name for the title “account executive”?
2. How similar are “account executive” and “sales executive”?

©2013 LinkedIn Corporation. All Rights Reserved. 37

Notable Experts
©2013 LinkedIn Corporation. All Rights Reserved. 38

First attempt
©2013 LinkedIn Corporation. All Rights Reserved. 39

Second attempt
©2013 LinkedIn Corporation. All Rights Reserved. 40

Third attempt
©2013 LinkedIn Corporation. All Rights Reserved. 41

What makes the best data mining expert?
 Education?
 Industry experience?
 Amount of publications?
 Communication skills?
 Hacking skills?
 Knowledge of statistics?
 Number of endorsements?
©2013 LinkedIn Corporation. All Rights Reserved. 42

“More bad data != better data”
©2013 LinkedIn Corporation. All Rights Reserved. 43

Summary
©2013 LinkedIn Corporation. All Rights Reserved. 44
1. Use the data you already have
2. Keep it simple, but not too simple
3. Pick a solvable problem
4. Help your helpers
5. Sample intelligently
6. More (bad) data != better data

©2013 LinkedIn Corporation. All Rights Reserved. 45
Questions?

Más contenido relacionado

Destacado

Data Science at LinkedIn - Data-Driven Products & InsightsYael Garten

SF Data Science: Developing Data ProductsPeter Skomoroch

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang

Linkedin - Business Model Scope (english)bizmodelSCOPE

Social Recruiting with LinkedIn Talent Solutions | WebcastLinkedIn Talent Solutions

A B2B guide to using LinkedIn.asabell

LinkedIn Business Canvas - 7 giugno WebinarLeonardo Bellini

Linkedin for businesses Moiz Ali

LinkedIn Business Update LinkedIn

LinkedIn for Business – The Secret of AuthenticitySocial Jack

How to Interview a Data ScientistDaniel Tunkelang

Destacado (11)

Data Science at LinkedIn - Data-Driven Products & Insights

SF Data Science: Developing Data Products

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Linkedin - Business Model Scope (english)

Social Recruiting with LinkedIn Talent Solutions | Webcast

A B2B guide to using LinkedIn.

LinkedIn Business Canvas - 7 giugno Webinar

Linkedin for businesses

LinkedIn Business Update

LinkedIn for Business – The Secret of Authenticity

How to Interview a Data Scientist

Similar a Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.

Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...Vitaly Gordon

Computing Professional Identity for the Economic GraphVitaly Gordon

7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)Social Fresh Conference

Linkedin job search fundamentals part 2Safe Rise

7 Badass Tactics for SlideShare Content DominationLinkedIn

7 Badass Tactics for Slideshare Content Domination Jason Miller

Driving Revenue w/ Social, Content, Marketing Automation - Scoop.It Meetup Jason Miller

Developing Data ProductsPeter Skomoroch

Consumer Internet Lessons for Enterprise Product ManagersMichael Korcuska

How Linkedin uses Automic for Big Data ProcessesCA | Automic Software

The LCG Digital Transformation Maturity ModelLima Consulting Group

LDA Beginner's TutorialWayne Lee

Content Targeting and Personalization: Improving Engagement at the Account LevelG3 Communications

5 Steps to Sourcing Like a Pro on LinkedInLinkedIn For Search and Recruitment Firms

Linkedin Trending content report - Feb 2014 updateWSI Business Performance

Big Data Ecosystem @ LinkedInMinh-Hoang Nguyen

Loyola 10 9 13Anne-Marie Ryan

The Lima Consulting Group Digital Transformation Maturity Model Presented at ...Lima Consulting Group

Intro to LC Workshop.pdfMadelineYi

Tamm & kittJeff Roy

Similar a Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips. (20)

Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...

Computing Professional Identity for the Economic Graph

7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)

Linkedin job search fundamentals part 2

7 Badass Tactics for SlideShare Content Domination

7 Badass Tactics for Slideshare Content Domination

Driving Revenue w/ Social, Content, Marketing Automation - Scoop.It Meetup

Developing Data Products

Consumer Internet Lessons for Enterprise Product Managers

How Linkedin uses Automic for Big Data Processes

The LCG Digital Transformation Maturity Model

LDA Beginner's Tutorial

Content Targeting and Personalization: Improving Engagement at the Account Level

5 Steps to Sourcing Like a Pro on LinkedIn

Linkedin Trending content report - Feb 2014 update

Big Data Ecosystem @ LinkedIn

Loyola 10 9 13

The Lima Consulting Group Digital Transformation Maturity Model Presented at ...

Intro to LC Workshop.pdf

Tamm & kitt

Más de Hakka Labs

Always Valid Inference (Ramesh Johari, Stanford)Hakka Labs

DataEngConf SF16 - High cardinality time series searchHakka Labs

DataEngConf SF16 - Data Asserts: Defensive Data ScienceHakka Labs

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs

DataEngConf SF16 - Recommendations at InstacartHakka Labs

DataEngConf SF16 - Running simulations at scaleHakka Labs

DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataHakka Labs

DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs

DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQHakka Labs

DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...Hakka Labs

DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs

DataEngConf SF16 - Scalable and Reliable Logging at PinterestHakka Labs

DataEngConf SF16 - Bridging the gap between data science and data engineeringHakka Labs

DataEngConf SF16 - Multi-temporal Data StructuresHakka Labs

DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkHakka Labs

DataEngConf SF16 - Beginning with OurselvesHakka Labs

DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityHakka Labs

DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...Hakka Labs

DataEngConf SF16 - Methods for Content Relevance at LinkedInHakka Labs

DataEngConf SF16 - Spark SQL WorkshopHakka Labs

Más de Hakka Labs (20)

Always Valid Inference (Ramesh Johari, Stanford)

DataEngConf SF16 - High cardinality time series search

DataEngConf SF16 - Data Asserts: Defensive Data Science

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data

DataEngConf SF16 - Recommendations at Instacart

DataEngConf SF16 - Running simulations at scale

DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data

DataEngConf SF16 - Collecting and Moving Data at Scale

DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ

DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...

DataEngConf SF16 - Three lessons learned from building a production machine l...

DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

DataEngConf SF16 - Bridging the gap between data science and data engineering

DataEngConf SF16 - Multi-temporal Data Structures

DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

DataEngConf SF16 - Beginning with Ourselves

DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability

DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...

DataEngConf SF16 - Methods for Content Relevance at LinkedIn

DataEngConf SF16 - Spark SQL Workshop

Último

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

WordPress Websites for Engineers: Elevate Your Brandgvaughan

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

CloudStudio User manual (basic edition):comworks

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

From Family Reminiscence to Scholarly Archive .Alan Dix

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Take control of your SAP testing with UiPath Test SuiteDianaGray10

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Story boards and shot lists for my a level piececharlottematthew16

Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.

1. Hacking Data Science

19. Using results to evaluate new features ©2013 LinkedIn Corporation. All Rights Reserved. 19 Model ΔP ΔR ΔPRC Baseline - - - Variation 1 + - + Variation 2 - + + Variation 3 - ++ - - Variation 4 - +++ ++ Variation 5 - +++ ++ Variation 6 - +++ ++ Variation 7 - ++++ +++ Variation 8 - ++++ +++ Variation 9 - ++++ +++ Variation 10 - ++++ +++

24. Evaluate suggested article-topic pairs  Using results to evaluate new implementations of spam classifier – Improve Prec without drop in Rec  18k comments labeled in 54 hrs for $180 ©2013 LinkedIn Corporation. All Rights Reserved. 24

42. What makes the best data mining expert?  Education?  Industry experience?  Amount of publications?  Communication skills?  Hacking skills?  Knowledge of statistics?  Number of endorsements? ©2013 LinkedIn Corporation. All Rights Reserved. 42

44. Summary ©2013 LinkedIn Corporation. All Rights Reserved. 44 1. Use the data you already have 2. Keep it simple, but not too simple 3. Pick a solvable problem 4. Help your helpers 5. Sample intelligently 6. More (bad) data != better data

Notas del editor

Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Context: why it mattersOff-topic comments lower the perceived value of Influencer content, LI network, etc.Legit members may leave low-quality topics -> no hell-banning
Especially if you only guess on the hard ones+ Gold and wawa don’t work as well with binary tasks
+ references to article, other comments, etc.
Sampling: took clusters where at least one item scored poorly with existing classifierStill a biased dataset -> skew gold to catch positive cases (80% of Golds have at least one comment flagged)Treat any comment that got at least 1 vote as “suspect”NEXT TIME: set minimum agreement thresholds and collect more labels dynamically
Sampling: took clusters where at least one item scored poorly with existing classifierStill a biased dataset -> skew gold to catch positive cases (80% of Golds have at least one comment flagged)Treat any comment that got at least 1 vote as “suspect”NEXT TIME: set minimum agreement thresholds and collect more labels dynamically
+ Using results to evaluate new implementations of spam classifierImprove Prec without drop in Rec+ 18k comments labeled in 54 hrs for $180
+ simple as possible, but not any simpler
need to find timely, relevant content for many subjects
Free-text tagging = standardization pain, plus hard to manage quality+ double-pass -> annoyingStandardized taxonomy: 1,200 topics selected as representative linkedin members interests + random guessing: 1200 topics is still a lot
Pick “likely” labels for evaluation:+ weak classifier to identify skills in an article -> expand to related skills+ weak classifier to identify industry of article -> expand to related skills+ pick labels based on source of article (e.g., forbes -> economy, marketing, etc.)+ 100 candidate labels for each article
+ 400k article-topic pairs+ e.g., 60k pairs in ~1 week @ 7c each+ 4 labels for each item, take the average value (rather than looking for consensus)+ bootstrap additional gold from items completed with high agreementLessons+ difference between very & somewhat relevant: “is this the primary topic”+ some non-english articles, some garbled articles
+ 400k article-topic pairs+ e.g., 60k pairs in ~1 week @ 7c each+ 4 labels for each item, take the average value (rather than looking for consensus)+ bootstrap additional gold from items completed with high agreementLessons+ difference between very & somewhat relevant: “is this the primary topic”+ some non-english articles, some garbled articles
Working towards a “less” supervised way to create new channels
Preprocessing the data to select likely matches greatly reduced the number of labels needed
search: + helps members find and be found+ People, Jobs, Groups and more
LI search is personalized: + tuple of (user, query, document)Too much to ask a random person to label for training+ “imagine that you’re X and see Y” has its limits+ train from logs
Indirect measures: + CTR@1, CTR@P1, Session Abandonment, etc.Explicit measures:+ what about non-personalized search (such as for recruiters)?+ what about identifying items that are off-topic for all members?
1000 query-result pairs+ retrieve all queries where result@1 didn’t get a click+ remove any queries tagged as {firstname, lastname} where the name in the query matched the name in the profile (we know these perform well}Binary tasks bad – added a second set of questions+ allows us to audit query tagger at the same timeUsing results to triage queries for additional manual review+ also adds an explicit relevance metric to track over time (wtf@1)
Other behavioral stuff:+ individual judgment duration, scrolls, clicks, mouse movement+ jQuery is your friend
Picking the right problem gets you a long way thereSkillRank example----- Meeting Notes (8/15/13 16:55) -----+ name queries really aren't that useful so we excluded those+ ran it internally first, then with turkers++ nearly identical, arguably it was better
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Supervised (gold, agreement) & unsupervised (behavioral)
Picking the right problem gets you a long way thereSkillRank example----- Meeting Notes (8/15/13 16:55) -----+ name queries really aren't that useful so we excluded those+ ran it internally first, then with turkers++ nearly identical, arguably it was better
Other fun lessons5. Not by gold alone

Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (11)

Similar a Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.

Similar a Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips. (20)

Más de Hakka Labs

Más de Hakka Labs (20)

Último

Último (20)

Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.

Notas del editor