SlideShare una empresa de Scribd logo
1 de 45
Hacking Data Science
Overview of ML pipeline
Gather data
Feature
engineering
Model
fitting
Evaluation
©2013 LinkedIn Corporation. All Rights Reserved. 2
Understanding Seniority
©2013 LinkedIn Corporation. All Rights Reserved. 3
©2013 LinkedIn Corporation. All Rights Reserved. 4
Companies are not standard
©2013 LinkedIn Corporation. All Rights Reserved. 5
Titles are not enough
©2013 LinkedIn Corporation. All Rights Reserved. 6
Things change
Learning to target better
©2013 LinkedIn Corporation. All Rights Reserved. 7
Classifying names to genders
©2013 LinkedIn Corporation. All Rights Reserved. 8
Let’s look at Monica again
©2013 LinkedIn Corporation. All Rights Reserved. 9
Not so fast …
©2013 LinkedIn Corporation. All Rights Reserved. 10
Not so fast …
©2013 LinkedIn Corporation. All Rights Reserved. 11
Even slower …
©2013 LinkedIn Corporation. All Rights Reserved. 12
Sometime the answer is just under your nose
©2013 LinkedIn Corporation. All Rights Reserved. 13
Comment Spam on Influencer content
©2013 LinkedIn Corporation. All Rights Reserved. 14
Challenge 1: Binary tasks are too guessable
©2013 LinkedIn Corporation. All Rights Reserved. 15
Challenge 2: Context matters
©2013 LinkedIn Corporation. All Rights Reserved. 16
Spam Comment Annotation Task
©2013 LinkedIn Corporation. All Rights Reserved. 17
Quality: Gold distributions and skewed datasets
©2013 LinkedIn Corporation. All Rights Reserved. 18
Using results to evaluate new features
©2013 LinkedIn Corporation. All Rights Reserved. 19
Model ΔP ΔR ΔPRC
Baseline - - -
Variation 1 + - +
Variation 2 - + +
Variation 3 - ++ - -
Variation 4 - +++ ++
Variation 5 - +++ ++
Variation 6 - +++ ++
Variation 7 - ++++ +++
Variation 8 - ++++ +++
Variation 9 - ++++ +++
Variation 10 - ++++ +++
“As simple as possible, but not simpler”
©2013 LinkedIn Corporation. All Rights Reserved. 20
Linkedin Channels
©2013 LinkedIn Corporation. All Rights Reserved. 21
Labels aren’t free
©2013 LinkedIn Corporation. All Rights Reserved. 22
Suggest likely candidates for topics then expand
©2013 LinkedIn Corporation. All Rights Reserved. 23
Evaluate suggested article-topic pairs
 Using results to evaluate new implementations of spam classifier
– Improve Prec without drop in Rec
 18k comments labeled in 54 hrs for $180
©2013 LinkedIn Corporation. All Rights Reserved. 24
Quality: Not by Gold alone
©2013 LinkedIn Corporation. All Rights Reserved. 25
Using results to evaluate existing classification
framework
©2013 LinkedIn Corporation. All Rights Reserved. 26
“Help your helpers”
©2013 LinkedIn Corporation. All Rights Reserved. 27
Search is a major portal to information
©2013 LinkedIn Corporation. All Rights Reserved. 28
LI Search is personalized
©2013 LinkedIn Corporation. All Rights Reserved. 29
Evaluation is still possible
©2013 LinkedIn Corporation. All Rights Reserved. 30
Search Evaluation – WTF@1
©2013 LinkedIn Corporation. All Rights Reserved. 31
Quality: Behavioral metrics are good too!
©2013 LinkedIn Corporation. All Rights Reserved. 32
“Pick a solvable problem”
©2013 LinkedIn Corporation. All Rights Reserved. 33
Standardizing titles
©2013 LinkedIn Corporation. All Rights Reserved. 34
©2013 LinkedIn Corporation. All Rights Reserved. 35
Which question is easier
©2013 LinkedIn Corporation. All Rights Reserved. 36
1. Find a better name for the title “account executive”?
2. How similar are “account executive” and “sales executive”?
©2013 LinkedIn Corporation. All Rights Reserved. 37
Notable Experts
©2013 LinkedIn Corporation. All Rights Reserved. 38
First attempt
©2013 LinkedIn Corporation. All Rights Reserved. 39
Second attempt
©2013 LinkedIn Corporation. All Rights Reserved. 40
Third attempt
©2013 LinkedIn Corporation. All Rights Reserved. 41
What makes the best data mining expert?
 Education?
 Industry experience?
 Amount of publications?
 Communication skills?
 Hacking skills?
 Knowledge of statistics?
 Number of endorsements?
©2013 LinkedIn Corporation. All Rights Reserved. 42
“More bad data != better data”
©2013 LinkedIn Corporation. All Rights Reserved. 43
Summary
©2013 LinkedIn Corporation. All Rights Reserved. 44
1. Use the data you already have
2. Keep it simple, but not too simple
3. Pick a solvable problem
4. Help your helpers
5. Sample intelligently
6. More (bad) data != better data
©2013 LinkedIn Corporation. All Rights Reserved. 45
Questions?

Más contenido relacionado

Destacado

Data Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & InsightsData Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & InsightsYael Garten
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsPeter Skomoroch
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang
 
Linkedin - Business Model Scope (english)
Linkedin - Business Model Scope (english)Linkedin - Business Model Scope (english)
Linkedin - Business Model Scope (english)bizmodelSCOPE
 
Social Recruiting with LinkedIn Talent Solutions | Webcast
Social Recruiting with LinkedIn Talent Solutions | WebcastSocial Recruiting with LinkedIn Talent Solutions | Webcast
Social Recruiting with LinkedIn Talent Solutions | WebcastLinkedIn Talent Solutions
 
A B2B guide to using LinkedIn.
A B2B guide to using LinkedIn.A B2B guide to using LinkedIn.
A B2B guide to using LinkedIn.asabell
 
LinkedIn Business Canvas - 7 giugno Webinar
LinkedIn Business Canvas - 7 giugno WebinarLinkedIn Business Canvas - 7 giugno Webinar
LinkedIn Business Canvas - 7 giugno WebinarLeonardo Bellini
 
Linkedin for businesses
Linkedin for businesses Linkedin for businesses
Linkedin for businesses Moiz Ali
 
LinkedIn Business Update
LinkedIn Business Update LinkedIn Business Update
LinkedIn Business Update LinkedIn
 
LinkedIn for Business – The Secret of Authenticity
LinkedIn for Business – The Secret of AuthenticityLinkedIn for Business – The Secret of Authenticity
LinkedIn for Business – The Secret of AuthenticitySocial Jack
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 

Destacado (11)

Data Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & InsightsData Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & Insights
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data Products
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
 
Linkedin - Business Model Scope (english)
Linkedin - Business Model Scope (english)Linkedin - Business Model Scope (english)
Linkedin - Business Model Scope (english)
 
Social Recruiting with LinkedIn Talent Solutions | Webcast
Social Recruiting with LinkedIn Talent Solutions | WebcastSocial Recruiting with LinkedIn Talent Solutions | Webcast
Social Recruiting with LinkedIn Talent Solutions | Webcast
 
A B2B guide to using LinkedIn.
A B2B guide to using LinkedIn.A B2B guide to using LinkedIn.
A B2B guide to using LinkedIn.
 
LinkedIn Business Canvas - 7 giugno Webinar
LinkedIn Business Canvas - 7 giugno WebinarLinkedIn Business Canvas - 7 giugno Webinar
LinkedIn Business Canvas - 7 giugno Webinar
 
Linkedin for businesses
Linkedin for businesses Linkedin for businesses
Linkedin for businesses
 
LinkedIn Business Update
LinkedIn Business Update LinkedIn Business Update
LinkedIn Business Update
 
LinkedIn for Business – The Secret of Authenticity
LinkedIn for Business – The Secret of AuthenticityLinkedIn for Business – The Secret of Authenticity
LinkedIn for Business – The Secret of Authenticity
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 

Similar a Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.

Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...Vitaly Gordon
 
Computing Professional Identity for the Economic Graph
Computing Professional Identity for the Economic GraphComputing Professional Identity for the Economic Graph
Computing Professional Identity for the Economic GraphVitaly Gordon
 
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)Social Fresh Conference
 
Linkedin job search fundamentals part 2
Linkedin job search fundamentals part 2Linkedin job search fundamentals part 2
Linkedin job search fundamentals part 2Safe Rise
 
7 Badass Tactics for SlideShare Content Domination
7 Badass Tactics for SlideShare Content Domination7 Badass Tactics for SlideShare Content Domination
7 Badass Tactics for SlideShare Content DominationLinkedIn
 
7 Badass Tactics for Slideshare Content Domination
7 Badass Tactics for Slideshare Content Domination 7 Badass Tactics for Slideshare Content Domination
7 Badass Tactics for Slideshare Content Domination Jason Miller
 
Driving Revenue w/ Social, Content, Marketing Automation - Scoop.It Meetup
Driving Revenue w/ Social, Content, Marketing Automation - Scoop.It Meetup Driving Revenue w/ Social, Content, Marketing Automation - Scoop.It Meetup
Driving Revenue w/ Social, Content, Marketing Automation - Scoop.It Meetup Jason Miller
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data ProductsPeter Skomoroch
 
Consumer Internet Lessons for Enterprise Product Managers
Consumer Internet Lessons for Enterprise Product ManagersConsumer Internet Lessons for Enterprise Product Managers
Consumer Internet Lessons for Enterprise Product ManagersMichael Korcuska
 
How Linkedin uses Automic for Big Data Processes
How Linkedin uses Automic for Big Data ProcessesHow Linkedin uses Automic for Big Data Processes
How Linkedin uses Automic for Big Data ProcessesCA | Automic Software
 
The LCG Digital Transformation Maturity Model
The LCG Digital Transformation Maturity ModelThe LCG Digital Transformation Maturity Model
The LCG Digital Transformation Maturity ModelLima Consulting Group
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's TutorialWayne Lee
 
Content Targeting and Personalization: Improving Engagement at the Account Level
Content Targeting and Personalization: Improving Engagement at the Account LevelContent Targeting and Personalization: Improving Engagement at the Account Level
Content Targeting and Personalization: Improving Engagement at the Account LevelG3 Communications
 
Linkedin Trending content report - Feb 2014 update
Linkedin Trending content report - Feb 2014 updateLinkedin Trending content report - Feb 2014 update
Linkedin Trending content report - Feb 2014 updateWSI Business Performance
 
Big Data Ecosystem @ LinkedIn
Big Data Ecosystem @ LinkedInBig Data Ecosystem @ LinkedIn
Big Data Ecosystem @ LinkedInMinh-Hoang Nguyen
 
The Lima Consulting Group Digital Transformation Maturity Model Presented at ...
The Lima Consulting Group Digital Transformation Maturity Model Presented at ...The Lima Consulting Group Digital Transformation Maturity Model Presented at ...
The Lima Consulting Group Digital Transformation Maturity Model Presented at ...Lima Consulting Group
 
Intro to LC Workshop.pdf
Intro to LC Workshop.pdfIntro to LC Workshop.pdf
Intro to LC Workshop.pdfMadelineYi
 
Tamm & kitt
Tamm & kittTamm & kitt
Tamm & kittJeff Roy
 

Similar a Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips. (20)

Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
 
Computing Professional Identity for the Economic Graph
Computing Professional Identity for the Economic GraphComputing Professional Identity for the Economic Graph
Computing Professional Identity for the Economic Graph
 
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)
 
Linkedin job search fundamentals part 2
Linkedin job search fundamentals part 2Linkedin job search fundamentals part 2
Linkedin job search fundamentals part 2
 
7 Badass Tactics for SlideShare Content Domination
7 Badass Tactics for SlideShare Content Domination7 Badass Tactics for SlideShare Content Domination
7 Badass Tactics for SlideShare Content Domination
 
7 Badass Tactics for Slideshare Content Domination
7 Badass Tactics for Slideshare Content Domination 7 Badass Tactics for Slideshare Content Domination
7 Badass Tactics for Slideshare Content Domination
 
Driving Revenue w/ Social, Content, Marketing Automation - Scoop.It Meetup
Driving Revenue w/ Social, Content, Marketing Automation - Scoop.It Meetup Driving Revenue w/ Social, Content, Marketing Automation - Scoop.It Meetup
Driving Revenue w/ Social, Content, Marketing Automation - Scoop.It Meetup
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data Products
 
Consumer Internet Lessons for Enterprise Product Managers
Consumer Internet Lessons for Enterprise Product ManagersConsumer Internet Lessons for Enterprise Product Managers
Consumer Internet Lessons for Enterprise Product Managers
 
How Linkedin uses Automic for Big Data Processes
How Linkedin uses Automic for Big Data ProcessesHow Linkedin uses Automic for Big Data Processes
How Linkedin uses Automic for Big Data Processes
 
The LCG Digital Transformation Maturity Model
The LCG Digital Transformation Maturity ModelThe LCG Digital Transformation Maturity Model
The LCG Digital Transformation Maturity Model
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's Tutorial
 
Content Targeting and Personalization: Improving Engagement at the Account Level
Content Targeting and Personalization: Improving Engagement at the Account LevelContent Targeting and Personalization: Improving Engagement at the Account Level
Content Targeting and Personalization: Improving Engagement at the Account Level
 
5 Steps to Sourcing Like a Pro on LinkedIn
5 Steps to Sourcing Like a Pro on LinkedIn5 Steps to Sourcing Like a Pro on LinkedIn
5 Steps to Sourcing Like a Pro on LinkedIn
 
Linkedin Trending content report - Feb 2014 update
Linkedin Trending content report - Feb 2014 updateLinkedin Trending content report - Feb 2014 update
Linkedin Trending content report - Feb 2014 update
 
Big Data Ecosystem @ LinkedIn
Big Data Ecosystem @ LinkedInBig Data Ecosystem @ LinkedIn
Big Data Ecosystem @ LinkedIn
 
Loyola 10 9 13
Loyola 10 9 13Loyola 10 9 13
Loyola 10 9 13
 
The Lima Consulting Group Digital Transformation Maturity Model Presented at ...
The Lima Consulting Group Digital Transformation Maturity Model Presented at ...The Lima Consulting Group Digital Transformation Maturity Model Presented at ...
The Lima Consulting Group Digital Transformation Maturity Model Presented at ...
 
Intro to LC Workshop.pdf
Intro to LC Workshop.pdfIntro to LC Workshop.pdf
Intro to LC Workshop.pdf
 
Tamm & kitt
Tamm & kittTamm & kitt
Tamm & kitt
 

Más de Hakka Labs

Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Hakka Labs
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchHakka Labs
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceHakka Labs
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartHakka Labs
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleHakka Labs
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataHakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQHakka Labs
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...Hakka Labs
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestHakka Labs
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringHakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresHakka Labs
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkHakka Labs
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesHakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityHakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...Hakka Labs
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInHakka Labs
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 

Más de Hakka Labs (20)

Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data Science
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at Instacart
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scale
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineering
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data Structures
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with Ourselves
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 

Último

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Último (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.

Notas del editor

  1. Supervised (gold, agreement) & unsupervised (behavioral)
  2. Supervised (gold, agreement) & unsupervised (behavioral)
  3. Supervised (gold, agreement) & unsupervised (behavioral)
  4. Supervised (gold, agreement) & unsupervised (behavioral)
  5. Supervised (gold, agreement) & unsupervised (behavioral)
  6. Supervised (gold, agreement) & unsupervised (behavioral)
  7. Supervised (gold, agreement) & unsupervised (behavioral)
  8. Supervised (gold, agreement) & unsupervised (behavioral)
  9. Supervised (gold, agreement) & unsupervised (behavioral)
  10. Supervised (gold, agreement) & unsupervised (behavioral)
  11. Context: why it mattersOff-topic comments lower the perceived value of Influencer content, LI network, etc.Legit members may leave low-quality topics -> no hell-banning
  12. Especially if you only guess on the hard ones+ Gold and wawa don’t work as well with binary tasks
  13. + references to article, other comments, etc.
  14. Sampling: took clusters where at least one item scored poorly with existing classifierStill a biased dataset -> skew gold to catch positive cases (80% of Golds have at least one comment flagged)Treat any comment that got at least 1 vote as “suspect”NEXT TIME: set minimum agreement thresholds and collect more labels dynamically
  15. Sampling: took clusters where at least one item scored poorly with existing classifierStill a biased dataset -> skew gold to catch positive cases (80% of Golds have at least one comment flagged)Treat any comment that got at least 1 vote as “suspect”NEXT TIME: set minimum agreement thresholds and collect more labels dynamically
  16. + Using results to evaluate new implementations of spam classifierImprove Prec without drop in Rec+ 18k comments labeled in 54 hrs for $180
  17. + simple as possible, but not any simpler
  18. need to find timely, relevant content for many subjects
  19. Free-text tagging = standardization pain, plus hard to manage quality+ double-pass -> annoyingStandardized taxonomy: 1,200 topics selected as representative linkedin members interests + random guessing: 1200 topics is still a lot
  20. Pick “likely” labels for evaluation:+ weak classifier to identify skills in an article -> expand to related skills+ weak classifier to identify industry of article -> expand to related skills+ pick labels based on source of article (e.g., forbes -> economy, marketing, etc.)+ 100 candidate labels for each article
  21. + 400k article-topic pairs+ e.g., 60k pairs in ~1 week @ 7c each+ 4 labels for each item, take the average value (rather than looking for consensus)+ bootstrap additional gold from items completed with high agreementLessons+ difference between very & somewhat relevant: “is this the primary topic”+ some non-english articles, some garbled articles
  22. + 400k article-topic pairs+ e.g., 60k pairs in ~1 week @ 7c each+ 4 labels for each item, take the average value (rather than looking for consensus)+ bootstrap additional gold from items completed with high agreementLessons+ difference between very & somewhat relevant: “is this the primary topic”+ some non-english articles, some garbled articles
  23. Working towards a “less” supervised way to create new channels
  24. Preprocessing the data to select likely matches greatly reduced the number of labels needed
  25. search: + helps members find and be found+ People, Jobs, Groups and more
  26. LI search is personalized: + tuple of (user, query, document)Too much to ask a random person to label for training+ “imagine that you’re X and see Y” has its limits+ train from logs
  27. Indirect measures: + CTR@1, CTR@P1, Session Abandonment, etc.Explicit measures:+ what about non-personalized search (such as for recruiters)?+ what about identifying items that are off-topic for all members?
  28. 1000 query-result pairs+ retrieve all queries where result@1 didn’t get a click+ remove any queries tagged as {firstname, lastname} where the name in the query matched the name in the profile (we know these perform well}Binary tasks bad – added a second set of questions+ allows us to audit query tagger at the same timeUsing results to triage queries for additional manual review+ also adds an explicit relevance metric to track over time (wtf@1)
  29. Other behavioral stuff:+ individual judgment duration, scrolls, clicks, mouse movement+ jQuery is your friend
  30. Picking the right problem gets you a long way thereSkillRank example----- Meeting Notes (8/15/13 16:55) -----+ name queries really aren't that useful so we excluded those+ ran it internally first, then with turkers++ nearly identical, arguably it was better
  31. Supervised (gold, agreement) & unsupervised (behavioral)
  32. Supervised (gold, agreement) & unsupervised (behavioral)
  33. Supervised (gold, agreement) & unsupervised (behavioral)
  34. Supervised (gold, agreement) & unsupervised (behavioral)
  35. Picking the right problem gets you a long way thereSkillRank example----- Meeting Notes (8/15/13 16:55) -----+ name queries really aren't that useful so we excluded those+ ran it internally first, then with turkers++ nearly identical, arguably it was better
  36. Other fun lessons5. Not by gold alone