SlideShare una empresa de Scribd logo
1 de 56
Integrating Crowd & Cloud
  Resources for Big Data
                Michael Franklin

      Middleware 2012, Montreal
                December 6 2012
                               Expeditions
  UC BERKELEY
                               in Computing
CROWDSOURCING
WHAT IS IT?
Citizen Science




NASA “Clickworkers” 2000
Citizen Journalism/Participatory Sensing




4
Communities & Expertise
Data Collection & Curation
e.g., Freebase
An Academic View




From Quinn & Bederson, “Human Computation: A Survey
and Taxonomy of a Growing Field”, CHI 2011.
The Way Industry Looks At It
 How Industry Looks At It
Useful Taxonomies
• Doan, Halevy, Ramakrishnan; (Crowdsourcing)
  CACM 4/11
  –   nature of collaboration (implicit vs. explicit)
  –   architecture (standalone vs. piggybacked)
  –   must recruit users/workers? (yes or no)
  –   What do users/workers do?
• Bederson & Quinn; (Human Computation) CHI ‟11
  –   Motivation (Pay, Altruism, Enjoyment, Reputation)
  –   Quality Control (many mechanisms)
  –   Aggregation (how are results combined?)
  –   Human Skill (Visual recognition, language, …)
  –   …
Types of Tasks

Task Granularity               Examples
Complex Tasks                  • Build a website
                               • Develop a software system
                               • Overthrow a government?
Simple Projects                • Design a logo and visual identity
                               • Write a term paper
Macro Tasks                    • Write a restaurant review
                               • Test a new website feature
                               • Identify a galaxy
Micro Tasks                    • Label an image
                               • Verify an address
                               • Simple entity resolution


 Inspired by the report: “Paid Crowdsourcing”,
 Smartsheet.com, 9/15/2009
MICRO-TASK MARKETPLACES
Amazon Mechanical Turk (AMT)
Microtasking – Virutalized Humans
• Current leader: Amazon Mechanical Turk
• Requestors place Human Intelligence Tasks
  (HITs)
      –   set price per “assignment” (usually cents)
      –   specify #of replicas (assignments), expiration, …
      –   User Interface (for workers)
      –   API-based: “createHit()”, “getAssignments()”,
          “approveAssignments()”, “forceExpire()”
• Requestors approve jobs and payment
• Workers (a.k.a. “turkers”) choose jobs, do them,
  get paid
 13
AMT Worker Interface
Microtask Aggregators
Crowdsourcing for Data Management
• Relational                        • Beyond relational
      –   data cleaning               –   graph search
      –   data entry                  –   classification
      –   information extraction      –   transcription
      –   schema matching             –   mobile image search
      –   entity resolution           –   social media analysis
      –   data spaces                 –   question answering
      –   building structured KBs     –   NLP
      –   sorting                     –   text summarization
      –   top-k                       –   sentiment analysis
      –   ...                         –   semantic wikis
                                      –   ...
 18
TOWARDS HYBRID
CROWD/CLOUD COMPUTING
Not Exactly Crowdsourcing, but…




“The hope is that, in not too many years, human brains
and computing machines will be coupled together very
tightly, and that the resulting partnership will think as no
human brain has ever thought and process data in a way
not approached by the information-handling machines
we know today.”
AMP: Integrating Diverse Resources

                           Algorithms:
                       Machine Learning and
                            Analytics




                                                  People:
        Machines:
                                              CrowdSourcing &
     Cloud Computing
                                          Human Computation

21
The Berkeley AMPLab
• Goal: Data analytics stack integrating A, M & P
  • BDAS: Released as BSD/Apache Open Source
• 6 year duration: 2011-2017
• 8 CS Faculty
  • Directors: Franklin(DB), Jordan (ML), Stoica (Sys)
• Industrial Support & Collaboration:




• NSF Expedition and Darpa XData
  22
People in AMP
• Long term Goal: Make people
  an integrated part of the system!
  • Leverage human activity
                                            Machines +
  • Leverage human intelligence
                                            Algorithms
• Current AMP People Projects
  – Carat: Collaborative Energy




                                                 Questions
                                      activity




                                                             Answers
    Debugging




                                      data,
  – CrowdDB: “The World‟s Dumbest
    Database System”
  – CrowdER: Hybrid computation for
    Entity Resolution
  – CrowdQ: Hybrid Unstructured
    Query Answering
  23
Carat: Leveraging Human Activity



                      ~500,000
                      downloads
                      to date
                   A. J. Oliner, et al. Collaborative
                   Energy Debugging for Mobile
                   Devices. Workshop on Hot
                   Topics in System Dependability
                   (HotDep), 2012.

24
Carat: How it works




     Collaborative Detection of Energy Bugs

25
Leveraging Human Intelligence
First Attempt:                     CrowdSQL                                         Results


CrowdDB                                                 Parser
                                                                          Turker Relationship




                                     MetaData
                                                                               Manager

                                                                            UI         Form
                                                       Optimizer
                                                                         Creation      Editor

See also:
                                                       Executor          UI Template Manager

  Qurk – MIT

                                     Statistics
                                                                            HIT Manager
  Deco – Stanford
                                                  Files Access Methods




                                                   Disk 1




                                                   Disk 2




      CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011
 26
      Query Processing with the VLDB Crowd, VLDB 2011
DB-hard Queries
Company_Name              Address                    Market Cap
Google                    Googleplex, Mtn. View CA   $210Bn
Intl. Business Machines   Armonk, NY                 $200Bn
Microsoft                 Redmond, WA                $250Bn

                                SELECT Market_Cap
                                From Companies
                                Where Company_Name = “IBM”


                                    Number of Rows: 0

                                    Problem:
                                    Entity Resolution

27
DB-hard Queries
Company_Name              Address                    Market Cap
Google                    Googleplex, Mtn. View CA   $210Bn
Intl. Business Machines   Armonk, NY                 $200Bn
Microsoft                 Redmond, WA                $250Bn

                                SELECT Market_Cap
                                From Companies
                                Where Company_Name = “Apple”


                                    Number of Rows: 0

                                    Problem:
                                    Closed-World Assumption

28
DB-hard Queries
SELECT Image
From Pictures
Where Image contains
“Good Looking Dog”




                       Number of Rows: 0

                       Problem:
                       Subjective Comparision

29
Leveraging Human Intelligence
First Attempt:                     CrowdSQL                                         Results


CrowdDB                                                 Parser
                                                                          Turker Relationship




                                     MetaData
                                                                               Manager

                                                                            UI         Form
Where to use the crowd:                                Optimizer
                                                                         Creation      Editor


• Cleaning and                                         Executor          UI Template Manager




                                     Statistics
  Disambiguation
• Find missing data                               Files Access Methods      HIT Manager


• Make subjective
  comparisons                                      Disk 1




                                                   Disk 2




      CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011
 30
      Query Processing with the VLDB Crowd, VLDB 2011
CrowdDB - Worker Interface




31
Mobile Platform




32
CrowdSQL
DDL Extensions:
Crowdsourced columns          Crowdsourced tables
CREATE TABLE company (        CREATE CROWD TABLE department (
  name STRING PRIMARY KEY,      university STRING,
  hq_address CROWD STRING);     department STRING,
                                phone_no STRING)
                              PRIMARY KEY (university, department);


DML Extensions:
 CrowdEqual:                  CROWDORDER operators (currently UDFs):
 SELECT *                     SELECT p FROM picture
 FROM companies               WHERE subject =
 WHERE Name ~= “Big Blue”         "Golden Gate Bridge"
                              ORDER BY CROWDORDER(p, "Which
                              pic shows better %subject");

  33
CrowdDB Query: Picture ordering
                                                                       Which picture visualizes better
Query:                                                                    "Golden Gate Bridge"
SELECT p FROM picture
WHERE subject = "Golden Gate Bridge"
ORDER BY CROWDORDER(p, "Which pic shows
                            better %subject");


Data-Size: 30 subject areas, with 8 pictures each
Batching: 4 orderings per HIT
Replication: 3 Assignments per HIT
Price: 1 cent per HIT
                                                                                    Submit




34                    (turker-votes, turker-ranking, expert-ranking)
User Interface vs. Quality



                       Please fill out the missing                           Please fill out the missing
                           professor data                                       department data
                     N ame        Carey                                   Department                CS
                     Department CS                                        Name
                                                                                                                               Please fill out the missing
                     name                               MTJoin                                                                     professor data
     MTJoin                                                               Phone
                     E-Mail                              (Dep)                                                               Name                   Carey
   (Professor)
                                                    p.dep = d.name                   Submit
p.name = "carey"                Submit                                                                       MTProbe         E-Mail
                                                                                                          (Professor, Dep)   Department
                                                                                                            name=Carey
                       Please fill out the missing                           Please fill out the missing                       Department
                                                       MTProbe                  professor data                               Phone
                           department data
                                                      (Professor)                                 Carey
MTProbe(Dep)         Department                                           Name
                                                     name=Carey                                                                         Submit
                     Name                                                 E-Mail

                     Phone                                                Department

                                Submit                                               Submit



                   (Department first)                                (Professor first)                          (De-normalized Probe)

                   ≈10% Error-Rate                            ≈10% Error-Rate                                     ≈80% Error-Rate
   35
Turker Affinity and Errors




               Turker Rank
36
A Bigger Underlying Issue

     Closed-World       Open-World




37
What Does This Query Mean?

SELECT COUNT(*) FROM IceCreamFlavors




 Trushkowsky et al. Croudsourcing Enumeration Queries, ICDE 2013 (to appear)

 38
Estimating Completeness
SELECT COUNT(*) FROM US States
 US States using Mechanical Turk
           Species Estimation techniques perform well on average
              •Uniform under-predicts slightly, coeff of var. = 0.5
              •Decent estimate after 100 HITs
                                      States: unique items
                                                                         Average US States
                        50
                        40
 avg # unique answers

                        30
                        20
                        10
                        0




                             0   50     100    150     200   250   300

39                                       # responses
                                         # Answers (HITs)
Estimating Completeness
SELECT COUNT(*) FROM IceCreamFlavors
• Ice Cream              Ice Cream Flavors

  Flavors
    – Estimators don‟t
      converge
    – Very highly
      skewed (CV =
      5.8)
    – Detect that # HITs
      insufficient         Few, short lists of ice cream flavors
                           (e.g. “alumni swirl, apple cobbler
      (beginning of        crunch, arboretum breeze,…” from Penn
                           State Creamery
 40
      curve)
pay-as-you-go
• “I don’t believe it is usually possible to estimate the
  number of species... but only an appropriate lower bound
  for that number. This is because there is nearly always a
  good chance that there are a very large number of
     extremely rare species” –
                      Good, 1953
• So instead, can ask: “What‟s the benefit of
  m additional HITs?”
                     Ice Cream after 1500 HITs
                    m     Actual   Shen   Spline
                    10    1        1.79   1.62
                    50    7        8.91   8.22
                    200   39       35.4   32.9
41
CrowdER - Entity Resolution




                    DB




42/17
Hybrid Entity-Resolution


                                                               Threshold = 0.2
                                                               #Pairs = 8,315
                                                               #HITs = 508
                                                               Cost= $38.1

                                                               Time = 4.5h
                                                               Time(QT) = 20h




        J. Wang et al. CrowdER: Crowdsourcing Entity Resolution, PVLDB 2012
43/17
CrowdQ – Query Generation
 • Help find answers to unstructured queries
     – Approach: Generate a structured query via templates
 • Machines do parsing and ontology lookup
 • People do the rest: verification, entity extraction, etc.




Demartini et al. CroudQ: Crowdsourced Query Understanding, CIDR 2013 (to appear)

   44
SO, WHERE DOES
MIDDLEWARE FIT IN?
Generic Architecture
                  Middleware is the software that
                  resides between applications
 application      and the underlying architecture.
                  The goal of middleware is to
                  facilitate the development of
                  applications by providing higher-
                  level abstractions for better
Hybrid Platform   programmability, performance, s
                  calability, security, and a variety
                  of essential features.
                         Middleware 2012 web page
The Challenge




                Incentives
                Latency & Prediction
                Failure Modes
Some issues:    Work Conditions
                Interface
                Task Structuring
                Task Routing
 47             …
Can you incentivize workers?




     http://waxy.org/2008/11/the_faces_of_
48
     mechanical_turk/
Incentives




49
Can you trust the crowd?
      On Wikipedia ”any user can
      change any entry, and if
      enough users agree with
      them, it becomes true."

    “The Elephant population in Africa has
    tripled over the past six months.”[1]


Wikiality: Reality as decided on by majority rule.[2]
[1] http://en.wikipedia.org/wiki/Cultural_impact_of_The_Colbert_Report
[2] http://www.urbandictionary.com/define.php?term=wikiality
Answer Quality Approaches

• Some General Techniques
     – Approval Rate / Demographic Restrictions
     – Qualification Test
     – Gold Sets/Honey Pots
     – Redundancy and Voting
     – Statistical Measures and Bias Reduction
     – Verification/Review
• Query Specific Techniques
• Worker Relationship Management
51
Can you organize the crowd?



                                        Independent agreement to identify patches


                                                               Soylent, a prototype...




                                        Randomize order of suggestions




52
     [Bernstein et al: Soylent: A Word Processor with a Crowd Inside. UIST, 2010]
Can You Predict the Crowd?

     Streakers        List walking




53
Can you build a low-latency crowd?




from: M S Bernstein, J Brandt, R C Miller, D R Karger, “Crowds in Two
Seconds: Enabling Realtime Crowdsourced Applications”, UIST 2011.
  54
Can you help the crowd?
For More Information
Crowdsourcing Tutorials:
 • P. Ipeirotis, Managing Crowdsourced Human Computation,
   WWW „11, March 2011.
 • O. Alonso, M. Lease, Crowdsourcing for Information Retrieval:
   Principles, Methods, and Applications, SIGIR July 2011.
 • A. Doan, M. Franklin, D. Kossmann, T. Kraska,
   Crowdsourcing Applications and Platforms: A Data
   Management Perspective, VLDB 2011.


AMPLab: amplab.cs.berkeley.edu
     • Papers
     • Project Descriptions and Pages
     • News updates and Blogs

56

Más contenido relacionado

Destacado

C2 empowering modern human resources and talent management in the cloud
C2   empowering modern human resources and talent management in the cloudC2   empowering modern human resources and talent management in the cloud
C2 empowering modern human resources and talent management in the cloudDr. Wilfred Lin (Ph.D.)
 
Managing human resources at data centers 1.0
Managing human resources at data centers 1.0Managing human resources at data centers 1.0
Managing human resources at data centers 1.0aqel aqel
 
Big Data Triage with Rosette Human Language Technology Conference
Big Data Triage with Rosette Human Language Technology ConferenceBig Data Triage with Rosette Human Language Technology Conference
Big Data Triage with Rosette Human Language Technology ConferenceBasis Technology
 
Big data in HR: Why all the fuss?
Big data in HR: Why all the fuss? Big data in HR: Why all the fuss?
Big data in HR: Why all the fuss? Steve Pell
 
HR Analytics and KPIs with LBi HR HelpDesk
HR Analytics and KPIs with LBi HR HelpDeskHR Analytics and KPIs with LBi HR HelpDesk
HR Analytics and KPIs with LBi HR HelpDeskLBi Software
 
Big Trends in HR Tech for 2014 and Beyond - Human Resource Executive Webinar
Big Trends in HR Tech for 2014 and Beyond - Human Resource Executive WebinarBig Trends in HR Tech for 2014 and Beyond - Human Resource Executive Webinar
Big Trends in HR Tech for 2014 and Beyond - Human Resource Executive WebinarH3 HR Advisors, Inc.
 
Data Science and Analytics in Human Resources - Moneyball comes to HR
Data Science and Analytics in Human Resources - Moneyball comes to HRData Science and Analytics in Human Resources - Moneyball comes to HR
Data Science and Analytics in Human Resources - Moneyball comes to HRJosh Bersin
 
Big Data in Manufacturing Final PPT
Big Data in Manufacturing Final PPTBig Data in Manufacturing Final PPT
Big Data in Manufacturing Final PPTNikhil Atkuri
 
21st Century Talent Management: Imperatives for 2014 and 2015
21st Century Talent Management: Imperatives for 2014 and 201521st Century Talent Management: Imperatives for 2014 and 2015
21st Century Talent Management: Imperatives for 2014 and 2015Josh Bersin
 
Proteomics public data resources: enabling "big data" analysis in proteomics
Proteomics public data resources: enabling "big data" analysis in proteomicsProteomics public data resources: enabling "big data" analysis in proteomics
Proteomics public data resources: enabling "big data" analysis in proteomicsJuan Antonio Vizcaino
 
Walmart value chain-analysis
Walmart value chain-analysisWalmart value chain-analysis
Walmart value chain-analysisMonica Mishra
 
Strategic planning process and human resource management
Strategic planning process and human resource managementStrategic planning process and human resource management
Strategic planning process and human resource managementJC
 
How Human Resources processes are improved by Advanced Analytics and Big Data
How Human Resources processes are improved by Advanced Analytics and Big DataHow Human Resources processes are improved by Advanced Analytics and Big Data
How Human Resources processes are improved by Advanced Analytics and Big DataCapgemini
 
How Knowledge Management and Big Data Multiply the Impact of CI
How Knowledge Management and Big Data Multiply the Impact of CIHow Knowledge Management and Big Data Multiply the Impact of CI
How Knowledge Management and Big Data Multiply the Impact of CIIntelCollab.com
 
Strategic Human Resource Management (SHRM) - MBA 423 Human Resources Manageme...
Strategic Human Resource Management (SHRM) - MBA 423 Human Resources Manageme...Strategic Human Resource Management (SHRM) - MBA 423 Human Resources Manageme...
Strategic Human Resource Management (SHRM) - MBA 423 Human Resources Manageme...Stuart Gow
 
Airline Revenue - Case Study and Industry Analysis
Airline Revenue - Case Study and Industry AnalysisAirline Revenue - Case Study and Industry Analysis
Airline Revenue - Case Study and Industry AnalysisFrank A.
 
Go Big on Community Management!
Go Big on Community Management!Go Big on Community Management!
Go Big on Community Management!Gary Vaynerchuk
 
HR Data Management: The Hard Way vs The Easy Way
HR Data Management: The Hard Way vs The Easy WayHR Data Management: The Hard Way vs The Easy Way
HR Data Management: The Hard Way vs The Easy WaySage HRMS
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 

Destacado (20)

C2 empowering modern human resources and talent management in the cloud
C2   empowering modern human resources and talent management in the cloudC2   empowering modern human resources and talent management in the cloud
C2 empowering modern human resources and talent management in the cloud
 
Managing human resources at data centers 1.0
Managing human resources at data centers 1.0Managing human resources at data centers 1.0
Managing human resources at data centers 1.0
 
Big Data Triage with Rosette Human Language Technology Conference
Big Data Triage with Rosette Human Language Technology ConferenceBig Data Triage with Rosette Human Language Technology Conference
Big Data Triage with Rosette Human Language Technology Conference
 
Big data in HR: Why all the fuss?
Big data in HR: Why all the fuss? Big data in HR: Why all the fuss?
Big data in HR: Why all the fuss?
 
HR Analytics and KPIs with LBi HR HelpDesk
HR Analytics and KPIs with LBi HR HelpDeskHR Analytics and KPIs with LBi HR HelpDesk
HR Analytics and KPIs with LBi HR HelpDesk
 
Big Trends in HR Tech for 2014 and Beyond - Human Resource Executive Webinar
Big Trends in HR Tech for 2014 and Beyond - Human Resource Executive WebinarBig Trends in HR Tech for 2014 and Beyond - Human Resource Executive Webinar
Big Trends in HR Tech for 2014 and Beyond - Human Resource Executive Webinar
 
Data Science and Analytics in Human Resources - Moneyball comes to HR
Data Science and Analytics in Human Resources - Moneyball comes to HRData Science and Analytics in Human Resources - Moneyball comes to HR
Data Science and Analytics in Human Resources - Moneyball comes to HR
 
Big Data in Manufacturing Final PPT
Big Data in Manufacturing Final PPTBig Data in Manufacturing Final PPT
Big Data in Manufacturing Final PPT
 
21st Century Talent Management: Imperatives for 2014 and 2015
21st Century Talent Management: Imperatives for 2014 and 201521st Century Talent Management: Imperatives for 2014 and 2015
21st Century Talent Management: Imperatives for 2014 and 2015
 
Proteomics public data resources: enabling "big data" analysis in proteomics
Proteomics public data resources: enabling "big data" analysis in proteomicsProteomics public data resources: enabling "big data" analysis in proteomics
Proteomics public data resources: enabling "big data" analysis in proteomics
 
Walmart value chain-analysis
Walmart value chain-analysisWalmart value chain-analysis
Walmart value chain-analysis
 
Strategic planning process and human resource management
Strategic planning process and human resource managementStrategic planning process and human resource management
Strategic planning process and human resource management
 
How Human Resources processes are improved by Advanced Analytics and Big Data
How Human Resources processes are improved by Advanced Analytics and Big DataHow Human Resources processes are improved by Advanced Analytics and Big Data
How Human Resources processes are improved by Advanced Analytics and Big Data
 
How Knowledge Management and Big Data Multiply the Impact of CI
How Knowledge Management and Big Data Multiply the Impact of CIHow Knowledge Management and Big Data Multiply the Impact of CI
How Knowledge Management and Big Data Multiply the Impact of CI
 
Strategic Human Resource Management (SHRM) - MBA 423 Human Resources Manageme...
Strategic Human Resource Management (SHRM) - MBA 423 Human Resources Manageme...Strategic Human Resource Management (SHRM) - MBA 423 Human Resources Manageme...
Strategic Human Resource Management (SHRM) - MBA 423 Human Resources Manageme...
 
Airline Revenue - Case Study and Industry Analysis
Airline Revenue - Case Study and Industry AnalysisAirline Revenue - Case Study and Industry Analysis
Airline Revenue - Case Study and Industry Analysis
 
Go Big on Community Management!
Go Big on Community Management!Go Big on Community Management!
Go Big on Community Management!
 
HR Data Management: The Hard Way vs The Easy Way
HR Data Management: The Hard Way vs The Easy WayHR Data Management: The Hard Way vs The Easy Way
HR Data Management: The Hard Way vs The Easy Way
 
Big Data v Data Mining
Big Data v Data MiningBig Data v Data Mining
Big Data v Data Mining
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 

Similar a Middeware2012 crowd

Sample Paper.doc.doc
Sample Paper.doc.docSample Paper.doc.doc
Sample Paper.doc.docbutest
 
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics SystemFour Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics SystemTreasure Data, Inc.
 
Crowdsourcing challenges and opportunities 2012
Crowdsourcing challenges and opportunities 2012Crowdsourcing challenges and opportunities 2012
Crowdsourcing challenges and opportunities 2012xin wang
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...oj08
 
Your Big Data Arsenal - Strata 2013
Your Big Data Arsenal - Strata 2013Your Big Data Arsenal - Strata 2013
Your Big Data Arsenal - Strata 2013Matt Asay
 
Essential Tools For Your Big Data Arsenal
Essential Tools For Your Big Data ArsenalEssential Tools For Your Big Data Arsenal
Essential Tools For Your Big Data ArsenalMongoDB
 
The Secret Formula to Staying Customer Conscious During Late-Stage Product De...
The Secret Formula to Staying Customer Conscious During Late-Stage Product De...The Secret Formula to Staying Customer Conscious During Late-Stage Product De...
The Secret Formula to Staying Customer Conscious During Late-Stage Product De...Aggregage
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...Alex Liu
 
The Pace of Change Requires AI (and/or its subsets)
The Pace of Change Requires AI (and/or its subsets) The Pace of Change Requires AI (and/or its subsets)
The Pace of Change Requires AI (and/or its subsets) Dharmabuilt
 
The Future Based on AI and Analytics
The Future Based on AI and AnalyticsThe Future Based on AI and Analytics
The Future Based on AI and AnalyticsDATAVERSITY
 
Ibm and innovation overview 20150326 v15 short
Ibm and innovation overview 20150326 v15 shortIbm and innovation overview 20150326 v15 short
Ibm and innovation overview 20150326 v15 shortISSIP
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Integrate All The Things WS02Con
Integrate All The Things WS02ConIntegrate All The Things WS02Con
Integrate All The Things WS02ConJames Governor
 
Intelligent Big Data analytics for the future.
Intelligent Big Data analytics for the future.Intelligent Big Data analytics for the future.
Intelligent Big Data analytics for the future.Shashank Garg
 
Introduction to Grid Computing
Introduction to Grid ComputingIntroduction to Grid Computing
Introduction to Grid Computingabhijeetnawal
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Scale, Structure, and Semantics
Scale, Structure, and SemanticsScale, Structure, and Semantics
Scale, Structure, and SemanticsDaniel Tunkelang
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSiHadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSiCloudera, Inc.
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesVishy Poosala
 

Similar a Middeware2012 crowd (20)

Sample Paper.doc.doc
Sample Paper.doc.docSample Paper.doc.doc
Sample Paper.doc.doc
 
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics SystemFour Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
 
Crowdsourcing challenges and opportunities 2012
Crowdsourcing challenges and opportunities 2012Crowdsourcing challenges and opportunities 2012
Crowdsourcing challenges and opportunities 2012
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
 
Your Big Data Arsenal - Strata 2013
Your Big Data Arsenal - Strata 2013Your Big Data Arsenal - Strata 2013
Your Big Data Arsenal - Strata 2013
 
Essential Tools For Your Big Data Arsenal
Essential Tools For Your Big Data ArsenalEssential Tools For Your Big Data Arsenal
Essential Tools For Your Big Data Arsenal
 
The Secret Formula to Staying Customer Conscious During Late-Stage Product De...
The Secret Formula to Staying Customer Conscious During Late-Stage Product De...The Secret Formula to Staying Customer Conscious During Late-Stage Product De...
The Secret Formula to Staying Customer Conscious During Late-Stage Product De...
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
 
The Pace of Change Requires AI (and/or its subsets)
The Pace of Change Requires AI (and/or its subsets) The Pace of Change Requires AI (and/or its subsets)
The Pace of Change Requires AI (and/or its subsets)
 
The Future Based on AI and Analytics
The Future Based on AI and AnalyticsThe Future Based on AI and Analytics
The Future Based on AI and Analytics
 
Ibm and innovation overview 20150326 v15 short
Ibm and innovation overview 20150326 v15 shortIbm and innovation overview 20150326 v15 short
Ibm and innovation overview 20150326 v15 short
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Integrate All The Things WS02Con
Integrate All The Things WS02ConIntegrate All The Things WS02Con
Integrate All The Things WS02Con
 
Intelligent Big Data analytics for the future.
Intelligent Big Data analytics for the future.Intelligent Big Data analytics for the future.
Intelligent Big Data analytics for the future.
 
Introduction to Grid Computing
Introduction to Grid ComputingIntroduction to Grid Computing
Introduction to Grid Computing
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Scale, Structure, and Semantics
Scale, Structure, and SemanticsScale, Structure, and Semantics
Scale, Structure, and Semantics
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSiHadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, Opportunities
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Último (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Middeware2012 crowd

  • 1. Integrating Crowd & Cloud Resources for Big Data Michael Franklin Middleware 2012, Montreal December 6 2012 Expeditions UC BERKELEY in Computing
  • 6. Data Collection & Curation e.g., Freebase
  • 7. An Academic View From Quinn & Bederson, “Human Computation: A Survey and Taxonomy of a Growing Field”, CHI 2011.
  • 8. The Way Industry Looks At It How Industry Looks At It
  • 9. Useful Taxonomies • Doan, Halevy, Ramakrishnan; (Crowdsourcing) CACM 4/11 – nature of collaboration (implicit vs. explicit) – architecture (standalone vs. piggybacked) – must recruit users/workers? (yes or no) – What do users/workers do? • Bederson & Quinn; (Human Computation) CHI ‟11 – Motivation (Pay, Altruism, Enjoyment, Reputation) – Quality Control (many mechanisms) – Aggregation (how are results combined?) – Human Skill (Visual recognition, language, …) – …
  • 10. Types of Tasks Task Granularity Examples Complex Tasks • Build a website • Develop a software system • Overthrow a government? Simple Projects • Design a logo and visual identity • Write a term paper Macro Tasks • Write a restaurant review • Test a new website feature • Identify a galaxy Micro Tasks • Label an image • Verify an address • Simple entity resolution Inspired by the report: “Paid Crowdsourcing”, Smartsheet.com, 9/15/2009
  • 13. Microtasking – Virutalized Humans • Current leader: Amazon Mechanical Turk • Requestors place Human Intelligence Tasks (HITs) – set price per “assignment” (usually cents) – specify #of replicas (assignments), expiration, … – User Interface (for workers) – API-based: “createHit()”, “getAssignments()”, “approveAssignments()”, “forceExpire()” • Requestors approve jobs and payment • Workers (a.k.a. “turkers”) choose jobs, do them, get paid 13
  • 15.
  • 16.
  • 18. Crowdsourcing for Data Management • Relational • Beyond relational – data cleaning – graph search – data entry – classification – information extraction – transcription – schema matching – mobile image search – entity resolution – social media analysis – data spaces – question answering – building structured KBs – NLP – sorting – text summarization – top-k – sentiment analysis – ... – semantic wikis – ... 18
  • 20. Not Exactly Crowdsourcing, but… “The hope is that, in not too many years, human brains and computing machines will be coupled together very tightly, and that the resulting partnership will think as no human brain has ever thought and process data in a way not approached by the information-handling machines we know today.”
  • 21. AMP: Integrating Diverse Resources Algorithms: Machine Learning and Analytics People: Machines: CrowdSourcing & Cloud Computing Human Computation 21
  • 22. The Berkeley AMPLab • Goal: Data analytics stack integrating A, M & P • BDAS: Released as BSD/Apache Open Source • 6 year duration: 2011-2017 • 8 CS Faculty • Directors: Franklin(DB), Jordan (ML), Stoica (Sys) • Industrial Support & Collaboration: • NSF Expedition and Darpa XData 22
  • 23. People in AMP • Long term Goal: Make people an integrated part of the system! • Leverage human activity Machines + • Leverage human intelligence Algorithms • Current AMP People Projects – Carat: Collaborative Energy Questions activity Answers Debugging data, – CrowdDB: “The World‟s Dumbest Database System” – CrowdER: Hybrid computation for Entity Resolution – CrowdQ: Hybrid Unstructured Query Answering 23
  • 24. Carat: Leveraging Human Activity ~500,000 downloads to date A. J. Oliner, et al. Collaborative Energy Debugging for Mobile Devices. Workshop on Hot Topics in System Dependability (HotDep), 2012. 24
  • 25. Carat: How it works Collaborative Detection of Energy Bugs 25
  • 26. Leveraging Human Intelligence First Attempt: CrowdSQL Results CrowdDB Parser Turker Relationship MetaData Manager UI Form Optimizer Creation Editor See also: Executor UI Template Manager Qurk – MIT Statistics HIT Manager Deco – Stanford Files Access Methods Disk 1 Disk 2 CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011 26 Query Processing with the VLDB Crowd, VLDB 2011
  • 27. DB-hard Queries Company_Name Address Market Cap Google Googleplex, Mtn. View CA $210Bn Intl. Business Machines Armonk, NY $200Bn Microsoft Redmond, WA $250Bn SELECT Market_Cap From Companies Where Company_Name = “IBM” Number of Rows: 0 Problem: Entity Resolution 27
  • 28. DB-hard Queries Company_Name Address Market Cap Google Googleplex, Mtn. View CA $210Bn Intl. Business Machines Armonk, NY $200Bn Microsoft Redmond, WA $250Bn SELECT Market_Cap From Companies Where Company_Name = “Apple” Number of Rows: 0 Problem: Closed-World Assumption 28
  • 29. DB-hard Queries SELECT Image From Pictures Where Image contains “Good Looking Dog” Number of Rows: 0 Problem: Subjective Comparision 29
  • 30. Leveraging Human Intelligence First Attempt: CrowdSQL Results CrowdDB Parser Turker Relationship MetaData Manager UI Form Where to use the crowd: Optimizer Creation Editor • Cleaning and Executor UI Template Manager Statistics Disambiguation • Find missing data Files Access Methods HIT Manager • Make subjective comparisons Disk 1 Disk 2 CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011 30 Query Processing with the VLDB Crowd, VLDB 2011
  • 31. CrowdDB - Worker Interface 31
  • 33. CrowdSQL DDL Extensions: Crowdsourced columns Crowdsourced tables CREATE TABLE company ( CREATE CROWD TABLE department ( name STRING PRIMARY KEY, university STRING, hq_address CROWD STRING); department STRING, phone_no STRING) PRIMARY KEY (university, department); DML Extensions: CrowdEqual: CROWDORDER operators (currently UDFs): SELECT * SELECT p FROM picture FROM companies WHERE subject = WHERE Name ~= “Big Blue” "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject"); 33
  • 34. CrowdDB Query: Picture ordering Which picture visualizes better Query: "Golden Gate Bridge" SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject"); Data-Size: 30 subject areas, with 8 pictures each Batching: 4 orderings per HIT Replication: 3 Assignments per HIT Price: 1 cent per HIT Submit 34 (turker-votes, turker-ranking, expert-ranking)
  • 35. User Interface vs. Quality Please fill out the missing Please fill out the missing professor data department data N ame Carey Department CS Department CS Name Please fill out the missing name MTJoin professor data MTJoin Phone E-Mail (Dep) Name Carey (Professor) p.dep = d.name Submit p.name = "carey" Submit MTProbe E-Mail (Professor, Dep) Department name=Carey Please fill out the missing Please fill out the missing Department MTProbe professor data Phone department data (Professor) Carey MTProbe(Dep) Department Name name=Carey Submit Name E-Mail Phone Department Submit Submit (Department first) (Professor first) (De-normalized Probe) ≈10% Error-Rate ≈10% Error-Rate ≈80% Error-Rate 35
  • 36. Turker Affinity and Errors Turker Rank 36
  • 37. A Bigger Underlying Issue Closed-World Open-World 37
  • 38. What Does This Query Mean? SELECT COUNT(*) FROM IceCreamFlavors Trushkowsky et al. Croudsourcing Enumeration Queries, ICDE 2013 (to appear) 38
  • 39. Estimating Completeness SELECT COUNT(*) FROM US States US States using Mechanical Turk Species Estimation techniques perform well on average •Uniform under-predicts slightly, coeff of var. = 0.5 •Decent estimate after 100 HITs States: unique items Average US States 50 40 avg # unique answers 30 20 10 0 0 50 100 150 200 250 300 39 # responses # Answers (HITs)
  • 40. Estimating Completeness SELECT COUNT(*) FROM IceCreamFlavors • Ice Cream Ice Cream Flavors Flavors – Estimators don‟t converge – Very highly skewed (CV = 5.8) – Detect that # HITs insufficient Few, short lists of ice cream flavors (e.g. “alumni swirl, apple cobbler (beginning of crunch, arboretum breeze,…” from Penn State Creamery 40 curve)
  • 41. pay-as-you-go • “I don’t believe it is usually possible to estimate the number of species... but only an appropriate lower bound for that number. This is because there is nearly always a good chance that there are a very large number of extremely rare species” – Good, 1953 • So instead, can ask: “What‟s the benefit of m additional HITs?” Ice Cream after 1500 HITs m Actual Shen Spline 10 1 1.79 1.62 50 7 8.91 8.22 200 39 35.4 32.9 41
  • 42. CrowdER - Entity Resolution DB 42/17
  • 43. Hybrid Entity-Resolution Threshold = 0.2 #Pairs = 8,315 #HITs = 508 Cost= $38.1 Time = 4.5h Time(QT) = 20h J. Wang et al. CrowdER: Crowdsourcing Entity Resolution, PVLDB 2012 43/17
  • 44. CrowdQ – Query Generation • Help find answers to unstructured queries – Approach: Generate a structured query via templates • Machines do parsing and ontology lookup • People do the rest: verification, entity extraction, etc. Demartini et al. CroudQ: Crowdsourced Query Understanding, CIDR 2013 (to appear) 44
  • 46. Generic Architecture Middleware is the software that resides between applications application and the underlying architecture. The goal of middleware is to facilitate the development of applications by providing higher- level abstractions for better Hybrid Platform programmability, performance, s calability, security, and a variety of essential features. Middleware 2012 web page
  • 47. The Challenge Incentives Latency & Prediction Failure Modes Some issues: Work Conditions Interface Task Structuring Task Routing 47 …
  • 48. Can you incentivize workers? http://waxy.org/2008/11/the_faces_of_ 48 mechanical_turk/
  • 50. Can you trust the crowd? On Wikipedia ”any user can change any entry, and if enough users agree with them, it becomes true." “The Elephant population in Africa has tripled over the past six months.”[1] Wikiality: Reality as decided on by majority rule.[2] [1] http://en.wikipedia.org/wiki/Cultural_impact_of_The_Colbert_Report [2] http://www.urbandictionary.com/define.php?term=wikiality
  • 51. Answer Quality Approaches • Some General Techniques – Approval Rate / Demographic Restrictions – Qualification Test – Gold Sets/Honey Pots – Redundancy and Voting – Statistical Measures and Bias Reduction – Verification/Review • Query Specific Techniques • Worker Relationship Management 51
  • 52. Can you organize the crowd? Independent agreement to identify patches Soylent, a prototype... Randomize order of suggestions 52 [Bernstein et al: Soylent: A Word Processor with a Crowd Inside. UIST, 2010]
  • 53. Can You Predict the Crowd? Streakers List walking 53
  • 54. Can you build a low-latency crowd? from: M S Bernstein, J Brandt, R C Miller, D R Karger, “Crowds in Two Seconds: Enabling Realtime Crowdsourced Applications”, UIST 2011. 54
  • 55. Can you help the crowd?
  • 56. For More Information Crowdsourcing Tutorials: • P. Ipeirotis, Managing Crowdsourced Human Computation, WWW „11, March 2011. • O. Alonso, M. Lease, Crowdsourcing for Information Retrieval: Principles, Methods, and Applications, SIGIR July 2011. • A. Doan, M. Franklin, D. Kossmann, T. Kraska, Crowdsourcing Applications and Platforms: A Data Management Perspective, VLDB 2011. AMPLab: amplab.cs.berkeley.edu • Papers • Project Descriptions and Pages • News updates and Blogs 56

Notas del editor

  1. Fix ME!!!!
  2. For the database administrator it is the correct answer, but for the CEO it is not really understandable
  3. Equal is not a good fit
  4. 210 HITsIt took 68 minutes to complete the whole experiment.
  5. -
  6. Lead off with saying heavily skewed distribution will be difficult to estimate, only lower bound(say quote)Instead, reason about cost vs. benefit tradeoffWhen you ask a slightly different question, you can still make progress!
  7. General Techniques (NON-DB techniques)