CUbRIK research at CIKM 2012: Map to Humans and Reduce Error
1. Map to Humans and Reduce Error - Crowdsourcing for Deduplication Applied to Digital Libraries
Mihai Georgescu, Dang Duc Pham, Claudiu S. Firan, Julien Gaugaz, Wolfgang Nejdl
[Show Diff] [Full Text] [Show Diff]
• Find duplicate entities based on metadata Crowdsourcing:
Title: Comparing H euris tic, Evolutionary and Local Search Approaches to Scheduling Title: Comparing H euris tic, Evolutionary and Local Search Approaches to Scheduling. Crowd Soft Decision
Authors : Soraya B. Rana, Adele E. H owe, L. Darrell W hitley, Keith E. Mathias
• Focus on scientific publications in the Freesearch system
Authors : Soraya Rana, Adele E. H owe, L. Darrell, W hitley Keith Mathias
Venue: Proceedings of the Third International Conference on Artificial Intelligence Planning
Sys tems , Menlo Park, CA
Book: AIPS Pg. 174-181 [Contents ]
Year: 1996
Aggregation of all individual votes Wi,j(k)ϵ{-1,1}
Publis her: The AAAI Pres s
Year: 1996
Language: Englis h
Type: conference (inproceedings ) CSD ϵ{0,1}
Language: Englis h
• An automatic method and human labelers work together
Type: conference
After carefully reviewing the publications metadata pres ented to you, how would you
1 HIT = 5 Pairs
towards improving their performance at identifying 5ct / HIT
Abs tract: The choice of s earch algorithm can play a vital role in the s ucces s of a scheduling
application. In this paper, we inves tigate the contribution of s earch algorithms in s olving a
clas s ify the 2 publications referred:
Judgment for publications pair:
1 weight i, j (k )Wi , j (k )
weight i , j ( k )
ck
real-world warehous e s cheduling problem. W e compare performance of three types of
3 ->5 Assignments kWi , j
s cheduling algorithms : heuris tic, genetic algorithms and local s earch. o Duplicates
CSDi , j cv
duplicate entities o Not Duplicates
2 vWi , j
• Actively learn how to deduplicate from the crowd by
optimizing the parameters of the automatic method
Compute crowd
Get Crowd decisions and worker Worker Confidence
• MTurk HITs to get labeled data, while tackling the quality Labels for P cand
issues of the crowdsourced work
confidences • Asses how reliable are the individual workers when
compared to the overall performance of the crowd
• Simple measure: proportion of pairs that have the
Identify pairs with High confidence same label as the one assigned by the crowd
Automatic Method ADS = threshold±ε pairs => P train • Use an EM algorithm to iteratively compute the
Sample and add to P cand = P cand - P train
• DuplicatesScorer produces an ADS worker confidence
P cand
• DSParams={(fieldName, fieldWeight)} and threshold • Compute CSD
• Compare ADS to threshold => ADϵ{1,0} • Update c k
Identify duplicate Optimize DSParams and
threshold to fit to the Crowd Decision Strategies:
pairs from P train, P dupl
Crowd Decision data in P train • MV: Majority Voting; All users are equal c k=1
• Iter: c k computed using the EM algorithm
• Aggregated decision from all workers for a pair produces • Boost: c k computed using the EM algorithm using
a CSD Initial Better boosted weights in the computation of CSD
• Worker contribution to the CSDis proportional to the DSParams, DSParams, • Heur: Heuristic 3/3 or 4/5
confidence c k we have in him Threshold Threshold
• Compare CDS to 0.5 => CDϵ{1,0} P cand = φ P dupl
Duplicate Detection Strategies Crowd Decision and Optimization Strategies
Experiment Setup
1.00 • 3 Batches : Compare CD to AD and optimize DSParams and
0.80 o 60 HITs with qualification test
0.60 o 60 HITs without qualification test threshold to maximize Accuracy
•Just signatures 0.40 o 120 HITs without qualification test
• Sign 0.20
Crowd Decision Strategies
P
•Just the DuplicatesScorer -
A Compare ADS to CSD and optimize DSParams
• DS/m 3 workers 5 workers
• DS/o
s ign
s ign+DS/ m
s ign+DS/ o
R Optimization •minimize the sum of errors
DS/ m
strategies MV MV Iter Manual Boost Heur
•First compute signatures and then base •minimize the sum of log of errors
DS/ o
CD-MV
decision on DuplicatesScorer Accuracy 79.19 80.00 79.73 80.00 78.92 79.73
• sign + DS/m •maximize the Pearson correlation
• sign + DS/o Sum-Err 76.49 79.46 79.46 79.46 79.46 79.19
sign sign+DS/m sign+DS/o DS/m DS/o CD-MV
Compare CD to AD and optimize threshold to
•Directly use Crowd Decision obtained via
Majority Voting CD-MV
R 0.20 0.20 0.20 0.67 0.56 0.97 Sum-log-err 71.89 78.11 78.38 78.92 80.27 76.76 maximize Accuracy
A 0.77 0.77 0.77 0.70 0.79 0.83
Pearson 73.24 79.46 79.46 80.54 79.46 81.08
P 0.95 0.95 1.00 0.48 0.66 0.63
Contact: Mihai Georgescu
email: georgescu@L3S.de
dblp.kbs.uni-hannover.de
L3S Research Center / Leibniz Universität Hannover
Appelstrasse 4, 30167 Hannover, Germany
phone: +49 511 762-19715 www.cubrikproject.eu