SlideShare una empresa de Scribd logo
1 de 1
Descargar para leer sin conexión
Map to Humans and Reduce Error - Crowdsourcing for Deduplication Applied to Digital Libraries
                                                                             Mihai Georgescu, Dang Duc Pham, Claudiu S. Firan, Julien Gaugaz, Wolfgang Nejdl
                                                                                                                                                             [Show Diff] [Full Text]                                                                                  [Show Diff]

   • Find duplicate entities based on metadata                                                                               Crowdsourcing:
                                                                                                                                                             Title: Comparing H euris tic, Evolutionary and Local Search Approaches to Scheduling                     Title: Comparing H euris tic, Evolutionary and Local Search Approaches to Scheduling.         Crowd Soft Decision
                                                                                                                                                                                                                                                                      Authors : Soraya B. Rana, Adele E. H owe, L. Darrell W hitley, Keith E. Mathias

   • Focus on scientific publications in the Freesearch system
                                                                                                                                                             Authors : Soraya Rana, Adele E. H owe, L. Darrell, W hitley Keith Mathias
                                                                                                                                                             Venue: Proceedings of the Third International Conference on Artificial Intelligence Planning
                                                                                                                                                             Sys tems , Menlo Park, CA
                                                                                                                                                                                                                                                                      Book: AIPS Pg. 174-181 [Contents ]
                                                                                                                                                                                                                                                                      Year: 1996
                                                                                                                                                                                                                                                                                                                                                                    Aggregation of all individual votes Wi,j(k)ϵ{-1,1}
                                                                                                                                                             Publis her: The AAAI Pres s
                                                                                                                                                             Year: 1996
                                                                                                                                                                                                                                                                      Language: Englis h
                                                                                                                                                                                                                                                                      Type: conference (inproceedings )                                                             CSD ϵ{0,1}
                                                                                                                                                             Language: Englis h

   • An automatic method and human labelers work together
                                                                                                                                                             Type: conference
                                                                                                                                                                                                                                                                        After carefully reviewing the publications metadata pres ented to you, how would you
                                                                                                                             1 HIT = 5 Pairs
   towards improving their performance at identifying                                                                        5ct / HIT
                                                                                                                                                             Abs tract: The choice of s earch algorithm can play a vital role in the s ucces s of a scheduling
                                                                                                                                                             application. In this paper, we inves tigate the contribution of s earch algorithms in s olving a
                                                                                                                                                                                                                                                                        clas s ify the 2 publications referred:

                                                                                                                                                                                                                                                                        Judgment for publications pair:
                                                                                                                                                                                                                                                                                                                                                                             1     weight      i, j   (k )Wi , j (k )
                                                                                                                                                                                                                                                                                                                                                                                                                          weight i , j ( k ) 
                                                                                                                                                                                                                                                                                                                                                                                                                                                      ck
                                                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                             real-world warehous e s cheduling problem. W e compare performance of three types of
                                                                                                                             3 ->5 Assignments                                                                                                                                                                                                                                    kWi , j
                                                                                                                                                             s cheduling algorithms : heuris tic, genetic algorithms and local s earch.                                 o Duplicates
                                                                                                                                                                                                                                                                                                                                                                CSDi , j                                                                                       cv
   duplicate entities                                                                                                                                                                                                                                                   o Not Duplicates
                                                                                                                                                                                                                                                                                                                                                                                             2                                                       vWi , j
   • Actively learn how to deduplicate from the crowd by
   optimizing the parameters of the automatic method
                                                                                                                                                                                                                                                                      Compute crowd
                                                                                                                                                                               Get Crowd                                                                            decisions and worker                                                                            Worker Confidence
   • MTurk HITs to get labeled data, while tackling the quality                                                                                                              Labels for P cand
   issues of the crowdsourced work
                                                                                                                                                                                                                                                                        confidences                                                                                 • Asses how reliable are the individual workers when
                                                                                                                                                                                                                                                                                                                                                                    compared to the overall performance of the crowd
                                                                                                                                                                                                                                                                                                                                                                    • Simple measure: proportion of pairs that have the
                                                                                                                                    Identify pairs with                                                                                                                                                                       High confidence                       same label as the one assigned by the crowd
                       Automatic Method                                                                                            ADS = threshold±ε                                                                                                                                                                             pairs => P train                   • Use an EM algorithm to iteratively compute the
                                                                                                                                   Sample and add to                                                                                                                                                                          P cand = P cand - P train
 • DuplicatesScorer produces an ADS                                                                                                                                                                                                                                                                                                                                 worker confidence
                                                                                                                                           P cand
 • DSParams={(fieldName, fieldWeight)} and threshold                                                                                                                                                                                                                                                                                                                           • Compute CSD
 • Compare ADS to threshold => ADϵ{1,0}                                                                                                                                                                                                                                                                                                                                        • Update c k
                                                                                                                                                                        Identify duplicate                                                                        Optimize DSParams and
                                                                                                                                                                                                                                                                   threshold to fit to the                                                                          Crowd Decision Strategies:
                                                                                                                                                                       pairs from P train, P dupl
                           Crowd Decision                                                                                                                                                                                                                               data in P train                                                                             • MV: Majority Voting; All users are equal c k=1
                                                                                                                                                                                                                                                                                                                                                                    • Iter: c k computed using the EM algorithm
 • Aggregated decision from all workers for a pair produces                                                                                                                                                                                                                                                                                                         • Boost: c k computed using the EM algorithm using
 a CSD                                                                                                                        Initial                        Better                                                                                                                                                                                                 boosted weights in the computation of CSD
 • Worker contribution to the CSDis proportional to the                                                                    DSParams,                       DSParams,                                                                                                                                                                                                • Heur: Heuristic 3/3 or 4/5
 confidence c k we have in him                                                                                             Threshold                       Threshold
 • Compare CDS to 0.5 => CDϵ{1,0}                                                                                           P cand = φ                        P dupl


             Duplicate Detection Strategies                                                                                                                                                           Crowd Decision and Optimization Strategies
                                                                                                                                                                                   Experiment Setup
                                                   1.00                                                                                                                            • 3 Batches :                                                                                                                                                               Compare CD to AD and optimize DSParams and
                                                   0.80                                                                                                                                           o 60 HITs with qualification test
                                                   0.60                                                                                                                                           o 60 HITs without qualification test                                                                                                                         threshold to maximize Accuracy
   •Just signatures                                 0.40                                                                                                                                          o 120 HITs without qualification test
                  • Sign                             0.20
                                                                                                                                                                                                                                                                 Crowd Decision Strategies
                                                                                                                                                       P
   •Just the DuplicatesScorer                             -
                                                                                                                                                   A                                                                                                                                                                                                           Compare ADS to CSD and optimize DSParams
                  • DS/m                                                                                                                                                                                         3 workers                                                             5 workers
                  • DS/o
                                                                     s ign
                                                                             s ign+DS/ m
                                                                                           s ign+DS/ o
                                                                                                                                                   R                           Optimization                                                                                                                                                                    •minimize the sum of errors
                                                                                                            DS/ m
                                                                                                                                                                                strategies                             MV                        MV                   Iter                Manual                     Boost                     Heur
   •First compute signatures and then base                                                                                                                                                                                                                                                                                                                     •minimize the sum of log of errors
                                                                                                                           DS/ o
                                                                                                                                         CD-MV

   decision on DuplicatesScorer                                                                                                                                              Accuracy                                     79.19                     80.00              79.73                     80.00                    78.92                    79.73
                  • sign + DS/m                                                                                                                                                                                                                                                                                                                                •maximize the Pearson correlation
                  • sign + DS/o                                                                                                                                              Sum-Err                                      76.49                     79.46              79.46                     79.46                    79.46                    79.19
                                                              sign           sign+DS/m          sign+DS/o           DS/m              DS/o       CD-MV
                                                                                                                                                                                                                                                                                                                                                               Compare CD to AD and optimize threshold to
   •Directly use Crowd Decision obtained via
   Majority Voting CD-MV
                                               R              0.20             0.20                0.20             0.67              0.56       0.97                        Sum-log-err                                  71.89                     78.11              78.38                     78.92                    80.27                    76.76       maximize Accuracy
                                               A              0.77             0.77                0.77             0.70              0.79       0.83
                                                                                                                                                                             Pearson                                      73.24                     79.46              79.46                     80.54                    79.46                    81.08
                                               P              0.95             0.95                1.00             0.48              0.66       0.63




Contact: Mihai Georgescu
email: georgescu@L3S.de
                                                                dblp.kbs.uni-hannover.de
L3S Research Center / Leibniz Universität Hannover
Appelstrasse 4, 30167 Hannover, Germany
phone: +49 511 762-19715                                             www.cubrikproject.eu

Más contenido relacionado

Destacado

CUbRIK research at CIKM 2012: Pic Alert
CUbRIK research at CIKM 2012: Pic AlertCUbRIK research at CIKM 2012: Pic Alert
CUbRIK research at CIKM 2012: Pic AlertCUbRIK Project
 
Stay at Peac in a Luxurious Hotel in Haridwar Near Ganga
Stay at Peac in a Luxurious Hotel in Haridwar Near GangaStay at Peac in a Luxurious Hotel in Haridwar Near Ganga
Stay at Peac in a Luxurious Hotel in Haridwar Near GangaLeisure Hotels
 
απειλουμενα θαλασσια οντα
απειλουμενα θαλασσια οντααπειλουμενα θαλασσια οντα
απειλουμενα θαλασσια ονταfilaretus
 
Cuotas del condominio mensual torre d 20130131
Cuotas del condominio mensual   torre d 20130131Cuotas del condominio mensual   torre d 20130131
Cuotas del condominio mensual torre d 20130131Antonio Cazorla
 
Peter and Kimble Reference017
Peter and Kimble Reference017Peter and Kimble Reference017
Peter and Kimble Reference017Ryan Dibble
 

Destacado (11)

CUbRIK research at CIKM 2012: Pic Alert
CUbRIK research at CIKM 2012: Pic AlertCUbRIK research at CIKM 2012: Pic Alert
CUbRIK research at CIKM 2012: Pic Alert
 
Stay at Peac in a Luxurious Hotel in Haridwar Near Ganga
Stay at Peac in a Luxurious Hotel in Haridwar Near GangaStay at Peac in a Luxurious Hotel in Haridwar Near Ganga
Stay at Peac in a Luxurious Hotel in Haridwar Near Ganga
 
Ppgbiotec selecao2013-resultados (2)
Ppgbiotec selecao2013-resultados (2)Ppgbiotec selecao2013-resultados (2)
Ppgbiotec selecao2013-resultados (2)
 
απειλουμενα θαλασσια οντα
απειλουμενα θαλασσια οντααπειλουμενα θαλασσια οντα
απειλουμενα θαλασσια οντα
 
Linkis
LinkisLinkis
Linkis
 
Bachelors Degree
Bachelors DegreeBachelors Degree
Bachelors Degree
 
Cuotas del condominio mensual torre d 20130131
Cuotas del condominio mensual   torre d 20130131Cuotas del condominio mensual   torre d 20130131
Cuotas del condominio mensual torre d 20130131
 
Peter and Kimble Reference017
Peter and Kimble Reference017Peter and Kimble Reference017
Peter and Kimble Reference017
 
tic
tictic
tic
 
Chetan cv
Chetan cvChetan cv
Chetan cv
 
שלי
שלישלי
שלי
 

Más de CUbRIK Project

Matching Game Mechanics and Human Computation Tasks in Games with a Purpose
Matching Game Mechanics and Human Computation Tasks in Games with a PurposeMatching Game Mechanics and Human Computation Tasks in Games with a Purpose
Matching Game Mechanics and Human Computation Tasks in Games with a PurposeCUbRIK Project
 
Humanist machine interaction with histoGraph
Humanist machine interaction with histoGraphHumanist machine interaction with histoGraph
Humanist machine interaction with histoGraphCUbRIK Project
 
histoGraph presented to MMSP 2013
histoGraph presented to MMSP 2013histoGraph presented to MMSP 2013
histoGraph presented to MMSP 2013CUbRIK Project
 
histoGraph for historians
histoGraph for historianshistoGraph for historians
histoGraph for historiansCUbRIK Project
 
histoGraph: a case study in Digital Humanities
histoGraph: a case study in Digital HumanitieshistoGraph: a case study in Digital Humanities
histoGraph: a case study in Digital HumanitiesCUbRIK Project
 
CUbRIK research on social aspects
CUbRIK research on social aspectsCUbRIK research on social aspects
CUbRIK research on social aspectsCUbRIK Project
 
Building a social graph for the history of Europe: the CUbRIK histoGraph
Building a social graph for the history of Europe: the CUbRIK histoGraphBuilding a social graph for the history of Europe: the CUbRIK histoGraph
Building a social graph for the history of Europe: the CUbRIK histoGraphCUbRIK Project
 
The CUbRIK histoGraph Factsheet
The CUbRIK histoGraph FactsheetThe CUbRIK histoGraph Factsheet
The CUbRIK histoGraph FactsheetCUbRIK Project
 
CUbRIK Fashion Trend Analysis: a Business Intelligence Application
CUbRIK Fashion Trend Analysis: a Business Intelligence ApplicationCUbRIK Fashion Trend Analysis: a Business Intelligence Application
CUbRIK Fashion Trend Analysis: a Business Intelligence ApplicationCUbRIK Project
 
CUbRIK Social Graph Visual Interface
CUbRIK Social Graph Visual InterfaceCUbRIK Social Graph Visual Interface
CUbRIK Social Graph Visual InterfaceCUbRIK Project
 
Mining Emotions in Short Films: User Comments or Crowdsourcing?
Mining Emotions in Short Films: User Comments or Crowdsourcing?Mining Emotions in Short Films: User Comments or Crowdsourcing?
Mining Emotions in Short Films: User Comments or Crowdsourcing?CUbRIK Project
 
CUbRIK and gaming experience@Qualinet
CUbRIK and gaming experience@QualinetCUbRIK and gaming experience@Qualinet
CUbRIK and gaming experience@QualinetCUbRIK Project
 
CUbRIK: Open Box. Multimedia and Human Computation approach
CUbRIK: Open Box. Multimedia and Human Computation approachCUbRIK: Open Box. Multimedia and Human Computation approach
CUbRIK: Open Box. Multimedia and Human Computation approachCUbRIK Project
 
ICT 2013: Better Society: empowering Horizon 2020 with trustable social media
ICT 2013: Better Society: empowering Horizon 2020 with trustable social mediaICT 2013: Better Society: empowering Horizon 2020 with trustable social media
ICT 2013: Better Society: empowering Horizon 2020 with trustable social mediaCUbRIK Project
 
How Do We Deep-Link? Leveraging User-Contributed Time-Links for Non-Linear Vi...
How Do We Deep-Link? Leveraging User-Contributed Time-Links for Non-Linear Vi...How Do We Deep-Link? Leveraging User-Contributed Time-Links for Non-Linear Vi...
How Do We Deep-Link? Leveraging User-Contributed Time-Links for Non-Linear Vi...CUbRIK Project
 
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...CUbRIK Project
 
CUbRIK Tutorial at ICWE 2013: part 2 - Introduction to Games with a Purpose
CUbRIK Tutorial at ICWE 2013: part 2 - Introduction to Games with a PurposeCUbRIK Tutorial at ICWE 2013: part 2 - Introduction to Games with a Purpose
CUbRIK Tutorial at ICWE 2013: part 2 - Introduction to Games with a PurposeCUbRIK Project
 
CUbRIK tutorial at ICWE 2013: part 1 Introduction to Human Computation
CUbRIK tutorial at ICWE 2013: part 1 Introduction to Human ComputationCUbRIK tutorial at ICWE 2013: part 1 Introduction to Human Computation
CUbRIK tutorial at ICWE 2013: part 1 Introduction to Human ComputationCUbRIK Project
 
Semantic schema for geonames
Semantic schema for geonamesSemantic schema for geonames
Semantic schema for geonamesCUbRIK Project
 

Más de CUbRIK Project (20)

Matching Game Mechanics and Human Computation Tasks in Games with a Purpose
Matching Game Mechanics and Human Computation Tasks in Games with a PurposeMatching Game Mechanics and Human Computation Tasks in Games with a Purpose
Matching Game Mechanics and Human Computation Tasks in Games with a Purpose
 
Humanist machine interaction with histoGraph
Humanist machine interaction with histoGraphHumanist machine interaction with histoGraph
Humanist machine interaction with histoGraph
 
histoGraph presented to MMSP 2013
histoGraph presented to MMSP 2013histoGraph presented to MMSP 2013
histoGraph presented to MMSP 2013
 
histoGraph for historians
histoGraph for historianshistoGraph for historians
histoGraph for historians
 
histoGraph: a case study in Digital Humanities
histoGraph: a case study in Digital HumanitieshistoGraph: a case study in Digital Humanities
histoGraph: a case study in Digital Humanities
 
SMILA in CUbRIK
SMILA in CUbRIKSMILA in CUbRIK
SMILA in CUbRIK
 
CUbRIK research on social aspects
CUbRIK research on social aspectsCUbRIK research on social aspects
CUbRIK research on social aspects
 
Building a social graph for the history of Europe: the CUbRIK histoGraph
Building a social graph for the history of Europe: the CUbRIK histoGraphBuilding a social graph for the history of Europe: the CUbRIK histoGraph
Building a social graph for the history of Europe: the CUbRIK histoGraph
 
The CUbRIK histoGraph Factsheet
The CUbRIK histoGraph FactsheetThe CUbRIK histoGraph Factsheet
The CUbRIK histoGraph Factsheet
 
CUbRIK Fashion Trend Analysis: a Business Intelligence Application
CUbRIK Fashion Trend Analysis: a Business Intelligence ApplicationCUbRIK Fashion Trend Analysis: a Business Intelligence Application
CUbRIK Fashion Trend Analysis: a Business Intelligence Application
 
CUbRIK Social Graph Visual Interface
CUbRIK Social Graph Visual InterfaceCUbRIK Social Graph Visual Interface
CUbRIK Social Graph Visual Interface
 
Mining Emotions in Short Films: User Comments or Crowdsourcing?
Mining Emotions in Short Films: User Comments or Crowdsourcing?Mining Emotions in Short Films: User Comments or Crowdsourcing?
Mining Emotions in Short Films: User Comments or Crowdsourcing?
 
CUbRIK and gaming experience@Qualinet
CUbRIK and gaming experience@QualinetCUbRIK and gaming experience@Qualinet
CUbRIK and gaming experience@Qualinet
 
CUbRIK: Open Box. Multimedia and Human Computation approach
CUbRIK: Open Box. Multimedia and Human Computation approachCUbRIK: Open Box. Multimedia and Human Computation approach
CUbRIK: Open Box. Multimedia and Human Computation approach
 
ICT 2013: Better Society: empowering Horizon 2020 with trustable social media
ICT 2013: Better Society: empowering Horizon 2020 with trustable social mediaICT 2013: Better Society: empowering Horizon 2020 with trustable social media
ICT 2013: Better Society: empowering Horizon 2020 with trustable social media
 
How Do We Deep-Link? Leveraging User-Contributed Time-Links for Non-Linear Vi...
How Do We Deep-Link? Leveraging User-Contributed Time-Links for Non-Linear Vi...How Do We Deep-Link? Leveraging User-Contributed Time-Links for Non-Linear Vi...
How Do We Deep-Link? Leveraging User-Contributed Time-Links for Non-Linear Vi...
 
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
 
CUbRIK Tutorial at ICWE 2013: part 2 - Introduction to Games with a Purpose
CUbRIK Tutorial at ICWE 2013: part 2 - Introduction to Games with a PurposeCUbRIK Tutorial at ICWE 2013: part 2 - Introduction to Games with a Purpose
CUbRIK Tutorial at ICWE 2013: part 2 - Introduction to Games with a Purpose
 
CUbRIK tutorial at ICWE 2013: part 1 Introduction to Human Computation
CUbRIK tutorial at ICWE 2013: part 1 Introduction to Human ComputationCUbRIK tutorial at ICWE 2013: part 1 Introduction to Human Computation
CUbRIK tutorial at ICWE 2013: part 1 Introduction to Human Computation
 
Semantic schema for geonames
Semantic schema for geonamesSemantic schema for geonames
Semantic schema for geonames
 

CUbRIK research at CIKM 2012: Map to Humans and Reduce Error

  • 1. Map to Humans and Reduce Error - Crowdsourcing for Deduplication Applied to Digital Libraries Mihai Georgescu, Dang Duc Pham, Claudiu S. Firan, Julien Gaugaz, Wolfgang Nejdl [Show Diff] [Full Text] [Show Diff] • Find duplicate entities based on metadata Crowdsourcing: Title: Comparing H euris tic, Evolutionary and Local Search Approaches to Scheduling Title: Comparing H euris tic, Evolutionary and Local Search Approaches to Scheduling. Crowd Soft Decision Authors : Soraya B. Rana, Adele E. H owe, L. Darrell W hitley, Keith E. Mathias • Focus on scientific publications in the Freesearch system Authors : Soraya Rana, Adele E. H owe, L. Darrell, W hitley Keith Mathias Venue: Proceedings of the Third International Conference on Artificial Intelligence Planning Sys tems , Menlo Park, CA Book: AIPS Pg. 174-181 [Contents ] Year: 1996 Aggregation of all individual votes Wi,j(k)ϵ{-1,1} Publis her: The AAAI Pres s Year: 1996 Language: Englis h Type: conference (inproceedings ) CSD ϵ{0,1} Language: Englis h • An automatic method and human labelers work together Type: conference After carefully reviewing the publications metadata pres ented to you, how would you 1 HIT = 5 Pairs towards improving their performance at identifying 5ct / HIT Abs tract: The choice of s earch algorithm can play a vital role in the s ucces s of a scheduling application. In this paper, we inves tigate the contribution of s earch algorithms in s olving a clas s ify the 2 publications referred: Judgment for publications pair: 1  weight i, j (k )Wi , j (k ) weight i , j ( k )  ck  real-world warehous e s cheduling problem. W e compare performance of three types of 3 ->5 Assignments kWi , j s cheduling algorithms : heuris tic, genetic algorithms and local s earch. o Duplicates CSDi , j  cv duplicate entities o Not Duplicates 2 vWi , j • Actively learn how to deduplicate from the crowd by optimizing the parameters of the automatic method Compute crowd Get Crowd decisions and worker Worker Confidence • MTurk HITs to get labeled data, while tackling the quality Labels for P cand issues of the crowdsourced work confidences • Asses how reliable are the individual workers when compared to the overall performance of the crowd • Simple measure: proportion of pairs that have the Identify pairs with High confidence same label as the one assigned by the crowd Automatic Method ADS = threshold±ε pairs => P train • Use an EM algorithm to iteratively compute the Sample and add to P cand = P cand - P train • DuplicatesScorer produces an ADS worker confidence P cand • DSParams={(fieldName, fieldWeight)} and threshold • Compute CSD • Compare ADS to threshold => ADϵ{1,0} • Update c k Identify duplicate Optimize DSParams and threshold to fit to the Crowd Decision Strategies: pairs from P train, P dupl Crowd Decision data in P train • MV: Majority Voting; All users are equal c k=1 • Iter: c k computed using the EM algorithm • Aggregated decision from all workers for a pair produces • Boost: c k computed using the EM algorithm using a CSD Initial Better boosted weights in the computation of CSD • Worker contribution to the CSDis proportional to the DSParams, DSParams, • Heur: Heuristic 3/3 or 4/5 confidence c k we have in him Threshold Threshold • Compare CDS to 0.5 => CDϵ{1,0} P cand = φ P dupl Duplicate Detection Strategies Crowd Decision and Optimization Strategies Experiment Setup 1.00 • 3 Batches : Compare CD to AD and optimize DSParams and 0.80 o 60 HITs with qualification test 0.60 o 60 HITs without qualification test threshold to maximize Accuracy •Just signatures 0.40 o 120 HITs without qualification test • Sign 0.20 Crowd Decision Strategies P •Just the DuplicatesScorer - A Compare ADS to CSD and optimize DSParams • DS/m 3 workers 5 workers • DS/o s ign s ign+DS/ m s ign+DS/ o R Optimization •minimize the sum of errors DS/ m strategies MV MV Iter Manual Boost Heur •First compute signatures and then base •minimize the sum of log of errors DS/ o CD-MV decision on DuplicatesScorer Accuracy 79.19 80.00 79.73 80.00 78.92 79.73 • sign + DS/m •maximize the Pearson correlation • sign + DS/o Sum-Err 76.49 79.46 79.46 79.46 79.46 79.19 sign sign+DS/m sign+DS/o DS/m DS/o CD-MV Compare CD to AD and optimize threshold to •Directly use Crowd Decision obtained via Majority Voting CD-MV R 0.20 0.20 0.20 0.67 0.56 0.97 Sum-log-err 71.89 78.11 78.38 78.92 80.27 76.76 maximize Accuracy A 0.77 0.77 0.77 0.70 0.79 0.83 Pearson 73.24 79.46 79.46 80.54 79.46 81.08 P 0.95 0.95 1.00 0.48 0.66 0.63 Contact: Mihai Georgescu email: georgescu@L3S.de dblp.kbs.uni-hannover.de L3S Research Center / Leibniz Universität Hannover Appelstrasse 4, 30167 Hannover, Germany phone: +49 511 762-19715 www.cubrikproject.eu