SlideShare una empresa de Scribd logo
1 de 50
Descargar para leer sin conexión
SIGIR 2018 · July 11th · Ann ArborPicture by qiyang
MOTIVATION
Experiments in IR
3
Core research
“how well?”
input
IR systems
Evaluation research
“what if?”
output
test
collection
AP
P@10
conditions
input
no. topics
stat. signif. tests
output
AP
P@10
p-values
Kendall τ
conditions
Experiments in IR
4
Core research
“how well?”
input
IR systems
Evaluation research
“what if?”
output
test
collection
AP
P@10
conditions
input
no. topics
stat. signif. tests
output
AP
P@10
p-values
Kendall τ
conditions
Experiments in IR
5
Core research
“how well?”
input
IR systems
Evaluation research
“what if?”
output
test
collection
AP
P@10
conditions
input
no. topics
stat. signif. tests
output
AP
P@10
p-values
Kendall τ
conditions
Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
6
How we Make Do
Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
Artificially create other
collections by resampling
from the existing data
7
How we Make Do
Limited to dozens of
systems and topics from
past evaluations like TREC
s
t
?
?
Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
Split in two topic sets and
consider results with one
subset as the truth
8
How we Make Do
Don’t know true properties
of systems, such as mean or
variance over topics
s
t
𝑿
𝑿
?
𝑿
𝑿
?
Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
Artificial modifications of
effectiveness scores that
lead to invalid data
9
How we Make Do
Can’t control properties of
systems, such as true mean.
Systems are how they are
?
-1 10
-1 10
PROPOSAL
10
Stochastic Simulation
11
Core research
“how well?”
IR systems
Evaluation research
“what if?”
test
collection
AP
P@10
no. topics
stat. signif. tests
AP
P@10
p-values
Kendall τ
Model
AP
P@10
• Build a generative model
of the joint distribution of
system scores
• Simulate scores on new,
random topics
• Lack of data
• Unknown truth
• Lack of control
Stochastic Simulation
12
Core research
“how well?”
IR systems
Evaluation research
“what if?”
test
collection
AP
P@10
no. topics
stat. signif. tests
AP
P@10
p-values
Kendall τ
Model
• Build a generative model
of the joint distribution of
system scores
• Simulate scores on new,
random topics
• Lack of data
• Unknown truth
• Lack of control
• Fit the model to existing
data to make it realistic
• Needs to be flexible to
model real data
Stochastic Simulation
13
Core research
“how well?”
IR systems
Evaluation research
“what if?”
test
collection
AP
P@10
no. topics
stat. signif. tests
AP
P@10
p-values
Kendall τ
Model
Model
• …of the joint distribution of system scores
• We use copula models, which separate:
1.Marginal distributions, of individual systems
2.Dependence structure, among systems
• Easy to customize: plug and play simulate
14
• Fit the model:
Model
15
𝑌1, … , 𝑌𝑛
𝑋1,…,𝑋n
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
Model
15
𝑌1, … , 𝑌𝑛
𝑋1,…,𝑋n
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
Model
15
𝑌1, … , 𝑌𝑛
𝑉𝑖 = 𝐹𝑌 𝑌𝑖
𝑋1,…,𝑋n
𝑈𝑖=𝐹𝑋𝑋𝑖
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
Model
15
𝑌1, … , 𝑌𝑛
𝑉𝑖 = 𝐹𝑌 𝑌𝑖
𝑋1,…,𝑋n
𝑈𝑖=𝐹𝑋𝑋𝑖
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
• Or instantiate at will
Model
16
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
• Or instantiate at will
• Simulate from the
model:
1. Generate pseudo-
observations
Model
16
𝑉
𝑈
*generalizes
to several
systems
• Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
• Or instantiate at will
• Simulate from the
model:
1. Generate pseudo-
observations
2. Turn into effectiveness
scores
Model
16
𝑌 = 𝐹𝑌
−1
𝑉
𝑉
𝑋=𝐹𝑋
−1
𝑈
𝑈
*generalizes
to several
systems
Modeling Dependences
• Gaussian copulas
– Only correlation
– Only symmetric
• R-Vine copulas
– Allows tail dependence
– Allows asymmetricity
– Built from pair-copulas
(bivariate)
– Eg: F(S1,S2), F(S4,S2|S1),
F(S4,S3|S1,S2), …
– ~40 alternatives based on 12
different families
17
Modeling Margins
• All effectiveness measures have discrete
distributions, but for some we can fairly
assume they’re continuous
–AP, nDCG
• For some others, this assumption is clearly
wrong, so we must preserve the support
–P@10: 1, 0.9, 0.8, …
–RR: 1, 1/2, 1/3, …
18
Modeling Margins
Continuous
• (Truncated) Normal
• Beta
• (Truncated) Normal Kernel
Smoothing
• Beta Kernel Smoothing
Discrete
• Beta-Binomial
• Discrete Kernel Smoothing
• Discrete Kernel Smoothing
w/ controlled smoothness
19
Transform to Predefined Mean
Problem: given 𝐹, transform to 𝐹 such that
𝝁 = 𝝁∗
and preserving the support
Solution: transform with a specific Beta
find 𝛼, 𝛽 > 1
such that 𝜇 = 𝜇∗
where 𝐹 𝑥 = 𝐹𝐵𝑒𝑡𝑎 𝐹 𝑥 ; 𝛼, 𝛽
20
RESULTS
21
Data
• TREC Web Ad hoc runs 2010-2014
– 50 topics and 30-88 systems each
– 12924 total system-topic pairs
• Continuous measures: AP, nDCG@20, ERR@20
• Discrete measures: P@10, P@20, RR
• Points of Interest
1. Margins
2. Copulas
3. Simulated scores
22
• 1572 system-measure pairs
• 5425 models successfully fitted
• Log-Likelihood:
• Kernel Smoothing (esp. discrete)
• Normal & Beta 25% of cases
• AIC and BIC:
• Normal & Beta 67% of cases
• Beta-Binomial 50% of P@k
• Transform all to the mean in the
given data and select again:
• Kernel Smoothing nearly always
1. Margins
23
• 39627 system pairs
• Fit pair-copulas and select
according to Log-Likelihood
• Wide diversity
• Gaussian copulas rarely
selected; correlation is not
enough
• Complex models are preferred
2. Dependence
24
• Simulate 1000 new topics and record deviations from the model
• 𝜇 − 𝑋 and 𝜎2 − 𝑠2
• Repeat 1000 times
• Full knowledge of truth encoded in the model
3. Simulation: Scores
25
• Web 2010, nDCG@20
• Simulate 500 new topics
• Dependence captured in the model
3. Simulation: Dependencies
26
SAMPLE APPLICATIONS
[Voorhees, 2009]
1. Type I Errors?
28
S1
S2
[Voorhees, 2009]
1. Type I Errors?
28
S1
S2
[Voorhees, 2009]
1. Type I Errors?
28
𝑫 𝟐 + p-value𝑫 𝟏 + p-value
conflict?
S1
S2
[Voorhees, 2009]
• Limited data
• Unknown truth (is H0 true?)
• No control over H0
• Cannot measure Type I error
rates directly
• Conflict rates at α=5%:
• AP: 2.8%
• P@10: 10.9%
1. Type I Errors?
28
𝑫 𝟐 + p-value𝑫 𝟏 + p-value
conflict?
S1
S2
[With simulation]
Same margins
1. Type I Errors
29
[With simulation]
Same margins
1. Type I Errors
29
[With simulation]
Same margins
1. Type I Errors
29
[With simulation]
Same margins
1. Type I Errors
29
[With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
1. Type I Errors
29
p-value
Type I
error?
[With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
1. Type I Errors
30
[With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
1. Type I Errors
30
[With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
1. Type I Errors
30
[With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 5% and 1%
1. Type I Errors
30
p-value
Type I
error?
2. Sneak Peak: Statistical Power and σ
31
[Webber et al, 2008]
• Show empirical evidence of the problem of sequential testing
• Limited data
• Unknown truth (true σ)
TAKE HOME
Today
• Part of Evaluation Research has
data-related limitations
–Lack of data, no knowledge of truth, no control
–How valid are our results?
• We propose a methodology for stochastic
simulation to eliminate these limitations
–Flexible, realistic, highly customizable
–Allows us to study new problems, directly
33
Tomorrow
• Even more flexibility
• Simulate new systems for given topics
• Add third factors
–Fixed: already possible
–Random: we’ll see
• Simulate full runs (doc scores & relevance)
34
simIReff
• All results fully reproducible
• Developed a full R-package for simulation
https://github.com/julian-urbano/simIReff
effs <- effDiscFitAndSelect(data, support("p20"))
cop <- effcopFit(data, effs)
y <- reffcop(1000, cop)
35

Más contenido relacionado

Más de Julián Urbano

MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Julián Urbano
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityJulián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityJulián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...Julián Urbano
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...Julián Urbano
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Julián Urbano
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJulián Urbano
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Julián Urbano
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsJulián Urbano
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksJulián Urbano
 
Using the Shape of Music to Compute the similarity between Symbolic Musical P...
Using the Shape of Music to Compute the similarity between Symbolic Musical P...Using the Shape of Music to Compute the similarity between Symbolic Musical P...
Using the Shape of Music to Compute the similarity between Symbolic Musical P...Julián Urbano
 

Más de Julián Urbano (17)

MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered Lists
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
 
Using the Shape of Music to Compute the similarity between Symbolic Musical P...
Using the Shape of Music to Compute the similarity between Symbolic Musical P...Using the Shape of Music to Compute the similarity between Symbolic Musical P...
Using the Shape of Music to Compute the similarity between Symbolic Musical P...
 

Último

GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.Silpa
 
Velocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.pptVelocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.pptRakeshMohan42
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxrohankumarsinghrore1
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspectsmuralinath2
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...Scintica Instrumentation
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 

Último (20)

GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Velocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.pptVelocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.ppt
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 

Stochastic Simulation of Test Collections: Evaluation Scores

  • 1. SIGIR 2018 · July 11th · Ann ArborPicture by qiyang
  • 3. Experiments in IR 3 Core research “how well?” input IR systems Evaluation research “what if?” output test collection AP P@10 conditions input no. topics stat. signif. tests output AP P@10 p-values Kendall τ conditions
  • 4. Experiments in IR 4 Core research “how well?” input IR systems Evaluation research “what if?” output test collection AP P@10 conditions input no. topics stat. signif. tests output AP P@10 p-values Kendall τ conditions
  • 5. Experiments in IR 5 Core research “how well?” input IR systems Evaluation research “what if?” output test collection AP P@10 conditions input no. topics stat. signif. tests output AP P@10 p-values Kendall τ conditions
  • 6. Current Limitations 1.Finite data 2.Unknown Truth 3.Lack of Control 6 How we Make Do
  • 7. Current Limitations 1.Finite data 2.Unknown Truth 3.Lack of Control Artificially create other collections by resampling from the existing data 7 How we Make Do Limited to dozens of systems and topics from past evaluations like TREC s t ? ?
  • 8. Current Limitations 1.Finite data 2.Unknown Truth 3.Lack of Control Split in two topic sets and consider results with one subset as the truth 8 How we Make Do Don’t know true properties of systems, such as mean or variance over topics s t 𝑿 𝑿 ? 𝑿 𝑿 ?
  • 9. Current Limitations 1.Finite data 2.Unknown Truth 3.Lack of Control Artificial modifications of effectiveness scores that lead to invalid data 9 How we Make Do Can’t control properties of systems, such as true mean. Systems are how they are ? -1 10 -1 10
  • 11. Stochastic Simulation 11 Core research “how well?” IR systems Evaluation research “what if?” test collection AP P@10 no. topics stat. signif. tests AP P@10 p-values Kendall τ Model AP P@10
  • 12. • Build a generative model of the joint distribution of system scores • Simulate scores on new, random topics • Lack of data • Unknown truth • Lack of control Stochastic Simulation 12 Core research “how well?” IR systems Evaluation research “what if?” test collection AP P@10 no. topics stat. signif. tests AP P@10 p-values Kendall τ Model
  • 13. • Build a generative model of the joint distribution of system scores • Simulate scores on new, random topics • Lack of data • Unknown truth • Lack of control • Fit the model to existing data to make it realistic • Needs to be flexible to model real data Stochastic Simulation 13 Core research “how well?” IR systems Evaluation research “what if?” test collection AP P@10 no. topics stat. signif. tests AP P@10 p-values Kendall τ Model
  • 14. Model • …of the joint distribution of system scores • We use copula models, which separate: 1.Marginal distributions, of individual systems 2.Dependence structure, among systems • Easy to customize: plug and play simulate 14
  • 15. • Fit the model: Model 15 𝑌1, … , 𝑌𝑛 𝑋1,…,𝑋n *generalizes to several systems
  • 16. • Fit the model: 1. Fit the margins Model 15 𝑌1, … , 𝑌𝑛 𝑋1,…,𝑋n *generalizes to several systems
  • 17. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations Model 15 𝑌1, … , 𝑌𝑛 𝑉𝑖 = 𝐹𝑌 𝑌𝑖 𝑋1,…,𝑋n 𝑈𝑖=𝐹𝑋𝑋𝑖 *generalizes to several systems
  • 18. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula Model 15 𝑌1, … , 𝑌𝑛 𝑉𝑖 = 𝐹𝑌 𝑌𝑖 𝑋1,…,𝑋n 𝑈𝑖=𝐹𝑋𝑋𝑖 *generalizes to several systems
  • 19. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula • Or instantiate at will Model 16 *generalizes to several systems
  • 20. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula • Or instantiate at will • Simulate from the model: 1. Generate pseudo- observations Model 16 𝑉 𝑈 *generalizes to several systems
  • 21. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula • Or instantiate at will • Simulate from the model: 1. Generate pseudo- observations 2. Turn into effectiveness scores Model 16 𝑌 = 𝐹𝑌 −1 𝑉 𝑉 𝑋=𝐹𝑋 −1 𝑈 𝑈 *generalizes to several systems
  • 22. Modeling Dependences • Gaussian copulas – Only correlation – Only symmetric • R-Vine copulas – Allows tail dependence – Allows asymmetricity – Built from pair-copulas (bivariate) – Eg: F(S1,S2), F(S4,S2|S1), F(S4,S3|S1,S2), … – ~40 alternatives based on 12 different families 17
  • 23. Modeling Margins • All effectiveness measures have discrete distributions, but for some we can fairly assume they’re continuous –AP, nDCG • For some others, this assumption is clearly wrong, so we must preserve the support –P@10: 1, 0.9, 0.8, … –RR: 1, 1/2, 1/3, … 18
  • 24. Modeling Margins Continuous • (Truncated) Normal • Beta • (Truncated) Normal Kernel Smoothing • Beta Kernel Smoothing Discrete • Beta-Binomial • Discrete Kernel Smoothing • Discrete Kernel Smoothing w/ controlled smoothness 19
  • 25. Transform to Predefined Mean Problem: given 𝐹, transform to 𝐹 such that 𝝁 = 𝝁∗ and preserving the support Solution: transform with a specific Beta find 𝛼, 𝛽 > 1 such that 𝜇 = 𝜇∗ where 𝐹 𝑥 = 𝐹𝐵𝑒𝑡𝑎 𝐹 𝑥 ; 𝛼, 𝛽 20
  • 27. Data • TREC Web Ad hoc runs 2010-2014 – 50 topics and 30-88 systems each – 12924 total system-topic pairs • Continuous measures: AP, nDCG@20, ERR@20 • Discrete measures: P@10, P@20, RR • Points of Interest 1. Margins 2. Copulas 3. Simulated scores 22
  • 28. • 1572 system-measure pairs • 5425 models successfully fitted • Log-Likelihood: • Kernel Smoothing (esp. discrete) • Normal & Beta 25% of cases • AIC and BIC: • Normal & Beta 67% of cases • Beta-Binomial 50% of P@k • Transform all to the mean in the given data and select again: • Kernel Smoothing nearly always 1. Margins 23
  • 29. • 39627 system pairs • Fit pair-copulas and select according to Log-Likelihood • Wide diversity • Gaussian copulas rarely selected; correlation is not enough • Complex models are preferred 2. Dependence 24
  • 30. • Simulate 1000 new topics and record deviations from the model • 𝜇 − 𝑋 and 𝜎2 − 𝑠2 • Repeat 1000 times • Full knowledge of truth encoded in the model 3. Simulation: Scores 25
  • 31. • Web 2010, nDCG@20 • Simulate 500 new topics • Dependence captured in the model 3. Simulation: Dependencies 26
  • 33. [Voorhees, 2009] 1. Type I Errors? 28 S1 S2
  • 34. [Voorhees, 2009] 1. Type I Errors? 28 S1 S2
  • 35. [Voorhees, 2009] 1. Type I Errors? 28 𝑫 𝟐 + p-value𝑫 𝟏 + p-value conflict? S1 S2
  • 36. [Voorhees, 2009] • Limited data • Unknown truth (is H0 true?) • No control over H0 • Cannot measure Type I error rates directly • Conflict rates at α=5%: • AP: 2.8% • P@10: 10.9% 1. Type I Errors? 28 𝑫 𝟐 + p-value𝑫 𝟏 + p-value conflict? S1 S2
  • 41. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% 1. Type I Errors 29 p-value Type I error?
  • 42. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins 1. Type I Errors 30
  • 43. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins 1. Type I Errors 30
  • 44. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins 1. Type I Errors 30
  • 45. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 5% and 1% 1. Type I Errors 30 p-value Type I error?
  • 46. 2. Sneak Peak: Statistical Power and σ 31 [Webber et al, 2008] • Show empirical evidence of the problem of sequential testing • Limited data • Unknown truth (true σ)
  • 48. Today • Part of Evaluation Research has data-related limitations –Lack of data, no knowledge of truth, no control –How valid are our results? • We propose a methodology for stochastic simulation to eliminate these limitations –Flexible, realistic, highly customizable –Allows us to study new problems, directly 33
  • 49. Tomorrow • Even more flexibility • Simulate new systems for given topics • Add third factors –Fixed: already possible –Random: we’ll see • Simulate full runs (doc scores & relevance) 34
  • 50. simIReff • All results fully reproducible • Developed a full R-package for simulation https://github.com/julian-urbano/simIReff effs <- effDiscFitAndSelect(data, support("p20")) cop <- effcopFit(data, effs) y <- reffcop(1000, cop) 35