SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
Overview of the NTCIR-15
We Want Web with CENTRE (WWW-3) Task
December 9, 2020@NTCIR-15 (virtual conference)
Web Search is not a solved problem!
• Are we making progress?
(Example: does deep
learning-based reranking
really outperform a
properly-tuned BM25 for
any query?)
• Can we
replicate/reproduce the
findings? (same method,
same/different data,
different research groups)
TALK OUTLINE
• Chinese subtask
• English subtask
• CENTRE
• Summary
• NTCIR-16 WWW-4
Chinese subtask definition
• Input: 80 WWW-2 topics and 80 new WWW-3
topics (participants had access to the original qrels
for WWW-2)
• Output: TREC-style run file
• Target corpus: SogouT-16
• All runs were pooled and relevance assessments
were conducted for 80 WWW-3 new topics
• Runs are scored also based on the 80 WWW-3
topics
Topics
• The 80 queries were sampled from Sogou’s query
logs in one day of August 2018, which contain 54
torso queries, 13 tail queries and 13 hot queries.
Runs and qrels
• 11 runs from 3 teams (including the organisers’
baseline) were submitted and pooled
Official results (nDCG and Q)
Official results (nERR and iRBU)
Randomised Tukey HSD test results
(nDCG and Q)
OUTPERFORMS
Randomised Tukey HSD test results
(nERR and iRBU)
OUTPERFORMS
TALK OUTLINE
• Chinese subtask
• English subtask
• CENTRE
• Summary
• NTCIR-16 WWW-4
English subtask definition
• Input: 80 WWW-2 topics and 80 new WWW-3
topics (participants had access to the original qrels
for WWW-2)
• Output: TREC-style run file
• Target corpus: clueweb12-B13
• All runs were pooled and relevance assessments
were conducted for all 160 topics
• Runs are scored based on the 80 WWW-3 topics
The original plan with a REV run
(a revived system from NTCIR-14)
• Replicability: compare a repli run with a REV run on
the WWW-2 topics
• Reproducibility: compare a repro run effectiveness
on the WWW-3 topics with a REV run effectiveness
on the WWW-2 topics
• Progress: compare new runs and a REV run (SOTA
from NTCIR-14) on the WWW-3 topics
But unfortunately, we could not obtain a reliable
REV run that represents the SOTA from NTCIR-14
on the NTCIR-15 WWW-3 topics.
Runs and qrels
• 37 runs from 9 teams (including the organisers’
baseline) were submitted and pooled
Official top 10 runs (nDCG and Q)
Official top 10 runs (nERR and iRBU)
Randomised Tukey HSD test results
(nDCG and Q) – top runs only
OUTPERFORMS
Randomised Tukey HSD test results
(nERR and iRBU) – top runs only
OUTPERFORMS
TALK OUTLINE
• Chinese subtask
• English subtask
• CENTRE
• Summary
• NTCIR-16 WWW-4
Replicability and Reproducibility
Terminology
“An experimental result is not fully established unless
it can be independently reproduced.”
OLD ACM Terminology (Version 1.0):
• Replicability: Different team, same experimental
setup
• Reproducibility: Different team, different
experimental setup
With the new ACM terminology (Version 1.1)
replicability and reproducibility are swapped!
Version 1.0: https://www.acm.org/publications/policies/artifact-review-badging
Version 1.1: https://www.acm.org/publications/policies/artifact-review-and-badging-current
Replicability Measures
Ranking:
Kendall’s τ and
RBO
Absolute Per-Topic Effectiveness:
RMSEabs
Statistical approach: p-value of paired t-test
Effect over a baseline: RMSEΔ, Effect Ratio (ERrepli) and Delta
Relative Improvement (ΔRIrepli)
Reproducibility measures
unpaired
Replicability & Reproducibility
Runs
• Target Runs submitted at WWW-2:
• Advanced: THUIR-E-CO-MAN-Base2 (LambdaMART)
• Baseline: THUIR-E-CO-PU-Base4 (BM25)
• Replicability and Reproducibility runs submitted at
WWW-3:
• Advanced: KASYS-E-CO-REP-2 and SLWWW-E-CO-REP-4
• Baseline: KASYS-E-CO-REP-3
• Replicability: WWW-2 qrels and topics;
• Reproducibility: WWW-2 qrels and topics
compared against WWW-3 qrels and topics.
Replicability recap
WWW-2 topics WWW-3 topics
WWW-2runsWWW-3runs
A-run (advanced)
B-run (baseline)
Effect
A-run (advanced)
B-run (baseline)
Effect
Replicability Results: Ranking of
Documents
• Kendall’s τ and RBO: computed between the original
ranking of documents and the replicated ranking;
• The closer to 1 the better the replicated run;
• Scores close to 0 mean that the original and replicated
runs are not correlated;
• It is extremely hard to obtain the same list of
documents!
• RMSE: the closer to 0 the better;
• p-value: small p-value means that the runs are
significantly different (without specifying whether
they are better or not);
Large RMSEs
Replicability Results: RMSE and
p-values
Very small p-values
Replicability Results: Effect over a
Baseline
• Implications of ER scores:
• ER ≤ 0: Failed replication, A-run failed to outperform the
B-run;
• 0 < ER < 1: Somehow successful, the replicated effect is
smaller compared to the original effect;
• ER = 1: Perfect replication;
• ER > 1: Successful replication, the replicated effect is
larger compared to the original effect.
• Similar interpretation of ΔRI but 0 is the perfect
replication;
Reproducibility recap
WWW-2 topics WWW-3 topics
WWW-2runsWWW-3runs
A-run (advanced)
B-run (baseline)
Effect
A-run (advanced)
B-run (baseline)
Effect
Reproducibility Results: p-values
and Effects over a Baseline
• Recall that there is no target original run;
• Reproduciblity is even harder than replicability!
TALK OUTLINE
• Chinese subtask
• English subtask
• CENTRE
• Summary
• NTCIR-16 WWW-4
Summary
• Chinese subtask (only 3 teams)
Best run: RUCIR-C-CD-NEW-4
• English subtask (9 teams)
Best runs: KASYS-E-CO-NEW-{1,4} and mpii-E-CO-
NEW-1. KASYS uses a BERT-based method from
[Yilmaz+ EMNLP 2019].
• CENTRE:
We need a community effort since replicability and
reproducibility are very tough problems!
Thank you participants!
And many thanks to the NTCIR PC chairs, GCs, and staff!
TALK OUTLINE
• Chinese subtask
• English subtask
• CENTRE
• Summary
• NTCIR-16 WWW-4
WWW will be back
(IF our task proposal is accepted)
• English subtask only
• New English corpus! (Common Crawl?)
• New target for replicability, reproducibility, and a
baseline for progress:
University of Tsukuba’s BERT-based run from WWW-3
• Topics to be released in October 2021
• Run submission deadline in November 2021
• Please follow @ntcirwww on Twitter!
Selected references
[Breuer+ SIGIR2020] How to Measure the Reproducibility
of System-oriented IR Experiments, ACM SIGIR 2020.
[Sakai+ TOIS2020] Retrieval Evaluation Measures that
Agree with Users' SERP Preferences: Traditional,
Preference-based, and Diversity Measures, ACM TOIS
39(2), to appear, 2020.
[Yilmaz+ EMNLP2019] Cross-Domain Modeling of
Sentence-Level Evidence for Document Retrieval, EMNLP
2019.
More about CENTRE
Evaluation measures
including iRBU
University of Tsukuba’s top
run is based on this

Más contenido relacionado

La actualidad más candente

Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?
Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?
Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?BIOVIA
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanBoris Glavic
 
Declarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierDeclarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierCrai Macdonald
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...Databricks
 
NTNU @ Social Event Detection Task (SED)
NTNU @ Social Event Detection Task (SED)NTNU @ Social Event Detection Task (SED)
NTNU @ Social Event Detection Task (SED)Massimiliano Ruocco
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAdnan Akhter
 
Deep API Learning (FSE 2016)
Deep API Learning (FSE 2016)Deep API Learning (FSE 2016)
Deep API Learning (FSE 2016)Sung Kim
 
Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Introduction to R for Data Science :: Session 6 [Linear Regression in R] Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Introduction to R for Data Science :: Session 6 [Linear Regression in R] Goran S. Milovanovic
 
Parallel analytics as a service
Parallel analytics as a serviceParallel analytics as a service
Parallel analytics as a servicePetrie Wong
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big DataPingCAP
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...PingCAP
 
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...Riccardo Tommasini
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
 
Introduction to R for Data Science :: Session 2
Introduction to R for Data Science :: Session 2Introduction to R for Data Science :: Session 2
Introduction to R for Data Science :: Session 2Goran S. Milovanovic
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data Surya Saha
 

La actualidad más candente (18)

Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?
Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?
Webinar: What's New in Pipeline Pilot 8.5 Collection Update 1?
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, Ian
 
Declarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierDeclarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
 
NTNU @ Social Event Detection Task (SED)
NTNU @ Social Event Detection Task (SED)NTNU @ Social Event Detection Task (SED)
NTNU @ Social Event Detection Task (SED)
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning Techniques
 
Deep API Learning (FSE 2016)
Deep API Learning (FSE 2016)Deep API Learning (FSE 2016)
Deep API Learning (FSE 2016)
 
Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Introduction to R for Data Science :: Session 6 [Linear Regression in R] Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Introduction to R for Data Science :: Session 6 [Linear Regression in R]
 
Parallel analytics as a service
Parallel analytics as a serviceParallel analytics as a service
Parallel analytics as a service
 
Rob Davidson: Using Galaxy for Metabolomics
Rob Davidson: Using Galaxy for MetabolomicsRob Davidson: Using Galaxy for Metabolomics
Rob Davidson: Using Galaxy for Metabolomics
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
 
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Introduction to R for Data Science :: Session 2
Introduction to R for Data Science :: Session 2Introduction to R for Data Science :: Session 2
Introduction to R for Data Science :: Session 2
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data
 

Similar a NTCIR15WWW3overview

Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesRui Vieira
 
Elasticsearch Sharding Strategy at Tubular Labs
Elasticsearch Sharding Strategy at Tubular LabsElasticsearch Sharding Strategy at Tubular Labs
Elasticsearch Sharding Strategy at Tubular LabsTubular Labs
 
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Thanh Tran
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...James McCombs
 
KASYS at the NTCIR-15 WWW-3 Task
KASYS at the NTCIR-15 WWW-3 TaskKASYS at the NTCIR-15 WWW-3 Task
KASYS at the NTCIR-15 WWW-3 TaskKohei Shinden
 
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...Rakebul Hasan
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
 
Strategies for Processing and Explaining Distributed Queries on Linked Data
Strategies for Processing and Explaining Distributed Queries on Linked DataStrategies for Processing and Explaining Distributed Queries on Linked Data
Strategies for Processing and Explaining Distributed Queries on Linked DataRakebul Hasan
 
A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...Aravind Sesagiri Raamkumar
 
Converting Scripts into Reproducible Workflow Research Objects
Converting Scripts into Reproducible Workflow Research ObjectsConverting Scripts into Reproducible Workflow Research Objects
Converting Scripts into Reproducible Workflow Research ObjectsLucas Augusto Carvalho
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
 
Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...dgarijo
 
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...Sung Kim
 
Fast Iterative Graph Computation with Block Updates
Fast Iterative Graph Computation with Block UpdatesFast Iterative Graph Computation with Block Updates
Fast Iterative Graph Computation with Block UpdatesWenlei Xie
 
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld
 
RDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation PruningRDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation Pruningwajrcs
 
Overview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning TrackOverview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning TrackNick Craswell
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesSimon Lia-Jonassen
 
Scilab: Computing Tool For Engineers
Scilab: Computing Tool For EngineersScilab: Computing Tool For Engineers
Scilab: Computing Tool For EngineersNaren P.R.
 

Similar a NTCIR15WWW3overview (20)

Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databases
 
Elasticsearch Sharding Strategy at Tubular Labs
Elasticsearch Sharding Strategy at Tubular LabsElasticsearch Sharding Strategy at Tubular Labs
Elasticsearch Sharding Strategy at Tubular Labs
 
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
 
KASYS at the NTCIR-15 WWW-3 Task
KASYS at the NTCIR-15 WWW-3 TaskKASYS at the NTCIR-15 WWW-3 Task
KASYS at the NTCIR-15 WWW-3 Task
 
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...Predicting SPARQL query execution time and suggesting SPARQL queries based on...
Predicting SPARQL query execution time and suggesting SPARQL queries based on...
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
Strategies for Processing and Explaining Distributed Queries on Linked Data
Strategies for Processing and Explaining Distributed Queries on Linked DataStrategies for Processing and Explaining Distributed Queries on Linked Data
Strategies for Processing and Explaining Distributed Queries on Linked Data
 
computer architecture.
computer architecture.computer architecture.
computer architecture.
 
A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...A task-based scientific paper recommender system for literature review and ma...
A task-based scientific paper recommender system for literature review and ma...
 
Converting Scripts into Reproducible Workflow Research Objects
Converting Scripts into Reproducible Workflow Research ObjectsConverting Scripts into Reproducible Workflow Research Objects
Converting Scripts into Reproducible Workflow Research Objects
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
 
Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...
 
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
 
Fast Iterative Graph Computation with Block Updates
Fast Iterative Graph Computation with Block UpdatesFast Iterative Graph Computation with Block Updates
Fast Iterative Graph Computation with Block Updates
 
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
 
RDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation PruningRDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation Pruning
 
Overview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning TrackOverview of the TREC 2019 Deep Learning Track
Overview of the TREC 2019 Deep Learning Track
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search Engines
 
Scilab: Computing Tool For Engineers
Scilab: Computing Tool For EngineersScilab: Computing Tool For Engineers
Scilab: Computing Tool For Engineers
 

Más de Tetsuya Sakai (20)

sigir2020
sigir2020sigir2020
sigir2020
 
ipsjifat201909
ipsjifat201909ipsjifat201909
ipsjifat201909
 
sigir2019
sigir2019sigir2019
sigir2019
 
assia2019
assia2019assia2019
assia2019
 
evia2019
evia2019evia2019
evia2019
 
ecir2019tutorial-finalised
ecir2019tutorial-finalisedecir2019tutorial-finalised
ecir2019tutorial-finalised
 
ecir2019tutorial
ecir2019tutorialecir2019tutorial
ecir2019tutorial
 
WSDM2019tutorial
WSDM2019tutorialWSDM2019tutorial
WSDM2019tutorial
 
sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorial
 
Evia2017unanimity
Evia2017unanimityEvia2017unanimity
Evia2017unanimity
 
Evia2017assessors
Evia2017assessorsEvia2017assessors
Evia2017assessors
 
Evia2017dialogues
Evia2017dialoguesEvia2017dialogues
Evia2017dialogues
 
Evia2017wcw
Evia2017wcwEvia2017wcw
Evia2017wcw
 
sigir2017bayesian
sigir2017bayesiansigir2017bayesian
sigir2017bayesian
 
NL20161222invited
NL20161222invitedNL20161222invited
NL20161222invited
 
AIRS2016
AIRS2016AIRS2016
AIRS2016
 
Nl201609
Nl201609Nl201609
Nl201609
 
ictir2016
ictir2016ictir2016
ictir2016
 
ICTIR2016tutorial
ICTIR2016tutorialICTIR2016tutorial
ICTIR2016tutorial
 
SIGIR2016
SIGIR2016SIGIR2016
SIGIR2016
 

Último

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Último (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

NTCIR15WWW3overview

  • 1. Overview of the NTCIR-15 We Want Web with CENTRE (WWW-3) Task December 9, 2020@NTCIR-15 (virtual conference)
  • 2. Web Search is not a solved problem! • Are we making progress? (Example: does deep learning-based reranking really outperform a properly-tuned BM25 for any query?) • Can we replicate/reproduce the findings? (same method, same/different data, different research groups)
  • 3. TALK OUTLINE • Chinese subtask • English subtask • CENTRE • Summary • NTCIR-16 WWW-4
  • 4. Chinese subtask definition • Input: 80 WWW-2 topics and 80 new WWW-3 topics (participants had access to the original qrels for WWW-2) • Output: TREC-style run file • Target corpus: SogouT-16 • All runs were pooled and relevance assessments were conducted for 80 WWW-3 new topics • Runs are scored also based on the 80 WWW-3 topics
  • 5. Topics • The 80 queries were sampled from Sogou’s query logs in one day of August 2018, which contain 54 torso queries, 13 tail queries and 13 hot queries.
  • 6. Runs and qrels • 11 runs from 3 teams (including the organisers’ baseline) were submitted and pooled
  • 9. Randomised Tukey HSD test results (nDCG and Q) OUTPERFORMS
  • 10. Randomised Tukey HSD test results (nERR and iRBU) OUTPERFORMS
  • 11. TALK OUTLINE • Chinese subtask • English subtask • CENTRE • Summary • NTCIR-16 WWW-4
  • 12. English subtask definition • Input: 80 WWW-2 topics and 80 new WWW-3 topics (participants had access to the original qrels for WWW-2) • Output: TREC-style run file • Target corpus: clueweb12-B13 • All runs were pooled and relevance assessments were conducted for all 160 topics • Runs are scored based on the 80 WWW-3 topics
  • 13. The original plan with a REV run (a revived system from NTCIR-14) • Replicability: compare a repli run with a REV run on the WWW-2 topics • Reproducibility: compare a repro run effectiveness on the WWW-3 topics with a REV run effectiveness on the WWW-2 topics • Progress: compare new runs and a REV run (SOTA from NTCIR-14) on the WWW-3 topics But unfortunately, we could not obtain a reliable REV run that represents the SOTA from NTCIR-14 on the NTCIR-15 WWW-3 topics.
  • 14. Runs and qrels • 37 runs from 9 teams (including the organisers’ baseline) were submitted and pooled
  • 15. Official top 10 runs (nDCG and Q)
  • 16. Official top 10 runs (nERR and iRBU)
  • 17. Randomised Tukey HSD test results (nDCG and Q) – top runs only OUTPERFORMS
  • 18. Randomised Tukey HSD test results (nERR and iRBU) – top runs only OUTPERFORMS
  • 19. TALK OUTLINE • Chinese subtask • English subtask • CENTRE • Summary • NTCIR-16 WWW-4
  • 20. Replicability and Reproducibility Terminology “An experimental result is not fully established unless it can be independently reproduced.” OLD ACM Terminology (Version 1.0): • Replicability: Different team, same experimental setup • Reproducibility: Different team, different experimental setup With the new ACM terminology (Version 1.1) replicability and reproducibility are swapped! Version 1.0: https://www.acm.org/publications/policies/artifact-review-badging Version 1.1: https://www.acm.org/publications/policies/artifact-review-and-badging-current
  • 21. Replicability Measures Ranking: Kendall’s τ and RBO Absolute Per-Topic Effectiveness: RMSEabs Statistical approach: p-value of paired t-test Effect over a baseline: RMSEΔ, Effect Ratio (ERrepli) and Delta Relative Improvement (ΔRIrepli) Reproducibility measures unpaired
  • 22. Replicability & Reproducibility Runs • Target Runs submitted at WWW-2: • Advanced: THUIR-E-CO-MAN-Base2 (LambdaMART) • Baseline: THUIR-E-CO-PU-Base4 (BM25) • Replicability and Reproducibility runs submitted at WWW-3: • Advanced: KASYS-E-CO-REP-2 and SLWWW-E-CO-REP-4 • Baseline: KASYS-E-CO-REP-3 • Replicability: WWW-2 qrels and topics; • Reproducibility: WWW-2 qrels and topics compared against WWW-3 qrels and topics.
  • 23. Replicability recap WWW-2 topics WWW-3 topics WWW-2runsWWW-3runs A-run (advanced) B-run (baseline) Effect A-run (advanced) B-run (baseline) Effect
  • 24. Replicability Results: Ranking of Documents • Kendall’s τ and RBO: computed between the original ranking of documents and the replicated ranking; • The closer to 1 the better the replicated run; • Scores close to 0 mean that the original and replicated runs are not correlated; • It is extremely hard to obtain the same list of documents!
  • 25. • RMSE: the closer to 0 the better; • p-value: small p-value means that the runs are significantly different (without specifying whether they are better or not); Large RMSEs Replicability Results: RMSE and p-values Very small p-values
  • 26. Replicability Results: Effect over a Baseline • Implications of ER scores: • ER ≤ 0: Failed replication, A-run failed to outperform the B-run; • 0 < ER < 1: Somehow successful, the replicated effect is smaller compared to the original effect; • ER = 1: Perfect replication; • ER > 1: Successful replication, the replicated effect is larger compared to the original effect. • Similar interpretation of ΔRI but 0 is the perfect replication;
  • 27. Reproducibility recap WWW-2 topics WWW-3 topics WWW-2runsWWW-3runs A-run (advanced) B-run (baseline) Effect A-run (advanced) B-run (baseline) Effect
  • 28. Reproducibility Results: p-values and Effects over a Baseline • Recall that there is no target original run; • Reproduciblity is even harder than replicability!
  • 29. TALK OUTLINE • Chinese subtask • English subtask • CENTRE • Summary • NTCIR-16 WWW-4
  • 30. Summary • Chinese subtask (only 3 teams) Best run: RUCIR-C-CD-NEW-4 • English subtask (9 teams) Best runs: KASYS-E-CO-NEW-{1,4} and mpii-E-CO- NEW-1. KASYS uses a BERT-based method from [Yilmaz+ EMNLP 2019]. • CENTRE: We need a community effort since replicability and reproducibility are very tough problems!
  • 31. Thank you participants! And many thanks to the NTCIR PC chairs, GCs, and staff!
  • 32. TALK OUTLINE • Chinese subtask • English subtask • CENTRE • Summary • NTCIR-16 WWW-4
  • 33. WWW will be back (IF our task proposal is accepted) • English subtask only • New English corpus! (Common Crawl?) • New target for replicability, reproducibility, and a baseline for progress: University of Tsukuba’s BERT-based run from WWW-3 • Topics to be released in October 2021 • Run submission deadline in November 2021 • Please follow @ntcirwww on Twitter!
  • 34. Selected references [Breuer+ SIGIR2020] How to Measure the Reproducibility of System-oriented IR Experiments, ACM SIGIR 2020. [Sakai+ TOIS2020] Retrieval Evaluation Measures that Agree with Users' SERP Preferences: Traditional, Preference-based, and Diversity Measures, ACM TOIS 39(2), to appear, 2020. [Yilmaz+ EMNLP2019] Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval, EMNLP 2019. More about CENTRE Evaluation measures including iRBU University of Tsukuba’s top run is based on this