SlideShare una empresa de Scribd logo
1 de 65
Descargar para leer sin conexión
Information Retrieval Meta-Evaluation:
                     Challenges and Opportunities
                         in the Music Domain
                           Julián Urbano      @julian_urbano
                                University Carlos III of Madrid




                                                                              ISMIR 2011
Picture by Daniel Ray                                             Miami, USA · October 26th
Picture by Bill Mill
current evaluation practices
     hinder the proper
 development of Music IR
we lack
  meta-evaluation studies


   we can’t complete the IR
research & development cycle
how we got here?




Picture by NASA History Office
users                     large-scale                  multi-language &
the basis           collections                                     multi-modal

                                                            NTCIR           CLEF
Cranfield 2 MEDLARS SMART                    TREC           (1999-today)
                                                                           (2000-today)
  (1962-1966) (1966-1967)                    (1992-today)
                        (1961-1995)
1960                                                                                      2011
ISMIR 2001 resolution on the need to create
              standardized MIR test collections tasks and
                                    collections, tasks,
         evaluation metrics for MIR research and development

                                                       NTCIR                  CLEF
Cranfield 2 MEDLARS SMART               TREC           (1999-today)
                                                                             (2000-today)
  (1962-1966) (1966-1967)               (1992-today)
                       (1961-1995)
1960                                                                                        2011

                                                              ISMIR
                                                              (2000-today)




                   3 workshops (2002-2003):
                  The MIR/MDL Evaluation Project
ISMIR 2001 resolution on the need to create
                standardized MIR test collections tasks and
                                      collections, tasks,
           evaluation metrics for MIR research and development

                                                         NTCIR                  CLEF
Cranfield 2 MEDLARS SMART                 TREC           (1999-today)
                                                                               (2000-today)
  (1962-1966) (1966-1967)                 (1992-today)
                         (1961-1995)
1960                                                                                                 2011


      follow the steps of the Text IR folks                     ISMIR
                                                                (2000-today)
                                                                                     MIREX
                                                                                      (2005-today)

   but carefully: not everything applies to music
                                                                                              >1200
                     3 workshops (2002-2003):                                                  runs!
                    The MIR/MDL Evaluation Project
are we done already?
                                                        nearly 2 decades of
       Evaluation is not easy                       Meta-
                                                    Meta-Evaluation in Text IR

                                                     NTCIR                  CLEF
Cranfield 2 MEDLARS SMART            TREC            (1999-today)
                                                                           (2000-today)
  (1962-1966) (1966-1967)            (1992-today)
                       (1961-1995)
1960                                                                                             2011

                                                            ISMIR                MIREX
                                                            (2000-today)
                                                                                  (2005-today)


                                                                                          positive
                                                                                           impact
                                                                                          on MIR
are we done already?
                                                                     nearly 2 decades of
       Evaluation is not easy                                    Meta-
                                                                 Meta-Evaluation in Text IR

                                                                  NTCIR                  CLEF
Cranfield 2 MEDLARS SMART                         TREC            (1999-today)
                                                                                        (2000-today)
  (1962-1966) (1966-1967)                         (1992-today)
                              (1961-1995)
1960                                                                                                          2011


        some good practices inherited from here                          ISMIR                MIREX
                                                                         (2000-today)
                                                                                               (2005-today)


                                                                                                       positive
                                                                                                        impact
   “not everything applies”                      a lot of things
                                              have happened here!
                                                                                                       on MIR
    but much of it does!
we still have
a very long
 way to go
Picture by Official U.S. Navy Imagery
                                        evaluation
Cranfield Paradigm




          Task
       User Model
Experimental Validity

how well an experiment meets the well-grounded
    requirements of the scientific method
    do the results fairly and actually assess
              what was intended?


                Meta-Evaluation
analyze the validity of IR Evaluation experiments
Ground truth
                     User model
                                  Documents




                                                                                 Measures
                                                                       Systems
                                              Queries
 Construct    Task
              x       x                                                           x
   Content    x       x            x           x                        x
Convergent            x                                  x                        x
  Criterion                                    x         x                        x
   Internal                        x           x         x              x         x
  External                         x           x         x              x
Conclusion            x                        x         x              x         x
experimental failures
construct validity
                    #fail
   measure quality of a Web search engine
          by the number of visits
                     what?
do the variables of the experiment correspond
  to the theoretical meaning of the concept
          they purport to measure?
                     how?
     thorough selection and justification
             of the variables used
construct validity in IR
 effectiveness measures and their user model
               [Carterette, SIGIR2011]

set-based measures do not resemble real users
             [Sanderson et al., SIGIR2010]
       rank-based measures are better
              [Jarvelin et al., TOIS2002]

          graded relevance is better
     [Voorhees, SIGIR2001][Kekäläinen, IP&M2005]
    other forms of ground truth are better
           [Bennet et al., SIGIRForum2008]
content validity
                    #fail
       measure reading comprehension
           only with sci-fi books

                    what?
do the experimental units reflect and represent
   the elements of the domain under study?

                     how?
  careful selection of the experimental units
content validity in IR
   tasks closely resembling real-world settings
  systems completely fulfilling real-user needs

   heavy user component, difficult to control
    evaluate de system component instead
       [Cleverdon, SIGIR2001][Voorhees, CLEF2002]

   actual value of systems is really unknown
                [Marchioni, CACM2006]
sometimes they just do not work with real users
                [Turpin et al., SIGIR2001]
content validity in IR
    documents resembling real-world settings’
        large and representative samples
                                       specially for Machine Learning
careful selection of queries, diverse but reasonable
       [Voorhees, CLEF2002][Carterette et al., ECIR2009]

        random selection is not good

some queries are better to differentiate bad systems
         [Guiver et al., TOIS2009][Robertson, ECIR2011]
convergent validity
                   #fail
    measures of math skills not correlated
           with abstract thinking
                    what?
do the results agree with others, theoretical or
 experimental, they should be related with?
                    how?
   careful examination and confirmation
   of the relationship between the results
        and others supposedly related
convergent validity in IR
         ground truth data is subjective
    differences across groups and over time
 different results depending on who evaluates
            absolute numbers change
relative differences stand still for the most part
               [Voorhees, IP&M2000]

for large-scale evaluations or varying experience
         of assessors, differences do exist
               [Carterette et al., 2010]
convergent validity in IR
     measures are precision- or recall-oriented
they should therefore be correlated with each other
                but they actually are not                 reliability?
           [Kekäläinen, IP&M2005][Sakai, IP&M2007]
better correlated with others than with themselves!
                   [Webber et al., SIGIR2008]

    correlation with user satisfaction in the task
                  [Sanderson et al., SIGIR2010]
ranks, unconventional judgments, discounted gain…
     [Bennet et al, SIGIRForum2008][Järvelin et al, TOIS2002]
criterion validity
                        #fail
            ask if the new drink is good
        instead of better than the old one

                      what?
     are the results correlated with those of
  other experiments already known to be valid?

                      how?
   careful examination and confirmation of the
correlation between our results and previous ones
criterion validity in IR
    practical large-scale methodologies: pooling
                  [Buckley et al., SIGIR2004] less effort, but
                                              same results?
             judgments by non-experts
                     [Bailey et al., SIGIR2008]
               crowdsourcing for low-cost
    [Alonso et al., SIGIR2009][Carvalho et al., SIGIRForum2010]
     estimate measures with fewer judgments
        [Yilmaz et al., CIKM2006][Yilmaz et al., SIGIR2008]
select what documents to judge, by informativeness
    [Carterette et al., SIGIR2006][Carterette et al., SIGIR2007]
          use no relevance judgments at all
                   [Soboroff et al., SIGIR2001]
internal validity
                    #fail
measure usefulness of Windows vs Linux vs iOS
         only with Apple employees
                      what?
   can the conclusions be rigorously drawn
           from the experiment alone
      and not other overlooked factors?
                      how?
 careful identification and control of possible
confounding variables and selection of desgin
internal validity in IR
inconsistency: performance depends on assessors
        [Voorhees, IP&M2000][Carterette et al., SIGIR2010]
 incompleteness: performance depends on pools
            system reinforcement
                        [Zobel, SIGIR2008]
 affects reliability of measures and overall results
           [Sakai, JIR2008][Buckley et al., SIGIR2007] specially for
                                                      Machine Learning
train-test: same characteristics in queries and docs
improvements on the same collections: overfitting
                       [Voorhees, CLEF2002]
         measures must be fair to all systems
external validity
                      #fail
study cancer treatment mostly with teenage males

                     what?
         can the results be generalized
to other populations and experimental settings?

                      how?
         careful design and justification
       of sampling and selection methods
external validity in IR
            weakest point of IR Evaluation
                     [Voorhees, CLEF2002]



          large-scale is always incomplete
          [Zobel, SIGIR2008][Buckley et al., SIGIR2004]


test collections are themselves an evaluation result
          but they become hardly reusable
   [Carterette et al., WSDM2010][Carterette et al., SIGIR2010]
external validity in IR
systems perform differently with different collections
     cross-collection comparisons are unjustified
   highly depends on test collection characteristics
         [Bodoff et al., SIGIR2007][Voorhees, CLEF2002]
    interpretation of results must be in terms of
   pairwise comparisons, not absolute numbers
                     [Voorhees, CLEF2002]
 do not claim anything about state of the art
     based on a handful of experiments
baselines can be used to compare across collections
       meaningful, [Armstrong et al., CIKM2009]
      not random!
conclusion validity
                       #fail
more access to the Internet in China than in the US
   because of the larger total number of users

                      what?
are the conclusions justified based on the results?

                        how?
careful selection of the measuring instruments and
statistical methods used to draw grand conclusions
conclusion validity in IR

measures should be sensitive and stable
           [Buckley et al., SIGIR2000]
            and also powerful
  [Voorhees et al., SIGIR2002][Sakai, IP&M2007]
              with little effort
          [Sanderson et al., SIGIR2005]



    always bearing in mind
  the user model and the task
conclusion validity in IR

statistical methods to compare score distributions
     [Smucker et al., CIKM2007][Webber et al., CIKM2008]

      correct interpretation of the statistics
        hypothesis testing is troublesome
statistical significance ≠ practical significance
increasing #queries (sample size) increases power
   to detect ever smaller differences (effect-size)
  eventually, everything is statistically significant
challenges




Picture by Brian Snelson
IR Research & Development Cycle
IR Research & Development Cycle
IR Research & Development Cycle
IR Research & Development Cycle
IR Research & Development Cycle
MIR evaluation practices
  do not allow us
 to complete this cycle
IR Research & Development Cycle
IR Research & Development Cycle

loose definition of task
intent and user model




 realistic data
IR Research & Development Cycle
IR Research & Development Cycle

                        collections are too small
                               and/or biased
 lack of realistic,
controlled public                              standard formats
    collections                                  and evaluation
                                                  software to
                                                minimize bugs

                                      can’t replicate
  private,
  private undescribed                 results, often
     and unanalyzed                     leading to
   collections emerge                wrong conclusions
IR Research & Development Cycle
IR Research & Development Cycle
                                    undocumented measures,
lack of baselines as lower bound no accepted evaluation software
  (random is not a baseline!)


              proper statistics               correct
                                          interpretation
                                           of statistics
IR Research & Development Cycle
IR Research & Development Cycle

                             raw musical material
                                  unknown
            undocumented
            queries and/or
             documents


                       go back to private
                           collections:
                          overfitting!
                          overfitting!
IR Research & Development Cycle
IR Research & Development Cycle




                      collections can’t be
                             reused
           blind
       improvements
                               go back to
                           private collections:
                              overfitting!
                              overfitting!
Picture by Donna Grayson
collections

       large, heterogeneous and controlled

not a hard endeavour, except for the damn copyright

                Million Song Dataset!
   still problematic (new features?, actual music)

        standardize collections across tasks
  better understanding and use of improvements
raw music data

essential for Learning and Improvement phases


           use copyright-free data
                 Jamendo!
            study possible biases


         reconsider artificial material
evaluation model
        let teams run their own algorithms
               (needs public collections)

relief for IMIRSEL and promote wider participation
   successfuly used for 20 years in Text IR venues
               adopted by MusiCLEF

      only viable alternative in the long run
MIREX-DIY platforms still don’t allow full completion
     of the IR Research & Development Cycle
organization
 IMIRSEL plans, schedules and runs everything


    add a 2nd tier of organizers, task-specific
logistics, planning, evaluation, troubleshooting…


  format of large forums like TREC and CLEF
smooth the process and develop tasks that really
    push the limits of the state of the art
overview papers
           every year, by task organizers


     detail the evaluation process, data, results
  discussion to boost Interpretation and Learning


           perfect wrap-up for team papers
rarely discuss results, and many are not even drafted
specific methodologies
MIR has unique methodologies and measures


    meta-evaluate: analyze and improve


      human effects on the evaluation


             user satisfaction
standard evaluation software
         bugs are inevitable


open evaluation software to everybody
            gain reliability
  speed up the development process
serve as documentation for newcomers

 promote standardization of formats
baselines
help measuring the overall progress of the filed


   standard formats + standard software +
  public controlled collections + raw music +
           task-specific organization


         measure the state of the art
commitment

we need to acknowledge the current problems

    MIREX should not only be a place to
       evaluate and improve systems
             but also a place to
 meta-evaluate and improve how we evaluate
                and a place to
   design tasks that challenge researchers

   analyze our evaluation methodologies
we all need to start
questioning
evaluation practices
it’s worth it




Picture by Brian Snelson
we all need to start
     questioning
    evaluation practices
it‘s not that eveything we do is wrong…
we all need to start
     questioning
    evaluation practices
it‘s not that eveything we do is wrong…
it’s that we don’t know it!

Más contenido relacionado

Más de Julián Urbano

A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...Julián Urbano
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Julián Urbano
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityJulián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityJulián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...Julián Urbano
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...Julián Urbano
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJulián Urbano
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Julián Urbano
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsJulián Urbano
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksJulián Urbano
 
Using the Shape of Music to Compute the similarity between Symbolic Musical P...
Using the Shape of Music to Compute the similarity between Symbolic Musical P...Using the Shape of Music to Compute the similarity between Symbolic Musical P...
Using the Shape of Music to Compute the similarity between Symbolic Musical P...Julián Urbano
 

Más de Julián Urbano (17)

A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered Lists
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
 
Using the Shape of Music to Compute the similarity between Symbolic Musical P...
Using the Shape of Music to Compute the similarity between Symbolic Musical P...Using the Shape of Music to Compute the similarity between Symbolic Musical P...
Using the Shape of Music to Compute the similarity between Symbolic Musical P...
 

Último

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Último (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

  • 1. Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain Julián Urbano @julian_urbano University Carlos III of Madrid ISMIR 2011 Picture by Daniel Ray Miami, USA · October 26th
  • 3. current evaluation practices hinder the proper development of Music IR
  • 4. we lack meta-evaluation studies we can’t complete the IR research & development cycle
  • 5. how we got here? Picture by NASA History Office
  • 6. users large-scale multi-language & the basis collections multi-modal NTCIR CLEF Cranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995) 1960 2011
  • 7. ISMIR 2001 resolution on the need to create standardized MIR test collections tasks and collections, tasks, evaluation metrics for MIR research and development NTCIR CLEF Cranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995) 1960 2011 ISMIR (2000-today) 3 workshops (2002-2003): The MIR/MDL Evaluation Project
  • 8. ISMIR 2001 resolution on the need to create standardized MIR test collections tasks and collections, tasks, evaluation metrics for MIR research and development NTCIR CLEF Cranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995) 1960 2011 follow the steps of the Text IR folks ISMIR (2000-today) MIREX (2005-today) but carefully: not everything applies to music >1200 3 workshops (2002-2003): runs! The MIR/MDL Evaluation Project
  • 9. are we done already? nearly 2 decades of Evaluation is not easy Meta- Meta-Evaluation in Text IR NTCIR CLEF Cranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995) 1960 2011 ISMIR MIREX (2000-today) (2005-today) positive impact on MIR
  • 10. are we done already? nearly 2 decades of Evaluation is not easy Meta- Meta-Evaluation in Text IR NTCIR CLEF Cranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995) 1960 2011 some good practices inherited from here ISMIR MIREX (2000-today) (2005-today) positive impact “not everything applies” a lot of things have happened here! on MIR but much of it does!
  • 11. we still have a very long way to go
  • 12. Picture by Official U.S. Navy Imagery evaluation
  • 13. Cranfield Paradigm Task User Model
  • 14. Experimental Validity how well an experiment meets the well-grounded requirements of the scientific method do the results fairly and actually assess what was intended? Meta-Evaluation analyze the validity of IR Evaluation experiments
  • 15. Ground truth User model Documents Measures Systems Queries Construct Task x x x Content x x x x x Convergent x x x Criterion x x x Internal x x x x x External x x x x Conclusion x x x x x
  • 17. construct validity #fail measure quality of a Web search engine by the number of visits what? do the variables of the experiment correspond to the theoretical meaning of the concept they purport to measure? how? thorough selection and justification of the variables used
  • 18. construct validity in IR effectiveness measures and their user model [Carterette, SIGIR2011] set-based measures do not resemble real users [Sanderson et al., SIGIR2010] rank-based measures are better [Jarvelin et al., TOIS2002] graded relevance is better [Voorhees, SIGIR2001][Kekäläinen, IP&M2005] other forms of ground truth are better [Bennet et al., SIGIRForum2008]
  • 19. content validity #fail measure reading comprehension only with sci-fi books what? do the experimental units reflect and represent the elements of the domain under study? how? careful selection of the experimental units
  • 20. content validity in IR tasks closely resembling real-world settings systems completely fulfilling real-user needs heavy user component, difficult to control evaluate de system component instead [Cleverdon, SIGIR2001][Voorhees, CLEF2002] actual value of systems is really unknown [Marchioni, CACM2006] sometimes they just do not work with real users [Turpin et al., SIGIR2001]
  • 21. content validity in IR documents resembling real-world settings’ large and representative samples specially for Machine Learning careful selection of queries, diverse but reasonable [Voorhees, CLEF2002][Carterette et al., ECIR2009] random selection is not good some queries are better to differentiate bad systems [Guiver et al., TOIS2009][Robertson, ECIR2011]
  • 22. convergent validity #fail measures of math skills not correlated with abstract thinking what? do the results agree with others, theoretical or experimental, they should be related with? how? careful examination and confirmation of the relationship between the results and others supposedly related
  • 23. convergent validity in IR ground truth data is subjective differences across groups and over time different results depending on who evaluates absolute numbers change relative differences stand still for the most part [Voorhees, IP&M2000] for large-scale evaluations or varying experience of assessors, differences do exist [Carterette et al., 2010]
  • 24. convergent validity in IR measures are precision- or recall-oriented they should therefore be correlated with each other but they actually are not reliability? [Kekäläinen, IP&M2005][Sakai, IP&M2007] better correlated with others than with themselves! [Webber et al., SIGIR2008] correlation with user satisfaction in the task [Sanderson et al., SIGIR2010] ranks, unconventional judgments, discounted gain… [Bennet et al, SIGIRForum2008][Järvelin et al, TOIS2002]
  • 25. criterion validity #fail ask if the new drink is good instead of better than the old one what? are the results correlated with those of other experiments already known to be valid? how? careful examination and confirmation of the correlation between our results and previous ones
  • 26. criterion validity in IR practical large-scale methodologies: pooling [Buckley et al., SIGIR2004] less effort, but same results? judgments by non-experts [Bailey et al., SIGIR2008] crowdsourcing for low-cost [Alonso et al., SIGIR2009][Carvalho et al., SIGIRForum2010] estimate measures with fewer judgments [Yilmaz et al., CIKM2006][Yilmaz et al., SIGIR2008] select what documents to judge, by informativeness [Carterette et al., SIGIR2006][Carterette et al., SIGIR2007] use no relevance judgments at all [Soboroff et al., SIGIR2001]
  • 27. internal validity #fail measure usefulness of Windows vs Linux vs iOS only with Apple employees what? can the conclusions be rigorously drawn from the experiment alone and not other overlooked factors? how? careful identification and control of possible confounding variables and selection of desgin
  • 28. internal validity in IR inconsistency: performance depends on assessors [Voorhees, IP&M2000][Carterette et al., SIGIR2010] incompleteness: performance depends on pools system reinforcement [Zobel, SIGIR2008] affects reliability of measures and overall results [Sakai, JIR2008][Buckley et al., SIGIR2007] specially for Machine Learning train-test: same characteristics in queries and docs improvements on the same collections: overfitting [Voorhees, CLEF2002] measures must be fair to all systems
  • 29. external validity #fail study cancer treatment mostly with teenage males what? can the results be generalized to other populations and experimental settings? how? careful design and justification of sampling and selection methods
  • 30. external validity in IR weakest point of IR Evaluation [Voorhees, CLEF2002] large-scale is always incomplete [Zobel, SIGIR2008][Buckley et al., SIGIR2004] test collections are themselves an evaluation result but they become hardly reusable [Carterette et al., WSDM2010][Carterette et al., SIGIR2010]
  • 31. external validity in IR systems perform differently with different collections cross-collection comparisons are unjustified highly depends on test collection characteristics [Bodoff et al., SIGIR2007][Voorhees, CLEF2002] interpretation of results must be in terms of pairwise comparisons, not absolute numbers [Voorhees, CLEF2002] do not claim anything about state of the art based on a handful of experiments baselines can be used to compare across collections meaningful, [Armstrong et al., CIKM2009] not random!
  • 32. conclusion validity #fail more access to the Internet in China than in the US because of the larger total number of users what? are the conclusions justified based on the results? how? careful selection of the measuring instruments and statistical methods used to draw grand conclusions
  • 33. conclusion validity in IR measures should be sensitive and stable [Buckley et al., SIGIR2000] and also powerful [Voorhees et al., SIGIR2002][Sakai, IP&M2007] with little effort [Sanderson et al., SIGIR2005] always bearing in mind the user model and the task
  • 34. conclusion validity in IR statistical methods to compare score distributions [Smucker et al., CIKM2007][Webber et al., CIKM2008] correct interpretation of the statistics hypothesis testing is troublesome statistical significance ≠ practical significance increasing #queries (sample size) increases power to detect ever smaller differences (effect-size) eventually, everything is statistically significant
  • 36. IR Research & Development Cycle
  • 37. IR Research & Development Cycle
  • 38. IR Research & Development Cycle
  • 39. IR Research & Development Cycle
  • 40. IR Research & Development Cycle
  • 41. MIR evaluation practices do not allow us to complete this cycle
  • 42. IR Research & Development Cycle
  • 43. IR Research & Development Cycle loose definition of task intent and user model realistic data
  • 44. IR Research & Development Cycle
  • 45. IR Research & Development Cycle collections are too small and/or biased lack of realistic, controlled public standard formats collections and evaluation software to minimize bugs can’t replicate private, private undescribed results, often and unanalyzed leading to collections emerge wrong conclusions
  • 46. IR Research & Development Cycle
  • 47. IR Research & Development Cycle undocumented measures, lack of baselines as lower bound no accepted evaluation software (random is not a baseline!) proper statistics correct interpretation of statistics
  • 48. IR Research & Development Cycle
  • 49. IR Research & Development Cycle raw musical material unknown undocumented queries and/or documents go back to private collections: overfitting! overfitting!
  • 50. IR Research & Development Cycle
  • 51. IR Research & Development Cycle collections can’t be reused blind improvements go back to private collections: overfitting! overfitting!
  • 52. Picture by Donna Grayson
  • 53. collections large, heterogeneous and controlled not a hard endeavour, except for the damn copyright Million Song Dataset! still problematic (new features?, actual music) standardize collections across tasks better understanding and use of improvements
  • 54. raw music data essential for Learning and Improvement phases use copyright-free data Jamendo! study possible biases reconsider artificial material
  • 55. evaluation model let teams run their own algorithms (needs public collections) relief for IMIRSEL and promote wider participation successfuly used for 20 years in Text IR venues adopted by MusiCLEF only viable alternative in the long run MIREX-DIY platforms still don’t allow full completion of the IR Research & Development Cycle
  • 56. organization IMIRSEL plans, schedules and runs everything add a 2nd tier of organizers, task-specific logistics, planning, evaluation, troubleshooting… format of large forums like TREC and CLEF smooth the process and develop tasks that really push the limits of the state of the art
  • 57. overview papers every year, by task organizers detail the evaluation process, data, results discussion to boost Interpretation and Learning perfect wrap-up for team papers rarely discuss results, and many are not even drafted
  • 58. specific methodologies MIR has unique methodologies and measures meta-evaluate: analyze and improve human effects on the evaluation user satisfaction
  • 59. standard evaluation software bugs are inevitable open evaluation software to everybody gain reliability speed up the development process serve as documentation for newcomers promote standardization of formats
  • 60. baselines help measuring the overall progress of the filed standard formats + standard software + public controlled collections + raw music + task-specific organization measure the state of the art
  • 61. commitment we need to acknowledge the current problems MIREX should not only be a place to evaluate and improve systems but also a place to meta-evaluate and improve how we evaluate and a place to design tasks that challenge researchers analyze our evaluation methodologies
  • 62. we all need to start questioning evaluation practices
  • 63. it’s worth it Picture by Brian Snelson
  • 64. we all need to start questioning evaluation practices it‘s not that eveything we do is wrong…
  • 65. we all need to start questioning evaluation practices it‘s not that eveything we do is wrong… it’s that we don’t know it!