information retrieval evaluation statistical significance effect sizes evaluation measures ntcir test collections statistical power graded relevance relevance assessments natural language processing short text conversation confidence intervals web search reproducibility replicability inter-assessor agreement dialogues topic set size design sample sizes topic set sizes preference assessments dialogue systems user preferences tukey hsd test anova t-tests multiple comparison procedures trec clef power analyis unanimity gain values lancers innformation retrieval bayesian inference power information access sigir stc measures power analysis failture analysis progress monitoring
Ver más