Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Multi-method Evaluation in Scientific Paper Recommender Systems

113 visualizaciones

Publicado el

Presentation made during the Intelligent User-Adapted Interfaces: Design and Multi-Modal Evaluation Workshop (IUadaptME) workshop conducted as part of UMAP 2018

Publicado en: Tecnología
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Multi-method Evaluation in Scientific Paper Recommender Systems

  1. 1. Multi-method Evaluation in Scientific Paper Recommender Systems Aravind Sesagiri Raamkumar Schubert Foo Wee Kim Wee School of Communication and Information, NTU IUadaptME Workshop|UMAP’18 July 8th 2018
  2. 2. 2 Scientific Paper Recommender Systems (SPRS) Citation Recommender Systems Literature Recommender Systems Research Paper Recommender Systems Focus Area
  3. 3. 3
  4. 4. 4
  5. 5. 5 • Major Areas – Literature Review (LR) tasks • Task of building an initial reading list at the start of LR • Task of finding similar papers based on a single paper • Task of finding similar papers based on multiple papers • Task of searching papers based on input text – User footprint – Researcher’s publication history – Social network of authors • Recommendations generated based on:- – Citation network – Metadata fields – Text content from papers – System logs SPRS Studies
  6. 6. Rec4LRW System Rec4LRW – Recommender System for Literature Review and Writing • Task 1 - Building an initial reading list of research papers – Author-specified Keywords based Retrieval (AKR) Technique • Task 2 - Finding similar papers based on set of papers – Integrated Discovery of Similar Papers (IDSP) Technique • Task 3 - Shortlisting papers from reading list for inclusion in manuscript – Citation Network based Shortlisting (CNS) Technique 6
  7. 7. Rec4LRW Task Screens Task 1 Task 2 Information cue labels Seed Basket (SB) 7
  8. 8. Task Screens Task 2 Task 3 Shared Co-relations Reading List (RL) 8 Rec4LRW Task Screens
  9. 9. Task Screens Task 3 Cluster viewing option 9 Rec4LRW Task Screens
  10. 10. Rec4LRW Evaluation Strategy 10 Offline Evaluation of Task 1 • Rank aggregation method User Evaluation of Three Tasks • Survey-based evaluations User Evaluation of Overall System • Survey-based evaluations “Offline evaluations are more prevalent in this SPRS area, accounting to about 69% of all studies”
  11. 11. Offline Evaluation of Task 1 Evaluated Techniques Label Abbr. Technique Description A AKRv1 Basic AKR technique with weights WCC = 0.25, WRC=0.25, WCO = 0.5 B AKRv2 Basic AKR technique with weights WCC = 0.1, WRC=0.1, WCO = 0.8 C HAKRv1 HITS enhanced AKR technique boosted with weights WCC = 0.25, WRC=0.25, WCO = 0.5 D HAKRv2 HITS enhanced AKR technique boosted with weights WCC = 0.1, WRC=0.1, WCO = 0.8 E CFHITS IBCF technique boosted with HITS F CFPR IBCF technique boosted with PageRank G PR PageRank technique Evaluation Approach • Number of Recent (R1), Popular (R2), Survey (R3) and Diverse (R4) papers were enumerated for each of the 186 topics and seven techniques • Ranks were assigned to the technique based on the highest counts in each recommendation list • The RankAggreg library was used to perform Rank Aggregation Experiment Setup • A total of 186 author-specified keywords from the ACM DL dataset were identified as the seed research topic • The experiment was performed in three sequential steps. 1. Top 200 papers were retrieved using the BM25 similarity algorithm 2. Top 20 papers were identified using the specific ranking schemes of the seven techniques 3. The evaluation metrics were measured for the seven techniques 11
  12. 12. Offline Evaluation of Task 1 Results Paper Type (Requirement) Optimal Aggregated Ranks Min. Obj. Function Score1 2 3 4 5 6 7 Recent Papers (R1) B A C D E F G 10.66 Popular Papers (R2) F E C D G A B 11.89 Literature Survey Papers (R3) C G D A E F B 13.38 Diverse Papers (R4) C D G A B F E 12.15 • The HITS enhanced version of the AKR technique HAKRv1 (C) was the best all- round performing technique • The HAKRv1 technique was particularly good for retrieving literature survey papers and papers from different sub-topics while the basic AKRv1 technique (A) was good for retrieving recent papers 12
  13. 13. Rec4LRW User Study Evaluation Goals 1. Ascertain the agreement percentages of the evaluation measures for the three tasks and the overall system and identify whether the values are above a preset threshold criteria of 75% 2. Test the hypothesis that students benefit more from the recommendation tasks/system in comparison to staff 3. Measure the correlation between the measures and build a regression model with ‘agreeability on a good list’ as the dependent variable 4. Track the change in user perceptions between the three tasks 5. Compare the pre-study and post-study variables for understanding whether the target participants are benefitted from the tasks 6. Identify the top most preferred and critical aspects of the task recommendations and the system using the subjective feedback of the participants 13
  14. 14. User Study Details • Rec4LRW system was made available over the internet • Participants were recruited with intent to get worldwide audience • Only researchers with paper authoring experience were recruited through a pre-screening survey • 230 researchers participated in the pre-screening survey • 149 participants were deemed eligible and invited for the study • Participants provided with a user guide • All the three tasks were required to be executed by the participants • Evaluation questionnaires embedded in the screen of each task of Rec4LRW system 14
  15. 15. Task Evaluation Measures Common Measures • Relevance • Usefulness • Good_List Tasks 1 and 2 • Good_Spread • Diversity • Interdisciplinarity • Popularity • Recency • Good_Mix • Familiarity • Novelty • Serendipity • Expansion_Required • User_Satisfaction Task 2 • Seedbasket_Similarity • Shared_Corelations • Seedbasket_Usefulness Task 3 • Importance • Certainty • Shortlisting_Feature 15 Qualitative Feedback 1) From the displayed information, what features did you like the most? 2) Please provide your personal feedback about the execution of this task
  16. 16. System Evaluation Measures Effort to use the System (EUS) • Convenience • Effort_Required • Mouse_Clicks • Little_Time • Much_Time Perceived Usefulness (PU) • Productivity_Improvability • Enhance_Effectiveness • Ease_Job • Work_Usefulness Perceived System Effectiveness (PSE) • Recommend • Pleasant_Experience • Useless • Awareness • Better_Choice • Findability • Accomplish_Tasks • Performance_Improvability 16
  17. 17. 17 Sample Evaluation Questionnaire
  18. 18. Analysis Procedures Quantitative Data • Agreement Percentage (AP) calculated by only considering responses of 4 (‘Agree’) and 5 (‘Strongly Agree’) in the 5-point Likert scale • Independent samples t-test for hypothesis testing • Spearman coefficient for correlation measurement • MLR used for the predictive models – Paired samples t-test for model validation Qualitative Data • Descriptive coding method was used to code the participant feedback • Two coders performed the coding in a sequential manner Preferred Aspects (κ) Critical Aspects (κ) Task 1 0.918 0.727 Task 2 0.930 0.758 Task 3 0.877 0.902 18
  19. 19. Participant Demographics Stage N Task 1 132 Task 2 121 Task 3 119 Demographic Variable N Position Student 62 (47%) Staff 70 (53%) Experience Level Beginner 15 (11.4%) Intermediate 61 (46.2%) Advanced 34 (25.8%) Expert 22 (16.7%) Discipline N Computer Science & Information Systems 51 (38.6%) Library and Information Studies 30 (22.7%) Electrical & Electronic Engineering 30 (22.7%) Communication & Media Studies 8 (6.1%) Mechanical, Aeronautical & Manufacturing Engineering 5 (3.8%) Biological Sciences 2 (1.5%) Statistics & Operational Research 1 (0.8%) Education 1 (0.8%) Politics & International Studies 1 (0.8%) Economics & Econometrics 1 (0.8%) Civil & Structural Engineering 1 (0.8%) Psychology 1 (0.8%) Country N Singapore 107 (81.1%) India 4 (3%) Malaysia 3 (2.3%) Sri Lanka 3 (2.3%) Pakistan 3 (2.3%) Indonesia 2 (1.5%) Germany 2 (1.5%) Australia 1 (0.8%) Iran 1 (0.8%) Thailand 1 (0.8%) China 1 (0.8%) USA 1 (0.8%) Canada 1 (0.8%) Sweden 1 (0.8%) Slovenia 1 (0.8%) 19
  20. 20. Results for Goals 1 & 2 20
  21. 21. Results for Goals 3 and 4 Predictors for “Good_List” Task Independent Variables Task 1 Recency, Novelty, Serendipity, Usefulness, User_Satisfaction Task 2 Seedbasket_Similarity, Usefulness Task 3 Relevance, Usefulness, Certainty Transition of User Perception from Task 1 to 2 21
  22. 22. Results for Goal 5 0 1 3 4 00 6 5 10 21 9 18 22 40 1 11 18 10 1 2 5 6 0 5 10 15 20 25 Count 1 2 3 4 5 0 3 5 20 30 3 9 30 41 2 7 21 20 0 3 1 2 0 5 10 15 20 25 30 35 Count 1 2 3 4 5 0 1 3 2 30 2 8 15 40 4 7 24 6 0 1 5 16 31 1 2 5 1 0 5 10 15 20 25 30 Count 1 2 3 4 5 Task 1 Task 2 Task 3 Need_Assistance (pre study) Vs. Good_List (post study) 22 Never Rarely Sometimes Often Always Never Rarely Sometimes Often Always Never Rarely Sometimes Often Always
  23. 23. Results for Goal 6 Top 5 Preferred Aspects Rank Task 1 (N=109) Task 2 (N=100) Task 3 (N=91) 1 Information Cue Labels (41%) Shared Co-citations & Co-references (28%) Shortlisting Feature & Recommendation Quality (24%) 2 Rich Metadata (21%) Recommendation Quality (27%) Information Cue Labels (15%) 3 Diversity of Papers (13%) Information Cue Labels (16%) View Papers in Clusters (11%) 4 Recommendation Quality (9%) Seed Basket (14%) Rich Metadata (7%) 5 Recency of Papers (4%) Rich Metadata (9%) Ranking of Papers (3%) Rank Task 1 (N=109) Task 2 (N=100) Task 3 (N=91) 1 Broad topics not suitable (20%) Quality can be improved (16%) Rote selection of papers for task execution (16%) 2 Limited dataset (7%) Limited dataset (12%) Limited dataset (5%) 3 Quality can be improved (6%) Recommendation algorithm could include more dimensions (7%) Algorithm can be improved (5%) 4 Different algorithm required (5%) Speed can be improved (7%) Not sure of the usefulness (4%) 5 Free-text search required (4%) Repeated recommendations from Task 1 (3%) UI can be improved (3%) Top 5 Critical Aspects 23
  24. 24. SPRRF - Scientific Paper Retrieval and Recommender Framework (SPRRF) Distinct User Groups Usefulness of Information Cue Labels Forced Serendipity vs. Natural Serendipity Learning Algorithms vs. Fixed-Logic Algorithms Inclusion of Control Features in UI Inclusion of Bibliometric Data Diversification of Corpus • Seven themes identified using holistic coding method • SPRRF conceptualized as a mental model based on the themes • The framework needs to be validated 24
  25. 25. Questions for Discussion  How dependable are the gold standard lists in SPRS evaluation since relevance is largely dependent on user perspective?  Should SPRS evaluations be conducted in a parallel or serial manner?  What type of data should be collected during usability testing in SPRS evaluation? 25
  26. 26. THANK YOU aravind002@ntu.edu.sg 26

×