SlideShare a Scribd company logo
1 of 15
Download to read offline
Usage-Based vs. Citation-Based
Recommenders in a Digital Library

                   André Vellino 	

          School of Information Studies 
               University of Ottawa
       blog: http://synthese.wordpress.com
                 twitter: @vellino	

          e-mail: avellino@uottawa.ca
Context
—  Canada Institute for Scientific and Technical Information
  (aka Canada’s National Science Library)
—  Has a full-text digital collection (Scientific, Technical,
  Medical) with text-mining rights for research purposes only
  —  Elsevier and Springer (mostly)
      —  ~8M articles
      —  ~2800 journals
      —  ~ 3TB
—  Plan: a Hybrid, Multi-Dimensional
  —  Usage-based (CF)
  —  Content-based (CBF)
  —  User-Context
Sparsity of Usage Data is a Problem in
Digital Libraries
    Amazon             Digital Libraries
 Users       Items                Items
                      Users




                      ~70,000



~ 70 M   ~ 93 M                  ~7M
Data is Sparse Too

                                  edges user-item graph 	

—  Sparseness of a dataset S =
                                  total number of possible edges	


—  Mendeley data           S = 2.66 x 10-05
—  Neflix                  S = 1.18 x 10-02

—  But also, Mendeley data isn’t “highly connected”
   —  83.6% of Mendeley articles were referenced by only 1 user
   —   6% of the articles were referenced by 3 or more users.
(2009)	




ExLibris bX solution to data sparsity:
   Harvest lots usage (co-download)
   behaviour from world-wide SFX (Ex
   Libris Open URL resolver) logs and
   apply collaborative filtering to
   correlate articles.




     Johan Bollen and Herbert Van de Sompel. An architecture for the
     aggregation and analysis of scholarly usage data. (in JCDL2006)
TechLens+ Citation-Based Recommdendation
          p2	

                                                                              References	

                                                       Articles	





p3	


p5	




        R. Torres, S. McNee, M. Abel, J. Konstan, and J. Riedl. Enhancing Digital
        Libraries with TechLens+. (in JCDL 2004)
Does “Rated” Citations w/ PageRank Help?
                       p1 p2 p3 p4 p5 p6 p7 p8                         citations
                  p1                         0.4         
                  p2             0.5         0.4
   articles
                  p3   0.2                         0.6

                  p4         0.7 0.5                          
                  u1             0.5 0.3           0.6        
   users
                  u2   0.2             0.3                        = constant

Answer:	

    Using PageRank to “rate” citations is not significantly 	

    Better than using a constant (0/1)	

Note:	

    There is ongoing work w/ NRC on machine learning method 
    for extracting “most important references” – that might help more
Sarkanto (NRC Article Recommender)
—  Uses TechLens+ strategy of replacing User-Item matrix with
    Article-Article matrix from citation data
—  Uses TASTE recommender (now the recommendation
    component of Mahout)
—  Is now decoupled from user-based recommender
—  Compare side by side w/ ‘bX’ recommendations
Try it here:

     http://lab.cisti-icist.nrc-cnrc.gc.ca/Sarkanto/
Sarkanto compared w/ bX




“These are articles whose co-         “Users who viewed this article also
citations are similar to this one.”   viewed these articles.”
Experiments
—  Sarkanto generated ~ 1.9 million citation-based
    recommendations (statically)
—  Experimental comparison done on 1886 randomly selected
    articles from a subset of ~ 1.2M articles (down from ~ 8M)
—  Questions asked in the experiment:
  —  How many recommendations produced by each recommender
  —  Coverage (how often does a seed article generate a
      recommendation)
  —  How semantically diverse are the recommendations
Measuring Semantic Diversity




—  Question: what is the semantic distance between the source-
    article and the recommendations?
—  In this setup it was not possible to compare the semantic distance
    without the full-text for both set of recommendations
—  Full-text is available for the Sarkanto recommendations but not for
    the bX recommendations
Journal-Journal Semantic Distance
  —  Concatenate the full-text of all the articles in each journal
  —  From a Lucene index of the full text in each journal, use
     Dominic Widdows’ Semantic Vectors package to create
     —  a term-journal matrix,
     —  reduced dimensionality term-vectors (512) for each journal
        using random projections
  —  Apply multidimensional scaling (MDS) in R to obtain a 2-D
     distance matrix (2300 x 2300)
G. Newton, A. Callahan, and M. Dumontier. Semantic journal mapping for 	

search visualization in a large scale article digital library in Second Workshop 	

on Very Large Digital Libraries, ECDL 2009
2-D Journal Distance Map
                                              Colours clusters represent	

                                              Journal subject headings
                                              (from publisher metadata)	





http://cuvier.cisti.nrc.ca/~gnewton/torngat/applet.2009.07.22/index.html
Results: Diversity of Recommendations

—  ~13% of seed articles generated recommendations for both
    bX and Sarkanto (i.e. not much overlap!)
—  Citation-based recommendations appear to be more
    semantically diverse than User-based.
Conclusions
—  Citation-based and User-based recommendations are
    complementary
—  Different kinds of data sources (users vs. citations) produce
    different kinds of (non-overlapping) results
—  Citation-based recommendations are more semantically diverse
    —  Hypothesis:“user-based recommendations may be biased by the semantic
        similarity of search-engine results”

More Related Content

What's hot

How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
drnigam
 
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
Susanna-Assunta Sansone
 
eXframe: A Semantic Web Platform for Genomic Experiments
eXframe: A Semantic Web Platform for Genomic ExperimentseXframe: A Semantic Web Platform for Genomic Experiments
eXframe: A Semantic Web Platform for Genomic Experiments
Tim Clark
 

What's hot (20)

Charleston Conference 2016
Charleston Conference 2016Charleston Conference 2016
Charleston Conference 2016
 
Sharing Sensitive Data With Confidence: The DataTags system
Sharing Sensitive Data With Confidence: The DataTags systemSharing Sensitive Data With Confidence: The DataTags system
Sharing Sensitive Data With Confidence: The DataTags system
 
Martone acs presentation
Martone acs presentationMartone acs presentation
Martone acs presentation
 
Neuroscience as networked science
Neuroscience as networked scienceNeuroscience as networked science
Neuroscience as networked science
 
NIH BD2K DataMed metadata model - Force11, 2016
NIH BD2K DataMed metadata model - Force11, 2016NIH BD2K DataMed metadata model - Force11, 2016
NIH BD2K DataMed metadata model - Force11, 2016
 
Annotopia open annotation services platform
Annotopia open annotation services platformAnnotopia open annotation services platform
Annotopia open annotation services platform
 
Overview of the NIH BD2K CEDAR centre, on metadata and standards
Overview of the NIH BD2K CEDAR centre, on metadata and standardsOverview of the NIH BD2K CEDAR centre, on metadata and standards
Overview of the NIH BD2K CEDAR centre, on metadata and standards
 
Ngsp
NgspNgsp
Ngsp
 
The Dataverse Commons
The Dataverse CommonsThe Dataverse Commons
The Dataverse Commons
 
Networked Science, And Integrating with Dataverse
Networked Science, And Integrating with DataverseNetworked Science, And Integrating with Dataverse
Networked Science, And Integrating with Dataverse
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
Fairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology viewsFairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology views
 
On community-standards, data curation and scholarly communication - BITS, Ita...
On community-standards, data curation and scholarly communication - BITS, Ita...On community-standards, data curation and scholarly communication - BITS, Ita...
On community-standards, data curation and scholarly communication - BITS, Ita...
 
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
 
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
 
Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...
Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...
Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...
 
Data Repositories: Recommendation, Certification and Models for Cost Recovery
Data Repositories: Recommendation, Certification and Models for Cost RecoveryData Repositories: Recommendation, Certification and Models for Cost Recovery
Data Repositories: Recommendation, Certification and Models for Cost Recovery
 
eXframe: A Semantic Web Platform for Genomic Experiments
eXframe: A Semantic Web Platform for Genomic ExperimentseXframe: A Semantic Web Platform for Genomic Experiments
eXframe: A Semantic Web Platform for Genomic Experiments
 
exFrame: a Semantic Web Platform for Genomics Experiments
exFrame: a Semantic Web Platform for Genomics ExperimentsexFrame: a Semantic Web Platform for Genomics Experiments
exFrame: a Semantic Web Platform for Genomics Experiments
 

Similar to Usage-Based vs. Citation-Based Recommenders in a Digital Library

The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
INSC580MacasaOpenSourceSoftwareLibrariesFall2016
INSC580MacasaOpenSourceSoftwareLibrariesFall2016INSC580MacasaOpenSourceSoftwareLibrariesFall2016
INSC580MacasaOpenSourceSoftwareLibrariesFall2016
Michael J. Macasa
 

Similar to Usage-Based vs. Citation-Based Recommenders in a Digital Library (20)

A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
MS-Presentation-new template arid university.pptx
MS-Presentation-new template arid university.pptxMS-Presentation-new template arid university.pptx
MS-Presentation-new template arid university.pptx
 
Scientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an OverviewScientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an Overview
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
From Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
From Bibliometrics to Cybermetrics - a book chapter by Nicola de BellisFrom Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
From Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
 
Text Mining from Three Perspectives - Publisher
Text Mining from Three Perspectives - PublisherText Mining from Three Perspectives - Publisher
Text Mining from Three Perspectives - Publisher
 
The paper trail:steps towards a reference model for the metadata ecology
The paper trail:steps towards a reference model for the metadata ecologyThe paper trail:steps towards a reference model for the metadata ecology
The paper trail:steps towards a reference model for the metadata ecology
 
Mendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleMendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scale
 
The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects
 
Automatically converting tabular data to
Automatically converting tabular data toAutomatically converting tabular data to
Automatically converting tabular data to
 
New age
New ageNew age
New age
 
INSC580MacasaOpenSourceSoftwareLibrariesFall2016
INSC580MacasaOpenSourceSoftwareLibrariesFall2016INSC580MacasaOpenSourceSoftwareLibrariesFall2016
INSC580MacasaOpenSourceSoftwareLibrariesFall2016
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
 
Price "KBART: improving the supply of data to link resolvers and knowledge ba...
Price "KBART: improving the supply of data to link resolvers and knowledge ba...Price "KBART: improving the supply of data to link resolvers and knowledge ba...
Price "KBART: improving the supply of data to link resolvers and knowledge ba...
 
Price "KBART: Improving the Supply of Data to Link Resolvers and Knowledge Ba...
Price "KBART: Improving the Supply of Data to Link Resolvers and Knowledge Ba...Price "KBART: Improving the Supply of Data to Link Resolvers and Knowledge Ba...
Price "KBART: Improving the Supply of Data to Link Resolvers and Knowledge Ba...
 
Towards OpenURL Quality Metrics: Initial Findings
Towards OpenURL Quality Metrics: Initial FindingsTowards OpenURL Quality Metrics: Initial Findings
Towards OpenURL Quality Metrics: Initial Findings
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the web
 

More from Andre Vellino

Why machines can't think (logically)
Why machines can't think (logically)Why machines can't think (logically)
Why machines can't think (logically)
Andre Vellino
 
Vellino presentationtocisti
Vellino presentationtocistiVellino presentationtocisti
Vellino presentationtocisti
Andre Vellino
 
Mechanical Librarian
Mechanical LibrarianMechanical Librarian
Mechanical Librarian
Andre Vellino
 
Synthese Recommender System
Synthese Recommender SystemSynthese Recommender System
Synthese Recommender System
Andre Vellino
 

More from Andre Vellino (6)

Why machines can't think (logically)
Why machines can't think (logically)Why machines can't think (logically)
Why machines can't think (logically)
 
Measuring academic influence: Not all citations are equal
Measuring academic influence: Not all citations are equalMeasuring academic influence: Not all citations are equal
Measuring academic influence: Not all citations are equal
 
Vellino presentationtocisti
Vellino presentationtocistiVellino presentationtocisti
Vellino presentationtocisti
 
Mechanical Librarian
Mechanical LibrarianMechanical Librarian
Mechanical Librarian
 
La recommandation d'articles scientifiques dans une bibliothèque numérique
La recommandation d'articles scientifiques dans une bibliothèque numériqueLa recommandation d'articles scientifiques dans une bibliothèque numérique
La recommandation d'articles scientifiques dans une bibliothèque numérique
 
Synthese Recommender System
Synthese Recommender SystemSynthese Recommender System
Synthese Recommender System
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Usage-Based vs. Citation-Based Recommenders in a Digital Library

  • 1. Usage-Based vs. Citation-Based Recommenders in a Digital Library André Vellino School of Information Studies  University of Ottawa blog: http://synthese.wordpress.com twitter: @vellino e-mail: avellino@uottawa.ca
  • 2. Context —  Canada Institute for Scientific and Technical Information (aka Canada’s National Science Library) —  Has a full-text digital collection (Scientific, Technical, Medical) with text-mining rights for research purposes only —  Elsevier and Springer (mostly) —  ~8M articles —  ~2800 journals —  ~ 3TB —  Plan: a Hybrid, Multi-Dimensional —  Usage-based (CF) —  Content-based (CBF) —  User-Context
  • 3. Sparsity of Usage Data is a Problem in Digital Libraries Amazon Digital Libraries Users Items Items Users ~70,000 ~ 70 M ~ 93 M ~7M
  • 4. Data is Sparse Too edges user-item graph —  Sparseness of a dataset S = total number of possible edges —  Mendeley data S = 2.66 x 10-05 —  Neflix S = 1.18 x 10-02 —  But also, Mendeley data isn’t “highly connected” —  83.6% of Mendeley articles were referenced by only 1 user —  6% of the articles were referenced by 3 or more users.
  • 5. (2009) ExLibris bX solution to data sparsity: Harvest lots usage (co-download) behaviour from world-wide SFX (Ex Libris Open URL resolver) logs and apply collaborative filtering to correlate articles. Johan Bollen and Herbert Van de Sompel. An architecture for the aggregation and analysis of scholarly usage data. (in JCDL2006)
  • 6. TechLens+ Citation-Based Recommdendation p2 References Articles p3 p5 R. Torres, S. McNee, M. Abel, J. Konstan, and J. Riedl. Enhancing Digital Libraries with TechLens+. (in JCDL 2004)
  • 7. Does “Rated” Citations w/ PageRank Help? p1 p2 p3 p4 p5 p6 p7 p8 citations p1 0.4  p2 0.5 0.4 articles p3 0.2 0.6 p4 0.7 0.5  u1 0.5 0.3 0.6  users u2 0.2 0.3   = constant Answer: Using PageRank to “rate” citations is not significantly Better than using a constant (0/1) Note: There is ongoing work w/ NRC on machine learning method for extracting “most important references” – that might help more
  • 8. Sarkanto (NRC Article Recommender) —  Uses TechLens+ strategy of replacing User-Item matrix with Article-Article matrix from citation data —  Uses TASTE recommender (now the recommendation component of Mahout) —  Is now decoupled from user-based recommender —  Compare side by side w/ ‘bX’ recommendations Try it here: http://lab.cisti-icist.nrc-cnrc.gc.ca/Sarkanto/
  • 9. Sarkanto compared w/ bX “These are articles whose co- “Users who viewed this article also citations are similar to this one.” viewed these articles.”
  • 10. Experiments —  Sarkanto generated ~ 1.9 million citation-based recommendations (statically) —  Experimental comparison done on 1886 randomly selected articles from a subset of ~ 1.2M articles (down from ~ 8M) —  Questions asked in the experiment: —  How many recommendations produced by each recommender —  Coverage (how often does a seed article generate a recommendation) —  How semantically diverse are the recommendations
  • 11. Measuring Semantic Diversity —  Question: what is the semantic distance between the source- article and the recommendations? —  In this setup it was not possible to compare the semantic distance without the full-text for both set of recommendations —  Full-text is available for the Sarkanto recommendations but not for the bX recommendations
  • 12. Journal-Journal Semantic Distance —  Concatenate the full-text of all the articles in each journal —  From a Lucene index of the full text in each journal, use Dominic Widdows’ Semantic Vectors package to create —  a term-journal matrix, —  reduced dimensionality term-vectors (512) for each journal using random projections —  Apply multidimensional scaling (MDS) in R to obtain a 2-D distance matrix (2300 x 2300) G. Newton, A. Callahan, and M. Dumontier. Semantic journal mapping for search visualization in a large scale article digital library in Second Workshop on Very Large Digital Libraries, ECDL 2009
  • 13. 2-D Journal Distance Map Colours clusters represent Journal subject headings (from publisher metadata) http://cuvier.cisti.nrc.ca/~gnewton/torngat/applet.2009.07.22/index.html
  • 14. Results: Diversity of Recommendations —  ~13% of seed articles generated recommendations for both bX and Sarkanto (i.e. not much overlap!) —  Citation-based recommendations appear to be more semantically diverse than User-based.
  • 15. Conclusions —  Citation-based and User-based recommendations are complementary —  Different kinds of data sources (users vs. citations) produce different kinds of (non-overlapping) results —  Citation-based recommendations are more semantically diverse —  Hypothesis:“user-based recommendations may be biased by the semantic similarity of search-engine results”