SlideShare a Scribd company logo
1 of 15
1/15
Why repository harvestability
matters
By Lucas Anastasiou,
KMi, The Open University
UKCoRR members meeting
3rd December 2013
2/15
Outline
• Exposing your metadata
• OAI organic issues
• OAI misuse
• Open Access principles
• Monitor tools
• Conclusion
3/15
Exposing your metadata
• One of the direct objectives of maintaining a
(institutional) repository is visibility and
dissemination
• OAI-PMH : a lower barrier mechanism for
repository interoperability
4/15
Exposing your content
• OAI-PMH is the protocol to provide a list of your repository
resources and provide incremental updates
– Your repository’s ‘RSS feed’
• OAI-PMH is highly adopted
– openDOAR reports 2508 repositories worldwide
– Mature ‘out-of-the-box’ repository software supporting it:
Eprints, Dspace, Bepress
• ResourceSync in the horizon
– Do we need another standard?
5/15
OAI-PMH organic issues
• No search mechanism : cannot filter by entity criteria
– E.g. get me all records of author X
• Built for metadata harvesting not for content harvesting
• Weak pagination mechanism
– resumptionTokens is a common issue
• Low granularity (per day)
• No default way to expose content in different ways according to
licensing
– though there is a workaround
• Long procedure to harvest full corpus, no way to process data on
the fly, need a local copy of data to work with it
6/15
OAI-PMH misuse
Common issues identified by CORE’s 609 tracked repositories
0 5 10 15 20
Misuse of external resolution service
Error 500
Soft/Hard 404
Exclusion from robots.txt
Bad ListRecords implementation
DNS resolution error
No incremental updates
Invalid XML characters
Wrong semantic tag
7/15
Open Access content principles
• Content referencing
– Content referencing Open repositories should always
establish a link from the metadata record to the item the
metadata record describes using a dereferencable
identifier pointing to the version held in the repository.
• Content accesibility to machine agents
– Open repositories must provide universal access to
machines with the same level of access as humans have.
8/15
Common issues (1)
User-agent: *
Sitemap: http://repository.jisc.ac.uk/sitemap.xml
Disallow: /
Since OAI-PMH is used for content harvesting, exclusion from
content is an (major) issue
Breaks the principle of content to be universally machine
accessible
Presumption of innonence.
Exclusion from robots.txt
9/15
Common issues (2)
Misuse of external resolving service
. . .
<dc:identifier>
http://hdl.handle.net/123456789/1656
</dc:identifier>
. . .
10/15
Common issues (3)
11/15
Discoverability / Harvestability
• Improve discoverability by service providers
• Open Access repositories registries
– OpenDOAR
– ROAR
• Once a document is online and Open Access it should be
retrievable and can be processed by humans and machine agents
as well
• Follow guidelines / best practices, avoid pitfalls
12/15
Monitor tools
• Validation tools to check against your repository metadata/content
exposure
– CORE repository analytics
– RIOXX Guidelines
• http://rioxx.net/
• Consistency of metadata fields, tracking of research outputs across
scholar systems
– OpenAIRE
• https://www.openaire.eu/
• OpenAIRE guidelines, OpenAIRE validator service
13/15
CORE repository analytics
Simple information on how much content has been aggregated
by CORE service, repository harvesting status and logs of
harvesting attempts (with issues included)
http://core.kmi.open.ac.uk/repository_analytics
14/15
So why does it matter after all?
• To achieve the primary goal of IR: to ‘open and disseminate
research outputs to a worldwide audience’
• Provide the best quality content to service providers that
make your repository more ‘discoverable’ , ‘accessible’,
‘reusable’. Aggregators can act as ambassadors in your behalf
• Add value (or avoid removing value) by disseminating your
content in the best possible way
15/15
Conclusion
Currently used protocol (OAI-PMH) has limitations
– By design
– By misuse
Don’t need to embrace a new standard (yet), make the best out
of the current standard
Unleashing your data to the web, makes your organisation
research output more visible, expands your audience

More Related Content

Similar to Why repository harvestability matters

For repositories to succeed they have to end. Reflections on (not just) the U...
For repositories to succeed they have to end. Reflections on (not just) the U...For repositories to succeed they have to end. Reflections on (not just) the U...
For repositories to succeed they have to end. Reflections on (not just) the U...Torsten Reimer
 
OSFair2017 training | Machine accessibility of Open Access scientific publica...
OSFair2017 training | Machine accessibility of Open Access scientific publica...OSFair2017 training | Machine accessibility of Open Access scientific publica...
OSFair2017 training | Machine accessibility of Open Access scientific publica...Open Science Fair
 
Developing Infrastructure to Support Closer Collaboration of Aggregators with...
Developing Infrastructure to Support Closer Collaboration of Aggregators with...Developing Infrastructure to Support Closer Collaboration of Aggregators with...
Developing Infrastructure to Support Closer Collaboration of Aggregators with...Nancy Pontika
 
OA Repositories for DE in Myanmar presentation
OA Repositories for DE in Myanmar presentationOA Repositories for DE in Myanmar presentation
OA Repositories for DE in Myanmar presentationaduchesne1
 
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Frank Oellien
 
Technical standards & the RDTF Vision: some considerations
Technical standards & the RDTF Vision: some considerationsTechnical standards & the RDTF Vision: some considerations
Technical standards & the RDTF Vision: some considerationsPaul Walk
 
Harvesting Repositories: DPLA, Europeana, & Other Case Studies
Harvesting Repositories:  DPLA, Europeana, & Other Case StudiesHarvesting Repositories:  DPLA, Europeana, & Other Case Studies
Harvesting Repositories: DPLA, Europeana, & Other Case Studieseohallor
 
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific PublicationsTowards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publicationspetrknoth
 
Application integration with the W3C Linked Data standards
Application integration with the W3C Linked Data standardsApplication integration with the W3C Linked Data standards
Application integration with the W3C Linked Data standardsNandana Mihindukulasooriya
 
Building the Future Together: AtoM3, Governance, and the Sustainability of Op...
Building the Future Together: AtoM3, Governance, and the Sustainability of Op...Building the Future Together: AtoM3, Governance, and the Sustainability of Op...
Building the Future Together: AtoM3, Governance, and the Sustainability of Op...Artefactual Systems - AtoM
 
From Open Access Metadata to Open Access Content: Two Principles for Increase...
From Open Access Metadata to Open Access Content: Two Principles for Increase...From Open Access Metadata to Open Access Content: Two Principles for Increase...
From Open Access Metadata to Open Access Content: Two Principles for Increase...petrknoth
 
Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research dataARDC
 
Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...Jenny Mitcham
 
Next generation repositories
Next generation repositoriesNext generation repositories
Next generation repositoriesPaul Walk
 
A user journey in OpenAIRE services through the lens of repository managers -...
A user journey in OpenAIRE services through the lens of repository managers -...A user journey in OpenAIRE services through the lens of repository managers -...
A user journey in OpenAIRE services through the lens of repository managers -...OpenAIRE
 
2013 05-15 Intro to Archivematica - UBC SLAIS Digital Records Forensics Class
2013 05-15 Intro to Archivematica - UBC SLAIS Digital Records Forensics Class2013 05-15 Intro to Archivematica - UBC SLAIS Digital Records Forensics Class
2013 05-15 Intro to Archivematica - UBC SLAIS Digital Records Forensics ClassCourtney Mumma
 
OA in the Library Collection: The Challenge of Identifying and Managing Open ...
OA in the Library Collection: The Challenge of Identifying and Managing Open ...OA in the Library Collection: The Challenge of Identifying and Managing Open ...
OA in the Library Collection: The Challenge of Identifying and Managing Open ...NASIG
 

Similar to Why repository harvestability matters (20)

For repositories to succeed they have to end. Reflections on (not just) the U...
For repositories to succeed they have to end. Reflections on (not just) the U...For repositories to succeed they have to end. Reflections on (not just) the U...
For repositories to succeed they have to end. Reflections on (not just) the U...
 
OSFair2017 training | Machine accessibility of Open Access scientific publica...
OSFair2017 training | Machine accessibility of Open Access scientific publica...OSFair2017 training | Machine accessibility of Open Access scientific publica...
OSFair2017 training | Machine accessibility of Open Access scientific publica...
 
Developing Infrastructure to Support Closer Collaboration of Aggregators with...
Developing Infrastructure to Support Closer Collaboration of Aggregators with...Developing Infrastructure to Support Closer Collaboration of Aggregators with...
Developing Infrastructure to Support Closer Collaboration of Aggregators with...
 
OA Repositories for DE in Myanmar presentation
OA Repositories for DE in Myanmar presentationOA Repositories for DE in Myanmar presentation
OA Repositories for DE in Myanmar presentation
 
AtoM Community Update: 2019-05
AtoM Community Update: 2019-05AtoM Community Update: 2019-05
AtoM Community Update: 2019-05
 
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
 
Technical standards & the RDTF Vision: some considerations
Technical standards & the RDTF Vision: some considerationsTechnical standards & the RDTF Vision: some considerations
Technical standards & the RDTF Vision: some considerations
 
Harvesting Repositories: DPLA, Europeana, & Other Case Studies
Harvesting Repositories:  DPLA, Europeana, & Other Case StudiesHarvesting Repositories:  DPLA, Europeana, & Other Case Studies
Harvesting Repositories: DPLA, Europeana, & Other Case Studies
 
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific PublicationsTowards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
 
Application integration with the W3C Linked Data standards
Application integration with the W3C Linked Data standardsApplication integration with the W3C Linked Data standards
Application integration with the W3C Linked Data standards
 
Building the Future Together: AtoM3, Governance, and the Sustainability of Op...
Building the Future Together: AtoM3, Governance, and the Sustainability of Op...Building the Future Together: AtoM3, Governance, and the Sustainability of Op...
Building the Future Together: AtoM3, Governance, and the Sustainability of Op...
 
From Open Access Metadata to Open Access Content: Two Principles for Increase...
From Open Access Metadata to Open Access Content: Two Principles for Increase...From Open Access Metadata to Open Access Content: Two Principles for Increase...
From Open Access Metadata to Open Access Content: Two Principles for Increase...
 
Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research data
 
Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...
 
Next generation repositories
Next generation repositoriesNext generation repositories
Next generation repositories
 
A user journey in OpenAIRE services through the lens of repository managers -...
A user journey in OpenAIRE services through the lens of repository managers -...A user journey in OpenAIRE services through the lens of repository managers -...
A user journey in OpenAIRE services through the lens of repository managers -...
 
OAI-PMH
OAI-PMHOAI-PMH
OAI-PMH
 
2013 05-15 Intro to Archivematica - UBC SLAIS Digital Records Forensics Class
2013 05-15 Intro to Archivematica - UBC SLAIS Digital Records Forensics Class2013 05-15 Intro to Archivematica - UBC SLAIS Digital Records Forensics Class
2013 05-15 Intro to Archivematica - UBC SLAIS Digital Records Forensics Class
 
OA in the Library Collection: The Challenge of Identifying and Managing Open ...
OA in the Library Collection: The Challenge of Identifying and Managing Open ...OA in the Library Collection: The Challenge of Identifying and Managing Open ...
OA in the Library Collection: The Challenge of Identifying and Managing Open ...
 
Digitisation and institutional repositories 2
Digitisation and institutional repositories 2Digitisation and institutional repositories 2
Digitisation and institutional repositories 2
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

Why repository harvestability matters

  • 1. 1/15 Why repository harvestability matters By Lucas Anastasiou, KMi, The Open University UKCoRR members meeting 3rd December 2013
  • 2. 2/15 Outline • Exposing your metadata • OAI organic issues • OAI misuse • Open Access principles • Monitor tools • Conclusion
  • 3. 3/15 Exposing your metadata • One of the direct objectives of maintaining a (institutional) repository is visibility and dissemination • OAI-PMH : a lower barrier mechanism for repository interoperability
  • 4. 4/15 Exposing your content • OAI-PMH is the protocol to provide a list of your repository resources and provide incremental updates – Your repository’s ‘RSS feed’ • OAI-PMH is highly adopted – openDOAR reports 2508 repositories worldwide – Mature ‘out-of-the-box’ repository software supporting it: Eprints, Dspace, Bepress • ResourceSync in the horizon – Do we need another standard?
  • 5. 5/15 OAI-PMH organic issues • No search mechanism : cannot filter by entity criteria – E.g. get me all records of author X • Built for metadata harvesting not for content harvesting • Weak pagination mechanism – resumptionTokens is a common issue • Low granularity (per day) • No default way to expose content in different ways according to licensing – though there is a workaround • Long procedure to harvest full corpus, no way to process data on the fly, need a local copy of data to work with it
  • 6. 6/15 OAI-PMH misuse Common issues identified by CORE’s 609 tracked repositories 0 5 10 15 20 Misuse of external resolution service Error 500 Soft/Hard 404 Exclusion from robots.txt Bad ListRecords implementation DNS resolution error No incremental updates Invalid XML characters Wrong semantic tag
  • 7. 7/15 Open Access content principles • Content referencing – Content referencing Open repositories should always establish a link from the metadata record to the item the metadata record describes using a dereferencable identifier pointing to the version held in the repository. • Content accesibility to machine agents – Open repositories must provide universal access to machines with the same level of access as humans have.
  • 8. 8/15 Common issues (1) User-agent: * Sitemap: http://repository.jisc.ac.uk/sitemap.xml Disallow: / Since OAI-PMH is used for content harvesting, exclusion from content is an (major) issue Breaks the principle of content to be universally machine accessible Presumption of innonence. Exclusion from robots.txt
  • 9. 9/15 Common issues (2) Misuse of external resolving service . . . <dc:identifier> http://hdl.handle.net/123456789/1656 </dc:identifier> . . .
  • 11. 11/15 Discoverability / Harvestability • Improve discoverability by service providers • Open Access repositories registries – OpenDOAR – ROAR • Once a document is online and Open Access it should be retrievable and can be processed by humans and machine agents as well • Follow guidelines / best practices, avoid pitfalls
  • 12. 12/15 Monitor tools • Validation tools to check against your repository metadata/content exposure – CORE repository analytics – RIOXX Guidelines • http://rioxx.net/ • Consistency of metadata fields, tracking of research outputs across scholar systems – OpenAIRE • https://www.openaire.eu/ • OpenAIRE guidelines, OpenAIRE validator service
  • 13. 13/15 CORE repository analytics Simple information on how much content has been aggregated by CORE service, repository harvesting status and logs of harvesting attempts (with issues included) http://core.kmi.open.ac.uk/repository_analytics
  • 14. 14/15 So why does it matter after all? • To achieve the primary goal of IR: to ‘open and disseminate research outputs to a worldwide audience’ • Provide the best quality content to service providers that make your repository more ‘discoverable’ , ‘accessible’, ‘reusable’. Aggregators can act as ambassadors in your behalf • Add value (or avoid removing value) by disseminating your content in the best possible way
  • 15. 15/15 Conclusion Currently used protocol (OAI-PMH) has limitations – By design – By misuse Don’t need to embrace a new standard (yet), make the best out of the current standard Unleashing your data to the web, makes your organisation research output more visible, expands your audience

Editor's Notes

  1. A low level technique to achieve interoperability across the distributed ecosystem of institutional repositories is OAI-PMH
  2. Built for metadata not for content Inconsistency for page content Granularity (by day) Long procedure
  3. Built for metadata not for content Inconsistency for page content Granularity (by day) e.g. arxiv has 2700 daily updates Long procedure
  4. X-axis number of repositories Similar problems reported for the PerX project UK repos : all 154 • When repository uses URL in <dc: identifier> : http://hdl.handle.net/123456789/ (3) misuse of external resolving service(hdl.handle.net) • Error 500 (9) • Soft/Hard 404’s (15) Hard 404 are easy to detect while soft you need to examine the response and deduce • If CORE is allowed in Robots.txt (6) • When resumptionToken repeats (4) bad OAI-PMH implementation • When Resumption token not valid (8) • If no <dc:identifier> (1) • When DNS not resolved/timeout (2) • 401 Unauthorized (Requires authentication) (1) • bad hdl.handle.net links (http://hdl.handle.net/10311/228) (2) • Report invalid XML characters – ideally The character and location (16) • Connection Time out (2) • Records Stored in <dc:identifier.uri> rather than <dc:identifier> (2) wrong semantics • Detect if incremental harvest is supported by repository 3 repos do not support incremental udpates • Detect when doing incremental updates: <error code="noRecordsMatch">The combination of the values of the from, until, set and metadataPrefix arguments result in an empty set.</error> • Detect when repository is using default/generic oai identifier. e.g. oai:generic.eprints.org
  5. Concepts to faciliate transition from open access metadata to open access content Principle 1: Identifier must by actionable - if a record conatins a url ,this url must be accesible
  6. Service must be unobrusively accesible Other ways to identify abusing robots/limit those
  7. Identifier is not actionable
  8. Make sure these registries hold the latest/fresh entry of your endpoint Follow guidelines/ best practices Avoid pitfalls Differentiate your licensed content using OAI protocol’s Sets AVOID password protection
  9. Raises an alarm to repository administrators if these numbers are not consistent with their datasets
  10. Focus on building services on content rather on the content Other service providers: CORE , BASE, Google Scholar?, RepUK As if nobody’s reading