1. The document discusses the importance of exposing institutional repository metadata and content through the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). OAI-PMH allows repositories to be interoperable and their content discoverable.
2. However, OAI-PMH has some limitations, such as no built-in search and weak pagination. It also is sometimes misused, for example by excluding repositories from robots.txt files.
3. Tools like CORE Repository Analytics and RIOXX Guidelines can help monitor repositories and ensure metadata and content are exposed properly according to open access principles to maximize discoverability.
3. 3/15
Exposing your metadata
• One of the direct objectives of maintaining a
(institutional) repository is visibility and
dissemination
• OAI-PMH : a lower barrier mechanism for
repository interoperability
4. 4/15
Exposing your content
• OAI-PMH is the protocol to provide a list of your repository
resources and provide incremental updates
– Your repository’s ‘RSS feed’
• OAI-PMH is highly adopted
– openDOAR reports 2508 repositories worldwide
– Mature ‘out-of-the-box’ repository software supporting it:
Eprints, Dspace, Bepress
• ResourceSync in the horizon
– Do we need another standard?
5. 5/15
OAI-PMH organic issues
• No search mechanism : cannot filter by entity criteria
– E.g. get me all records of author X
• Built for metadata harvesting not for content harvesting
• Weak pagination mechanism
– resumptionTokens is a common issue
• Low granularity (per day)
• No default way to expose content in different ways according to
licensing
– though there is a workaround
• Long procedure to harvest full corpus, no way to process data on
the fly, need a local copy of data to work with it
6. 6/15
OAI-PMH misuse
Common issues identified by CORE’s 609 tracked repositories
0 5 10 15 20
Misuse of external resolution service
Error 500
Soft/Hard 404
Exclusion from robots.txt
Bad ListRecords implementation
DNS resolution error
No incremental updates
Invalid XML characters
Wrong semantic tag
7. 7/15
Open Access content principles
• Content referencing
– Content referencing Open repositories should always
establish a link from the metadata record to the item the
metadata record describes using a dereferencable
identifier pointing to the version held in the repository.
• Content accesibility to machine agents
– Open repositories must provide universal access to
machines with the same level of access as humans have.
8. 8/15
Common issues (1)
User-agent: *
Sitemap: http://repository.jisc.ac.uk/sitemap.xml
Disallow: /
Since OAI-PMH is used for content harvesting, exclusion from
content is an (major) issue
Breaks the principle of content to be universally machine
accessible
Presumption of innonence.
Exclusion from robots.txt
9. 9/15
Common issues (2)
Misuse of external resolving service
. . .
<dc:identifier>
http://hdl.handle.net/123456789/1656
</dc:identifier>
. . .
11. 11/15
Discoverability / Harvestability
• Improve discoverability by service providers
• Open Access repositories registries
– OpenDOAR
– ROAR
• Once a document is online and Open Access it should be
retrievable and can be processed by humans and machine agents
as well
• Follow guidelines / best practices, avoid pitfalls
12. 12/15
Monitor tools
• Validation tools to check against your repository metadata/content
exposure
– CORE repository analytics
– RIOXX Guidelines
• http://rioxx.net/
• Consistency of metadata fields, tracking of research outputs across
scholar systems
– OpenAIRE
• https://www.openaire.eu/
• OpenAIRE guidelines, OpenAIRE validator service
13. 13/15
CORE repository analytics
Simple information on how much content has been aggregated
by CORE service, repository harvesting status and logs of
harvesting attempts (with issues included)
http://core.kmi.open.ac.uk/repository_analytics
14. 14/15
So why does it matter after all?
• To achieve the primary goal of IR: to ‘open and disseminate
research outputs to a worldwide audience’
• Provide the best quality content to service providers that
make your repository more ‘discoverable’ , ‘accessible’,
‘reusable’. Aggregators can act as ambassadors in your behalf
• Add value (or avoid removing value) by disseminating your
content in the best possible way
15. 15/15
Conclusion
Currently used protocol (OAI-PMH) has limitations
– By design
– By misuse
Don’t need to embrace a new standard (yet), make the best out
of the current standard
Unleashing your data to the web, makes your organisation
research output more visible, expands your audience
Notas del editor
A low level technique to achieve interoperability across the distributed ecosystem of institutional repositories is OAI-PMH
Built for metadata not for content
Inconsistency for page content
Granularity (by day)
Long procedure
Built for metadata not for content
Inconsistency for page content
Granularity (by day) e.g. arxiv has 2700 daily updates
Long procedure
X-axis number of repositories
Similar problems reported for the PerX project
UK repos : all 154
• When repository uses URL in <dc: identifier> : http://hdl.handle.net/123456789/ (3)
misuse of external resolving service(hdl.handle.net)
• Error 500 (9)
• Soft/Hard 404’s (15)
Hard 404 are easy to detect while soft you need to examine the response and deduce
• If CORE is allowed in Robots.txt (6)
• When resumptionToken repeats (4)
bad OAI-PMH implementation
• When Resumption token not valid (8)
• If no <dc:identifier> (1)
• When DNS not resolved/timeout (2)
• 401 Unauthorized (Requires authentication) (1)
• bad hdl.handle.net links (http://hdl.handle.net/10311/228) (2)
• Report invalid XML characters – ideally The character and location (16)
• Connection Time out (2)
• Records Stored in <dc:identifier.uri> rather than <dc:identifier> (2)
wrong semantics
• Detect if incremental harvest is supported by repository
3 repos do not support incremental udpates
• Detect when doing incremental updates:
<error code="noRecordsMatch">The combination of the values of the from, until, set and metadataPrefix arguments result in an empty set.</error>
• Detect when repository is using default/generic oai identifier. e.g. oai:generic.eprints.org
Concepts to faciliate transition from open access metadata to open access content
Principle 1:
Identifier must by actionable - if a record conatins a url ,this url must be accesible
Service must be unobrusively accesible
Other ways to identify abusing robots/limit those
Identifier is not actionable
Make sure these registries hold the latest/fresh entry of your endpoint
Follow guidelines/ best practices
Avoid pitfalls
Differentiate your licensed content using OAI protocol’s Sets
AVOID password protection
Raises an alarm to repository administrators if these numbers are not consistent with their datasets
Focus on building services on content rather on the content
Other service providers: CORE , BASE, Google Scholar?, RepUK
As if nobody’s reading