SlideShare a Scribd company logo
1 of 123
Advanced Overview of Version 2.0 of the
Open Archives Initiative
Protocol for Metadata Harvesting
Michael L. Nelson
Old Dominion University
Norfolk VA
mln@cs.odu.edu
Herbert Van de Sompel
Los Alamos National Laboratory
Los Alamos NM
herbertv@lanl.gov
Simeon Warner
Cornell University
Ithaca NY
simeon@cs.cornell.edu
ACM/IEEE Joint Conference on Digital Libraries
Houston, Texas
13:30 - 17:00
May 27 2003
latest version at: http://www.cs.odu.edu/~mln/jcdl03/
Scope and Focus
• This Tutorial is not…
– an introduction to OAI-PMH
– a listing of all the wonderful projects that use
OAI-PMH
– a discussion of the merits of metadata
harvesting vs. distributed searching
• A passing familiarity is assumed for:
– web / http interaction
– Dublin Core / metadata
Outline
• How 2.0 evolved from SFC and 1.x
– people, processes, events
• 2.0 highlights
– comparison with 1.x
• Guidelines, recommendations, best
practices for 2.0 implementations
– harvesters, repositories, aggregators, optional
containers
• Novel applications of OAI-PMH
from
Santa Fe Convention
to
OAI-PMH v.2.0
about eprints
document
like objects
resources
metadata OAMS
unqualified
Dublin Core
unqualified
Dublin Core
transport HTTP HTTP HTTP
responses XML XML XML
requests HTTP GET/POST HTTP GET/POST HTTP GET/POST
verbs Dienst OAI-PMH OAI-PMH
nature experimental experimental stable
model
metadata
harvesting
metadata
harvesting
metadata
harvesting
Santa Fe
convention
OAI-PMH
v.1.0/1.1
OAI-PMH
v.2.0
Santa Fe Convention [02/2000]
• goal: optimize discovery of e-prints
• input:
• the UPS prototype
• RePEc /SODA “data provider / service
provider model”
• Dienst protocol
• deliberations at Santa Fe meeting
[10/99]
OAI-PMH v.1.0 [01/2001]
• goal: optimize discovery of document-like
objects
• input:
• SFC
• DLF meetings on metadata harvesting
• deliberations at Cornell meeting [09/00]
• alpha test group of OAI-PMH v.1.0
• low-barrier interoperability specification
• metadata harvesting model: data provider / service
provider
• focus on document-like objects
• autonomous protocol
• HTTP based
• XML responses
• unqualified Dublin Core
• experimental: 12-18 months
OAI-PMH v.1.0 [01/2001]
Selected Pre- 2.0 OAI Highlights
• October 21-22, 1999 - initial UPS meeting
• February 15, 2000 - Santa Fe Convention published in D-Lib Magazine
– precursor to the OAI metadata harvesting protocol
• June 3, 2000 - workshop at ACM DL 2000 (Texas)
• August 25, 2000 - OAI steering committee formed, DLF/CNI support
• September 7-8, 2000 - technical meeting at Cornell University
– defined the core of the current OAI metadata harvesting protocol
• September 21, 2000 - workshop at ECDL 2000 (Portugal)
• November 1, 2000 - Alpha test group announced (~15 organizations)
• January 23, 2001 - OAI protocol 1.0 announced, OAI Open Day in the
U.S. (Washington DC)
– purpose: freeze protocol for 12-16 months, generate critical mass
• February 26, 2001 - OAI Open Day in Europe (Berlin)
• July 3, 2001 - OAI protocol 1.1 announced
– to reflect changes in the W3C’s XML latest schema
recommendation
• September 8, 2001 - workshop at ECDL 2001 (Darmstadt)
OAI-PMH v.2.0 [06/2002]
• goal: recurrent exchange of metadata about resources
between systems
• input:
• OAI-PMH v.1.0
• feedback on OAI-implementers
• deliberations by OAI-tech [09/01 - 06/02]
• alpha test group of OAI-PMH v.2.0 [03/02 - 06/02]
•officially released June 14, 2002
• low-barrier interoperability specification
• metadata harvesting model: data provider / service
provider
• metadata about resources
• autonomous protocol
• HTTP based
• XML responses
• unqualified Dublin Core
• stable
OAI-PMH v.2.0 [06/2002]
releasing OAI-PMH v.2.0
(illustrating the OAI process)
See also
Lagoze, Carl and Van de Sompel, Herbert. The making of the Open
Archives Initiative Protocol for Metadata Harvesting. 2003. Library Hi
Tech. v21, N2. Draft
 pre-alpha phase
 alpha-phase
 creation of OAI-tech
 beta-phase
• created for 1 year period
• charge:
• review functionality and nature of OAI-PMH v.1.0
• investigate extensions
• release stable version of OAI-PMH by 05/02
• determine need for infrastructure to support broad
adoption of the protocol
• communication: listserv, SourceForge, conference calls
creation of OAI-tech [06/01]
US representatives
Thomas Krichel (Long Island U) - Jeff Young (OCLC) - Tim
Cole - (U of Illinois at Urbana Champaign) - Hussein Suleman
(Virginia Tech) - Simeon Warner (Cornell U) - Michael Nelson
(NASA) - Caroline Arms (LoC) - Mohammad Zubair (Old
Dominion U) - Steven Bird (U Penn.)
European representatives
Andy Powell (Bath U. & UKOLN) - Mogens Sandfaer (DTV) -
Thomas Baron (CERN) - Les Carr (U of Southampton)
OAI-tech
• review process by OAI-tech:
• identification of issues
• conference call to filter/combine issues
• white paper per issue
• on-line discussion per white paper
• proposal for resolution of issue by OAI-exec
• discussion of proposal & closure of issue
• conference call to resolve open issues
pre-alpha phase [09/01 – 02/02]
• creation of revised protocol document
• in-person meeting Lagoze - Van de Sompel -
Nelson – Warner
• autonomous decisions
• internal vetting of protocol document
pre-alpha phase [02/02]
• alpha-1 release to OAI-tech March 1st
2002
• OAI-tech extended with alpha testers
• discussions/implementations by OAI-tech
• ongoing revision of protocol document
alpha phase [02/02 – 05/02]
• The British Library
• Cornell U. -- NSDL project & e-print arXiv
• Ex Libris
• FS Consulting Inc -- harvester for my.OAI
• Humboldt-Universität zu Berlin
• InQuirion Pty Ltd, RMIT University
• Library of Congress
• NASA
• OCLC
OAI-PMH 2.0 alpha testers (1/2)
OAI-PMH 2.0 alpha testers (2/2)
• Old Dominion U. -- ARC , DP9
• U. of Illinois at Urbana-Champaign
• U. Of Southampton -- OAIA (now Celestial),
CiteBase, eprints.org
• UCLA, John Hopkins U., Indiana U., NYU -- sheet
music collection
• UKOLN, U. of Bath -- RDN
• Virginia Tech -- repository explorer
beta phase [05/02-06/02]
• beta release on May 1st 2002 to:
• registered data providers and service
providers
• interested parties
• fine tuning of protocol document
• preparation for the release of 2.0
conformant tools by alpha testers
OAI-PMH v.2.0 highlights
 important improvements in 2.0
 quick recap
Overview of OAI-PMH Verbs
Verb Function
Identify description of repository
ListMetadataFormats metadata formats supported by
repository
ListSets sets defined by repository
ListIdentifiers OAI unique ids contained in
repository
ListRecords listing of N records
GetRecord listing of a single record
metadata
about the
repository
harvesting
verbs
most verbs take arguments: dates, sets, ids, metadata formats
and resumption token (for flow control)
important improvements
resource
all available metadata
about David
item
Dublin Core
metadata
MARC
metadata
SPECTRUM
metadata records
item = identifier
record = identifier + metadata format + datestamp
set-membership is
item-level property
resource – item - record
OAI-PMH vs HTTP
• clear separation of OAI-PMH and HTTP
• OAI-PMH error handling
• all OK at HTTP level? => 200 OK
• something wrong at OAI-PMH level? =>
OAI-PMH error (e.g. badVerb)
• http codes 302, 503, etc. still available to
implementers, but no longer represent OAI-PMH
events
other improvements
• all protocol responses can be validated with
a single XML Schema
• easier for data providers
• no redundancy in type definitions
• SOAP-ready
• clean for error handling
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH>
<responseDate>2002-0208T08:55:46Z</responseDate>
<request verb=“GetRecord”… …>http://arXiv.org/oai2</request>
<GetRecord>
<record>
<header>
<identifier>oai:arXiv:cs/0112017</identifier>
<datestamp>2001-12-14</datestamp>
<setSpec>cs</setSpec>
<setSpec>math</setSpec>
</header>
<metadata>
…..
</metadata>
</record>
</GetRecord>
</OAI-PMH>
response no errors
note no http encoding
of the OAI-PMH request
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH>
<responseDate>2002-0208T08:55:46Z</responseDate>
<request>http://arXiv.org/oai2</request>
<error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error>
</OAI-PMH>
response with error
with errors, only the correct
attributes are echoed in
<request>
• Identify more expressive
Identify
<Identify>
<repositoryName>Library of Congress 1</repositoryName>
<baseURL>http://memory.loc.gov/cgi-bin/oai</baseURL>
<protocolVersion>2.0</protocolVersion>
<adminEmail>r.e.gillian@larc.nasa.gov</adminEmail>
<adminEmail>rgillian@visi.net</adminEmail>
<earliestDatestamp>1990-02-01T00:00:00Z</earliestDatestamp>
<deletedRecord>transient</deletedRecord>
<granularity>YYYY-MM-DDThh:mm:ssZ</granularity>
<compression>deflate</compression>
protocol vs periphery
• clear distinction between protocol and
periphery
• fixed protocol document
• extensible implementation guidelines:
• e.g. sample metadata formats, description
containers, about containers
• allows for OAI guidelines and community
guidelines
• introduction of provenance container to
facilitate tracing of harvesting history
provenance
<about>
<provenance>
<originDescription>
<baseURL>http://an.oa.org</baseURL>
<identifier>oai:r1:plog/9801001</identifier>
<datestamp>2001-08-13T13:00:02Z</datestamp>
<metadataPrefix>oai_dc</metadataPrefix>
<harvestDate>2001-08-15T12:01:30Z</harvestDate>
<originDescription>
… … …
</originDescription>
</originDescription>
</provenance>
</about>
• introduction of friends container to
facilitate dynamic discovery of repositories
friends
<description>
<friends>
<baseURL>http://cav2001.library.caltech.edu/perl/oai</baseURL>
<baseURL>http://formations2.ulst.ac.uk/perl/oai</baseURL>
<baseURL>http://cogprints.soton.ac.uk/perl/oai</baseURL>
<baseURL>http://wave.ldc.upenn.edu/OLAC/dp/aps.php4</baseURL>
</friends>
</description>
• introduction of branding container for
DPs to suggest rendering & association hints
<branding xmlns="http://www.openarchives.org/OAI/2.0/branding/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/branding/
http://www.openarchives.org/OAI/2.0/branding.xsd">
<collectionIcon>
<url>http://my.site/icon.png</url>
<link>http://my.site/homepage.html</link>
<title>MySite(tm)</title>
<width>88</width>
<height>31</height>
</collectionIcon>
<metadataRendering
metadataNamespace="http://www.openarchives.org/OAI/2.0/oai_dc/"
mimeType="text/xsl">http://some.where/DCrender.xsl</metadataRendering>
<metadataRendering
metadataNamespace="http://another.place/MARC"
mimeType="text/css">http://another.place/MARCrender.css</metadataRendering>
</branding>
branding
• revision of oai-identifier
<description>
<oai-identifier xmlns="http://www.openarchives.org/OAI/2.0/oai-
identifier"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai-
identifier
http://www.openarchives.org/OAI/2.0/oai-identifier.xsd">
<scheme>oai</scheme>
<repositoryIdentifier>oai-stuff.foo.org</repositoryIdentifier>
<delimiter>:</delimiter>
<sampleIdentifier>oai:oai-stuff.foo.org:5324</sampleIdentifier>
</oai-identifier>
</description>
oai-identifier
domain based
repository names
• OAI 1.x: oai_dc Schema defined by OAI
• OAI 2.0: oai_dc Schema imports from DCMI
Schema for unqualified DC elements
oai_dc
• OAI 1.x: oai_marc
• OAI 2.0: LoC marxml, oai_marc
– http://www.loc.gov/standards/marcxml/
MARC21
did not make it into OAI-PMH v.2.0
• SOAP implementation
• Result set filtering
• “all” / “best” metadata
• GetRecord -> GetRecords
• Machine readable rights management
• XML format for “mini-archives”
Detailed Review of the
OAI-PMH 2.0 Verbs
Identify
• Arguments
– none
• Errors
– none
• Arguments
– none
• Errors
– badArgument
1.1 2.0
ListMetadataFormats
• Arguments
– identifier
(OPTIONAL)
• Errors
– id does not exist
• Arguments
– identifier
(OPTIONAL)
• Errors
– badArgument
– noMetadataFormats
– idDoesNotExist
1.1 2.0
ListSets
• Arguments
– resumptionToken
(EXCLUSIVE)
• Errors
– no set hierarchy
• Arguments
– resumptionToken
(EXCLUSIVE)
• Errors
– badArgument
– badResumptionToken
– noSetHierarchy
1.1 2.0
ListIdentifiers
• Arguments
– from (OPTIONAL)
– until (OPTIONAL)
– set (OPTIONAL)
– resumptionToken
(EXCLUSIVE)
• Errors
– no records match
• Arguments
– from (OPTIONAL)
– until (OPTIONAL)
– set (OPTIONAL)
– resumptionToken
(EXCLUSIVE)
– metadataPrefix (REQUIRED)
• Errors
– badArgument
– cannotDisseminateFormat
– badResumptionToken
– noSetHierarchy
– noRecordsMatch
1.1 2.0
ListRecords
• Arguments
– from (OPTIONAL)
– until (OPTIONAL)
– set (OPTIONAL)
– resumptionToken
(EXCLUSIVE)
– metadataPrefix (REQUIRED)
• Errors
– no records match
– metadata format cannot be
disseminated
• Arguments
– from (OPTIONAL)
– until (OPTIONAL)
– set (OPTIONAL)
– resumptionToken
(EXCLUSIVE)
– metadataPrefix (REQUIRED)
• Errors
– noRecordsMatch
– cannotDisseminateFormat
– badResumptionToken
– noSetHierarchy
– badArgument
1.1 2.0
GetRecord
• Arguments
– identifier (REQUIRED)
– metadataPrefix
(REQUIRED)
• Errors
– id does not exist
– metadata format cannot
be disseminated
• Arguments
– identifier (REQUIRED)
– metadataPrefix (REQUIRED)
• Errors
– badArgument
– cannotDisseminateFormat
– idDoesNotExist
1.1 2.0
Argument Summary
metadataPrefix from until set resumptionToken identifier
Identify      
ListMetadata
Formats
     optional
ListSets     exclusive 
ListIdentifiers
 optional optional optional exclusive 
ListRecords
 optional optional optional exclusive 
GetRecord
     
Error Summary
Identify BA
ListMetadata
Formats
BA NMF IDDNE
ListSets BA BRT NSH
ListIdentifiers BA BRT CDF NRM NSH
ListRecords BA BRT CDF NRM NSH
GetRecord BA CDF IDDNE
Generate badVerb on any input not matching the 6 defined verbs
this is an inversion of the table in section 3.6 of the OAI-PMH specification
Repository
Implementation
(see also Repository Implementation Guidelines:
http://www.openarchives.org/OAI/2.0/guidelines-repository.htm)
Minimal Repository
• 2.0 provides many expressive, but optional
features
– but still low barrier!
• if you are writing your own repository software,
the quickest path to implementation can involve
initially:
– only supporting DC
– skipping: <about>, sets, compression
– skip flow control (resumptionTokens) if < 1000 items
• add optional features as requirements and
familiarity allows
Be Honest with datestamp!
• a change in the process of dynamic generation of a
metadata format really does mean all records have
been updated!
– harvester caveat: an incremental harvest could yield an
entire repository dump if all the date stamps change (for
example, if the metadata mapping rules change)
if (internalItemDatestamp > disseminationInterfaceDatestamp) {
datestamp = internalItemDatestamp
} else {
datestamp = disseminationInterfaceDatestamp
}
Not Hiding Updates
• OAI-PMH is designed to allow incremental
harvesting
• Updates must be available by the end of the
period of the datestamp assigned, i.e.
– Day granularity => during same day
– Seconds granularity => during same second
• Reason: harvesters need to overlap requests by
just one datestamp interval (one day or one
second)
– in 1.x, 2 intervals were required (in many circumstances)
State in resumptionTokens
• HTTP is stateless
• resumptionTokens allow state information to be
passed back to the repository to create a
complete list from sequence of incomplete lists
• EITHER – all state in resumptionToken
• OR – cache result set in repository
Caching the Result Set
• Repository caches results of initial request,
returns only incomplete list
• resumptionToken does not contain all state
information, it includes:
– a session id
– offset information, necessary for idempotency
• resumptionToken allows repository to return
next incomplete list
• increased complexity due to cache management
– but a potential performance win
All State in the
resumptionToken
• Arrange that remaining items/headers in complete list
response can be specified with a new query and encode that
in resumptionToken
• One simple approach is to return items/headers in id order
and make the new query specify the same parameters and
the last id return (or by date)
– simple to implement, but possibly longer execution times
• Can encode parameters very simply:
<resumptionToken>metadataPrefix=oai_dc
from=1999-02-03&until=2002-04-01&
lastid=fghy:45:123</resumptionToken>
resumptionToken attributes (1)
• expirationDate – likely to be useful when cache clean-up
schedule is known
– Do not specify expirationDate if all state in
resumptionToken
• badResumptionToken error to be used if
resumptionToken expired
– May also be used if request cannot be completed for some other
reason
• e.g.: if repository changes cause the incomplete list to have no
records
– issue badRT’s judiciously; it can invalidate a lot of effort by a lot
of harvesters
resumptionToken attributes (2)
• completeListSize and cursor optionally
provide information about size of complete
list and number of records so far
disseminated
– not (currently) widely used
– use consistently if used
– designed for status monitoring
– caveat harvester: completeListSize may be
approximate and may be revised
resumptionToken
The only defined use of resumptionToken is as follows:
•a repository must include a resumptionToken element as
part of each response that includes an incomplete list;
•in order to retrieve the next portion of the complete
list, the next request must use the value of that
resumptionToken element as the value of the
resumptionToken argument of the request;
•the response containing the incomplete list that
completes the list must include an empty resumptionToken
element;
Flow Control & Load Balancing
How to respond to a “bad” harvester:
1. HTTP status code 200; response to OAI-PMH request
with a resumptionToken.
2. HTTP status code 503; with the Retry-After header
set to an appropriate value if subsequent request
follows too quickly or if the server is heavily loaded.
3. HTTP status code 403; with an appropriate reason
specified if subsequent requests do not adhere to
Retry-After delays.
302 Load Balancing
• Interactive users on main DL machine should not be
impacted by metadata harvesting
– don’t take deliveries through the front door
– not part of the protocol; defined outside the protocol
OAI
Server
naca.larc.nasa.gov/oai/
if load > 0.50
redirect request
OAI
Server
buckets.dsi.internet2.edu/naca/oai/
harvester
http://blah/oai/?verb=ListIdentifiers&metadataPrefix=oai_dc
HTTP Status Code 302
http://blah/oai/?verb=ListIdentifiers&metadataPrefix=oai_dc
<?xml version=“1.0” encoding=“UTF-8”?>
…
<ListIdentifiers>
…
</ListIdentifiers>
DNS Load Balancing
• using a DNS rotor, establish
– a.foo.org, b.foo.org, c.foo.org
– each with a synchronized copy of the repository
– let DNS & chance distribute the load
– implication: if resumptionTokens could issued to
loosely synchronized servers, it is likely that
the rTs will be stateful
Load Balancing Caveats
• Copies of the repository must be synchronized
– (cf. Pande, et al. JCDL 02)
• Complex hierarchies are possible
– programmer must insure no cycles in redirection graphs!
• The baseURL in the reply must always point to the
original repository, not the repository that
eventually answered the request
Error Handling: Verbosity
More is better…
<error code="badArgument">Illegal argument ‘foo’</error>
<error code="badArgument">Illegal argument ‘bar’</error>
is preferred over:
<error code="badArgument">Illegal arguments ‘foo’, ‘bar’</error>
which is preferred over:
<error code="badArgument">Illegal arguments</error>
Error Handling: Levels
• the OAI-PMH error / exception conditions are for
OAI-PMH semantic events
• they are not for situations when:
– the database is down
– a record is malformed
• remember: record = id + datestamp + metadataPrefix
• if you’re missing one of those, you don’t have an OAI record!
– and other conditions that occur outside the OAI scope
• use http codes 500, 503 or other appropriate values to indicate
non-OAI problems
Error Handling: Extensions
• Arguments that are not 'required', 'optional' or
'exclusive’ are 'illegal' and should generate
badArgument errors.
• If you want to extend the OAI-PMH:
– stop and consider: do you really need to?
• maybe you should have different OAI-PMH interfaces, or creative
metadata formats
– if you really, really want to, tunnel your extensions through the
“set” feature
• see http://www.dlib.org/dlib/december01/suleman/12suleman.html for
examples
Idempotency of “List”
Requests (1)
• Purpose is to allow harvesters to recover from lost
responses or crashes without starting a large
harvest from scratch
• Recover by re-issuing request using
resumptionToken from previous request
• IMPLICATION: harvester must accept both the
most recent resumptionToken issued and the
previous one
Idempotency of “List”
Requests (2)
• response to a re-issued request must contain all unchanged
records
• any changed records will get new datestamps after time of
initial request
• changes will be picked up by subsequent harvest if not
included
[no experience yet with incomplete responses to ListSets or
ListMetadataFormats requests]
Case Study: “bucket” based repositories
• Buckets: see Nelson & Maly, CACM 44(5)
• 2.0
– NTRS - ntrs.nasa.gov/ (MySQL, DC)
– LTRS - techreports.larc.nasa.gov/ltrs/oai2.0/ (file system, refer)
– NACA - naca.larc.nasa.gov/oai2.0/ (file system, refer)
• 1.1
– LTRS - techreports.larc.nasa.gov/ltrs/oai/
– NACA - naca.larc.nasa.gov/ltrs/oai/
– Open Video - www.open-video.org/oai/ (MySQL, local)
– JTRS - ston.jsc.nasa.gov/collections/TRS/oai (MS Access dump, local)
– GLTRS (filesystem, HTML scraping)
• Characteristics:
– resumptionToken support initially skipped; added later (all)
• highly encoded rT’s: [2001-01-01!!!!301!600]
– sets initially skipped, added later (LTRS)
– initially had load balancing with 2 NACA repositories…
Case Study: “bucket” based repositories
• in bucket terminology:
– 6 OAI verbs (methods) added to the existing list
of methods
• http://ntrs.nasa.gov/?method=list_methods
• http://ntrs.nasa.gov/?method=list_source&target=ListIdentifiers
– a data element is added to the bucket that contains
the specifics of the particular repository and its
metadata format
• http://ntrs.nasa.gov/?method=display&pkg_name=oai&element_na
me=oai.pl
Harvester
Implementation and Use
(see also Harvester Implementation Guidelines:
http://www.openarchives.org/OAI/2.0/guidelines-
repository.htm )
Be a Polite OAI Neighbor
• Re-use existing free harvester software/libraries:
http://www.openarchives.org/tools/index.html
• If you insist on writing your own harvester, read
http://www.robotstxt.org/wc/robots.html
• Provide meaningful User-Agent & From headers
– Should be present in HTTP headers of all robot requests
– Should be configured even if using someone else’s harvester
Harvesting Sequence
• Issue Identify request
– Check OAI-PMH version
– Check baseURL, granularity, compression
• Issue ListMetadataFormats request
– Get information regarding selected metadataPrefix
• Issue ListSets request if using sets
– Check set structure matches expectation
• Issue ListIdentifier or ListRecords
request
– Continue until end of complete list
Listen to the Repository
• Check Identify’s <granularity> element if you wish to use finer
than YYYY-MM-DD
• If you harvest with sets, remember that “:” indicates hierarchy
– harvesting “a” will recursively harvest “a:b”, “a:b:c”, and “a:d”
• Check for and handle non-200 HTTP status codes, 503, 302 and
4xx in particular
• Empty resumptionToken => end of complete list
• Ask for compressed responses if the repository supports them
Harvesting Everything
• Issue an Identify request to find protocol version, finest datestamp
granularity supported, if compression is supported…
• Issue a ListMetadataFormats request to obtain a list of all
metadataPrefixes supported.
• Harvest using a ListRecords request for each metadataPrefix
supported. Knowledge of the datestamp granularity allows for less
overlap in incremental harvesting if granularities finer than a day are
supported.
• Set structure can be inferred from the setSpec elements in the header
blocks of each record returned (consistency checks are possible).
• Items may be reconstructed from the constituent records.
• Provenance and other information in <about> blocks may be re-
assembled at the item level if it is the same for all metadata formats
harvested. However, this information may be supplied differently for
different metadata formats and may thus need to be store separately
for each metadata format.
Harvesting v1.1 and v2.0
• Not difficult to handle
both cases, test
Identify response:
– v1.1: <Identify>
<protocolVersion>
– v2.0 <OAI-PMH>
<Identify>
<protocolVersion>
• Different error
and exception
handling
• Many similarities,
harvesters can
share lots of code
Harvesting Demo
• Harvester written in Perl (Uses LWP, Expat and
XML::Parser, no schema validation)
• Handles v1.0, v1.1 and v2.0
• Sequence of requests: Identify, ListMetadataFormats,
ListSets then ListRecords/ListIdentifiers
• Support for incremental harvesting, uses responseDate
from last harvest to get new start datestamp
• Supports response compression (gzip, compress)
• UTF-8 conditioning to deal with some “imperfect”
repositories
Harvesting logs
• Alan Kent’s v2.0 harvester logs:
http://www.inquirion.com:8123/public/collList;collListCmd=list
• Alan Kent’s summary of v1.1 harvesting results
http://www.mds.rmit.edu.au/~ajk/oai/interop/summary.htm
• Celestial v1.1 harvesting logs
http://celestial.eprints.org/cgi-bin/status
• DP9 gateway using arc harvested information
http://arc.cs.odu.edu:8080/dp9/index.jsp
<friends> example (1)
A light-weight, data-provider driven way to communicate
the existence of “others”, e.g.
http://ntrs.nasa.gov/?verb=Identify
…
<description>
<friends …namespace stuff… >
<baseURL>http://naca.larc.nasa.gov/oai2.0</baseURL>
<baseURL>http://ntrs.nasa.gov/oai2.0</baseURL>
<baseURL>http://eprints.riacs.edu/perl/oai/</baseURL>
<baseURL>http://ston.jsc.nasa.gov/collections/TRS/oai/</baseURL>
</friends>
</description>
…
http://techreports.larc.nasa.gov/ltrs/oai2.0/ http://naca.larc.nasa.gov/oai2.0/
http://ntrs.nasa.gov/oai2.0/
http://ston.jsc.nasa.gov/collections/TRS/oai/
http://eprints.riacs.edu/perl/oai/
harvester
Identify
<friends> example (2)
<friends>
Use of <friends>
Aggregator / Cache / Proxy
Implementation
(see also Aggregator Implementation Guidelines:
http://www.openarchives.org/OAI/2.0/guidelines-
aggregator.htm)
<provenance> & datestamps
• Reminder: datestamps are local to the
repository, a re-exporting service
must use new local datestamps
• Such services should use the
<provenance> container to preserve
the original datestamps and other
information
Identifiers are Local
• Identifiers are local to the repository
• Unless you absolutely did not change the
metadata and the identifier corresponds to a
recognized URI scheme, use a new identifier upon
re-exporting
– use the <provenance> container to preserve the
harvesting history
oai-identifier
• Just one option for identifiers in OAI-PMH
• The v2.0 oai-identifier scheme is not
compatible with v1.1:
– repositoryName now domain name based
– not reliant upon OAI centralized registration
• One-to-one mapping for escaping
characters: %3F allowed, %3f not
– allows simple comparison
Derived from the same item?
3 ways to determine if records share provenance
from the same item:
1. both records have the same identifier and the
baseURL in the request elements of the OAI-PMH
reponses which include the record are the same;
2. both records have the same identifier and that
identifier belongs to some recognized URI scheme;
3. the provenance containers of both records have the
same entries for both the identifier and baseURL;
<provenance> example (1)
<responseDate>2002-02-08T08:55:46.1</responseDate>
<request verb="GetRecord" metadataPrefix="odd_fmt"
identifier="oai:odd.oa.org:z1x2y3">http://odd.oa.org</request>
<GetRecord ...namespace stuff…
<record>
<header>
<identifier>oai:odd.oa.org:z1x2y3</identifier>
<datestamp>1999-08-07T06:05:04Z</datestamp>
</header>
<metadata> …metadata record in odd_fmt… </metadata>
</record>
</GetRecord>
Consider a request from crosswalker.oa.org:
http://odd.oa.org?verb=GetRecord
&identifier=oai:odd.oa.org:z1x2y3&metadataPrefix=odd_fmt
and the following response from odd.oa.org:
Imagine that crosswalker.oa.org cross-walks
harvested metadata from odd_fmt into oai_marc and
then re-exposes the metadata with new identifiers.
A request from getmarc.oa.org:
http://crosswalker.oa.org?verb=GetRecord
&identifier=oai:cw.oa.org:z1x2y3
&metadataPrefix=oai_marc
might then yield the following response from
crosswalker.oa.org:
<provenance> example (2)
<provenance> example (3)
<record>
<header>
<identifier>oai:cw.oa.org:z1x2y3</identifier>
<datestamp>2002-02-09T01:15:43Z</datestamp>
</header>
<metadata> ...metadata record in oai_marc... </metadata>
<about>
<provenance …namespace stuff… >
<originDescription harvestDate="2002-02-08T08:55:46Z“
altered="true">
<baseURL>http://odd.oa.org</baseURL>
<identifier>oai:odd.oa.org:z1x2y3</identifier>
<datestamp>1999-08-07T06:05:04Z</datestamp>
<metadataNamespace>http://odd.oa.org/odd_fmt</..>
</originDescription>
</provenance>
</about>
</record>
<provenance> example (4)
This oai_marc record is then re-exposed by
getmarc.oa.org with the same identifier
oai:cw.oa.og:z1x2y3 (because the record
has not been altered).
The associated <provenance> container might
be:
<provenance> example (5)
<record>
<header>
<identifier>oai:cw.oa.org:z1x2y3</identifier>
<datestamp>2002-03-01T01:46:11Z</datestamp>
</header>
<metadata> ...metadata record in oai_marc... </metadata>
<about>
<provenance …namespace stuff…>
<originDescription harvestDate=“2002-03-01T01:23:45” altered=“false”>
<baseURL>http://crosswalker.oa.org/<baseURL>
<identifier>oai:cw.oa.org:z1x2y3</identifier>
<datestamp>2002-02-09T01:15:43Z</datestamp>
<metadataNamespace>http://../oai_marc</metadataNamespace>
<originDescription harvestDate="2002-02-08T08:55:46Z” altered="true">
<baseURL>http://odd.oa.org</baseURL>
<identifier>oai:odd.oa.org:z1x2y3</identifier>
<datestamp>1999-08-07T06:05:04Z</datestamp>
<metadataNamespace>http://odd.oa.org/odd_fmt</metadateNamespace>
</originDescription>
</originDescription>
</provenance>
</about>
</record>
Case Studies
Example data-provider
implementations for:
arXiv.org
NSDL metadata repository
arXiv (1)
http://arXiv.org/oai2
• Existing system, running >11 years, written mostly in
Perl
• Flat file system for ‘database’
• 230k papers with metadata in homebrew format
• ~200 updates/day. OAI repository just one view of
system, must integrate with daily update schedule
arXiv (2)
• Write in Perl
– Easy integration with rest of system, reuse
code from v1.0/v1.1 interface
– Use libwww; XML::DOM
• Daily rebuild of datestamp database
– No existing date in system appropriate
– Base on Unix cdate of metadata files
• On-the-fly metadata translation
– Straightforward, avoids data duplication
arXiv (3)
• Flow control to avoid loading server and to
avoid harvesters tripping robot alarms
– resumptionTokens to limit response size
(1500 records or 15k identifiers / response)
– 503 Retry-After replies based on client ip
• Implement resumptionTokens that
include all state
– Avoid need to cache result sets / clean cache
NSDL Metadata Repository
http://services.nsdl.org:8080/nsdloai/OAI
• Implemented as an integral part of a new system
• Expect heavy load; db target size >10M items; stateless
resumptionTokens
• Java servlets; Xerces; Oracle (JDBC interface); strict validation
throughout
• Based on rewrite of Cocoa (NCSA UIUC)
• Integral to NSDL services model: provides data for user
interface and search services
looking ahead:
novel uses of OAI-PMH
• “Using OAI-PMH…Differently” Young, Van de
Sompel, Hickey, D-Lib Magazine, 9(7/8),
2003
– DL Usage logs ~ LANL
– Registry of metadata formats for OpenURL ~
OCLC & LANL
• http://www.openurl.info/registry/
• http://lib-www.lanl.gov/~herbertv/papers/icpp02-draft.pdf
– GSFAD Thesaurus ~ OCLC
• Other uses?
OAI-PMH access to DL usage logs
• usage logs filtered and stored in MySQL db
• accessible as 2 OAI-PMH repositories:
• document oriented
• agent oriented (user-proxy)
• interlinked
• recommender system:
• harvests logs
• interpretes logs
• exposes relationships (OpenURL access)
Phase 1: creating recommender system
Document
logs
local
Agent logs
local
document log agent log
Log processing
Log
based
recom.
system
About local and remote data
ISI
citations
local
Inspec
bibliographic
local
PubMed
bibliographic
remote
Biosis
bibliographic
local
biblio
or
citation
i
d
recommended
identifiers
Phase 2: requesting recommendations
Log
based
recom.
system
relate
OpenURL
agent
alog:IP:128.1.22.13
Repository 1
docs accessed
by agent
about
agent
alog
document
dlog:ori:pmid:258471
Repository 2
agents accessing
the document
about
document
dlog
• log repository screencam
OAI-PMH access to DL usage logs
OAI-PMH-conformant OpenURL Registry
• NISO OpenURL Framework builds on Registry
• Registry entry:
• unique identifier
• always DC record
• sometimes XHTML or XML Schema
definition
OAI-PMH-conformant OpenURL Registry
• Collaboration with OCLC Office of Research:
• Registry is OAI-PMH harvestable
• Registry is browseable through overlaying
of PURL and XSLT
registered
item
ori:enc:UTF-8
OpenURL Registry
about
registered
item
character
encoding
XML Schema
defining the
metadata format
xsd
registered
item
ori:fmt:xml:xsd:book
about
registered
item
OpenURL Registry
metadata
format
<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="gui.xsl" ?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2002-02-08T12:00:01Z</responseDate>
...
</OAI-PMH>
OAI-PMH-conformant OpenURL Registry
• Insert XSLT stylesheet reference in OAI-
PMH response => make repository browseable
OAI-PMH-conformant OpenURL Registry
• Use PURL partial redirects to obtain
publisheable URLs
baseURL?
verb=GetRecord &
metadataPrefix=xsd &
identifier=ori:fmt:xml:xsd:book
http://www.openurl.info/registry
/xsd
/ori:fmt:xml:xsd:book
• OpenURL Registry screencam
OAI-PMH-conformant OpenURL Registry
OAI-PMH-conformant GSFAD Thesaurus
• OCLC Office of Research:
• GSFAD Thesaurus is OAI-PMH harvestable
• Thesaurus is user-browseable through
overlaying of PURL and XSLT
• Thesaurus is accessible by machines via
OAI-PMH-based web services
Native GSFAD
record
MARCXML
concept
Adventure+films
Xwalked
GSFAD
record
GSFAD Thesaurus
Other Uses For the OAI-PMH
• Assumptions:
– Traditional DLs / SPs will continue on their
present path of increasing sophistication
• citation indexing, search results viz, personalization,
recommendations, subject-based filtering, etc.
– growth rates remain the same (~5x DPs as SPs)
• Premise: OAI-PMH is applicable to any
scenario that needs to update / synchronize
distributed state
– Future opportunities are possible by creatively
interpreting the OAI-PMH data model
Typical Values
• repository
– collection of publications
• resource
– scholarly publication
• item
– all metadata (DC + MARC)
• record
– a single metadata format
• datestamp
– last update / addition of a record
• metadata format
– bibliographic metadata format
• set
– originating institution or subject categories
Repositories…
• Stretching the idea of a repository a bit:
– contextually sensitive repositories
• “personalization for harvesters”
• communication between strangers, or communication
between friends?
– OAI-PMH for individual complex objects?
• OAI-PMH without MySQL?!
– Fedora, Multi-valent documents, buckets
– tar, jar, zip, etc. files
Resource
• What if resource were:
– computer system status
• uptime, who, w, df, ps, etc.
– or generalized “system” status
• e.g., sports league standings
– people
• personnel databases
• authority files for authors
Item
• What if item were:
– software
• union of versions + formats
– all forms of metadata
• administrative + structural
• citations, annotations, reviews, etc.
– data
• e.g., newsfeeds and other XML expressible content
– metadataPrefixes or sets could be defined to be different
versions
Record
• What if record were:
– specific software instantiations / updates
– access / retrieval logs for DLs (or computer systems)
– push / pull model inversion
• put a harvester on the client behind a firewall, the client
contacts a DP and receives “instructions” on how to submit
the desired document (e.g., send email to a specified
address)
Datestamp
• semantics of datestamp are strongly influenced by
the choice of resource / item / record /
metadataPrefix, but it could be used to:
– signify change of set membership (e.g., workflow: item
moves from “submitted” to “approved”)
– change datestamp to reflect access to the DP
• e.g., in conjunction with metadataPrefixes of “accessed” or
“mirrored”
metadataPrefix
• what if metadataPrefix were:
– instructions for extracting / archiving / scraping the
resource
• verb=ListRecords&metadataPrefix=extract_TIFFs
– code fragments to run locally
• (harvested from a trusted source!)
– XSLT for other metadataPrefixes
• branding container is at the repository-level, this could be
record- or item-level
Set
• sets are already used for tunneling OAI-PMH
extensions (see Suleman & Fox, D-Lib 7(12))
• other uses:
– in aggregators, automatically create 1 set per baseURL
– have “hidden” sets (or metadataPrefix) that have
administrative or community-specific values (or triggers)
• set=accessed>1000&from=2001-01-01
• set=harvestMeWithTheseARGS&until=2002-05-
05&metadataPrefix=oai_marc

More Related Content

Similar to oai-2.0-adv.ppt

CERN User Story
CERN User StoryCERN User Story
CERN User StoryTim Bell
 
towards interoperable archives: the Universal Preprint Service initiative
towards interoperable archives:  the Universal Preprint Service initiativetowards interoperable archives:  the Universal Preprint Service initiative
towards interoperable archives: the Universal Preprint Service initiativeHerbert Van de Sompel
 
OOR--Open-Ontology-Repository--jun2010
OOR--Open-Ontology-Repository--jun2010OOR--Open-Ontology-Repository--jun2010
OOR--Open-Ontology-Repository--jun2010Peter Yim
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineKen Karapetyan
 
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...Laurent Lefort
 
The eXtensible Catalog Drupal Toolkit
The eXtensible Catalog Drupal ToolkitThe eXtensible Catalog Drupal Toolkit
The eXtensible Catalog Drupal ToolkitPéter Király
 
Update From OCLC Research May 2008
Update From OCLC Research May 2008Update From OCLC Research May 2008
Update From OCLC Research May 2008Nancy Elkington
 
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Ricard de la Vega
 
Nanotech Standards - Industry Participation - Strengthening Ties
Nanotech Standards - Industry Participation - Strengthening TiesNanotech Standards - Industry Participation - Strengthening Ties
Nanotech Standards - Industry Participation - Strengthening Tiesharidoss
 
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation FrameworkBL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation FrameworkIMPACT Centre of Competence
 
OpenURL - The Rough Guide
OpenURL - The Rough GuideOpenURL - The Rough Guide
OpenURL - The Rough GuideTony Hammond
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013Ahmed AlSum
 
SeaDataCloud - Common Vocabulary Services
SeaDataCloud - Common Vocabulary ServicesSeaDataCloud - Common Vocabulary Services
SeaDataCloud - Common Vocabulary ServicesEUDAT
 

Similar to oai-2.0-adv.ppt (20)

The aDORe Federation Architecture
The aDORe Federation ArchitectureThe aDORe Federation Architecture
The aDORe Federation Architecture
 
CERN User Story
CERN User StoryCERN User Story
CERN User Story
 
towards interoperable archives: the Universal Preprint Service initiative
towards interoperable archives:  the Universal Preprint Service initiativetowards interoperable archives:  the Universal Preprint Service initiative
towards interoperable archives: the Universal Preprint Service initiative
 
OOR--Open-Ontology-Repository--jun2010
OOR--Open-Ontology-Repository--jun2010OOR--Open-Ontology-Repository--jun2010
OOR--Open-Ontology-Repository--jun2010
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
 
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
 
Ji cv6n1
Ji cv6n1Ji cv6n1
Ji cv6n1
 
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...
Semantically-Enabling the Web of Things: The W3C Semantic Sensor Network Onto...
 
The eXtensible Catalog Drupal Toolkit
The eXtensible Catalog Drupal ToolkitThe eXtensible Catalog Drupal Toolkit
The eXtensible Catalog Drupal Toolkit
 
Update From OCLC Research May 2008
Update From OCLC Research May 2008Update From OCLC Research May 2008
Update From OCLC Research May 2008
 
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
 
Globe seminar
Globe seminarGlobe seminar
Globe seminar
 
Nanotech Standards - Industry Participation - Strengthening Ties
Nanotech Standards - Industry Participation - Strengthening TiesNanotech Standards - Industry Participation - Strengthening Ties
Nanotech Standards - Industry Participation - Strengthening Ties
 
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation FrameworkBL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
 
OpenURL - The Rough Guide
OpenURL - The Rough GuideOpenURL - The Rough Guide
OpenURL - The Rough Guide
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013
 
SeaDataCloud - Common Vocabulary Services
SeaDataCloud - Common Vocabulary ServicesSeaDataCloud - Common Vocabulary Services
SeaDataCloud - Common Vocabulary Services
 
Digitisation and institutional repositories 3
Digitisation and institutional repositories 3Digitisation and institutional repositories 3
Digitisation and institutional repositories 3
 

Recently uploaded

Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxFIDO Alliance
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideStefan Dietze
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...marcuskenyatta275
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch TuesdayIvanti
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024Stephen Perrenod
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfUK Journal
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 

Recently uploaded (20)

Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 

oai-2.0-adv.ppt

  • 1. Advanced Overview of Version 2.0 of the Open Archives Initiative Protocol for Metadata Harvesting Michael L. Nelson Old Dominion University Norfolk VA mln@cs.odu.edu Herbert Van de Sompel Los Alamos National Laboratory Los Alamos NM herbertv@lanl.gov Simeon Warner Cornell University Ithaca NY simeon@cs.cornell.edu ACM/IEEE Joint Conference on Digital Libraries Houston, Texas 13:30 - 17:00 May 27 2003 latest version at: http://www.cs.odu.edu/~mln/jcdl03/
  • 2. Scope and Focus • This Tutorial is not… – an introduction to OAI-PMH – a listing of all the wonderful projects that use OAI-PMH – a discussion of the merits of metadata harvesting vs. distributed searching • A passing familiarity is assumed for: – web / http interaction – Dublin Core / metadata
  • 3. Outline • How 2.0 evolved from SFC and 1.x – people, processes, events • 2.0 highlights – comparison with 1.x • Guidelines, recommendations, best practices for 2.0 implementations – harvesters, repositories, aggregators, optional containers • Novel applications of OAI-PMH
  • 5. about eprints document like objects resources metadata OAMS unqualified Dublin Core unqualified Dublin Core transport HTTP HTTP HTTP responses XML XML XML requests HTTP GET/POST HTTP GET/POST HTTP GET/POST verbs Dienst OAI-PMH OAI-PMH nature experimental experimental stable model metadata harvesting metadata harvesting metadata harvesting Santa Fe convention OAI-PMH v.1.0/1.1 OAI-PMH v.2.0
  • 6. Santa Fe Convention [02/2000] • goal: optimize discovery of e-prints • input: • the UPS prototype • RePEc /SODA “data provider / service provider model” • Dienst protocol • deliberations at Santa Fe meeting [10/99]
  • 7. OAI-PMH v.1.0 [01/2001] • goal: optimize discovery of document-like objects • input: • SFC • DLF meetings on metadata harvesting • deliberations at Cornell meeting [09/00] • alpha test group of OAI-PMH v.1.0
  • 8. • low-barrier interoperability specification • metadata harvesting model: data provider / service provider • focus on document-like objects • autonomous protocol • HTTP based • XML responses • unqualified Dublin Core • experimental: 12-18 months OAI-PMH v.1.0 [01/2001]
  • 9. Selected Pre- 2.0 OAI Highlights • October 21-22, 1999 - initial UPS meeting • February 15, 2000 - Santa Fe Convention published in D-Lib Magazine – precursor to the OAI metadata harvesting protocol • June 3, 2000 - workshop at ACM DL 2000 (Texas) • August 25, 2000 - OAI steering committee formed, DLF/CNI support • September 7-8, 2000 - technical meeting at Cornell University – defined the core of the current OAI metadata harvesting protocol • September 21, 2000 - workshop at ECDL 2000 (Portugal) • November 1, 2000 - Alpha test group announced (~15 organizations) • January 23, 2001 - OAI protocol 1.0 announced, OAI Open Day in the U.S. (Washington DC) – purpose: freeze protocol for 12-16 months, generate critical mass • February 26, 2001 - OAI Open Day in Europe (Berlin) • July 3, 2001 - OAI protocol 1.1 announced – to reflect changes in the W3C’s XML latest schema recommendation • September 8, 2001 - workshop at ECDL 2001 (Darmstadt)
  • 10. OAI-PMH v.2.0 [06/2002] • goal: recurrent exchange of metadata about resources between systems • input: • OAI-PMH v.1.0 • feedback on OAI-implementers • deliberations by OAI-tech [09/01 - 06/02] • alpha test group of OAI-PMH v.2.0 [03/02 - 06/02] •officially released June 14, 2002
  • 11. • low-barrier interoperability specification • metadata harvesting model: data provider / service provider • metadata about resources • autonomous protocol • HTTP based • XML responses • unqualified Dublin Core • stable OAI-PMH v.2.0 [06/2002]
  • 12. releasing OAI-PMH v.2.0 (illustrating the OAI process) See also Lagoze, Carl and Van de Sompel, Herbert. The making of the Open Archives Initiative Protocol for Metadata Harvesting. 2003. Library Hi Tech. v21, N2. Draft
  • 13.  pre-alpha phase  alpha-phase  creation of OAI-tech  beta-phase
  • 14. • created for 1 year period • charge: • review functionality and nature of OAI-PMH v.1.0 • investigate extensions • release stable version of OAI-PMH by 05/02 • determine need for infrastructure to support broad adoption of the protocol • communication: listserv, SourceForge, conference calls creation of OAI-tech [06/01]
  • 15. US representatives Thomas Krichel (Long Island U) - Jeff Young (OCLC) - Tim Cole - (U of Illinois at Urbana Champaign) - Hussein Suleman (Virginia Tech) - Simeon Warner (Cornell U) - Michael Nelson (NASA) - Caroline Arms (LoC) - Mohammad Zubair (Old Dominion U) - Steven Bird (U Penn.) European representatives Andy Powell (Bath U. & UKOLN) - Mogens Sandfaer (DTV) - Thomas Baron (CERN) - Les Carr (U of Southampton) OAI-tech
  • 16. • review process by OAI-tech: • identification of issues • conference call to filter/combine issues • white paper per issue • on-line discussion per white paper • proposal for resolution of issue by OAI-exec • discussion of proposal & closure of issue • conference call to resolve open issues pre-alpha phase [09/01 – 02/02]
  • 17. • creation of revised protocol document • in-person meeting Lagoze - Van de Sompel - Nelson – Warner • autonomous decisions • internal vetting of protocol document pre-alpha phase [02/02]
  • 18. • alpha-1 release to OAI-tech March 1st 2002 • OAI-tech extended with alpha testers • discussions/implementations by OAI-tech • ongoing revision of protocol document alpha phase [02/02 – 05/02]
  • 19. • The British Library • Cornell U. -- NSDL project & e-print arXiv • Ex Libris • FS Consulting Inc -- harvester for my.OAI • Humboldt-Universität zu Berlin • InQuirion Pty Ltd, RMIT University • Library of Congress • NASA • OCLC OAI-PMH 2.0 alpha testers (1/2)
  • 20. OAI-PMH 2.0 alpha testers (2/2) • Old Dominion U. -- ARC , DP9 • U. of Illinois at Urbana-Champaign • U. Of Southampton -- OAIA (now Celestial), CiteBase, eprints.org • UCLA, John Hopkins U., Indiana U., NYU -- sheet music collection • UKOLN, U. of Bath -- RDN • Virginia Tech -- repository explorer
  • 21. beta phase [05/02-06/02] • beta release on May 1st 2002 to: • registered data providers and service providers • interested parties • fine tuning of protocol document • preparation for the release of 2.0 conformant tools by alpha testers
  • 23.  important improvements in 2.0  quick recap
  • 24. Overview of OAI-PMH Verbs Verb Function Identify description of repository ListMetadataFormats metadata formats supported by repository ListSets sets defined by repository ListIdentifiers OAI unique ids contained in repository ListRecords listing of N records GetRecord listing of a single record metadata about the repository harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)
  • 26. resource all available metadata about David item Dublin Core metadata MARC metadata SPECTRUM metadata records item = identifier record = identifier + metadata format + datestamp set-membership is item-level property resource – item - record
  • 27. OAI-PMH vs HTTP • clear separation of OAI-PMH and HTTP • OAI-PMH error handling • all OK at HTTP level? => 200 OK • something wrong at OAI-PMH level? => OAI-PMH error (e.g. badVerb) • http codes 302, 503, etc. still available to implementers, but no longer represent OAI-PMH events
  • 28. other improvements • all protocol responses can be validated with a single XML Schema • easier for data providers • no redundancy in type definitions • SOAP-ready • clean for error handling
  • 29. <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate>2002-0208T08:55:46Z</responseDate> <request verb=“GetRecord”… …>http://arXiv.org/oai2</request> <GetRecord> <record> <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> </GetRecord> </OAI-PMH> response no errors note no http encoding of the OAI-PMH request
  • 30. <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate>2002-0208T08:55:46Z</responseDate> <request>http://arXiv.org/oai2</request> <error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error> </OAI-PMH> response with error with errors, only the correct attributes are echoed in <request>
  • 31. • Identify more expressive Identify <Identify> <repositoryName>Library of Congress 1</repositoryName> <baseURL>http://memory.loc.gov/cgi-bin/oai</baseURL> <protocolVersion>2.0</protocolVersion> <adminEmail>r.e.gillian@larc.nasa.gov</adminEmail> <adminEmail>rgillian@visi.net</adminEmail> <earliestDatestamp>1990-02-01T00:00:00Z</earliestDatestamp> <deletedRecord>transient</deletedRecord> <granularity>YYYY-MM-DDThh:mm:ssZ</granularity> <compression>deflate</compression>
  • 32. protocol vs periphery • clear distinction between protocol and periphery • fixed protocol document • extensible implementation guidelines: • e.g. sample metadata formats, description containers, about containers • allows for OAI guidelines and community guidelines
  • 33. • introduction of provenance container to facilitate tracing of harvesting history provenance <about> <provenance> <originDescription> <baseURL>http://an.oa.org</baseURL> <identifier>oai:r1:plog/9801001</identifier> <datestamp>2001-08-13T13:00:02Z</datestamp> <metadataPrefix>oai_dc</metadataPrefix> <harvestDate>2001-08-15T12:01:30Z</harvestDate> <originDescription> … … … </originDescription> </originDescription> </provenance> </about>
  • 34. • introduction of friends container to facilitate dynamic discovery of repositories friends <description> <friends> <baseURL>http://cav2001.library.caltech.edu/perl/oai</baseURL> <baseURL>http://formations2.ulst.ac.uk/perl/oai</baseURL> <baseURL>http://cogprints.soton.ac.uk/perl/oai</baseURL> <baseURL>http://wave.ldc.upenn.edu/OLAC/dp/aps.php4</baseURL> </friends> </description>
  • 35. • introduction of branding container for DPs to suggest rendering & association hints <branding xmlns="http://www.openarchives.org/OAI/2.0/branding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/branding/ http://www.openarchives.org/OAI/2.0/branding.xsd"> <collectionIcon> <url>http://my.site/icon.png</url> <link>http://my.site/homepage.html</link> <title>MySite(tm)</title> <width>88</width> <height>31</height> </collectionIcon> <metadataRendering metadataNamespace="http://www.openarchives.org/OAI/2.0/oai_dc/" mimeType="text/xsl">http://some.where/DCrender.xsl</metadataRendering> <metadataRendering metadataNamespace="http://another.place/MARC" mimeType="text/css">http://another.place/MARCrender.css</metadataRendering> </branding> branding
  • 36. • revision of oai-identifier <description> <oai-identifier xmlns="http://www.openarchives.org/OAI/2.0/oai- identifier" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai- identifier http://www.openarchives.org/OAI/2.0/oai-identifier.xsd"> <scheme>oai</scheme> <repositoryIdentifier>oai-stuff.foo.org</repositoryIdentifier> <delimiter>:</delimiter> <sampleIdentifier>oai:oai-stuff.foo.org:5324</sampleIdentifier> </oai-identifier> </description> oai-identifier domain based repository names
  • 37. • OAI 1.x: oai_dc Schema defined by OAI • OAI 2.0: oai_dc Schema imports from DCMI Schema for unqualified DC elements oai_dc
  • 38. • OAI 1.x: oai_marc • OAI 2.0: LoC marxml, oai_marc – http://www.loc.gov/standards/marcxml/ MARC21
  • 39. did not make it into OAI-PMH v.2.0
  • 40. • SOAP implementation • Result set filtering • “all” / “best” metadata • GetRecord -> GetRecords • Machine readable rights management • XML format for “mini-archives”
  • 41. Detailed Review of the OAI-PMH 2.0 Verbs
  • 42. Identify • Arguments – none • Errors – none • Arguments – none • Errors – badArgument 1.1 2.0
  • 43. ListMetadataFormats • Arguments – identifier (OPTIONAL) • Errors – id does not exist • Arguments – identifier (OPTIONAL) • Errors – badArgument – noMetadataFormats – idDoesNotExist 1.1 2.0
  • 44. ListSets • Arguments – resumptionToken (EXCLUSIVE) • Errors – no set hierarchy • Arguments – resumptionToken (EXCLUSIVE) • Errors – badArgument – badResumptionToken – noSetHierarchy 1.1 2.0
  • 45. ListIdentifiers • Arguments – from (OPTIONAL) – until (OPTIONAL) – set (OPTIONAL) – resumptionToken (EXCLUSIVE) • Errors – no records match • Arguments – from (OPTIONAL) – until (OPTIONAL) – set (OPTIONAL) – resumptionToken (EXCLUSIVE) – metadataPrefix (REQUIRED) • Errors – badArgument – cannotDisseminateFormat – badResumptionToken – noSetHierarchy – noRecordsMatch 1.1 2.0
  • 46. ListRecords • Arguments – from (OPTIONAL) – until (OPTIONAL) – set (OPTIONAL) – resumptionToken (EXCLUSIVE) – metadataPrefix (REQUIRED) • Errors – no records match – metadata format cannot be disseminated • Arguments – from (OPTIONAL) – until (OPTIONAL) – set (OPTIONAL) – resumptionToken (EXCLUSIVE) – metadataPrefix (REQUIRED) • Errors – noRecordsMatch – cannotDisseminateFormat – badResumptionToken – noSetHierarchy – badArgument 1.1 2.0
  • 47. GetRecord • Arguments – identifier (REQUIRED) – metadataPrefix (REQUIRED) • Errors – id does not exist – metadata format cannot be disseminated • Arguments – identifier (REQUIRED) – metadataPrefix (REQUIRED) • Errors – badArgument – cannotDisseminateFormat – idDoesNotExist 1.1 2.0
  • 48. Argument Summary metadataPrefix from until set resumptionToken identifier Identify       ListMetadata Formats      optional ListSets     exclusive  ListIdentifiers  optional optional optional exclusive  ListRecords  optional optional optional exclusive  GetRecord      
  • 49. Error Summary Identify BA ListMetadata Formats BA NMF IDDNE ListSets BA BRT NSH ListIdentifiers BA BRT CDF NRM NSH ListRecords BA BRT CDF NRM NSH GetRecord BA CDF IDDNE Generate badVerb on any input not matching the 6 defined verbs this is an inversion of the table in section 3.6 of the OAI-PMH specification
  • 50. Repository Implementation (see also Repository Implementation Guidelines: http://www.openarchives.org/OAI/2.0/guidelines-repository.htm)
  • 51. Minimal Repository • 2.0 provides many expressive, but optional features – but still low barrier! • if you are writing your own repository software, the quickest path to implementation can involve initially: – only supporting DC – skipping: <about>, sets, compression – skip flow control (resumptionTokens) if < 1000 items • add optional features as requirements and familiarity allows
  • 52. Be Honest with datestamp! • a change in the process of dynamic generation of a metadata format really does mean all records have been updated! – harvester caveat: an incremental harvest could yield an entire repository dump if all the date stamps change (for example, if the metadata mapping rules change) if (internalItemDatestamp > disseminationInterfaceDatestamp) { datestamp = internalItemDatestamp } else { datestamp = disseminationInterfaceDatestamp }
  • 53. Not Hiding Updates • OAI-PMH is designed to allow incremental harvesting • Updates must be available by the end of the period of the datestamp assigned, i.e. – Day granularity => during same day – Seconds granularity => during same second • Reason: harvesters need to overlap requests by just one datestamp interval (one day or one second) – in 1.x, 2 intervals were required (in many circumstances)
  • 54. State in resumptionTokens • HTTP is stateless • resumptionTokens allow state information to be passed back to the repository to create a complete list from sequence of incomplete lists • EITHER – all state in resumptionToken • OR – cache result set in repository
  • 55. Caching the Result Set • Repository caches results of initial request, returns only incomplete list • resumptionToken does not contain all state information, it includes: – a session id – offset information, necessary for idempotency • resumptionToken allows repository to return next incomplete list • increased complexity due to cache management – but a potential performance win
  • 56. All State in the resumptionToken • Arrange that remaining items/headers in complete list response can be specified with a new query and encode that in resumptionToken • One simple approach is to return items/headers in id order and make the new query specify the same parameters and the last id return (or by date) – simple to implement, but possibly longer execution times • Can encode parameters very simply: <resumptionToken>metadataPrefix=oai_dc from=1999-02-03&until=2002-04-01& lastid=fghy:45:123</resumptionToken>
  • 57. resumptionToken attributes (1) • expirationDate – likely to be useful when cache clean-up schedule is known – Do not specify expirationDate if all state in resumptionToken • badResumptionToken error to be used if resumptionToken expired – May also be used if request cannot be completed for some other reason • e.g.: if repository changes cause the incomplete list to have no records – issue badRT’s judiciously; it can invalidate a lot of effort by a lot of harvesters
  • 58. resumptionToken attributes (2) • completeListSize and cursor optionally provide information about size of complete list and number of records so far disseminated – not (currently) widely used – use consistently if used – designed for status monitoring – caveat harvester: completeListSize may be approximate and may be revised
  • 59. resumptionToken The only defined use of resumptionToken is as follows: •a repository must include a resumptionToken element as part of each response that includes an incomplete list; •in order to retrieve the next portion of the complete list, the next request must use the value of that resumptionToken element as the value of the resumptionToken argument of the request; •the response containing the incomplete list that completes the list must include an empty resumptionToken element;
  • 60. Flow Control & Load Balancing How to respond to a “bad” harvester: 1. HTTP status code 200; response to OAI-PMH request with a resumptionToken. 2. HTTP status code 503; with the Retry-After header set to an appropriate value if subsequent request follows too quickly or if the server is heavily loaded. 3. HTTP status code 403; with an appropriate reason specified if subsequent requests do not adhere to Retry-After delays.
  • 61. 302 Load Balancing • Interactive users on main DL machine should not be impacted by metadata harvesting – don’t take deliveries through the front door – not part of the protocol; defined outside the protocol OAI Server naca.larc.nasa.gov/oai/ if load > 0.50 redirect request OAI Server buckets.dsi.internet2.edu/naca/oai/ harvester http://blah/oai/?verb=ListIdentifiers&metadataPrefix=oai_dc HTTP Status Code 302 http://blah/oai/?verb=ListIdentifiers&metadataPrefix=oai_dc <?xml version=“1.0” encoding=“UTF-8”?> … <ListIdentifiers> … </ListIdentifiers>
  • 62. DNS Load Balancing • using a DNS rotor, establish – a.foo.org, b.foo.org, c.foo.org – each with a synchronized copy of the repository – let DNS & chance distribute the load – implication: if resumptionTokens could issued to loosely synchronized servers, it is likely that the rTs will be stateful
  • 63. Load Balancing Caveats • Copies of the repository must be synchronized – (cf. Pande, et al. JCDL 02) • Complex hierarchies are possible – programmer must insure no cycles in redirection graphs! • The baseURL in the reply must always point to the original repository, not the repository that eventually answered the request
  • 64. Error Handling: Verbosity More is better… <error code="badArgument">Illegal argument ‘foo’</error> <error code="badArgument">Illegal argument ‘bar’</error> is preferred over: <error code="badArgument">Illegal arguments ‘foo’, ‘bar’</error> which is preferred over: <error code="badArgument">Illegal arguments</error>
  • 65. Error Handling: Levels • the OAI-PMH error / exception conditions are for OAI-PMH semantic events • they are not for situations when: – the database is down – a record is malformed • remember: record = id + datestamp + metadataPrefix • if you’re missing one of those, you don’t have an OAI record! – and other conditions that occur outside the OAI scope • use http codes 500, 503 or other appropriate values to indicate non-OAI problems
  • 66. Error Handling: Extensions • Arguments that are not 'required', 'optional' or 'exclusive’ are 'illegal' and should generate badArgument errors. • If you want to extend the OAI-PMH: – stop and consider: do you really need to? • maybe you should have different OAI-PMH interfaces, or creative metadata formats – if you really, really want to, tunnel your extensions through the “set” feature • see http://www.dlib.org/dlib/december01/suleman/12suleman.html for examples
  • 67. Idempotency of “List” Requests (1) • Purpose is to allow harvesters to recover from lost responses or crashes without starting a large harvest from scratch • Recover by re-issuing request using resumptionToken from previous request • IMPLICATION: harvester must accept both the most recent resumptionToken issued and the previous one
  • 68. Idempotency of “List” Requests (2) • response to a re-issued request must contain all unchanged records • any changed records will get new datestamps after time of initial request • changes will be picked up by subsequent harvest if not included [no experience yet with incomplete responses to ListSets or ListMetadataFormats requests]
  • 69. Case Study: “bucket” based repositories • Buckets: see Nelson & Maly, CACM 44(5) • 2.0 – NTRS - ntrs.nasa.gov/ (MySQL, DC) – LTRS - techreports.larc.nasa.gov/ltrs/oai2.0/ (file system, refer) – NACA - naca.larc.nasa.gov/oai2.0/ (file system, refer) • 1.1 – LTRS - techreports.larc.nasa.gov/ltrs/oai/ – NACA - naca.larc.nasa.gov/ltrs/oai/ – Open Video - www.open-video.org/oai/ (MySQL, local) – JTRS - ston.jsc.nasa.gov/collections/TRS/oai (MS Access dump, local) – GLTRS (filesystem, HTML scraping) • Characteristics: – resumptionToken support initially skipped; added later (all) • highly encoded rT’s: [2001-01-01!!!!301!600] – sets initially skipped, added later (LTRS) – initially had load balancing with 2 NACA repositories…
  • 70. Case Study: “bucket” based repositories • in bucket terminology: – 6 OAI verbs (methods) added to the existing list of methods • http://ntrs.nasa.gov/?method=list_methods • http://ntrs.nasa.gov/?method=list_source&target=ListIdentifiers – a data element is added to the bucket that contains the specifics of the particular repository and its metadata format • http://ntrs.nasa.gov/?method=display&pkg_name=oai&element_na me=oai.pl
  • 71. Harvester Implementation and Use (see also Harvester Implementation Guidelines: http://www.openarchives.org/OAI/2.0/guidelines- repository.htm )
  • 72. Be a Polite OAI Neighbor • Re-use existing free harvester software/libraries: http://www.openarchives.org/tools/index.html • If you insist on writing your own harvester, read http://www.robotstxt.org/wc/robots.html • Provide meaningful User-Agent & From headers – Should be present in HTTP headers of all robot requests – Should be configured even if using someone else’s harvester
  • 73. Harvesting Sequence • Issue Identify request – Check OAI-PMH version – Check baseURL, granularity, compression • Issue ListMetadataFormats request – Get information regarding selected metadataPrefix • Issue ListSets request if using sets – Check set structure matches expectation • Issue ListIdentifier or ListRecords request – Continue until end of complete list
  • 74. Listen to the Repository • Check Identify’s <granularity> element if you wish to use finer than YYYY-MM-DD • If you harvest with sets, remember that “:” indicates hierarchy – harvesting “a” will recursively harvest “a:b”, “a:b:c”, and “a:d” • Check for and handle non-200 HTTP status codes, 503, 302 and 4xx in particular • Empty resumptionToken => end of complete list • Ask for compressed responses if the repository supports them
  • 75. Harvesting Everything • Issue an Identify request to find protocol version, finest datestamp granularity supported, if compression is supported… • Issue a ListMetadataFormats request to obtain a list of all metadataPrefixes supported. • Harvest using a ListRecords request for each metadataPrefix supported. Knowledge of the datestamp granularity allows for less overlap in incremental harvesting if granularities finer than a day are supported. • Set structure can be inferred from the setSpec elements in the header blocks of each record returned (consistency checks are possible). • Items may be reconstructed from the constituent records. • Provenance and other information in <about> blocks may be re- assembled at the item level if it is the same for all metadata formats harvested. However, this information may be supplied differently for different metadata formats and may thus need to be store separately for each metadata format.
  • 76. Harvesting v1.1 and v2.0 • Not difficult to handle both cases, test Identify response: – v1.1: <Identify> <protocolVersion> – v2.0 <OAI-PMH> <Identify> <protocolVersion> • Different error and exception handling • Many similarities, harvesters can share lots of code
  • 77. Harvesting Demo • Harvester written in Perl (Uses LWP, Expat and XML::Parser, no schema validation) • Handles v1.0, v1.1 and v2.0 • Sequence of requests: Identify, ListMetadataFormats, ListSets then ListRecords/ListIdentifiers • Support for incremental harvesting, uses responseDate from last harvest to get new start datestamp • Supports response compression (gzip, compress) • UTF-8 conditioning to deal with some “imperfect” repositories
  • 78. Harvesting logs • Alan Kent’s v2.0 harvester logs: http://www.inquirion.com:8123/public/collList;collListCmd=list • Alan Kent’s summary of v1.1 harvesting results http://www.mds.rmit.edu.au/~ajk/oai/interop/summary.htm • Celestial v1.1 harvesting logs http://celestial.eprints.org/cgi-bin/status • DP9 gateway using arc harvested information http://arc.cs.odu.edu:8080/dp9/index.jsp
  • 79. <friends> example (1) A light-weight, data-provider driven way to communicate the existence of “others”, e.g. http://ntrs.nasa.gov/?verb=Identify … <description> <friends …namespace stuff… > <baseURL>http://naca.larc.nasa.gov/oai2.0</baseURL> <baseURL>http://ntrs.nasa.gov/oai2.0</baseURL> <baseURL>http://eprints.riacs.edu/perl/oai/</baseURL> <baseURL>http://ston.jsc.nasa.gov/collections/TRS/oai/</baseURL> </friends> </description> …
  • 82. Aggregator / Cache / Proxy Implementation (see also Aggregator Implementation Guidelines: http://www.openarchives.org/OAI/2.0/guidelines- aggregator.htm)
  • 83. <provenance> & datestamps • Reminder: datestamps are local to the repository, a re-exporting service must use new local datestamps • Such services should use the <provenance> container to preserve the original datestamps and other information
  • 84. Identifiers are Local • Identifiers are local to the repository • Unless you absolutely did not change the metadata and the identifier corresponds to a recognized URI scheme, use a new identifier upon re-exporting – use the <provenance> container to preserve the harvesting history
  • 85. oai-identifier • Just one option for identifiers in OAI-PMH • The v2.0 oai-identifier scheme is not compatible with v1.1: – repositoryName now domain name based – not reliant upon OAI centralized registration • One-to-one mapping for escaping characters: %3F allowed, %3f not – allows simple comparison
  • 86. Derived from the same item? 3 ways to determine if records share provenance from the same item: 1. both records have the same identifier and the baseURL in the request elements of the OAI-PMH reponses which include the record are the same; 2. both records have the same identifier and that identifier belongs to some recognized URI scheme; 3. the provenance containers of both records have the same entries for both the identifier and baseURL;
  • 87. <provenance> example (1) <responseDate>2002-02-08T08:55:46.1</responseDate> <request verb="GetRecord" metadataPrefix="odd_fmt" identifier="oai:odd.oa.org:z1x2y3">http://odd.oa.org</request> <GetRecord ...namespace stuff… <record> <header> <identifier>oai:odd.oa.org:z1x2y3</identifier> <datestamp>1999-08-07T06:05:04Z</datestamp> </header> <metadata> …metadata record in odd_fmt… </metadata> </record> </GetRecord> Consider a request from crosswalker.oa.org: http://odd.oa.org?verb=GetRecord &identifier=oai:odd.oa.org:z1x2y3&metadataPrefix=odd_fmt and the following response from odd.oa.org:
  • 88. Imagine that crosswalker.oa.org cross-walks harvested metadata from odd_fmt into oai_marc and then re-exposes the metadata with new identifiers. A request from getmarc.oa.org: http://crosswalker.oa.org?verb=GetRecord &identifier=oai:cw.oa.org:z1x2y3 &metadataPrefix=oai_marc might then yield the following response from crosswalker.oa.org: <provenance> example (2)
  • 89. <provenance> example (3) <record> <header> <identifier>oai:cw.oa.org:z1x2y3</identifier> <datestamp>2002-02-09T01:15:43Z</datestamp> </header> <metadata> ...metadata record in oai_marc... </metadata> <about> <provenance …namespace stuff… > <originDescription harvestDate="2002-02-08T08:55:46Z“ altered="true"> <baseURL>http://odd.oa.org</baseURL> <identifier>oai:odd.oa.org:z1x2y3</identifier> <datestamp>1999-08-07T06:05:04Z</datestamp> <metadataNamespace>http://odd.oa.org/odd_fmt</..> </originDescription> </provenance> </about> </record>
  • 90. <provenance> example (4) This oai_marc record is then re-exposed by getmarc.oa.org with the same identifier oai:cw.oa.og:z1x2y3 (because the record has not been altered). The associated <provenance> container might be:
  • 91. <provenance> example (5) <record> <header> <identifier>oai:cw.oa.org:z1x2y3</identifier> <datestamp>2002-03-01T01:46:11Z</datestamp> </header> <metadata> ...metadata record in oai_marc... </metadata> <about> <provenance …namespace stuff…> <originDescription harvestDate=“2002-03-01T01:23:45” altered=“false”> <baseURL>http://crosswalker.oa.org/<baseURL> <identifier>oai:cw.oa.org:z1x2y3</identifier> <datestamp>2002-02-09T01:15:43Z</datestamp> <metadataNamespace>http://../oai_marc</metadataNamespace> <originDescription harvestDate="2002-02-08T08:55:46Z” altered="true"> <baseURL>http://odd.oa.org</baseURL> <identifier>oai:odd.oa.org:z1x2y3</identifier> <datestamp>1999-08-07T06:05:04Z</datestamp> <metadataNamespace>http://odd.oa.org/odd_fmt</metadateNamespace> </originDescription> </originDescription> </provenance> </about> </record>
  • 92. Case Studies Example data-provider implementations for: arXiv.org NSDL metadata repository
  • 93. arXiv (1) http://arXiv.org/oai2 • Existing system, running >11 years, written mostly in Perl • Flat file system for ‘database’ • 230k papers with metadata in homebrew format • ~200 updates/day. OAI repository just one view of system, must integrate with daily update schedule
  • 94. arXiv (2) • Write in Perl – Easy integration with rest of system, reuse code from v1.0/v1.1 interface – Use libwww; XML::DOM • Daily rebuild of datestamp database – No existing date in system appropriate – Base on Unix cdate of metadata files • On-the-fly metadata translation – Straightforward, avoids data duplication
  • 95. arXiv (3) • Flow control to avoid loading server and to avoid harvesters tripping robot alarms – resumptionTokens to limit response size (1500 records or 15k identifiers / response) – 503 Retry-After replies based on client ip • Implement resumptionTokens that include all state – Avoid need to cache result sets / clean cache
  • 96. NSDL Metadata Repository http://services.nsdl.org:8080/nsdloai/OAI • Implemented as an integral part of a new system • Expect heavy load; db target size >10M items; stateless resumptionTokens • Java servlets; Xerces; Oracle (JDBC interface); strict validation throughout • Based on rewrite of Cocoa (NCSA UIUC) • Integral to NSDL services model: provides data for user interface and search services
  • 98. • “Using OAI-PMH…Differently” Young, Van de Sompel, Hickey, D-Lib Magazine, 9(7/8), 2003 – DL Usage logs ~ LANL – Registry of metadata formats for OpenURL ~ OCLC & LANL • http://www.openurl.info/registry/ • http://lib-www.lanl.gov/~herbertv/papers/icpp02-draft.pdf – GSFAD Thesaurus ~ OCLC • Other uses?
  • 99.
  • 100. OAI-PMH access to DL usage logs • usage logs filtered and stored in MySQL db • accessible as 2 OAI-PMH repositories: • document oriented • agent oriented (user-proxy) • interlinked • recommender system: • harvests logs • interpretes logs • exposes relationships (OpenURL access)
  • 101. Phase 1: creating recommender system Document logs local Agent logs local document log agent log Log processing Log based recom. system About local and remote data
  • 105. • log repository screencam OAI-PMH access to DL usage logs
  • 106. OAI-PMH-conformant OpenURL Registry • NISO OpenURL Framework builds on Registry • Registry entry: • unique identifier • always DC record • sometimes XHTML or XML Schema definition
  • 107. OAI-PMH-conformant OpenURL Registry • Collaboration with OCLC Office of Research: • Registry is OAI-PMH harvestable • Registry is browseable through overlaying of PURL and XSLT
  • 109. XML Schema defining the metadata format xsd registered item ori:fmt:xml:xsd:book about registered item OpenURL Registry metadata format
  • 110. <?xml version="1.0" encoding="UTF-8" ?> <?xml-stylesheet type="text/xsl" href="gui.xsl" ?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2002-02-08T12:00:01Z</responseDate> ... </OAI-PMH> OAI-PMH-conformant OpenURL Registry • Insert XSLT stylesheet reference in OAI- PMH response => make repository browseable
  • 111. OAI-PMH-conformant OpenURL Registry • Use PURL partial redirects to obtain publisheable URLs baseURL? verb=GetRecord & metadataPrefix=xsd & identifier=ori:fmt:xml:xsd:book http://www.openurl.info/registry /xsd /ori:fmt:xml:xsd:book
  • 112. • OpenURL Registry screencam OAI-PMH-conformant OpenURL Registry
  • 113. OAI-PMH-conformant GSFAD Thesaurus • OCLC Office of Research: • GSFAD Thesaurus is OAI-PMH harvestable • Thesaurus is user-browseable through overlaying of PURL and XSLT • Thesaurus is accessible by machines via OAI-PMH-based web services
  • 115. Other Uses For the OAI-PMH • Assumptions: – Traditional DLs / SPs will continue on their present path of increasing sophistication • citation indexing, search results viz, personalization, recommendations, subject-based filtering, etc. – growth rates remain the same (~5x DPs as SPs) • Premise: OAI-PMH is applicable to any scenario that needs to update / synchronize distributed state – Future opportunities are possible by creatively interpreting the OAI-PMH data model
  • 116. Typical Values • repository – collection of publications • resource – scholarly publication • item – all metadata (DC + MARC) • record – a single metadata format • datestamp – last update / addition of a record • metadata format – bibliographic metadata format • set – originating institution or subject categories
  • 117. Repositories… • Stretching the idea of a repository a bit: – contextually sensitive repositories • “personalization for harvesters” • communication between strangers, or communication between friends? – OAI-PMH for individual complex objects? • OAI-PMH without MySQL?! – Fedora, Multi-valent documents, buckets – tar, jar, zip, etc. files
  • 118. Resource • What if resource were: – computer system status • uptime, who, w, df, ps, etc. – or generalized “system” status • e.g., sports league standings – people • personnel databases • authority files for authors
  • 119. Item • What if item were: – software • union of versions + formats – all forms of metadata • administrative + structural • citations, annotations, reviews, etc. – data • e.g., newsfeeds and other XML expressible content – metadataPrefixes or sets could be defined to be different versions
  • 120. Record • What if record were: – specific software instantiations / updates – access / retrieval logs for DLs (or computer systems) – push / pull model inversion • put a harvester on the client behind a firewall, the client contacts a DP and receives “instructions” on how to submit the desired document (e.g., send email to a specified address)
  • 121. Datestamp • semantics of datestamp are strongly influenced by the choice of resource / item / record / metadataPrefix, but it could be used to: – signify change of set membership (e.g., workflow: item moves from “submitted” to “approved”) – change datestamp to reflect access to the DP • e.g., in conjunction with metadataPrefixes of “accessed” or “mirrored”
  • 122. metadataPrefix • what if metadataPrefix were: – instructions for extracting / archiving / scraping the resource • verb=ListRecords&metadataPrefix=extract_TIFFs – code fragments to run locally • (harvested from a trusted source!) – XSLT for other metadataPrefixes • branding container is at the repository-level, this could be record- or item-level
  • 123. Set • sets are already used for tunneling OAI-PMH extensions (see Suleman & Fox, D-Lib 7(12)) • other uses: – in aggregators, automatically create 1 set per baseURL – have “hidden” sets (or metadataPrefix) that have administrative or community-specific values (or triggers) • set=accessed>1000&from=2001-01-01 • set=harvestMeWithTheseARGS&until=2002-05- 05&metadataPrefix=oai_marc