Hutt, Arwen and Jenn Riley. "Semantics and Syntax of Dublin Core Usage in Open Archives Initiative Data Providers of Cultural Heritage Materials." Joint Conference on Digital Libraries, Denver, CO, June 7-11, 2005.
Web & Social Media Analytics Previous Year Question Paper.pdf
Semantics and Syntax of Dublin Core Usage in Open Archives Initiative Data Providers of Cultural Heritage Materials
1. Semantics and Syntax of Dublin Core
Usage in Open Archives Initiative Data
Providers of Cultural Heritage Materials
Arwen Hutt, University of Tennessee
Jenn Riley, Indiana University
2. OAI-PMH
Open Archives Initiative Protocol for Metadata Har
Originally developed for sharing metadata
about e-prints
Two players
Data providers
Service providers
Requires unqualified Dublin Core be exposed
for all resources, but supplemental metadata
formats are allowed
3. Dublin Core [Unqualified]
Simple, flexible metadata format
15 elements
All repeatable
None required
“Core” across all knowledge domains
4. “Cultural heritage” defined
The intellectual creative and material output
of society
Libraries, museums and archives generally
considered cultural heritage institutions
Often primary source materials
Tend to be older analog digitized for network
access
5. Significant variability in OAI metadata
Ward: found that only a small number of DC
elements were used in the majority of OAI
records
Liu: Arc service provider studied controlled
vocabulary usage in DC subject, type, format,
language, and date fields
NSDL: found errors missing data, incorrect
data, confusing data, insufficient data
UIUC: date, coverage, format, and type
vocabulary varies significantly
6. Goals of the study
Focus on cultural heritage community
Examined 3 DC fields: date, creator,
contributor
Semantic content
Syntactic form
Results could inform community best
practices
One step towards improving the overall
quality of OAI metadata
7. Harvesting statistics
Successfully harvested metadata from 35
data providers
750,945 total records harvested
5% sample* from each data provider taken
for analysis (37,564 records)
* Minimum of 1 record per provider, values rounded up to the nearest whole number
8. Processing steps
Date, creator, contributor elements extracted
into “silos”
Repeated values grouped, keeping
connections between elements and the
records in which they appeared
Certain characteristics tracked about each
element
Example
9. Characteristics recorded for all
elements
The presence of multiple discrete values in a
single element
<creator>Hutt, Arwen; Riley, Jenn</creator>
The presence of pseudo-qualifiers within the
value that refined the meaning of the element
<creator>Berlin, Irving
[composer]</creator>
Whether the value was appropriate within the
specified element based on DC rules and
usage guidelines
<date>Las Vegas, Nevada</date>
10. Additional characteristics of <date>
The semantic type of the value (creation, copyright or
digitization)
<date>2000</date>
The general specificity of the date (single date, range
or period)
<date>19th Century</date>
Indication that a date is not definitive (that it is
estimated or approximate)
<date>ca. 1930</date>
Whether the value is purely numeric or contains non-
numeric text
<date>March 18, 1902</date>
11. Additional characteristics of <creator>
and <contributor>
The semantic type of the value (personal
name, corporate name or other)
<creator>Newton, Isaac</creator>
Whether the entity is known, unknown or
ambiguous
<creator>Vermeer, Johannes, 1632-1675 ?</creator>
Whether the value is inverted or in direct
order
<creator>Charles Schultz</creator>
12. Strategies for categorization
Automatic
Iteratively developed
Pattern matching
Identification of commonly occurring values
Manual
Where feasible
Not perfect!
13. Findings for <date>
Values largely appropriate for element
Few “pseudo-qualifiers”
Different events represented
Values mostly numeric
Many dates not expressible in W3CDTF
14. Findings for <creator>
Values largely appropriate for element
Most were personal names
Many “pseudo-qualifiers,” in comparison to
other elements
Often included information intended to
disambiguate a name
Some indication of the use of controlled
vocabularies, but many different name forms
present
15. Findings for <contributor>
Used infrequently
Many values inappropriate for element
Majority personal names, but higher
proportion of corporate names than
occurred in <creator>
Few “pseudo-qualifiers”
16. OAI DC record & intellectual object
1:1 principle – each DC record describes only
one version of a resource
BUT
Cultural heritage materials often digitized
from analog originals, resulting in multiple
versions of each intellectual object
17. OAI DC record & intellectual object
Two choices for data providers
Adhere
to 1:1 rule but omit pertinent
information
Violate the 1:1 rule but create more
complete records
Many data providers in practice violate
the 1:1 rule
18. OAI DC record & aggregated search
environment
Extraction of records from original
collection context
Aggregation with records from other
collections
19. Moving towards better metadata –
some possibilities
Remove the OAI requirement for simple
Dublin Core (or “the Nuclear Option”)
Develop best practice documentation for
cultural heritage materials that deviate from
current DC best practice
Combination of data provider education and
service provider normalization
Improved communication between data and
service providers
Encourage use of other metadata formats
supplementing simple DC
20. Some other relevant initiatives
Digital Library Federation and NSDL OAI and
Shareable Metadata Best Practices Working
Group
Development of general OAI best practices
Development of strategies for communication
with vendors
DLF Aquifer Metadata Working Group
Development of profile for DLF institutions
(strong focus on cultural heritage)
Recommendations for specific metadata
elements
21. Plans for extension of this research
Primary analysis of the subject, coverage and
publisher elements
Analyze temporal information across date,
subject and coverage elements
Analyze geographic information across
subject and coverage elements
Analyze name information across creator,
contributor and publisher elements
Our study was performed on metadata shared by OAI-PMH.
OAI is protocol for sharing metadata, not content.
Data providers “expose” metadata for service providers to come get. Service providers make use of that metadata in some way. Currently, by far the most common service provided is cross-repository searching. Our study focused on data providers of cultural heritage materials.
Explain the difference b/n qualified and simple dc
Example of 1:1 principle – mona lisa painting, leonardo painted many years ago– digital image created by jenn riley 2000
All the research in this area talks about the variability problem
Quickly!
What we work with
Better chance of finding patterns within a general community
Semantic content – the meaning of the content of a metadata element
Syntactic form – the structure or format of the value
Processing performed with perl and xslt scripts
Show sample
Grouping
Link back to original record
Attributes – some attributes are used all silos but there are some that are specific to the different elements
[We picked characteristics to record based on our experience as OAI data providers and on reports in the OAI literature]
Date specificity – ming dynasty, 1980’s, 1900-1910, etc.
Perl scripts
Not perfect
Too many records to manually check each one
Certain characteristics require subjective judgements
Digitization and creation most prevalent, but a few copyright also.
3 to 1 numeric to textual
W3CDTF – profile of ISO 8601 recommended by Dublin Core as a best practice for date encoding but 17% of dates cannot be represented by it.
we’re seeing a need for better support for variations on date values
Extraction – on the horse example, Roosevelt
Aggregation
– Administrative metadata is rarely useful in an aggregated environment.
- System hacks like to make all years of a date range searchable typing in every year in the range.