Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Pandora
1. Trends in Use of
Pandora Archive
Presentation at IIPC Open Day
The Broad Value of Web Archives
30th April, 2012, Library of Congress
Monica Omodei
Director, Web Archiving and Digital Preservation
National Library of Australia
momodei @ nla.gov.au
2. About the Pandora Archive
• Selective, Collaborative Approach "
– high value, discrete, timely collecting"
– A number of partners contribute to Pandora"
• Targeted Australian content "
– selection policy, nominations are reviewed"
• Historical – started 1996"
• Bibliocentric approach "
– archived sites/publications are fully catalogued"
• Publicly accessible"
– full content keyword search through national resource
discovery service trove.nla.gov.au
– Browse is of reconstituted version of original site
– Metadata indexed in google"
3. Pandora Archive Stats
• Size – 6.32 TB"
• Number of Files > 140 million"
• Number of titles > 30.5K"
• Number of title instances > 73.5K"
4. Whole domain archive
• We have also commissioned the IA to crawl
the .au domain for us annually since 2005
• Legislation prevents us from making this
accessible yet
• Hopefully soon we will be able to allow
access to researchers
11. The Bad News
• we have no legal deposit legislation for electronic
publications so permission to archive must be
obtained"
– significant content missed because permission to
copy refused"
• QA and fixing process can be labour intensive"
– Technical infrastructure ten years old"
• Selection guidelines outdated and dont align"
• Significant content missed because of resourcing
constraints and high labour cost"
• Search and browse functionality very limited"
– no URL search, no time-based searching"
• Current infrastructure doesnʼt scale for broader
themed collections with multiple sites or for domain-
scale archiving
12. Glass half full
• Situation will improve markedly if Legal Deposit
provisions extended to digital publications"
– The Australian Attorney-General has released a
consultation paper with a model for this extension"
• Broader coverage will be achieved when
infrastructure is upgraded, improving scalability
and reducing labour costs for QA/fixing
– We have commenced a multi-year Digital Library
Infrastructure Replacement Project which includes
upgrading our web archiving tools"
– We are currently trialling Heritrix for collaborative
thematic collecting, and wayback for access to our
commissioned .gov.au sub-domain archive"
13. DLIR Project
• Digital Library Infrastructure Replacement"
• RFP was followed by RFT for components
where reasonable solutions had been
proposed (including core repository)"
• The RFT evaluation recommended
proceeding to contract negotiations with
the selected tenderer for each component"
• Currently preparing a submission for
ministerial approval prior to contract
negotiations with vendors.
14. Patterns of Use
• Which archived sites are popular
and why ?"
• Is use of our archive growing ?"
• What is the relative interest in
older vs more recent captures ?"
• Who is using our archives ?"
• And what for ?
15. Which archived sites are popular ?
• Data source – filtered, aggregated web
access log data which counts access to
titles "
• Examined top 30 archived titles (# of
accesses) for each year 2009 to 2012"
• Selected some to examine and
speculate as to why they might be
popular"
• Included consistently high ranking, and
ones that were very variable between
years
16. Reasons for popularity of archived version
• Were once popular and are now
decommissioned, particularly if
domain name continues to exist and
redirects to the archive"
• May not be that popular as live sites
but their live site links prominently to
Pandora as an archive for their
content"
• Popular referencing sources cite the
archive as well as the live site (if it
still exists)
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27. Conclusions
• Be more proactive in identifying
unresponsive domains "
• Market automatic redirect
services to web site owners/
managers"
• Allow Google to index archive
content for sites which are no
longer live "
28. Is use of Pandora growing ?
Annual access figures for Pandora Web Site and Archive
NB robots.txt was not introduced on the site until 2005
Web site design change in 2008 affected measure downward
29. Interest in older vs recent content
• Filtered access logs by reference
from the entry page to the archived
instance
• aggregated accesses by age(year)
of archived instance
• Added number of instances of that
age in the archive as a reference
31. Who is using archive .
"
• Online survey linked to from search
service - approx 450 respondents
• Age, gender, location, education
• How did they arrive
• What type of information and for
what purpose
• Is it still available on the live web ?
32. But first an anecdote
Article in major newspaper – quote
WE at Spring Loaded are no conspiracy theorists, but
the disappearance of Liberal Party policies is curious.
First went the policy documents. A recent revamp of
the website saw the pre-election press releases go.
But thanks to the National Library of Australia s
Internet archive, many of the policies can be seen at
http://pandora.nla.gov.au When Spring Loaded asked
about the missing policies, the Liberal Party said there
was nothing untoward .
33. Examples of lost web sites
• Qantas own special web site presenting
their case during the major dispute with
pilots, engineers and cabin crew unions that
grounded the airline in 2011
• Jeff Kennett's campaign web site in the
1999 Victorian State election - the first use
of the web by a politician during a
campaign in Australia
41. Other questions
• Did you realise that you were going to enter
an archived version of a web site, not the live
one (60% yes to 40% no)
• Was the resource you were looking for no
longer available on the live web ? (50-50)
• Have you visited other web archives ? (60%
yes, 40% no)
42. Conclusions
• We need to market our archive better
• Promote redirects for closing, unsupported
web sites
• Convert archives to arc/warc so memento API
will find content
• allow google indexing of content for archived
web sites where live version is extinct or
substantially altered