SlideShare una empresa de Scribd logo
1 de 40
Capture all the URLs:
First Steps in Web Archiving

Kristen Yarmey

Judy Silva

Alexis Antracoli

Digital Services Librarian

Fine & Performing Arts Librarian and Archivist

Records Management Archivist

University of Scranton

Slippery Rock University of Pennsylvania

Drexel University
Where We’re Going
Kristen:
• Intro to web archiving
• Web archives in higher
ed
• Archive-It and other
tools

Judy:
• First steps
• Getting buy-in
• Selecting and scoping

Alexis:
• Metadata
• Policies
• Workflow

All:
•
•
•
•

Challenges
Lessons learned
What’s next?
Q&A
Why archive the web?
What do we put on the web?
• University publications
•
•
•
•
•
•
•

Course catalogs
Student handbooks
Newsletters
Press releases
Alumni Journal
Admissions viewbook
University calendar

• Governance/Planning
documents and records
• Policies
• Assessment reports (Fact Book)
• Faculty Senate
agendas, minutes, and reports
• Presidential announcements
• Email

• Campus life
•
•
•
•
•
•

Student clubs
Housing contract
Wellness programming
Community outreach
Athletics scores
Alumni class pages

• Events
• Presidential inauguration
• New building
construction/dedication

• Social Media presence
•
•
•
•
•

Facebook
Twitter
Blogs
YouTube
…
Web Archiving in Higher Ed
―We have the responsibility
to preserve things like course information, course roster
information and policies — all sorts of things that we used to
get in paper but are now just showing up as websites.‖
Dean B. Krafft, Chief Technology Strategist, Cornell University

―Almost every office and unit on campus has a web
site with business information. .. Many of our campus
publications are only on the web now as pdfs or html.
[This content] isn’t preserved anywhere else.‖
Ed Busch, Electronic Records Archivist, Michigan State University
Goals:
• Preserve dynamic content
•
•
•
•
•

Text
Images
Animation
Video
…

• Preserve context
•
•
•
•

Hyperlinks
Embedded media
Document method and date of capture
Relate to prior and later versions

• Provide access
• Full text search
• Browsability
• User-friendly interface
Once something
is posted
on the web, it’s
there forever…
right?

New York Times, September 23, 2013
Web Archiving in Higher Ed
“One finding revealed by the survey was the
preponderance of universities that have
initiated web archiving programs in the last 5
years.”
Web Archiving Survey Report by National Digital Stewardship Alliance
June 2012
National Digital Stewardship Alliance Web Archiving Survey Report, June 2012
Tools

National Digital Stewardship Alliance Web Archiving Survey Report, June 2012
Tools: In-House Options
Proprietary tools:
 Adobe Acrobat - convert websites into PDFs
(internal links remain active but other
dynamic functionality is lost)
 Grab-a-Site and WebWhacker – download
files from a website
 Teleport Pro – ―webspidering‖
Open source tools:
 Heritrix – crawler
 HTTrack – downloads web content to a local
directory
 Wayback – discovery
 Memento – access framework
 NutchWAX - search
 Solr – search
 WARCreate – Google Chrome extension for
creating WARC files (view with Wayback,
store your own data)
 Wget – retrieve files from a website
 Web Curator Tool – workflow management
 NetarchiveSuite - software package
 Xenu’s Link Sleuth – finds broken links
National Digital Stewardship Alliance Web Archiving Survey Report, June 2012
Tools: Outsourcing Options
Vendor services
 Archive-It
 California Digital
Library Web Archiving
Service (WAS)
 OCLC Web Harvester

National Digital Stewardship Alliance Web Archiving Survey Report, June 2012
Archive-It
• Subscription service
• Branch of nonprofit Internet Archive
• Crawls, harvests, and hosts web
content, using open source tools and
standard formats
• Yearly fees, based on ―data budget‖
Archive-It: Partners
Archive-It Partners
 279 collecting organizations total
 118 colleges & universities
Pennsylvania Partners:














Bryn Mawr, Haverford, and Swarthmore (joint, 2005)
Bucknell (2012)
Chemical Heritage Foundation (2010)
Curtis Institute of Music (2010)
Drexel (2009)
Free Library of Philadelphia (2010)
Gettysburg College (2013)
La Salle University (2012)
Pennsylvania State University (2012)
Slippery Rock University of Pennsylvania (2011)
Temple University (2013)
University of Pennsylvania Law School (2011)
University of Scranton (2012)
Archive-It: Crawl
• Collection
• Seeds (regular, one-time, or RSS)
• Documents = any file with a distinct URL, including…
– HTML
– Images
– Video
– Audio
– PDF
–…

• Scope = which URLs are captured and which are not
• Frequency = how often seed is crawled
Archive-It: Access
Users can:
• Search
• Browse

From:
• Archive-It website
• Portal page
• Embedded search
boxes
• Library catalog
• Finding aids
• 404 error pages
• Wayback Machine

Content can be
public or private.
Archive-It: Manage
Metadata
• Dublin Core
• Collection, seed, document level

Storage
• Archive-It hosts content and
backup on multiple servers
• Partner can request copy of data

Support
• Training sessions
• Partner support
• User community
Why Archive-It?
Campus Stakeholders
• Library
• Information
Technology
• Public Relations
• Administration
• New President
Funding
• Information
Technology
• Public Relations
• Library
• Provost’s Office
• Grants
• Donors
Selecting Content
Selecting More Content
• Athletics
• Student
organizations
• Alumni
• President’s page
• Provost’s page
• 125th Anniversary
• University Curriculum
Committee minutes
(password protected)
What is a seed?
• A seed is any URL that you want to
capture:
• An entire website
• http://www.whitehouse.gov/

• A specific part of a website
• http://www.whitehouse.gov/issues/foreign-policy/

• A specific URL
• http://www.whitehouse.gov/sites/default/files/rss_viewer/natio
nal_security_strategy.pdf
Scoping & Crawls
Before You Start
Building the Program
•
•
•
•
•

Policy
Records Management Benefits
Standardizing Metadata
Developing Quality Control Procedures
Working within organizational constraints
Collection Development
• Developed policy
•
•
•
•
•

Mission
Scope
Designated Community
Intellectual Property
Access

• Determined/reviewed
seeds to crawl and
frequency
• Maintain an up-to-date
list of seeds that are
regularly crawled
―Brasseri F – Archives oubliees,‖ by GuillaBar.
http://www.flickr.com/photos/guillabar/8666232614/
Updating Metadata
• Selected fields to use consistently:
•
•
•
•

Title
Creator
Description
Collector

• Standardized names
• Eliminated groups
Quality Control Procedures
•
•
•
•

New program
Excel spreadsheet
Track by seed
Check basic yes/no
problems:
• Crawl too large.
• Date Queued
• Robots.txt

• Track errors:
• Various seed errors
• Embedded file problems

• Track updates:
•
•
•
•

New URLs
Recrawls
Patch crawls
Web administrator contacts

―Our Quality Control,‖ by Paphio.
http://www.flickr.com/photos/paphio/3313728492/
Challenges
•
•
•
•

Staffing
Time-intensive
Correcting technical problems
Not yet knowing how people will use the
crawls as a resource
• Capturing online publications and email
newsletters
Lessons Learned

―Lessons‖ by Pavel Ivashkov.
http://www.flickr.com/photos/ipasha/5588688937/

• Web-archiving takes time
• There are ways to make it
work with a small staff
• Metadata can be basic
and still useful
• Quality Control is
important
• You can’t correct every
error with limited staff
• Need to keep up with new
sites and URL changes
Up Next
• Additional outreach to Web administrators
• Official launch of Web archiving program
to University
• Exploring cross-training to improve quality
control program
• Institute regular scanning of environment
for new content and updates
• Social Media
Resources
 Archive-It Knowledge Center (October 2013)
 Brenda Reyes Ayala’s Web Archiving Bibliography (June
2013)
 Kalpesh Padia et al., Visualizing Digital Collections at
Archive-It (August 2012)
 National Digital Stewardship Alliance, Web Archiving
Survey Report (June 2012)
 International Internet Preservation Consortium, Future of
the Web Workshop (May 2012)
 Jinfang Niu, ―An Overview of Web Archiving‖ (D-Lib
Magazine, March 2012)
 Inside Higher Ed, Archiving the Web for Scholars (May
2011)
 WebArchivists, Web-Archives Timeline

Más contenido relacionado

La actualidad más candente

Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet ArchiveMichael Nelson
 
Enhancing HIP
Enhancing HIPEnhancing HIP
Enhancing HIPdaveyp
 
Reference Rot and Link Decoration
Reference Rot and Link DecorationReference Rot and Link Decoration
Reference Rot and Link DecorationMartin Klein
 
Prototypes of pro-active approaches to support the archiving of web reference...
Prototypes of pro-active approaches to support the archiving of web reference...Prototypes of pro-active approaches to support the archiving of web reference...
Prototypes of pro-active approaches to support the archiving of web reference...EDINA, University of Edinburgh
 
Therapy for your CMS: Improving the User Experience
Therapy for your CMS: Improving the User ExperienceTherapy for your CMS: Improving the User Experience
Therapy for your CMS: Improving the User ExperienceRachel Vacek
 
The Future of the OPAC...?
The Future of the OPAC...?The Future of the OPAC...?
The Future of the OPAC...?daveyp
 
Web-Scale Discovery: Post Implementation
Web-Scale Discovery: Post ImplementationWeb-Scale Discovery: Post Implementation
Web-Scale Discovery: Post ImplementationRachel Vacek
 
Georgia Tech Drupal Users Group - February 2015 Meeting
Georgia Tech Drupal Users Group - February 2015 MeetingGeorgia Tech Drupal Users Group - February 2015 Meeting
Georgia Tech Drupal Users Group - February 2015 MeetingEric Sembrat
 
Collaborative Web Archiving with Ivy Plus / Borrow Direct
Collaborative Web Archiving with Ivy Plus / Borrow Direct Collaborative Web Archiving with Ivy Plus / Borrow Direct
Collaborative Web Archiving with Ivy Plus / Borrow Direct Anna Perricci
 
FAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueFAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueHerbert Van de Sompel
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolMichael Nelson
 
E Write Blogs Wikis Us Courts 9 408
E Write   Blogs Wikis Us Courts 9 408E Write   Blogs Wikis Us Courts 9 408
E Write Blogs Wikis Us Courts 9 408guest45c75b
 
Create and maintain an up-to-date ResearcherID profile
Create and maintain an up-to-date ResearcherID profileCreate and maintain an up-to-date ResearcherID profile
Create and maintain an up-to-date ResearcherID profileNader Ale Ebrahim
 
Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013Herbert Van de Sompel
 
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic Data
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic DataNCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic Data
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic DataNebraska Library Commission
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDMartin Klein
 
Analyzing Multidimensional Networks within MediaWikis
Analyzing Multidimensional Networks within MediaWikisAnalyzing Multidimensional Networks within MediaWikis
Analyzing Multidimensional Networks within MediaWikisBrian Keegan
 
Webpage Classification
Webpage ClassificationWebpage Classification
Webpage ClassificationPacharaStudio
 

La actualidad más candente (20)

Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
 
Enhancing HIP
Enhancing HIPEnhancing HIP
Enhancing HIP
 
Reference Rot and Link Decoration
Reference Rot and Link DecorationReference Rot and Link Decoration
Reference Rot and Link Decoration
 
Prototypes of pro-active approaches to support the archiving of web reference...
Prototypes of pro-active approaches to support the archiving of web reference...Prototypes of pro-active approaches to support the archiving of web reference...
Prototypes of pro-active approaches to support the archiving of web reference...
 
Therapy for your CMS: Improving the User Experience
Therapy for your CMS: Improving the User ExperienceTherapy for your CMS: Improving the User Experience
Therapy for your CMS: Improving the User Experience
 
The Future of the OPAC...?
The Future of the OPAC...?The Future of the OPAC...?
The Future of the OPAC...?
 
Web-Scale Discovery: Post Implementation
Web-Scale Discovery: Post ImplementationWeb-Scale Discovery: Post Implementation
Web-Scale Discovery: Post Implementation
 
PID Signposting Pattern
PID Signposting PatternPID Signposting Pattern
PID Signposting Pattern
 
Georgia Tech Drupal Users Group - February 2015 Meeting
Georgia Tech Drupal Users Group - February 2015 MeetingGeorgia Tech Drupal Users Group - February 2015 Meeting
Georgia Tech Drupal Users Group - February 2015 Meeting
 
Collaborative Web Archiving with Ivy Plus / Borrow Direct
Collaborative Web Archiving with Ivy Plus / Borrow Direct Collaborative Web Archiving with Ivy Plus / Borrow Direct
Collaborative Web Archiving with Ivy Plus / Borrow Direct
 
FAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueFAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning Issue
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
 
E Write Blogs Wikis Us Courts 9 408
E Write   Blogs Wikis Us Courts 9 408E Write   Blogs Wikis Us Courts 9 408
E Write Blogs Wikis Us Courts 9 408
 
Create and maintain an up-to-date ResearcherID profile
Create and maintain an up-to-date ResearcherID profileCreate and maintain an up-to-date ResearcherID profile
Create and maintain an up-to-date ResearcherID profile
 
Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013
 
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic Data
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic DataNCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic Data
NCompass Live: Beyond MARC: BIBFRAME and the Future of Bibliographic Data
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
 
Analyzing Multidimensional Networks within MediaWikis
Analyzing Multidimensional Networks within MediaWikisAnalyzing Multidimensional Networks within MediaWikis
Analyzing Multidimensional Networks within MediaWikis
 
Webpage Classification
Webpage ClassificationWebpage Classification
Webpage Classification
 
NISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to RealityNISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to Reality
 

Destacado

Destacado (7)

As.oct11
As.oct11As.oct11
As.oct11
 
Aja group presentation_4[1]
Aja group presentation_4[1]Aja group presentation_4[1]
Aja group presentation_4[1]
 
Contest v1.1
Contest v1.1Contest v1.1
Contest v1.1
 
AutoSuccessOct04
AutoSuccessOct04AutoSuccessOct04
AutoSuccessOct04
 
WordPress 3.0 at DC PHP
WordPress 3.0 at DC PHPWordPress 3.0 at DC PHP
WordPress 3.0 at DC PHP
 
E3 chap-07
E3 chap-07E3 chap-07
E3 chap-07
 
Just in Case: Archive-It & DuraCloud Integration
Just in Case: Archive-It & DuraCloud IntegrationJust in Case: Archive-It & DuraCloud Integration
Just in Case: Archive-It & DuraCloud Integration
 

Similar a Capture All the URLS: First Steps in Web Archiving

Capture All the URLs: First Steps in Web Archiving
Capture All the URLs: First Steps in Web ArchivingCapture All the URLs: First Steps in Web Archiving
Capture All the URLs: First Steps in Web ArchivingKristen Yarmey
 
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...Samuel W. Shogren, MPA., LEAD assoc.
 
Library Support for Journal Publishing: Emphasis on multi-modal open peer rev...
Library Support for Journal Publishing: Emphasis on multi-modal open peer rev...Library Support for Journal Publishing: Emphasis on multi-modal open peer rev...
Library Support for Journal Publishing: Emphasis on multi-modal open peer rev...Karen Estlund
 
The workflows for the ingest of digital objects into a repository/digital li...
The workflows for the ingest of digital objects into a repository/digital li...The workflows for the ingest of digital objects into a repository/digital li...
The workflows for the ingest of digital objects into a repository/digital li...Hong (Jenny) Jing
 
Building Web Archiving Collaborations to Save [More of] the Web
Building Web Archiving Collaborations to Save [More of] the WebBuilding Web Archiving Collaborations to Save [More of] the Web
Building Web Archiving Collaborations to Save [More of] the WebAnna Perricci
 
The workflows for the ingest of digital objects into a repository/digital l...
The workflows for the ingest of  digital objects into a repository/digital l...The workflows for the ingest of  digital objects into a repository/digital l...
The workflows for the ingest of digital objects into a repository/digital l...Hong (Jenny) Jing
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Search Across Multiple VIVO Instances
Search Across Multiple VIVO InstancesSearch Across Multiple VIVO Instances
Search Across Multiple VIVO Instancescappadona
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersRebekah Cummings
 
BIS3400 Oct/Nov 2018
BIS3400 Oct/Nov 2018BIS3400 Oct/Nov 2018
BIS3400 Oct/Nov 2018EISLibrarian
 
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)TimelessFuture
 
6-4-13 VIVO Case Studies Presentation Slides
6-4-13 VIVO Case Studies Presentation Slides6-4-13 VIVO Case Studies Presentation Slides
6-4-13 VIVO Case Studies Presentation SlidesDuraSpace
 
Research Support Services ECU Library
Research Support Services ECU LibraryResearch Support Services ECU Library
Research Support Services ECU LibraryJulia Gross
 
Steven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data LifecycleSteven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data LifecycleSteve Androulakis
 
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017ARDC
 
The ELIXIR UK training portal (TeSS) by Carole Goble
The ELIXIR UK training portal (TeSS) by Carole GobleThe ELIXIR UK training portal (TeSS) by Carole Goble
The ELIXIR UK training portal (TeSS) by Carole GobleELIXIR UK
 

Similar a Capture All the URLS: First Steps in Web Archiving (20)

Capture All the URLs: First Steps in Web Archiving
Capture All the URLs: First Steps in Web ArchivingCapture All the URLs: First Steps in Web Archiving
Capture All the URLs: First Steps in Web Archiving
 
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...
 
Library Support for Journal Publishing: Emphasis on multi-modal open peer rev...
Library Support for Journal Publishing: Emphasis on multi-modal open peer rev...Library Support for Journal Publishing: Emphasis on multi-modal open peer rev...
Library Support for Journal Publishing: Emphasis on multi-modal open peer rev...
 
The workflows for the ingest of digital objects into a repository/digital li...
The workflows for the ingest of digital objects into a repository/digital li...The workflows for the ingest of digital objects into a repository/digital li...
The workflows for the ingest of digital objects into a repository/digital li...
 
Building Web Archiving Collaborations to Save [More of] the Web
Building Web Archiving Collaborations to Save [More of] the WebBuilding Web Archiving Collaborations to Save [More of] the Web
Building Web Archiving Collaborations to Save [More of] the Web
 
The workflows for the ingest of digital objects into a repository/digital l...
The workflows for the ingest of  digital objects into a repository/digital l...The workflows for the ingest of  digital objects into a repository/digital l...
The workflows for the ingest of digital objects into a repository/digital l...
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Search Across Multiple VIVO Instances
Search Across Multiple VIVO InstancesSearch Across Multiple VIVO Instances
Search Across Multiple VIVO Instances
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate Researchers
 
BIS3400 Oct/Nov 2018
BIS3400 Oct/Nov 2018BIS3400 Oct/Nov 2018
BIS3400 Oct/Nov 2018
 
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
 
Designing e-Learning Objects
Designing e-Learning ObjectsDesigning e-Learning Objects
Designing e-Learning Objects
 
6-4-13 VIVO Case Studies Presentation Slides
6-4-13 VIVO Case Studies Presentation Slides6-4-13 VIVO Case Studies Presentation Slides
6-4-13 VIVO Case Studies Presentation Slides
 
Research Support Services ECU Library
Research Support Services ECU LibraryResearch Support Services ECU Library
Research Support Services ECU Library
 
Alamw15 VIVO
Alamw15 VIVOAlamw15 VIVO
Alamw15 VIVO
 
CSD3333 Dec 2018
CSD3333 Dec 2018CSD3333 Dec 2018
CSD3333 Dec 2018
 
Steven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data LifecycleSteven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
 
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
 
Drupal and Libraries
Drupal and LibrariesDrupal and Libraries
Drupal and Libraries
 
The ELIXIR UK training portal (TeSS) by Carole Goble
The ELIXIR UK training portal (TeSS) by Carole GobleThe ELIXIR UK training portal (TeSS) by Carole Goble
The ELIXIR UK training portal (TeSS) by Carole Goble
 

Más de Kristen Yarmey

From 0 to 400 GB: Confronting the Challenges of Born-Digital Photographs
From 0 to 400 GB: Confronting the Challenges of Born-Digital PhotographsFrom 0 to 400 GB: Confronting the Challenges of Born-Digital Photographs
From 0 to 400 GB: Confronting the Challenges of Born-Digital PhotographsKristen Yarmey
 
Web Archiving with Limited Resources: A Nickel's Worth of Free Advice
Web Archiving with Limited Resources: A Nickel's Worth of Free AdviceWeb Archiving with Limited Resources: A Nickel's Worth of Free Advice
Web Archiving with Limited Resources: A Nickel's Worth of Free AdviceKristen Yarmey
 
Yes We Scan(ned): The Scranton Family Papers Scanathon
Yes We Scan(ned): The Scranton Family Papers ScanathonYes We Scan(ned): The Scranton Family Papers Scanathon
Yes We Scan(ned): The Scranton Family Papers ScanathonKristen Yarmey
 
Plays Well with Others: Getting Your Digital Collection Metadata Ready for th...
Plays Well with Others: Getting Your Digital Collection Metadata Ready for th...Plays Well with Others: Getting Your Digital Collection Metadata Ready for th...
Plays Well with Others: Getting Your Digital Collection Metadata Ready for th...Kristen Yarmey
 
What DPLA Can Do for Digital Humanities: The Digital Public Library of Americ...
What DPLA Can Do for Digital Humanities: The Digital Public Library of Americ...What DPLA Can Do for Digital Humanities: The Digital Public Library of Americ...
What DPLA Can Do for Digital Humanities: The Digital Public Library of Americ...Kristen Yarmey
 
3D Digitization (from Making, Modeling, Materializing: 3D Printing in Teachin...
3D Digitization (from Making, Modeling, Materializing: 3D Printing in Teachin...3D Digitization (from Making, Modeling, Materializing: 3D Printing in Teachin...
3D Digitization (from Making, Modeling, Materializing: 3D Printing in Teachin...Kristen Yarmey
 
DPLA and What it Means for PA
DPLA and What it Means for PADPLA and What it Means for PA
DPLA and What it Means for PAKristen Yarmey
 
PA Backward: Statewide Collaboration on Historic Digital Collections
PA Backward: Statewide Collaboration on Historic Digital CollectionsPA Backward: Statewide Collaboration on Historic Digital Collections
PA Backward: Statewide Collaboration on Historic Digital CollectionsKristen Yarmey
 
Celebrating Our Towns (and Gowns): Digital Collections and Collaborations
Celebrating Our Towns (and Gowns): Digital Collections and CollaborationsCelebrating Our Towns (and Gowns): Digital Collections and Collaborations
Celebrating Our Towns (and Gowns): Digital Collections and CollaborationsKristen Yarmey
 
The DPLA and What It Means for PA
The DPLA and What It Means for PAThe DPLA and What It Means for PA
The DPLA and What It Means for PAKristen Yarmey
 
Planning the Future and Preserving the Past: Emerging Technology in the Libra...
Planning the Future and Preserving the Past: Emerging Technology in the Libra...Planning the Future and Preserving the Past: Emerging Technology in the Libra...
Planning the Future and Preserving the Past: Emerging Technology in the Libra...Kristen Yarmey
 
Once and Future Digital Collections
Once and Future Digital CollectionsOnce and Future Digital Collections
Once and Future Digital CollectionsKristen Yarmey
 
Information Literacy in an Age of Algorithms
Information Literacy in an Age of AlgorithmsInformation Literacy in an Age of Algorithms
Information Literacy in an Age of AlgorithmsKristen Yarmey
 
George Gilbert Pond and the Preservation of Priestley House
George Gilbert Pond and the Preservation of Priestley HouseGeorge Gilbert Pond and the Preservation of Priestley House
George Gilbert Pond and the Preservation of Priestley HouseKristen Yarmey
 
Evan Pugh, Chemical Education, and the Fight for Pennsylvania's Land Grant De...
Evan Pugh, Chemical Education, and the Fight for Pennsylvania's Land Grant De...Evan Pugh, Chemical Education, and the Fight for Pennsylvania's Land Grant De...
Evan Pugh, Chemical Education, and the Fight for Pennsylvania's Land Grant De...Kristen Yarmey
 
Near Field Communication: Introduction and Implications
Near Field Communication: Introduction and ImplicationsNear Field Communication: Introduction and Implications
Near Field Communication: Introduction and ImplicationsKristen Yarmey
 
Digital Collections: Worst Mistakes and Greatest Hits
Digital Collections: Worst Mistakes and Greatest HitsDigital Collections: Worst Mistakes and Greatest Hits
Digital Collections: Worst Mistakes and Greatest HitsKristen Yarmey
 
Civil War Project - Student Handout
Civil War Project - Student HandoutCivil War Project - Student Handout
Civil War Project - Student HandoutKristen Yarmey
 
Preserving Your Family Memories (Personal Digital Archiving)
Preserving Your Family Memories (Personal Digital Archiving)Preserving Your Family Memories (Personal Digital Archiving)
Preserving Your Family Memories (Personal Digital Archiving)Kristen Yarmey
 

Más de Kristen Yarmey (20)

From 0 to 400 GB: Confronting the Challenges of Born-Digital Photographs
From 0 to 400 GB: Confronting the Challenges of Born-Digital PhotographsFrom 0 to 400 GB: Confronting the Challenges of Born-Digital Photographs
From 0 to 400 GB: Confronting the Challenges of Born-Digital Photographs
 
Web Archiving with Limited Resources: A Nickel's Worth of Free Advice
Web Archiving with Limited Resources: A Nickel's Worth of Free AdviceWeb Archiving with Limited Resources: A Nickel's Worth of Free Advice
Web Archiving with Limited Resources: A Nickel's Worth of Free Advice
 
Yes We Scan(ned): The Scranton Family Papers Scanathon
Yes We Scan(ned): The Scranton Family Papers ScanathonYes We Scan(ned): The Scranton Family Papers Scanathon
Yes We Scan(ned): The Scranton Family Papers Scanathon
 
Plays Well with Others: Getting Your Digital Collection Metadata Ready for th...
Plays Well with Others: Getting Your Digital Collection Metadata Ready for th...Plays Well with Others: Getting Your Digital Collection Metadata Ready for th...
Plays Well with Others: Getting Your Digital Collection Metadata Ready for th...
 
What DPLA Can Do for Digital Humanities: The Digital Public Library of Americ...
What DPLA Can Do for Digital Humanities: The Digital Public Library of Americ...What DPLA Can Do for Digital Humanities: The Digital Public Library of Americ...
What DPLA Can Do for Digital Humanities: The Digital Public Library of Americ...
 
3D Digitization (from Making, Modeling, Materializing: 3D Printing in Teachin...
3D Digitization (from Making, Modeling, Materializing: 3D Printing in Teachin...3D Digitization (from Making, Modeling, Materializing: 3D Printing in Teachin...
3D Digitization (from Making, Modeling, Materializing: 3D Printing in Teachin...
 
DPLA and What it Means for PA
DPLA and What it Means for PADPLA and What it Means for PA
DPLA and What it Means for PA
 
PA Backward: Statewide Collaboration on Historic Digital Collections
PA Backward: Statewide Collaboration on Historic Digital CollectionsPA Backward: Statewide Collaboration on Historic Digital Collections
PA Backward: Statewide Collaboration on Historic Digital Collections
 
Celebrating Our Towns (and Gowns): Digital Collections and Collaborations
Celebrating Our Towns (and Gowns): Digital Collections and CollaborationsCelebrating Our Towns (and Gowns): Digital Collections and Collaborations
Celebrating Our Towns (and Gowns): Digital Collections and Collaborations
 
The DPLA and What It Means for PA
The DPLA and What It Means for PAThe DPLA and What It Means for PA
The DPLA and What It Means for PA
 
Planning the Future and Preserving the Past: Emerging Technology in the Libra...
Planning the Future and Preserving the Past: Emerging Technology in the Libra...Planning the Future and Preserving the Past: Emerging Technology in the Libra...
Planning the Future and Preserving the Past: Emerging Technology in the Libra...
 
Once and Future Digital Collections
Once and Future Digital CollectionsOnce and Future Digital Collections
Once and Future Digital Collections
 
Information Literacy in an Age of Algorithms
Information Literacy in an Age of AlgorithmsInformation Literacy in an Age of Algorithms
Information Literacy in an Age of Algorithms
 
George Gilbert Pond and the Preservation of Priestley House
George Gilbert Pond and the Preservation of Priestley HouseGeorge Gilbert Pond and the Preservation of Priestley House
George Gilbert Pond and the Preservation of Priestley House
 
Evan Pugh, Chemical Education, and the Fight for Pennsylvania's Land Grant De...
Evan Pugh, Chemical Education, and the Fight for Pennsylvania's Land Grant De...Evan Pugh, Chemical Education, and the Fight for Pennsylvania's Land Grant De...
Evan Pugh, Chemical Education, and the Fight for Pennsylvania's Land Grant De...
 
Near Field Communication: Introduction and Implications
Near Field Communication: Introduction and ImplicationsNear Field Communication: Introduction and Implications
Near Field Communication: Introduction and Implications
 
Digital Collections: Worst Mistakes and Greatest Hits
Digital Collections: Worst Mistakes and Greatest HitsDigital Collections: Worst Mistakes and Greatest Hits
Digital Collections: Worst Mistakes and Greatest Hits
 
Civil War Project - Student Handout
Civil War Project - Student HandoutCivil War Project - Student Handout
Civil War Project - Student Handout
 
Library of Resources
Library of ResourcesLibrary of Resources
Library of Resources
 
Preserving Your Family Memories (Personal Digital Archiving)
Preserving Your Family Memories (Personal Digital Archiving)Preserving Your Family Memories (Personal Digital Archiving)
Preserving Your Family Memories (Personal Digital Archiving)
 

Último

Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 

Último (20)

Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 

Capture All the URLS: First Steps in Web Archiving

  • 1. Capture all the URLs: First Steps in Web Archiving Kristen Yarmey Judy Silva Alexis Antracoli Digital Services Librarian Fine & Performing Arts Librarian and Archivist Records Management Archivist University of Scranton Slippery Rock University of Pennsylvania Drexel University
  • 2. Where We’re Going Kristen: • Intro to web archiving • Web archives in higher ed • Archive-It and other tools Judy: • First steps • Getting buy-in • Selecting and scoping Alexis: • Metadata • Policies • Workflow All: • • • • Challenges Lessons learned What’s next? Q&A
  • 4. What do we put on the web? • University publications • • • • • • • Course catalogs Student handbooks Newsletters Press releases Alumni Journal Admissions viewbook University calendar • Governance/Planning documents and records • Policies • Assessment reports (Fact Book) • Faculty Senate agendas, minutes, and reports • Presidential announcements • Email • Campus life • • • • • • Student clubs Housing contract Wellness programming Community outreach Athletics scores Alumni class pages • Events • Presidential inauguration • New building construction/dedication • Social Media presence • • • • • Facebook Twitter Blogs YouTube …
  • 5. Web Archiving in Higher Ed ―We have the responsibility to preserve things like course information, course roster information and policies — all sorts of things that we used to get in paper but are now just showing up as websites.‖ Dean B. Krafft, Chief Technology Strategist, Cornell University ―Almost every office and unit on campus has a web site with business information. .. Many of our campus publications are only on the web now as pdfs or html. [This content] isn’t preserved anywhere else.‖ Ed Busch, Electronic Records Archivist, Michigan State University
  • 6. Goals: • Preserve dynamic content • • • • • Text Images Animation Video … • Preserve context • • • • Hyperlinks Embedded media Document method and date of capture Relate to prior and later versions • Provide access • Full text search • Browsability • User-friendly interface
  • 7. Once something is posted on the web, it’s there forever… right? New York Times, September 23, 2013
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. Web Archiving in Higher Ed “One finding revealed by the survey was the preponderance of universities that have initiated web archiving programs in the last 5 years.” Web Archiving Survey Report by National Digital Stewardship Alliance June 2012
  • 13. National Digital Stewardship Alliance Web Archiving Survey Report, June 2012
  • 14. Tools National Digital Stewardship Alliance Web Archiving Survey Report, June 2012
  • 15. Tools: In-House Options Proprietary tools:  Adobe Acrobat - convert websites into PDFs (internal links remain active but other dynamic functionality is lost)  Grab-a-Site and WebWhacker – download files from a website  Teleport Pro – ―webspidering‖ Open source tools:  Heritrix – crawler  HTTrack – downloads web content to a local directory  Wayback – discovery  Memento – access framework  NutchWAX - search  Solr – search  WARCreate – Google Chrome extension for creating WARC files (view with Wayback, store your own data)  Wget – retrieve files from a website  Web Curator Tool – workflow management  NetarchiveSuite - software package  Xenu’s Link Sleuth – finds broken links National Digital Stewardship Alliance Web Archiving Survey Report, June 2012
  • 16. Tools: Outsourcing Options Vendor services  Archive-It  California Digital Library Web Archiving Service (WAS)  OCLC Web Harvester National Digital Stewardship Alliance Web Archiving Survey Report, June 2012
  • 17. Archive-It • Subscription service • Branch of nonprofit Internet Archive • Crawls, harvests, and hosts web content, using open source tools and standard formats • Yearly fees, based on ―data budget‖
  • 18. Archive-It: Partners Archive-It Partners  279 collecting organizations total  118 colleges & universities Pennsylvania Partners:              Bryn Mawr, Haverford, and Swarthmore (joint, 2005) Bucknell (2012) Chemical Heritage Foundation (2010) Curtis Institute of Music (2010) Drexel (2009) Free Library of Philadelphia (2010) Gettysburg College (2013) La Salle University (2012) Pennsylvania State University (2012) Slippery Rock University of Pennsylvania (2011) Temple University (2013) University of Pennsylvania Law School (2011) University of Scranton (2012)
  • 19. Archive-It: Crawl • Collection • Seeds (regular, one-time, or RSS) • Documents = any file with a distinct URL, including… – HTML – Images – Video – Audio – PDF –… • Scope = which URLs are captured and which are not • Frequency = how often seed is crawled
  • 20. Archive-It: Access Users can: • Search • Browse From: • Archive-It website • Portal page • Embedded search boxes • Library catalog • Finding aids • 404 error pages • Wayback Machine Content can be public or private.
  • 21.
  • 22.
  • 23. Archive-It: Manage Metadata • Dublin Core • Collection, seed, document level Storage • Archive-It hosts content and backup on multiple servers • Partner can request copy of data Support • Training sessions • Partner support • User community
  • 25. Campus Stakeholders • Library • Information Technology • Public Relations • Administration • New President
  • 26. Funding • Information Technology • Public Relations • Library • Provost’s Office • Grants • Donors
  • 28. Selecting More Content • Athletics • Student organizations • Alumni • President’s page • Provost’s page • 125th Anniversary • University Curriculum Committee minutes (password protected)
  • 29.
  • 30. What is a seed? • A seed is any URL that you want to capture: • An entire website • http://www.whitehouse.gov/ • A specific part of a website • http://www.whitehouse.gov/issues/foreign-policy/ • A specific URL • http://www.whitehouse.gov/sites/default/files/rss_viewer/natio nal_security_strategy.pdf
  • 33. Building the Program • • • • • Policy Records Management Benefits Standardizing Metadata Developing Quality Control Procedures Working within organizational constraints
  • 34. Collection Development • Developed policy • • • • • Mission Scope Designated Community Intellectual Property Access • Determined/reviewed seeds to crawl and frequency • Maintain an up-to-date list of seeds that are regularly crawled ―Brasseri F – Archives oubliees,‖ by GuillaBar. http://www.flickr.com/photos/guillabar/8666232614/
  • 35. Updating Metadata • Selected fields to use consistently: • • • • Title Creator Description Collector • Standardized names • Eliminated groups
  • 36. Quality Control Procedures • • • • New program Excel spreadsheet Track by seed Check basic yes/no problems: • Crawl too large. • Date Queued • Robots.txt • Track errors: • Various seed errors • Embedded file problems • Track updates: • • • • New URLs Recrawls Patch crawls Web administrator contacts ―Our Quality Control,‖ by Paphio. http://www.flickr.com/photos/paphio/3313728492/
  • 37. Challenges • • • • Staffing Time-intensive Correcting technical problems Not yet knowing how people will use the crawls as a resource • Capturing online publications and email newsletters
  • 38. Lessons Learned ―Lessons‖ by Pavel Ivashkov. http://www.flickr.com/photos/ipasha/5588688937/ • Web-archiving takes time • There are ways to make it work with a small staff • Metadata can be basic and still useful • Quality Control is important • You can’t correct every error with limited staff • Need to keep up with new sites and URL changes
  • 39. Up Next • Additional outreach to Web administrators • Official launch of Web archiving program to University • Exploring cross-training to improve quality control program • Institute regular scanning of environment for new content and updates • Social Media
  • 40. Resources  Archive-It Knowledge Center (October 2013)  Brenda Reyes Ayala’s Web Archiving Bibliography (June 2013)  Kalpesh Padia et al., Visualizing Digital Collections at Archive-It (August 2012)  National Digital Stewardship Alliance, Web Archiving Survey Report (June 2012)  International Internet Preservation Consortium, Future of the Web Workshop (May 2012)  Jinfang Niu, ―An Overview of Web Archiving‖ (D-Lib Magazine, March 2012)  Inside Higher Ed, Archiving the Web for Scholars (May 2011)  WebArchivists, Web-Archives Timeline

Notas del editor

  1. Recognition of changing platformAll of our stuff is going here, and it’s dynamic
  2. This is harder than you think. Digital files are highly vulnerable.
  3. Already seeing this happen.
  4. Often a combination of tools
  5. Often a combination of tools
  6. As we’ve seen, Archive-It is the popular favorite and has an impressive list of users.Started looking into web archiving in 2009; followed the topic on professional listservs and saw Archive-It mentioned repeatedly. Had read about Brewster Kahle’s work. Tried the Wayback Machine and was impressed with what was being collected already. In 2010 saw a presentation at MARAC (Mid-Atlantic Archives Conference) by a colleague Rebecca Goldman at Drexel who suggested I try a webinar and that was it.Few Options in 2009Archive-It endorsed by ColleaguesInternet Archive’s WayBack Machine Presentation at Professional Conference Attended a WebinarRan a TrialArchive-It Support
  7. First contacted Information Technology (partly to determine they were not already archiving the website somehow). VP Info Technology referred us to PR . . . At Slippery Rock University the Public RelationsOffice is responsible for the website, so they were a natural in helping to select content for capture and preservation. PR publishes more and more content in electronic format (in some cases only electronic): course catalogs, alumni magazine, press releases . . . Administration seeking storage solutions as network drives fillLibrary willing to test the web archiving concept with our content
  8. IT deferred concept (and funding) to PR. PR said they could not afford it. Library ended up paying for it. Will ask Provost’s Office next year.
  9. Library Homepage Archives: Digital CollectionsUniversity HomepageRockPride (campus e-newsletter)Catalogs: undergraduate and graduateDecided to use library content as prototype, it allowed us to practice and then showcase our own content. Library homepage and Digital Collections.The library is responsible for the annual Student Research Symposium, so that provides a stepping stone beyond library content and some additional stakeholders (faculty whose students are participants in the symposium). Also e-newsletter of faculty publications.PR’s suggestions: University homepage, campus e-newsletter and catalogs (undergraduate and graduate)
  10. Popular campus activities like Athletics, student organizations, alumniInfluential friends: president and provostOne time events: anniversaries, etc.it is now possible to archive intranet content behind a username and password(provided that the partner supplies those credentials in the web application).
  11. Archive-it help pages are user friendly
  12. Options: limit or expandSet a timeSet a data limitCrawl frequencyUniversity Homepage: The homepage URL does not necessarily need crawled further than the root at the moment, once a month.RockPride (campus e-newsletter): A new Rock Pride online magazine runs every Friday during the school semester, and roughly once a month throughout the summer.Catalog: undergraduate and graduate catalogs, back to 2004. Annually.
  13. look at other college and university sites (available from the Archive-It site) to see what they were harvesting and how they were naming collections.Stumbling blocks: Kristen and Alexis?
  14. [Alexis first, then Kristen and Judy can add]
  15. [Alexis first, then Kristen and Judy can add]
  16. [Alexis first, then Kristen and Judy can add]