An Institutional Perspective to Rescue Scholarly Orphans
1. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Martin Klein
LANL
@mart1nkle1n
https://orcid.org/0000-0003-0130-2097
Herbert Van de Sompel
DANS
@hvdsomp
https://orcid.org/0000-0002-0715-6126
An Institutional Perspective to Rescue Scholarly Orphans
The Scholarly Orphans project
is funded by the Andrew W. Mellon Foundation
2. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Scholarly Orphans Team
• Los Alamos National Laboratory:
• Lyudmila Balakireva
• Martin Klein
• James Powell
• Harihar Shankar
• Herbert Van de Sompel
• Old Dominion University:
• Sawood Alam
• Grant Atkins
• Shawn Jones
• Mat Kelly
• Michael L. Nelson
4. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
• Consideration
• Researchers are increasingly using a variety of web platforms for
collaboration and communication
• Why?
• Many of these platforms have desirable characteristics
• Versioning
• Time stamping
• Social embedding
• Their institutions do not provide platforms that have global reach
• Collaboration, cf. Github ~ productivity
• Communication, cf. SlideShare ~ visibility
Research and Research Communication on the Web
5. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Emma Schymanski
https://orcid.org/0000-0001-6868-8145
https://github.com/schymane
https://www.slideshare.net/EmmaSchymanski
https://figshare.com/authors/Emma_Schymanski/5087039
https://publons.com/author/1538491/emma-schymanski#profile
https://www.eawag.ch/en/aboutus/portrait/organisation/staff/profile/emma-schymanski/
6. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Shawn Jones
https://orcid.org/0000-0002-4372-870X
http://www.shawnmjones.org/
https://github.com/shawnmjones
https://www.slideshare.net/shawnmjones
https://en.wikipedia.org/wiki/User:Shawnmjones
https://www.blogger.com/profile/17827543974149663194
7. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
• Consideration
• Researchers deposit artifacts in web platforms
• Web Platforms:
• Dedicated to scholarship:
• Commercial: e.g., FigShare, Publons
• Not for profit: e.g., OSF, Zenodo
• General purpose:
• Commercial: e.g., GitHub, SlideShare
• Not for profit: e.g., Wikipedia, Wikidata
Research and Research Communication on the Web
8. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
• Consideration
• Researchers deposit artifacts in web platforms
• Status quo - The researchers’ institutions are in the dark
• Do not know about the existence of these artifact
• Do not have a copy of these artifacts
Research and Research Communication on the Web
9. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
• Consideration
• Researchers deposit artifacts in web platforms
• Status quo – Uncertainty regarding long-term access
• Commercial: changing business model, no preservation commitment
• Not for profit: unpredictable funding stream
Research and Research Communication on the Web
10. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
• Consideration
• Researchers deposit artifacts in web platforms
• Status quo - Not systematically archived
• No frameworks like LOCKSS/Portico exist for these artifacts
• Researchers only selectively deposit artifacts in portals that
provide archival guarantees; to obtain a cite-able DOI
• Can’t expect researchers to (also) upload all artifacts in IRs
• Web archives only incidentally archive these artifacts, cf.
anecdotal & Hiberlink project evidence
Research and Research Communication on the Web
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
11. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Emma’s SlideShare Artifact: 0 Mementos
https://www.slideshare.net/EmmaSchymanski/dmcm2018-community-resources-connecting-chemistry-and-toxicity-knowledge
http://timetravel.mementoweb.org/
12. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Shawn’s GitHub Artifact: 1 Memento
https://github.com/shawnmjones/mediawiki
https://web.archive.org/web/*/https://github.com/shawnmjones/mediawiki
13. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Scholarly Orphans – Project Overview
How to capture Scholarly Orphans for long-term archiving?
14. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
The Scholarly Orphans Project
• Explores an institution-driven paradigm
• Academic institutions typically have a long shelf life
• A basic premise underlying e.g., LOCKSS, perma.cc
• An academic institution should be interested in capturing the
artifacts (intellectual property) its scholars deposit on the web
• Collecting and archiving such artifacts aligns with the
mission of academic libraries
16. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
The Scholarly Orphans Project
• Explores a paradigm inspired by web archiving
• Scale of the problem
• Can’t expect researchers to upload all artifacts in an institutional
repository
• Bilateral agreements for archival purposes with most web
portals unlikely
21. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Tracking Artifacts - Description
• In order to track artifacts that were recently deposited by an
institutional researcher in a portal, one reasonably needs:
• The web identity of the researcher in the portal
• Algorithmic discovery
• Discovery via a registry
• Manual collection
• A portal API that supports:
• Access by web identity
• Access to contributions “since …” for the web identity
• Result of tracking:
• URI(s) of new artifact(s) discovered in the portal
22. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Tracking Artifacts - Challenges
• Portal API access by web identity
• Broadly supported by general purpose portals
• Typically not supported by scholarly portals
• Some lack an API altogether
• Should add ORCID access to APIs
• OAI-PMH and ResourceSync need sets per web identity
• Professional versus personal contributions
• Tracking frequency/scale
24. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Capturing Artifacts - Description
• The capture process takes as input the URI of a new artifact
discovered in a portal
• Its task is to create a representative institutional capture of the
artifact
• Result of capture:
• WARC file for new artifact in an institutional archive
25. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Capturing Artifacts - Challenges
• Delineate the web boundary of the artifact
• More than the input artifact URI
• The boundary is in the eye of the beholder
• Create a high-fidelity capture using an approach that scales for a
steady stream of new artifacts
• Determine the web boundary of the artifact
• Handle dynamic content & interactive features of web pages
• We made a significant breakthrough with the Memento Tracer
framework
Memento Tracer: http://tracer.mementoweb.org
27. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Archiving Artifacts - Description
• The archiving process takes as input the URI of a WARC file
generated by the capture process
• Its task is to ingest the WARC file in a cross-institutional web archive
• This can be achieved using off-the-shelf web archiving software,
e.g., pywb, Open Wayback
• Result of archiving:
• Mementos pertaining to newly discovered artifact in a cross-
institutional, Memento-compliant web archive
• Possibility to link to artifacts using Robust Links:
<a href=“URI-A”
data-versionurl=“URI-M”
data-versiondate=“date-of-capture”
Robust Links: http://robustlinks.mementoweb.org/about/
28. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Archiving Artifacts - Challenges
• Attempted to use ipwb, a pywb version that uses IPFS
• Cross-institutional distributed file system with redundancy
• Ran out of time to get it operationally stable
Sawood Alam, Mat Kelly, and Michael L. Nelson (2016) InterPlanetary Wayback: The Permanent Web Archive
https://doi.org/10.1145/2910896.2925467
30. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
myresearch.institute - Researchers
• Uniquely identified by ORCIDs
• Web identities in multiple portals
• Create various types of artifacts
31. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
myresearch.institute - Portals
• Tracking started August 27 2018
• Tracking artifacts created starting
August 1 2018
39. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Scholarly Orphans – Pipeline
• 10,187 unique artifacts tracked, captured, and archived since
08/01/2018
• 41MB event database
• 61GB of WARC files
• 2.3GB of web archive index
40. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Scholarly Orphans – Pipeline
• Capture process, post tracking
• Within 9 minutes 50% of artifacts captured
• Within 1 hour 21 minutes 75% of artifacts captured
• Archiver process, post capture
• Within 10 minutes 50% of artifacts archived
• Within 57 minutes 75% of artifacts archived
41. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Summary
• The Scholarly Orphans project explores an institution-driven
approach to capture scholarly artifacts deposited in web portals
• Artifacts out of scope of existing archival approaches such as
LOCKSS, Portico, web archives
• Institutions have a long shelf life, should be interested in
collecting these artifacts, and have feasible scale for
identity/artifact discovery
• Prototype at myresearch.institute illustrates feasibility, opportunities,
and challenges of this institutional perspective
42. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
“Ha, this is awesome! Thanks for letting me know - carry on as usual, and feel
free to monitor away. I'll try not to change my behaviour or anything now with
this new knowledge :)”
“This is fine, since everything you are capturing is public to start with. I also
wonder if you know about Software Heritage?”
“I’m very comfortable with being part of this (very important) research project”
“I'm cool with it :-)”
“Interesting project! I’m happy to participate.”
“One more thing, is it possible to get a copy of the URI-Rs that you guys
detected so that I can feed them into an archive of my choice?...”
What Our Researchers Say…
43. @mart1nkle1n @hvdsomp
CNI Spring 2019, April 8 2019, St, Louis, MO
Martin Klein
LANL
@mart1nkle1n
https://orcid.org/0000-0003-0130-2097
Herbert Van de Sompel
DANS
@hvdsomp
https://orcid.org/0000-0002-0715-6126
An Institutional Perspective to Rescue Scholarly Orphans
The Scholarly Orphans project
is funded by the Andrew W. Mellon Foundation