About the Webinar
The digitization of resources can provide expanded access to information as well as a preservation mechanism for now-fragile materials. Preserving the digital copy of the resource is an issue now being addressed, but what about the software used to create digital files? How can software on media which can no longer be read -- or no longer be read easily -- be preserved? If that software can’t be accessed, what happens to the material created by, and only read by, that software?
Progress has been made in formulating standards for the preservation and description of digital materials and a framework for addressing digital item preservation has been proposed. Despite, however, meetings such as the Library of Congress’ “Preserving.exe: Toward a National Strategy for Preserving Software,” no formal standard or framework yet exists for software digitization and preservation. This webinar will feature three presenters who will speak on aspects of software digitization and preservation, including a how-to approach (technical aspects), a metadata component, and observations from the field as part of the continuing discussion on the state of the field and the need for standardization.
Agenda
Introduction
Todd Carpenter, Executive Director, NISO
Software artifacts: Migration and Emulation
Michael Lesk, Professor of Library and Information Science, Rutgers University
Emulation in practice: Emulation as a Service at Yale University Library: Lessons learnt and plans for the future
Euan Cochrane, Digital Preservation Manager, Yale University Library
No (You Can't Expect To Run Your Files Just Because You Saved Them)
Jon Ippolito, Professor of New Media and Director of the Digital Curation graduate program, University of Maine
NISO Webinar: Software Preservation and Use: I Saved the Files But Can I Run Them?
1. NISO Webinar:
Software Preservation and Use:
I Saved the Files But Can I Run Them?
Wednesday, May 13, 2015
Speakers:
Michael Lesk
Professor of Library and Information Science, Rutgers University
Euan Cochrane
Digital Preservation Manager, Yale University Library
Jon Ippolito
Professor of New Media and
Director of the Digital Curation Graduate Program,
University of Maine
http://www.niso.org/news/events/2015/webinars/software/
3. Software preservation
The hard problem is not bad tape; it’s obsolescence.
There are two common answers to the obsolescence
problem.
Migration or emulation?
Migration: Convert the old information to a new
format, e.g., BMP to JPEG.
Emulation: Use old information on a new version of
an old machine, e.g. using a website that looks like
an arcade game platform.
4. Why might old software be lost?
All the copies were thrown away.
The copies still exist, but the media have worn out.
The media are OK, but we have no device to read them.
We can read the bits, but we don’t know what they mean.
We understand the bits, but have no software to process them.
We have software but nothing to run it on.
The software depends on an environment that no longer exists.
We could process the bits, but we lack legal permission.
5. Discarded
We know the first telegram:
“What hath God wrought?” May 24, 1844; Samuel F. Morse, in
Washington, to Alfred Vail, in Baltimore.
We know the first telephone call:
“Mr. Watson—Come here—I want to see you.” March 10, 1876.
Alexander Graham Bell to Thomas A. Watson, in Boston.
We don’t know the first email message. It was in the spring of
1964, in either Cambridge (UK), Cambridge (Mass.), or
Pittsburgh; but whatever it was, it was thrown out, and nobody
kept good records.
The solution to this problem is multiple copies. Digital copies are
perfect and cheap; use them.
6. Media fragility
In the 1970s Brazil stored Landsat space photography of their
country on magnetic tape. These tapes were stored in humid
conditions and deteriorated until they were unreadable.
Magnetic tape is often fragile; audio tape is lost as well. It helps
to start with better quality tape, and linear tape (audio) is better
than helical tape (VHS cassettes). Sometimes it helps to heat the
tape, once; hence one of the great titles in preservation
literature, “If I knew you were coming, I’d have baked a tape”
(Eddie Ciletti).
Again the solution to this is multiple copies, regularly inspected.
Note projects like LOCKSS: Lots of copies keep stuff safe.
7. Devices gone
Where today would you find a diskette drive? And that’s an easy
one: what about a paper tape reader?
The answer to those is eBay, but what about special-purpose
technology that failed in the marketplace, such as kinds of 12”
writeable optical disk from the early 1990s?
Again, the answer is multiple copies on current devices. Even if
your organization thinks it’s prepared to keep its 1980-vintage
DEC computer running for a long time, where would you find
spare parts when it broke? Or a technician who knew what to do
with them?
8. Forgot the format
It is possible to have a format and not know what it is. Suppose you
have a file made by Volkswriter, marketed by Lifetime Software
(which, despite its name, ceased operating independently in 1991).
How would you find out the control codes?
If you can’t find documentation, it may be easier to view this as a
decipherment problem: if you find a funny symbol at “plus ?a
change, plus c’est …” it’s the French ç character.
Now we’re into the real issues: is it better to try to find a copy of the
software or to convert the file a current standard, like Word? In this
case (word processing) conversion is probably easier.
Solution: use standard formats. Preferably public ones.
9. No software is available
Again, the vendor who wrote the software originally used for your
file might have gone out of business. If your file is in a public format,
there is probably an alternative. But if it was in a proprietary
format, it may be difficult to find something that reads it. There
was a time, for example, when Microsoft deliberately arranged for
old MS-Word documents to be unreadable on newer versions so
that customers would be forced to upgrade continuously. And in
those days, Microsoft tried to prohibit other vendors from selling
software that read and translated the “.doc” format; some of them
did it anyway, and Microsoft gave up.
The solution is public formats and current formats; for example the
newer “.docx” files in Microsoft Word have a public description.
10. No machine to run the software
Now we’re into the hard part of the problem: you might have some
kind of program but it was coded to run on a long-gone machine
(Commodore 64, anyone)? You choice is between
Finding a machine for sale on eBay – but you can’t get parts to fix it,
and you may have trouble finding out how to make it work.
Migrating whatever this is to a modern platform, ideally expressing
it in public standard terms.
Finding an emulator for the old machine: something that will run
the old code as it was.
11. Migration vs. Emulation
Migration means converting files to newer formats. For example,
Amiga graphics to Tiff or JPEG. If you migrate to a public standard
you minimize the chance of having to do it again. It’s hard to guess
which commercial formats will survive: if you had asked me in the
1990s whether a Kodak image format would survive, I would have
said yes. You have to do it for every format. But you get modern
capabilities with the converted files.
Emulation means programming a current machine to behave like an
old machine. This is a difficult task, but emulators exist for many
common machines, particularly game platforms. A notable project
is Olive (olivearchive.org) which is aimed at preservation of
intellectual content beyond video games (CMU, IBM, and others).
You get only the old behavior of the program.
12. Examples
Migration:
JSTOR, and many old journal systems: the early issues, whatever
their original formats, are now in PDF. Often they were just OCRd
from the printed version, rather than translated digitally (high
proofreading cost but minimal programming complexity). You can
use all modern PDF tools on the articles.
Emulation:
The Internet Arcade is a collection of 1970s-80s arcade games that
you can run in an emulator:
https://archive.org/details/internetarcade
13. Some very special cases
Colossus, 1942. Colossus re-build, 1996
Charles Babbage’s
Difference Engine,
as rebuilt by the
Science Museum
(London), 2002.
14. Analogy
Consider performing early music. Should you play it on old
instruments or modern ones? Old instruments are more authentic,
but have a different effect on the modern ear. Bach’s listeners had
not heard a piano and the organ did not sound “old fashioned”.
Emulation is finding an old church (there are some in Germany
whose architecture and organ pipes are not changed from Bach’s
day) and using old-fashioned performance techniques.
Migration is using a piano (and keyed flutes and trumpets, etc) but
trying to produce the same emotional effect.
Similarly with old books: Caslon and Baskerville did not look old to
people who had never seen Helvetica.
15. If you lack source code
In general, you can’t migrate a piece of software without the source
code, since you want to recompile it on a new machine. There are
de-assemblers, but the result is going to be a real pain to
understand. So if you have only the object code, you may be driven
to emulation. Since many software vendors keep source code very
secret, and did so in the past as well, it’s not uncommon to have
only the binary form of some program.
A legal warning: if you can’t find the vendor (out of business) and
get permission, you may not have permission even to use the binary
code, although this may depend on the terms of the original
purchase. It may or may not have allowed transferring the program
to a new user.
16. Features in old and new versions
Suppose you take an ancient word processor file and migrate it to a
modern format. Then you can do things like export HTML, or PDF.
Any tool that will use the modern format can work with your old
file. But the tool will give a modern result – it will run faster, use
modern display fonts, and the like.
If you are using an emulator, you get the old behavior. If the
program only displayed green on black, you get green on black. This
is “authentic” but you may not like it. And you may not be able to
create HTML or PDF from the program. If you are trying to merge
many such older documents into a digital library, the format
incompatibilities will make things worse.
17. Metadata
If you really want to preserve a complex software object, it
helps to know exactly what programs were used to create it.
That means not just the name, but the exact version. Other
issues that are more serious for digital preservation include
provenance: where did this come from? This is relevant for
answering questions about the material, or finding the people
who might know the answer. Similarly it may assist with rights
metadata, or technical metadata. Modern formats sometimes
have technical metadata included in the file (eg in a JPG
header) but older formats often don’t.
Again, it is easiest if you use well-known and common formats.
18. Standards
“The good thing about standards is that you have so much
choice.”
Even ASCII (ISO 646) is ambiguous. The UK changed the “#”
character to mean “£” and Germany changed “}” to “ü” .
Particularly worrisome are “wrapper” formats. Tiff may
contain different kinds of image compression algorithms (such
as G4 fax, or Lempel-Ziv), and thus a Tiff reader may not be
able to read all Tiff images. Some image viewers understand
progressive images in GIF or JPG; some don’t. PDF can include
the kitchen sink (eg 3-D viewers).
Solution: emphasize the best and most public formats.
19. Missing environment
What would it mean to preserve the “Amazon home page”? It is
different for every person using it and for each instance – it’s
synthesizing using the browsing and order history of the user, the
current incentives for sales at Amazon, and lots else (geography,
source computer, etc.). There are many pieces of software that
depend on almost everything around them- think about all the
install scripts that ask “we want to use your location,” “we want to
use your browser history,” and so on. (And of course many
programs don’t ask, they just use them.)
No good answer for this. You have to judge what you mean by
preserving the object – what will the users want the behavior to be?
20. Protection from abuse
If you run a general-purpose preservation operation, you need
to think about whether anything in your preservation files is
dangerous or doubtful in some way. People might try to use
your system to distribute malware (viruses) or to enable
software piracy.
Thus, unfortunately, you may want to put out calls like “please
send in examples of early APL software” but you can’t just
accept anything, and can’t rely on statements made by
unknown volunteers about what they are submitting.
21. Legal permission
You may have an object, and know what to do with it, but not have
legal permission to preserve it. For example, many of the video
game companies object to attempts to imitate the old games – to
them, this is creating competition for new games.
Unfortunately, given the copyright trolls out there, who try to make
a living by finding people who have downloaded something they
shouldn’t have, and then threatening them with lawsuits, this is not
an area where it is easier to get forgiveness than permission.
Libraries are often justifiably paranoid.
There is of course the preservation exception in the law; but it limits
a library to on-premises use.
22. Good and bad
Why software preservation is hard: the material is not self-
describing, there were many early products that vanished without
adequate documentation, software can be very complex, it requires
special hardware to run, and so on….
Why software preservation is easy: as with all digital information, it
can be copied without error; if one person has migrated a format or
emulated a machine, that can be used by others; and computers are
new enough that there is probably no computer without some user
who is still alive. I learned to program on a Univac I; that doesn’t
mean I have a tape drive that uses its steel tapes (yes, steel), but at
least I know what they are.
23. Conclusion
The biggest technical choice is migration vs. emulation. I
would generally say:
migration for static formats
emulation for executable programs
There are some ambitious programs: the Computer History
Museum in Mountain View has been able to salvage old
machines like the Xerox Alto.
But the industry does a lot less than we would like; it is more
common to have legal problems in salvaging software than to
get financial help from its original marketer.
24. Emulation in Practice
Emulation as a Service at Yale University Library; lessons learnt and plans for
the future
Euan Cochrane, Digital Preservation Manager, Yale University Library
25. Overview
1. Why should we care about emulation?
2. What is emulation?
3. How do we do emulation?
4. What is Emulation as a Service (EaaS)?
5. How we use EaaS
6. Lessons learnt using EaaS
7. Future work at Yale University Library (YUL)
27. Why? - Executable content
• Video games
• Research data workflows
• Digital Art
• Software as artifact
• Digital artifact museums
(preserving the tools and infrastructure of the digital age)
28. Why? – Software dependent
content
Content that requires software in order to be rendered
or interacted with:
• Office files (documents, spreadsheets, slide sets, etc)
• CAD files
• Outlook inboxes
• eBooks with note taking capability
• Desktop environments
• Code
• Any proprietary, or effectively proprietary, formats
30. Old software is required to
authentically render old content
Original content in original
software (WordPerfect in
Windows 95)
Original content in newer
software (LibreOffice Writer in
Windows Vista)
31. Research results are at risk of loss
without original software
Original content in original software
(WordStar for DOS in Microsoft DOS)
[NB: equation predicting tree growth
rates includes exponents documented
using upper line of text]
Original content in newer software
(LibreOffice Writer in Windows
Vista)
33. How? – Emulation and virtualization
software tools
• An emulation software package
(“emulator”) is used to create a
virtual version of one computer
within another computer that has
different hardware
• Old software can be run on the
“emulated” computer hardware just
like it was running on the original
physical computer.
• Many emulators were originally
developed to run old video games
34. How? – Software tools
• Emulation is often used to support old hardware
devices that require obsolete software
(e.g. assembly line management software, scientific instruments,
industrial machinery, etc)
• Emulation is widely used by mobile phone
application developers to develop software for
phone-hardware using desktop-PC hardware
(i.e. phone hardware is emulated on desktop pcs to build phone-
compatible applications)
• Virtualization = emulation but with compatible
hardware
(some of the host machine’s hardware is used directly by the
“virtualized” computer)
Virtualization bridges the gap between departure of recently obsolete
hardware and the arrival of hardware powerful enough to emulate it
35. How? – Preserving software
and dependencies
• We need to curate and preserve operating systems to support access to
assets that depend on them
• We need to curate and preserve software applications to support access to
content that depends on them
• We need to curate and preserve fonts, scripts, plug-ins and other
dependencies to support access to content that requires them
• We need to preserve whole desktop environments (e.g. Salmon Rushdie’s
desktop at Emory university) to support access to the experience of interacting
with it
• We need to curate and preserve pre-configured disk images with software
already installed on them – for running on emulated hardware
36. How? - Documentation
• We need unique, persistent identifiers for software
• We need software catalogues
• We need unique, persistent identifiers for disk images
(installed environments/virtual hard drives)
• We need disk image/virtual hard drive catalogues
• We need unique, persistent identifiers for
emulated/virtualized hardware configurations
• We need hardware configuration catalogues
37. How? - Documentation
• We need unique, persistent identifiers for software
• We need software catalogues
• We need unique, persistent identifiers for disk images
(installed environments/virtual hard drives)
• We need disk image/virtual hard drive catalogues
• We need unique, persistent identifiers for
emulated/virtualized hardware configurations
• We need hardware configuration catalogues
*Mostly, the internet
archive is doing great
work, as are NIST and
PRONOM
We
don’t
have
these
(yet!)*
39. How? – Configuring emulated hardware
• Admins configure an
emulator
• Admins install and/or
configure the emulated
software
• Requires various emulator
specific, technically
challenging tools
40. How? – accessing emulated
environments at libraries and
archives
• Users access
emulated
environments via
dedicated
machines
• Use dedicated
software
• At libraries and
archives this is
mostly restricted to
reading rooms
43. Emulation as a Service –What is
it?
Remote access to pre-configured emulated and virtualized
environments via any modern web browser
Abstracts configuration challenges away from end-users
Changes to environments can be saved or discarded at the end
of a session (a fresh/unchanged version is always available)
Interactivity can be restricted where appropriate (e.g. limited
ability to download or copy content to local computer)
Relatively simple way to provide custom online environments
(virtual reading rooms?)
44. Emulation as a Service (EaaS)–
Why?
• A lot of old digital content can only be properly accessed using
emulation tools
• Emulation is technically specialized
• Old software can be challenging for modern users to understand
• Modern users don’t expect to have to come into a reading room
to access digital content
• Maintain control over content: users can’t copy data in or out
unless authorized (screenshots are inevitably excluded)
45. Emulation as a Service (EaaS)–
Why?
• Strong separation between environments, objects and
emulators/configurations
• Emulation can be provided remotely (outsourced) with disk
image archives and/or content maintained locally)
• Small derivative environments can be created from base-
environments –saving space
• Standard environments can be reused and customized
• Provides ability to cite environments
48. EaaS – How it works
(For Technical Administrators)
• Admins configure
an emulator on
local PC
• Admins configure
the emulated
software on a local
PC
• Configured
environment gets
saved as a “disk
image” with
configuration
metadata
49. • Admins confirm the
software
environment stored
on the disk image
works on local PC
• Admins/Archivists/L
ibrarians ingest it
into the EaaS
service:
EaaS – How it works
(For Technical Administrators)
50. works
(For
Librarians/Archivist
s)• Pre-configured software
environments (e.g. a
Windows 95 + Office 95
environment) can have
files added to them and
be saved as a variant or as
a stand-alone new
environment
• Only difference (delta)
between base-
environments and
customized environment
retained – saving space by
not duplicating virtual
hard drive content
51. • CD-ROMs and
other software
can be ingested,
installed/configure
d on top of a base
environment, and
tested using an
online interface
• Newly customized
environment can
be stored for
future use and
works
(For
Librarians/Archivist
s)
52. • Librarians/Archivi
sts can also
ingest disk
images captured
from machines
they have
acquired (e.g.
authors’/politicia
ns’ desktops)
works
(For
Librarians/Archivist
s)
53. EaaS – How it works
(For end-users)
• Users can click on links in a
catalogue/finding aid to
access
environments/content
54. EaaS – How it works
(For developers and system
integrators)
• Provides generic access to functionality of many emulators and virtualization
tools vi a WebService and REST API
• Emulation functionality can be incorporated into existing workflows
• Emulated (or virtualized) environments can be embedded into web pages for
online access and online exhibitions
• Emulated environment citations, thumbnails, and URIs/URLs enable easy
integration with existing catalogues and finding aids
• One-click “image-disk-and-emulate” workflows being developed (collaborating
with digital forensics initiatives)
• Open Source (currently available on request, code will be published in the
future)
55. EaaS – Background
• bwFLA EaaS project from University of Freiburg in
Germany (http://bw-fla.uni-freiburg.de)
• Personally collaborated with bwFLA at Freiburg
while at Archives New Zealand
• Now at Yale University Library and brought
collaboration along
• Yale University Library have(/had!) only installation
outside of Germany
58. Related work
• Olive Archive https://olivearchive.org/
• Internet Archive
https://archive.org/details/software
• Keep Emulation framework
http://emuframework.sourceforge.net/
• QEMU http://wiki.qemu.org/Main_Page
59. EaaS at Yale
• Testing and providing requirements for ongoing
development
• Imaging general collections digital media & Trialing
access via EaaS
• Investigating workflow integration (virtual reading
rooms?)
• Finding gaps in supporting infrastructure
61. Lessons learnt
• Software licensing needs to be solved (abandonware
and out-of-cart software are huge problems)
• Scale is manageable through standardization and
sharing
• Archivists and Librarians can use EaaS with relatively
little training
• The possibilities of using EaaS in workflows are huge
• If EaaS becomes an assumption, creators may change
62. Future work at Yale University
Library
• Move EaaS into production
• Increase software archiving
• Develop standard shareable environment images
• Collaborate with others to maximize efficiency of software archiving
• Develop emulation testing standards and frameworks
• Explore options for preserving networked environments
• Make progress on the licensing issues
64. NISO Webinar • May 13, 2015
Questions?
All questions will be posted with presenter answers on
the NISO website following the webinar:
http://www.niso.org/news/events/2015/webinars/software/
NISO Webinar
Software Preservation and Use:
I Saved the Files But Can I Run Them?
65. Thank you for joining us today.
Please take a moment to fill out the brief online survey.
We look forward to hearing from you!
THANK YOU