SlideShare una empresa de Scribd logo
1 de 65
NISO Webinar:
Software Preservation and Use:
I Saved the Files But Can I Run Them?
Wednesday, May 13, 2015
Speakers:
Michael Lesk
Professor of Library and Information Science, Rutgers University
Euan Cochrane
Digital Preservation Manager, Yale University Library
Jon Ippolito
Professor of New Media and
Director of the Digital Curation Graduate Program,
University of Maine
http://www.niso.org/news/events/2015/webinars/software/
Software preservation
Michael Lesk
Prof. of Library and Information Science
Rutgers University
New Brunswick, NJ 08901
lesk@acm.org
www.lesk.com
Software preservation
The hard problem is not bad tape; it’s obsolescence.
There are two common answers to the obsolescence
problem.
Migration or emulation?
Migration: Convert the old information to a new
format, e.g., BMP to JPEG.
Emulation: Use old information on a new version of
an old machine, e.g. using a website that looks like
an arcade game platform.
Why might old software be lost?
All the copies were thrown away.
The copies still exist, but the media have worn out.
The media are OK, but we have no device to read them.
We can read the bits, but we don’t know what they mean.
We understand the bits, but have no software to process them.
We have software but nothing to run it on.
The software depends on an environment that no longer exists.
We could process the bits, but we lack legal permission.
Discarded
We know the first telegram:
“What hath God wrought?” May 24, 1844; Samuel F. Morse, in
Washington, to Alfred Vail, in Baltimore.
We know the first telephone call:
“Mr. Watson—Come here—I want to see you.” March 10, 1876.
Alexander Graham Bell to Thomas A. Watson, in Boston.
We don’t know the first email message. It was in the spring of
1964, in either Cambridge (UK), Cambridge (Mass.), or
Pittsburgh; but whatever it was, it was thrown out, and nobody
kept good records.
The solution to this problem is multiple copies. Digital copies are
perfect and cheap; use them.
Media fragility
In the 1970s Brazil stored Landsat space photography of their
country on magnetic tape. These tapes were stored in humid
conditions and deteriorated until they were unreadable.
Magnetic tape is often fragile; audio tape is lost as well. It helps
to start with better quality tape, and linear tape (audio) is better
than helical tape (VHS cassettes). Sometimes it helps to heat the
tape, once; hence one of the great titles in preservation
literature, “If I knew you were coming, I’d have baked a tape”
(Eddie Ciletti).
Again the solution to this is multiple copies, regularly inspected.
Note projects like LOCKSS: Lots of copies keep stuff safe.
Devices gone
Where today would you find a diskette drive? And that’s an easy
one: what about a paper tape reader?
The answer to those is eBay, but what about special-purpose
technology that failed in the marketplace, such as kinds of 12”
writeable optical disk from the early 1990s?
Again, the answer is multiple copies on current devices. Even if
your organization thinks it’s prepared to keep its 1980-vintage
DEC computer running for a long time, where would you find
spare parts when it broke? Or a technician who knew what to do
with them?
Forgot the format
It is possible to have a format and not know what it is. Suppose you
have a file made by Volkswriter, marketed by Lifetime Software
(which, despite its name, ceased operating independently in 1991).
How would you find out the control codes?
If you can’t find documentation, it may be easier to view this as a
decipherment problem: if you find a funny symbol at “plus ?a
change, plus c’est …” it’s the French ç character.
Now we’re into the real issues: is it better to try to find a copy of the
software or to convert the file a current standard, like Word? In this
case (word processing) conversion is probably easier.
Solution: use standard formats. Preferably public ones.
No software is available
Again, the vendor who wrote the software originally used for your
file might have gone out of business. If your file is in a public format,
there is probably an alternative. But if it was in a proprietary
format, it may be difficult to find something that reads it. There
was a time, for example, when Microsoft deliberately arranged for
old MS-Word documents to be unreadable on newer versions so
that customers would be forced to upgrade continuously. And in
those days, Microsoft tried to prohibit other vendors from selling
software that read and translated the “.doc” format; some of them
did it anyway, and Microsoft gave up.
The solution is public formats and current formats; for example the
newer “.docx” files in Microsoft Word have a public description.
No machine to run the software
Now we’re into the hard part of the problem: you might have some
kind of program but it was coded to run on a long-gone machine
(Commodore 64, anyone)? You choice is between
Finding a machine for sale on eBay – but you can’t get parts to fix it,
and you may have trouble finding out how to make it work.
Migrating whatever this is to a modern platform, ideally expressing
it in public standard terms.
Finding an emulator for the old machine: something that will run
the old code as it was.
Migration vs. Emulation
Migration means converting files to newer formats. For example,
Amiga graphics to Tiff or JPEG. If you migrate to a public standard
you minimize the chance of having to do it again. It’s hard to guess
which commercial formats will survive: if you had asked me in the
1990s whether a Kodak image format would survive, I would have
said yes. You have to do it for every format. But you get modern
capabilities with the converted files.
Emulation means programming a current machine to behave like an
old machine. This is a difficult task, but emulators exist for many
common machines, particularly game platforms. A notable project
is Olive (olivearchive.org) which is aimed at preservation of
intellectual content beyond video games (CMU, IBM, and others).
You get only the old behavior of the program.
Examples
Migration:
JSTOR, and many old journal systems: the early issues, whatever
their original formats, are now in PDF. Often they were just OCRd
from the printed version, rather than translated digitally (high
proofreading cost but minimal programming complexity). You can
use all modern PDF tools on the articles.
Emulation:
The Internet Arcade is a collection of 1970s-80s arcade games that
you can run in an emulator:
https://archive.org/details/internetarcade
Some very special cases
Colossus, 1942. Colossus re-build, 1996
Charles Babbage’s
Difference Engine,
as rebuilt by the
Science Museum
(London), 2002.
Analogy
Consider performing early music. Should you play it on old
instruments or modern ones? Old instruments are more authentic,
but have a different effect on the modern ear. Bach’s listeners had
not heard a piano and the organ did not sound “old fashioned”.
Emulation is finding an old church (there are some in Germany
whose architecture and organ pipes are not changed from Bach’s
day) and using old-fashioned performance techniques.
Migration is using a piano (and keyed flutes and trumpets, etc) but
trying to produce the same emotional effect.
Similarly with old books: Caslon and Baskerville did not look old to
people who had never seen Helvetica.
If you lack source code
In general, you can’t migrate a piece of software without the source
code, since you want to recompile it on a new machine. There are
de-assemblers, but the result is going to be a real pain to
understand. So if you have only the object code, you may be driven
to emulation. Since many software vendors keep source code very
secret, and did so in the past as well, it’s not uncommon to have
only the binary form of some program.
A legal warning: if you can’t find the vendor (out of business) and
get permission, you may not have permission even to use the binary
code, although this may depend on the terms of the original
purchase. It may or may not have allowed transferring the program
to a new user.
Features in old and new versions
Suppose you take an ancient word processor file and migrate it to a
modern format. Then you can do things like export HTML, or PDF.
Any tool that will use the modern format can work with your old
file. But the tool will give a modern result – it will run faster, use
modern display fonts, and the like.
If you are using an emulator, you get the old behavior. If the
program only displayed green on black, you get green on black. This
is “authentic” but you may not like it. And you may not be able to
create HTML or PDF from the program. If you are trying to merge
many such older documents into a digital library, the format
incompatibilities will make things worse.
Metadata
If you really want to preserve a complex software object, it
helps to know exactly what programs were used to create it.
That means not just the name, but the exact version. Other
issues that are more serious for digital preservation include
provenance: where did this come from? This is relevant for
answering questions about the material, or finding the people
who might know the answer. Similarly it may assist with rights
metadata, or technical metadata. Modern formats sometimes
have technical metadata included in the file (eg in a JPG
header) but older formats often don’t.
Again, it is easiest if you use well-known and common formats.
Standards
“The good thing about standards is that you have so much
choice.”
Even ASCII (ISO 646) is ambiguous. The UK changed the “#”
character to mean “£” and Germany changed “}” to “ü” .
Particularly worrisome are “wrapper” formats. Tiff may
contain different kinds of image compression algorithms (such
as G4 fax, or Lempel-Ziv), and thus a Tiff reader may not be
able to read all Tiff images. Some image viewers understand
progressive images in GIF or JPG; some don’t. PDF can include
the kitchen sink (eg 3-D viewers).
Solution: emphasize the best and most public formats.
Missing environment
What would it mean to preserve the “Amazon home page”? It is
different for every person using it and for each instance – it’s
synthesizing using the browsing and order history of the user, the
current incentives for sales at Amazon, and lots else (geography,
source computer, etc.). There are many pieces of software that
depend on almost everything around them- think about all the
install scripts that ask “we want to use your location,” “we want to
use your browser history,” and so on. (And of course many
programs don’t ask, they just use them.)
No good answer for this. You have to judge what you mean by
preserving the object – what will the users want the behavior to be?
Protection from abuse
If you run a general-purpose preservation operation, you need
to think about whether anything in your preservation files is
dangerous or doubtful in some way. People might try to use
your system to distribute malware (viruses) or to enable
software piracy.
Thus, unfortunately, you may want to put out calls like “please
send in examples of early APL software” but you can’t just
accept anything, and can’t rely on statements made by
unknown volunteers about what they are submitting.
Legal permission
You may have an object, and know what to do with it, but not have
legal permission to preserve it. For example, many of the video
game companies object to attempts to imitate the old games – to
them, this is creating competition for new games.
Unfortunately, given the copyright trolls out there, who try to make
a living by finding people who have downloaded something they
shouldn’t have, and then threatening them with lawsuits, this is not
an area where it is easier to get forgiveness than permission.
Libraries are often justifiably paranoid.
There is of course the preservation exception in the law; but it limits
a library to on-premises use.
Good and bad
Why software preservation is hard: the material is not self-
describing, there were many early products that vanished without
adequate documentation, software can be very complex, it requires
special hardware to run, and so on….
Why software preservation is easy: as with all digital information, it
can be copied without error; if one person has migrated a format or
emulated a machine, that can be used by others; and computers are
new enough that there is probably no computer without some user
who is still alive. I learned to program on a Univac I; that doesn’t
mean I have a tape drive that uses its steel tapes (yes, steel), but at
least I know what they are.
Conclusion
The biggest technical choice is migration vs. emulation. I
would generally say:
migration for static formats
emulation for executable programs
There are some ambitious programs: the Computer History
Museum in Mountain View has been able to salvage old
machines like the Xerox Alto.
But the industry does a lot less than we would like; it is more
common to have legal problems in salvaging software than to
get financial help from its original marketer.
Emulation in Practice
Emulation as a Service at Yale University Library; lessons learnt and plans for
the future
Euan Cochrane, Digital Preservation Manager, Yale University Library
Overview
1. Why should we care about emulation?
2. What is emulation?
3. How do we do emulation?
4. What is Emulation as a Service (EaaS)?
5. How we use EaaS
6. Lessons learnt using EaaS
7. Future work at Yale University Library (YUL)
Emulation– why?
Why? - Executable content
• Video games
• Research data workflows
• Digital Art
• Software as artifact
• Digital artifact museums
(preserving the tools and infrastructure of the digital age)
Why? – Software dependent
content
Content that requires software in order to be rendered
or interacted with:
• Office files (documents, spreadsheets, slide sets, etc)
• CAD files
• Outlook inboxes
• eBooks with note taking capability
• Desktop environments
• Code
• Any proprietary, or effectively proprietary, formats
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
2003 2005 2007 2009 2011 2013 2015
Operating System Usage Over Time
Win8
Win7
Vista
Win2003
Older Win
WinXP
W2000
Win98
Win95
WinNT
Linux
Mac
Mobile
Why? – Software dependent
content
Old software is required to
authentically render old content
Original content in original
software (WordPerfect in
Windows 95)
Original content in newer
software (LibreOffice Writer in
Windows Vista)
Research results are at risk of loss
without original software
Original content in original software
(WordStar for DOS in Microsoft DOS)
[NB: equation predicting tree growth
rates includes exponents documented
using upper line of text]
Original content in newer software
(LibreOffice Writer in Windows
Vista)
Emulation – How?
How? – Emulation and virtualization
software tools
• An emulation software package
(“emulator”) is used to create a
virtual version of one computer
within another computer that has
different hardware
• Old software can be run on the
“emulated” computer hardware just
like it was running on the original
physical computer.
• Many emulators were originally
developed to run old video games
How? – Software tools
• Emulation is often used to support old hardware
devices that require obsolete software
(e.g. assembly line management software, scientific instruments,
industrial machinery, etc)
• Emulation is widely used by mobile phone
application developers to develop software for
phone-hardware using desktop-PC hardware
(i.e. phone hardware is emulated on desktop pcs to build phone-
compatible applications)
• Virtualization = emulation but with compatible
hardware
(some of the host machine’s hardware is used directly by the
“virtualized” computer)
Virtualization bridges the gap between departure of recently obsolete
hardware and the arrival of hardware powerful enough to emulate it
How? – Preserving software
and dependencies
• We need to curate and preserve operating systems to support access to
assets that depend on them
• We need to curate and preserve software applications to support access to
content that depends on them
• We need to curate and preserve fonts, scripts, plug-ins and other
dependencies to support access to content that requires them
• We need to preserve whole desktop environments (e.g. Salmon Rushdie’s
desktop at Emory university) to support access to the experience of interacting
with it
• We need to curate and preserve pre-configured disk images with software
already installed on them – for running on emulated hardware
How? - Documentation
• We need unique, persistent identifiers for software
• We need software catalogues
• We need unique, persistent identifiers for disk images
(installed environments/virtual hard drives)
• We need disk image/virtual hard drive catalogues
• We need unique, persistent identifiers for
emulated/virtualized hardware configurations
• We need hardware configuration catalogues
How? - Documentation
• We need unique, persistent identifiers for software
• We need software catalogues
• We need unique, persistent identifiers for disk images
(installed environments/virtual hard drives)
• We need disk image/virtual hard drive catalogues
• We need unique, persistent identifiers for
emulated/virtualized hardware configurations
• We need hardware configuration catalogues
*Mostly, the internet
archive is doing great
work, as are NIST and
PRONOM
We
don’t
have
these
(yet!)*
How? – Configuring emulated
hardware
How? – Configuring emulated hardware
• Admins configure an
emulator
• Admins install and/or
configure the emulated
software
• Requires various emulator
specific, technically
challenging tools
How? – accessing emulated
environments at libraries and
archives
• Users access
emulated
environments via
dedicated
machines
• Use dedicated
software
• At libraries and
archives this is
mostly restricted to
reading rooms
How? – This is too hard!
Emulation as a Service
Emulation as a Service –What is
it?
 Remote access to pre-configured emulated and virtualized
environments via any modern web browser
 Abstracts configuration challenges away from end-users
 Changes to environments can be saved or discarded at the end
of a session (a fresh/unchanged version is always available)
 Interactivity can be restricted where appropriate (e.g. limited
ability to download or copy content to local computer)
 Relatively simple way to provide custom online environments
(virtual reading rooms?)
Emulation as a Service (EaaS)–
Why?
• A lot of old digital content can only be properly accessed using
emulation tools
• Emulation is technically specialized
• Old software can be challenging for modern users to understand
• Modern users don’t expect to have to come into a reading room
to access digital content
• Maintain control over content: users can’t copy data in or out
unless authorized (screenshots are inevitably excluded)
Emulation as a Service (EaaS)–
Why?
• Strong separation between environments, objects and
emulators/configurations
• Emulation can be provided remotely (outsourced) with disk
image archives and/or content maintained locally)
• Small derivative environments can be created from base-
environments –saving space
• Standard environments can be reused and customized
• Provides ability to cite environments
EaaS usage Examples
• Puppet Motel
• Hebrew Texts
• Companies Data
• See:
http://blogs.loc.gov/digitalpreservation/2014/08/e
mulation-as-a-service-eaas-at-yale-university-
library/
EaaS – How it works
Architecture and design
EaaS – How it works
(For Technical Administrators)
• Admins configure
an emulator on
local PC
• Admins configure
the emulated
software on a local
PC
• Configured
environment gets
saved as a “disk
image” with
configuration
metadata
• Admins confirm the
software
environment stored
on the disk image
works on local PC
• Admins/Archivists/L
ibrarians ingest it
into the EaaS
service:
EaaS – How it works
(For Technical Administrators)
works
(For
Librarians/Archivist
s)• Pre-configured software
environments (e.g. a
Windows 95 + Office 95
environment) can have
files added to them and
be saved as a variant or as
a stand-alone new
environment
• Only difference (delta)
between base-
environments and
customized environment
retained – saving space by
not duplicating virtual
hard drive content
• CD-ROMs and
other software
can be ingested,
installed/configure
d on top of a base
environment, and
tested using an
online interface
• Newly customized
environment can
be stored for
future use and
works
(For
Librarians/Archivist
s)
• Librarians/Archivi
sts can also
ingest disk
images captured
from machines
they have
acquired (e.g.
authors’/politicia
ns’ desktops)
works
(For
Librarians/Archivist
s)
EaaS – How it works
(For end-users)
• Users can click on links in a
catalogue/finding aid to
access
environments/content
EaaS – How it works
(For developers and system
integrators)
• Provides generic access to functionality of many emulators and virtualization
tools vi a WebService and REST API
• Emulation functionality can be incorporated into existing workflows
• Emulated (or virtualized) environments can be embedded into web pages for
online access and online exhibitions
• Emulated environment citations, thumbnails, and URIs/URLs enable easy
integration with existing catalogues and finding aids
• One-click “image-disk-and-emulate” workflows being developed (collaborating
with digital forensics initiatives)
• Open Source (currently available on request, code will be published in the
future)
EaaS – Background
• bwFLA EaaS project from University of Freiburg in
Germany (http://bw-fla.uni-freiburg.de)
• Personally collaborated with bwFLA at Freiburg
while at Archives New Zealand
• Now at Yale University Library and brought
collaboration along
• Yale University Library have(/had!) only installation
outside of Germany
EaaS Demo
(Semi-)Public Demo
https://demo.bw-fla.uni-
freiburg.de
Username: bwfla
Password: demo
Related work
• Olive Archive https://olivearchive.org/
• Internet Archive
https://archive.org/details/software
• Keep Emulation framework
http://emuframework.sourceforge.net/
• QEMU http://wiki.qemu.org/Main_Page
EaaS at Yale
• Testing and providing requirements for ongoing
development
• Imaging general collections digital media & Trialing
access via EaaS
• Investigating workflow integration (virtual reading
rooms?)
• Finding gaps in supporting infrastructure
Lessons learnt
• It works and we can do this now!*
*with caveats
Lessons learnt
• Software licensing needs to be solved (abandonware
and out-of-cart software are huge problems)
• Scale is manageable through standardization and
sharing
• Archivists and Librarians can use EaaS with relatively
little training
• The possibilities of using EaaS in workflows are huge
• If EaaS becomes an assumption, creators may change
Future work at Yale University
Library
• Move EaaS into production
• Increase software archiving
• Develop standard shareable environment images
• Collaborate with others to maximize efficiency of software archiving
• Develop emulation testing standards and frameworks
• Explore options for preserving networked environments
• Make progress on the licensing issues
Thank you
https://demo.bw-fla.uni-
freiburg.de
Username: bwfla
Password: demo
NISO Webinar • May 13, 2015
Questions?
All questions will be posted with presenter answers on
the NISO website following the webinar:
http://www.niso.org/news/events/2015/webinars/software/
NISO Webinar
Software Preservation and Use:
I Saved the Files But Can I Run Them?
Thank you for joining us today.
Please take a moment to fill out the brief online survey.
We look forward to hearing from you!
THANK YOU

Más contenido relacionado

La actualidad más candente

Information technology
Information technologyInformation technology
Information technologyBadar Rizwan
 
Education and Free Software - Jon Maddog Hall in Campus Party London
Education and Free Software - Jon Maddog Hall in Campus Party LondonEducation and Free Software - Jon Maddog Hall in Campus Party London
Education and Free Software - Jon Maddog Hall in Campus Party LondonAntonio Pérez
 
SAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotationSAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotationDavide Eynard
 
Open file formats favour real innovation and really free markets
Open file formats favour real innovation and really free marketsOpen file formats favour real innovation and really free markets
Open file formats favour real innovation and really free marketsMarco Fioretti
 
Free Libre Open Source Software - Business Aspects of Software Industry
Free Libre Open Source Software - Business Aspects of Software IndustryFree Libre Open Source Software - Business Aspects of Software Industry
Free Libre Open Source Software - Business Aspects of Software IndustryFrederik Questier
 
Introduction To Internet And Www 6
Introduction To Internet And Www   6Introduction To Internet And Www   6
Introduction To Internet And Www 6guestb912a3d
 

La actualidad más candente (8)

Information technology
Information technologyInformation technology
Information technology
 
Education and Free Software - Jon Maddog Hall in Campus Party London
Education and Free Software - Jon Maddog Hall in Campus Party LondonEducation and Free Software - Jon Maddog Hall in Campus Party London
Education and Free Software - Jon Maddog Hall in Campus Party London
 
Open source technology
Open source technologyOpen source technology
Open source technology
 
SAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotationSAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotation
 
Open file formats favour real innovation and really free markets
Open file formats favour real innovation and really free marketsOpen file formats favour real innovation and really free markets
Open file formats favour real innovation and really free markets
 
Free Libre Open Source Software - Business Aspects of Software Industry
Free Libre Open Source Software - Business Aspects of Software IndustryFree Libre Open Source Software - Business Aspects of Software Industry
Free Libre Open Source Software - Business Aspects of Software Industry
 
Launch
LaunchLaunch
Launch
 
Introduction To Internet And Www 6
Introduction To Internet And Www   6Introduction To Internet And Www   6
Introduction To Internet And Www 6
 

Similar a NISO Webinar: Software Preservation and Use: I Saved the Files But Can I Run Them?

Caring for file formats
Caring for file formatsCaring for file formats
Caring for file formatsAnge Albertini
 
Data Integration Lecture
Data Integration LectureData Integration Lecture
Data Integration LectureSUNY Oneonta
 
The evolution of the collections management system
The evolution of the collections management systemThe evolution of the collections management system
The evolution of the collections management systemirowson
 
Trusting files (and their formats)
Trusting files (and their formats)Trusting files (and their formats)
Trusting files (and their formats)Ange Albertini
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
Integrating technology into the classroom
Integrating technology into the classroomIntegrating technology into the classroom
Integrating technology into the classroomTammiRice
 
Computers for Beginners
Computers for BeginnersComputers for Beginners
Computers for BeginnersGeorge Grayson
 
Perspectives on digitization of music
Perspectives on digitization of musicPerspectives on digitization of music
Perspectives on digitization of musicOle Bisbjerg
 
Linux Sucks
Linux SucksLinux Sucks
Linux Suckspapygeek
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheetthomasmcd6
 
Basic information about computer
Basic information about computer Basic information about computer
Basic information about computer Mohammed39165
 
Even internet computers want to be free: Using Linux and open source software...
Even internet computers want to be free: Using Linux and open source software...Even internet computers want to be free: Using Linux and open source software...
Even internet computers want to be free: Using Linux and open source software...North Bend Public Library
 
Pc architecture michael karbo
Pc architecture   michael karboPc architecture   michael karbo
Pc architecture michael karboSecretTed
 

Similar a NISO Webinar: Software Preservation and Use: I Saved the Files But Can I Run Them? (20)

Caring for file formats
Caring for file formatsCaring for file formats
Caring for file formats
 
Data Integration Lecture
Data Integration LectureData Integration Lecture
Data Integration Lecture
 
The evolution of the collections management system
The evolution of the collections management systemThe evolution of the collections management system
The evolution of the collections management system
 
Backup design
Backup designBackup design
Backup design
 
Trusting files (and their formats)
Trusting files (and their formats)Trusting files (and their formats)
Trusting files (and their formats)
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
Why Linux is better
Why Linux is betterWhy Linux is better
Why Linux is better
 
Integrating technology into the classroom
Integrating technology into the classroomIntegrating technology into the classroom
Integrating technology into the classroom
 
MyLifeBits van Microsoft
MyLifeBits van MicrosoftMyLifeBits van Microsoft
MyLifeBits van Microsoft
 
Computers for Beginners
Computers for BeginnersComputers for Beginners
Computers for Beginners
 
Perspectives on digitization of music
Perspectives on digitization of musicPerspectives on digitization of music
Perspectives on digitization of music
 
Linux Sucks
Linux SucksLinux Sucks
Linux Sucks
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheet
 
Preserve or preserve not
Preserve or preserve notPreserve or preserve not
Preserve or preserve not
 
Basic information about computer
Basic information about computer Basic information about computer
Basic information about computer
 
Foss Presentation
Foss PresentationFoss Presentation
Foss Presentation
 
Task 3
Task 3Task 3
Task 3
 
Even internet computers want to be free: Using Linux and open source software...
Even internet computers want to be free: Using Linux and open source software...Even internet computers want to be free: Using Linux and open source software...
Even internet computers want to be free: Using Linux and open source software...
 
Pc architecture michael karbo
Pc architecture   michael karboPc architecture   michael karbo
Pc architecture michael karbo
 

Más de National Information Standards Organization (NISO)

Más de National Information Standards Organization (NISO) (20)

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Bazargan "NISO Webinar, Sustainability in Publishing"
Bazargan "NISO Webinar, Sustainability in Publishing"Bazargan "NISO Webinar, Sustainability in Publishing"
Bazargan "NISO Webinar, Sustainability in Publishing"
 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
 
Compton "NISO Webinar, Sustainability in Publishing"
Compton "NISO Webinar, Sustainability in Publishing"Compton "NISO Webinar, Sustainability in Publishing"
Compton "NISO Webinar, Sustainability in Publishing"
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
 
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
 
Mattingly "Text and Data Mining: Building Data Driven Applications"
Mattingly "Text and Data Mining: Building Data Driven Applications"Mattingly "Text and Data Mining: Building Data Driven Applications"
Mattingly "Text and Data Mining: Building Data Driven Applications"
 
Mattingly "Text and Data Mining: Searching Vectors"
Mattingly "Text and Data Mining: Searching Vectors"Mattingly "Text and Data Mining: Searching Vectors"
Mattingly "Text and Data Mining: Searching Vectors"
 
Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"
 
Mattingly "Text Processing for Library Data: Representing Text as Data"
Mattingly "Text Processing for Library Data: Representing Text as Data"Mattingly "Text Processing for Library Data: Representing Text as Data"
Mattingly "Text Processing for Library Data: Representing Text as Data"
 
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
 
Ross and Clark "Strategic Planning"
Ross and Clark "Strategic Planning"Ross and Clark "Strategic Planning"
Ross and Clark "Strategic Planning"
 
Mattingly "Data Mining Techniques: Classification and Clustering"
Mattingly "Data Mining Techniques: Classification and Clustering"Mattingly "Data Mining Techniques: Classification and Clustering"
Mattingly "Data Mining Techniques: Classification and Clustering"
 
Straza "Global collaboration towards equitable and open science: UNESCO Recom...
Straza "Global collaboration towards equitable and open science: UNESCO Recom...Straza "Global collaboration towards equitable and open science: UNESCO Recom...
Straza "Global collaboration towards equitable and open science: UNESCO Recom...
 
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
 
Kriegsman "Integrating Open and Equitable Research into Open Science"
Kriegsman "Integrating Open and Equitable Research into Open Science"Kriegsman "Integrating Open and Equitable Research into Open Science"
Kriegsman "Integrating Open and Equitable Research into Open Science"
 
Mattingly "Ethics and Cleaning Data"
Mattingly "Ethics and Cleaning Data"Mattingly "Ethics and Cleaning Data"
Mattingly "Ethics and Cleaning Data"
 
Mercado-Lara "Open & Equitable Program"
Mercado-Lara "Open & Equitable Program"Mercado-Lara "Open & Equitable Program"
Mercado-Lara "Open & Equitable Program"
 

Último

Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 

Último (20)

Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 

NISO Webinar: Software Preservation and Use: I Saved the Files But Can I Run Them?

  • 1. NISO Webinar: Software Preservation and Use: I Saved the Files But Can I Run Them? Wednesday, May 13, 2015 Speakers: Michael Lesk Professor of Library and Information Science, Rutgers University Euan Cochrane Digital Preservation Manager, Yale University Library Jon Ippolito Professor of New Media and Director of the Digital Curation Graduate Program, University of Maine http://www.niso.org/news/events/2015/webinars/software/
  • 2. Software preservation Michael Lesk Prof. of Library and Information Science Rutgers University New Brunswick, NJ 08901 lesk@acm.org www.lesk.com
  • 3. Software preservation The hard problem is not bad tape; it’s obsolescence. There are two common answers to the obsolescence problem. Migration or emulation? Migration: Convert the old information to a new format, e.g., BMP to JPEG. Emulation: Use old information on a new version of an old machine, e.g. using a website that looks like an arcade game platform.
  • 4. Why might old software be lost? All the copies were thrown away. The copies still exist, but the media have worn out. The media are OK, but we have no device to read them. We can read the bits, but we don’t know what they mean. We understand the bits, but have no software to process them. We have software but nothing to run it on. The software depends on an environment that no longer exists. We could process the bits, but we lack legal permission.
  • 5. Discarded We know the first telegram: “What hath God wrought?” May 24, 1844; Samuel F. Morse, in Washington, to Alfred Vail, in Baltimore. We know the first telephone call: “Mr. Watson—Come here—I want to see you.” March 10, 1876. Alexander Graham Bell to Thomas A. Watson, in Boston. We don’t know the first email message. It was in the spring of 1964, in either Cambridge (UK), Cambridge (Mass.), or Pittsburgh; but whatever it was, it was thrown out, and nobody kept good records. The solution to this problem is multiple copies. Digital copies are perfect and cheap; use them.
  • 6. Media fragility In the 1970s Brazil stored Landsat space photography of their country on magnetic tape. These tapes were stored in humid conditions and deteriorated until they were unreadable. Magnetic tape is often fragile; audio tape is lost as well. It helps to start with better quality tape, and linear tape (audio) is better than helical tape (VHS cassettes). Sometimes it helps to heat the tape, once; hence one of the great titles in preservation literature, “If I knew you were coming, I’d have baked a tape” (Eddie Ciletti). Again the solution to this is multiple copies, regularly inspected. Note projects like LOCKSS: Lots of copies keep stuff safe.
  • 7. Devices gone Where today would you find a diskette drive? And that’s an easy one: what about a paper tape reader? The answer to those is eBay, but what about special-purpose technology that failed in the marketplace, such as kinds of 12” writeable optical disk from the early 1990s? Again, the answer is multiple copies on current devices. Even if your organization thinks it’s prepared to keep its 1980-vintage DEC computer running for a long time, where would you find spare parts when it broke? Or a technician who knew what to do with them?
  • 8. Forgot the format It is possible to have a format and not know what it is. Suppose you have a file made by Volkswriter, marketed by Lifetime Software (which, despite its name, ceased operating independently in 1991). How would you find out the control codes? If you can’t find documentation, it may be easier to view this as a decipherment problem: if you find a funny symbol at “plus ?a change, plus c’est …” it’s the French ç character. Now we’re into the real issues: is it better to try to find a copy of the software or to convert the file a current standard, like Word? In this case (word processing) conversion is probably easier. Solution: use standard formats. Preferably public ones.
  • 9. No software is available Again, the vendor who wrote the software originally used for your file might have gone out of business. If your file is in a public format, there is probably an alternative. But if it was in a proprietary format, it may be difficult to find something that reads it. There was a time, for example, when Microsoft deliberately arranged for old MS-Word documents to be unreadable on newer versions so that customers would be forced to upgrade continuously. And in those days, Microsoft tried to prohibit other vendors from selling software that read and translated the “.doc” format; some of them did it anyway, and Microsoft gave up. The solution is public formats and current formats; for example the newer “.docx” files in Microsoft Word have a public description.
  • 10. No machine to run the software Now we’re into the hard part of the problem: you might have some kind of program but it was coded to run on a long-gone machine (Commodore 64, anyone)? You choice is between Finding a machine for sale on eBay – but you can’t get parts to fix it, and you may have trouble finding out how to make it work. Migrating whatever this is to a modern platform, ideally expressing it in public standard terms. Finding an emulator for the old machine: something that will run the old code as it was.
  • 11. Migration vs. Emulation Migration means converting files to newer formats. For example, Amiga graphics to Tiff or JPEG. If you migrate to a public standard you minimize the chance of having to do it again. It’s hard to guess which commercial formats will survive: if you had asked me in the 1990s whether a Kodak image format would survive, I would have said yes. You have to do it for every format. But you get modern capabilities with the converted files. Emulation means programming a current machine to behave like an old machine. This is a difficult task, but emulators exist for many common machines, particularly game platforms. A notable project is Olive (olivearchive.org) which is aimed at preservation of intellectual content beyond video games (CMU, IBM, and others). You get only the old behavior of the program.
  • 12. Examples Migration: JSTOR, and many old journal systems: the early issues, whatever their original formats, are now in PDF. Often they were just OCRd from the printed version, rather than translated digitally (high proofreading cost but minimal programming complexity). You can use all modern PDF tools on the articles. Emulation: The Internet Arcade is a collection of 1970s-80s arcade games that you can run in an emulator: https://archive.org/details/internetarcade
  • 13. Some very special cases Colossus, 1942. Colossus re-build, 1996 Charles Babbage’s Difference Engine, as rebuilt by the Science Museum (London), 2002.
  • 14. Analogy Consider performing early music. Should you play it on old instruments or modern ones? Old instruments are more authentic, but have a different effect on the modern ear. Bach’s listeners had not heard a piano and the organ did not sound “old fashioned”. Emulation is finding an old church (there are some in Germany whose architecture and organ pipes are not changed from Bach’s day) and using old-fashioned performance techniques. Migration is using a piano (and keyed flutes and trumpets, etc) but trying to produce the same emotional effect. Similarly with old books: Caslon and Baskerville did not look old to people who had never seen Helvetica.
  • 15. If you lack source code In general, you can’t migrate a piece of software without the source code, since you want to recompile it on a new machine. There are de-assemblers, but the result is going to be a real pain to understand. So if you have only the object code, you may be driven to emulation. Since many software vendors keep source code very secret, and did so in the past as well, it’s not uncommon to have only the binary form of some program. A legal warning: if you can’t find the vendor (out of business) and get permission, you may not have permission even to use the binary code, although this may depend on the terms of the original purchase. It may or may not have allowed transferring the program to a new user.
  • 16. Features in old and new versions Suppose you take an ancient word processor file and migrate it to a modern format. Then you can do things like export HTML, or PDF. Any tool that will use the modern format can work with your old file. But the tool will give a modern result – it will run faster, use modern display fonts, and the like. If you are using an emulator, you get the old behavior. If the program only displayed green on black, you get green on black. This is “authentic” but you may not like it. And you may not be able to create HTML or PDF from the program. If you are trying to merge many such older documents into a digital library, the format incompatibilities will make things worse.
  • 17. Metadata If you really want to preserve a complex software object, it helps to know exactly what programs were used to create it. That means not just the name, but the exact version. Other issues that are more serious for digital preservation include provenance: where did this come from? This is relevant for answering questions about the material, or finding the people who might know the answer. Similarly it may assist with rights metadata, or technical metadata. Modern formats sometimes have technical metadata included in the file (eg in a JPG header) but older formats often don’t. Again, it is easiest if you use well-known and common formats.
  • 18. Standards “The good thing about standards is that you have so much choice.” Even ASCII (ISO 646) is ambiguous. The UK changed the “#” character to mean “£” and Germany changed “}” to “ü” . Particularly worrisome are “wrapper” formats. Tiff may contain different kinds of image compression algorithms (such as G4 fax, or Lempel-Ziv), and thus a Tiff reader may not be able to read all Tiff images. Some image viewers understand progressive images in GIF or JPG; some don’t. PDF can include the kitchen sink (eg 3-D viewers). Solution: emphasize the best and most public formats.
  • 19. Missing environment What would it mean to preserve the “Amazon home page”? It is different for every person using it and for each instance – it’s synthesizing using the browsing and order history of the user, the current incentives for sales at Amazon, and lots else (geography, source computer, etc.). There are many pieces of software that depend on almost everything around them- think about all the install scripts that ask “we want to use your location,” “we want to use your browser history,” and so on. (And of course many programs don’t ask, they just use them.) No good answer for this. You have to judge what you mean by preserving the object – what will the users want the behavior to be?
  • 20. Protection from abuse If you run a general-purpose preservation operation, you need to think about whether anything in your preservation files is dangerous or doubtful in some way. People might try to use your system to distribute malware (viruses) or to enable software piracy. Thus, unfortunately, you may want to put out calls like “please send in examples of early APL software” but you can’t just accept anything, and can’t rely on statements made by unknown volunteers about what they are submitting.
  • 21. Legal permission You may have an object, and know what to do with it, but not have legal permission to preserve it. For example, many of the video game companies object to attempts to imitate the old games – to them, this is creating competition for new games. Unfortunately, given the copyright trolls out there, who try to make a living by finding people who have downloaded something they shouldn’t have, and then threatening them with lawsuits, this is not an area where it is easier to get forgiveness than permission. Libraries are often justifiably paranoid. There is of course the preservation exception in the law; but it limits a library to on-premises use.
  • 22. Good and bad Why software preservation is hard: the material is not self- describing, there were many early products that vanished without adequate documentation, software can be very complex, it requires special hardware to run, and so on…. Why software preservation is easy: as with all digital information, it can be copied without error; if one person has migrated a format or emulated a machine, that can be used by others; and computers are new enough that there is probably no computer without some user who is still alive. I learned to program on a Univac I; that doesn’t mean I have a tape drive that uses its steel tapes (yes, steel), but at least I know what they are.
  • 23. Conclusion The biggest technical choice is migration vs. emulation. I would generally say: migration for static formats emulation for executable programs There are some ambitious programs: the Computer History Museum in Mountain View has been able to salvage old machines like the Xerox Alto. But the industry does a lot less than we would like; it is more common to have legal problems in salvaging software than to get financial help from its original marketer.
  • 24. Emulation in Practice Emulation as a Service at Yale University Library; lessons learnt and plans for the future Euan Cochrane, Digital Preservation Manager, Yale University Library
  • 25. Overview 1. Why should we care about emulation? 2. What is emulation? 3. How do we do emulation? 4. What is Emulation as a Service (EaaS)? 5. How we use EaaS 6. Lessons learnt using EaaS 7. Future work at Yale University Library (YUL)
  • 27. Why? - Executable content • Video games • Research data workflows • Digital Art • Software as artifact • Digital artifact museums (preserving the tools and infrastructure of the digital age)
  • 28. Why? – Software dependent content Content that requires software in order to be rendered or interacted with: • Office files (documents, spreadsheets, slide sets, etc) • CAD files • Outlook inboxes • eBooks with note taking capability • Desktop environments • Code • Any proprietary, or effectively proprietary, formats
  • 29. 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 2003 2005 2007 2009 2011 2013 2015 Operating System Usage Over Time Win8 Win7 Vista Win2003 Older Win WinXP W2000 Win98 Win95 WinNT Linux Mac Mobile Why? – Software dependent content
  • 30. Old software is required to authentically render old content Original content in original software (WordPerfect in Windows 95) Original content in newer software (LibreOffice Writer in Windows Vista)
  • 31. Research results are at risk of loss without original software Original content in original software (WordStar for DOS in Microsoft DOS) [NB: equation predicting tree growth rates includes exponents documented using upper line of text] Original content in newer software (LibreOffice Writer in Windows Vista)
  • 33. How? – Emulation and virtualization software tools • An emulation software package (“emulator”) is used to create a virtual version of one computer within another computer that has different hardware • Old software can be run on the “emulated” computer hardware just like it was running on the original physical computer. • Many emulators were originally developed to run old video games
  • 34. How? – Software tools • Emulation is often used to support old hardware devices that require obsolete software (e.g. assembly line management software, scientific instruments, industrial machinery, etc) • Emulation is widely used by mobile phone application developers to develop software for phone-hardware using desktop-PC hardware (i.e. phone hardware is emulated on desktop pcs to build phone- compatible applications) • Virtualization = emulation but with compatible hardware (some of the host machine’s hardware is used directly by the “virtualized” computer) Virtualization bridges the gap between departure of recently obsolete hardware and the arrival of hardware powerful enough to emulate it
  • 35. How? – Preserving software and dependencies • We need to curate and preserve operating systems to support access to assets that depend on them • We need to curate and preserve software applications to support access to content that depends on them • We need to curate and preserve fonts, scripts, plug-ins and other dependencies to support access to content that requires them • We need to preserve whole desktop environments (e.g. Salmon Rushdie’s desktop at Emory university) to support access to the experience of interacting with it • We need to curate and preserve pre-configured disk images with software already installed on them – for running on emulated hardware
  • 36. How? - Documentation • We need unique, persistent identifiers for software • We need software catalogues • We need unique, persistent identifiers for disk images (installed environments/virtual hard drives) • We need disk image/virtual hard drive catalogues • We need unique, persistent identifiers for emulated/virtualized hardware configurations • We need hardware configuration catalogues
  • 37. How? - Documentation • We need unique, persistent identifiers for software • We need software catalogues • We need unique, persistent identifiers for disk images (installed environments/virtual hard drives) • We need disk image/virtual hard drive catalogues • We need unique, persistent identifiers for emulated/virtualized hardware configurations • We need hardware configuration catalogues *Mostly, the internet archive is doing great work, as are NIST and PRONOM We don’t have these (yet!)*
  • 38. How? – Configuring emulated hardware
  • 39. How? – Configuring emulated hardware • Admins configure an emulator • Admins install and/or configure the emulated software • Requires various emulator specific, technically challenging tools
  • 40. How? – accessing emulated environments at libraries and archives • Users access emulated environments via dedicated machines • Use dedicated software • At libraries and archives this is mostly restricted to reading rooms
  • 41. How? – This is too hard!
  • 42. Emulation as a Service
  • 43. Emulation as a Service –What is it?  Remote access to pre-configured emulated and virtualized environments via any modern web browser  Abstracts configuration challenges away from end-users  Changes to environments can be saved or discarded at the end of a session (a fresh/unchanged version is always available)  Interactivity can be restricted where appropriate (e.g. limited ability to download or copy content to local computer)  Relatively simple way to provide custom online environments (virtual reading rooms?)
  • 44. Emulation as a Service (EaaS)– Why? • A lot of old digital content can only be properly accessed using emulation tools • Emulation is technically specialized • Old software can be challenging for modern users to understand • Modern users don’t expect to have to come into a reading room to access digital content • Maintain control over content: users can’t copy data in or out unless authorized (screenshots are inevitably excluded)
  • 45. Emulation as a Service (EaaS)– Why? • Strong separation between environments, objects and emulators/configurations • Emulation can be provided remotely (outsourced) with disk image archives and/or content maintained locally) • Small derivative environments can be created from base- environments –saving space • Standard environments can be reused and customized • Provides ability to cite environments
  • 46. EaaS usage Examples • Puppet Motel • Hebrew Texts • Companies Data • See: http://blogs.loc.gov/digitalpreservation/2014/08/e mulation-as-a-service-eaas-at-yale-university- library/
  • 47. EaaS – How it works Architecture and design
  • 48. EaaS – How it works (For Technical Administrators) • Admins configure an emulator on local PC • Admins configure the emulated software on a local PC • Configured environment gets saved as a “disk image” with configuration metadata
  • 49. • Admins confirm the software environment stored on the disk image works on local PC • Admins/Archivists/L ibrarians ingest it into the EaaS service: EaaS – How it works (For Technical Administrators)
  • 50. works (For Librarians/Archivist s)• Pre-configured software environments (e.g. a Windows 95 + Office 95 environment) can have files added to them and be saved as a variant or as a stand-alone new environment • Only difference (delta) between base- environments and customized environment retained – saving space by not duplicating virtual hard drive content
  • 51. • CD-ROMs and other software can be ingested, installed/configure d on top of a base environment, and tested using an online interface • Newly customized environment can be stored for future use and works (For Librarians/Archivist s)
  • 52. • Librarians/Archivi sts can also ingest disk images captured from machines they have acquired (e.g. authors’/politicia ns’ desktops) works (For Librarians/Archivist s)
  • 53. EaaS – How it works (For end-users) • Users can click on links in a catalogue/finding aid to access environments/content
  • 54. EaaS – How it works (For developers and system integrators) • Provides generic access to functionality of many emulators and virtualization tools vi a WebService and REST API • Emulation functionality can be incorporated into existing workflows • Emulated (or virtualized) environments can be embedded into web pages for online access and online exhibitions • Emulated environment citations, thumbnails, and URIs/URLs enable easy integration with existing catalogues and finding aids • One-click “image-disk-and-emulate” workflows being developed (collaborating with digital forensics initiatives) • Open Source (currently available on request, code will be published in the future)
  • 55. EaaS – Background • bwFLA EaaS project from University of Freiburg in Germany (http://bw-fla.uni-freiburg.de) • Personally collaborated with bwFLA at Freiburg while at Archives New Zealand • Now at Yale University Library and brought collaboration along • Yale University Library have(/had!) only installation outside of Germany
  • 58. Related work • Olive Archive https://olivearchive.org/ • Internet Archive https://archive.org/details/software • Keep Emulation framework http://emuframework.sourceforge.net/ • QEMU http://wiki.qemu.org/Main_Page
  • 59. EaaS at Yale • Testing and providing requirements for ongoing development • Imaging general collections digital media & Trialing access via EaaS • Investigating workflow integration (virtual reading rooms?) • Finding gaps in supporting infrastructure
  • 60. Lessons learnt • It works and we can do this now!* *with caveats
  • 61. Lessons learnt • Software licensing needs to be solved (abandonware and out-of-cart software are huge problems) • Scale is manageable through standardization and sharing • Archivists and Librarians can use EaaS with relatively little training • The possibilities of using EaaS in workflows are huge • If EaaS becomes an assumption, creators may change
  • 62. Future work at Yale University Library • Move EaaS into production • Increase software archiving • Develop standard shareable environment images • Collaborate with others to maximize efficiency of software archiving • Develop emulation testing standards and frameworks • Explore options for preserving networked environments • Make progress on the licensing issues
  • 64. NISO Webinar • May 13, 2015 Questions? All questions will be posted with presenter answers on the NISO website following the webinar: http://www.niso.org/news/events/2015/webinars/software/ NISO Webinar Software Preservation and Use: I Saved the Files But Can I Run Them?
  • 65. Thank you for joining us today. Please take a moment to fill out the brief online survey. We look forward to hearing from you! THANK YOU

Notas del editor

  1. 57