Snapshot of how we thought about migration infrastructure then: PLANETS for the infrastructure, MIXED as a plugin for the tabular data conversion functionality.
6. testbed: spreadsheets
XML is an appropriate choice for the
long-term preservation of
spreadsheets. XML can be used to
specify the context, content and
structure of spreadsheets.
7. testbed: databases
At present, XML is the most
effective strategy for the
durable preservation of
databases. XML is highly
capable of representing the
context, content, and structure
of databases.
This strategy can
implemented using a number
of different methods.
8. what do repositories want
Conversion to preservable formats.
Automatically
at most once
Faithfully.
9. preservation strategy
Migration and emulation are complementary
strategies. Migration is best for offering
usable content. Emulation is best for
invoking the original experience.
Migration to XML is
normalised migration,
hence we coin it smart migration.
10. Ingredients
suitable xml formats for your data
software to convert
legacy data to xml
ingest data to xml
xml to dissemination data
connectors to your repository workflow
15. Data kinds
Data comes in kinds, defined by the typical
applications that manipulate it.
Spreadsheets, databases, rich
text, images, audio, video, drawings, ...
The need for these applications are the
basic reason for the threat of data loss
caused by software obsolescence.
16. standards for data kinds
binary vendor formats (doc)
ascii vendor formats (rtf)
open formats (HTML export)
interchange formats (ad-hoc XML)
standard formats (defined XML: OOXML)
preservation formats (selected XML: SDFP)
17. SDFP
Standard Data Formats for Preservation
Spreadsheets: ODF subset
Databases: e-David-XML
Statistical Data: DDI
26. issues
how loose/tight are the components
connected?
pure own Java code / borrow existing
programs in other languages?
modularity of file type recognition (JHOVE)
27. Using MIXED
• history
• defining
• developing
• using
• exploiting
29. improvements for repositories
• users can select format most usable to
them, irrespective of producer
• users can select the preservation
format, in case usable formats are not
supported
• less uncertainties in interpretation, either
by humans or by software
30. further improvements
combine data from heterogeneous sources
• different formats (straightforward)
• different data models (advanced)
• different data kinds
33. Data on an Infrastructure
• higher demand for interoperability
• more needs for standards
• more opportunities for re-use
• more scope for digital preservation tools
35. Conversion as a service
• a uniform resource
• yielding uniform results
• easily accessible
• product of community effort
• a good conversion requires a lot of intelligent
work
• quality is reached in an iterative manner
36. MIXED as Infrastructure
• provides a standard for preservation
formats
• implements the tools to maintain the
standard
• accumulates the shared wisdom of data
formats
I first want to express my delight that you have made it to Scheveningen, to this consultation workshop for MIXED.
This is what MIXED is about, according to the White Paper.I wonder whether in the future I just write Wordles instead of White Papers.
This is an overview of my talk.
Let us now state very briefly what MIXED is, at least, the tangible project result.
Really, MIXED is not so surprising. The idea is quite natural. There have been attempts to put it on the agenda of digital preservation
They made explicit statements about various data kinds: spreadsheets
The verdict on XML for databases is also positive
So the existing tools for converting to XML leaves something to wish for.That leads to the question:
Let us talk a bit about preservation strategy, because you cannot effectively build tools if you do not have a strategy
What are the ingredients for this to work out as desired?
Lets zoom in on the processes of smart migration.First let us ignore the passage of time, we take that into account after this slide
Now, considering time, the question is, how long do we have to maintain parts of the system?
Let us have a closer look at what defines MIXED
The first characteristic is a buzz word: XML
The question is: what do we want to model with our XML?
There is an evolution in file formats, one with a positive direction, look as an illustration to the formats that Microsoft Word can deal with.
So now we are approaching MIXED again, by means of the selection of a few XML schema’s.
SDFP is open, eXtensible, the objective is to gather the best preservation formats for each data kind under one umbrella.
Below XML there is still some structure left: basic data types. XML schema can define them. We chose the ISO definitions.
MIXED is far from complete. It is an experiment, so we have limited our scope.In two ways, one is just a matter of choice, the other may be a more intrinsical limitation.
Here are the more intrinsic limitations. This is what we do
and this is what we don’t
By way of introducing the presentation of Jan, I want to say a few words about the MIXED software