Axa Assurance Maroc - Insurer Innovation Award 2024
Archiving Statistics on the Internet
1. Annegrete Wulff
Statistics Denmark
awu@dst.dk 29. januar 2015
Archiving Internet published statistics
International Marketing and Output Database Conference
Cork, Ireland 2428 September 2007
Introduction
Slogans like Time for Numbers Numbers on Time1
or Wissen.Nutzen2
, refer to
the fact that statistics should be timely and is the basis for good planning.
Access to the latest figures as soon as they are published is increasingly
important. The use of the Internet makes this possible.
Yesterday’s figures and historic data are, however, also part of the
description of our societies.
Most statistical offices – among them Statistics Denmark – used to have
simple, objective and easytofollow rules concerning archiving its data:
− All printed publications were kept in stock with a few copies and the
National Archive received a copy.
− The statistical files behind the books were delivered to the National
Archive
− The documentation belonging to the statistical files was archived there
as well.
During the latest 10 years the Internet has become an ever more important
publishing medium and hard copy publications have diminished in
importance.
What is archived – and what is not?
Our legacy of statistics is still accessible and readable in the archives and in
the libraries as printed books. It does not include all data that passed
throughout our production. It represents everything that was published to
be read by a range of users.
In this paper I exclude the readiness archiving we do in order to secure the
continued production. That means back up procedures of servers with data,
programs and systems will not be taken into account here. I shall focus on
the archiving of data and information disseminated to the public in
electronic form.
1
Statistics Denmark
2
DSTATIS, Germany
/mnt/temp/unoconv/20150129160951/archiving-of-electronic-publications-1231359482416206-
1.doc
2. Today’s practice
Currently all pdf publications are archived. If major errors are noted a new
release of the pdf is published and both versions are archived. In the case of
minor errors the existing pdf is overwritten; thus the original is not
archived. The electronic archive is accessible on the Internet and from our
internal server. The archiving of Statistical Abstracts (Daily News) dates
back to 1999.
An inhouse developed crawler was put in function in 2005. It is used to
discover invalid links on the site as well as to archive the full site.
Crawling and archiving is carried out in accordance with the schedule
mentioned below:
Yearly:
− Snapshot of all pages and subpages of the website, including all pdf
files.
Monthly:
− All pages on the site of the Danish version www.dst.dk as well as the
English version www.dst.dk/uk are archived. (Pdf documents are
excluded as they are archived separately). This is a time consuming
process taking approximately 20 hours.
− Snapshot of the StatBank interface on the Danish as well as the English
version. As the StatBank is an interactive databank, figures are retrieved
through user’s selection. Dynamic pages that contain forms, JavaScript
or other elements that require “human interaction” can not be archived.
So neither the functionality of the StatBank nor the data resulting from a
selection are archived this way.
− www.Alexa.com (The web archive) has since 1997 recorded examples of
our web site. In 1997 only three downloads of the site were made. In
2006 it was around 100. They are accessible on the Internet.
Weekly:
− Economic key indicators on the web
Daily:
− www.dst.dk front page of the web site, Danish version
− www.dst.dk/uk front page of the web site, English version
Three times a day
− Figures in the IMF DSBB agreement www.dst.dk/imf
The user interface and layout of the StatBank is archived according to the
procedure described in the previous section. The data in the StatBank,
however, is not saved in that connection. The StatBank is the primary source
for all our published statistics so it would seem logical as the first thing to
secure the archiving of this primary source. However, this is not the case.
Statistics Denmark is preparing for a set of rules regarding archiving.
Considering this we will balance the costs against the usefulness.
An error in a table may turn up after data has been published. As a result
the data needs to be corrected. There are two ways of handling this:
1. A new file with corrected data is loaded and the original file is
“unpublished” but still kept.
2
Pdf documents www.dst.dkwww.statbank.dk
3. 2. A new file with corrected data is loaded and overwrites the original file.
Both methods are used, although the one where all loads are kept is the
more common. Data is stored (even loaded data which was never
published) but only the period or part of the file that is actually updated.
The file will also contain some metadata – but only codes. As the archived
files are not stored in the macro database environment, reading these files
may be misleading if the metadatabase has been changed over time.
The fact that all erroneous published figures are not archived and available
has not been regarded a huge problem so far. Statistics Denmark considers
the corrected figures to be the ones of interest for the majority of the users.
Moreover it might disturb the majority of users if also the erroneous data
would be available just to satisfy a very small minority. Never the less,
when resources in our unit permit, we should pay attention to an archiving
method of these files that makes it possible to access them in a better way
without interfering with the corrected data.
It should be mentioned that series and time periods holding correct data are
never deleted from the StatBank.
Should everything that we publish be available to the public in the future,
we need to take the following “products” into account:
− All databank tables – every single update and revisions
− All versions of pdf documents
− Every single page of the web site
− Electronic, interactive publications – all updates.
Should we choose an ideal or a pragmatic solution? Will the archiving
activity be enormous? Can we archive in a way that allows us to still retrieve
and access the information?
These are some of the challenges we need to solve.
Why do we archive?
There may be a range of reasons for an organisation to archive the products
and activities. Some are “need to have” while others may be classified as
“nice to have”.
Pdf publications follow the same rules as printed publications. We are
obliged to deliver a copy to the Royal Library. We keep anyway a copy in
our own archive as well. From our archive the pdf is accessible to the public,
while this is not the case in the Royal Library.
There are no legal obligations on any of the other electronically published
products
Timeliness is an important quality indicator. However, not only the latest
updated statistics are of interest. Historians and others with an interest in
the past often need to complement their research with statistical data. In
this way our output database StatBank grows larger and larger as we do not
3
Legal obligations Historical interest for data