It is easy to compare a city transport system to the process of publishing statistical data on a statistical website. There are completely unorganized systems, where everyone drives to work in their own cars, takes whatever route is most convenient at the time and expects to park as close as possible to their destinations. This is similar to those systems in which there are no rules as to how, when, where and in what form data are published. There are several reasons why neither such a transport system nor such a statistical output database is preferable.
Conversely, there are completely organized systems, where all of the commuters use a public transportation system designed to their needs. Users adjust to the various schedules and transportation availability in order to reach their goals. This corresponds to a metadata-driven system where a well organized metadata repository runs data publishing through a pre-defined process based on integrated databases and templates.
This paper focuses on work done and lessons learned during a project of upgrading the Slovenian statistical output database from a file server to a macro database.
Presented at International Marketing and Output Database Conference, Ireland, Cork 2007
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
What's a City Transport System Got to Do With Publishing Data in an Output Database?
1. International Marketing and Output Database Conference
Blarney, Cork, 24th –28th September 2007
What's a City Transport System
Got to Do With
Publishing Data in an Output Database?
Katja Šnuderl
Statistical Office of the Republic of Slovenia
katja.snuderl@gov.si
Abstract
It is easy to compare a city transport system to the process of publishing statistical
data on a statistical website. There are completely unorganized systems, where
everyone drives to work in their own cars, takes whatever route is most convenient at
the time and expects to park as close as possible to their destinations. This is similar
to those systems in which there are no rules as to how, when, where and in what
form data are published. There are several reasons why neither such a transport
system nor such a statistical output database is preferable.
Conversely, there are completely organized systems, where all of the commuters use
a public transportation system designed to their needs. Users adjust to the various
schedules and transportation availability in order to reach their goals. This
corresponds to a metadata-driven system where a well organized metadata repository
runs data publishing through a pre-defined process based on integrated databases
and templates.
This article focuses on work done and lessons learned during a project of upgrading
the Slovenian statistical output database from a file server to a macro database.
Context
Following the general trend of making statistical data available on the web, the
Statistical Office of the Republic of Slovenia (Statistics Slovenia) decided to build an
output database. First databases (Agriculture Census and Population Census) in 2003
were based on the PC-Axis file format and tools. As the concept has proven to be
efficient, Statistics Slovenia has decided to migrate all of its dissemination to the
output database. The dilemma of choosing either a file server system or an SQL
macro model was always present, until some largest tables hit the technical
limitations of the file server system. Within a new project in the field of External
Trade a new PC-Axis SQL macro database was built. Having experiences with both
systems and with migrating from one to another helped at identifying a metaphor
that can help "non-IT people" understand the differences between table and database
management.
2. Katja Šnuderl: What's a City Transport System Got to Do With Publishing Data in an Output Database?
1. Introduction
It is all about people. There is no IT solution being run by machines for machines.
Each is created by people, maintained by people and used by people. Therefore, when
building an output database it is important to understand how the human mind
works.
Somehow it seems we believe that everything that looks simple is simple. But in
reality to make a simple application, where a user can understand the features easily
and learn only by doing, it takes thorough analysis of users' needs, their behaviour,
technical possibilities and an exacting decision process. It takes less work to make
something that looks complicated and is difficult to use.
In terms of a transport system we could say that good transport networks don't just
happen. It takes a lot of effort to turn a chaotic situation into a well run public
service. Good route maps and schedules are based on user needs analysis and
technical possibilities. They evolve for years.
Basic preconditions for a succesful project are sharing the information (among al
participants and cooperating parties in the project), understanding the project goal
and decision and (management) support. No support is possible without
understanding the problems. The comparison of building an output database with a
transport system can sometimes help us explain basics of standardization and
changes to someone who sees building a database purely as an IT matter.
Management can support our needs even without understanding IT matters – if we
know how to explain them in an understandable way. Since transport is somehing
most people know and use, it can be used as a useful comparison.
2. "Keep it as it is, we're fine"
There is always a problem when a system
changes. The new one doesn't always
support all the options the old one had.
Many people ask why changing a system
that runs well at all, but if this view was
always respected we'd be still using
carriages.
The project on External Trade was built in order to replace dissemination of data in
the Statistical Databank, an older instance of the output database. The Statistical
Databank had a lot of regular users who extracted data monthly. However, only one
kind of extraction was possible: one flow (exports or imports) for one time period by
tariff codes (for one country or total) or by countries (for one tariff code or total). In
the new database users can combine flows, several time periods, many tariff codes
and many countries. The output table always has a multidimensional structure and
presents also empty cells – if a user selects a country with no flows, the country is
listed in the table with appropriate statistical sign.
-2-
3. Katja Šnuderl: What's a City Transport System Got to Do With Publishing Data in an Output Database?
Regular users, who were adjusting to the old database for years, had many problems
and special requests when introducing the new system. We had to enlarge the
selection size limit in the first week and we introduced new functions to filter data
according to existing data flows. Luckily all users will benefit from the new functions,
though not all parameters of the previous output were met.
3. "Don't just just replace cars with buses"
We often say that there is no IT solution that
could change a process by itself. Changing
only the technical part of the process is
similar to giving people buses instead of cars.
Without changing anything else, people
would probably start driving one bus each to
the workplace. A project manager should be
careful in preventing usage of new tools in
old and obsolete ways. At the same time it is
essential to know that users have to adjust to
new tools at different levels and not all of
them can be the "drivers".
At Statistics Slovenia we chose a step-by-step approach when building the output
database. The first stage was building the file server, where procedures and tools are
easiest to understand for statisticians who were used to preparing tables in
spreadsheets. The first tables were always prepared by the support team in order to
meet all the general rules. The first examples also helped statisticians understand the
multidimensional table structure. At the beginning we always took what was available
and tried to create a comprehensive multidimensional table from existing tabulations
(published tables). In the last year a major step was made when we introduced new
tabulation rules based on our experiences. The new rules introduce a clear
multidimensional structure, where the statistician only defines the content of the
table. The programming unit then prepares a new tabulation with the available tool
(from the view of the source or the responsible person) by the general rules of
tabulation for the PC-Axis database. The main result of the whole exercise is higher
understanding of multidimensional table structure by the statisticians and the
programming unit. But, when preparing these tables statisticians had to learn and use
new tools for table management. They have to update existing tables with new time
periods themselves.
When building the new macro database, the next step was taken. Here statisticians
only deal with content definition and don't manage the tables in any way. Once the
data for the new time period are ready, the support unit pulls data into the macro
database. The statistician can make the final check whether data and metadata are
ready to be published. The procedure of pulling data is manual for now and will be
automated when it is stable. At the early stage we prefer to do it manually in order to
learn how the automated process should run.
-3-
4. Katja Šnuderl: What's a City Transport System Got to Do With Publishing Data in an Output Database?
4. Transport logistics is complex
In reality nobody expects a city tram system to cover all the
areas of the city. Transport modules (trains, trams, metro,
buses, cars, etc.) are differentiated but at the same time
integrated and can be used successively. In the same way a
good IT system should be developed in modules - coherent,
integrated and supporting each other.
When building our new output database a decision was made that new applications
shouldn't depend on any other system within Statistics Slovenia. It was understood
that the dissemination "module" will be integrated with the metadata system, but
only at a later stage. Working other way could reasonably slow down the project or
even cause failure. For classifications we decided to pull them from the classification
server and maybe at a later stage use direct views. But, as not all classifications are
always prepared in the server, a backup option to be able to import classifications as
TXT files was introduced.
A similar solution was introduced for importing data into the output database. We
expect all data to be available in micro or macro databases eventually. Currently at
Statistics Slovenia we still maintain the variety of sources of data. Input tables are
created from relational databases, flat files and Excel spreadsheets. Tools for
tabulation are versatile, from SQL queries and views to Cobol, TPL, SAS and Excel
tabulations. We even prepared a simple converter for TXT files from TPL to be
converted to the correct CSV structure. So even though the project was run on data
for External Trade (available in an Oracle database), procedures to import data from
other SQL databases or CSV files or even existing PC-Axis files were developed.
Having the old output database (file server) and building the new one at the same
time brought us the luxury of having an option to keep them both. Our strategy is to
eventually migrate all data to the SQL Macro Database, but there is no need to do it
before input data sources are consolidated. For now both systems will be supported
and integrated.
Another aspect of coexistance of transport systems is the image of simplicity. When a
system runs smoothly and is easy to use, usually a lot of efforts were made towards
integrating and coordinating different modules. Intuitive tools are based on lots of
axperiences, selection of needs and testing. On the other hand, if a system looks
complicated and is difficult to use is very easy to develop. You simply respct all needs
and make no selection. In the proces of preparing the specifications of the output
database project a lot of emphasis was given to the expected outcome, especially
with the end-user solution (web interface to view the data) in order to make it
intuitive and easy to use. Unfortunately fewer experiences were available when
building the database management application, so the tool turned out to be rather
complicated to use.
-4-
5. Katja Šnuderl: What's a City Transport System Got to Do With Publishing Data in an Output Database?
5. "Let the grass grow, please!"
Allowing exception to rules is similar to
building parking places where people tend
to park on the grass. Finally everything is a
parking place and the chaos remains.
There is no green colour to calm the
nervous drivers down anymore.
Already in our first output database some general rules were introduced. We had the
file naming convention (unique file names within the whole system), corporate
metadata, common classifications and some standard links (to methodological
explanations, the release calendar and questionnaires). But in a file server it is
difficult to validate each and every file whether it is compliant to the rules. As it was
done manually, not all exceptions were noticed and some were even agreed upon.
On the other hand, when we built the macro database we formed some very strict
rules. For example, all classifications in use have to be maintained in the classification
server. Even though there is an alternative to import classifications, all tables with
exceptions will be maintained in the file server. This decision is based on the workload
balancing – in the macro database the management of metadata is done by the
support unit. If statisticians demand to maintain an exception to the rule, they have
to manage the table themselves. They can only do that within the file server. Even in
the long run we don't plan to allocate management of the metadata from the support
unit to the statisticians.
6. Why bother with anything else than a taxi?
In some big cities around the world people don't use public transportation but the taxi
service. There is no worrying about schedules or need to learn which route to go and
which number to take. In output database management terms there can be a support
unit that manages all the
dissemination of statistical data.
Statisticians are only involved in
managing the statistical process up
to dissemination. They don't have to
learn or use any new tools to
prepare data for dissemination.
Statistics Slovenia is relatively small. The output database support unit grew to 5
members who work on regular production and development in parallel. Therefore the
process of producing files to be published was organized within the subject-matter
units from the early beginning. One of the arguments for such a decision was also
knowledge, as only statisticians knew the content of a statistical survey and could
define expected outputs. But through the file server management also experiences
and knowledge within the support unit were collected. While building the new macro
database we wondered whether there is any need to put any technical burdens on the
content managers. We decided no to do so for the start, so all technical matters are
done within the output database unit.
-5-
6. Katja Šnuderl: What's a City Transport System Got to Do With Publishing Data in an Output Database?
It is always a matter of balancing – if all management is given to subject-matter
units, it is not very probable that the coherence principles would be met. If all
management is centralized, subject-matter units could oppose solutions that don't
support their special requirements. So it is important to set some clear rules and
introduce validation tools that support these rules on one hand, and balance
management between the content managers and output database team on the other.
7. "Lost"
Not many people get lost in the Paris Metro network. At
every station it is easy to find maps and information
where to exit and where to continue to go the right
way. But in another country it is fairly easy to miss the
Haag train station and end up in Rotterdam.
As an output database includes more and more data, it
also grows larger and larger. It is important to build a
navigation system that helps users easily navigate within the database. This refers
either to entering the database to find the data or later to find the way back.
The first challenge is how to build an efficient way to find the data. The new output
database at Statistics Slovenia offers several options. One is browsing through the
content tree from the starting page of the database. There all subjects are available
and users have to open the content tree and check table titles whether they seem
compliant with their needs. This option is available without additional maintenance of
metadata, just using the database content definitions. But, besides the entry page
we've introduced an option to open the content tree at any level within a subject
area. For this purpose we use content identification numbers, unique and
standardized among different dissemination products. For example, on our website
every theme (e.g. Prices) has an ID number. Opening the database content tree with
the same ID number opens only items within the same theme (Prices). Identifications
go down to a single table. When the content tree opens partially, the current location
is read from the database and written in the header section.
In the next step we will add an option to search for tables. We plan to introduce a
keyword search, where a pre-defined list of keywords will be prepared and linked to
the tables. Users will only be able to select words from the list. The words will be
suggested while typing the letters. The list of keywords will be maintained regularly in
order to support users' needs.
When users select data from a table in the database, they are often interested in
continuing work on other tables from the same content. To support such request, we
introduced a command "List of tables" in the menu bar, which opens the content tree
for the same content.
-6-
7. Katja Šnuderl: What's a City Transport System Got to Do With Publishing Data in an Output Database?
8. "Shinkansen or a good old tram"?
We are yet far from Shinkansen and the Japanese transportation system. Actually in
Slovenia one can experience that it is not enough to replace old trains with new ones
that can speed up to 200 km/h. Here at some places they have to slow down to 50
km/h or less, otherwise the tracks would collapse. Or you get stuck on a train station
because nobody knows how to unlock a secured carriage and after half an hour of
trying and thinking they have to move people and uncouple the carriage so the train
can proceed.
So, what we did for now is limiting parking places for cars within the city, introduce
many bus routes, one intercity train route ending in the suburbs and one tram route
from the suburbs to the centre. The system might be not the most modern, but it has
proven to be is reliable. In reality we reduced the number of published Excel
spreadsheets in favour of multidimensional tables, introduced standard procedures for
tabulation of multidimensional tables, included the classification server in the
dissemination process and built a macro database for data on External Trade. The
next "tram" routes will be prepared for Earnings and Tourism Statistics.
After deciding to maintain both systems (the file server and the macro database) our
main goal was to integrate them without putting burden on the user when searching
for data. Basic principles are:
a) Single entry point
b) Same "Look and feel"
c) Same functions + advanced options in the macro database
d) Same support (header menus)
e) Single registration for advanced user (option to save queries).
A lot of effort was put into coherent design of the two systems, adjusted to the design
of the Statistics Slovenia website. The only connecting point of the two databases is
the content tree view, the entry page of the database. From there users are
redirected either to a table in the file server database or in the macro database. In
the tree view there are also links to related content: First Releases, methodological
explanations, statistical questionnaires, special publications, links to external websites
(data on websites of other governmental bodies) and links to the Eurostat database.
To view or download, a data user can select any values from the table, change
texts/codes presentation of values, pivot the table, view selection-specific footnotes,
change decimals presentation, display data in graph or map and download data to
several formats. Advanced features in the macro database support selection and
filtering of hierarchical variables by levels, removing empty lines, sorting and a better
structured presentation of footnotes.
With the new database structure we are also introducing pre-defined tables, where
less experienced users can look at data just by clicking the table title. The content of
pre-defined tables was defined by each theme editor.
-7-
8. Katja Šnuderl: What's a City Transport System Got to Do With Publishing Data in an Output Database?
9. Conclusion
Our habits differ in different societies. Not every country has as many problems with
transport systems as Slovenia. But still, there are some basic principles that everyone
understands and that can be used when explaining the principles of building a new IT
solution to a non-IT person.
During the project of building the new dissemination macro database our main goal
was to build a system that will support different contents, different input data formats
and versatile users. From the start we have been careful about standardisation,
coherence and process management. We are building on our experiences with the file
server database. At the same time we are trying to meet most users' needs.
Statistics is produced by people for people and our role in this process is to make it
accessible, reliable and understandable.
-8-