Icc2013 country names

Variability of country names and
identifiers in datasets –
Reconciling practical and cultural
perspectives
International Cartographic Conference, Dresden
Laura Kostanski| Sara-Jane Farmer | Rob Atkinson
August 2013
GOVERNMENT AND COMMERCIAL SERVICES THEME

Today’s Presentation
• Overview
• Cultural Reasons for Multiple Country Names
• Impact of Cultural Reasons
• Multiple Country Name Datasets
• Reconciling Information
• Spatial Identifier Reference Framework (SIRF) Approach

Overview
•

There are multiple country name datasets in use

•

e.g. ISO 3166, UNSTATS , Alexandria Digital Library, CIA Fact Book, UN-FAO

•

Multiple stakeholders in creation and use of data using these names

•

e.g. World Bank, Statistics Agencies, Crisis Response and Social Protection Groups.

•

Time spent accessing and reconciling data is costly and delays production of
results from analysis

•

The same issues apply to most, perhaps all, identifiers of spatial objects

•

Preview of how we might tackle this problem

Context

CSIRO. UNSDI Gazetteer for Social Protection in Indonesia

Data Analysis
Utopia Way Inc. investigated files in the data.un.org dataset. …
Country names were discovered in multiple fields, such as:
•country of birth,
•country of citizenship,
•country or area,
•country or territory,
•country or territory of asylum or residence,
•country or territory of origin,
•reference area.
and identified significant issues with country name alignments and mismatches.
An automated matching process was set up to explore the extent of the issue.
In all, 21,195,188 rows of data were analysed.

Common “Errors”

Index error
Withdrawn countries with no
ISO3166 code

Abbreviation
Added markers
Capitalisation
Brackets “()” or “[]” instead of
commas
Standards confusion

Examples
“East Timor", "Czechoslovakia, Czechoslovak Socialist Republic”, "USSR,
Union of Soviet Socialist Republics", “Yemen, Yemen Arab Republic", “Yemen,
Democratic, People's Democratic Republic of", “Yugoslavia, Socialist Federal
Republic of”, “Germany, Federal Republic of”, “German Democratic
Republic”, “US Miscellaneous Pacific Islands", “Wake Island", “Serbia and
Montenegro".
“Rep.” for “Republic”, “St.” for “Saint”, “Is.” For “Island”, “Isds” for “Islands”,
“&” for “and”.
“+” added to the end of region names, to differentiate them from
countrynames. “MDG_” added to region names, e.g. “MDG_Southern Asia”.
“YEMEN” for “Yemen”, “republic” for “Republic”, “The” for “the”, “the” for
“The”.
“Virgin Islands (British)” for “British Virgin Islands”.
The ISO3166 labels “name” and “official_name” were both used in the same
datasets (“name” is available for all countries; “official_name” is not).

Use of familiar names
issues with character translation

Brunei, Ivory Coast, China, Libya
Cote d'Ivoire, Åland Islands, Curaçao, Réunion

Misspellings

Double spaces, trailing spaces, “South Asia” vs “Southern Asia”.

Data sets providing country names
Organisation

Name of Data Set

United Nations Statistics Division

Country and Region Codes for Statistical Use

Working Group on Country Names,
United Nations Group of Experts on
Geographic Names
Terminology Section,
Department for General Assembly
and Conference Management
International Standards Organisation
(ISO)
Food and Agriculture Organisation of
the United Nations
United Nations Geospatial
Information Working Group
(UNGIWG)
National Geospatial Intelligence
Agency

List of Country Names

NATO

Standards Agreement (STANAG) 1059

Multilingual Terminology Database (UNTERM)

ISO 3166: Codes for the representation of names of countries and their subdivisions (parts 1, 2 and 3)
Global Administrative Unit Layers (GAUL)
Second Administrative Level Boundaries (SALB)

Federal Information Processing Standard (FIPS) 10-4 : Countries, Dependencies, Areas of Special
Sovereignty, and their Principal Administrative Divisions

Two Aspects of Country Name Datasets
1: Development of datasets
Why is there a proliferation of country name sources?
• Cultural issues
• Development practices

2: Usage
How, in a digital age of ‘big data’ analytics and SDIs, can newly emerging
technologies such as the Spatial Identifier Reference Framework (SIRF)
assist in reducing the ambiguity associated with multiple,
heterogeneous country name sources?

• Can we do better? What do we need to do it?

Cultural Issues
• Toponyms provide communities with identity (Toponymic Identity is both
reflected and reinforced)
• Country names are the highest-order toponyms
• Problems are similar at lower levels, compounded by scale (size of problem)
and higher rates of change (e.g. electoral boundaries, urban growth)

Endonym/Exonym
Above and beyond associations with an individual’s attachment to the Endonym
of their country, there are often multiple Exonyms used by other languages.
• e.g. Deutschland= Germany or Allemagne

Other Cultural Country Naming Considerations
Formal/Informal naming applications
(particularly prevalent in the social media world- e.g. ‘Oz’ for Australia)

Political/Non-Political Usage
e.g. ‘Commonwealth of Australia’

Change over time
e.g. Czechoslovakia

Non-standardised international conventions
e.g. Saint or St? The or none?

The Impact
All of these cultural mores impact on the ability of people and organisations to
record country name information in a standardised, transparent manner.
Thus, there exists a proliferation of country name lists which are officially
promoted by international agencies.
This impact is then intensified in usage,

Options
Suggested improvements to the indices and standards include:
1. Improve access to source data
a.
b.

Make the UN’s regions list available as a csv file online, to include withdrawn country
codes, assignment dates and withdrawal dates (these are needed to match names for
earlier years).
Make the UN’s economic status list available as a csv file online.

2. Lobby to improve content
a.
b.

ISO to create a region (Africa, West Africa, North America etc.) code standard.
ISO to correct inconsistencies in the ISO countries list (e.g. republic not Republic in
Bolivia’s name).

3. Policy
a.

Make a definitive statement about which GIS naming standard (ISO, UNstats etc) UN
online development data should attempt to adhere to.

4. Better citation mechanisms
–
–

Standardised metadata and identifiers that “resolve” – i.e. links back to data
Shared infrastructure to link all the information together

Spatial Identifier Reference Framework
CSIRO has been working with stakeholders including UN, National
agencies and others on a set of standards and infrastructure
services to support discovering and linking multiple sources of
spatial references.
This is being presented in more detail in:
6D.3 Spatial Identifier Reference Framework (SIRF): Realising the
potential of SDI Using Spatial Identifiers to Link Multiple
Information Systems (#633)
Paul Box 1, Robert Atkinson 1, Laura Kostanski 2
S6-D - SDI
Tuesday, August 27, 2013 04:30 p.m. - 05:45 p.m. - Room:
Conference Level - C1

One real world feature:
a bus station

BIG
National Gazetteer of Indonesia

Identifier
Feature Type
Merak, Stasiun Bis Transport

Department of Transport
Bus Terminals

Identifier Feature Type Footprint
Merak
Terminal
Polygon

Footprint
Point

Currently systems are
disconnected and difficult to integrate

Merak

Merak, Stasiun
Bis

Represented in multiple systems
using different names, and classified
and represented in different ways

Terminus Dataset

Gazetir Indeonesia

Merak, Stasiun Bis
(Gazetteer Entry)
Gazetir Indonesia
(Gazetteer)

Used in
Navigation application
Linked Resource

Same as

Online Public
Transport Map
Linked Resource

Merak
(Gazetteer Entry)
Terminus Dataset
(Gazetteer)

Used in
Passenger Travel Stats
Application
Linked Resource

Spatial Identifier
REFERENCE FRAMEWORK
Links gazetteers (based on same
feature in different gazetteers)
used in web applications and other
online resources.

Identifiers
This is the “tricky part”
Lets start with the practical implication…
Catchment

ExtractionRate

Storage

1123343

730

300

Catchment
Boundary

Area

Geometry

1123343

33535.4

151.3344,35.330…….

“Distributed” references
Catchment

ExtractionRate

Storage

1123343

730

300

How to ask for this entity

Internet

How to deliver this entity
Catchment Boundary Area

Geometry

1123343

151.3344,-35.330…….

33535.4

SDI resource
access

One real world feature:
a bus station

BIG
National Gazetteer of Indonesia

Provenance

URI

Identifier
Feature Type
Merak, Stasiun Bis Transport

Department of Transport
Bus Terminals

Identifier Feature Type Footprint
Merak
Terminal
Polygon

Footprint
Point

Currently systems are
disconnected and difficult to integrate

Merak

Merak, Stasiun
Bis

Represented in multiple systems
using different names, and classified
and represented in different ways

Terminus Dataset

Gazetir
Describe Indeonesia

Discover

Merak, Stasiun Bis
(Gazetteer Entry)
Gazetir Indonesia
(Gazetteer)

Used in

Link

Navigation application
Linked Resource

Same as

Online Public
Transport Map
Linked Resource

Merak
(Gazetteer Entry)
Terminus Dataset
(Gazetteer)

Used in
Passenger Travel Stats
Application
Linked Resource

Spatial Identifier
REFERENCE FRAMEWORK
Links gazetteers (based on same
feature in different gazetteers)
used in web applications and other
online resources.

Thank you
For more information
Rob.atkinson@csiro.au

GOVERNMENT AND COMMERCIAL SERVICES THEME

Icc2013 country names

Recomendados

Recomendados

Más contenido relacionado

Similar a Icc2013 country names

Similar a Icc2013 country names (20)

Último

Último (20)

Icc2013 country names

Notas del editor