1. Variability of country names and
identifiers in datasets –
Reconciling practical and cultural
perspectives
International Cartographic Conference, Dresden
Laura Kostanski| Sara-Jane Farmer | Rob Atkinson
August 2013
GOVERNMENT AND COMMERCIAL SERVICES THEME
2. Today’s Presentation
• Overview
• Cultural Reasons for Multiple Country Names
• Impact of Cultural Reasons
• Multiple Country Name Datasets
• Reconciling Information
• Spatial Identifier Reference Framework (SIRF) Approach
3. Overview
•
There are multiple country name datasets in use
•
e.g. ISO 3166, UNSTATS , Alexandria Digital Library, CIA Fact Book, UN-FAO
•
Multiple stakeholders in creation and use of data using these names
•
e.g. World Bank, Statistics Agencies, Crisis Response and Social Protection Groups.
•
Time spent accessing and reconciling data is costly and delays production of
results from analysis
•
The same issues apply to most, perhaps all, identifiers of spatial objects
•
Preview of how we might tackle this problem
5. Data Analysis
Utopia Way Inc. investigated files in the data.un.org dataset. …
Country names were discovered in multiple fields, such as:
•country of birth,
•country of citizenship,
•country or area,
•country or territory,
•country or territory of asylum or residence,
•country or territory of origin,
•reference area.
and identified significant issues with country name alignments and mismatches.
An automated matching process was set up to explore the extent of the issue.
In all, 21,195,188 rows of data were analysed.
6. Common “Errors”
Index error
Withdrawn countries with no
ISO3166 code
Abbreviation
Added markers
Capitalisation
Brackets “()” or “[]” instead of
commas
Standards confusion
Examples
“East Timor", "Czechoslovakia, Czechoslovak Socialist Republic”, "USSR,
Union of Soviet Socialist Republics", “Yemen, Yemen Arab Republic", “Yemen,
Democratic, People's Democratic Republic of", “Yugoslavia, Socialist Federal
Republic of”, “Germany, Federal Republic of”, “German Democratic
Republic”, “US Miscellaneous Pacific Islands", “Wake Island", “Serbia and
Montenegro".
“Rep.” for “Republic”, “St.” for “Saint”, “Is.” For “Island”, “Isds” for “Islands”,
“&” for “and”.
“+” added to the end of region names, to differentiate them from
countrynames. “MDG_” added to region names, e.g. “MDG_Southern Asia”.
“YEMEN” for “Yemen”, “republic” for “Republic”, “The” for “the”, “the” for
“The”.
“Virgin Islands (British)” for “British Virgin Islands”.
The ISO3166 labels “name” and “official_name” were both used in the same
datasets (“name” is available for all countries; “official_name” is not).
Use of familiar names
issues with character translation
Brunei, Ivory Coast, China, Libya
Cote d'Ivoire, Åland Islands, Curaçao, Réunion
Misspellings
Double spaces, trailing spaces, “South Asia” vs “Southern Asia”.
8. Data sets providing country names
Organisation
Name of Data Set
United Nations Statistics Division
Country and Region Codes for Statistical Use
Working Group on Country Names,
United Nations Group of Experts on
Geographic Names
Terminology Section,
Department for General Assembly
and Conference Management
International Standards Organisation
(ISO)
Food and Agriculture Organisation of
the United Nations
United Nations Geospatial
Information Working Group
(UNGIWG)
National Geospatial Intelligence
Agency
List of Country Names
NATO
Standards Agreement (STANAG) 1059
Multilingual Terminology Database (UNTERM)
ISO 3166: Codes for the representation of names of countries and their subdivisions (parts 1, 2 and 3)
Global Administrative Unit Layers (GAUL)
Second Administrative Level Boundaries (SALB)
Federal Information Processing Standard (FIPS) 10-4 : Countries, Dependencies, Areas of Special
Sovereignty, and their Principal Administrative Divisions
9. Two Aspects of Country Name Datasets
1: Development of datasets
Why is there a proliferation of country name sources?
• Cultural issues
• Development practices
2: Usage
How, in a digital age of ‘big data’ analytics and SDIs, can newly emerging
technologies such as the Spatial Identifier Reference Framework (SIRF)
assist in reducing the ambiguity associated with multiple,
heterogeneous country name sources?
• Can we do better? What do we need to do it?
10. Cultural Issues
• Toponyms provide communities with identity (Toponymic Identity is both
reflected and reinforced)
• Country names are the highest-order toponyms
• Problems are similar at lower levels, compounded by scale (size of problem)
and higher rates of change (e.g. electoral boundaries, urban growth)
11. Endonym/Exonym
Above and beyond associations with an individual’s attachment to the Endonym
of their country, there are often multiple Exonyms used by other languages.
• e.g. Deutschland= Germany or Allemagne
12. Other Cultural Country Naming Considerations
Formal/Informal naming applications
(particularly prevalent in the social media world- e.g. ‘Oz’ for Australia)
Political/Non-Political Usage
e.g. ‘Commonwealth of Australia’
Change over time
e.g. Czechoslovakia
Non-standardised international conventions
e.g. Saint or St? The or none?
13. The Impact
All of these cultural mores impact on the ability of people and organisations to
record country name information in a standardised, transparent manner.
Thus, there exists a proliferation of country name lists which are officially
promoted by international agencies.
This impact is then intensified in usage,
14. Options
Suggested improvements to the indices and standards include:
1. Improve access to source data
a.
b.
Make the UN’s regions list available as a csv file online, to include withdrawn country
codes, assignment dates and withdrawal dates (these are needed to match names for
earlier years).
Make the UN’s economic status list available as a csv file online.
2. Lobby to improve content
a.
b.
ISO to create a region (Africa, West Africa, North America etc.) code standard.
ISO to correct inconsistencies in the ISO countries list (e.g. republic not Republic in
Bolivia’s name).
3. Policy
a.
Make a definitive statement about which GIS naming standard (ISO, UNstats etc) UN
online development data should attempt to adhere to.
4. Better citation mechanisms
–
–
Standardised metadata and identifiers that “resolve” – i.e. links back to data
Shared infrastructure to link all the information together
15. Spatial Identifier Reference Framework
CSIRO has been working with stakeholders including UN, National
agencies and others on a set of standards and infrastructure
services to support discovering and linking multiple sources of
spatial references.
This is being presented in more detail in:
6D.3 Spatial Identifier Reference Framework (SIRF): Realising the
potential of SDI Using Spatial Identifiers to Link Multiple
Information Systems (#633)
Paul Box 1, Robert Atkinson 1, Laura Kostanski 2
S6-D - SDI
Tuesday, August 27, 2013 04:30 p.m. - 05:45 p.m. - Room:
Conference Level - C1
16. One real world feature:
a bus station
BIG
National Gazetteer of Indonesia
Identifier
Feature Type
Merak, Stasiun Bis Transport
Department of Transport
Bus Terminals
Identifier Feature Type Footprint
Merak
Terminal
Polygon
Footprint
Point
Currently systems are
disconnected and difficult to integrate
Merak
Merak, Stasiun
Bis
Represented in multiple systems
using different names, and classified
and represented in different ways
Terminus Dataset
Gazetir Indeonesia
Merak, Stasiun Bis
(Gazetteer Entry)
Gazetir Indonesia
(Gazetteer)
Used in
Navigation application
Linked Resource
Same as
Online Public
Transport Map
Linked Resource
Merak
(Gazetteer Entry)
Terminus Dataset
(Gazetteer)
Used in
Passenger Travel Stats
Application
Linked Resource
Spatial Identifier
REFERENCE FRAMEWORK
Links gazetteers (based on same
feature in different gazetteers)
used in web applications and other
online resources.
17. Identifiers
This is the “tricky part”
Lets start with the practical implication…
Catchment
ExtractionRate
Storage
1123343
730
300
Catchment
Boundary
Area
Geometry
1123343
33535.4
151.3344,35.330…….
19. SDI resource
access
One real world feature:
a bus station
BIG
National Gazetteer of Indonesia
Provenance
URI
Identifier
Feature Type
Merak, Stasiun Bis Transport
Department of Transport
Bus Terminals
Identifier Feature Type Footprint
Merak
Terminal
Polygon
Footprint
Point
Currently systems are
disconnected and difficult to integrate
Merak
Merak, Stasiun
Bis
Represented in multiple systems
using different names, and classified
and represented in different ways
Terminus Dataset
Gazetir
Describe Indeonesia
Discover
Merak, Stasiun Bis
(Gazetteer Entry)
Gazetir Indonesia
(Gazetteer)
Used in
Link
Navigation application
Linked Resource
Same as
Online Public
Transport Map
Linked Resource
Merak
(Gazetteer Entry)
Terminus Dataset
(Gazetteer)
Used in
Passenger Travel Stats
Application
Linked Resource
Spatial Identifier
REFERENCE FRAMEWORK
Links gazetteers (based on same
feature in different gazetteers)
used in web applications and other
online resources.
20. Thank you
For more information
Rob.atkinson@csiro.au
GOVERNMENT AND COMMERCIAL SERVICES THEME
Notas del editor
Forms of country names range from those in use by the countries themselves (endonyms) to externally used alternatives (exonyms), to various common abbreviations (e.g. USA) and codes (such as those in ISO 3166). Indexes are produced by a diversity of communities including United Nations agencies, Non-Government Organisations (NGOs- such as humanitarian relief or environmental assessment groups) and commercial enterprises (postal agencies, distribution companies).
Each of these issues is experienced to differing degrees, with particular regions more affected than others. Utopia Way Inc. investigated 5577 csv files in the data.un.org dataset (UN Statistics Division’s Internet-accessible repository for data) to explore country name alignments and mismatches published by UN agencies in their datasets. In all, 21,195,188 rows of data were analysed. Data that was excluded from the investigation included:
-footnotes at the end of each dataset;
-the UN interface limits downloads to 50000 rows of data, so 159 files in the set are incomplete; and,
-25 files published in multi-sheet Excel format.
Indices and headers from all the datasets were collated into lists: the headers list was searched for geographical references, and the indices list was used to produce a list of corrections from the data.un.org geographical indices into both ISO 3166 and United Nations Statistics Divison’s list Country and Region Codes for Statistical Use of region, country and economic group names.
Geographical references in the headers are:
-country of birth, country of citizenship, country or area, country or territory, country or territory of asylum or residence, country or territory of origin, reference area.
-OID.
-WMO station number, station name, national station id number.
-City.
-Area, residence area, city type.
Most of the data.un.org datasets contain information that is listed by country (e.g. Yemen), region (e.g. West Africa) or economic group (e.g. Developing Regions). The placenames in the indices are a mix of country, region and economic group names, with different spellings and formats for similar names. For example, in one instance the following spellings can be located for one country-. “Yemen”, “YEMEN”, “Yemen,Rep.”, “Yemen, Republic of”.
Two standards are similar to the placenames used in these files: ISO3166 and the “composition of regions” list published by data.un.org.
ISO3166 is a widely-used standard, but contains code for countries and their subregions only (e.g. has no official lists of larger regions or economic areas) and is published as tables online and available (although without the list of withdrawn codes) in the Python library pycountry.
The UNstats list (which ISO3166 is partially based on) contains countries, regions and economic areas, but is available only as an html table (http://unstats.un.org/unsd/methods/m49/m49regin.htm). This table was scraped (the data copied from its html page) by hand for this research, but this process could be automated using e.g. ScraperWiki. There are two main lists in the UNstats table: the regions, subregions and countries by physical location, and the economic status (e.g. “Developing regions”, “Least developed countries”) of each country and region. These are mostly consistent, with a couple of oddities, e.g. Netherland Antilles doesn’t appear on the list of countries, but does appear on the list of small island developing states.
Work by other groups (e.g. the World Wide Human Geography Data working group) has also translated data.un.org files into the FIPS 10-4 standard. This standard is common in US Government work; it includes codes for country names and administrative districts in each country, but does not include regions (e.g. Africa). It is similar, but not identical, to ISO3166.
The indices were checked against both these standards. Against the ISO3166 standard, common data.un.org csv index errors were:
Some names could not be resolved: remaining queries include the code for French Polynesia, whether “Christmas Is.(Aust)” is Christmas Island, whether St. Helena refers to just the island of Saint Helena, or “Saint Helena, Ascension and Tristan da Cunha” and whether Palestine and Palestinian Territories refer to “Palestinian Territory, Occupied”. Other issues include whether Micronesia refers to the region (Micronesia) or country (Micronesia, Federated States of), and whether there should be separate codes for changing states, e.g. Ethiopia before and after 1993.
The current correction files for UNSTATS and ISO3166 standards, along with a CSV file containing the UNSTATS standard codes fromhttp://unstats.un.org/unsd/methods/m49/m49regin.htm can be found at xxxxxx
From an international standards perspective there are multiple competing GRDs of country names published by various agencies including the UN and ISO. There are diverse reasons for the existence of the varieties, including different end-user requirements which predicate whether official endonyms are required for mapping purposes or country codes used for statistical purposes. A brief summary of the key GRDs is provided in table one to contextualise the current international GRD situation.
As indicated, there are multiple official GRDs of country names published at the international level by the UN and other organisations. The existence of multiple datasets related to country name standardisation is analogous to the mismatched country name data held within UN datasets.Examples
Analysis of the key UN databases held in data.un.org has identified key matching, linking and interoperability issues currently experienced in the domain of GRDs which contain country names. These can be summarised as:
Non-standardised use of country endonyms/exonyms by UN agencies
Mismatches between data instances in authoritative country name GRDs
Temporality of country name GRDs
This paper explores two aspects of the propagation of country name SISets: development (cultural/qualitative) and usage (data management/quantitative). From a development perspective the fundamental question is asked of why , when country names can be considered one of the highest-order administrative categories for geospatial organization, there is a proliferation of ‘official’ country name SIRDs. Within the domain of usage the authors query how, in a digital age of ‘big data’ analytics and Spatial Data Infrastructures (SDIs), newly emerging technologies such as the Spatial Identifier Reference Framework (SIRF) can assist in reducing the ambiguity associated with multiple, heterogeneous country name SIRDs.
Rob- essentially, Toponymic Attachment means that people have strong affinities with place names, for cultural, social, branding and wayfinding purposes. Because of this, people are very hesitant to stop using a name. In fact, it is nearly impossible to get them to stop using a name. Also, people will create new ‘nicknames’ for things so that they can create a ‘clique’ or communicate in a community with their own ‘special terms’. It’s all about creating and reinforcing identity.
Thus, in a world of multiple names in multiple databases, there is a massive headache for data junkies. Data users want straightforward stuff, and usually the people who create standards want people to be using the same names in the same way all the time. But, that’s not how the real world works.
So, data junkies can try and tell people to use standardised names, and can create ISO lists etc etc etc. But because of human nature, the standardised lists will always have gaps and mistakes and won’t truly reflect usage on the ground. Thus accounting for some of the reason for why there are mutliple country name lists.
And thus the reason for why SIRF is awesome- because instead of force-feeding people the standardised-name-line, it allows for a holistic view of naming which accounts for multiple representations, permutations, interpretations etc. It is, as I like to say, ‘ideologically promiscuous’
Until now the preference of many agencies has been to homogenize geospatial information for ‘ease of use’ purposes- either through aggregating and de-duplicating existing SIDs or by disregarding competing information. SIRF is a system being developed by CSIRO using Linked Data mechanisms to support interoperability between heterogeneous geospatial information datasets and systems. SIRF harmonises disparate SIRDs through cross-walking and data linking methods, the benefits of which are outlined in detail by the authors. The framework system brings to the geospatial data management world, for the first time, the capability to streamline information integration processes whilst acknowledging the reality of multiple, competing SIRDs.