SlideShare una empresa de Scribd logo
1 de 20
Variability of country names and
identifiers in datasets –
Reconciling practical and cultural
perspectives
International Cartographic Conference, Dresden
Laura Kostanski| Sara-Jane Farmer | Rob Atkinson
August 2013
GOVERNMENT AND COMMERCIAL SERVICES THEME
Today’s Presentation
• Overview
• Cultural Reasons for Multiple Country Names
• Impact of Cultural Reasons
• Multiple Country Name Datasets
• Reconciling Information
• Spatial Identifier Reference Framework (SIRF) Approach
Overview
•

There are multiple country name datasets in use

•

e.g. ISO 3166, UNSTATS , Alexandria Digital Library, CIA Fact Book, UN-FAO

•

Multiple stakeholders in creation and use of data using these names

•

e.g. World Bank, Statistics Agencies, Crisis Response and Social Protection Groups.

•

Time spent accessing and reconciling data is costly and delays production of
results from analysis

•

The same issues apply to most, perhaps all, identifiers of spatial objects

•

Preview of how we might tackle this problem
Context

CSIRO. UNSDI Gazetteer for Social Protection in Indonesia
Data Analysis
Utopia Way Inc. investigated files in the data.un.org dataset. …
Country names were discovered in multiple fields, such as:
•country of birth,
•country of citizenship,
•country or area,
•country or territory,
•country or territory of asylum or residence,
•country or territory of origin,
•reference area.
and identified significant issues with country name alignments and mismatches.
An automated matching process was set up to explore the extent of the issue.
In all, 21,195,188 rows of data were analysed.
Common “Errors”

Index error
Withdrawn countries with no
ISO3166 code

Abbreviation
Added markers
Capitalisation
Brackets “()” or “[]” instead of
commas
Standards confusion

Examples
“East Timor", "Czechoslovakia, Czechoslovak Socialist Republic”, "USSR,
Union of Soviet Socialist Republics", “Yemen, Yemen Arab Republic", “Yemen,
Democratic, People's Democratic Republic of", “Yugoslavia, Socialist Federal
Republic of”, “Germany, Federal Republic of”, “German Democratic
Republic”, “US Miscellaneous Pacific Islands", “Wake Island", “Serbia and
Montenegro".
“Rep.” for “Republic”, “St.” for “Saint”, “Is.” For “Island”, “Isds” for “Islands”,
“&” for “and”.
“+” added to the end of region names, to differentiate them from
countrynames. “MDG_” added to region names, e.g. “MDG_Southern Asia”.
“YEMEN” for “Yemen”, “republic” for “Republic”, “The” for “the”, “the” for
“The”.
“Virgin Islands (British)” for “British Virgin Islands”.
The ISO3166 labels “name” and “official_name” were both used in the same
datasets (“name” is available for all countries; “official_name” is not).

Use of familiar names
issues with character translation

Brunei, Ivory Coast, China, Libya
Cote d'Ivoire, Åland Islands, Curaçao, Réunion

Misspellings

Double spaces, trailing spaces, “South Asia” vs “Southern Asia”.
Long names, short names
Data sets providing country names
Organisation

Name of Data Set

United Nations Statistics Division

Country and Region Codes for Statistical Use

Working Group on Country Names,
United Nations Group of Experts on
Geographic Names
Terminology Section,
Department for General Assembly
and Conference Management
International Standards Organisation
(ISO)
Food and Agriculture Organisation of
the United Nations
United Nations Geospatial
Information Working Group
(UNGIWG)
National Geospatial Intelligence
Agency

List of Country Names

NATO

Standards Agreement (STANAG) 1059

Multilingual Terminology Database (UNTERM)

ISO 3166: Codes for the representation of names of countries and their subdivisions (parts 1, 2 and 3)
Global Administrative Unit Layers (GAUL)
Second Administrative Level Boundaries (SALB)

Federal Information Processing Standard (FIPS) 10-4 : Countries, Dependencies, Areas of Special
Sovereignty, and their Principal Administrative Divisions
Two Aspects of Country Name Datasets
1: Development of datasets
Why is there a proliferation of country name sources?
• Cultural issues
• Development practices

2: Usage
How, in a digital age of ‘big data’ analytics and SDIs, can newly emerging
technologies such as the Spatial Identifier Reference Framework (SIRF)
assist in reducing the ambiguity associated with multiple,
heterogeneous country name sources?

• Can we do better? What do we need to do it?
Cultural Issues
• Toponyms provide communities with identity (Toponymic Identity is both
reflected and reinforced)
• Country names are the highest-order toponyms
• Problems are similar at lower levels, compounded by scale (size of problem)
and higher rates of change (e.g. electoral boundaries, urban growth)
Endonym/Exonym
Above and beyond associations with an individual’s attachment to the Endonym
of their country, there are often multiple Exonyms used by other languages.
• e.g. Deutschland= Germany or Allemagne
Other Cultural Country Naming Considerations
Formal/Informal naming applications
(particularly prevalent in the social media world- e.g. ‘Oz’ for Australia)

Political/Non-Political Usage
e.g. ‘Commonwealth of Australia’

Change over time
e.g. Czechoslovakia

Non-standardised international conventions
e.g. Saint or St? The or none?
The Impact
All of these cultural mores impact on the ability of people and organisations to
record country name information in a standardised, transparent manner.
Thus, there exists a proliferation of country name lists which are officially
promoted by international agencies.
This impact is then intensified in usage,
Options
Suggested improvements to the indices and standards include:
1. Improve access to source data
a.
b.

Make the UN’s regions list available as a csv file online, to include withdrawn country
codes, assignment dates and withdrawal dates (these are needed to match names for
earlier years).
Make the UN’s economic status list available as a csv file online.

2. Lobby to improve content
a.
b.

ISO to create a region (Africa, West Africa, North America etc.) code standard.
ISO to correct inconsistencies in the ISO countries list (e.g. republic not Republic in
Bolivia’s name).

3. Policy
a.

Make a definitive statement about which GIS naming standard (ISO, UNstats etc) UN
online development data should attempt to adhere to.

4. Better citation mechanisms
–
–

Standardised metadata and identifiers that “resolve” – i.e. links back to data
Shared infrastructure to link all the information together
Spatial Identifier Reference Framework
CSIRO has been working with stakeholders including UN, National
agencies and others on a set of standards and infrastructure
services to support discovering and linking multiple sources of
spatial references.
This is being presented in more detail in:
6D.3 Spatial Identifier Reference Framework (SIRF): Realising the
potential of SDI Using Spatial Identifiers to Link Multiple
Information Systems (#633)
Paul Box 1, Robert Atkinson 1, Laura Kostanski 2
S6-D - SDI
Tuesday, August 27, 2013 04:30 p.m. - 05:45 p.m. - Room:
Conference Level - C1
One real world feature:
a bus station

BIG
National Gazetteer of Indonesia

Identifier
Feature Type
Merak, Stasiun Bis Transport

Department of Transport
Bus Terminals

Identifier Feature Type Footprint
Merak
Terminal
Polygon

Footprint
Point

Currently systems are
disconnected and difficult to integrate

Merak

Merak, Stasiun
Bis

Represented in multiple systems
using different names, and classified
and represented in different ways

Terminus Dataset

Gazetir Indeonesia

Merak, Stasiun Bis
(Gazetteer Entry)
Gazetir Indonesia
(Gazetteer)

Used in
Navigation application
Linked Resource

Same as

Online Public
Transport Map
Linked Resource

Merak
(Gazetteer Entry)
Terminus Dataset
(Gazetteer)

Used in
Passenger Travel Stats
Application
Linked Resource

Spatial Identifier
REFERENCE FRAMEWORK
Links gazetteers (based on same
feature in different gazetteers)
used in web applications and other
online resources.
Identifiers
This is the “tricky part”
Lets start with the practical implication…
Catchment

ExtractionRate

Storage

1123343

730

300

Catchment
Boundary

Area

Geometry

1123343

33535.4

151.3344,35.330…….
“Distributed” references
Catchment

ExtractionRate

Storage

1123343

730

300

How to ask for this entity

Internet

How to deliver this entity
Catchment Boundary Area

Geometry

1123343

151.3344,-35.330…….

33535.4
SDI resource
access

One real world feature:
a bus station

BIG
National Gazetteer of Indonesia

Provenance

URI

Identifier
Feature Type
Merak, Stasiun Bis Transport

Department of Transport
Bus Terminals

Identifier Feature Type Footprint
Merak
Terminal
Polygon

Footprint
Point

Currently systems are
disconnected and difficult to integrate

Merak

Merak, Stasiun
Bis

Represented in multiple systems
using different names, and classified
and represented in different ways

Terminus Dataset

Gazetir
Describe Indeonesia

Discover

Merak, Stasiun Bis
(Gazetteer Entry)
Gazetir Indonesia
(Gazetteer)

Used in

Link

Navigation application
Linked Resource

Same as

Online Public
Transport Map
Linked Resource

Merak
(Gazetteer Entry)
Terminus Dataset
(Gazetteer)

Used in
Passenger Travel Stats
Application
Linked Resource

Spatial Identifier
REFERENCE FRAMEWORK
Links gazetteers (based on same
feature in different gazetteers)
used in web applications and other
online resources.
Thank you
For more information
Rob.atkinson@csiro.au

GOVERNMENT AND COMMERCIAL SERVICES THEME

Más contenido relacionado

Similar a Icc2013 country names

Open Data Islands and Communities
Open Data Islands and CommunitiesOpen Data Islands and Communities
Open Data Islands and CommunitiesAlan Dix
 
Spatial Information Systems yesterday, today and tomorrow
Spatial Information Systems yesterday, today and tomorrowSpatial Information Systems yesterday, today and tomorrow
Spatial Information Systems yesterday, today and tomorrowBeniamino Murgante
 
2012 03-28 ungiwg12 unsdi
2012 03-28 ungiwg12 unsdi2012 03-28 ungiwg12 unsdi
2012 03-28 ungiwg12 unsdisirf13
 
Space For Human Services Planning
Space For Human Services PlanningSpace For Human Services Planning
Space For Human Services PlanningBrian Cooper
 
rworldmap: A New R package for Mapping Global Data
rworldmap: A New R package for Mapping Global Datarworldmap: A New R package for Mapping Global Data
rworldmap: A New R package for Mapping Global DataDr. Volkan OBAN
 
Bigdataforesight
BigdataforesightBigdataforesight
Bigdataforesightsuresh sood
 
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...Stephane Fellah
 
Spatial data infrastructure in Kyrgyzstan
Spatial data infrastructure in KyrgyzstanSpatial data infrastructure in Kyrgyzstan
Spatial data infrastructure in KyrgyzstanUnison Group
 
GIS Lecture Note.ppt
GIS Lecture Note.pptGIS Lecture Note.ppt
GIS Lecture Note.pptwarkisafile1
 
Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...
Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...
Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...nabo_ghea
 
Hawke's Bay Open Data Conference - 2 May 2019
Hawke's Bay Open Data Conference - 2 May 2019Hawke's Bay Open Data Conference - 2 May 2019
Hawke's Bay Open Data Conference - 2 May 2019enotsluap
 

Similar a Icc2013 country names (20)

Open Data Islands and Communities
Open Data Islands and CommunitiesOpen Data Islands and Communities
Open Data Islands and Communities
 
Geo Open Data
Geo Open DataGeo Open Data
Geo Open Data
 
Spatial Information Systems yesterday, today and tomorrow
Spatial Information Systems yesterday, today and tomorrowSpatial Information Systems yesterday, today and tomorrow
Spatial Information Systems yesterday, today and tomorrow
 
Big Data and Me
Big Data and MeBig Data and Me
Big Data and Me
 
2009 09 19 Learning Unit Sdi
2009 09 19 Learning Unit Sdi2009 09 19 Learning Unit Sdi
2009 09 19 Learning Unit Sdi
 
2012 03-28 ungiwg12 unsdi
2012 03-28 ungiwg12 unsdi2012 03-28 ungiwg12 unsdi
2012 03-28 ungiwg12 unsdi
 
Space For Human Services Planning
Space For Human Services PlanningSpace For Human Services Planning
Space For Human Services Planning
 
rworldmap: A New R package for Mapping Global Data
rworldmap: A New R package for Mapping Global Datarworldmap: A New R package for Mapping Global Data
rworldmap: A New R package for Mapping Global Data
 
Bigdataforesight
BigdataforesightBigdataforesight
Bigdataforesight
 
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
 
RDA Presentation to G8
RDA Presentation to G8RDA Presentation to G8
RDA Presentation to G8
 
Spatial data infrastructure in Kyrgyzstan
Spatial data infrastructure in KyrgyzstanSpatial data infrastructure in Kyrgyzstan
Spatial data infrastructure in Kyrgyzstan
 
Data, Indicators and Maps on Homelessness
Data, Indicators and Maps on HomelessnessData, Indicators and Maps on Homelessness
Data, Indicators and Maps on Homelessness
 
GIS Lecture Note.ppt
GIS Lecture Note.pptGIS Lecture Note.ppt
GIS Lecture Note.ppt
 
Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...
Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...
Managing Social Science Data from the Arctic with ELOKA, ACADIS, NSIDC, and (...
 
2009 unicef open everything nyc
2009 unicef open everything nyc2009 unicef open everything nyc
2009 unicef open everything nyc
 
Ongoing Research in Data Studies
Ongoing Research in Data StudiesOngoing Research in Data Studies
Ongoing Research in Data Studies
 
Homelessness Data Discussion
Homelessness Data DiscussionHomelessness Data Discussion
Homelessness Data Discussion
 
Geo dataintro
Geo dataintroGeo dataintro
Geo dataintro
 
Hawke's Bay Open Data Conference - 2 May 2019
Hawke's Bay Open Data Conference - 2 May 2019Hawke's Bay Open Data Conference - 2 May 2019
Hawke's Bay Open Data Conference - 2 May 2019
 

Último

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Último (20)

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Icc2013 country names

  • 1. Variability of country names and identifiers in datasets – Reconciling practical and cultural perspectives International Cartographic Conference, Dresden Laura Kostanski| Sara-Jane Farmer | Rob Atkinson August 2013 GOVERNMENT AND COMMERCIAL SERVICES THEME
  • 2. Today’s Presentation • Overview • Cultural Reasons for Multiple Country Names • Impact of Cultural Reasons • Multiple Country Name Datasets • Reconciling Information • Spatial Identifier Reference Framework (SIRF) Approach
  • 3. Overview • There are multiple country name datasets in use • e.g. ISO 3166, UNSTATS , Alexandria Digital Library, CIA Fact Book, UN-FAO • Multiple stakeholders in creation and use of data using these names • e.g. World Bank, Statistics Agencies, Crisis Response and Social Protection Groups. • Time spent accessing and reconciling data is costly and delays production of results from analysis • The same issues apply to most, perhaps all, identifiers of spatial objects • Preview of how we might tackle this problem
  • 4. Context CSIRO. UNSDI Gazetteer for Social Protection in Indonesia
  • 5. Data Analysis Utopia Way Inc. investigated files in the data.un.org dataset. … Country names were discovered in multiple fields, such as: •country of birth, •country of citizenship, •country or area, •country or territory, •country or territory of asylum or residence, •country or territory of origin, •reference area. and identified significant issues with country name alignments and mismatches. An automated matching process was set up to explore the extent of the issue. In all, 21,195,188 rows of data were analysed.
  • 6. Common “Errors” Index error Withdrawn countries with no ISO3166 code Abbreviation Added markers Capitalisation Brackets “()” or “[]” instead of commas Standards confusion Examples “East Timor", "Czechoslovakia, Czechoslovak Socialist Republic”, "USSR, Union of Soviet Socialist Republics", “Yemen, Yemen Arab Republic", “Yemen, Democratic, People's Democratic Republic of", “Yugoslavia, Socialist Federal Republic of”, “Germany, Federal Republic of”, “German Democratic Republic”, “US Miscellaneous Pacific Islands", “Wake Island", “Serbia and Montenegro". “Rep.” for “Republic”, “St.” for “Saint”, “Is.” For “Island”, “Isds” for “Islands”, “&” for “and”. “+” added to the end of region names, to differentiate them from countrynames. “MDG_” added to region names, e.g. “MDG_Southern Asia”. “YEMEN” for “Yemen”, “republic” for “Republic”, “The” for “the”, “the” for “The”. “Virgin Islands (British)” for “British Virgin Islands”. The ISO3166 labels “name” and “official_name” were both used in the same datasets (“name” is available for all countries; “official_name” is not). Use of familiar names issues with character translation Brunei, Ivory Coast, China, Libya Cote d'Ivoire, Åland Islands, Curaçao, Réunion Misspellings Double spaces, trailing spaces, “South Asia” vs “Southern Asia”.
  • 8. Data sets providing country names Organisation Name of Data Set United Nations Statistics Division Country and Region Codes for Statistical Use Working Group on Country Names, United Nations Group of Experts on Geographic Names Terminology Section, Department for General Assembly and Conference Management International Standards Organisation (ISO) Food and Agriculture Organisation of the United Nations United Nations Geospatial Information Working Group (UNGIWG) National Geospatial Intelligence Agency List of Country Names NATO Standards Agreement (STANAG) 1059 Multilingual Terminology Database (UNTERM) ISO 3166: Codes for the representation of names of countries and their subdivisions (parts 1, 2 and 3) Global Administrative Unit Layers (GAUL) Second Administrative Level Boundaries (SALB) Federal Information Processing Standard (FIPS) 10-4 : Countries, Dependencies, Areas of Special Sovereignty, and their Principal Administrative Divisions
  • 9. Two Aspects of Country Name Datasets 1: Development of datasets Why is there a proliferation of country name sources? • Cultural issues • Development practices 2: Usage How, in a digital age of ‘big data’ analytics and SDIs, can newly emerging technologies such as the Spatial Identifier Reference Framework (SIRF) assist in reducing the ambiguity associated with multiple, heterogeneous country name sources? • Can we do better? What do we need to do it?
  • 10. Cultural Issues • Toponyms provide communities with identity (Toponymic Identity is both reflected and reinforced) • Country names are the highest-order toponyms • Problems are similar at lower levels, compounded by scale (size of problem) and higher rates of change (e.g. electoral boundaries, urban growth)
  • 11. Endonym/Exonym Above and beyond associations with an individual’s attachment to the Endonym of their country, there are often multiple Exonyms used by other languages. • e.g. Deutschland= Germany or Allemagne
  • 12. Other Cultural Country Naming Considerations Formal/Informal naming applications (particularly prevalent in the social media world- e.g. ‘Oz’ for Australia) Political/Non-Political Usage e.g. ‘Commonwealth of Australia’ Change over time e.g. Czechoslovakia Non-standardised international conventions e.g. Saint or St? The or none?
  • 13. The Impact All of these cultural mores impact on the ability of people and organisations to record country name information in a standardised, transparent manner. Thus, there exists a proliferation of country name lists which are officially promoted by international agencies. This impact is then intensified in usage,
  • 14. Options Suggested improvements to the indices and standards include: 1. Improve access to source data a. b. Make the UN’s regions list available as a csv file online, to include withdrawn country codes, assignment dates and withdrawal dates (these are needed to match names for earlier years). Make the UN’s economic status list available as a csv file online. 2. Lobby to improve content a. b. ISO to create a region (Africa, West Africa, North America etc.) code standard. ISO to correct inconsistencies in the ISO countries list (e.g. republic not Republic in Bolivia’s name). 3. Policy a. Make a definitive statement about which GIS naming standard (ISO, UNstats etc) UN online development data should attempt to adhere to. 4. Better citation mechanisms – – Standardised metadata and identifiers that “resolve” – i.e. links back to data Shared infrastructure to link all the information together
  • 15. Spatial Identifier Reference Framework CSIRO has been working with stakeholders including UN, National agencies and others on a set of standards and infrastructure services to support discovering and linking multiple sources of spatial references. This is being presented in more detail in: 6D.3 Spatial Identifier Reference Framework (SIRF): Realising the potential of SDI Using Spatial Identifiers to Link Multiple Information Systems (#633) Paul Box 1, Robert Atkinson 1, Laura Kostanski 2 S6-D - SDI Tuesday, August 27, 2013 04:30 p.m. - 05:45 p.m. - Room: Conference Level - C1
  • 16. One real world feature: a bus station BIG National Gazetteer of Indonesia Identifier Feature Type Merak, Stasiun Bis Transport Department of Transport Bus Terminals Identifier Feature Type Footprint Merak Terminal Polygon Footprint Point Currently systems are disconnected and difficult to integrate Merak Merak, Stasiun Bis Represented in multiple systems using different names, and classified and represented in different ways Terminus Dataset Gazetir Indeonesia Merak, Stasiun Bis (Gazetteer Entry) Gazetir Indonesia (Gazetteer) Used in Navigation application Linked Resource Same as Online Public Transport Map Linked Resource Merak (Gazetteer Entry) Terminus Dataset (Gazetteer) Used in Passenger Travel Stats Application Linked Resource Spatial Identifier REFERENCE FRAMEWORK Links gazetteers (based on same feature in different gazetteers) used in web applications and other online resources.
  • 17. Identifiers This is the “tricky part” Lets start with the practical implication… Catchment ExtractionRate Storage 1123343 730 300 Catchment Boundary Area Geometry 1123343 33535.4 151.3344,35.330…….
  • 18. “Distributed” references Catchment ExtractionRate Storage 1123343 730 300 How to ask for this entity Internet How to deliver this entity Catchment Boundary Area Geometry 1123343 151.3344,-35.330……. 33535.4
  • 19. SDI resource access One real world feature: a bus station BIG National Gazetteer of Indonesia Provenance URI Identifier Feature Type Merak, Stasiun Bis Transport Department of Transport Bus Terminals Identifier Feature Type Footprint Merak Terminal Polygon Footprint Point Currently systems are disconnected and difficult to integrate Merak Merak, Stasiun Bis Represented in multiple systems using different names, and classified and represented in different ways Terminus Dataset Gazetir Describe Indeonesia Discover Merak, Stasiun Bis (Gazetteer Entry) Gazetir Indonesia (Gazetteer) Used in Link Navigation application Linked Resource Same as Online Public Transport Map Linked Resource Merak (Gazetteer Entry) Terminus Dataset (Gazetteer) Used in Passenger Travel Stats Application Linked Resource Spatial Identifier REFERENCE FRAMEWORK Links gazetteers (based on same feature in different gazetteers) used in web applications and other online resources.
  • 20. Thank you For more information Rob.atkinson@csiro.au GOVERNMENT AND COMMERCIAL SERVICES THEME

Notas del editor

  1. Forms of country names range from those in use by the countries themselves (endonyms) to externally used alternatives (exonyms), to various common abbreviations (e.g. USA) and codes (such as those in ISO 3166). Indexes are produced by a diversity of communities including United Nations agencies, Non-Government Organisations (NGOs- such as humanitarian relief or environmental assessment groups) and commercial enterprises (postal agencies, distribution companies).
  2. Each of these issues is experienced to differing degrees, with particular regions more affected than others. Utopia Way Inc. investigated 5577 csv files in the data.un.org dataset (UN Statistics Division’s Internet-accessible repository for data) to explore country name alignments and mismatches published by UN agencies in their datasets. In all, 21,195,188 rows of data were analysed. Data that was excluded from the investigation included: -footnotes at the end of each dataset; -the UN interface limits downloads to 50000 rows of data, so 159 files in the set are incomplete; and, -25 files published in multi-sheet Excel format. Indices and headers from all the datasets were collated into lists: the headers list was searched for geographical references, and the indices list was used to produce a list of corrections from the data.un.org geographical indices into both ISO 3166 and United Nations Statistics Divison’s list Country and Region Codes for Statistical Use of region, country and economic group names. Geographical references in the headers are: -country of birth, country of citizenship, country or area, country or territory, country or territory of asylum or residence, country or territory of origin, reference area. -OID. -WMO station number, station name, national station id number. -City. -Area, residence area, city type.
  3. Most of the data.un.org datasets contain information that is listed by country (e.g. Yemen), region (e.g. West Africa) or economic group (e.g. Developing Regions).   The placenames in the indices are a mix of country, region and economic group names, with different spellings and formats for similar names. For example, in one instance the following spellings can be located for one country-. “Yemen”, “YEMEN”, “Yemen,Rep.”, “Yemen, Republic of”. Two standards are similar to the placenames used in these files: ISO3166 and the “composition of regions” list published by data.un.org. ISO3166 is a widely-used standard, but contains code for countries and their subregions only (e.g. has no official lists of larger regions or economic areas) and is published as tables online and available (although without the list of withdrawn codes) in the Python library pycountry. The UNstats list (which ISO3166 is partially based on) contains countries, regions and economic areas, but is available only as an html table (http://unstats.un.org/unsd/methods/m49/m49regin.htm).   This table was scraped (the data copied from its html page) by hand for this research, but this process could be automated using e.g. ScraperWiki. There are two main lists in the UNstats table: the regions, subregions and countries by physical location, and the economic status (e.g. “Developing regions”, “Least developed countries”) of each country and region.  These are mostly consistent, with a couple of oddities, e.g. Netherland Antilles doesn’t appear on the list of countries, but does appear on the list of small island developing states. Work by other groups (e.g. the World Wide Human Geography Data working group) has also translated data.un.org files into the FIPS 10-4 standard. This standard is common in US Government work; it includes codes for country names and administrative districts in each country, but does not include regions (e.g. Africa). It is similar, but not identical, to ISO3166.       The indices were checked against both these standards.  Against the ISO3166 standard, common data.un.org csv index errors were:
  4. Some names could not be resolved: remaining queries include the code for French Polynesia, whether “Christmas Is.(Aust)” is Christmas Island, whether St. Helena refers to just the island of Saint Helena, or “Saint Helena, Ascension and Tristan da Cunha” and whether Palestine and Palestinian Territories refer to “Palestinian Territory, Occupied”. Other issues include whether Micronesia refers to the region (Micronesia) or country (Micronesia, Federated States of), and whether there should be separate codes for changing states, e.g. Ethiopia before and after 1993.   The current correction files for UNSTATS and ISO3166 standards, along with a CSV file containing the UNSTATS standard codes fromhttp://unstats.un.org/unsd/methods/m49/m49regin.htm can be found at xxxxxx
  5. From an international standards perspective there are multiple competing GRDs of country names published by various agencies including the UN and ISO. There are diverse reasons for the existence of the varieties, including different end-user requirements which predicate whether official endonyms are required for mapping purposes or country codes used for statistical purposes. A brief summary of the key GRDs is provided in table one to contextualise the current international GRD situation. As indicated, there are multiple official GRDs of country names published at the international level by the UN and other organisations. The existence of multiple datasets related to country name standardisation is analogous to the mismatched country name data held within UN datasets.Examples Analysis of the key UN databases held in data.un.org has identified key matching, linking and interoperability issues currently experienced in the domain of GRDs which contain country names. These can be summarised as: Non-standardised use of country endonyms/exonyms by UN agencies Mismatches between data instances in authoritative country name GRDs Temporality of country name GRDs  
  6. This paper explores two aspects of the propagation of country name SISets: development (cultural/qualitative) and usage (data management/quantitative). From a development perspective the fundamental question is asked of why , when country names can be considered one of the highest-order administrative categories for geospatial organization, there is a proliferation of ‘official’ country name SIRDs. Within the domain of usage the authors query how, in a digital age of ‘big data’ analytics and Spatial Data Infrastructures (SDIs), newly emerging technologies such as the Spatial Identifier Reference Framework (SIRF) can assist in reducing the ambiguity associated with multiple, heterogeneous country name SIRDs.
  7. Rob- essentially, Toponymic Attachment means that people have strong affinities with place names, for cultural, social, branding and wayfinding purposes. Because of this, people are very hesitant to stop using a name. In fact, it is nearly impossible to get them to stop using a name. Also, people will create new ‘nicknames’ for things so that they can create a ‘clique’ or communicate in a community with their own ‘special terms’. It’s all about creating and reinforcing identity. Thus, in a world of multiple names in multiple databases, there is a massive headache for data junkies. Data users want straightforward stuff, and usually the people who create standards want people to be using the same names in the same way all the time. But, that’s not how the real world works. So, data junkies can try and tell people to use standardised names, and can create ISO lists etc etc etc. But because of human nature, the standardised lists will always have gaps and mistakes and won’t truly reflect usage on the ground. Thus accounting for some of the reason for why there are mutliple country name lists. And thus the reason for why SIRF is awesome- because instead of force-feeding people the standardised-name-line, it allows for a holistic view of naming which accounts for multiple representations, permutations, interpretations etc. It is, as I like to say, ‘ideologically promiscuous’ 
  8. Until now the preference of many agencies has been to homogenize geospatial information for ‘ease of use’ purposes- either through aggregating and de-duplicating existing SIDs or by disregarding competing information. SIRF is a system being developed by CSIRO using Linked Data mechanisms to support interoperability between heterogeneous geospatial information datasets and systems. SIRF harmonises disparate SIRDs through cross-walking and data linking methods, the benefits of which are outlined in detail by the authors. The framework system brings to the geospatial data management world, for the first time, the capability to streamline information integration processes whilst acknowledging the reality of multiple, competing SIRDs.
  9. Data products linked in practice....
  10. On he web you may not know the data product...