SlideShare a Scribd company logo
1 of 26
Download to read offline
Integrating Cross-Domain
         Metadata
    Terra Populus experience report


   Peter Clark pclark@umn.edu @pclark
Alex Jokela ajokela@umn.edu @alexjokela
Overview of Terra Populus
● National Science Foundation
   ○ Award ID: 0940818
● Core goal:
  Integrate, preserve, and disseminate data describing
  changes in the human population and environment
  over time.

● Focus for this talk is on integrating and
  linking interdisciplinary data
   ○ Challenges for both data and metadata
Motivation: Demographics




Data: Population Reference Bureau
Motivation: Environment




Image credit: United Nations Environmental Programme (via http://climate.nasa.gov/state_of_flux)
Flavors of Data and Metadata
● Four types of data in Terra Populus
  ○ Individual demographic
     ■ Brazil 1970-2000 (4 points in time)
     ■ Malawi 1987-2009 (3 points in time)
  ○ Area-based demographic and economic
  ○ Environment and land use
     ■ Global Landscapes Initiative
     ■ Global Land Cover GLC2000
  ○ Climate
     ■ WorldClim: 1950-2000
● Three different formats:
  ○ Microdata (survey)
  ○ Area-level (tables)
  ○ Raster (images/bitmaps)
Microdata
● Individual-level survey response data
  ○ Usually grouped into households
  ○ Usually has a couple hundred variables
  ○ Usually doesn't give good geographic locality
      due to confidentiality
● All censuses tend to ask about similar topics:
  ○ location, dwelling characteristics, age, sex,
      occupation, marital status, family size, home
      ownership, nativity and immigration...

● Each of these is a Variable
US Census 2000 Microdata
H000012712724     27600        51205120 22                                   34473
P00001270100004501000010600011110000010147031400100005002699901200000   0027010000
P00001270300004503011010170010110000010147050600215007003299901200000   0027010000
P00001270200004519000010540010110000010147051600100008003299901200000   0027010000
H000013412724     27720        51205120 23                                    2862
P00001340100008023000010200010110000010147010200216011008203202200000   0027010000
H000014112724     27500        51205120 22                                   22517
P00001410100027001000010540010110000010147010100100009003299901200000   0055010000
P00001410200022602000220440010110000010147010100100009002202602200000   0027010000
P00001410300031503000010210010110000010147050600206011003202202200000   0027010000
P00001410400029303011010170010110000010147050600205007003202202200000   0027010000
P00001410500024703011010130010110000010147050000204004003202202200000   0027010000
US Census 2000 Microdata
H000012712724     27600        51205120 22                                  34473
P00001270100004501000010600011110000010147031400100005002699901200000 0027010000
P00001270300004503011010170010110000010147050600215007003299901200000 0027010000
P00001270200004519000010540010110000010147051600100008003299901200000 0027010000
Area-level aggregate data
● Demographic and economic characteristics
  of a location
   ○ Location can mean nation, state, county, city,
     metropolitan area
   ○ Might be a political/administrative entity or just a
     statistical boundary.
● Some aggregate characteristics are scalar:
   ○ Total Population, GDP, % literacy, average
     household size
● More often it's a crosstab or table...
   ○ We treat each cell in the table as a variable
Educational Attainment by Sex:
 Minnesota

Subject                                   Estimate   Estimate   Estimate
                                          (Total)    (Male)     (Female)
Population 18 to 24 years                 506,399    258,167    248,232

Less than high school graduate            12.9%      14.2%      11.6%
High school graduate (inc. equivalency)   25.8%      28.3%      23.1%

Some college or associate's degree        50.2%      48.5%      51.9%

Bachelor's degree or higher               11.2%      9.0%       13.4%
Raster data
● Basically, just a picture
   ○ But usually heavily processed
● Geocoded
● Pixel = Measurement
   ○ either direct or interpolated
● Type
   ○ Categorical (such as land use)
   ○ Continuous (temperature, elevation)
Global Land Cover 2000
Metadata Integration - same domain
● Census/Survey
  ○ Different questions
  ○ Different answers
  ○ Different encodings
● Area-level
  ○ Different table structures
  ○ Different geographies
  ○ Different categories
● Raster
  ○ Different alignments
  ○ Different resolutions
  ○ Different processing
Metadata integration - cross-domain
● Semantics are key
  ○ Terms can mean different things to different
    disciplines.
  ○ What does "data" mean?


● How do you link across domains?
  ○ What questions are you trying to answer?
Microdata integration
● Old data usually presents recovery, cleanup
  and parsing issues
● The older the data, the less likely to have a
  machine readable codebook

● If you're lucky, you get the data as an SPSS
  or SAS file
Historical Data challenges: ETL
Historical Microdata - metadata
Microdata Integration in TerraPop
● Data structure can vary on many dimensions
  ○ questions/variables
  ○ categories & topcoding/bottomcoding
  ○ universe statements


● We've defined a unified coding scheme
● We deal with these using translation tables
Translation Tables for microdata
CODE   LABEL                       China 1982   Columbia 1973    Kenya 1989     Mexico 1970             USA 1990

100    Single/Never Married        1=never      4=single         1=single       9=single                6=never

200    Married/In Union

210      Married (Not Specified)   2=married    2=married        2=monogamous                           1=married

211        Civil                                                                3=only civil

212        Religious                                                            4=only religious

213        Civil and Religious                                                  2=civil and religious

214        Polygamous                                            3=polygamous

200      Consensual Union                       1=free union                    5=free union

300    Separated/Divorced                       3=sep/divorced

310      Separated                                               6=separated    8=separated             3=separated

330      Divorced                  4=divorced                    5=divorced     7=divorced              4=divorced

400    Widowed                     3=widowed    5=widowed        4=widowed      6=widowed               5=widowed

999    Missing/Unknown             0=missing    6=unknown        B=Blank        1=unknown
Area-level metadata
Presents all the same problems as microdata,
but two points of additional complexity

● Geography
  ○ boundaries, names, and codes change
● Table Structure
  ○ Changes over time in concert with changes in survey
    structure and binning
Area-level Metadata in TerraPop
We're cheating!

● Generating our own from microdata
  ○ Minimizes structural differences
  ○ But doesn't eliminate semantic differences


● But we lose out on fine-grain geography
Raster metadata
● Best opportunity for things to be easy
  ○ ISO-19115 standard
  ○ ArcGIS
  ○ GeoTIFF


● Need to integrate:
  ○ Spatial: projection, origin, resolution
  ○ Semantic integration is harder
  ○ Across datasets and within a dataset over time
● Raster data is often heavily processed, and
  there's no standard for describing that.
Raster metadata...
INTERNATIONAL JOURNAL OF CLIMATOLOGY
Int. J. Climatol. 25: 1965–1978 (2005)
Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/joc.1276

VERY HIGH RESOLUTION INTERPOLATED CLIMATE SURFACES FOR GLOBAL LAND AREAS

ROBERT J. HIJMANS,a,* SUSAN E. CAMERON,a,b JUAN L. PARRA,a PETER G. JONESc and ANDY JARVISc,d a Museum of Vertebrate Zoology, University of California, 3101
Valley Life Sciences Building, Berkeley, CA, USA
b Department of Environmental Science and Policy, University of California, Davis, CA, USA; and Rainforest Cooperative Research Centre, University of Queensland, Australia
c International Center for Tropical Agriculture, Cali, Colombia

d International Plant Genetic Resources Institute, Cali, Colombia

Received 18 November 2004 Revised 25 May 2005 Accepted 6 September 2005


ABSTRACT
We developed interpolated climate surfaces for global land areas (excluding Antarctica) at a spatial resolution of 30 arc s (often referred to as 1-km spatial
resolution). The climate elements considered were monthly precipitation and mean, minimum, and maximum temperature. Input data were gathered from a
variety of sources and, where possible, were restricted to records from the 1950–2000 period. We used the thin-plate smoothing spline algorithm implemented
in the ANUSPLIN package for interpolation, using latitude, longitude, and elevation as independent variables. We quantified uncertainty arising from the input
data and the interpolation by mapping weather station density, elevation bias in the weather stations, and elevation variation within grid cells and through data
partitioning and cross validation. Elevation bias tended to be negative (stations lower than expected) at high latitudes but positive in the tropics. Uncertainty is
highest in mountainous and in poorly sampled areas. Data partitioning showed high uncertainty of the surfaces on isolated islands, e.g. in the Pacific.
Aggregating the elevation and climate data to 10 arc min resolution showed an enormous variation within grid cells, illustrating the value of high-resolution
surfaces. A comparison with an existing data set at 10 arc min resolution showed overall agreement, but with significant variation in some regions. A
comparison with two high-resolution data sets for the United States also identified areas with large local differences, particularly in mountainous areas.
Compared to previous global climatologies, ours has the following advantages: the data are at a higher spatial resolution (400 times greater or more); more
weather station records were used; improved elevation data were used; and more information about spatial patterns of uncertainty in the data is available.
Owing to the overall low density of available climate stations, our surfaces do not capture of all variation that may occur at a resolution of 1 km, particularly of
precipitation in mountainous areas. In future work, such variation might be captured through knowledge- based methods and inclusion of additional co-variates,
particularly layers obtained through remote sensing. Copyright 2005 Royal Meteorological Society.
Integration in Terra Populus
● Everything is a Variable.

● Deal with each domain separately

● Integration across domains is geospatial
  ○ Tabulate microdata to area-level
  ○ Summarize raster to area-level, based on boundary
    files


● Provide output in all three domains
  ○ Different disciplines have different preferences
Observations and Lessons
● Unifying view of data as Variables

● Need a human in the loop:
  ○ Integration is messy, hard & difficult to automate
  ○ Many issues of semantics to sort out
● Mappings let you automate ~80%.
● Support hand-coding of special cases.
● Useful to have a sense of what the output
  should look like for sanity checks.
● Simple is good.
Coming this summer!


http://www.terrapop.org
     @terrapopulus

  http://www.ipums.org
  http://www.nhgis.org
http://www.pop.umn.edu

More Related Content

Similar to Integrating Cross-Domain Metadata

25 oct 2011 geocoding symposium washington 0950 richard abas
25 oct 2011 geocoding symposium washington 0950 richard abas25 oct 2011 geocoding symposium washington 0950 richard abas
25 oct 2011 geocoding symposium washington 0950 richard abasrichard abas
 
Does microfinance reduce rural poverty?
Does microfinance reduce rural poverty?Does microfinance reduce rural poverty?
Does microfinance reduce rural poverty?essp2
 
Does microfinance reduce rural poverty? Evidence based on long term household...
Does microfinance reduce rural poverty? Evidence based on long term household...Does microfinance reduce rural poverty? Evidence based on long term household...
Does microfinance reduce rural poverty? Evidence based on long term household...guest9970726
 

Similar to Integrating Cross-Domain Metadata (6)

Terra Populus Overview
Terra Populus OverviewTerra Populus Overview
Terra Populus Overview
 
Geomandering
GeomanderingGeomandering
Geomandering
 
Introduction automated zone design
Introduction automated zone designIntroduction automated zone design
Introduction automated zone design
 
25 oct 2011 geocoding symposium washington 0950 richard abas
25 oct 2011 geocoding symposium washington 0950 richard abas25 oct 2011 geocoding symposium washington 0950 richard abas
25 oct 2011 geocoding symposium washington 0950 richard abas
 
Does microfinance reduce rural poverty?
Does microfinance reduce rural poverty?Does microfinance reduce rural poverty?
Does microfinance reduce rural poverty?
 
Does microfinance reduce rural poverty? Evidence based on long term household...
Does microfinance reduce rural poverty? Evidence based on long term household...Does microfinance reduce rural poverty? Evidence based on long term household...
Does microfinance reduce rural poverty? Evidence based on long term household...
 

Recently uploaded

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

Integrating Cross-Domain Metadata

  • 1. Integrating Cross-Domain Metadata Terra Populus experience report Peter Clark pclark@umn.edu @pclark Alex Jokela ajokela@umn.edu @alexjokela
  • 2. Overview of Terra Populus ● National Science Foundation ○ Award ID: 0940818 ● Core goal: Integrate, preserve, and disseminate data describing changes in the human population and environment over time. ● Focus for this talk is on integrating and linking interdisciplinary data ○ Challenges for both data and metadata
  • 4. Motivation: Environment Image credit: United Nations Environmental Programme (via http://climate.nasa.gov/state_of_flux)
  • 5. Flavors of Data and Metadata ● Four types of data in Terra Populus ○ Individual demographic ■ Brazil 1970-2000 (4 points in time) ■ Malawi 1987-2009 (3 points in time) ○ Area-based demographic and economic ○ Environment and land use ■ Global Landscapes Initiative ■ Global Land Cover GLC2000 ○ Climate ■ WorldClim: 1950-2000 ● Three different formats: ○ Microdata (survey) ○ Area-level (tables) ○ Raster (images/bitmaps)
  • 6. Microdata ● Individual-level survey response data ○ Usually grouped into households ○ Usually has a couple hundred variables ○ Usually doesn't give good geographic locality due to confidentiality ● All censuses tend to ask about similar topics: ○ location, dwelling characteristics, age, sex, occupation, marital status, family size, home ownership, nativity and immigration... ● Each of these is a Variable
  • 7. US Census 2000 Microdata H000012712724 27600 51205120 22 34473 P00001270100004501000010600011110000010147031400100005002699901200000 0027010000 P00001270300004503011010170010110000010147050600215007003299901200000 0027010000 P00001270200004519000010540010110000010147051600100008003299901200000 0027010000 H000013412724 27720 51205120 23 2862 P00001340100008023000010200010110000010147010200216011008203202200000 0027010000 H000014112724 27500 51205120 22 22517 P00001410100027001000010540010110000010147010100100009003299901200000 0055010000 P00001410200022602000220440010110000010147010100100009002202602200000 0027010000 P00001410300031503000010210010110000010147050600206011003202202200000 0027010000 P00001410400029303011010170010110000010147050600205007003202202200000 0027010000 P00001410500024703011010130010110000010147050000204004003202202200000 0027010000
  • 8. US Census 2000 Microdata H000012712724 27600 51205120 22 34473 P00001270100004501000010600011110000010147031400100005002699901200000 0027010000 P00001270300004503011010170010110000010147050600215007003299901200000 0027010000 P00001270200004519000010540010110000010147051600100008003299901200000 0027010000
  • 9. Area-level aggregate data ● Demographic and economic characteristics of a location ○ Location can mean nation, state, county, city, metropolitan area ○ Might be a political/administrative entity or just a statistical boundary. ● Some aggregate characteristics are scalar: ○ Total Population, GDP, % literacy, average household size ● More often it's a crosstab or table... ○ We treat each cell in the table as a variable
  • 10. Educational Attainment by Sex: Minnesota Subject Estimate Estimate Estimate (Total) (Male) (Female) Population 18 to 24 years 506,399 258,167 248,232 Less than high school graduate 12.9% 14.2% 11.6% High school graduate (inc. equivalency) 25.8% 28.3% 23.1% Some college or associate's degree 50.2% 48.5% 51.9% Bachelor's degree or higher 11.2% 9.0% 13.4%
  • 11. Raster data ● Basically, just a picture ○ But usually heavily processed ● Geocoded ● Pixel = Measurement ○ either direct or interpolated ● Type ○ Categorical (such as land use) ○ Continuous (temperature, elevation)
  • 13. Metadata Integration - same domain ● Census/Survey ○ Different questions ○ Different answers ○ Different encodings ● Area-level ○ Different table structures ○ Different geographies ○ Different categories ● Raster ○ Different alignments ○ Different resolutions ○ Different processing
  • 14. Metadata integration - cross-domain ● Semantics are key ○ Terms can mean different things to different disciplines. ○ What does "data" mean? ● How do you link across domains? ○ What questions are you trying to answer?
  • 15. Microdata integration ● Old data usually presents recovery, cleanup and parsing issues ● The older the data, the less likely to have a machine readable codebook ● If you're lucky, you get the data as an SPSS or SAS file
  • 18. Microdata Integration in TerraPop ● Data structure can vary on many dimensions ○ questions/variables ○ categories & topcoding/bottomcoding ○ universe statements ● We've defined a unified coding scheme ● We deal with these using translation tables
  • 19. Translation Tables for microdata CODE LABEL China 1982 Columbia 1973 Kenya 1989 Mexico 1970 USA 1990 100 Single/Never Married 1=never 4=single 1=single 9=single 6=never 200 Married/In Union 210 Married (Not Specified) 2=married 2=married 2=monogamous 1=married 211 Civil 3=only civil 212 Religious 4=only religious 213 Civil and Religious 2=civil and religious 214 Polygamous 3=polygamous 200 Consensual Union 1=free union 5=free union 300 Separated/Divorced 3=sep/divorced 310 Separated 6=separated 8=separated 3=separated 330 Divorced 4=divorced 5=divorced 7=divorced 4=divorced 400 Widowed 3=widowed 5=widowed 4=widowed 6=widowed 5=widowed 999 Missing/Unknown 0=missing 6=unknown B=Blank 1=unknown
  • 20. Area-level metadata Presents all the same problems as microdata, but two points of additional complexity ● Geography ○ boundaries, names, and codes change ● Table Structure ○ Changes over time in concert with changes in survey structure and binning
  • 21. Area-level Metadata in TerraPop We're cheating! ● Generating our own from microdata ○ Minimizes structural differences ○ But doesn't eliminate semantic differences ● But we lose out on fine-grain geography
  • 22. Raster metadata ● Best opportunity for things to be easy ○ ISO-19115 standard ○ ArcGIS ○ GeoTIFF ● Need to integrate: ○ Spatial: projection, origin, resolution ○ Semantic integration is harder ○ Across datasets and within a dataset over time ● Raster data is often heavily processed, and there's no standard for describing that.
  • 23. Raster metadata... INTERNATIONAL JOURNAL OF CLIMATOLOGY Int. J. Climatol. 25: 1965–1978 (2005) Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/joc.1276 VERY HIGH RESOLUTION INTERPOLATED CLIMATE SURFACES FOR GLOBAL LAND AREAS ROBERT J. HIJMANS,a,* SUSAN E. CAMERON,a,b JUAN L. PARRA,a PETER G. JONESc and ANDY JARVISc,d a Museum of Vertebrate Zoology, University of California, 3101 Valley Life Sciences Building, Berkeley, CA, USA b Department of Environmental Science and Policy, University of California, Davis, CA, USA; and Rainforest Cooperative Research Centre, University of Queensland, Australia c International Center for Tropical Agriculture, Cali, Colombia d International Plant Genetic Resources Institute, Cali, Colombia Received 18 November 2004 Revised 25 May 2005 Accepted 6 September 2005 ABSTRACT We developed interpolated climate surfaces for global land areas (excluding Antarctica) at a spatial resolution of 30 arc s (often referred to as 1-km spatial resolution). The climate elements considered were monthly precipitation and mean, minimum, and maximum temperature. Input data were gathered from a variety of sources and, where possible, were restricted to records from the 1950–2000 period. We used the thin-plate smoothing spline algorithm implemented in the ANUSPLIN package for interpolation, using latitude, longitude, and elevation as independent variables. We quantified uncertainty arising from the input data and the interpolation by mapping weather station density, elevation bias in the weather stations, and elevation variation within grid cells and through data partitioning and cross validation. Elevation bias tended to be negative (stations lower than expected) at high latitudes but positive in the tropics. Uncertainty is highest in mountainous and in poorly sampled areas. Data partitioning showed high uncertainty of the surfaces on isolated islands, e.g. in the Pacific. Aggregating the elevation and climate data to 10 arc min resolution showed an enormous variation within grid cells, illustrating the value of high-resolution surfaces. A comparison with an existing data set at 10 arc min resolution showed overall agreement, but with significant variation in some regions. A comparison with two high-resolution data sets for the United States also identified areas with large local differences, particularly in mountainous areas. Compared to previous global climatologies, ours has the following advantages: the data are at a higher spatial resolution (400 times greater or more); more weather station records were used; improved elevation data were used; and more information about spatial patterns of uncertainty in the data is available. Owing to the overall low density of available climate stations, our surfaces do not capture of all variation that may occur at a resolution of 1 km, particularly of precipitation in mountainous areas. In future work, such variation might be captured through knowledge- based methods and inclusion of additional co-variates, particularly layers obtained through remote sensing. Copyright 2005 Royal Meteorological Society.
  • 24. Integration in Terra Populus ● Everything is a Variable. ● Deal with each domain separately ● Integration across domains is geospatial ○ Tabulate microdata to area-level ○ Summarize raster to area-level, based on boundary files ● Provide output in all three domains ○ Different disciplines have different preferences
  • 25. Observations and Lessons ● Unifying view of data as Variables ● Need a human in the loop: ○ Integration is messy, hard & difficult to automate ○ Many issues of semantics to sort out ● Mappings let you automate ~80%. ● Support hand-coding of special cases. ● Useful to have a sense of what the output should look like for sanity checks. ● Simple is good.
  • 26. Coming this summer! http://www.terrapop.org @terrapopulus http://www.ipums.org http://www.nhgis.org http://www.pop.umn.edu