SlideShare una empresa de Scribd logo
1 de 20
Gulf of Mexico Hydrocarbon
Database: Integrating
Heterogeneous Data for
Improved Model Development
Anne E. Thessen, Sean
McGinnis, Elizabeth North, and Ian
Mitchell
http://www.slideshare.net/athessen
Thank You to Data Providers
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

NOAA/NOS Office of Response and
Restoration
Commonwealth Scientific and Industrial
Research Organization
Environmental Protection Commission of
Hillsborough County
National Estuarine Research Reserves
Sarah Allan
Kim Anderson
Jamie Pierson
Nan Walker
Ed Overton
Richard Aronson
Ryan Moody
Charlotte Brunner
William Patterson
Kyeong Park
Kendra Daly
Liz Kujawinski
Jana Goldman
Jay Lunden
Samuel Georgian
Leslie Wade
British Petroleum

•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

Joe Montoya
Terry Hazen
Mandy Joye
Richard Camilli
Chris Reddy
John Kessler
David Valentine
Tom Soniat
Matt Tarr
Tom Bianchi
Tom Miller
Elise Gornish
Terry Wade
Steven Lohrenz
Dick Snyder
Paul Montagna
Patrick Bieber
Wei Wu
Mitchell Roffer
Dongjoo Joung
Mark Williams
Don Blake
Jordan Pino

•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

John Valentine
Jeffrey Baguely
Gary Ervin
Erik Cordes
Michaeol Perdue
Bill Stickle
Andrew Zimmerman
Andrew Whitehead
Alice Ortmann
Alan Shiller
Laodong Guo
A. Ravishankara
Ken Aikin
Tom Ryerson
Prabhakar Clement
Christine Ennis
Eric Williams
Ed Sherwood
Julie Bosch
Wade Jeffrey
Chet Pilley
Just Cebrian
Ambrose Bordelon
LTRANS
• Lagrangian Transport
Model
• Open Source
• http://northweb.hpl.umc
es.edu/LTRANS.htm
• Used to predict transport
of particles, subsurface
hydrocarbons, and
surface oil slicks (in
development)
GISR Deepwater Horizon Database

Number of
Data Points

• Over 8 million georeferenced data points
• Over 13 GB
• Over 2000 analytes and parameters
Database Contents
• Oceanographic Data
–
–
–
–

Salinity
Temperature
Oxygen
More

•
•
•
•

Air
Water
Tissue
Sediment/Soil

• Chemistry Data
–
–
–
–

Hydrocarbons
Heavy metals
Nutrients
More
n > 10,000
Challenges
•
•
•
•

Obtaining the data
Heterogeneity
Metadata
Comparison
The Great Data Hunt
• Discovery
– Project directory
– Funding agency records
– Literature
– Internet search

Relevant

Total Data Sets
Discovered
n = 146
The Great Data Hunt
• Access
– Online
– Ask directly
– Literature

data and
response
no data and
response
no data no
response
data no
response

We received responses to 58% of our inquires and
obtained 40% of the identified data sets
Heterogeneity
• Heterogeneity
– Terms
– Units
– Format
– Structure
– Quality Codes

Benzoic Acid

Carboxybenzene

E210

Benzoic Acid

Dracylic Acid

C7H6O2

2,212

1,367
Heterogeneity
• Heterogeneity

n-Decane

– Terms
– Units
– Format
– Structure
– Quality Codes

122

parts per trillion
ppbv

37

μg/g

ng/g ppt mg/kg μg/kg

ppb
Metadata
• Metadata
– Missing
– Not computable

Name
Unit

Location

Data
Point
Attribution

Time
Metadata
• Metadata
– Missing
– Not computable

Name
Unit
Method

Location

Data
Point
Attribution

Uncertainty

Time
Comparing to Model Output
Model Output in netCDF format
Parameter

Depth

Latitude

Longitude

TimeStamp

Nearest Neighbor
Algorithm

Database in SQL
Parameter

Depth

Latitude

Longitude

TimeStamp

Parameter

Depth

Latitude

Longitude

TimeStamp

Parameter

Depth

Latitude

Longitude

TimeStamp

Parameter

Depth

Latitude

Longitude

TimeStamp
Comparing to Model Output
• Set limits on what is considered nearestneighbor
• Not all data points have to be matched
• Data points can have many neighbors
• Matching is done before query
Attribution and Citation
• Literature citation
• Repository identifier
• Generate new
Future Work
•
•
•
•
•
•

More data
User feedback
Web Access
Users’ Guide
Manuscripts
Improved query
Questions?
The Great Data Hunt

– Online
– Ask directly
– Literature
We received responses
to 58% of our inquires
and obtained 40% of
the identified data sets

25

20

Number of Responses

• Discovery
• Access

40% of those responses were received
within 24 hours and 27% were received
within the first week

15

10

5

0
First Day

2 to 7

8 to 30

31 to 60

61 to 90

91 to 120

Time to First Response (Days)

121 to 150 151 to 180
The Great Data Hunt

– Online
– Ask directly
– Literature

0-24 email exchanges per data set

We received responses
to 58% of our inquires
and obtained 40% of
the identified data sets

7
6

Number of Data Sets

• Discovery
• Access

40% of those responses were received
within 24 hours and 27% were received
within the first week

5
4
3
2
1
0
0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Number of Emails
Why didn’t people share?
•
•
•
•
•

Paper not published yet – 30%
Passed the buck – 17%
Too busy – 9%
Medical problems – 9%
Poor quality – 9%

Más contenido relacionado

Similar a Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Improved Model Development

How and Why to Share Your Data
How and Why to Share Your DataHow and Why to Share Your Data
How and Why to Share Your Datakfear
 
Lecture 3_1 CharacteristicsOfBigData.pptx
Lecture 3_1 CharacteristicsOfBigData.pptxLecture 3_1 CharacteristicsOfBigData.pptx
Lecture 3_1 CharacteristicsOfBigData.pptxMOAZZAMALISATTI
 
Irving-TeraData: data and science driven big industry-nfdp13
Irving-TeraData: data and science driven big industry-nfdp13Irving-TeraData: data and science driven big industry-nfdp13
Irving-TeraData: data and science driven big industry-nfdp13DataDryad
 
Supporting Data Stewardship in the Solid Earth Sciences
Supporting Data Stewardship in the Solid Earth SciencesSupporting Data Stewardship in the Solid Earth Sciences
Supporting Data Stewardship in the Solid Earth SciencesVicki Ferrini
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan
 
2013 CrossRef Annual Meeting Strategic Update Geoffrey Bilder
2013 CrossRef Annual Meeting Strategic Update Geoffrey Bilder2013 CrossRef Annual Meeting Strategic Update Geoffrey Bilder
2013 CrossRef Annual Meeting Strategic Update Geoffrey BilderCrossref
 
Simple Steps to Effective Research Data Sharing
Simple Steps to Effective Research Data SharingSimple Steps to Effective Research Data Sharing
Simple Steps to Effective Research Data SharingAnusuriya Devaraju
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in EducationPhilip Piety
 
Integrated Earth Data Applications: Enhancing Reliable Data Services Through ...
Integrated Earth Data Applications: Enhancing Reliable Data Services Through ...Integrated Earth Data Applications: Enhancing Reliable Data Services Through ...
Integrated Earth Data Applications: Enhancing Reliable Data Services Through ...iedadata
 
A Data Scientist Perspective on Data Curation in the Digital Era
A Data Scientist Perspective on Data Curation in the Digital EraA Data Scientist Perspective on Data Curation in the Digital Era
A Data Scientist Perspective on Data Curation in the Digital EraVicki Ferrini
 
Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Rese...
Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Rese...Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Rese...
Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Rese...Jennifer Liss
 
Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...
Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...
Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...Oscar Peña del Rio
 
Data Citation in The Dataverse Network
Data Citation in The Dataverse NetworkData Citation in The Dataverse Network
Data Citation in The Dataverse NetworkMicah Altman
 

Similar a Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Improved Model Development (20)

How and Why to Share Your Data
How and Why to Share Your DataHow and Why to Share Your Data
How and Why to Share Your Data
 
Lecture 3_1 CharacteristicsOfBigData.pptx
Lecture 3_1 CharacteristicsOfBigData.pptxLecture 3_1 CharacteristicsOfBigData.pptx
Lecture 3_1 CharacteristicsOfBigData.pptx
 
Irving-TeraData: data and science driven big industry-nfdp13
Irving-TeraData: data and science driven big industry-nfdp13Irving-TeraData: data and science driven big industry-nfdp13
Irving-TeraData: data and science driven big industry-nfdp13
 
Supporting Data Stewardship in the Solid Earth Sciences
Supporting Data Stewardship in the Solid Earth SciencesSupporting Data Stewardship in the Solid Earth Sciences
Supporting Data Stewardship in the Solid Earth Sciences
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
2013 CrossRef Annual Meeting Strategic Update Geoffrey Bilder
2013 CrossRef Annual Meeting Strategic Update Geoffrey Bilder2013 CrossRef Annual Meeting Strategic Update Geoffrey Bilder
2013 CrossRef Annual Meeting Strategic Update Geoffrey Bilder
 
Simple Steps to Effective Research Data Sharing
Simple Steps to Effective Research Data SharingSimple Steps to Effective Research Data Sharing
Simple Steps to Effective Research Data Sharing
 
Providing support for JC Bradleys vision of open science using RSC cheminform...
Providing support for JC Bradleys vision of open science using RSC cheminform...Providing support for JC Bradleys vision of open science using RSC cheminform...
Providing support for JC Bradleys vision of open science using RSC cheminform...
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in Education
 
Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...
 
DBMS
DBMSDBMS
DBMS
 
Integrated Earth Data Applications: Enhancing Reliable Data Services Through ...
Integrated Earth Data Applications: Enhancing Reliable Data Services Through ...Integrated Earth Data Applications: Enhancing Reliable Data Services Through ...
Integrated Earth Data Applications: Enhancing Reliable Data Services Through ...
 
A Data Scientist Perspective on Data Curation in the Digital Era
A Data Scientist Perspective on Data Curation in the Digital EraA Data Scientist Perspective on Data Curation in the Digital Era
A Data Scientist Perspective on Data Curation in the Digital Era
 
Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Rese...
Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Rese...Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Rese...
Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Rese...
 
Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...
Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...
Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...
 
Delivering on the promise of a chemistry data repository for the world
Delivering on the promise of a chemistry data repository for the worldDelivering on the promise of a chemistry data repository for the world
Delivering on the promise of a chemistry data repository for the world
 
Data Citation in The Dataverse Network
Data Citation in The Dataverse NetworkData Citation in The Dataverse Network
Data Citation in The Dataverse Network
 
Data Mining Lecture_1.pptx
Data Mining Lecture_1.pptxData Mining Lecture_1.pptx
Data Mining Lecture_1.pptx
 

Más de Anne Thessen

Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...Anne Thessen
 
Unifying Genomics, Phenomics, and Environments
Unifying Genomics, Phenomics, and EnvironmentsUnifying Genomics, Phenomics, and Environments
Unifying Genomics, Phenomics, and EnvironmentsAnne Thessen
 
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...Anne Thessen
 
Bridging discrepancies across North American butterfly naming authorities: Su...
Bridging discrepancies across North American butterfly naming authorities: Su...Bridging discrepancies across North American butterfly naming authorities: Su...
Bridging discrepancies across North American butterfly naming authorities: Su...Anne Thessen
 
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...Anne Thessen
 
Next-Gen Taxonomic Descriptions for Microbial Eukaryotes
Next-Gen Taxonomic Descriptions for Microbial EukaryotesNext-Gen Taxonomic Descriptions for Microbial Eukaryotes
Next-Gen Taxonomic Descriptions for Microbial EukaryotesAnne Thessen
 
Linking biodiversity data for ecology
Linking biodiversity data for ecologyLinking biodiversity data for ecology
Linking biodiversity data for ecologyAnne Thessen
 
Knowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTKKnowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTKAnne Thessen
 
Visualizing Evolution
Visualizing EvolutionVisualizing Evolution
Visualizing EvolutionAnne Thessen
 
The Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyThe Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyAnne Thessen
 
Knowledge Extraction and Semantic Linking in the Encyclopedia of Life
Knowledge Extraction and Semantic Linking in the Encyclopedia of LifeKnowledge Extraction and Semantic Linking in the Encyclopedia of Life
Knowledge Extraction and Semantic Linking in the Encyclopedia of LifeAnne Thessen
 

Más de Anne Thessen (11)

Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
 
Unifying Genomics, Phenomics, and Environments
Unifying Genomics, Phenomics, and EnvironmentsUnifying Genomics, Phenomics, and Environments
Unifying Genomics, Phenomics, and Environments
 
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
 
Bridging discrepancies across North American butterfly naming authorities: Su...
Bridging discrepancies across North American butterfly naming authorities: Su...Bridging discrepancies across North American butterfly naming authorities: Su...
Bridging discrepancies across North American butterfly naming authorities: Su...
 
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
 
Next-Gen Taxonomic Descriptions for Microbial Eukaryotes
Next-Gen Taxonomic Descriptions for Microbial EukaryotesNext-Gen Taxonomic Descriptions for Microbial Eukaryotes
Next-Gen Taxonomic Descriptions for Microbial Eukaryotes
 
Linking biodiversity data for ecology
Linking biodiversity data for ecologyLinking biodiversity data for ecology
Linking biodiversity data for ecology
 
Knowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTKKnowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTK
 
Visualizing Evolution
Visualizing EvolutionVisualizing Evolution
Visualizing Evolution
 
The Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyThe Future of Microalgal Taxonomy
The Future of Microalgal Taxonomy
 
Knowledge Extraction and Semantic Linking in the Encyclopedia of Life
Knowledge Extraction and Semantic Linking in the Encyclopedia of LifeKnowledge Extraction and Semantic Linking in the Encyclopedia of Life
Knowledge Extraction and Semantic Linking in the Encyclopedia of Life
 

Último

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 

Último (20)

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 

Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Improved Model Development

  • 1. Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Improved Model Development Anne E. Thessen, Sean McGinnis, Elizabeth North, and Ian Mitchell http://www.slideshare.net/athessen
  • 2. Thank You to Data Providers • • • • • • • • • • • • • • • • • • • • • NOAA/NOS Office of Response and Restoration Commonwealth Scientific and Industrial Research Organization Environmental Protection Commission of Hillsborough County National Estuarine Research Reserves Sarah Allan Kim Anderson Jamie Pierson Nan Walker Ed Overton Richard Aronson Ryan Moody Charlotte Brunner William Patterson Kyeong Park Kendra Daly Liz Kujawinski Jana Goldman Jay Lunden Samuel Georgian Leslie Wade British Petroleum • • • • • • • • • • • • • • • • • • • • • • • Joe Montoya Terry Hazen Mandy Joye Richard Camilli Chris Reddy John Kessler David Valentine Tom Soniat Matt Tarr Tom Bianchi Tom Miller Elise Gornish Terry Wade Steven Lohrenz Dick Snyder Paul Montagna Patrick Bieber Wei Wu Mitchell Roffer Dongjoo Joung Mark Williams Don Blake Jordan Pino • • • • • • • • • • • • • • • • • • • • • • • John Valentine Jeffrey Baguely Gary Ervin Erik Cordes Michaeol Perdue Bill Stickle Andrew Zimmerman Andrew Whitehead Alice Ortmann Alan Shiller Laodong Guo A. Ravishankara Ken Aikin Tom Ryerson Prabhakar Clement Christine Ennis Eric Williams Ed Sherwood Julie Bosch Wade Jeffrey Chet Pilley Just Cebrian Ambrose Bordelon
  • 3. LTRANS • Lagrangian Transport Model • Open Source • http://northweb.hpl.umc es.edu/LTRANS.htm • Used to predict transport of particles, subsurface hydrocarbons, and surface oil slicks (in development)
  • 4. GISR Deepwater Horizon Database Number of Data Points • Over 8 million georeferenced data points • Over 13 GB • Over 2000 analytes and parameters
  • 5. Database Contents • Oceanographic Data – – – – Salinity Temperature Oxygen More • • • • Air Water Tissue Sediment/Soil • Chemistry Data – – – – Hydrocarbons Heavy metals Nutrients More n > 10,000
  • 7. The Great Data Hunt • Discovery – Project directory – Funding agency records – Literature – Internet search Relevant Total Data Sets Discovered n = 146
  • 8. The Great Data Hunt • Access – Online – Ask directly – Literature data and response no data and response no data no response data no response We received responses to 58% of our inquires and obtained 40% of the identified data sets
  • 9. Heterogeneity • Heterogeneity – Terms – Units – Format – Structure – Quality Codes Benzoic Acid Carboxybenzene E210 Benzoic Acid Dracylic Acid C7H6O2 2,212 1,367
  • 10. Heterogeneity • Heterogeneity n-Decane – Terms – Units – Format – Structure – Quality Codes 122 parts per trillion ppbv 37 μg/g ng/g ppt mg/kg μg/kg ppb
  • 11. Metadata • Metadata – Missing – Not computable Name Unit Location Data Point Attribution Time
  • 12. Metadata • Metadata – Missing – Not computable Name Unit Method Location Data Point Attribution Uncertainty Time
  • 13. Comparing to Model Output Model Output in netCDF format Parameter Depth Latitude Longitude TimeStamp Nearest Neighbor Algorithm Database in SQL Parameter Depth Latitude Longitude TimeStamp Parameter Depth Latitude Longitude TimeStamp Parameter Depth Latitude Longitude TimeStamp Parameter Depth Latitude Longitude TimeStamp
  • 14. Comparing to Model Output • Set limits on what is considered nearestneighbor • Not all data points have to be matched • Data points can have many neighbors • Matching is done before query
  • 15. Attribution and Citation • Literature citation • Repository identifier • Generate new
  • 16. Future Work • • • • • • More data User feedback Web Access Users’ Guide Manuscripts Improved query
  • 18. The Great Data Hunt – Online – Ask directly – Literature We received responses to 58% of our inquires and obtained 40% of the identified data sets 25 20 Number of Responses • Discovery • Access 40% of those responses were received within 24 hours and 27% were received within the first week 15 10 5 0 First Day 2 to 7 8 to 30 31 to 60 61 to 90 91 to 120 Time to First Response (Days) 121 to 150 151 to 180
  • 19. The Great Data Hunt – Online – Ask directly – Literature 0-24 email exchanges per data set We received responses to 58% of our inquires and obtained 40% of the identified data sets 7 6 Number of Data Sets • Discovery • Access 40% of those responses were received within 24 hours and 27% were received within the first week 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Number of Emails
  • 20. Why didn’t people share? • • • • • Paper not published yet – 30% Passed the buck – 17% Too busy – 9% Medical problems – 9% Poor quality – 9%

Notas del editor

  1. Hello, my name is Anne Thessen and I’m going to speak to you about some model development that we’ve been doing. First I would like to acknowledge my coauthors, Sean McGinnis, Elizabeth North and Ian Mitchell. We are part of the Gulf Integrated Spill Research Consortium. Our work was funded by the Gulf of Mexico Research Initiative. We received institutional support from Arizona State University and the University of Maryland Center for Environmental Science. I recently started my own business called The Data Detektiv that does the type of data work I’m about to present. If you like what you see and have need of this sort of expertise in your project please see me after the talk. These slides will be posted to slideshare later today.
  2. This talk is primarily about building a database from multiple data sets. Here is a list of all the data providers. As you can see, we have many. It takes a village to build a database. We have a fantastic product and we are just starting to scratch the surface of what it can tell us, so we really appreciate all these folks sharing data and answering questions about their data.
  3. The goal of our project is to modify an existing Lagrangian transport model, called LTRANS, so that it can be effectively used to understand the processes that determine transport and fate of hydrocarbons in the Gulf of Mexico. I won’t say much more about the model itself, but if you are interested, here is where you can learn more. The figure shows model output and field data together. The small points are model output and the large circles are field data. You can see there is a good match here, but we are dealing with geographic sampling bias.
  4. To determine the efficacy of the model, we are comparing the output to field data collected after the Deepwater Horizon explosion. To accomplish this, we are compiling a database of oceanographic and hydrocarbon field measurements called the GISR Deepwater Horizon database. It can be queried to get the output we need for analysis. Currently, it is over 13 GB in size and contains over 8 million georeferenced data points gathered from published and unpublished sources, industry, government databases, volunteer networks and individual researchers.
  5. The data base contains multiple types of oceanographic and chemistry data. This plot is an example of database content. It shows Naphthalene data from the beginning of August 2010. We have well over 10,000 naphthalene data points.
  6. We encountered four major challenges while building and using this database. I will talk about each of them in turn.
  7. The first challenge was finding and accessing data sets. A significant number of data sets were not in a repository or part of the published literature - and we expected this. To discover data we looked through project directories, databases of awarded projects, the literature, and the internet. That gave us a list of contacts. We ended up identifying 146 potentially relevant projects. We approached each contact via email to find out if they had data and if they were relevant. At the end of the process we identified 95 relevant data sets.
  8. Once the data sets were discovered, they had to be accessed. Some were freely available and were simply downloaded. Some data sets were in repositories that may or may not involve working with the data manager to gain access. Others were published as a table in supplementary material. Most data sets involved communicating with the provider to get the complete data set and the metadata. There were a few instances where the provider instructed us to take the data from the figure, but we tried to avoid doing that. Out of those 95 relevant data sets, we received responses to 58% of our inquiries and were able to obtain 40% of the data sets. This chart is a breakdown of the 95 data sets. The dark orange represents the data sets we asked for - and received a response and the data. The dark purple are the data sets that were freely available online, so no communication was necessary. The light orange represents the data sets that were denied us. The light purple represents the inquires that went completely unanswered. You can look at this another way. The orange represents communication and the purple represents no communication. The dark colors represent data while the light colors represent no data. This is quite good compared to sharing in some communities which can be as low as 10%.
  9. Then came the process of integrating the data sets, which brings us to our second challenge - Heterogeneity. We encountered heterogeneity in terms, units, formats, structures and quality codes. We normalized terms, units and codes algorithmically. Some of the formats and structures were normalized algorithmically. Terms were normalized using a Google Fusion Table that lists a “preferred name” and all of the synonyms for that name. An algorithm generates a table relating the homonym and the preferred name. This is connected to the database such that the preferred name can be pulled from the table. That way when the database is queried for a particular analyte, we don’t have missed data because one homonym is used in the original data set instead of another. For example, benzoic acid has five homonyms in the table. We had over 2,000 terms before reconciliation and 1,367 terms after reconciliation.
  10. The units are handled similarly, except there is a transformation step, wherein some math is done to convert the value to the “preferred unit”. This allows us to normalize terms and units without changing the original data set. For example, n-Decane was represented by 6 different units. The number of different units in the database was decreased substantially after reconciliation. Formats varied from Access databases and shape files to pdf tables. All data sets, except for the databases, were normalized to our schema and then imported into an SQL database. The databases were transformed to SQL and then joined. Sometimes this had to be done manually. Sometimes we were able to write scripts to help. We are in the process of normalizing the quality codes, but we will probably handle this in the same manner as the terms and units.
  11. The third challenge was metadata (or lack thereof). Metadata was often missing, in a separate location or in a separate format from the actual data. At a minimum, for the database to work, we needed to know the basics of what, where and when.
  12. Ideally, we were able to get more information, like about methods and uncertainty. Compiling the metadata was an exercise in detective work that involved searching through multiple files and contacting data providers. This was often a very time consuming process.
  13. The final challenge was in actually putting the database to use. We have only scratched the surface in this regard. We developed the “nearest neighbor” algorithm to connect a data point in the model output to its partner in the database (or vice versa) based on space/time. This is accomplished via a C# script that takes as input a link to each dataset, and the names of the fields to be considered in the distance function between two points.  This distance function is currently implemented as a stepped function, where the number of candidate points are filtered first by date, then by geospatial distance, and finally by depth. The output is given in SQL and links data points via their data point ID in the database.
  14. Some important features of nearest-neighbor include…
  15. An important part of reusing other people’s data is citing them appropriately. Data set citation is still a relatively new concept, but its starting to gain momentum through tools like ImpactStory that summarizes sharing activities and repositories like FigShare and Dryad. We worked with each of the data providers to find out how they wanted to be cited. Typically, if the data had a publication, the provider wanted the publication to be cited. Not all data sets had a publication. Data sets in repositories often had a citation already developed and provided by the repository. There were plenty of data sets that were unpublished and not in a repository. For these data sets we worked with the provider to generate a citation. This involved encouraging the provider to deposit data and receive a citable, unique identifier for the data. If the data set was already online, like on a personal web site, the access URL was given in the citation. We also plan to develop a citation for the database as a whole with all of the providers as authors. In the future, when a user executes a query, they will also be presented with a list of citations for the data sets that appear in the query results. So they can cite the database as a whole or the individual data sets they actually use.
  16. We have accomplished a lot, but we still have much to do to fulfill our goals. There will be a lot of additional data released over the next year that will be added to the database. We will be giving web access to the contributors and plan to incorporate their feedback to improve usability before opening to a wider audience. We are currently drafting and refining a users’ guide. There will be manuscripts published on the process of gathering data that I just described and a more technical paper on the database itself. At the top of my wish list is building a more semantically-intelligent structure for improved query. We don’t currently have funds for this, but an ontology of terms would enable users to query for classes of parameters instead of single parameters and to use terms of their choosing. As I said before, this is a really great resource that we have only just begun to use. We look forward to getting many insights out of this database and making it available for others to get even more.
  17. With that, I can take questions now or if you want to speak in more detail about this project or The Data Detektiv I would be happy to sit down over coffee, food or beer. I have worked on many different data types including oceanographic, chemical, ecological and taxonomic data sets. I can help you solve your data problems, so you can spend more time on research.
  18. Most of the responses we received were quite timely. 40% were received within the first 24 hours and 27% were received within 2-7 days.
  19. This process can be quite labor intensive. Some data sets required up to 24 email exchanges to get all of the data and metadata situated. The average was 7.8 emails.
  20. We were actively denied data by 24% of the 95 contacts made. The other 36% did not respond to our requests at all. We know nothing about the data set or why it wasn’t shared. For those 24% that did give us a reason we see that “paper not published yet” was the primary reason. All of these folks expressed willingness to share after publication. So the sharing rate will increase dramatically once all these papers come out. 17% directed me to another person who did not respond at all. Only 9% told me they were too busy. Another 9% said the data or the samples got messed up in some way and was not useful. Interestingly, medical problems were also cited as a reason for not sharing. I also want to say, as an aside, that we know there are large data sets that are not available to us because of legal reasons. They were part of the Natural Resources Damage Assessment. These data sets were not included in any of these statistics. This does not add up to 100. The last 26% was a combination of random, one-off reasons or folks being hesitant and not really giving a reason.