SlideShare una empresa de Scribd logo
1 de 17
Aleatha Parker-Wood*^,Brian A. Madden*,Michael McThrow*,
Darrell D.E. Long*, Ian F. Adams*, Avani Wildani*
*University of California Santa Cruz
^Conservatoire National des Arts et Métiers
Examining Extended and
Scientific Metadata for
Scalable Index Designs
What we call metadata
• Data for the system
• External to the file
• Small
• Dense
2
Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin,
"Operating System Concepts, Eighth Edition "
What everyone else calls metadata
• Data for the user
• Embedded in:
• the file
• the inode
• a separate file
• a notebook somewhere on
their desk
• Wildly varying size
• Sparse
3
Embedded
Metadata
Metadata
filesMetadata
filesMetadata
files
Metadata outside
the system
Inode metadata
A scientist at work
• “Show me the data set about bears in Alaska from
last fall”
• “Show me simulation results from last week for
Vesuvius which used this code library, and where
the pressure is higher than 500 kiloPascals”
• A mix of system and scientific metadata
4
Our options
• Relational databases
• Column stores
• Spatial trees (E.g., Spyglass, Smartstore)
• Inverted indexes
• Bitmap indexes (E.g. FastBit)
• The choice of index depends on the data, but what
does the data look like?
5
Outline
• The data in brief
• Dimensionality
• Sparsity
• Atomicity
• Entropy
6
The metadata in brief
7
Discipline
Native	
  
Format
Record	
  
count
Subsample
d?
Sample	
  
count
Total	
  size
Dryad Biology XML 31K No 31K 400	
  MB
WISE Astronomy CSV 564M Yes 10K 1	
  TB
ARGO
Oceanograp
hy
NetCDF 2B Yes 635K 330GB
ORNL Climatology CSV 1478 No 1478 154KB
Dimensionality
8
Dryad WISE Argo ORNL
Total	
  
Dimensions
44 285 108 14 451
•Much higher dimensional than POSIX data
•Curse of dimensionality concerns
Sparsity
9
Sparse even within a discipline (extremely sparse
across all disciplines)
• CDF of sparsity
• For a randomly
chosen element from
X% of columns, there
is a Y% chance it will
be null
Atomicity (Dryad)
• How many times can a
field be present for a
single item?
• E.g.: A single paper can
have multiple authors
• Truncated to show
detail. One study had
800 species!
10
Some disciplines have many field values per item.
Others have range values (e.g., May-June 2010)
Entropy
• Row organization
versus column
• How compressible is
the data?
• How selective are
queries?
• Plenty of compression
available
11
Bringing it all together
• Scientific data is:
• Sparse
• High-dimensional
• Compressible
• Non-atomic (one to many)
• A mix of cardinal, ordinal, spatial, and binary data
• Query models:
• Spatial
• Range and point
• Key word
12
Comparing indexes
13
Column	
  
stores
Row	
  stores Spatial	
  trees
Inverted	
  
Indexes
HDF5 FastBit
High	
  
dimensional
Yes Yes No Yes Yes Yes
Sparse Yes Stores	
  nulls No Yes Yes Stores	
  nulls
Multiple	
  
values
Yes Yes No
List,	
  not	
  
range
Yes Yes
Non-­‐numeric	
  
data
Yes Yes No Yes Yes No
Range	
  
queries
Yes Yes Yes No Yes Yes
Specialized	
  
indexes
Yes Yes No No No No
High
Compression
Yes No No Yes No Yes
Conclusions
14
• Currently popular approaches to file system
indexing (spatial trees, RDBMS) are a poor match
for scientific data
• Current approaches to scientific indexing are not a
complete solution
• Column stores are a natural fit for scientific
metadata and queries
• Specialized indexes based on inverted indexes,
bitmaps, and spatial trees are appropriate for some
data
15
Questions?
Data types (raw and semantic)
16
Dryad WISE Argo ORNL Total
String
Numeric
Str/Num
Date
Spatial
Flagsets
100% 4% 62% 29% 28%
0% 96% 38% 71% 72%
96% 68% 77% 72% 73%
2% 4% 7% 7% 5%
2% 9% 2% 21% 7%
0% 19% 14% 0% 15%
•Support for spatial search is useful
•Application hinting is needed for good search (is
this a string, a location, or a flag set?)
How can we support this?
• Search functionality which:
• Supports these kinds of queries
• Does not double the size of storage
• Does not require a linear scan over petabytes of data
• The answers to queries are documents
• We rarely need an entire row
• Complex transactions and joins are less important
17

Más contenido relacionado

La actualidad más candente

Managing the research life cycle
Managing the research life cycleManaging the research life cycle
Managing the research life cycle
Sherry Lake
 

La actualidad más candente (20)

Managing the research life cycle
Managing the research life cycleManaging the research life cycle
Managing the research life cycle
 
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v12016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
 
DataVsStatistics
DataVsStatisticsDataVsStatistics
DataVsStatistics
 
EDI Training Module 4: Organizing Data Into Publishable Units
EDI Training Module 4: Organizing Data Into Publishable UnitsEDI Training Module 4: Organizing Data Into Publishable Units
EDI Training Module 4: Organizing Data Into Publishable Units
 
EDI Training Module 10: EDI Data Repository Overview
EDI Training Module 10:  EDI Data Repository OverviewEDI Training Module 10:  EDI Data Repository Overview
EDI Training Module 10: EDI Data Repository Overview
 
Data Management for librarians
Data Management for librariansData Management for librarians
Data Management for librarians
 
A Guide for Reproducible Research
A Guide for Reproducible ResearchA Guide for Reproducible Research
A Guide for Reproducible Research
 
Using a Case Study to Teach Data Management to Librarians
Using a Case Study to Teach Data Management to LibrariansUsing a Case Study to Teach Data Management to Librarians
Using a Case Study to Teach Data Management to Librarians
 
Crosslinks
Crosslinks Crosslinks
Crosslinks
 
DataCite at APE 2011
DataCite at APE 2011DataCite at APE 2011
DataCite at APE 2011
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management plan
 
The challenge of sharing data well, how publishers can help
The challenge of sharing data well, how publishers can helpThe challenge of sharing data well, how publishers can help
The challenge of sharing data well, how publishers can help
 
Introduction to Digital File Management
Introduction to Digital File ManagementIntroduction to Digital File Management
Introduction to Digital File Management
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate Researchers
 
Top (10) challenging problems in data mining
Top (10) challenging problems  in data miningTop (10) challenging problems  in data mining
Top (10) challenging problems in data mining
 
Creating dmp
Creating dmpCreating dmp
Creating dmp
 
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014
 
Data Citation Implementation at Dataverse
Data Citation Implementation at DataverseData Citation Implementation at Dataverse
Data Citation Implementation at Dataverse
 
Data Management for Undergraduate Research
Data Management for Undergraduate ResearchData Management for Undergraduate Research
Data Management for Undergraduate Research
 
The expanding dataverse
The expanding dataverseThe expanding dataverse
The expanding dataverse
 

Destacado

Describing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community ProfileDescribing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community Profile
Alasdair Gray
 

Destacado (18)

Causality Based Versioning
Causality Based VersioningCausality Based Versioning
Causality Based Versioning
 
Slides: Safeguarding Abila through Multiple Data Perspectives
Slides: Safeguarding Abila through Multiple Data PerspectivesSlides: Safeguarding Abila through Multiple Data Perspectives
Slides: Safeguarding Abila through Multiple Data Perspectives
 
Safeguarding Abila: Discovering Evolving Activist Networks
Safeguarding Abila: Discovering Evolving Activist NetworksSafeguarding Abila: Discovering Evolving Activist Networks
Safeguarding Abila: Discovering Evolving Activist Networks
 
RDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for MetadataRDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
 
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
 
Slides: Safeguarding Abila: Real-time Streaming Analysis
Slides: Safeguarding Abila: Real-time Streaming AnalysisSlides: Safeguarding Abila: Real-time Streaming Analysis
Slides: Safeguarding Abila: Real-time Streaming Analysis
 
Slides: Safeguarding Abila: Spatio-Temporal Activity Modeling
Slides: Safeguarding Abila: Spatio-Temporal Activity ModelingSlides: Safeguarding Abila: Spatio-Temporal Activity Modeling
Slides: Safeguarding Abila: Spatio-Temporal Activity Modeling
 
Fast File System
Fast File SystemFast File System
Fast File System
 
Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...
 
Describing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community ProfileDescribing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community Profile
 
A fast file system for unix presentation by parang saraf (cs5204 VT)
A fast file system for unix presentation by parang saraf (cs5204 VT)A fast file system for unix presentation by parang saraf (cs5204 VT)
A fast file system for unix presentation by parang saraf (cs5204 VT)
 
Linking Scientific Metadata (presented at DC2010)
Linking Scientific Metadata (presented at DC2010)Linking Scientific Metadata (presented at DC2010)
Linking Scientific Metadata (presented at DC2010)
 
Lab Notebooks: A Librarian's Primer
Lab Notebooks: A Librarian's PrimerLab Notebooks: A Librarian's Primer
Lab Notebooks: A Librarian's Primer
 
OntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific SoftwareOntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific Software
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
 
Scientific Data Cataloging Framework
Scientific Data Cataloging FrameworkScientific Data Cataloging Framework
Scientific Data Cataloging Framework
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post Formats
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome Economy
 

Similar a Analyzing Extended and Scientific Metadata for Scalable Index Designs

Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
bhagathk
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
lyarmey
 
2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith
Vince Smith
 

Similar a Analyzing Extended and Scientific Metadata for Scalable Index Designs (20)

Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
Coping with Data for WHOI JP Students
Coping with Data for WHOI JP StudentsCoping with Data for WHOI JP Students
Coping with Data for WHOI JP Students
 
Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...
 
Elns and repositories, American Chemical Society, Dallas, March 2014
Elns and repositories, American Chemical Society, Dallas, March 2014Elns and repositories, American Chemical Society, Dallas, March 2014
Elns and repositories, American Chemical Society, Dallas, March 2014
 
data analytics lecture3.ppt
data analytics lecture3.pptdata analytics lecture3.ppt
data analytics lecture3.ppt
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Design and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHRDesign and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHR
 
Bren - UCSB - Spooky spreadsheets
Bren - UCSB - Spooky spreadsheetsBren - UCSB - Spooky spreadsheets
Bren - UCSB - Spooky spreadsheets
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10
 
Dbms rlde.ppt
Dbms rlde.pptDbms rlde.ppt
Dbms rlde.ppt
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
 
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
 
2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
 
FSCI Data Discovery
FSCI Data DiscoveryFSCI Data Discovery
FSCI Data Discovery
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Researh data management
Researh data managementResearh data management
Researh data management
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Analyzing Extended and Scientific Metadata for Scalable Index Designs

  • 1. Aleatha Parker-Wood*^,Brian A. Madden*,Michael McThrow*, Darrell D.E. Long*, Ian F. Adams*, Avani Wildani* *University of California Santa Cruz ^Conservatoire National des Arts et Métiers Examining Extended and Scientific Metadata for Scalable Index Designs
  • 2. What we call metadata • Data for the system • External to the file • Small • Dense 2 Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin, "Operating System Concepts, Eighth Edition "
  • 3. What everyone else calls metadata • Data for the user • Embedded in: • the file • the inode • a separate file • a notebook somewhere on their desk • Wildly varying size • Sparse 3 Embedded Metadata Metadata filesMetadata filesMetadata files Metadata outside the system Inode metadata
  • 4. A scientist at work • “Show me the data set about bears in Alaska from last fall” • “Show me simulation results from last week for Vesuvius which used this code library, and where the pressure is higher than 500 kiloPascals” • A mix of system and scientific metadata 4
  • 5. Our options • Relational databases • Column stores • Spatial trees (E.g., Spyglass, Smartstore) • Inverted indexes • Bitmap indexes (E.g. FastBit) • The choice of index depends on the data, but what does the data look like? 5
  • 6. Outline • The data in brief • Dimensionality • Sparsity • Atomicity • Entropy 6
  • 7. The metadata in brief 7 Discipline Native   Format Record   count Subsample d? Sample   count Total  size Dryad Biology XML 31K No 31K 400  MB WISE Astronomy CSV 564M Yes 10K 1  TB ARGO Oceanograp hy NetCDF 2B Yes 635K 330GB ORNL Climatology CSV 1478 No 1478 154KB
  • 8. Dimensionality 8 Dryad WISE Argo ORNL Total   Dimensions 44 285 108 14 451 •Much higher dimensional than POSIX data •Curse of dimensionality concerns
  • 9. Sparsity 9 Sparse even within a discipline (extremely sparse across all disciplines) • CDF of sparsity • For a randomly chosen element from X% of columns, there is a Y% chance it will be null
  • 10. Atomicity (Dryad) • How many times can a field be present for a single item? • E.g.: A single paper can have multiple authors • Truncated to show detail. One study had 800 species! 10 Some disciplines have many field values per item. Others have range values (e.g., May-June 2010)
  • 11. Entropy • Row organization versus column • How compressible is the data? • How selective are queries? • Plenty of compression available 11
  • 12. Bringing it all together • Scientific data is: • Sparse • High-dimensional • Compressible • Non-atomic (one to many) • A mix of cardinal, ordinal, spatial, and binary data • Query models: • Spatial • Range and point • Key word 12
  • 13. Comparing indexes 13 Column   stores Row  stores Spatial  trees Inverted   Indexes HDF5 FastBit High   dimensional Yes Yes No Yes Yes Yes Sparse Yes Stores  nulls No Yes Yes Stores  nulls Multiple   values Yes Yes No List,  not   range Yes Yes Non-­‐numeric   data Yes Yes No Yes Yes No Range   queries Yes Yes Yes No Yes Yes Specialized   indexes Yes Yes No No No No High Compression Yes No No Yes No Yes
  • 14. Conclusions 14 • Currently popular approaches to file system indexing (spatial trees, RDBMS) are a poor match for scientific data • Current approaches to scientific indexing are not a complete solution • Column stores are a natural fit for scientific metadata and queries • Specialized indexes based on inverted indexes, bitmaps, and spatial trees are appropriate for some data
  • 16. Data types (raw and semantic) 16 Dryad WISE Argo ORNL Total String Numeric Str/Num Date Spatial Flagsets 100% 4% 62% 29% 28% 0% 96% 38% 71% 72% 96% 68% 77% 72% 73% 2% 4% 7% 7% 5% 2% 9% 2% 21% 7% 0% 19% 14% 0% 15% •Support for spatial search is useful •Application hinting is needed for good search (is this a string, a location, or a flag set?)
  • 17. How can we support this? • Search functionality which: • Supports these kinds of queries • Does not double the size of storage • Does not require a linear scan over petabytes of data • The answers to queries are documents • We rarely need an entire row • Complex transactions and joins are less important 17