Research Dataspaces: Pay-as-you-go Integration and Analysis

Research Dataspaces:
Pay-as-you-go Integration and Analysis
Bill Howe, Phd
University of Washington
QuickTime™ and a
decompressor
are needed to see this picture.

3/12/09 Bill Howe, eScience Institute2
Data acquisition is no longer the
bottleneck to scientific discovery
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, in support of many hypotheses)
 Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
 Oceanography: high-resolution models, cheap sensors, satellites
 Biology: lab automation, high-throughput sequencing,

Biology
Oceanography
Astronomy
Two dimensions#ofbytes
# of data types
LSST
SDSS
Galaxy
BioMart
GEO
IOOS
OOI
LANL
HIVPathway
Commons
PanSTARRS

Building a Research Data Management System:
Status Quo
1. Establish (scientific) consensus
2. Derive and encode a domain model (schema)
3. Retrofit new domain model to existing data
4. Build applications
5. Analyze data
Encode shared knowledge in a machine readable manner
a. Relational schema, ontology, metadata standards,
conventions, controlled vocabularies, object model, API
b. Mappings between existing models
Scope, vision, requirements, terminology
Populate the schema, attach semantics, clean data
Use domain model to inform design
Do science

The Value of a Data Repository
VR = BD2
+ UD + C
D = # of datasets in the repository
B = # of binary operations facilitated
U = # of unary operations facilitated
C = intrinsic value of the schema (for communication, etc.)

Quote
A typical biological data management system involves accessing
or gathering data from multiple sources, followed by data
correlation, classification, review, and curation using domain
specific tools (e.g., functional clusters, ontologies) and expertise.
In practice, biological data management is less daunting when it is
considered in the context of an iterative strategy based on gradual
data integration while accumulating domain specific knowledge
throughout the integration process.
Victor Markowitz, LBNL

Outline
 Challenges
 Dataspaces
 Dataspace Support Platforms
 Next Steps

QuickTime™ and a
decompressor
are needed to see this picture.
slide source: Alon HalevyFranklin, Halevy, Maier 2005
Dataspaces

Data Management Solutions

Databases vs. Dataspaces
Single Schema Data “Coexistence”
Centralized Administration Autonomous Sources
Structured Query
Search, Browse,
Approximate Answers
Strict Integrity Constraints
Patterns and trends;
few global properties

Dataspaces vs. Databases (2)
 Databases are Exclusive
 Reject data that violates types,
schema, integrity constraints, rules +
triggers
 In return:

structured query, logical and physical
data independence, transactions

…over the clean subset of your data
 Dataspaces are Inclusive
 Few restrictions; all data is welcome
 In return, best effort services at first:
 Cataloging, keywords, attribute-value
 …over (almost) everything

Dataspace Services
Catalog
Keyword search
Structured Query
Anakysis and Vis
Task-specific Tools
Time
Over time, a dataset becomes accessible by additional services

Dataspace Services
Keyword Search
Structured Query
Analysis and
Visualization
Task-specific
Applications
Cataloguing

Dataspace Services
Cataloguing
Keyword Search
Structured Query
Analysis and Vis
Task-specific Tools

Example: The Internet

Example: Ocean Circulation
Forecasting System
Atmospheric
models Tides River discharge
filesystem
salinity isolines
station extractions
model-data comparisons
products via the web
forcings (i.e., inputs)
Simulation results
Config and log files
Intermediate files
Annotations
Data Products
Relations
perl and cron
cluster
perl and cron
…
FORTRAN
RDBMS

Example: Environmental
Metagenomics
ANNOTATION TABLES
Pfams
TIGRfams
COGs
FIGfams
SAMPLING
metagenome 4
metagenome 3
metagenome 2
metagenome 1
CAMERA annotation
PPLACER
of Pfams, TIGRfams, COGs, FIGfams
STATs
taxonomic info
seed alignmentHMMer search
of meta*ome
reference treealigned meta*ome
fragments
precomputed
precomputed
sequencing
raw data
environment
metadata
raw data
analyzed data
SQLShare
analyzed data
correlate diversity
w/environment
correlate
diversity and
nutrients
find new
genes
find new
taxa and
their
distributions
compare meta*omes
src: Robin Kodner

Example: CHAVI
Relational
Dataspace
Interface and
Analysis
B Cell
Control
T Cell
Control
NK Cell
Control
Genetics
Databases
NHP
Database
Virus Seq.
Data
src: Bart Haynes

Outline
 Challenges
 Dataspaces
 Next Steps

Example Systems cast as DSSPs
 Atlas (LabKey)
 data model: tables and files
 Mark Igra will present
 “Data Warehouse” prototype (SCHARP)
 data model: relations
 SQLShare (UW eScience)
 Quarry [Howe, et al. 2006]
 data model: triples
 iTrails [Salles et al. 2007]
 data model: triples
 Google Fusion Tables [Halevy 2010]

Environmental
Sampling
Public annotation DBs
Sequencing
metadata
search hits
taxonomic info
correlate diversity
w/environment?
correlate diversity
w/nutrients?
find new genes?
find new taxa and
their distributions? compare meta*omes?
Pfams, TIGRfams,
COGs, FIGfams
Phylogeny
“90% of my time spent
manipulating data rather than doing
science'”

Simple Example
###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1
chr_4[480001-580000].287 4500
chr_4[560001-660000].1 3556
chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein C
chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN,
chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein C
chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf fa
chr_24[160001-260000].65 3542
chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydr
chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and p
chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and p
chr_11[1-100000].70 2886
chr_11[80001-180000].100 1523
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length
1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 285
2 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 233
3 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872
…
2853 FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 1089
2854 FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316
…
3566 FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105
…
COGAnnotation_coastal_sample.txt
select *
from annotationsummary_combinedorfannotation16_phaeo_genome,
COGAnnotation_surface
where phaeo_gene = surf_hit

Environmental
Sampling
Public annotation DBs
Sequencing
metadata
search hits
taxonomic info
correlate diversity
w/environment?
correlate diversity
w/nutrients?
find new genes?find new taxa and
their distributions? compare meta*omes?
Pfams, TIGRfams,
COGs, FIGfams
SQL
“That took me a week with Excel”
“I can do science again”
SQLShare
Phylogeny

SQLShare Motivation
 Conventional wisdom says “Scientists won’t write SQL”
 We don’t believe it
 Instead, we implicate difficulty in
 installation
 configuration
 schema design
 performance tuning
 data ingest
 over-reliance on GUIs
We ask “What kind of technology would
make SQL a natural fit for hypothesis
testing?”

SQLShare Features
 Collaborative SQL authoring and sharing
 Views for incremental abstraction and integration
 Semi-automatic integration
 Identify “natural” unions and joins
 SQL Autocomplete
 User starts typing, system uses query logs to make suggestions
[Khoussainova 10]
 English Query
 Bootstrap a SQL query from an English questions
 Simple Visualization
 via Integration with Google Fusion Tables

Outline
 Challenges
 Dataspaces
 Next Steps

Next Steps
 Define scope
 Define HIV Dataspace team
 Build a minimal technical team

“Data Wrangler”

“Application Wrangler”
 Identify and catalog dataspace “participants” (i.e., sources)
 Review data access rights and security requirements
 Gather “spanning basis” of questions to answer
 Jim Gray’s “20 questions” methodology
 Gather “spanning basis” of existing data
 use exemplars if necessary
 load data “as is” into a database

Next Steps (2)
 Answer initial questions (Data wrangler)
 RDBMS example: create views
 Visualize initial answers (Application wrangler)
 Demonstrate early progress
 Check breadth (what’s missing?)
 Check depth (Did “hard” questions get answered?)

Summary
 Conventional “schema-first” approaches
break down in research contexts
 The dataspace abstraction and DSSPs
offer a way forward
 Systems and best practices are emerging
in the literature and from production
deployments

BACKUP SLIDES

Feature: Sharing SQL

Feature: SQL Autocomplete
 User requests suggestions on-the-fly as he/she
types query
 Recommends snippets:
 predicates in the WHERE clause
 tables in the FROM clause
 attributes in the SELECT clause
 Recommendations are context-aware
 Leverages past queries by user and
collaborators
Src: Nodira Khoussainova

Feature: English Query
 Lots of research on Natural Language
Interfaces to Databases
 c.f. [Etzioni 2008, Zettermeyer 2009]
 Very hard problem, in general
 Significant simplification: user can inspect
and “fix” the generated SQL prior to
execution

Feature: Simple Visualization
For each phaeo gene, count the number of matches in the COGAnnotation_surface
dataset, joining on COG id. Return the top 10 most commonly found genes.
Implementation: Export to Google Fusion Tables

Dataspaces: Summary
A “Dataspace Support Platform” should
 use a “lowest common denominator” data model
 not rely crucially on upfront global consensus
 not rely crucially on “perfect” metadata
 embrace exceptions, but exploit patterns
 support task-specific, “top down” integration
 ….but seek and exploit cross-cutting patterns where possible
 deliver incremental return for incremental investment
 …in data quality enhancement
 …in metadata normalization
 …in usage standardization
 …in application “convergence”

Timeline
time, scope, effort
valueforusers
Semantic
Web
RDF/OWL
Ontologies
Insular Data Sources
Data Integration Tools
Federated
Databases
Dataspace
support
platforms
Dataspaces

Example: Metagenomics
1. Who is there?
Which organisms make up the population?
2. What are they doing?
Which metabolic pathways are present and active?
(and who is doing what?)
3. Compare datasets
- across a transect (nearshore vs. deep ocean)
- before/after some event (e.g., Spring freshet)
- across salinity/temperature gradients
- diurnal cycles (day/night)
metagenomics
metatranscriptomics
metaproteomics
Study microbial populations
sampled from the environment
instead of individual organisms
Source: Robin Kodner, Armbrust Lab

id query hit e_value query_start query_end hit_start hit_end hit_length
6409 FHJ7DRN01BYA61.1 TIGR00149 2.20E-21 1 84 43 125 134
6410 FHJ7DRN01BDTEA.1 TIGR00149 3.40E-09 3 42 30 69 134
6411 FHJ7DRN02HEUGQ.1 TIGR00149 1.70E-05 4 46 1 46 134
6412 FHJ7DRN01CA4BO.1 TIGR00149 5.30E-05 4 45 1 45 134
6413 FHJ7DRN01DM2FK.3 TIGR01651 5.70E-64 1 76 511 586 606
6414 FHJ7DRN01B8BPS.1 TIGR01651 1.20E-36 1 52 500 551 606
6415 FHJ7DRN02JM54P.1 TIGR01651 2.20E-24 15 80 301 366 606
6416 FHJ7DRN02FK6C5.2 TIGR00039 2.70E-16 1 45 37 85 153
6417 FHJ7DRN01D019A.1 TIGR00039 8.90E-12 5 65 48 118 153
6418 FHJ7DRN02FYAFO.1 TIGR00039 1.60E-11 1 76 67 153 153
coastal sample
Complex Example
…
[H] COG4547 Cobalamin biosynthesis protein CobT
(nicotinate-mononucleotide:5, 6-dimethylbenzimidazole
phosphoribosyltransferase)
Ype: YPMT1.87
Atu: AGl2410
Sme: SMc00701
Bme: BMEI0050
Mlo: mll3561
Ccr: CC0672
…
[J] COG5099 RNA-binding protein of the Puf family,
translational repressor
Sce: YGL014w YGL178w YJR091c YLL013c YPR042c
Spo: SPAC1687.22c SPAC4G8.03c SPAC4G9.05
SPAC6G9.14 SPBC56F2.08c SPBP35G2.14 SPCC1682.08c
Ecu: ECU11g1730
…
COG database
###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1
chr_4[480001-580000].287 4500
chr_4[560001-660000].1 3556
chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein Cob
chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN, SP
chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein Cob
chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf fam
chr_24[160001-260000].65 3542
chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydrola
chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and pro
chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and pro
chr_11[1-100000].70 2886
chr_11[80001-180000].100 1523
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
SwissProt web service
Browser Cross-Reference
TIGR01650 GO:0051116 contributes_to
TIGR01651 GO:0009236 NULL
TIGRFAM to GO Mapping
id query hit e_value query_start query_end hit_start hit_end hit_length
6409 FHJ7DRN01BYA61.1 TIGR00149 2.20E-21 1 84 43 125 134
6410 FHJ7DRN01BDTEA.1 TIGR00149 3.40E-09 3 42 30 69 134
6411 FHJ7DRN02HEUGQ.1 TIGR00149 1.70E-05 4 46 1 46 134
6412 FHJ7DRN01CA4BO.1 TIGR00149 5.30E-05 4 45 1 45 134
6413 FHJ7DRN01DM2FK.3 TIGR01651 5.70E-64 1 76 511 586 606
6414 FHJ7DRN01B8BPS.1 TIGR01651 1.20E-36 1 52 500 551 606
6415 FHJ7DRN02JM54P.1 TIGR01651 2.20E-24 15 80 301 366 606
6416 FHJ7DRN02FK6C5.2 TIGR00039 2.70E-16 1 45 37 85 153
6417 FHJ7DRN01D019A.1 TIGR00039 8.90E-12 5 65 48 118 153
6418 FHJ7DRN02FYAFO.1 TIGR00039 1.60E-11 1 76 67 153 153
coastal sample

Pre-relational brittleness: if your data changed, your
application often broke.
Early RDBMS were buggy and slow (and often reviled),
but required only 5% of the application code.
“Activities of users at terminals and most application programs
should remain unaffected when the internal representation of data
is changed and even when some aspects of the external
representation are changed.” -- E.F. Codd 1979
Key Ideas: Programs that manipulate tabular data exhibit an
algebraic structure allowing reasoning and manipulation
independently of physical data representation
Background: Relational Databases

Relational Databases: Summary
 A General Data Model: “just tables”
 Logical and Physical Data Independence
 Declarative Query Language
 via the “Relational Algebra”
 Good Scalability
 “SQL is the most successful parallel language in the world”
 Results
 $15B industry
 Nearly every (non-search engine) website you visit is backed by a
RDBMS
 One of the all-time best examples of CS research impact

So what went wrong?
 DBAs!
 “Schema design” became paramount
 “Applications write queries, not users”

Applications became tightly coupled to schema

Ad hoc queries, ad hoc views, ad hoc data
confounded predictable performance, centralized
management, and strong global guarantees
 Result: Other tools enlisted to fil the gap

Java, etc.; XML, RDF, etc.; Web Services

Key Idea: Data Independence
physical data independence
logical data independence
files and
pointers
relations
view
s
SELECT seq
FROM all_sequences
WHERE seq =
‘GATTACGATATTA’;
SELECT dna
FROM ncbi_sequences
WHERE dna =
‘GATTACGATATTA’;
f = fopen(‘table_file’);
fseek(10030440);
while (True) {
fread(&buf, 1, 8192, f);
if (buf == GATTACGATATTA) {
. . .

Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product

Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra!

My Interests
Computer Science
Scientific Data
Management
Databases
Data-Intensive Scalable Computing
Research Data Integration
Cloud Computing
Visual Data Analytics

Research Cycle
Observe
Experiment
Analyze
Publish/Shar
e
Synthesis

Web
Services
Data Management
Query
Languages
Storage
Cloud Computing
Visualization;
Workflow
Information Integration
Information Extraction,
Access
Methods
Data Mining,
Distributed Programming Models,
complexity-hiding interfaces

Research Dataspaces: Pay-as-you-go Integration and Analysis

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Research Dataspaces: Pay-as-you-go Integration and Analysis

Similar a Research Dataspaces: Pay-as-you-go Integration and Analysis (20)

Más de University of Washington

Más de University of Washington (20)

Último

Último (20)

Research Dataspaces: Pay-as-you-go Integration and Analysis

Notas del editor