SlideShare una empresa de Scribd logo
1 de 50
Research Dataspaces:
Pay-as-you-go Integration and Analysis
Bill Howe, Phd
University of Washington
QuickTime™ and a
decompressor
are needed to see this picture.
3/12/09 Bill Howe, eScience Institute2
Data acquisition is no longer the
bottleneck to scientific discovery
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, in support of many hypotheses)
 Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
 Oceanography: high-resolution models, cheap sensors, satellites
 Biology: lab automation, high-throughput sequencing,
3/12/09 Bill Howe, eScience Institute3
Biology
Oceanography
Astronomy
Two dimensions#ofbytes
# of data types
LSST
SDSS
Galaxy
BioMart
GEO
IOOS
OOI
LANL
HIVPathway
Commons
PanSTARRS
3/12/09 Bill Howe, eScience Institute4
Building a Research Data Management System:
Status Quo
1. Establish (scientific) consensus
2. Derive and encode a domain model (schema)
3. Retrofit new domain model to existing data
4. Build applications
5. Analyze data
Encode shared knowledge in a machine readable manner
a. Relational schema, ontology, metadata standards,
conventions, controlled vocabularies, object model, API
b. Mappings between existing models
Scope, vision, requirements, terminology
Populate the schema, attach semantics, clean data
Use domain model to inform design
Do science
3/12/09 Bill Howe, eScience Institute5
The Value of a Data Repository
VR = BD2
+ UD + C
D = # of datasets in the repository
B = # of binary operations facilitated
U = # of unary operations facilitated
C = intrinsic value of the schema (for communication, etc.)
3/12/09 Bill Howe, eScience Institute6
Quote
A typical biological data management system involves accessing
or gathering data from multiple sources, followed by data
correlation, classification, review, and curation using domain
specific tools (e.g., functional clusters, ontologies) and expertise.
In practice, biological data management is less daunting when it is
considered in the context of an iterative strategy based on gradual
data integration while accumulating domain specific knowledge
throughout the integration process.
Victor Markowitz, LBNL
3/12/09 Bill Howe, eScience Institute7
Outline
 Challenges
 Dataspaces
 Dataspace Support Platforms
 Next Steps
3/12/09 Bill Howe, eScience Institute8
QuickTime™ and a
decompressor
are needed to see this picture.
slide source: Alon HalevyFranklin, Halevy, Maier 2005
Dataspaces
3/12/09 Bill Howe, eScience Institute9
Data Management Solutions
3/12/09 Bill Howe, eScience Institute10
Databases vs. Dataspaces
Single Schema Data “Coexistence”
Centralized Administration Autonomous Sources
Structured Query
Search, Browse,
Approximate Answers
Strict Integrity Constraints
Patterns and trends;
few global properties
3/12/09 Bill Howe, eScience Institute11
Dataspaces vs. Databases (2)
 Databases are Exclusive
 Reject data that violates types,
schema, integrity constraints, rules +
triggers
 In return:

structured query, logical and physical
data independence, transactions

…over the clean subset of your data
 Dataspaces are Inclusive
 Few restrictions; all data is welcome
 In return, best effort services at first:
 Cataloging, keywords, attribute-value
 …over (almost) everything
3/12/09 Bill Howe, eScience Institute12
Dataspace Services
Catalog
Keyword search
Structured Query
Anakysis and Vis
Task-specific Tools
Time
Over time, a dataset becomes accessible by additional services
3/12/09 Bill Howe, eScience Institute13
Dataspace Services
Keyword Search
Structured Query
Analysis and
Visualization
Task-specific
Applications
Cataloguing
3/12/09 Bill Howe, eScience Institute14
Dataspace Services
Cataloguing
Keyword Search
Structured Query
Analysis and Vis
Task-specific Tools
3/12/09 Bill Howe, eScience Institute15
Example: The Internet
3/12/09 Bill Howe, eScience Institute16
Example: Ocean Circulation
Forecasting System
Atmospheric
models Tides River discharge
filesystem
salinity isolines
station extractions
model-data comparisons
products via the web
forcings (i.e., inputs)
Simulation results
Config and log files
Intermediate files
Annotations
Data Products
Relations
perl and cron
cluster
perl and cron
…
FORTRAN
RDBMS
3/12/09 Bill Howe, eScience Institute17
Example: Environmental
Metagenomics
ANNOTATION TABLES
Pfams
TIGRfams
COGs
FIGfams
SAMPLING
metagenome 4
metagenome 3
metagenome 2
metagenome 1
CAMERA annotation
PPLACER
of Pfams, TIGRfams, COGs, FIGfams
STATs
taxonomic info
seed alignmentHMMer search
of meta*ome
reference treealigned meta*ome
fragments
precomputed
precomputed
sequencing
raw data
environment
metadata
raw data
analyzed data
SQLShare
analyzed data
correlate diversity
w/environment
correlate
diversity and
nutrients
find new
genes
find new
taxa and
their
distributions
compare meta*omes
src: Robin Kodner
3/12/09 Bill Howe, eScience Institute18
Example: CHAVI
Relational
Dataspace
Interface and
Analysis
B Cell
Control
T Cell
Control
NK Cell
Control
Genetics
Databases
NHP
Database
Virus Seq.
Data
src: Bart Haynes
3/12/09 Bill Howe, eScience Institute19
Outline
 Challenges
 Dataspaces
 Dataspace Support Platforms
 Next Steps
3/12/09 Bill Howe, eScience Institute20
Example Systems cast as DSSPs
 Atlas (LabKey)
 data model: tables and files
 Mark Igra will present
 “Data Warehouse” prototype (SCHARP)
 data model: relations
 SQLShare (UW eScience)
 data model: relations
 Quarry [Howe, et al. 2006]
 data model: triples
 iTrails [Salles et al. 2007]
 data model: triples
 Google Fusion Tables [Halevy 2010]
 data model: relations
3/12/09 Bill Howe, eScience Institute21
Environmental
Sampling
Public annotation DBs
Sequencing
metadata
search hits
taxonomic info
correlate diversity
w/environment?
correlate diversity
w/nutrients?
find new genes?
find new taxa and
their distributions? compare meta*omes?
Pfams, TIGRfams,
COGs, FIGfams
Phylogeny
“90% of my time spent
manipulating data rather than doing
science'”
3/12/09 Bill Howe, eScience Institute22
Simple Example
###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1
chr_4[480001-580000].287 4500
chr_4[560001-660000].1 3556
chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein C
chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN,
chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein C
chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf fa
chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf fa
chr_24[160001-260000].65 3542
chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf fa
chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydr
chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and p
chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and p
chr_11[1-100000].70 2886
chr_11[80001-180000].100 1523
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length
1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 285
2 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 233
3 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872
…
2853 FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 1089
2854 FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316
…
3566 FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105
…
COGAnnotation_coastal_sample.txt
select *
from annotationsummary_combinedorfannotation16_phaeo_genome,
COGAnnotation_surface
where phaeo_gene = surf_hit
3/12/09 Bill Howe, eScience Institute23
Environmental
Sampling
Public annotation DBs
Sequencing
metadata
search hits
taxonomic info
correlate diversity
w/environment?
correlate diversity
w/nutrients?
find new genes?find new taxa and
their distributions? compare meta*omes?
Pfams, TIGRfams,
COGs, FIGfams
SQL
“That took me a week with Excel”
“I can do science again”
SQLShare
Phylogeny
3/12/09 Bill Howe, eScience Institute24
3/12/09 Bill Howe, eScience Institute25
3/12/09 Bill Howe, eScience Institute26
SQLShare Motivation
 Conventional wisdom says “Scientists won’t write SQL”
 We don’t believe it
 Instead, we implicate difficulty in
 installation
 configuration
 schema design
 performance tuning
 data ingest
 over-reliance on GUIs
We ask “What kind of technology would
make SQL a natural fit for hypothesis
testing?”
3/12/09 Bill Howe, eScience Institute27
SQLShare Features
 Collaborative SQL authoring and sharing
 Views for incremental abstraction and integration
 Semi-automatic integration
 Identify “natural” unions and joins
 SQL Autocomplete
 User starts typing, system uses query logs to make suggestions
[Khoussainova 10]
 English Query
 Bootstrap a SQL query from an English questions
 Simple Visualization
 via Integration with Google Fusion Tables
3/12/09 Bill Howe, eScience Institute28
Outline
 Challenges
 Dataspaces
 Dataspace Support Platforms
 Next Steps
3/12/09 Bill Howe, eScience Institute29
Next Steps
 Define scope
 Define HIV Dataspace team
 Build a minimal technical team

“Data Wrangler”

“Application Wrangler”
 Identify and catalog dataspace “participants” (i.e., sources)
 Review data access rights and security requirements
 Gather “spanning basis” of questions to answer
 Jim Gray’s “20 questions” methodology
 Gather “spanning basis” of existing data
 use exemplars if necessary
 load data “as is” into a database
3/12/09 Bill Howe, eScience Institute30
Next Steps (2)
 Answer initial questions (Data wrangler)
 RDBMS example: create views
 Visualize initial answers (Application wrangler)
 Demonstrate early progress
 Check breadth (what’s missing?)
 Check depth (Did “hard” questions get answered?)
3/12/09 Bill Howe, eScience Institute31
Summary
 Conventional “schema-first” approaches
break down in research contexts
 The dataspace abstraction and DSSPs
offer a way forward
 Systems and best practices are emerging
in the literature and from production
deployments
3/12/09 Bill Howe, eScience Institute32
3/12/09 Bill Howe, eScience Institute33
BACKUP SLIDES
3/12/09 Bill Howe, eScience Institute34
Feature: Sharing SQL
3/12/09 Bill Howe, eScience Institute35
Feature: SQL Autocomplete
 User requests suggestions on-the-fly as he/she
types query
 Recommends snippets:
 predicates in the WHERE clause
 tables in the FROM clause
 attributes in the SELECT clause
 Recommendations are context-aware
 Leverages past queries by user and
collaborators
Src: Nodira Khoussainova
3/12/09 Bill Howe, eScience Institute36
Feature: English Query
 Lots of research on Natural Language
Interfaces to Databases
 c.f. [Etzioni 2008, Zettermeyer 2009]
 Very hard problem, in general
 Significant simplification: user can inspect
and “fix” the generated SQL prior to
execution
3/12/09 Bill Howe, eScience Institute37
Feature: Simple Visualization
For each phaeo gene, count the number of matches in the COGAnnotation_surface
dataset, joining on COG id. Return the top 10 most commonly found genes.
Implementation: Export to Google Fusion Tables
3/12/09 Bill Howe, eScience Institute38
Dataspaces: Summary
A “Dataspace Support Platform” should
 use a “lowest common denominator” data model
 not rely crucially on upfront global consensus
 not rely crucially on “perfect” metadata
 embrace exceptions, but exploit patterns
 support task-specific, “top down” integration
 ….but seek and exploit cross-cutting patterns where possible
 deliver incremental return for incremental investment
 …in data quality enhancement
 …in metadata normalization
 …in usage standardization
 …in application “convergence”
3/12/09 Bill Howe, eScience Institute39
Timeline
time, scope, effort
valueforusers
Semantic
Web
RDF/OWL
Ontologies
Insular Data Sources
Data Integration Tools
Federated
Databases
Dataspace
support
platforms
Dataspaces
3/12/09 Bill Howe, eScience Institute40
Example: Metagenomics
1. Who is there?
Which organisms make up the population?
2. What are they doing?
Which metabolic pathways are present and active?
(and who is doing what?)
3. Compare datasets
- across a transect (nearshore vs. deep ocean)
- before/after some event (e.g., Spring freshet)
- across salinity/temperature gradients
- diurnal cycles (day/night)
metagenomics
metatranscriptomics
metaproteomics
Study microbial populations
sampled from the environment
instead of individual organisms
Source: Robin Kodner, Armbrust Lab
id query hit e_value query_start query_end hit_start hit_end hit_length
6409 FHJ7DRN01BYA61.1 TIGR00149 2.20E-21 1 84 43 125 134
6410 FHJ7DRN01BDTEA.1 TIGR00149 3.40E-09 3 42 30 69 134
6411 FHJ7DRN02HEUGQ.1 TIGR00149 1.70E-05 4 46 1 46 134
6412 FHJ7DRN01CA4BO.1 TIGR00149 5.30E-05 4 45 1 45 134
6413 FHJ7DRN01DM2FK.3 TIGR01651 5.70E-64 1 76 511 586 606
6414 FHJ7DRN01B8BPS.1 TIGR01651 1.20E-36 1 52 500 551 606
6415 FHJ7DRN02JM54P.1 TIGR01651 2.20E-24 15 80 301 366 606
6416 FHJ7DRN02FK6C5.2 TIGR00039 2.70E-16 1 45 37 85 153
6417 FHJ7DRN01D019A.1 TIGR00039 8.90E-12 5 65 48 118 153
6418 FHJ7DRN02FYAFO.1 TIGR00039 1.60E-11 1 76 67 153 153
coastal sample
Complex Example
…
[H] COG4547 Cobalamin biosynthesis protein CobT
(nicotinate-mononucleotide:5, 6-dimethylbenzimidazole
phosphoribosyltransferase)
Ype: YPMT1.87
Atu: AGl2410
Sme: SMc00701
Bme: BMEI0050
Mlo: mll3561
Ccr: CC0672
…
[J] COG5099 RNA-binding protein of the Puf family,
translational repressor
Sce: YGL014w YGL178w YJR091c YLL013c YPR042c
Spo: SPAC1687.22c SPAC4G8.03c SPAC4G9.05
SPAC6G9.14 SPBC56F2.08c SPBP35G2.14 SPCC1682.08c
Ecu: ECU11g1730
…
COG database
###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1
chr_4[480001-580000].287 4500
chr_4[560001-660000].1 3556
chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein Cob
chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN, SP
chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein Cob
chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf fam
chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf fam
chr_24[160001-260000].65 3542
chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf fam
chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydrola
chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and pro
chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and pro
chr_11[1-100000].70 2886
chr_11[80001-180000].100 1523
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
SwissProt web service
Browser Cross-Reference
TIGR01650 GO:0051116 contributes_to
TIGR01651 GO:0009236 NULL
TIGR01651 GO:0051116 NULL
TIGR01660 GO:0008940 NULL
TIGR01660 GO:0009061 NULL
TIGR01660 GO:0009325 NULL
TIGR01663 GO:0000012 NULL
TIGR01663 GO:0046403 NULL
TIGRFAM to GO Mapping
id query hit e_value query_start query_end hit_start hit_end hit_length
6409 FHJ7DRN01BYA61.1 TIGR00149 2.20E-21 1 84 43 125 134
6410 FHJ7DRN01BDTEA.1 TIGR00149 3.40E-09 3 42 30 69 134
6411 FHJ7DRN02HEUGQ.1 TIGR00149 1.70E-05 4 46 1 46 134
6412 FHJ7DRN01CA4BO.1 TIGR00149 5.30E-05 4 45 1 45 134
6413 FHJ7DRN01DM2FK.3 TIGR01651 5.70E-64 1 76 511 586 606
6414 FHJ7DRN01B8BPS.1 TIGR01651 1.20E-36 1 52 500 551 606
6415 FHJ7DRN02JM54P.1 TIGR01651 2.20E-24 15 80 301 366 606
6416 FHJ7DRN02FK6C5.2 TIGR00039 2.70E-16 1 45 37 85 153
6417 FHJ7DRN01D019A.1 TIGR00039 8.90E-12 5 65 48 118 153
6418 FHJ7DRN02FYAFO.1 TIGR00039 1.60E-11 1 76 67 153 153
coastal sample
3/12/09 Bill Howe, eScience Institute42
Pre-relational brittleness: if your data changed, your
application often broke.
Early RDBMS were buggy and slow (and often reviled),
but required only 5% of the application code.
“Activities of users at terminals and most application programs
should remain unaffected when the internal representation of data
is changed and even when some aspects of the external
representation are changed.” -- E.F. Codd 1979
Key Ideas: Programs that manipulate tabular data exhibit an
algebraic structure allowing reasoning and manipulation
independently of physical data representation
Background: Relational Databases
3/12/09 Bill Howe, eScience Institute43
Relational Databases: Summary
 A General Data Model: “just tables”
 Logical and Physical Data Independence
 Declarative Query Language
 via the “Relational Algebra”
 Good Scalability
 “SQL is the most successful parallel language in the world”
 Results
 $15B industry
 Nearly every (non-search engine) website you visit is backed by a
RDBMS
 One of the all-time best examples of CS research impact
3/12/09 Bill Howe, eScience Institute44
So what went wrong?
 DBAs!
 “Schema design” became paramount
 “Applications write queries, not users”

Applications became tightly coupled to schema

Ad hoc queries, ad hoc views, ad hoc data
confounded predictable performance, centralized
management, and strong global guarantees
 Result: Other tools enlisted to fil the gap

Java, etc.; XML, RDF, etc.; Web Services
3/12/09 Bill Howe, eScience Institute45
Key Idea: Data Independence
physical data independence
logical data independence
files and
pointers
relations
view
s
SELECT seq
FROM all_sequences
WHERE seq =
‘GATTACGATATTA’;
SELECT dna
FROM ncbi_sequences
WHERE dna =
‘GATTACGATATTA’;
f = fopen(‘table_file’);
fseek(10030440);
while (True) {
fread(&buf, 1, 8192, f);
if (buf == GATTACGATATTA) {
. . .
3/12/09 Bill Howe, eScience Institute46
Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product
3/12/09 Bill Howe, eScience Institute47
Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra!
3/12/09 Bill Howe, eScience Institute48
My Interests
Computer Science
Scientific Data
Management
Databases
Data-Intensive Scalable Computing
Research Data Integration
Cloud Computing
Visual Data Analytics
3/12/09 Bill Howe, eScience Institute49
Research Cycle
Observe
Experiment
Analyze
Publish/Shar
e
Synthesis
3/12/09 Bill Howe, eScience Institute50
Web
Services
Data Management
Query
Languages
Storage
Cloud Computing
Visualization;
Workflow
Information Integration
Information Extraction,
Access
Methods
Data Mining,
Distributed Programming Models,
complexity-hiding interfaces

Más contenido relacionado

La actualidad más candente

Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...GigaScience, BGI Hong Kong
 
Scott Edmunds: GigaScience Datacite meeting Rapid Fire Talk
Scott Edmunds: GigaScience Datacite meeting Rapid Fire TalkScott Edmunds: GigaScience Datacite meeting Rapid Fire Talk
Scott Edmunds: GigaScience Datacite meeting Rapid Fire TalkGigaScience, BGI Hong Kong
 
GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.GigaScience, BGI Hong Kong
 
Scientific Workflow Systems for accessible, reproducible research
Scientific Workflow Systems for accessible, reproducible researchScientific Workflow Systems for accessible, reproducible research
Scientific Workflow Systems for accessible, reproducible researchPeter van Heusden
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 
Developing data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universitiesDeveloping data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universitiesAmanda Whitmire
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
Acting as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeActing as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeLizLyon
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsFrom Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsStefan Dietze
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Amanda Whitmire
 
Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...Maulik Kamdar
 
Rewarding data publication: ipt.biodiversity.aq
Rewarding data publication: ipt.biodiversity.aqRewarding data publication: ipt.biodiversity.aq
Rewarding data publication: ipt.biodiversity.aqAnton Van de Putte
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
 
Scott Edmunds ISMB talk on Big Data Publishing
Scott Edmunds ISMB talk on Big Data PublishingScott Edmunds ISMB talk on Big Data Publishing
Scott Edmunds ISMB talk on Big Data PublishingGigaScience, BGI Hong Kong
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009Ian Foster
 

La actualidad más candente (20)

Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
 
Scott Edmunds: GigaScience Datacite meeting Rapid Fire Talk
Scott Edmunds: GigaScience Datacite meeting Rapid Fire TalkScott Edmunds: GigaScience Datacite meeting Rapid Fire Talk
Scott Edmunds: GigaScience Datacite meeting Rapid Fire Talk
 
GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.
 
Peer Review and Science2.0
Peer Review and Science2.0Peer Review and Science2.0
Peer Review and Science2.0
 
Scientific Workflow Systems for accessible, reproducible research
Scientific Workflow Systems for accessible, reproducible researchScientific Workflow Systems for accessible, reproducible research
Scientific Workflow Systems for accessible, reproducible research
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Developing data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universitiesDeveloping data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universities
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Environmental Science, Big Data and the Cloud
Environmental Science, Big Data and the CloudEnvironmental Science, Big Data and the Cloud
Environmental Science, Big Data and the Cloud
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Acting as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeActing as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decade
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsFrom Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web Datasets
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521
 
Cifar
CifarCifar
Cifar
 
Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...
 
Rewarding data publication: ipt.biodiversity.aq
Rewarding data publication: ipt.biodiversity.aqRewarding data publication: ipt.biodiversity.aq
Rewarding data publication: ipt.biodiversity.aq
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
Scott Edmunds ISMB talk on Big Data Publishing
Scott Edmunds ISMB talk on Big Data PublishingScott Edmunds ISMB talk on Big Data Publishing
Scott Edmunds ISMB talk on Big Data Publishing
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
 

Similar a Research Dataspaces: Pay-as-you-go Integration and Analysis

Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceUniversity of Washington
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceUniversity of Washington
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositoriesChris Rusbridge
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chainPaul Groth
 
Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reusevoginip
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Data Science, Data & Dashboards Design
Data Science, Data & Dashboards DesignData Science, Data & Dashboards Design
Data Science, Data & Dashboards DesignKoo Ping Shung
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce University of Washington
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of ScienceGlobus
 
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...sesrdm
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneIan Foster
 
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebBeyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebStefan Dietze
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformaticsc.titus.brown
 
Accomplishments And Challenges In Bioinformatics
Accomplishments And Challenges In BioinformaticsAccomplishments And Challenges In Bioinformatics
Accomplishments And Challenges In BioinformaticsDereck Downing
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
 
Data Standards & Best Practices for the Stratigraphic Record
Data Standards & Best Practices for the Stratigraphic RecordData Standards & Best Practices for the Stratigraphic Record
Data Standards & Best Practices for the Stratigraphic RecordKerstin Lehnert
 

Similar a Research Dataspaces: Pay-as-you-go Integration and Analysis (20)

Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory Science
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScience
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reuse
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Data Science, Data & Dashboards Design
Data Science, Data & Dashboards DesignData Science, Data & Dashboards Design
Data Science, Data & Dashboards Design
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of Science
 
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebBeyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
Accomplishments And Challenges In Bioinformatics
Accomplishments And Challenges In BioinformaticsAccomplishments And Challenges In Bioinformatics
Accomplishments And Challenges In Bioinformatics
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 
Data Standards & Best Practices for the Stratigraphic Record
Data Standards & Best Practices for the Stratigraphic RecordData Standards & Best Practices for the Stratigraphic Record
Data Standards & Best Practices for the Stratigraphic Record
 

Más de University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DUniversity of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareUniversity of Washington
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersUniversity of Washington
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceUniversity of Washington
 

Más de University of Washington (20)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 

Último

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Último (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Research Dataspaces: Pay-as-you-go Integration and Analysis

  • 1. Research Dataspaces: Pay-as-you-go Integration and Analysis Bill Howe, Phd University of Washington QuickTime™ and a decompressor are needed to see this picture.
  • 2. 3/12/09 Bill Howe, eScience Institute2 Data acquisition is no longer the bottleneck to scientific discovery Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquired en masse, in support of many hypotheses)  Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)  Oceanography: high-resolution models, cheap sensors, satellites  Biology: lab automation, high-throughput sequencing,
  • 3. 3/12/09 Bill Howe, eScience Institute3 Biology Oceanography Astronomy Two dimensions#ofbytes # of data types LSST SDSS Galaxy BioMart GEO IOOS OOI LANL HIVPathway Commons PanSTARRS
  • 4. 3/12/09 Bill Howe, eScience Institute4 Building a Research Data Management System: Status Quo 1. Establish (scientific) consensus 2. Derive and encode a domain model (schema) 3. Retrofit new domain model to existing data 4. Build applications 5. Analyze data Encode shared knowledge in a machine readable manner a. Relational schema, ontology, metadata standards, conventions, controlled vocabularies, object model, API b. Mappings between existing models Scope, vision, requirements, terminology Populate the schema, attach semantics, clean data Use domain model to inform design Do science
  • 5. 3/12/09 Bill Howe, eScience Institute5 The Value of a Data Repository VR = BD2 + UD + C D = # of datasets in the repository B = # of binary operations facilitated U = # of unary operations facilitated C = intrinsic value of the schema (for communication, etc.)
  • 6. 3/12/09 Bill Howe, eScience Institute6 Quote A typical biological data management system involves accessing or gathering data from multiple sources, followed by data correlation, classification, review, and curation using domain specific tools (e.g., functional clusters, ontologies) and expertise. In practice, biological data management is less daunting when it is considered in the context of an iterative strategy based on gradual data integration while accumulating domain specific knowledge throughout the integration process. Victor Markowitz, LBNL
  • 7. 3/12/09 Bill Howe, eScience Institute7 Outline  Challenges  Dataspaces  Dataspace Support Platforms  Next Steps
  • 8. 3/12/09 Bill Howe, eScience Institute8 QuickTime™ and a decompressor are needed to see this picture. slide source: Alon HalevyFranklin, Halevy, Maier 2005 Dataspaces
  • 9. 3/12/09 Bill Howe, eScience Institute9 Data Management Solutions
  • 10. 3/12/09 Bill Howe, eScience Institute10 Databases vs. Dataspaces Single Schema Data “Coexistence” Centralized Administration Autonomous Sources Structured Query Search, Browse, Approximate Answers Strict Integrity Constraints Patterns and trends; few global properties
  • 11. 3/12/09 Bill Howe, eScience Institute11 Dataspaces vs. Databases (2)  Databases are Exclusive  Reject data that violates types, schema, integrity constraints, rules + triggers  In return:  structured query, logical and physical data independence, transactions  …over the clean subset of your data  Dataspaces are Inclusive  Few restrictions; all data is welcome  In return, best effort services at first:  Cataloging, keywords, attribute-value  …over (almost) everything
  • 12. 3/12/09 Bill Howe, eScience Institute12 Dataspace Services Catalog Keyword search Structured Query Anakysis and Vis Task-specific Tools Time Over time, a dataset becomes accessible by additional services
  • 13. 3/12/09 Bill Howe, eScience Institute13 Dataspace Services Keyword Search Structured Query Analysis and Visualization Task-specific Applications Cataloguing
  • 14. 3/12/09 Bill Howe, eScience Institute14 Dataspace Services Cataloguing Keyword Search Structured Query Analysis and Vis Task-specific Tools
  • 15. 3/12/09 Bill Howe, eScience Institute15 Example: The Internet
  • 16. 3/12/09 Bill Howe, eScience Institute16 Example: Ocean Circulation Forecasting System Atmospheric models Tides River discharge filesystem salinity isolines station extractions model-data comparisons products via the web forcings (i.e., inputs) Simulation results Config and log files Intermediate files Annotations Data Products Relations perl and cron cluster perl and cron … FORTRAN RDBMS
  • 17. 3/12/09 Bill Howe, eScience Institute17 Example: Environmental Metagenomics ANNOTATION TABLES Pfams TIGRfams COGs FIGfams SAMPLING metagenome 4 metagenome 3 metagenome 2 metagenome 1 CAMERA annotation PPLACER of Pfams, TIGRfams, COGs, FIGfams STATs taxonomic info seed alignmentHMMer search of meta*ome reference treealigned meta*ome fragments precomputed precomputed sequencing raw data environment metadata raw data analyzed data SQLShare analyzed data correlate diversity w/environment correlate diversity and nutrients find new genes find new taxa and their distributions compare meta*omes src: Robin Kodner
  • 18. 3/12/09 Bill Howe, eScience Institute18 Example: CHAVI Relational Dataspace Interface and Analysis B Cell Control T Cell Control NK Cell Control Genetics Databases NHP Database Virus Seq. Data src: Bart Haynes
  • 19. 3/12/09 Bill Howe, eScience Institute19 Outline  Challenges  Dataspaces  Dataspace Support Platforms  Next Steps
  • 20. 3/12/09 Bill Howe, eScience Institute20 Example Systems cast as DSSPs  Atlas (LabKey)  data model: tables and files  Mark Igra will present  “Data Warehouse” prototype (SCHARP)  data model: relations  SQLShare (UW eScience)  data model: relations  Quarry [Howe, et al. 2006]  data model: triples  iTrails [Salles et al. 2007]  data model: triples  Google Fusion Tables [Halevy 2010]  data model: relations
  • 21. 3/12/09 Bill Howe, eScience Institute21 Environmental Sampling Public annotation DBs Sequencing metadata search hits taxonomic info correlate diversity w/environment? correlate diversity w/nutrients? find new genes? find new taxa and their distributions? compare meta*omes? Pfams, TIGRfams, COGs, FIGfams Phylogeny “90% of my time spent manipulating data rather than doing science'”
  • 22. 3/12/09 Bill Howe, eScience Institute22 Simple Example ###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1 chr_4[480001-580000].287 4500 chr_4[560001-660000].1 3556 chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein C chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN, chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein C chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf fa chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf fa chr_24[160001-260000].65 3542 chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf fa chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydr chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and p chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and p chr_11[1-100000].70 2886 chr_11[80001-180000].100 1523 ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length 1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 285 2 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 233 3 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872 … 2853 FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 1089 2854 FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316 … 3566 FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105 … COGAnnotation_coastal_sample.txt select * from annotationsummary_combinedorfannotation16_phaeo_genome, COGAnnotation_surface where phaeo_gene = surf_hit
  • 23. 3/12/09 Bill Howe, eScience Institute23 Environmental Sampling Public annotation DBs Sequencing metadata search hits taxonomic info correlate diversity w/environment? correlate diversity w/nutrients? find new genes?find new taxa and their distributions? compare meta*omes? Pfams, TIGRfams, COGs, FIGfams SQL “That took me a week with Excel” “I can do science again” SQLShare Phylogeny
  • 24. 3/12/09 Bill Howe, eScience Institute24
  • 25. 3/12/09 Bill Howe, eScience Institute25
  • 26. 3/12/09 Bill Howe, eScience Institute26 SQLShare Motivation  Conventional wisdom says “Scientists won’t write SQL”  We don’t believe it  Instead, we implicate difficulty in  installation  configuration  schema design  performance tuning  data ingest  over-reliance on GUIs We ask “What kind of technology would make SQL a natural fit for hypothesis testing?”
  • 27. 3/12/09 Bill Howe, eScience Institute27 SQLShare Features  Collaborative SQL authoring and sharing  Views for incremental abstraction and integration  Semi-automatic integration  Identify “natural” unions and joins  SQL Autocomplete  User starts typing, system uses query logs to make suggestions [Khoussainova 10]  English Query  Bootstrap a SQL query from an English questions  Simple Visualization  via Integration with Google Fusion Tables
  • 28. 3/12/09 Bill Howe, eScience Institute28 Outline  Challenges  Dataspaces  Dataspace Support Platforms  Next Steps
  • 29. 3/12/09 Bill Howe, eScience Institute29 Next Steps  Define scope  Define HIV Dataspace team  Build a minimal technical team  “Data Wrangler”  “Application Wrangler”  Identify and catalog dataspace “participants” (i.e., sources)  Review data access rights and security requirements  Gather “spanning basis” of questions to answer  Jim Gray’s “20 questions” methodology  Gather “spanning basis” of existing data  use exemplars if necessary  load data “as is” into a database
  • 30. 3/12/09 Bill Howe, eScience Institute30 Next Steps (2)  Answer initial questions (Data wrangler)  RDBMS example: create views  Visualize initial answers (Application wrangler)  Demonstrate early progress  Check breadth (what’s missing?)  Check depth (Did “hard” questions get answered?)
  • 31. 3/12/09 Bill Howe, eScience Institute31 Summary  Conventional “schema-first” approaches break down in research contexts  The dataspace abstraction and DSSPs offer a way forward  Systems and best practices are emerging in the literature and from production deployments
  • 32. 3/12/09 Bill Howe, eScience Institute32
  • 33. 3/12/09 Bill Howe, eScience Institute33 BACKUP SLIDES
  • 34. 3/12/09 Bill Howe, eScience Institute34 Feature: Sharing SQL
  • 35. 3/12/09 Bill Howe, eScience Institute35 Feature: SQL Autocomplete  User requests suggestions on-the-fly as he/she types query  Recommends snippets:  predicates in the WHERE clause  tables in the FROM clause  attributes in the SELECT clause  Recommendations are context-aware  Leverages past queries by user and collaborators Src: Nodira Khoussainova
  • 36. 3/12/09 Bill Howe, eScience Institute36 Feature: English Query  Lots of research on Natural Language Interfaces to Databases  c.f. [Etzioni 2008, Zettermeyer 2009]  Very hard problem, in general  Significant simplification: user can inspect and “fix” the generated SQL prior to execution
  • 37. 3/12/09 Bill Howe, eScience Institute37 Feature: Simple Visualization For each phaeo gene, count the number of matches in the COGAnnotation_surface dataset, joining on COG id. Return the top 10 most commonly found genes. Implementation: Export to Google Fusion Tables
  • 38. 3/12/09 Bill Howe, eScience Institute38 Dataspaces: Summary A “Dataspace Support Platform” should  use a “lowest common denominator” data model  not rely crucially on upfront global consensus  not rely crucially on “perfect” metadata  embrace exceptions, but exploit patterns  support task-specific, “top down” integration  ….but seek and exploit cross-cutting patterns where possible  deliver incremental return for incremental investment  …in data quality enhancement  …in metadata normalization  …in usage standardization  …in application “convergence”
  • 39. 3/12/09 Bill Howe, eScience Institute39 Timeline time, scope, effort valueforusers Semantic Web RDF/OWL Ontologies Insular Data Sources Data Integration Tools Federated Databases Dataspace support platforms Dataspaces
  • 40. 3/12/09 Bill Howe, eScience Institute40 Example: Metagenomics 1. Who is there? Which organisms make up the population? 2. What are they doing? Which metabolic pathways are present and active? (and who is doing what?) 3. Compare datasets - across a transect (nearshore vs. deep ocean) - before/after some event (e.g., Spring freshet) - across salinity/temperature gradients - diurnal cycles (day/night) metagenomics metatranscriptomics metaproteomics Study microbial populations sampled from the environment instead of individual organisms Source: Robin Kodner, Armbrust Lab
  • 41. id query hit e_value query_start query_end hit_start hit_end hit_length 6409 FHJ7DRN01BYA61.1 TIGR00149 2.20E-21 1 84 43 125 134 6410 FHJ7DRN01BDTEA.1 TIGR00149 3.40E-09 3 42 30 69 134 6411 FHJ7DRN02HEUGQ.1 TIGR00149 1.70E-05 4 46 1 46 134 6412 FHJ7DRN01CA4BO.1 TIGR00149 5.30E-05 4 45 1 45 134 6413 FHJ7DRN01DM2FK.3 TIGR01651 5.70E-64 1 76 511 586 606 6414 FHJ7DRN01B8BPS.1 TIGR01651 1.20E-36 1 52 500 551 606 6415 FHJ7DRN02JM54P.1 TIGR01651 2.20E-24 15 80 301 366 606 6416 FHJ7DRN02FK6C5.2 TIGR00039 2.70E-16 1 45 37 85 153 6417 FHJ7DRN01D019A.1 TIGR00039 8.90E-12 5 65 48 118 153 6418 FHJ7DRN02FYAFO.1 TIGR00039 1.60E-11 1 76 67 153 153 coastal sample Complex Example … [H] COG4547 Cobalamin biosynthesis protein CobT (nicotinate-mononucleotide:5, 6-dimethylbenzimidazole phosphoribosyltransferase) Ype: YPMT1.87 Atu: AGl2410 Sme: SMc00701 Bme: BMEI0050 Mlo: mll3561 Ccr: CC0672 … [J] COG5099 RNA-binding protein of the Puf family, translational repressor Sce: YGL014w YGL178w YJR091c YLL013c YPR042c Spo: SPAC1687.22c SPAC4G8.03c SPAC4G9.05 SPAC6G9.14 SPBC56F2.08c SPBP35G2.14 SPCC1682.08c Ecu: ECU11g1730 … COG database ###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1 chr_4[480001-580000].287 4500 chr_4[560001-660000].1 3556 chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein Cob chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN, SP chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein Cob chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf fam chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf fam chr_24[160001-260000].65 3542 chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf fam chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydrola chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and pro chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and pro chr_11[1-100000].70 2886 chr_11[80001-180000].100 1523 ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome SwissProt web service Browser Cross-Reference TIGR01650 GO:0051116 contributes_to TIGR01651 GO:0009236 NULL TIGR01651 GO:0051116 NULL TIGR01660 GO:0008940 NULL TIGR01660 GO:0009061 NULL TIGR01660 GO:0009325 NULL TIGR01663 GO:0000012 NULL TIGR01663 GO:0046403 NULL TIGRFAM to GO Mapping id query hit e_value query_start query_end hit_start hit_end hit_length 6409 FHJ7DRN01BYA61.1 TIGR00149 2.20E-21 1 84 43 125 134 6410 FHJ7DRN01BDTEA.1 TIGR00149 3.40E-09 3 42 30 69 134 6411 FHJ7DRN02HEUGQ.1 TIGR00149 1.70E-05 4 46 1 46 134 6412 FHJ7DRN01CA4BO.1 TIGR00149 5.30E-05 4 45 1 45 134 6413 FHJ7DRN01DM2FK.3 TIGR01651 5.70E-64 1 76 511 586 606 6414 FHJ7DRN01B8BPS.1 TIGR01651 1.20E-36 1 52 500 551 606 6415 FHJ7DRN02JM54P.1 TIGR01651 2.20E-24 15 80 301 366 606 6416 FHJ7DRN02FK6C5.2 TIGR00039 2.70E-16 1 45 37 85 153 6417 FHJ7DRN01D019A.1 TIGR00039 8.90E-12 5 65 48 118 153 6418 FHJ7DRN02FYAFO.1 TIGR00039 1.60E-11 1 76 67 153 153 coastal sample
  • 42. 3/12/09 Bill Howe, eScience Institute42 Pre-relational brittleness: if your data changed, your application often broke. Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code. “Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” -- E.F. Codd 1979 Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation Background: Relational Databases
  • 43. 3/12/09 Bill Howe, eScience Institute43 Relational Databases: Summary  A General Data Model: “just tables”  Logical and Physical Data Independence  Declarative Query Language  via the “Relational Algebra”  Good Scalability  “SQL is the most successful parallel language in the world”  Results  $15B industry  Nearly every (non-search engine) website you visit is backed by a RDBMS  One of the all-time best examples of CS research impact
  • 44. 3/12/09 Bill Howe, eScience Institute44 So what went wrong?  DBAs!  “Schema design” became paramount  “Applications write queries, not users”  Applications became tightly coupled to schema  Ad hoc queries, ad hoc views, ad hoc data confounded predictable performance, centralized management, and strong global guarantees  Result: Other tools enlisted to fil the gap  Java, etc.; XML, RDF, etc.; Web Services
  • 45. 3/12/09 Bill Howe, eScience Institute45 Key Idea: Data Independence physical data independence logical data independence files and pointers relations view s SELECT seq FROM all_sequences WHERE seq = ‘GATTACGATATTA’; SELECT dna FROM ncbi_sequences WHERE dna = ‘GATTACGATATTA’; f = fopen(‘table_file’); fseek(10030440); while (True) { fread(&buf, 1, 8192, f); if (buf == GATTACGATATTA) { . . .
  • 46. 3/12/09 Bill Howe, eScience Institute46 Key Idea: An Algebra of Tables select project join join Other operators: aggregate, union, difference, cross product
  • 47. 3/12/09 Bill Howe, eScience Institute47 Key Idea: Algebraic Optimization N = ((z*2)+((z*3)+0))/1 Algebraic Laws: 1. (+) identity: x+0 = x 2. (/) identity: x/1 = x 3. (*) distributes: (n*x+n*y) = n*(x+y) 4. (*) commutes: x*y = y*x Apply rules 1, 3, 4, 2: N = (2+3)*z two operations instead of five, no division operator Same idea works with the Relational Algebra!
  • 48. 3/12/09 Bill Howe, eScience Institute48 My Interests Computer Science Scientific Data Management Databases Data-Intensive Scalable Computing Research Data Integration Cloud Computing Visual Data Analytics
  • 49. 3/12/09 Bill Howe, eScience Institute49 Research Cycle Observe Experiment Analyze Publish/Shar e Synthesis
  • 50. 3/12/09 Bill Howe, eScience Institute50 Web Services Data Management Query Languages Storage Cloud Computing Visualization; Workflow Information Integration Information Extraction, Access Methods Data Mining, Distributed Programming Models, complexity-hiding interfaces

Notas del editor

  1. My name is Bill Howe and I’m from the University of Washington eScience Institute
  2. Drowning in data; starving for information We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem. “Typical large pharmas today are generating 20 terabytes of data daily. That’s probably going up to 100 terabytes per day in the next year or so.” “tens of terabytes of data per day” -- genome center at Washignton University Increase data collection exponentially with flowcam
  3. Steps 1,2,3,4 are expensive At the frontier of research, 1,2,3 are by definition elusive. By definition, the domain is not fully understood. By definition, researchers do not universally agree on the interpretation of data; there is no universal domain model. How do you build a domain model, build consensus? You perform experiments, analyze the data, publish the results -- the overall goal of the scientific enterprise is to achieve shared knowledge. So it is a mistake to presuppose the existence of an ontology as a way to faciliatate data analysis. We perform data analysis in order to establish an ontology; we can’t make establishing an ontology a prerequisite for data analysis. shared knowledge is the end result of research, not a precondition for it. So in practice, you find individual researchers, individual analysts managing their own data ad hoc -- mostly in text files. and mostly with SAS, perl, python, MATLAB, maybe Excel if they do not have aprogramming background. So what results is this ecosystem of “desultory” data -- varying levels of metadata, varying levels of quality, varying levels of accessibility. Now, it is a phenomenally good idea to strive for machine readable representations of knowledge. When these exist, there are a variety of ways to exploit them to improv interoperability, build applications, facilitate communication. But at the frontier of research, they simply can’t exist until we have a chance to analyze all the data. So this talk is about technology that can tolerate the heterogeneity and ambiguity of research data. Aside: Formally, an interpretation of a model is a mapping from the elements in the model to the elements in the real world. A sound interpretation of a logical model is one where all true statements in the artificial model are true when mapped into the real world. A global ontology or othr form of comprehensive schema presumes some measure of consensus (which may take the form of mappings between different sub-ontologies or sub schemas -- yet these are still a form of consensus.) Even if you can successfully capture global agreement into a shared schema, it’s only a snapshot -- its half-life will be very short. Even if your schema is sound and complete, and even if you can keep up with changing requirements,
  4. My claim is that the value of a repository is quadratic in the number of datasets it holds. The reason is that every pairwise comparison of two datasets potentially provides new insight. Some things to observe: Many system only provide basic retrieval -- unary operation -- an so they scale linearly. (Though of course the user can download two datasets and compare them locally, so the B coefficient is non-zero. I think of C as the benefit of having a shared domain model to focus the conversation, resolve ambiguities, and encourage converging mental models among users, especially new users (students). So this is important, but it does not get any benefit from having 1, 10, 1000, or a 1M datasets. It’s value is constant. So my recommended strategy is to make sure B is non-zero, and crank up D as high as possible. Now, having a rich, thorough domain model can help facilitate more operations -- maybe a new genome browser might be easier to build if we have a well-tuned schema to work from. However, it’s demonstrably NOT the case that a genome browser *could not* be built without such a schema. Indeed, a huge number of simple desktop genome browsers exist that do not have any shared semantics. The disadvantage to having a rich, thorough domain model is that it restricts D -- it limits the amount of data you can put into the system. Data with missing, incomplete, or ambiguous metadata cannot be ingested. So you’ve increased C (and possibly B and U) at the expense of D -- this is not a good idea. So we need systems that are inclusive, that emphasize breadth over depth (at least initially), emphasize coverage.
  5. Many failed systems in and out of science attribute their failure to inflexibility -- too hard to get data into the system due to over-engineered metadata standards. The value of a data repository is is quadratic in the number of datasets it holds Vr = QD^2 + RD + C D=number of datasets Q=analysis capability R=simple accessibility C=communicability <-- intrinsic value of domain knowledge, metadata standard So a rich and thorough domain model increases C, but can decrease D -- it’s too difficult to put data into the correct format, with all the required metadata, so the system is underused. FGDC
  6. Data in avariety of formats ensconced in autonomous systems with different capabilities and different schemas How do you get started in this environment? What’s the first thing you do?
  7. Data source do not share a schema, and may not exhibit a schema at all. Data is allowed to exist in its native form behind its native interfaces. These data sources are also autonomous, so you’re not necessarily allowed to take all the flat files and repalce them with XML. You have to pay for all this freedom and flexibility somewhere, and here’s where you do it: With lots of global properties, you can define sophisticated services that exploit them: structured query, and strong integrity guarantees
  8. Put another way, databases are inherently exclusive, helping you reject data that does not conform to your schema or satisfy your integrity constraints. The dataspace support platform is inclusive -- everybody is welcome
  9. The dataspace provides a hierarchy of services to accommodate varying degrees of data “maturity”
  10. Add a screenshot of Google There is no global schema for the Internet Search is approximate, “best effort”…and highly effective
  11. <number> What do you need to do forecast the physical state of the ocean? You’re going to be solving a set of partial differential equations, so you need forcings at the boundaries of the domain -- river discharge, tides, and atmospheric condtions, bathymetry. Every day, you can download results of atmospheric forecasts, compute tidal forcings, and estimate river discharge, as well as some observational data to compare with your simulation and see how well you’re doing. these data go into files and relational databases. When the forecast is ready to run, these inputs are staged out to compute cluster, along with the FORTRAN code that will solve the equations and some post processing routines. The forecast executes, incrementally generating data files, visualizations, log files, and status information. This information is pushed back to the storage servers and the visualizations are served over the web. This process generates lots of intermediate data of a variety of types -- We want to provide browse and query services over these data without disturbing the operational system and without a lengthy design phase -- we want results by 5:00pm Some data loaded into a relational database Others left as files (no need for ad hoc query; one-time use; large size) SELFE Eulerian Lagrangian Semi-implicit Finite element model. Solves 3D Navier Stokes equations. Produces 6 variables * 700MB /day. Hindcast runs: compare code versions, compare inputs, long term behavior, what if analysis (river dredging), Tsunami model assumptions Data products: Animations, maps, timeseries plots station extractions, model-data comparisons
  12. Slide from Robin Kodner. Key idea: This protocol is far more precise than BLAST for sequence searching, but it generates a lot of heterogeneous intermediate results that must be analyzed -- this step had completely roadblocked the research. With SQLShare, you can just throw all the data up to the cloud and start asking questions right away -- collaboratively. None of the overhead associated with an RDBMS. == Name == SQLShare: Cloud-based Collaborative Query == PI == Ginger Armbrust == Science Perspective == From Robin: “The SQLShare database is allowing me to do basic sorting and clustering of my data that took me a week to do in excel, now in a matter of seconds. It is also making it possible to correlate the analyzed data that results from different kinda of anaylsis from different analysis pathways, which maximizes the use of the data. Further, it allows for finding correlations between different projects and the corresponding environmental metadata that would be impossible without the database. Without the database, I'd only be able to utilize a fraction of my data, and find only a fraction of the interesting nuggets that we are looking for. I conceived of the database to help me with metagenomic data but its so useful, we are now using it to do comparative genomics and evolutionary studies.” == Computational View == Goal: Tolerate the “spreadsheet tsunami” Each user has O(100) files with O(100k) rows each heterogeneous, changing schemas Observation: Databases underused in science Hypothesis: Scientists dislike RDBMS, not SQL Installation, configuration, schema design, physical tuning Approach: Just put it in the cloud and query it ignore DB design; do auto-scaling and auto-tuning System-enabled sharing of data and queries Only makes sense for science! Ex: Two labs both buy an AB Solid sequencer, so both may use the same queries to process the output == Resources == Azure for the application, SQL Azure for the system data, EC2 for the user data (to avoid 10GB limit on SQL Azure) == Comparison == Quotes: “I can do science again” “That took me a week to do with spreadsheets!” “I spend 90% of my time manipulating data in spreadsheets.” “My research was stuck on data analysis before SQLShare”
  13. Environmental samples are sequenced. Sequence fragments are looked up in public databases, and passed through phylogenetic analysis to place them at the appropriate location in the tree. Each step generates a bunch of “residual” data, usually in the form of spreadsheets or text files. This process is repeated many times, leading to 100s of “desultory” spreadsheets The actual science questions are answered using these spreadsheets by computing “manual joins”, creating plots, searching and filtering, copying and pasting, etc. It’s a mess -- when asked how much time is spent “handling data” as opposed to “doing science”, one postdoc said a staggering 90%!
  14. Here are two datasets: Sequence annotations for the Phaeo-dactylum organism and sequence annotations from an environmental sample. The task is to compare these sets of annotations to determine what role Phaeo is serving in the metagenomic population, if present. Previously, researchers had to manually cross-reference data between spreadsheets. But the join between these datasets is trivially expressed in SQL Now, that was just the first step -- counting subsets, finding intersections, finding “top K” matches, etc. must also be performed manullay, but are also easily expressd in SQL.
  15. -- No schema design: Just upload everything “as is” and start querying. No one wants to create one, and the schema’s going to change anyway. -- We find that ALL of the scientists’ English queries are expressible in SQL. Some can be complex, however. -- Challenge: SQL is hard -- Solution: Let scientists train themselves. Give them examples to modify instead of a “blinking cursor.” More generally: Facilitate collaborative query authoring, sharing, and reuse. Support collaboration between the “carpet lab” and the “tile lab” (Computer geeks work in carpeted offices, bio geeks work in the wet lab.) -- How? 1) Use the cloud to logically and physically co-locate all data across all labs -- no more islands 2) Let queries be saved and shared 3) Log everything and do machine learning on the log to perform “Query autocomplete” (Nodira and Magda’s work) 4) Automatically adapt queries for use on ‘similar’ datasets (change table names, etc.) many more ideas….
  16. Items to point out: -- IMPORTANT: These are not trivial queries! But with some help, scientists can write them. We give them an example, and they modify the example and save it for reuse. Queries can be optionally shared across users.
  17. <number> Currently we have insular data sources Pay as you go Smoothing the ROI curve!
  18. Science slide for Ginger
  19. In the previous case, the same source of database identifiers were used; when they differ the process can be more complicated. Here we have two datasets: Phaeo gene annotations again, and set of sample annotations with references to the TIGRFam database. The workflow here might look like: Find an annotation of interest in Phaeo dataset Look up COG Id to get Protein Name Search for Protein Name in various online databases (here we use SwissProt) to collect additional information Browse to cross-reference information to find TIGRFam Id, Find Gene Ontology synonym of the TIGRFam Id to collect additional metadata (other metadata not shown -- another step) Finally, match TIGRFam Ids back to original sample. By putting all of this data into a database, you can write these expressions as joins. More importantly, you can go beyond “lookup” tasks and express the actual science questions directly: What percentage of Phaeo genes are present in this sample? What metabolic processes are those genes involved in? Note that we do NOT want to attempt to create “YAUDB” (yet another universal database). these data are uploaded and manipulated in an exploratory, task-specific manner. We aim to provide SQL over YOUR data, not a universal reference database from scratch. (That being said, our research involves learning a universal database schema -- incrementally and organically -- based on the upoloaded data, the executed queries, and any available user input. Bill
  20. It provides a means of describing data with its natural structure only--that is, without superimposing any additional structure for machine representation purposes. Accordingly, it provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation on the other.
  21. So what’s wrong? Applications write queries, not users Schema design, tuning, “protectionist” attitudes
  22. It turns out that you can express a wide variety of computations using only a handful of operators.
  23. Data Management != Storage Management Storage Management is SATA/SCSI/Fiber Backup policies and procedures redundancy decisions (RAID 0, 1+0, 0+1, 5 Access methods Query languages Data Mining, Analysis, Visualization Data Integration