2. 3/12/09 Bill Howe, eScience Institute2
Data acquisition is no longer the
bottleneck to scientific discovery
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, in support of many hypotheses)
Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
Oceanography: high-resolution models, cheap sensors, satellites
Biology: lab automation, high-throughput sequencing,
3. 3/12/09 Bill Howe, eScience Institute3
Biology
Oceanography
Astronomy
Two dimensions#ofbytes
# of data types
LSST
SDSS
Galaxy
BioMart
GEO
IOOS
OOI
LANL
HIVPathway
Commons
PanSTARRS
4. 3/12/09 Bill Howe, eScience Institute4
Building a Research Data Management System:
Status Quo
1. Establish (scientific) consensus
2. Derive and encode a domain model (schema)
3. Retrofit new domain model to existing data
4. Build applications
5. Analyze data
Encode shared knowledge in a machine readable manner
a. Relational schema, ontology, metadata standards,
conventions, controlled vocabularies, object model, API
b. Mappings between existing models
Scope, vision, requirements, terminology
Populate the schema, attach semantics, clean data
Use domain model to inform design
Do science
5. 3/12/09 Bill Howe, eScience Institute5
The Value of a Data Repository
VR = BD2
+ UD + C
D = # of datasets in the repository
B = # of binary operations facilitated
U = # of unary operations facilitated
C = intrinsic value of the schema (for communication, etc.)
6. 3/12/09 Bill Howe, eScience Institute6
Quote
A typical biological data management system involves accessing
or gathering data from multiple sources, followed by data
correlation, classification, review, and curation using domain
specific tools (e.g., functional clusters, ontologies) and expertise.
In practice, biological data management is less daunting when it is
considered in the context of an iterative strategy based on gradual
data integration while accumulating domain specific knowledge
throughout the integration process.
Victor Markowitz, LBNL
7. 3/12/09 Bill Howe, eScience Institute7
Outline
Challenges
Dataspaces
Dataspace Support Platforms
Next Steps
8. 3/12/09 Bill Howe, eScience Institute8
QuickTime™ and a
decompressor
are needed to see this picture.
slide source: Alon HalevyFranklin, Halevy, Maier 2005
Dataspaces
10. 3/12/09 Bill Howe, eScience Institute10
Databases vs. Dataspaces
Single Schema Data “Coexistence”
Centralized Administration Autonomous Sources
Structured Query
Search, Browse,
Approximate Answers
Strict Integrity Constraints
Patterns and trends;
few global properties
11. 3/12/09 Bill Howe, eScience Institute11
Dataspaces vs. Databases (2)
Databases are Exclusive
Reject data that violates types,
schema, integrity constraints, rules +
triggers
In return:
structured query, logical and physical
data independence, transactions
…over the clean subset of your data
Dataspaces are Inclusive
Few restrictions; all data is welcome
In return, best effort services at first:
Cataloging, keywords, attribute-value
…over (almost) everything
12. 3/12/09 Bill Howe, eScience Institute12
Dataspace Services
Catalog
Keyword search
Structured Query
Anakysis and Vis
Task-specific Tools
Time
Over time, a dataset becomes accessible by additional services
13. 3/12/09 Bill Howe, eScience Institute13
Dataspace Services
Keyword Search
Structured Query
Analysis and
Visualization
Task-specific
Applications
Cataloguing
14. 3/12/09 Bill Howe, eScience Institute14
Dataspace Services
Cataloguing
Keyword Search
Structured Query
Analysis and Vis
Task-specific Tools
16. 3/12/09 Bill Howe, eScience Institute16
Example: Ocean Circulation
Forecasting System
Atmospheric
models Tides River discharge
filesystem
salinity isolines
station extractions
model-data comparisons
products via the web
forcings (i.e., inputs)
Simulation results
Config and log files
Intermediate files
Annotations
Data Products
Relations
perl and cron
cluster
perl and cron
…
FORTRAN
RDBMS
17. 3/12/09 Bill Howe, eScience Institute17
Example: Environmental
Metagenomics
ANNOTATION TABLES
Pfams
TIGRfams
COGs
FIGfams
SAMPLING
metagenome 4
metagenome 3
metagenome 2
metagenome 1
CAMERA annotation
PPLACER
of Pfams, TIGRfams, COGs, FIGfams
STATs
taxonomic info
seed alignmentHMMer search
of meta*ome
reference treealigned meta*ome
fragments
precomputed
precomputed
sequencing
raw data
environment
metadata
raw data
analyzed data
SQLShare
analyzed data
correlate diversity
w/environment
correlate
diversity and
nutrients
find new
genes
find new
taxa and
their
distributions
compare meta*omes
src: Robin Kodner
18. 3/12/09 Bill Howe, eScience Institute18
Example: CHAVI
Relational
Dataspace
Interface and
Analysis
B Cell
Control
T Cell
Control
NK Cell
Control
Genetics
Databases
NHP
Database
Virus Seq.
Data
src: Bart Haynes
19. 3/12/09 Bill Howe, eScience Institute19
Outline
Challenges
Dataspaces
Dataspace Support Platforms
Next Steps
20. 3/12/09 Bill Howe, eScience Institute20
Example Systems cast as DSSPs
Atlas (LabKey)
data model: tables and files
Mark Igra will present
“Data Warehouse” prototype (SCHARP)
data model: relations
SQLShare (UW eScience)
data model: relations
Quarry [Howe, et al. 2006]
data model: triples
iTrails [Salles et al. 2007]
data model: triples
Google Fusion Tables [Halevy 2010]
data model: relations
21. 3/12/09 Bill Howe, eScience Institute21
Environmental
Sampling
Public annotation DBs
Sequencing
metadata
search hits
taxonomic info
correlate diversity
w/environment?
correlate diversity
w/nutrients?
find new genes?
find new taxa and
their distributions? compare meta*omes?
Pfams, TIGRfams,
COGs, FIGfams
Phylogeny
“90% of my time spent
manipulating data rather than doing
science'”
22. 3/12/09 Bill Howe, eScience Institute22
Simple Example
###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1
chr_4[480001-580000].287 4500
chr_4[560001-660000].1 3556
chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein C
chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN,
chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein C
chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf fa
chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf fa
chr_24[160001-260000].65 3542
chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf fa
chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydr
chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and p
chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and p
chr_11[1-100000].70 2886
chr_11[80001-180000].100 1523
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length
1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 285
2 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 233
3 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872
…
2853 FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 1089
2854 FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316
…
3566 FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105
…
COGAnnotation_coastal_sample.txt
select *
from annotationsummary_combinedorfannotation16_phaeo_genome,
COGAnnotation_surface
where phaeo_gene = surf_hit
23. 3/12/09 Bill Howe, eScience Institute23
Environmental
Sampling
Public annotation DBs
Sequencing
metadata
search hits
taxonomic info
correlate diversity
w/environment?
correlate diversity
w/nutrients?
find new genes?find new taxa and
their distributions? compare meta*omes?
Pfams, TIGRfams,
COGs, FIGfams
SQL
“That took me a week with Excel”
“I can do science again”
SQLShare
Phylogeny
26. 3/12/09 Bill Howe, eScience Institute26
SQLShare Motivation
Conventional wisdom says “Scientists won’t write SQL”
We don’t believe it
Instead, we implicate difficulty in
installation
configuration
schema design
performance tuning
data ingest
over-reliance on GUIs
We ask “What kind of technology would
make SQL a natural fit for hypothesis
testing?”
27. 3/12/09 Bill Howe, eScience Institute27
SQLShare Features
Collaborative SQL authoring and sharing
Views for incremental abstraction and integration
Semi-automatic integration
Identify “natural” unions and joins
SQL Autocomplete
User starts typing, system uses query logs to make suggestions
[Khoussainova 10]
English Query
Bootstrap a SQL query from an English questions
Simple Visualization
via Integration with Google Fusion Tables
28. 3/12/09 Bill Howe, eScience Institute28
Outline
Challenges
Dataspaces
Dataspace Support Platforms
Next Steps
29. 3/12/09 Bill Howe, eScience Institute29
Next Steps
Define scope
Define HIV Dataspace team
Build a minimal technical team
“Data Wrangler”
“Application Wrangler”
Identify and catalog dataspace “participants” (i.e., sources)
Review data access rights and security requirements
Gather “spanning basis” of questions to answer
Jim Gray’s “20 questions” methodology
Gather “spanning basis” of existing data
use exemplars if necessary
load data “as is” into a database
31. 3/12/09 Bill Howe, eScience Institute31
Summary
Conventional “schema-first” approaches
break down in research contexts
The dataspace abstraction and DSSPs
offer a way forward
Systems and best practices are emerging
in the literature and from production
deployments
35. 3/12/09 Bill Howe, eScience Institute35
Feature: SQL Autocomplete
User requests suggestions on-the-fly as he/she
types query
Recommends snippets:
predicates in the WHERE clause
tables in the FROM clause
attributes in the SELECT clause
Recommendations are context-aware
Leverages past queries by user and
collaborators
Src: Nodira Khoussainova
36. 3/12/09 Bill Howe, eScience Institute36
Feature: English Query
Lots of research on Natural Language
Interfaces to Databases
c.f. [Etzioni 2008, Zettermeyer 2009]
Very hard problem, in general
Significant simplification: user can inspect
and “fix” the generated SQL prior to
execution
37. 3/12/09 Bill Howe, eScience Institute37
Feature: Simple Visualization
For each phaeo gene, count the number of matches in the COGAnnotation_surface
dataset, joining on COG id. Return the top 10 most commonly found genes.
Implementation: Export to Google Fusion Tables
38. 3/12/09 Bill Howe, eScience Institute38
Dataspaces: Summary
A “Dataspace Support Platform” should
use a “lowest common denominator” data model
not rely crucially on upfront global consensus
not rely crucially on “perfect” metadata
embrace exceptions, but exploit patterns
support task-specific, “top down” integration
….but seek and exploit cross-cutting patterns where possible
deliver incremental return for incremental investment
…in data quality enhancement
…in metadata normalization
…in usage standardization
…in application “convergence”
39. 3/12/09 Bill Howe, eScience Institute39
Timeline
time, scope, effort
valueforusers
Semantic
Web
RDF/OWL
Ontologies
Insular Data Sources
Data Integration Tools
Federated
Databases
Dataspace
support
platforms
Dataspaces
40. 3/12/09 Bill Howe, eScience Institute40
Example: Metagenomics
1. Who is there?
Which organisms make up the population?
2. What are they doing?
Which metabolic pathways are present and active?
(and who is doing what?)
3. Compare datasets
- across a transect (nearshore vs. deep ocean)
- before/after some event (e.g., Spring freshet)
- across salinity/temperature gradients
- diurnal cycles (day/night)
metagenomics
metatranscriptomics
metaproteomics
Study microbial populations
sampled from the environment
instead of individual organisms
Source: Robin Kodner, Armbrust Lab
42. 3/12/09 Bill Howe, eScience Institute42
Pre-relational brittleness: if your data changed, your
application often broke.
Early RDBMS were buggy and slow (and often reviled),
but required only 5% of the application code.
“Activities of users at terminals and most application programs
should remain unaffected when the internal representation of data
is changed and even when some aspects of the external
representation are changed.” -- E.F. Codd 1979
Key Ideas: Programs that manipulate tabular data exhibit an
algebraic structure allowing reasoning and manipulation
independently of physical data representation
Background: Relational Databases
43. 3/12/09 Bill Howe, eScience Institute43
Relational Databases: Summary
A General Data Model: “just tables”
Logical and Physical Data Independence
Declarative Query Language
via the “Relational Algebra”
Good Scalability
“SQL is the most successful parallel language in the world”
Results
$15B industry
Nearly every (non-search engine) website you visit is backed by a
RDBMS
One of the all-time best examples of CS research impact
44. 3/12/09 Bill Howe, eScience Institute44
So what went wrong?
DBAs!
“Schema design” became paramount
“Applications write queries, not users”
Applications became tightly coupled to schema
Ad hoc queries, ad hoc views, ad hoc data
confounded predictable performance, centralized
management, and strong global guarantees
Result: Other tools enlisted to fil the gap
Java, etc.; XML, RDF, etc.; Web Services
45. 3/12/09 Bill Howe, eScience Institute45
Key Idea: Data Independence
physical data independence
logical data independence
files and
pointers
relations
view
s
SELECT seq
FROM all_sequences
WHERE seq =
‘GATTACGATATTA’;
SELECT dna
FROM ncbi_sequences
WHERE dna =
‘GATTACGATATTA’;
f = fopen(‘table_file’);
fseek(10030440);
while (True) {
fread(&buf, 1, 8192, f);
if (buf == GATTACGATATTA) {
. . .
46. 3/12/09 Bill Howe, eScience Institute46
Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product
47. 3/12/09 Bill Howe, eScience Institute47
Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra!
48. 3/12/09 Bill Howe, eScience Institute48
My Interests
Computer Science
Scientific Data
Management
Databases
Data-Intensive Scalable Computing
Research Data Integration
Cloud Computing
Visual Data Analytics
49. 3/12/09 Bill Howe, eScience Institute49
Research Cycle
Observe
Experiment
Analyze
Publish/Shar
e
Synthesis
50. 3/12/09 Bill Howe, eScience Institute50
Web
Services
Data Management
Query
Languages
Storage
Cloud Computing
Visualization;
Workflow
Information Integration
Information Extraction,
Access
Methods
Data Mining,
Distributed Programming Models,
complexity-hiding interfaces
Notas del editor
My name is Bill Howe and I’m from the University of Washington eScience Institute
Drowning in data; starving for information
We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
“Typical large pharmas today are generating 20 terabytes of data daily. That’s probably going up to 100 terabytes per day in the next year or so.”
“tens of terabytes of data per day” -- genome center at Washignton University
Increase data collection exponentially with flowcam
Steps 1,2,3,4 are expensive
At the frontier of research, 1,2,3 are by definition elusive. By definition, the domain is not fully understood. By definition, researchers do not universally agree on the interpretation of data; there is no universal domain model.
How do you build a domain model, build consensus? You perform experiments, analyze the data, publish the results -- the overall goal of the scientific enterprise is to achieve shared knowledge. So it is a mistake to presuppose the existence of an ontology as a way to faciliatate data analysis.
We perform data analysis in order to establish an ontology; we can’t make establishing an ontology a prerequisite for data analysis.
shared knowledge is the end result of research, not a precondition for it.
So in practice, you find individual researchers, individual analysts managing their own data ad hoc -- mostly in text files. and mostly with SAS, perl, python, MATLAB, maybe Excel if they do not have aprogramming background.
So what results is this ecosystem of “desultory” data -- varying levels of metadata, varying levels of quality, varying levels of accessibility.
Now, it is a phenomenally good idea to strive for machine readable representations of knowledge. When these exist, there are a variety of ways to exploit them to improv interoperability, build applications, facilitate communication. But at the frontier of research, they simply can’t exist until we have a chance to analyze all the data.
So this talk is about technology that can tolerate the heterogeneity and ambiguity of research data.
Aside: Formally, an interpretation of a model is a mapping from the elements in the model to the elements in the real world. A sound interpretation of a logical model is one where all true statements in the artificial model are true when mapped into the real world.
A global ontology or othr form of comprehensive schema presumes some measure of consensus (which may take the form of mappings between different sub-ontologies or sub schemas -- yet these are still a form of consensus.)
Even if you can successfully capture global agreement into a shared schema, it’s only a snapshot -- its half-life will be very short.
Even if your schema is sound and complete, and even if you can keep up with changing requirements,
My claim is that the value of a repository is quadratic in the number of datasets it holds. The reason is that every pairwise comparison of two datasets potentially provides new insight.
Some things to observe: Many system only provide basic retrieval -- unary operation -- an so they scale linearly. (Though of course the user can download two datasets and compare them locally, so the B coefficient is non-zero.
I think of C as the benefit of having a shared domain model to focus the conversation, resolve ambiguities, and encourage converging mental models among users, especially new users (students). So this is important, but it does not get any benefit from having 1, 10, 1000, or a 1M datasets. It’s value is constant.
So my recommended strategy is to make sure B is non-zero, and crank up D as high as possible.
Now, having a rich, thorough domain model can help facilitate more operations -- maybe a new genome browser might be easier to build if we have a well-tuned schema to work from. However, it’s demonstrably NOT the case that a genome browser *could not* be built without such a schema. Indeed, a huge number of simple desktop genome browsers exist that do not have any shared semantics.
The disadvantage to having a rich, thorough domain model is that it restricts D -- it limits the amount of data you can put into the system. Data with missing, incomplete, or ambiguous metadata cannot be ingested. So you’ve increased C (and possibly B and U) at the expense of D -- this is not a good idea.
So we need systems that are inclusive, that emphasize breadth over depth (at least initially), emphasize coverage.
Many failed systems in and out of science attribute their failure to inflexibility -- too hard to get data into the system due to over-engineered metadata standards.
The value of a data repository is is quadratic in the number of datasets it holds
Vr = QD^2 + RD + C
D=number of datasets
Q=analysis capability
R=simple accessibility
C=communicability <-- intrinsic value of domain knowledge, metadata standard
So a rich and thorough domain model increases C, but can decrease D -- it’s too difficult to put data into the correct format, with all the required metadata, so the system is underused.
FGDC
Data in avariety of formats ensconced in autonomous systems with different capabilities and different schemas
How do you get started in this environment? What’s the first thing you do?
Data source do not share a schema, and may not exhibit a schema at all. Data is allowed to exist in its native form behind its native interfaces.
These data sources are also autonomous, so you’re not necessarily allowed to take all the flat files and repalce them with XML.
You have to pay for all this freedom and flexibility somewhere, and here’s where you do it: With lots of global properties, you can define sophisticated services that exploit them: structured query, and strong integrity guarantees
Put another way, databases are inherently exclusive, helping you reject data that does not conform to your schema or satisfy your integrity constraints.
The dataspace support platform is inclusive -- everybody is welcome
The dataspace provides a hierarchy of services to accommodate varying degrees of data “maturity”
Add a screenshot of Google
There is no global schema for the Internet
Search is approximate, “best effort”…and highly effective
<number>
What do you need to do forecast the physical state of the ocean? You’re going to be solving a set of partial differential equations, so you need forcings at the boundaries of the domain -- river discharge, tides, and atmospheric condtions, bathymetry. Every day, you can download results of atmospheric forecasts, compute tidal forcings, and estimate river discharge, as well as some observational data to compare with your simulation and see how well you’re doing.
these data go into files and relational databases. When the forecast is ready to run, these inputs are staged out to compute cluster, along with the FORTRAN code that will solve the equations and some post processing routines. The forecast executes, incrementally generating data files, visualizations, log files, and status information.
This information is pushed back to the storage servers and the visualizations are served over the web.
This process generates lots of intermediate data of a variety of types -- We want to provide browse and query services over these data without disturbing the operational system and without a lengthy design phase -- we want results by 5:00pm
Some data loaded into a relational database
Others left as files (no need for ad hoc query; one-time use; large size)
SELFE Eulerian Lagrangian Semi-implicit Finite element model. Solves 3D Navier Stokes equations. Produces 6 variables * 700MB /day.
Hindcast runs: compare code versions, compare inputs, long term behavior, what if analysis (river dredging), Tsunami model assumptions
Data products:
Animations, maps, timeseries plots station extractions, model-data comparisons
Slide from Robin Kodner. Key idea: This protocol is far more precise than BLAST for sequence searching, but it generates a lot of heterogeneous intermediate results that must be analyzed -- this step had completely roadblocked the research. With SQLShare, you can just throw all the data up to the cloud and start asking questions right away -- collaboratively. None of the overhead associated with an RDBMS.
== Name ==
SQLShare: Cloud-based Collaborative Query
== PI ==
Ginger Armbrust
== Science Perspective ==
From Robin:
“The SQLShare database is allowing me to do basic sorting and clustering of my data that took me a week to do in excel, now in a matter of seconds. It is also making it possible to correlate the analyzed data that results from different kinda of anaylsis from different analysis pathways, which maximizes the use of the data. Further, it allows for finding correlations between different projects and the corresponding environmental metadata that would be impossible without the database. Without the database, I'd only be able to utilize a fraction of my data, and find only a fraction of the interesting nuggets that we are looking for. I conceived of the database to help me with metagenomic data but its so useful, we are now using it to do comparative genomics and evolutionary studies.”
== Computational View ==
Goal: Tolerate the “spreadsheet tsunami”
Each user has O(100) files with O(100k) rows each
heterogeneous, changing schemas
Observation: Databases underused in science
Hypothesis: Scientists dislike RDBMS, not SQL
Installation, configuration, schema design, physical tuning
Approach: Just put it in the cloud and query it
ignore DB design; do auto-scaling and auto-tuning
System-enabled sharing of data and queries
Only makes sense for science!
Ex: Two labs both buy an AB Solid sequencer, so both may use the same queries to process the output
== Resources ==
Azure for the application, SQL Azure for the system data, EC2 for the user data (to avoid 10GB limit on SQL Azure)
== Comparison ==
Quotes: “I can do science again” “That took me a week to do with spreadsheets!” “I spend 90% of my time manipulating data in spreadsheets.” “My research was stuck on data analysis before SQLShare”
Environmental samples are sequenced.
Sequence fragments are looked up in public databases, and passed through phylogenetic analysis to place them at the appropriate location in the tree.
Each step generates a bunch of “residual” data, usually in the form of spreadsheets or text files.
This process is repeated many times, leading to 100s of “desultory” spreadsheets
The actual science questions are answered using these spreadsheets by computing “manual joins”, creating plots, searching and filtering, copying and pasting, etc.
It’s a mess -- when asked how much time is spent “handling data” as opposed to “doing science”, one postdoc said a staggering 90%!
Here are two datasets: Sequence annotations for the Phaeo-dactylum organism and sequence annotations from an environmental sample.
The task is to compare these sets of annotations to determine what role Phaeo is serving in the metagenomic population, if present.
Previously, researchers had to manually cross-reference data between spreadsheets.
But the join between these datasets is trivially expressed in SQL
Now, that was just the first step -- counting subsets, finding intersections, finding “top K” matches, etc. must also be performed manullay, but are also easily expressd in SQL.
-- No schema design: Just upload everything “as is” and start querying. No one wants to create one, and the schema’s going to change anyway.
-- We find that ALL of the scientists’ English queries are expressible in SQL. Some can be complex, however.
-- Challenge: SQL is hard
-- Solution: Let scientists train themselves. Give them examples to modify instead of a “blinking cursor.” More generally: Facilitate collaborative query authoring, sharing, and reuse. Support collaboration between the “carpet lab” and the “tile lab” (Computer geeks work in carpeted offices, bio geeks work in the wet lab.)
-- How?
1) Use the cloud to logically and physically co-locate all data across all labs -- no more islands
2) Let queries be saved and shared
3) Log everything and do machine learning on the log to perform “Query autocomplete” (Nodira and Magda’s work)
4) Automatically adapt queries for use on ‘similar’ datasets (change table names, etc.)
many more ideas….
Items to point out:
-- IMPORTANT: These are not trivial queries! But with some help, scientists can write them. We give them an example, and they modify the example and save it for reuse. Queries can be optionally shared across users.
<number>
Currently we have insular data sources
Pay as you go
Smoothing the ROI curve!
Science slide for Ginger
In the previous case, the same source of database identifiers were used; when they differ the process can be more complicated.
Here we have two datasets: Phaeo gene annotations again, and set of sample annotations with references to the TIGRFam database.
The workflow here might look like:
Find an annotation of interest in Phaeo dataset
Look up COG Id to get Protein Name
Search for Protein Name in various online databases (here we use SwissProt) to collect additional information
Browse to cross-reference information to find TIGRFam Id,
Find Gene Ontology synonym of the TIGRFam Id to collect additional metadata (other metadata not shown -- another step)
Finally, match TIGRFam Ids back to original sample.
By putting all of this data into a database, you can write these expressions as joins. More importantly, you can go beyond “lookup” tasks and express the actual science questions directly:
What percentage of Phaeo genes are present in this sample? What metabolic processes are those genes involved in?
Note that we do NOT want to attempt to create “YAUDB” (yet another universal database). these data are uploaded and manipulated in an exploratory, task-specific manner. We aim to provide SQL over YOUR data, not a universal reference database from scratch.
(That being said, our research involves learning a universal database schema -- incrementally and organically -- based on the upoloaded data, the executed queries, and any available user input.
Bill
It provides a means of describing data with its natural structure only--that is, without superimposing any additional structure for machine representation purposes. Accordingly, it provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation on the other.
So what’s wrong?
Applications write queries, not users
Schema design, tuning, “protectionist” attitudes
It turns out that you can express a wide variety of computations using only a handful of operators.
Data Management != Storage Management
Storage Management is
SATA/SCSI/Fiber
Backup policies and procedures
redundancy decisions (RAID 0, 1+0, 0+1, 5
Access methods
Query languages
Data Mining, Analysis, Visualization
Data Integration