SlideShare a Scribd company logo
1 of 37
SQL is Dead; Long Live SQL  Lightweight Query Services for  Ad Hoc Research Data Bill Howe Garret Cole Nodira Khoussainova  Luke Zettlemoyer  Shaminoo Kapoor Yuan Zhou Patrick Michaud  Kevin Pittman Charlon Palacay
 
http://escience.washington.edu
 
 
Context: The Long Tail  [Wired 2004] The long tail is getting fatter:   notebooks become spreadsheets (MB), spreadsheets become databases (GB),  databases become clusters (TB) clusters become clouds (PB) data size rank ,[object Object],[object Object],[object Object],CERN (~15PB/year) LSST (~100PB) PanSTARRS (~40PB) Ocean Modelers citizen science SDSS (~100TB) Seis-mologists Microbiologists CARMEN (~50TB)
Problem How much time do you spend “handling data” as opposed to “doing science”? Mode answer? 90%
Example: Environmental Metagenomics 5/18/10 Garret Cole, eScience Institute
5/18/10 Garret Cole, eScience Institute
5/18/10 Garret Cole, eScience Institute
5/18/10 Garret Cole, eScience Institute
measurements search results sequence data
5/18/10 Garret Cole, eScience Institute SQL
Demo
Ad Hoc Research Data 5/18/10 Garret Cole, eScience Institute Fasta Spreadsheets ASCII
Simple Example 5/18/10 Garret Cole, eScience Institute ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome COGAnnotation_coastal_sample.txt SELECT  *   FROM   Phaeo P, Coastal C  WHERE   P.hit = C.hit
Complex Example coastal sample COG database ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome SwissProt web service Browser Cross-Reference TIGR01650 GO:0051116 contributes_to TIGR01651 GO:0009236 NULL TIGR01651 GO:0051116 NULL TIGR01660 GO:0008940 NULL TIGR01660 GO:0009061 NULL TIGR01660 GO:0009325 NULL TIGR01663 GO:0000012 NULL TIGR01663 GO:0046403 NULL TIGRFAM to GO Mapping coastal sample
More examples ,[object Object],[object Object],[object Object],SELECT  *   FROM   plasmiddb WHERE   NOT  ( ISDATE (cloned)  OR  cloned = ‘yes’) SELECT  hit,  COUNT (*) as cnt  FROM  tigrfamannotation_surface  GROUP BY  hit  ORDER BY  cnt  DESC SELECT   count (*) FROM   [bombardment_log] WHERE   bomb_date  BETWEEN   ’ 7/1/2010'   AND  ’7/31/2010' AND   rescue clone IS NOT NULL AND   [expression?] = 'yes'
My favorite example ,[object Object],SELECT  col0  FROM  [refseq_hma_fasta_TGIRfam_refs] UNION SELECT  col0  FROM  [est_hma_fasta_TGIRfam_refs] UNION SELECT  col0  FROM  [combo_hma_fasta_TGIRfam_refs] EXCEPT SELECT  col0  FROM  [refseq_hma_fasta_TGIRfam_refs] INTERSECT SELECT  col0  FROM  [est_hma_fasta_TGIRfam_refs] INTERSECT SELECT  col0  FROM  [combo_hma_fasta_TGIRfam_refs]
Digression: Relational Database History Pre-Relational: if your data changed, your application broke. “ Activities of users at terminals and most application programs  should remain unaffected when the internal representation of data is changed  and even when some aspects of the external representation are changed.” Key Ideas:  Programs that manipulate tabular data exhibit an  algebraic structure  allowing reasoning and manipulation independently of physical data representation -- Codd 1979
Key Idea: Data Independence physical data independence logical data independence files and pointers relations views SELECT *  FROM my_sequences  SELECT  seq   FROM  ncbi_sequences WHERE  seq  = ‘GATTACGATATTA’; f = fopen( ‘table_file’ ); fseek( 10030440 ); while (True) { fread(&buf, 1, 8192, f); if (buf == GATTACGATATTA) { . . .
Key Idea: An  Algebra of Tables select project join join Other operators: aggregate, union, difference, cross product
Key Idea: Algebraic Optimization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Same idea works with the Relational Algebra!
RDBMS: Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
What’s the point? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],So we ask: What kind of platform can deliver SQL to scientists?
Architecture ASPX App Silverlight Uploader Python API “ Flagship” SQLShare  App  (PHP) on EC2 REST Windows Azure SQL Azure SQL Server on EC2 Excel Addin
Bootstrapping new projects ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],A mini version of Jim Gray’s 20 questions methodology
Other use cases we are seeing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],*Khoussinova, CIDR 2009 **credit: Alex Szalay
Usage ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Automating the Process ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
http://sqlshare.escience.washington.edu (Ask me about the “NoSQL” movement)
23 rd  International Conference on  Scientific and Statistical Database Management  (SSDBM 2011) http://www.ssdbm2011. ssdbm .org Portland, Oregon, USA July 20 – 22, 2011 Abstracts due: January 31, 2011 Papers due: February 7, 2011
Backup slides
The “NoSQL” Movement
Everything is a View ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],* [Khoussainova, CIDR 2009]
Background: Sloan Digital Sky Survey ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
How do we repeat the success of SDSS in the long tail? How do we build the next 100 SDSS-like systems?

More Related Content

What's hot

Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Josef Hardi
 
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...Sease
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsDan Sullivan, Ph.D.
 
CIKM Tutorial 2008
CIKM Tutorial 2008CIKM Tutorial 2008
CIKM Tutorial 2008Peiling Wang
 
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life SciencesEamonn Maguire
 
Workshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke Informatie
Workshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke InformatieWorkshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke Informatie
Workshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke Informatiejstaaks
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsYalçın Yenigün
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studyCharlie Hull
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...dgarijo
 
Analysis of the “KDD Cup-1999” Datasets
Analysis of the  “KDD Cup-1999”  DatasetsAnalysis of the  “KDD Cup-1999”  Datasets
Analysis of the “KDD Cup-1999” DatasetsRafsanjani, Muhammod
 
Scaling Analytics with elasticsearch
Scaling Analytics with elasticsearchScaling Analytics with elasticsearch
Scaling Analytics with elasticsearchdnoble00
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data ManagementC. Tobin Magle
 
Towards ubiquitous OWL computing: Simplifying programmatic authoring of and q...
Towards ubiquitous OWL computing: Simplifying programmatic authoring of and q...Towards ubiquitous OWL computing: Simplifying programmatic authoring of and q...
Towards ubiquitous OWL computing: Simplifying programmatic authoring of and q...Hilmar Lapp
 
ICSE2016 - Detecting Problems in Database Access Code of Large Scale Systems ...
ICSE2016 - Detecting Problems in Database Access Code of Large Scale Systems ...ICSE2016 - Detecting Problems in Database Access Code of Large Scale Systems ...
ICSE2016 - Detecting Problems in Database Access Code of Large Scale Systems ...Concordia University
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research ObjectsDavid De Roure
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with railsTom Z Zeng
 
HEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkHEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkEamonn Maguire
 
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Gilles Fedak
 

What's hot (20)

Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
 
Text mining meets neural nets
Text mining meets neural netsText mining meets neural nets
Text mining meets neural nets
 
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in Bioinformatics
 
CIKM Tutorial 2008
CIKM Tutorial 2008CIKM Tutorial 2008
CIKM Tutorial 2008
 
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life Sciences
 
Workshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke Informatie
Workshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke InformatieWorkshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke Informatie
Workshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke Informatie
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
 
Analysis of the “KDD Cup-1999” Datasets
Analysis of the  “KDD Cup-1999”  DatasetsAnalysis of the  “KDD Cup-1999”  Datasets
Analysis of the “KDD Cup-1999” Datasets
 
Scaling Analytics with elasticsearch
Scaling Analytics with elasticsearchScaling Analytics with elasticsearch
Scaling Analytics with elasticsearch
 
2012 User's Conference Indivo Updates
2012 User's Conference Indivo Updates2012 User's Conference Indivo Updates
2012 User's Conference Indivo Updates
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
 
Towards ubiquitous OWL computing: Simplifying programmatic authoring of and q...
Towards ubiquitous OWL computing: Simplifying programmatic authoring of and q...Towards ubiquitous OWL computing: Simplifying programmatic authoring of and q...
Towards ubiquitous OWL computing: Simplifying programmatic authoring of and q...
 
ICSE2016 - Detecting Problems in Database Access Code of Large Scale Systems ...
ICSE2016 - Detecting Problems in Database Access Code of Large Scale Systems ...ICSE2016 - Detecting Problems in Database Access Code of Large Scale Systems ...
ICSE2016 - Detecting Problems in Database Access Code of Large Scale Systems ...
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
 
HEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkHEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 Talk
 
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
 

Viewers also liked

PPL presentation 2010
PPL presentation 2010PPL presentation 2010
PPL presentation 2010SlimTrabelsi
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Fastest Way to DRUPAL
Fastest Way to DRUPALFastest Way to DRUPAL
Fastest Way to DRUPALBrahm
 
Ad tech singapore (public)
 Ad tech singapore (public) Ad tech singapore (public)
Ad tech singapore (public)paulwongjm
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceUniversity of Washington
 

Viewers also liked (6)

End-to-End eScience
End-to-End eScienceEnd-to-End eScience
End-to-End eScience
 
PPL presentation 2010
PPL presentation 2010PPL presentation 2010
PPL presentation 2010
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Fastest Way to DRUPAL
Fastest Way to DRUPALFastest Way to DRUPAL
Fastest Way to DRUPAL
 
Ad tech singapore (public)
 Ad tech singapore (public) Ad tech singapore (public)
Ad tech singapore (public)
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory Science
 

Similar to SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science

Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for HadoopJim Dowling
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsBertram Ludäscher
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Ravi Okade
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics EnvironmentIan Foster
 
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...Matthew J Collins
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresSteven Johnson
 
Database Research Principles Revealed
Database Research Principles RevealedDatabase Research Principles Revealed
Database Research Principles Revealedinfoblog
 
Being RDBMS Free -- Alternate Approaches to Data Persistence
Being RDBMS Free -- Alternate Approaches to Data PersistenceBeing RDBMS Free -- Alternate Approaches to Data Persistence
Being RDBMS Free -- Alternate Approaches to Data PersistenceDavid Hoerster
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Session #5 content providers
Session #5  content providersSession #5  content providers
Session #5 content providersVitali Pekelis
 
Inside SQL Server In-Memory OLTP
Inside SQL Server In-Memory OLTPInside SQL Server In-Memory OLTP
Inside SQL Server In-Memory OLTPBob Ward
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
When to no sql and when to know sql javaone
When to no sql and when to know sql   javaoneWhen to no sql and when to know sql   javaone
When to no sql and when to know sql javaoneSimon Elliston Ball
 

Similar to SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science (20)

Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for Hadoop
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
 
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
 
Database Research Principles Revealed
Database Research Principles RevealedDatabase Research Principles Revealed
Database Research Principles Revealed
 
Being RDBMS Free -- Alternate Approaches to Data Persistence
Being RDBMS Free -- Alternate Approaches to Data PersistenceBeing RDBMS Free -- Alternate Approaches to Data Persistence
Being RDBMS Free -- Alternate Approaches to Data Persistence
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Session #5 content providers
Session #5  content providersSession #5  content providers
Session #5 content providers
 
Inside SQL Server In-Memory OLTP
Inside SQL Server In-Memory OLTPInside SQL Server In-Memory OLTP
Inside SQL Server In-Memory OLTP
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
When to no sql and when to know sql javaone
When to no sql and when to know sql   javaoneWhen to no sql and when to know sql   javaone
When to no sql and when to know sql javaone
 

More from University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DUniversity of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareUniversity of Washington
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchUniversity of Washington
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersUniversity of Washington
 

More from University of Washington (20)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible Research
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
 

Recently uploaded

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science

  • 1. SQL is Dead; Long Live SQL Lightweight Query Services for Ad Hoc Research Data Bill Howe Garret Cole Nodira Khoussainova Luke Zettlemoyer Shaminoo Kapoor Yuan Zhou Patrick Michaud Kevin Pittman Charlon Palacay
  • 2.  
  • 4.  
  • 5.  
  • 6.
  • 7. Problem How much time do you spend “handling data” as opposed to “doing science”? Mode answer? 90%
  • 8. Example: Environmental Metagenomics 5/18/10 Garret Cole, eScience Institute
  • 9. 5/18/10 Garret Cole, eScience Institute
  • 10. 5/18/10 Garret Cole, eScience Institute
  • 11. 5/18/10 Garret Cole, eScience Institute
  • 13. 5/18/10 Garret Cole, eScience Institute SQL
  • 14. Demo
  • 15. Ad Hoc Research Data 5/18/10 Garret Cole, eScience Institute Fasta Spreadsheets ASCII
  • 16. Simple Example 5/18/10 Garret Cole, eScience Institute ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome COGAnnotation_coastal_sample.txt SELECT * FROM Phaeo P, Coastal C WHERE P.hit = C.hit
  • 17. Complex Example coastal sample COG database ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome SwissProt web service Browser Cross-Reference TIGR01650 GO:0051116 contributes_to TIGR01651 GO:0009236 NULL TIGR01651 GO:0051116 NULL TIGR01660 GO:0008940 NULL TIGR01660 GO:0009061 NULL TIGR01660 GO:0009325 NULL TIGR01663 GO:0000012 NULL TIGR01663 GO:0046403 NULL TIGRFAM to GO Mapping coastal sample
  • 18.
  • 19.
  • 20. Digression: Relational Database History Pre-Relational: if your data changed, your application broke. “ Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation -- Codd 1979
  • 21. Key Idea: Data Independence physical data independence logical data independence files and pointers relations views SELECT * FROM my_sequences SELECT seq FROM ncbi_sequences WHERE seq = ‘GATTACGATATTA’; f = fopen( ‘table_file’ ); fseek( 10030440 ); while (True) { fread(&buf, 1, 8192, f); if (buf == GATTACGATATTA) { . . .
  • 22. Key Idea: An Algebra of Tables select project join join Other operators: aggregate, union, difference, cross product
  • 23.
  • 24.
  • 25.
  • 26. Architecture ASPX App Silverlight Uploader Python API “ Flagship” SQLShare App (PHP) on EC2 REST Windows Azure SQL Azure SQL Server on EC2 Excel Addin
  • 27.
  • 28.
  • 29.
  • 30.
  • 31. http://sqlshare.escience.washington.edu (Ask me about the “NoSQL” movement)
  • 32. 23 rd International Conference on Scientific and Statistical Database Management (SSDBM 2011) http://www.ssdbm2011. ssdbm .org Portland, Oregon, USA July 20 – 22, 2011 Abstracts due: January 31, 2011 Papers due: February 7, 2011
  • 35.
  • 36.
  • 37. How do we repeat the success of SDSS in the long tail? How do we build the next 100 SDSS-like systems?

Editor's Notes

  1. Spreadsheets to Databases Other Large Scale Query Services Ad Hoc Research Data Cloud-based Ad Hoc Integration Dataspace Cloud-sourced integration SQLShare: Database as a Service for Ad Hoc Science Data
  2. The long tail of eScience -- huge number of scientists who struggle with data management, but do not have access to IT reesources -- no clusters, no system administrators, no programmers, and no computer scientists. They rely on spreadsheets, email, and maybe a shared file system. Their data challenges have more to do with heterogeneity than size: tens of spreadsheets from different sources. However: the long tail is becoming the fat tail. Tens of spreadsheets are growing to hundreds, and the number of records in each goes from hundreds to thousands. How many of you know someone who was forced to split a large spreadsheet into multiple files in order to get around the 65k record limit in certain versions of Excel? Further, medium data (gigabytes) becomes big data (terabytes). Ocean modelers are moving from regional-focus to meso-scale simulations to global simulations.
  3. (granted we had a minute for Bill (clearly Bill) to describe this new eScience movement) We want to give a little background of our project before we launch into it, so we will discuss the problem we are trying to solve. Essentially, we want to remove the speed-bump of data handling from the scientists.
  4. Lets see how heterogeneous some of this data can actually get before its ready to be digested into a database. In Enviromental Metagenomics, we start by collecting by taking samples from the water
  5. The DNA material in the samples taken from the water are then sequenced in a machine to produce millions of short strings
  6. These DNA reads can then be cross-referenced in public databases to determine what organisms were present in the water, and what genes were being expressed
  7. Each step generates a bunch of “residual” data, usually in the form of spreadsheets or text files. This process is repeated many times, leading to 100s of spreadsheets At this point, the actual science questions are answered using these spreadsheets by computing “manual joins”, creating plots, searching and filtering, copying and pasting, etc. It’s a mess -- when asked how much time is spent “handling data” as opposed to “doing science”! We’ve heard that 90% of their work is manipulating the data before they can actually answer a question!
  8. At this point, we try to relieve the scientists of having to deal with multiple spreadsheets and file formats by just “dumping everything” into this mystical program that unifies everything under one banner. With no schema design to constrain the data, everything is uploaded “as is” and is ready to be queried upon. -- How? 1) Use the cloud to logically and physically co-locate all data across all labs -- no more islands 2) Let queries be saved and shared 3) Log everything and do machine learning on the log to perform “Query autocomplete” (Nodira and Magda’s work) 4) Automatically adapt queries for use on ‘similar’ datasets (change table names, etc.) many more ideas….
  9. What exactly is Ad Hoc Research data? It is data that can come in any size shape or form, where the data is heterogeneous within its structure, format, quality, and more.
  10. Previously, researchers had to manually cross-reference data between spreadsheets. We’ve heard that this process in particular could take a week to do depending on the number of data sources. But as common database users, we recognize that the join between these datasets is trivially expressed in SQL
  11. In the previous case, the same source of database identifiers were used; when they differ the process can be more complicated. Here we have two datasets: Phaeo gene annotations again, and set of sample annotations with references to the TIGRFam database. The workflow here might look like: Find an annotation of interest in Phaeo dataset Look up COG Id to get Protein Name Search for Protein Name in various online databases (here we use SwissProt) to collect additional information Browse to cross-reference information to find TIGRFam Id, Find Gene Ontology synonym of the TIGRFam Id to collect additional metadata (other metadata not shown -- another step) Finally, match TIGRFam Ids back to original sample. By putting all of this data into a database, you can write these expressions as joins. More importantly, you can go beyond “lookup” tasks and express the actual science questions directly: What percentage of Phaeo genes are present in this sample? What metabolic processes are those genes involved in? Note that we do NOT want to attempt to create “YAUDB” (yet another universal database). these data are uploaded and manipulated in an exploratory, task-specific manner. We aim to provide SQL over YOUR data, not a universal reference database from scratch. (That being said, our research involves learning a universal database schema -- incrementally and organically -- based on the upoloaded data, the executed queries, and any available user input. Bill
  12. To begin, we ask, what kind of questions would you ask your data once you have it ready to be worked on? Just about EVERY question that we have heard a scientist would ask, we have found an equivalent SQL statement counterpart. If we could just turn their questions in SQL our job would be done, but there are many other problems to solve before that becomes a reality. For example, their data may not reside in a relational database. This brings us to part of our next problem: how can we bring the power of SQL to the scientists to solve their questions without the overhead of everything that a database administrator would need to do.
  13. It provides a means of describing data with its natural structure only--that is, without superimposing any additional structure for machine representation purposes. Accordingly, it provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation on the other.
  14. It turns out that you can express a wide variety of computations using only a handful of operators.
  15. So what’s wrong? Applications write queries, not users Schema design, tuning, “protectionist” attitudes
  16. Wed - Fri