SlideShare una empresa de Scribd logo
1 de 27
Unison: Enabling easy, rapid, and
comprehensive proteomic mining
http://unison-db.org/
Online access, download, documentation, references.

Reece Hart
Genentech, Inc.

UCSF / SF PostgreSQL Users' Group
March 11, 2009
San Francisco, CA
                      Slides available at http://harts.net/reece/pubs/
A Bestiary of Life Sciences Data Types
              Genomics
                                                        Proteomics
        assemblies, transcripts,
                                               sequences, domains, PTMs,
         probes, trans. factors,
                                             localization, structure, orthology,
           expression, SNPs,
                                                 predictions, networks …
             haplotypes …



                                                                        Annotation
     Chemistry                                                    GO, taxonomy, SCOP,
compounds, HCS, HTS,                                               disease, OMIM …
    properties …




       Clinincal                                                          LIMS
   assays, protocols,                                          animal records, protocols
    patient records,                                               request systems,
      samples …                                                 personnel, samples …
                                     Communications
                                   literature, patents, and
                                        presentations …
                                                                                           2
Types of Integration

    Source Aggregation
➢                                      RefSeq       In-house
                                      Sequences     Structures
         Aggregates data of the
     ●


         same type from multiple     UniProt         PDB
         sources.                   Sequences     Structures
         Ensures completeness of
     ●


         data.


                                                    Structures
                                    Sequences
    Semantic Integration
➢
         Integrates fundamentally
     ●


         distinct data types.
         Abstracts types to
     ●


         essential features.
                                                    Structures
                                    Sequences
         Improves contextual
     ●


         understanding of data.
                                                                 3
A Survey of Integration Methods

Presentation
                                                                                        Mashups
                                                       Link Integration                 AJAX, iframe
                                                       Hypertext links between
                                                                sites




Middle Tier

                                                         Server
                                                        Mashups


Database
Integration
                                                  F                              W
(Federation /
Warehouse)
                                          Federation                                 Warehouse

Source Databases
                                            A                       B                  C
or Files
For review, see:
Goble C, Stevens R
                                                                                                       4
J Biomed Inform. 2008 Oct;41(5):687-93.
The Problems in a Nutshell
    Data integration is complex.
➢
         Establishing semantic equivalences and
     ●


         relationships are difficult.
         Source database contents are updated often.
     ●




    Existing tools don't cut it.
➢
         Licensing restrictions prevent sharing.
     ●


         Narrow in type of data and/or content, and not
     ●


         easily updated.
         Not specifically designed for mining.
     ●




    Scientists develop ad hoc and integration
➢
    solutions.
         Results are difficult to repeat.
     ●


         It wastes a lot of time.
     ●


         Questions don't get asked.
     ●
                                                          5
Unison in a Nutshell




                         Domain,
                                                         Structures
                  Structure & Homology
                                                         & Ligands
                       Predictions

                                        Protein
                                     Sequences and
                                      Annotations
                                                        Auxiliary
                      Genomes,
                                                      Annotations
                    Gene Mapping &
                                                     GO, RIF, SCOP,
                      Structure,
                                                          etc.
                       Probes



      Sequences and Annotations         Auxiliary Data   Precomputed predictions
UniProt, IPI, Ensembl, RefSeq, PDB, HomoloGene, Gene     Domains, homology, structure, TMs,
 PHANTOM, HUGE, ROUGE, MGC,               Ontology,      localization, signals, disorder, etc.
              Derwent, pataa, nr, etc. taxonomy, PDB,    >200M predictions, 23 types,
                                                                                               6
>13M seqs, >17k species, 69 origins HUGO, SCOP, etc.     ~6 CPU-years
Unison Contents
  patents                       HUGO
  Geneseq:AAP60074              TNFSF9
                                                      homologs
  1991-10-29                    TNFSF10
  SUNTORY                       TNFSF11
                                                      NP_000585.2 NP_036807.1 | RAT
  EP205038-A; New tumour...
                                                      NP_000585.2 NP_038721.1 | MOUSE
                                                      NP_000585.2 XP_858423.1 | CANFA


 GO                                                                                                      SNPs
 Function                                                                                                P84L
   transcription                                                                                         A94T
      initiation

                       aliases
      elongation

                       TNFA_HUMAN
Entrez                                                sequences                         protein features
                       Q1XHZ6
                       IPI00001671.1
gene_id                                               >Unison:98
                       INCY:1109711.FL1p
symbol                                                MSTESMIRDVE...FGIIAL
                       CCDS4702.1
locus                                                 >Unison:23782
                       gi:25952111
                                                      VRSSSRTPSD...FGIIAL                  1   |    23   |         | SS
                                                                                         108   |   143   | 1.8e-06 | EGF
                                                                                         162   |   184   |         | TM

taxonomy                                                                                 133   |   138   |         | ITIM

                                                                   alignments
9606 Homo sapiens
                                                                   TNFA 1tnfA
10090 Mus musculus
                                                                                                   aa-to-resid
                                                                   TNFA 1tnfB
10028 Rattus rattus

                              loci                                 ...
                                                                                                   MSTESMIR
                                                                   TNFA 5tswF
                                                                                                   DVEFGIIA
                                1 233 6+:31651498-31653288
                                                                                                   TESMIRDV
                                                                                                   IIAMDAC

                                                                                structures
                                                                                                                SCOP
                                                                                1tnf
  genomes                                                                       1a8m                            all alpha
                                            probes                              2tun                            all beta
  Hs35
                                                                                4tsv                             Ig
  Hs36
                                            HGU133P                             5tsw                             TNF-like
  RAT
                                            WHG                                                                 alpha+beta
                                                                                                                             7
Ex1: Mine for sequences w/conserved features.
  patents                       HUGO
  Geneseq:AAP60074              TNFSF9
                                                      homologs
  1991-10-29                    TNFSF10
  SUNTORY                       TNFSF11
                                                      NP_000585.2 NP_036807.1 | RAT
  EP205038-A; New tumour...
                                                      NP_000585.2 NP_038721.1 | MOUSE
                                                      NP_000585.2 XP_858423.1 | CANFA


 GO                                                                                                      SNPs
 Function                                                                                                P84L
   transcription                                                                                         A94T
      initiation

                       aliases
      elongation

                       TNFA_HUMAN
Entrez                                                sequences                         protein features
                       Q1XHZ6
                       IPI00001671.1
gene_id                                               >Unison:98
                       INCY:1109711.FL1p
symbol                                                MSTESMIRDVE...FGIIAL
                       CCDS4702.1
locus                                                 >Unison:23782
                       gi:25952111
                                                      VRSSSRTPSD...FGIIAL                  1   |    23   |         | SS
                                                                                         108   |   143   | 1.8e-06 | EGF
                                                                                         162   |   184   |         | TM

taxonomy                                                                                 133   |   138   |         | ITIM

                                                                   alignments
9606 Homo sapiens
                                                                   TNFA 1tnfA
10090 Mus musculus
                                                                                                   aa-to-resid
                                                                   TNFA 1tnfB
10028 Rattus rattus

                              loci                                 ...
                                                                                                   MSTESMIR
                                                                   TNFA 5tswF
                                                                                                   DVEFGIIA
                                1 233 6+:31651498-31653288
                                                                                                   TESMIRDV
                                                                                                   IIAMDAC

                                                                                structures
                                                                                                                SCOP
                                                                                1tnf
  genomes                                                                       1a8m                            all alpha
                                            probes                              2tun                            all beta
  Hs35
                                                                                4tsv                             Ig
  Hs36
                                            HGU133P                             5tsw                             TNF-like
  RAT
                                            WHG                                                                 alpha+beta
                                                                                                                             8
Ex2: Locate SNPs and domains on structure.
  patents                       HUGO
  Geneseq:AAP60074              TNFSF9
                                                      homologs
  1991-10-29                    TNFSF10
  SUNTORY                       TNFSF11
                                                      NP_000585.2 NP_036807.1 | RAT
  EP205038-A; New tumour...
                                                      NP_000585.2 NP_038721.1 | MOUSE
                                                      NP_000585.2 XP_858423.1 | CANFA


 GO                                                                                                      SNPs
 Function                                                                                                P84L
   transcription                                                                                         A94T
      initiation

                       aliases
      elongation

                       TNFA_HUMAN
Entrez                                                sequences                         protein features
                       Q1XHZ6
                       IPI00001671.1
gene_id                                               >Unison:98
                       INCY:1109711.FL1p
symbol                                                MSTESMIRDVE...FGIIAL
                       CCDS4702.1
locus                                                 >Unison:23782
                       gi:25952111
                                                      VRSSSRTPSD...FGIIAL                  1   |    23   |         | SS
                                                                                         108   |   143   | 1.8e-06 | EGF
                                                                                         162   |   184   |         | TM

taxonomy                                                                                 133   |   138   |         | ITIM

                                                                   alignments
9606 Homo sapiens
                                                                   TNFA 1tnfA
10090 Mus musculus
                                                                                                   aa-to-resid
                                                                   TNFA 1tnfB
10028 Rattus rattus

                              loci                                 ...
                                                                                                   MSTESMIR
                                                                   TNFA 5tswF
                                                                                                   DVEFGIIA
                                1 233 6+:31651498-31653288
                                                                                                   TESMIRDV
                                                                                                   IIAMDAC

                                                                                structures
                                                                                                                SCOP
                                                                                1tnf
  genomes                                                                       1a8m                            all alpha
                                            probes                              2tun                            all beta
  Hs35
                                                                                4tsv                             Ig
  Hs36
                                            HGU133P                             5tsw                             TNF-like
  RAT
                                            WHG                                                                 alpha+beta
                                                                                                                             9
Analysis and data mining have distinct needs.
                                                       (semantic integration)
                                                       feature types/models HMM, TM, signal, etc.
   (source integration)
   sequences non-redundant superset of all sequences
                                                                                                                                         Sequence Analysis
                                                                                                                                         i.e., show predictions for a given sequence
                                                                                                                                         Typically involves minutes to hours of computing per sequence.
                                                         Typically entails days to months of computing results.
                                                         i.e., show sequences that contain specified features.

                                                                                                                  Feature-Based Mining
                                                                                                                                            Prediction results
                                                                                                                                            method-specific data such as score, e-value, p-
                                                                                                                                            value, kinase probability, etc.




                                                                                                                                                                                         parameters
                                                                                                                                                                         execution arguments/options for every
                                                                                                                                                                                      prediction type and result



                                                                                                                                                                                                                   10
Mining for ITIMs the Old Way

                        Ig            TM         ITIM



     Collect sequences.
 ➢
     Prune redundant sequences. (How?!)
 ➢
     For each unique sequence, predict
 ➢
          Immunoglobulin domains.
      ●


          Transmembrane domains.
      ●


          ITIM domains.
      ●


     Write a program that filters predictions.
 ➢
     Summarize hits with external data.
 ➢
     Do it again when source data are updated.
 ➢




For Review: Daëron M Immunol Rev. 2008 Aug;224:11-43    11
Mining for ITIMs the Unison Way

                                Ig                  TM             ITIM
SELECT IG.pseq_id,
          IG.start as ig_start,IG.stop as ig_stop,IG.score,IG.eval,
          TM.start as tm_start,TM.stop as tm_stop,
          ITIM.start as itim_start,ITIM.stop as itim_stop
 FROM pahmm_current_pfam_v IG
 JOIN pftmhmm_tms_v TM ON IG.pseq_id=TM.pseq_id                           AND IG.stop<TM.start
 JOIN pfregexp_v ITIM                ON TM.pseq_id=ITIM.pseq_id AND TM.stop<ITIM.start
WHERE IG.name='ig' AND IG.eval<1e-2
          AND ITIM.acc='MOD_TYR_ITIM';

              Ig      Ig                      TM     Tm    ITIM     ITIM
           start   stop score               start   stop   start    stop best_annotation
pseq_id                              eval
                                                                     523 UniProtKB/Swiss-Prot:SIGL5_HUMAN (RecNam
    234     262     316    30    7.40E-06    440     462    518
                                                                     391 UniProtKB/Swiss-Prot:VSIG4_HUMAN (RecNam
    254     158     213    36    1.90E-07    284     306    386
                                                                     436 UniProtKB/Swiss-Prot:SIGL9_HUMAN (RecNam
    544     157     215    24    6.60E-04    348     370    431
    797     254     312    40    7.60E-09   1099    1121   1361     1366 UniProtKB/Swiss-Prot:DCC_HUMAN (RecName
   1113      42     102    30    1.20E-05    243     265    300      305 UniProtKB/Swiss-Prot:KI2L2_HUMAN (RecNam
                                                                     335 UniProtKB/Swiss-Prot:KI2L1_HUMAN (RecNam
   1114      42     102    30    6.50E-06    243     265    330
                                                                     306 UniProtKB/Swiss-Prot:KI2L3_HUMAN (RecNam
   1115      42     102    31    4.20E-06    243     265    301
                                                                     401 UniProtKB/TrEMBL:Q95368_HUMAN (SubName
   1116      42      97    30    1.10E-05    339     361    396
                                                                                                              12
   1134     340     388    26    1.40E-04    603     625    688      693 UniProtKB/Swiss-Prot:PECA1_HUMAN (RecNam
Unison has many applications.
Unison Web Tools                                   Other In-House Tools                                                  Ad Hoc Mining



                                                                                                                             Mining and
                                                                                                                             analysis
                                                                                                                             projects




                                              Domain,
                                                                                 Structures
                                       Structure & Homology
                                                                                 & Ligands
                                            Predictions

                                                               Protein
                                                            Sequences and
                                                             Annotations
                                                                                Auxiliary
                                            Genomes,
                                                                              Annotations
                                          Gene Mapping &
                                                                             GO, RIF, SCOP,
                                            Structure,
                                                                                  etc.
                                              Probes



                          Sequences and Annotations          Auxiliary Data      Precomputed predictions
                     UniProt, IPI, Ensembl, RefSeq, PDB    HomoloGene, Gene      Domains, homology, structure, TMs,
                   STRING, PHANTOM, HUGE, ROUGE,           Ontology, taxonomy,   localization, signals, disorder, etc.
                                                                                                                                          13
                           MGC, Derwent, pataa, nr, etc.   PDB, HUGO, SCOP,      >200M predictions, 23 types,
                     >13M seqs, >17k species, 69 origins           etc.          ~6 CPU-years
Unison facilitates complex mining.




                             Jason Hackney
                             Nandini Krishnamurthy
                             Li Li
                             Yun Li
                             Jinfeng Liu
                             Shiu-ming Loh
                             Kiran Mukhyala     14
Data integration led to Bcl-2 discoveries.



                                                     +           sequences,
                                                                   models,
                                                               HMM alignments,
                                                                 automation




        Custom model building

      Z'fish        Source Database        Human                                   %
                                                     E-value    Score   % Ide
     Protein         and Accession         Protein                              Coverage
       Bax         RefSeq:NP_571637                  2.00E-47    189     51        98


⇒                                                                                          ⇒
                                            BAX
      Bax2     E35:ENSDARP00000040899                1.00E-14    81      33        51
       Bik        UP:Q5RGV6_BRARE           BIK      1.41E+04    20      47        12
      Bmf        RefSeq:NP_001038689                 1.00E-05    50      32        91
                                            BMF
     Bmf2      E35:FGENESH00000082230                1.10E-02    42      41        42
     BBC3      E35:FGENESH00000078270      PUMA      2.10E+01    30      25        49

                        4 novel Bcl-2 proteins in zebrafish




Kratz et al., Cell Death Differ. (2006).                                                       15
Unison Web Tools




                   16
Structure Viewer with User Features!
http://unison-db.org/pseq_structure.pl?q=TNFA_HUMAN;userfeatures=Estrand@164-174,mysnp@170




                                                                                             17
Unison is a platform for diverse tools.




                                    Matt Brauer
                                    Guy Cavet
                                    Josh Kaminker
                                    Scott Lohr
                                    Kathryn Woods
                                    Jean Yuan
                                    Peng Yue 18
Unison Build Process

      Phase 0    Phase 1     Phase 2    Phase 3     Phase 4     Phase 5    Phase 6
     Download    Load Aux     Load      Update       Update     Update     House-
                   Data     Sequences    Sets      Predictions Mat Views   keeping
        2d          4h         2d         1h       50 CPU-d       6h          0

                            Makefile
Makefile
                            loads auxiliary data
downloads all data
                            loads sequences and annotations
                                (in-house is just another source)
                            updates sequence sets
                            updates precomputed predictions
                                (incremental update!)
                            updates precomputed analyses and mat'd views
                            builds public database



     Runs in a cron job
 ➢
     Requires ~10% time of 1 person
 ➢
     Consistent, reliable builds
 ➢

                                                                                     19
Benefit Lessons

    Integrate to enable reasoning based on a
➢
    corpus of data of multiple types and/or
    from multiple origins.
         To analyze biological data in broad context.
     ●


         To generate hypotheses by data mining.
     ●


         To enable business decisions based on a holistic
     ●


         view of decision criteria.

    Ancillary benefits:
➢
         Data preparation is hard. Centralization means
     ●


         that questions get asked and asked efficiently.
         Integrated data provides a consistent foundation
     ●


         on which others can build.
         Integration improves currency.
     ●




                                                            20
Design Lessons

    Know what data to integrate, how they'll
➢
    be used, and the converse.

    Integrate on simple, intuitively meaningful
➢
    abstract concepts.
         Precise definitions are critical.
     ●


         Represent proprietary data elsewhere, if needed.
     ●




    Aggregate on data types.
➢
         Corollary: Partitioning on content makes data
     ●


         silos.

    Design for Integrity.
➢
         Reliability is everything.
     ●

                                                            21
Process Lessons

    Explicitly track the provenance of data.
➢
         All data in Unison are tied to an origin –
     ●


         predictions, annotations, sequences, models.

    Plan for updates.
➢
         Updates are completely automated and
     ●


         idempotent.

              idempotent
              i⋅dem⋅po⋅tent (/ˈaɪdəmˈpoʊtnt, ˈɪdəm-/)
              adj. [from mathematical techspeak] Acting as if
              used only once, even if used multiple times.


              idempotent. Dictionary.com. Jargon File 4.2.0.
              http://dictionary.reference.com/browse/idempotent (accessed:
              February 25, 2009).
                                                                             22
Other Lessons

    Design security from the start.
➢
         Internal version of Unison use Kerberos.
     ●


         Especially important in a world of distributed
     ●


         services and data.

    Include web services early in the design.
➢
         (Ooops, I blew it on this.)
     ●




                                                          23
A Few Reasons for PostgreSQL.

    Excellent support for server-side functions
➢
         in PL/PGSQL, Perl, C, Java, Python, R, sh, ...
     ●


    Table inheritance
➢
         Facilitates type abstraction
     ●


    GSSAPI/Kerberos support
➢
         No password admin
     ●


         User identity all the way to the database
     ●


    psql rocks
➢
    Pedantic and responsive development
➢
    community
    Ease community adoption (?)
➢




                                                          24
Kiran Mukhyala

                                        Fernando Bazan, Matt Brauer,
                                        David Cavanaugh, Jason Hackney,
                                        Pete Haverty, Ken Jung, Josh
                                        Kaminker, Nandini Krishnamurthy,
                                        Li Li, Yun Li, Scott Lohr, Shiuh-
                                        ming Loh, Jinfeng Liu, Peng Yue,
                                        Jianjun Zhang, Yan Zhang

                                        Simran Hansrai, Marc Lambert,
                                        Dave Windgassen

                                        A huge open science and open
                                        source community.

                                        http://unison-db.org/
                                        Open access web site, downloads,
                                        documentation, references,
  “Are you sure about this              credits.
 Stan? It seems odd that a
pointy head and a long beak             unison-db.org:5432
                                        PostgreSQL & odbc/jdbc/sdbc
 is what makes them fly.”
                                        access                              25
  J. Workman, Science 245:1399 (1989)
26
Unison form follows function.
            Params/Models




Sequences   Results




                                   27

Más contenido relacionado

Destacado

Sql 學習繪本
Sql 學習繪本Sql 學習繪本
Sql 學習繪本chuyenyin
 
Presentation mr ono
Presentation mr onoPresentation mr ono
Presentation mr onorjmchicago
 
Evaluación de la Seguridad Informática y Mitigación de Vulnerabilidades en un...
Evaluación de la Seguridad Informática y Mitigación de Vulnerabilidades en un...Evaluación de la Seguridad Informática y Mitigación de Vulnerabilidades en un...
Evaluación de la Seguridad Informática y Mitigación de Vulnerabilidades en un...Leonardo Duran
 
Unit 1 paper 1 answers
Unit 1   paper 1 answersUnit 1   paper 1 answers
Unit 1 paper 1 answersCAPE ECONOMICS
 
D acad-016 estructura de tesina o memoria de estadía profesional
D acad-016 estructura de tesina o memoria de estadía profesional D acad-016 estructura de tesina o memoria de estadía profesional
D acad-016 estructura de tesina o memoria de estadía profesional Edgar Mata
 
Spring_GHUCCTS_Newsletter_1st_edition
Spring_GHUCCTS_Newsletter_1st_editionSpring_GHUCCTS_Newsletter_1st_edition
Spring_GHUCCTS_Newsletter_1st_editionAlexander V. Libin
 
Open Access an Kunstuniversitäten - am Beispiel der Akademie der bildenden Kü...
Open Access an Kunstuniversitäten - am Beispiel der Akademie der bildenden Kü...Open Access an Kunstuniversitäten - am Beispiel der Akademie der bildenden Kü...
Open Access an Kunstuniversitäten - am Beispiel der Akademie der bildenden Kü...Andreas Ferus
 

Destacado (11)

Sql 學習繪本
Sql 學習繪本Sql 學習繪本
Sql 學習繪本
 
Presentation mr ono
Presentation mr onoPresentation mr ono
Presentation mr ono
 
Psicologia genesis
Psicologia genesisPsicologia genesis
Psicologia genesis
 
Landforms
Landforms Landforms
Landforms
 
Evaluación de la Seguridad Informática y Mitigación de Vulnerabilidades en un...
Evaluación de la Seguridad Informática y Mitigación de Vulnerabilidades en un...Evaluación de la Seguridad Informática y Mitigación de Vulnerabilidades en un...
Evaluación de la Seguridad Informática y Mitigación de Vulnerabilidades en un...
 
Unit 1 paper 1 answers
Unit 1   paper 1 answersUnit 1   paper 1 answers
Unit 1 paper 1 answers
 
D acad-016 estructura de tesina o memoria de estadía profesional
D acad-016 estructura de tesina o memoria de estadía profesional D acad-016 estructura de tesina o memoria de estadía profesional
D acad-016 estructura de tesina o memoria de estadía profesional
 
Modern Loom
Modern Loom Modern Loom
Modern Loom
 
Aims, Objectives and Goals
Aims, Objectives and GoalsAims, Objectives and Goals
Aims, Objectives and Goals
 
Spring_GHUCCTS_Newsletter_1st_edition
Spring_GHUCCTS_Newsletter_1st_editionSpring_GHUCCTS_Newsletter_1st_edition
Spring_GHUCCTS_Newsletter_1st_edition
 
Open Access an Kunstuniversitäten - am Beispiel der Akademie der bildenden Kü...
Open Access an Kunstuniversitäten - am Beispiel der Akademie der bildenden Kü...Open Access an Kunstuniversitäten - am Beispiel der Akademie der bildenden Kü...
Open Access an Kunstuniversitäten - am Beispiel der Akademie der bildenden Kü...
 

Más de Joshua Drake

Defining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessDefining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessJoshua Drake
 
Defining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessDefining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessJoshua Drake
 
An evening with Postgresql
An evening with PostgresqlAn evening with Postgresql
An evening with PostgresqlJoshua Drake
 
Dumb Simple PostgreSQL Performance (NYCPUG)
Dumb Simple PostgreSQL Performance (NYCPUG)Dumb Simple PostgreSQL Performance (NYCPUG)
Dumb Simple PostgreSQL Performance (NYCPUG)Joshua Drake
 
Introduction to PgBench
Introduction to PgBenchIntroduction to PgBench
Introduction to PgBenchJoshua Drake
 
Developing A Procedural Language For Postgre Sql
Developing A Procedural Language For Postgre SqlDeveloping A Procedural Language For Postgre Sql
Developing A Procedural Language For Postgre SqlJoshua Drake
 
PostgreSQL Conference: East 08
PostgreSQL Conference: East 08PostgreSQL Conference: East 08
PostgreSQL Conference: East 08Joshua Drake
 
PostgreSQL Conference: West 08
PostgreSQL Conference: West 08PostgreSQL Conference: West 08
PostgreSQL Conference: West 08Joshua Drake
 
What MySQL can learn from PostgreSQL
What MySQL can learn from PostgreSQLWhat MySQL can learn from PostgreSQL
What MySQL can learn from PostgreSQLJoshua Drake
 
Northern Arizona State ACM talk (10/08)
Northern Arizona State ACM talk (10/08)Northern Arizona State ACM talk (10/08)
Northern Arizona State ACM talk (10/08)Joshua Drake
 

Más de Joshua Drake (14)

Defining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessDefining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own Business
 
Defining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessDefining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own Business
 
An evening with Postgresql
An evening with PostgresqlAn evening with Postgresql
An evening with Postgresql
 
Dumb Simple PostgreSQL Performance (NYCPUG)
Dumb Simple PostgreSQL Performance (NYCPUG)Dumb Simple PostgreSQL Performance (NYCPUG)
Dumb Simple PostgreSQL Performance (NYCPUG)
 
East09 Keynote
East09 KeynoteEast09 Keynote
East09 Keynote
 
Go Replicator
Go ReplicatorGo Replicator
Go Replicator
 
Pitr Made Easy
Pitr Made EasyPitr Made Easy
Pitr Made Easy
 
Introduction to PgBench
Introduction to PgBenchIntroduction to PgBench
Introduction to PgBench
 
Developing A Procedural Language For Postgre Sql
Developing A Procedural Language For Postgre SqlDeveloping A Procedural Language For Postgre Sql
Developing A Procedural Language For Postgre Sql
 
PostgreSQL Conference: East 08
PostgreSQL Conference: East 08PostgreSQL Conference: East 08
PostgreSQL Conference: East 08
 
PostgreSQL Conference: West 08
PostgreSQL Conference: West 08PostgreSQL Conference: West 08
PostgreSQL Conference: West 08
 
What MySQL can learn from PostgreSQL
What MySQL can learn from PostgreSQLWhat MySQL can learn from PostgreSQL
What MySQL can learn from PostgreSQL
 
Northern Arizona State ACM talk (10/08)
Northern Arizona State ACM talk (10/08)Northern Arizona State ACM talk (10/08)
Northern Arizona State ACM talk (10/08)
 
Plproxy
PlproxyPlproxy
Plproxy
 

Último

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 

Último (20)

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 

Unison Ucsf Sfpug

  • 1. Unison: Enabling easy, rapid, and comprehensive proteomic mining http://unison-db.org/ Online access, download, documentation, references. Reece Hart Genentech, Inc. UCSF / SF PostgreSQL Users' Group March 11, 2009 San Francisco, CA Slides available at http://harts.net/reece/pubs/
  • 2. A Bestiary of Life Sciences Data Types Genomics Proteomics assemblies, transcripts, sequences, domains, PTMs, probes, trans. factors, localization, structure, orthology, expression, SNPs, predictions, networks … haplotypes … Annotation Chemistry GO, taxonomy, SCOP, compounds, HCS, HTS, disease, OMIM … properties … Clinincal LIMS assays, protocols, animal records, protocols patient records, request systems, samples … personnel, samples … Communications literature, patents, and presentations … 2
  • 3. Types of Integration Source Aggregation ➢ RefSeq In-house Sequences Structures Aggregates data of the ● same type from multiple UniProt PDB sources. Sequences Structures Ensures completeness of ● data. Structures Sequences Semantic Integration ➢ Integrates fundamentally ● distinct data types. Abstracts types to ● essential features. Structures Sequences Improves contextual ● understanding of data. 3
  • 4. A Survey of Integration Methods Presentation Mashups Link Integration AJAX, iframe Hypertext links between sites Middle Tier Server Mashups Database Integration F W (Federation / Warehouse) Federation Warehouse Source Databases A B C or Files For review, see: Goble C, Stevens R 4 J Biomed Inform. 2008 Oct;41(5):687-93.
  • 5. The Problems in a Nutshell Data integration is complex. ➢ Establishing semantic equivalences and ● relationships are difficult. Source database contents are updated often. ● Existing tools don't cut it. ➢ Licensing restrictions prevent sharing. ● Narrow in type of data and/or content, and not ● easily updated. Not specifically designed for mining. ● Scientists develop ad hoc and integration ➢ solutions. Results are difficult to repeat. ● It wastes a lot of time. ● Questions don't get asked. ● 5
  • 6. Unison in a Nutshell Domain, Structures Structure & Homology & Ligands Predictions Protein Sequences and Annotations Auxiliary Genomes, Annotations Gene Mapping & GO, RIF, SCOP, Structure, etc. Probes Sequences and Annotations Auxiliary Data Precomputed predictions UniProt, IPI, Ensembl, RefSeq, PDB, HomoloGene, Gene Domains, homology, structure, TMs, PHANTOM, HUGE, ROUGE, MGC, Ontology, localization, signals, disorder, etc. Derwent, pataa, nr, etc. taxonomy, PDB, >200M predictions, 23 types, 6 >13M seqs, >17k species, 69 origins HUGO, SCOP, etc. ~6 CPU-years
  • 7. Unison Contents patents HUGO Geneseq:AAP60074 TNFSF9 homologs 1991-10-29 TNFSF10 SUNTORY TNFSF11 NP_000585.2 NP_036807.1 | RAT EP205038-A; New tumour... NP_000585.2 NP_038721.1 | MOUSE NP_000585.2 XP_858423.1 | CANFA GO SNPs Function P84L transcription A94T initiation aliases elongation TNFA_HUMAN Entrez sequences protein features Q1XHZ6 IPI00001671.1 gene_id >Unison:98 INCY:1109711.FL1p symbol MSTESMIRDVE...FGIIAL CCDS4702.1 locus >Unison:23782 gi:25952111 VRSSSRTPSD...FGIIAL 1 | 23 | | SS 108 | 143 | 1.8e-06 | EGF 162 | 184 | | TM taxonomy 133 | 138 | | ITIM alignments 9606 Homo sapiens TNFA 1tnfA 10090 Mus musculus aa-to-resid TNFA 1tnfB 10028 Rattus rattus loci ... MSTESMIR TNFA 5tswF DVEFGIIA 1 233 6+:31651498-31653288 TESMIRDV IIAMDAC structures SCOP 1tnf genomes 1a8m all alpha probes 2tun all beta Hs35 4tsv Ig Hs36 HGU133P 5tsw TNF-like RAT WHG alpha+beta 7
  • 8. Ex1: Mine for sequences w/conserved features. patents HUGO Geneseq:AAP60074 TNFSF9 homologs 1991-10-29 TNFSF10 SUNTORY TNFSF11 NP_000585.2 NP_036807.1 | RAT EP205038-A; New tumour... NP_000585.2 NP_038721.1 | MOUSE NP_000585.2 XP_858423.1 | CANFA GO SNPs Function P84L transcription A94T initiation aliases elongation TNFA_HUMAN Entrez sequences protein features Q1XHZ6 IPI00001671.1 gene_id >Unison:98 INCY:1109711.FL1p symbol MSTESMIRDVE...FGIIAL CCDS4702.1 locus >Unison:23782 gi:25952111 VRSSSRTPSD...FGIIAL 1 | 23 | | SS 108 | 143 | 1.8e-06 | EGF 162 | 184 | | TM taxonomy 133 | 138 | | ITIM alignments 9606 Homo sapiens TNFA 1tnfA 10090 Mus musculus aa-to-resid TNFA 1tnfB 10028 Rattus rattus loci ... MSTESMIR TNFA 5tswF DVEFGIIA 1 233 6+:31651498-31653288 TESMIRDV IIAMDAC structures SCOP 1tnf genomes 1a8m all alpha probes 2tun all beta Hs35 4tsv Ig Hs36 HGU133P 5tsw TNF-like RAT WHG alpha+beta 8
  • 9. Ex2: Locate SNPs and domains on structure. patents HUGO Geneseq:AAP60074 TNFSF9 homologs 1991-10-29 TNFSF10 SUNTORY TNFSF11 NP_000585.2 NP_036807.1 | RAT EP205038-A; New tumour... NP_000585.2 NP_038721.1 | MOUSE NP_000585.2 XP_858423.1 | CANFA GO SNPs Function P84L transcription A94T initiation aliases elongation TNFA_HUMAN Entrez sequences protein features Q1XHZ6 IPI00001671.1 gene_id >Unison:98 INCY:1109711.FL1p symbol MSTESMIRDVE...FGIIAL CCDS4702.1 locus >Unison:23782 gi:25952111 VRSSSRTPSD...FGIIAL 1 | 23 | | SS 108 | 143 | 1.8e-06 | EGF 162 | 184 | | TM taxonomy 133 | 138 | | ITIM alignments 9606 Homo sapiens TNFA 1tnfA 10090 Mus musculus aa-to-resid TNFA 1tnfB 10028 Rattus rattus loci ... MSTESMIR TNFA 5tswF DVEFGIIA 1 233 6+:31651498-31653288 TESMIRDV IIAMDAC structures SCOP 1tnf genomes 1a8m all alpha probes 2tun all beta Hs35 4tsv Ig Hs36 HGU133P 5tsw TNF-like RAT WHG alpha+beta 9
  • 10. Analysis and data mining have distinct needs. (semantic integration) feature types/models HMM, TM, signal, etc. (source integration) sequences non-redundant superset of all sequences Sequence Analysis i.e., show predictions for a given sequence Typically involves minutes to hours of computing per sequence. Typically entails days to months of computing results. i.e., show sequences that contain specified features. Feature-Based Mining Prediction results method-specific data such as score, e-value, p- value, kinase probability, etc. parameters execution arguments/options for every prediction type and result 10
  • 11. Mining for ITIMs the Old Way Ig TM ITIM Collect sequences. ➢ Prune redundant sequences. (How?!) ➢ For each unique sequence, predict ➢ Immunoglobulin domains. ● Transmembrane domains. ● ITIM domains. ● Write a program that filters predictions. ➢ Summarize hits with external data. ➢ Do it again when source data are updated. ➢ For Review: Daëron M Immunol Rev. 2008 Aug;224:11-43 11
  • 12. Mining for ITIMs the Unison Way Ig TM ITIM SELECT IG.pseq_id, IG.start as ig_start,IG.stop as ig_stop,IG.score,IG.eval, TM.start as tm_start,TM.stop as tm_stop, ITIM.start as itim_start,ITIM.stop as itim_stop FROM pahmm_current_pfam_v IG JOIN pftmhmm_tms_v TM ON IG.pseq_id=TM.pseq_id AND IG.stop<TM.start JOIN pfregexp_v ITIM ON TM.pseq_id=ITIM.pseq_id AND TM.stop<ITIM.start WHERE IG.name='ig' AND IG.eval<1e-2 AND ITIM.acc='MOD_TYR_ITIM'; Ig Ig TM Tm ITIM ITIM start stop score start stop start stop best_annotation pseq_id eval 523 UniProtKB/Swiss-Prot:SIGL5_HUMAN (RecNam 234 262 316 30 7.40E-06 440 462 518 391 UniProtKB/Swiss-Prot:VSIG4_HUMAN (RecNam 254 158 213 36 1.90E-07 284 306 386 436 UniProtKB/Swiss-Prot:SIGL9_HUMAN (RecNam 544 157 215 24 6.60E-04 348 370 431 797 254 312 40 7.60E-09 1099 1121 1361 1366 UniProtKB/Swiss-Prot:DCC_HUMAN (RecName 1113 42 102 30 1.20E-05 243 265 300 305 UniProtKB/Swiss-Prot:KI2L2_HUMAN (RecNam 335 UniProtKB/Swiss-Prot:KI2L1_HUMAN (RecNam 1114 42 102 30 6.50E-06 243 265 330 306 UniProtKB/Swiss-Prot:KI2L3_HUMAN (RecNam 1115 42 102 31 4.20E-06 243 265 301 401 UniProtKB/TrEMBL:Q95368_HUMAN (SubName 1116 42 97 30 1.10E-05 339 361 396 12 1134 340 388 26 1.40E-04 603 625 688 693 UniProtKB/Swiss-Prot:PECA1_HUMAN (RecNam
  • 13. Unison has many applications. Unison Web Tools Other In-House Tools Ad Hoc Mining Mining and analysis projects Domain, Structures Structure & Homology & Ligands Predictions Protein Sequences and Annotations Auxiliary Genomes, Annotations Gene Mapping & GO, RIF, SCOP, Structure, etc. Probes Sequences and Annotations Auxiliary Data Precomputed predictions UniProt, IPI, Ensembl, RefSeq, PDB HomoloGene, Gene Domains, homology, structure, TMs, STRING, PHANTOM, HUGE, ROUGE, Ontology, taxonomy, localization, signals, disorder, etc. 13 MGC, Derwent, pataa, nr, etc. PDB, HUGO, SCOP, >200M predictions, 23 types, >13M seqs, >17k species, 69 origins etc. ~6 CPU-years
  • 14. Unison facilitates complex mining. Jason Hackney Nandini Krishnamurthy Li Li Yun Li Jinfeng Liu Shiu-ming Loh Kiran Mukhyala 14
  • 15. Data integration led to Bcl-2 discoveries. + sequences, models, HMM alignments, automation Custom model building Z'fish Source Database Human % E-value Score % Ide Protein and Accession Protein Coverage Bax RefSeq:NP_571637 2.00E-47 189 51 98 ⇒ ⇒ BAX Bax2 E35:ENSDARP00000040899 1.00E-14 81 33 51 Bik UP:Q5RGV6_BRARE BIK 1.41E+04 20 47 12 Bmf RefSeq:NP_001038689 1.00E-05 50 32 91 BMF Bmf2 E35:FGENESH00000082230 1.10E-02 42 41 42 BBC3 E35:FGENESH00000078270 PUMA 2.10E+01 30 25 49 4 novel Bcl-2 proteins in zebrafish Kratz et al., Cell Death Differ. (2006). 15
  • 17. Structure Viewer with User Features! http://unison-db.org/pseq_structure.pl?q=TNFA_HUMAN;userfeatures=Estrand@164-174,mysnp@170 17
  • 18. Unison is a platform for diverse tools. Matt Brauer Guy Cavet Josh Kaminker Scott Lohr Kathryn Woods Jean Yuan Peng Yue 18
  • 19. Unison Build Process Phase 0 Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Download Load Aux Load Update Update Update House- Data Sequences Sets Predictions Mat Views keeping 2d 4h 2d 1h 50 CPU-d 6h 0 Makefile Makefile loads auxiliary data downloads all data loads sequences and annotations (in-house is just another source) updates sequence sets updates precomputed predictions (incremental update!) updates precomputed analyses and mat'd views builds public database Runs in a cron job ➢ Requires ~10% time of 1 person ➢ Consistent, reliable builds ➢ 19
  • 20. Benefit Lessons Integrate to enable reasoning based on a ➢ corpus of data of multiple types and/or from multiple origins. To analyze biological data in broad context. ● To generate hypotheses by data mining. ● To enable business decisions based on a holistic ● view of decision criteria. Ancillary benefits: ➢ Data preparation is hard. Centralization means ● that questions get asked and asked efficiently. Integrated data provides a consistent foundation ● on which others can build. Integration improves currency. ● 20
  • 21. Design Lessons Know what data to integrate, how they'll ➢ be used, and the converse. Integrate on simple, intuitively meaningful ➢ abstract concepts. Precise definitions are critical. ● Represent proprietary data elsewhere, if needed. ● Aggregate on data types. ➢ Corollary: Partitioning on content makes data ● silos. Design for Integrity. ➢ Reliability is everything. ● 21
  • 22. Process Lessons Explicitly track the provenance of data. ➢ All data in Unison are tied to an origin – ● predictions, annotations, sequences, models. Plan for updates. ➢ Updates are completely automated and ● idempotent. idempotent i⋅dem⋅po⋅tent (/ˈaɪdəmˈpoʊtnt, ˈɪdəm-/) adj. [from mathematical techspeak] Acting as if used only once, even if used multiple times. idempotent. Dictionary.com. Jargon File 4.2.0. http://dictionary.reference.com/browse/idempotent (accessed: February 25, 2009). 22
  • 23. Other Lessons Design security from the start. ➢ Internal version of Unison use Kerberos. ● Especially important in a world of distributed ● services and data. Include web services early in the design. ➢ (Ooops, I blew it on this.) ● 23
  • 24. A Few Reasons for PostgreSQL. Excellent support for server-side functions ➢ in PL/PGSQL, Perl, C, Java, Python, R, sh, ... ● Table inheritance ➢ Facilitates type abstraction ● GSSAPI/Kerberos support ➢ No password admin ● User identity all the way to the database ● psql rocks ➢ Pedantic and responsive development ➢ community Ease community adoption (?) ➢ 24
  • 25. Kiran Mukhyala Fernando Bazan, Matt Brauer, David Cavanaugh, Jason Hackney, Pete Haverty, Ken Jung, Josh Kaminker, Nandini Krishnamurthy, Li Li, Yun Li, Scott Lohr, Shiuh- ming Loh, Jinfeng Liu, Peng Yue, Jianjun Zhang, Yan Zhang Simran Hansrai, Marc Lambert, Dave Windgassen A huge open science and open source community. http://unison-db.org/ Open access web site, downloads, documentation, references, “Are you sure about this credits. Stan? It seems odd that a pointy head and a long beak unison-db.org:5432 PostgreSQL & odbc/jdbc/sdbc is what makes them fly.” access 25 J. Workman, Science 245:1399 (1989)
  • 26. 26
  • 27. Unison form follows function. Params/Models Sequences Results 27