SlideShare una empresa de Scribd logo
1 de 63
Descargar para leer sin conexión
Challenges with data quality,
     sharing, and versioning
      David Dooling <ddooling@wustl.edu>
                              GIA 2009
Production Centers
• Tony Cox, Sanger        • David Dooling, WUStL
  Sequencing              Scale
  Scale                   Quality
  Infrastructure          Sharing
  Data    flow            Versioning



• Toby Bloom, Broad
  Quality
  Integration
  Standards
  Sharing

<ddooling@wustl.edu>
sub scale {



<ddooling@wustl.edu>
Moore’s Law

                       ,-./011-2#
                       300.-4/#567#
                       8,9#
                       :;0.6<-#
                       :-=>-1?-#




       !quot;quot;quot;#   !quot;quot;$#      !quot;quot;!#     !quot;quot;%#   !quot;quot;&#   !quot;quot;'#   !quot;quot;(#   !quot;quot;)#   !quot;quot;*#   !quot;quot;+#   !quot;$quot;#




<ddooling@wustl.edu>
Images




                 200 TB/week




<ddooling@wustl.edu>
Images




                   10 PB/year




<ddooling@wustl.edu>
Perspective




                   20 PB/day

<ddooling@wustl.edu>
Perspective




                       2 PB/s

<ddooling@wustl.edu>
FASTQ
  @HWI-EAS404:5:1:6:180#0/1
  GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT
  +HWI-EAS404:5:1:6:180#0/1
  aaaa`]aaaa`aa^aa]aaaa^`_``____`W]a_`T[[b__`YXUW][MSTNZX^[[`_Z[^``X`^a
  @HWI-EAS404:5:1:6:396#0/1
  TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA
  +HWI-EAS404:5:1:6:396#0/1
  Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`^ZPP[__^_a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^NZ
  @HWI-EAS404:5:1:6:1344#0/1
  GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG
  +HWI-EAS404:5:1:6:1344#0/1
  aabaaa__]^a`[^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X```_WVNYWKDNLTW[YXSVZ^ZTZZVRUX[
  @HWI-EAS404:5:1:6:1814#0/1
  AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC
  +HWI-EAS404:5:1:6:1814#0/1
  aa````aa^a`_^``a`XY`^ZX^YW^[XUWUYOMVZZ_W^^XXTSMHMLLNTTDWU__[WVVY]Y_]X


                               7 TB/week
<ddooling@wustl.edu>
FASTQ
  @HWI-EAS404:5:1:6:180#0/1
  GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT
  +HWI-EAS404:5:1:6:180#0/1
  aaaa`]aaaa`aa^aa]aaaa^`_``____`W]a_`T[[b__`YXUW][MSTNZX^[[`_Z[^``X`^a
  @HWI-EAS404:5:1:6:396#0/1
  TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA
  +HWI-EAS404:5:1:6:396#0/1
  Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`^ZPP[__^_a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^NZ
  @HWI-EAS404:5:1:6:1344#0/1
  GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG
  +HWI-EAS404:5:1:6:1344#0/1
  aabaaa__]^a`[^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X```_WVNYWKDNLTW[YXSVZ^ZTZZVRUX[
  @HWI-EAS404:5:1:6:1814#0/1
  AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC
  +HWI-EAS404:5:1:6:1814#0/1
  aa````aa^a`_^``a`XY`^ZX^YW^[XUWUYOMVZZ_W^^XXTSMHMLLNTTDWU__[WVVY]Y_]X


                               350 TB/year
<ddooling@wustl.edu>
Mapping




                       2 TB/week




<ddooling@wustl.edu>
Mapping




                       100 TB/year




<ddooling@wustl.edu>
Mapping




                  42,000 core-hr/week




<ddooling@wustl.edu>
Mapping




                       5 core-yr/week




<ddooling@wustl.edu>
Mapping




                       250 core cluster




<ddooling@wustl.edu>
The Weakest Link




<ddooling@wustl.edu>
The Balanced PC
• Clock speed
• AGP
• Front-side bus
• Hypertransport
• 1 Gbps
• PCI-X
• SATA
• PCI-Express
• Infiniband
• Multi-core
• Front-side bus
• GPU
• 10 Gbps
<ddooling@wustl.edu>
The balanced PS         1




        10   gosub     get(sequencers)
        20   gosub     get(disk)
        30   gosub     get(backup_capacity)
        40   gosub     get(network_capacity)
        50   gosub     get(cluster_nodes)




                        1 - Pipeline for Sequencing
<ddooling@wustl.edu>
The unbalanced PS



        10   gosub get(sequencers)
        20   gosub get(disk)
        30   gosub get(backup_capacity)
        40   gosub get(network_capacity)
        50   gosub get(cluster_nodes)
        60   goto 10




<ddooling@wustl.edu>
The GHz race




<ddooling@wustl.edu>
} # scale



<ddooling@wustl.edu>
sub quality {



<ddooling@wustl.edu>
Honda




<ddooling@wustl.edu>
Honda




<ddooling@wustl.edu>
Honda




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Quality is Job 1




<ddooling@wustl.edu>
...must be more than
         just a slogan



<ddooling@wustl.edu>
Quality missteps
          Initial low fidelity between base
              quality values and quality




                       Tsonev, S. SEP 2007

<ddooling@wustl.edu>
An aside




            “basecall calibration predicted vs. observed”
<ddooling@wustl.edu>
Cult of traces




<ddooling@wustl.edu>
Quality is the key
Need high fidelity between prediction and observed

                 50 bytes per base


                 20 bytes per base


                  2 bytes per base


                       3 bits per base

<ddooling@wustl.edu>
The down side




http://www3.appliedbiosystems.com/cms/
groups/mcb_marketing/documents/
generaldocuments/cms_057559.pdf




                                         http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg




   <ddooling@wustl.edu>
} # quality



<ddooling@wustl.edu>
sub sharing {



<ddooling@wustl.edu>
1000 Genomes




<ddooling@wustl.edu>
3.8 Tb




<ddooling@wustl.edu>
~50 B/b




<ddooling@wustl.edu>
190 TB




<ddooling@wustl.edu>
Submitted to central
         repositories



<ddooling@wustl.edu>
... and replicated
            across the pond



<ddooling@wustl.edu>
The goal of this project is to provide a system
for storing and retrieving huge amounts of
data, distributed among a large number of
heterogenous server nodes, under a single
virtual filesystem tree with a variety of
standard access methods.




<ddooling@wustl.edu>
Write-only databases




          Search limited to sequence and
           values of specific XML entities
              submitted as metadata
<ddooling@wustl.edu>
Write-only databases




                             x
          Search limited to sequence and
           values of specific XML entities
              submitted as metadata
<ddooling@wustl.edu>
Speaking of XML
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>             <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>                   <LS454>
<STUDY_SET xmlns:xsi=quot;http://www.w3.org/2001/      <EXPERIMENT_SET xmlns:xsi=quot;http://www.w3.org/              <INSTRUMENT_MODEL>GS 20</                  <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
XMLSchema-instancequot;>                               2001/XMLSchema-instancequot;>                          INSTRUMENT_MODEL>                                  ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
  <STUDY alias=quot;LowSalternSDbayVir111005quot;            <EXPERIMENT                                                                                         ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
accession=quot;SRP000145quot;>                             alias=quot;LowSalternSDbayVir111005_experimentquot;        <FLOW_SEQUENCE>TACGTACGTACGTACGTACGTACGTACGTACGT   ACGTACGTACGTACGTACGTACGTACGTACG</VALUE>
    <DESCRIPTOR>                                   expected_number_runs=quot;2quot; accession=quot;SRX000217quot;>    ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT         </RUN_ATTRIBUTE>
      <STUDY_TITLE>Solar Salterns, viral               <TITLE>454 sequencing of saltern metagenome    ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT         <RUN_ATTRIBUTE>
fraction from low salinity saltern in San Diego,   fragment library</TITLE>                           ACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</                   <TAG>key_sequence</TAG>
CA </STUDY_TITLE>                                      <STUDY_REF accession=quot;SRP000145quot;               FLOW_SEQUENCE>                                              <VALUE>TCAG</VALUE>
      <STUDY_TYPE                                  refname=quot;LowSalternSDbayVir111005quot;/>                       <FLOW_COUNT>168</FLOW_COUNT>                     </RUN_ATTRIBUTE>
existing_study_type=quot;Metagenomicsquot;/>                   <DESIGN>                                             </LS454>                                         </RUN_ATTRIBUTES>
      <STUDY_ABSTRACT>Viral community from a             <DESIGN_DESCRIPTION>454 Sequencing of            </PLATFORM>                                      </RUN>
quot;lowquot; salinity saltern and sequenced at 454 Life   viral fraction from low salinity saltern in San        <PROCESSING>                                     <RUN alias=quot;D1LDSHLquot; instrument_model=quot;454 GS
Sciences. </STUDY_ABSTRACT>                        Diego, CA</DESIGN_DESCRIPTION>                           <BASE_CALLS>                                 20quot; run_date=quot;2006-04-06T09:25:19Zquot;
      <CENTER_NAME>SDSU</CENTER_NAME>                    <SAMPLE_DESCRIPTOR accession=quot;SRS000373quot;             <SEQUENCE_SPACE>Base Space</               run_file=quot;D1LDSHLquot; run_center=quot;454MSCquot;
                                                   refname=quot;28373quot;/>                                  SEQUENCE_SPACE>                                    total_data_blocks=quot;1quot; accession=quot;SRR001054quot;>
<CENTER_PROJECT_NAME>LowSalternSDbayVir111005</          <LIBRARY_DESCRIPTOR>                                 <BASE_CALLER>454BaseCaller</BASE_CALLER>       <EXPERIMENT_REF accession=quot;SRX000217quot;
CENTER_PROJECT_NAME>                                       <LIBRARY_NAME>lowSalternSDbayVir111005</         </BASE_CALLS>                                refname=quot;LowSalternSDbayVir111005_experimentquot;/>
      <PROJECT_ID>28373</PROJECT_ID>               LIBRARY_NAME>                                            <QUALITY_SCORES qtype=quot;phredquot;>                   <DATA_BLOCK name=quot;D1LDSHLquot; region=quot;1quot;
    </DESCRIPTOR>                                          <LIBRARY_STRATEGY>OTHER</                          <QUALITY_SCORER>454BaseCaller</            total_spots=quot;70935quot; total_reads=quot;70935quot;
    <STUDY_ATTRIBUTES>                             LIBRARY_STRATEGY>                                  QUALITY_SCORER>                                    number_channels=quot;1quot; format_code=quot;1quot; sector=quot;0quot;>
      <STUDY_ATTRIBUTE>                                    <LIBRARY_SOURCE>OTHER</LIBRARY_SOURCE>             <NUMBER_OF_LEVELS>64</NUMBER_OF_LEVELS>          <FILES>
        <TAG>NCBI parent project ID</TAG>                  <LIBRARY_SELECTION>RANDOM</                        <MULTIPLIER>1</MULTIPLIER>                          <FILE filename=quot;D1LDSHL01.sffquot;
        <VALUE>28725</VALUE>                       LIBRARY_SELECTION>                                       </QUALITY_SCORES>                            filetype=quot;sffquot;/>
      </STUDY_ATTRIBUTE>                                   <LIBRARY_LAYOUT>                               </PROCESSING>                                        </FILES>
    </STUDY_ATTRIBUTES>                                      <SINGLE/>                                  </EXPERIMENT>                                        </DATA_BLOCK>
  </STUDY>                                                 </LIBRARY_LAYOUT>                          </EXPERIMENT_SET>                                      <RUN_ATTRIBUTES>
</STUDY_SET>                                               <LIBRARY_CONSTRUCTION_PROTOCOL>                                                                     <RUN_ATTRIBUTE>
                                                             none provided                                                                                        <TAG>flow_count</TAG>
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>                     </LIBRARY_CONSTRUCTION_PROTOCOL>           <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>                      <VALUE>168</VALUE>
<SAMPLE_SET xmlns:xsi=quot;http://www.w3.org/2001/           </LIBRARY_DESCRIPTOR>                        <RUN_SET xmlns:xsi=quot;http://www.w3.org/2001/              </RUN_ATTRIBUTE>
XMLSchema-instancequot;>                                     <SPOT_DESCRIPTOR>                            XMLSchema-instancequot;>                                     <RUN_ATTRIBUTE>
  <SAMPLE alias=quot;28373quot; accession=quot;SRS000373quot;>             <SPOT_DECODE_SPEC>                           <RUN alias=quot;D0IIGP3quot; instrument_model=quot;454 GS             <TAG>flow_sequence</TAG>
    <SAMPLE_NAME>                                            <NUMBER_OF_READS_PER_SPOT>2</            20quot; run_date=quot;2006-03-17T09:39:51Zquot;
      <TAXON_ID>496920</TAXON_ID>                  NUMBER_OF_READS_PER_SPOT>                          run_file=quot;D0IIGP3quot; run_center=quot;454MSCquot;             <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
      <COMMON_NAME>saltern metagenome</                      <READ_SPEC>                              total_data_blocks=quot;1quot; accession=quot;SRR001053quot;>       ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
COMMON_NAME>                                                    <READ_INDEX>0</READ_INDEX>                <EXPERIMENT_REF accession=quot;SRX000217quot;          ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
    </SAMPLE_NAME>                                              <READ_CLASS>Technical Read</          refname=quot;LowSalternSDbayVir111005_experimentquot;/>    ACGTACGTACGTACGTACGTACGTACGTACG</VALUE>
    <DESCRIPTION>viral fraction from low           READ_CLASS>                                            <DATA_BLOCK name=quot;D0IIGP3quot; region=quot;1quot;                </RUN_ATTRIBUTE>
salinity saltern in San Diego, CA </                            <READ_TYPE>Adapter</READ_TYPE>        total_spots=quot;51121quot; total_reads=quot;51121quot;                  <RUN_ATTRIBUTE>
DESCRIPTION>                                                    <BASE_COORD>1</BASE_COORD>            number_channels=quot;1quot; format_code=quot;1quot; sector=quot;0quot;>             <TAG>key_sequence</TAG>
    <SAMPLE_ATTRIBUTES>                                      </READ_SPEC>                                   <FILES>                                               <VALUE>TCAG</VALUE>
      <SAMPLE_ATTRIBUTE>                                     <READ_SPEC>                                      <FILE filename=quot;D0IIGP301.sffquot;                   </RUN_ATTRIBUTE>
        <TAG>collection_date</TAG>                              <READ_INDEX>1</READ_INDEX>            filetype=quot;sffquot;/>                                       </RUN_ATTRIBUTES>
        <VALUE>11/10/05</VALUE>                                 <READ_CLASS>Application Read</              </FILES>                                       </RUN>
      </SAMPLE_ATTRIBUTE>                          READ_CLASS>                                            </DATA_BLOCK>                                  </RUN_SET>
      <SAMPLE_ATTRIBUTE>                                        <READ_TYPE>Forward</READ_TYPE>            <RUN_ATTRIBUTES>
        <TAG>lat_lon</TAG>                                      <BASE_COORD>5</BASE_COORD>                  <RUN_ATTRIBUTE>
        <VALUE>32.599040, -117.107356</VALUE>                </READ_SPEC>                                     <TAG>flow_count</TAG>
      </SAMPLE_ATTRIBUTE>                                  </SPOT_DECODE_SPEC>                                <VALUE>168</VALUE>
    </SAMPLE_ATTRIBUTES>                                 </SPOT_DESCRIPTOR>                                 </RUN_ATTRIBUTE>
  </SAMPLE>                                            </DESIGN>                                            <RUN_ATTRIBUTE>
</SAMPLE_SET>                                          <PLATFORM>                                             <TAG>flow_sequence</TAG>




         <ddooling@wustl.edu>
} # sharing



<ddooling@wustl.edu>
sub versioning {



<ddooling@wustl.edu>
The Cathedral and the Bazaar
Linux overturned much of what I thought I
knew. I had been preaching the Unix gospel of
small tools, rapid prototyping and evolutionary
programming for years. But I also believed
there was a certain critical complexity above
which a more centralized, a priori approach was
required. I believed that the most important
software (operating systems and really large
tools like the Emacs programming editor)
needed to be built like cathedrals, carefully
crafted by individual wizards or small bands of
mages working in splendid isolation, with no
beta to be released before its time.
<ddooling@wustl.edu>
The Vatican and the Reformation




<ddooling@wustl.edu>
The popes




                   Will this scale?
<ddooling@wustl.edu>
GenBank genome




http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/

  <ddooling@wustl.edu>
git genome




http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/

  <ddooling@wustl.edu>
The Human Reference
>7 dna:chromosome chromosome:NCBI36:7:1:158821424:1
...AATAACTATATAAGTAAATAAGCAAGCTGTATGAATATACAAAGCTCTCTGGTAAAG
GTAAATACATAAACAAACATAAAAACAGTCCTATTGTAATTTTGGTTTGTAACTCTGCTT
TTTATTTTCTACATAATTTAAAAGGCAAATGCATAAAATGTAATTGTAAATCTGTTAGCT
GGTATACAATGAATAAAGATATAATTTGTCACATCAATAACATAAAAAGAGTAGAGCTAT
ATATATAGCAGTAGAATTTTGGTATGTGATTGAACTTAAGTTGAAATAAATTCAAATTAA
AATGTTATAACTCTAGGATGTTATATGTAATTCTCATAGTAACCAAAAATGAAATATACA
TAGAATATAAACAAAAGGAAATGAGACTAGAAACAAAATGTGTCACTACAAAAAAATCAA
CTAAAGATAAAAAAGAAATAATTGAGAAAATGATTGGCAAAAATCAGTAACTCTGACGTA
TTAAAACTTTCCATGCTACATAAATCTGAAAACTCTATTTCACATAAAACTGGAGCTGAA
AGAAACAAATATTTACCTATAAAGTTAAAAGTTATATAGGGAACAAACACTAATTTTTTT
TAGAAAAAATTATAAAAAGAGTAAAAATATGCCTTATACTACCGTAATTTCATGTTTTAC
AGCTCTGGGAAAATAGAAAATAAAATGTTCTGTTAGCATGAATCCCTCTGTGCCCCC...


<ddooling@wustl.edu>
The Human Reference




<ddooling@wustl.edu>
The Human Reference
  (a)                                                                                                                                               2
                                                                                                                                                                                                                                                                                             A

                                                                                                                                    4(24)                                                                                                                                                    B
                                                                                                                                                                                                                                                         82
               3(2)
                                         5                             7                                                                    16(2)
               3(3)                                                                                                                                       2
                                                                                                                             3                                           3(2)                 2
                                                                                                         5                                                                                                58(2)
                                                                                                             3(2)
                                                         2(2)                                                                                                        8                                                                                 2(3)
                                                                                       6(2)
             2(219)                     2                             2
                          23(2)                                                                                                                                                               3
                                                                                                                              2
                                                                                                         2                                                3                                                               81
                                                                                                                                                                                                                                                       3(21)
             4(22)                     4(3)
        13                                                                                                                                                                                                                                3(24)
                                                                                                                                                                                                                                  3
    A                                  2(2)
                                                                                                                                    2(2)                                                                                         2(202)
                                                                                                                                                                                                                                                       19(8)
                      2(19)            2(15)                     2                                                                          2(34)
                                                         2(13)
                                                                                                                                                                                                                                                                                   158
                                                                                                                                                                                                                                                                                             C
                                                                                                                                                                                                                                          5(7)                  2(42) 4(9)
                                                                                                                                            2(15)
                                                2(4)
                                                                                                                                                                                                                                                       7(8)
                                         3(3)                                     71
    B
                                                                                                             18
                                                                                                                      2
    C
                                                                                                                             2
    D
                                                                                                                                                                                                                          37
                                                                                                                                                                                                                                                                                             F
                                                                                       139                                                                                                                                                  6
    E                                                                                                                                                                                                                                                                                        E
                                                                                                                            13(2)                       13(2)            55(3)
                                                                                                     2(6)    2(7)                           6(3)
                                                                                       4(7)
                                         4                                                     5                                                                                                                                            2
    F                                                                                                 3                                                                                                                                                                                      D
                                                                     38(6)                                                  3(5)
                                                                                              160                   3(50)                                                                                                                   2
    G                                                                                                                                                                                                                                                                                        G
         2                                                                                                                                                                       2(61)
                      4(51)                                                                                                                                                                                              2(49)
                                       3(50)                                                                  8
                                                          2(7)
    H
                                                                                                                                                                                   4
                                                                                                                                            2(4)
                                                                                                             142                                                                                          2(50)            5
                                                                                                                                                         5(5)            8(6)             5(7)
                                                                                                                    158
                                                                                                                                             3
                                                                                                                    3(41)
                                                                                                                                                                173
                                                                                                                                                                                                                                                                                             H




  (b)                                                                                                                                       (c)                                                                                                           142
                                                                                                                                                          G                            160
                                                        81

                                                                                                                                                                                                                                  13(7)                   158
                              117                       93                                          29
                      D                                                                                                                                   H
                                                                                                                     A                                                                                                                                                       184
                                                                                                                                                                                                                  9(6)
                                                                                                                                                                                                                                                                                         H
                                                                                                                                                                                                                                                       48(10)
                                                       140                                                                                                                                                                            8
                                                                                                                                                                                       8(5)
                                                                                                                                                                                                                         38(6)
                                                       114                                                                                                                                                                                                                               G
                                                                                                                                                          F
                                                                                                                                                                                   13(2)
                                                                                                                                                                                                              13(2)               55(3)
                                                       132
                              207                                                                                                                                                                                                                                                        D
                                                                                                                                                                                       139
                      A                                                                             82
                                                  127(2)                                                             B                                    E
                                  62
                                                                                                                                                                                                                                                                                         E
                                                                                                    37                                                                                   71
                      B
                                                                                                                     F                                                                                                                            37
                                                       139                                                                                                D
                                                                                                                                                                                                                                                                                         F
                                                                          13(2)                55(3)
                      E                                                                                              D                                          21                                                                                                    158
                                                                                                                                                                                                              32(3)               45(3)
                                                                                                                                                          A
                                                   13(2)                                                                                                                                                                                                                                 C
                                                                                                                                                                                                  s5766
                                                                                                                                                                                   13(2)
                                                                          38(6)
                                                                                                                                                                                                              20(2)
                                                                                                                                                                18
                      F                                                                                              G
                                                                                                                                                          B
                                                                                         8
                                                       8(5)                                                                                                                                                                                                                              A
                                                                                                                                                                                        81
                                                                          18(6)                58(7)                 E
                                                       171                                                                                                C
                      G                                                                                                                                                           123(2)                                                          82
                                                                                                                                                                                                                                                                                         B




             D Zhi, BJ Raphael, AL Price, H Tang and PA Pevzner. Identifying repeat domains in
             large genomes. Genome Biology 2006, 7:R7
<ddooling@wustl.edu>
} # versioning



<ddooling@wustl.edu>
sub thank {quot;youquot;}



<ddooling@wustl.edu>

Más contenido relacionado

Similar a Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

Why Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the WebWhy Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the Webjoelburton
 
OSCON 2004: XML and Apache
OSCON 2004: XML and ApacheOSCON 2004: XML and Apache
OSCON 2004: XML and ApacheTed Leung
 
IST 561 Session2--Feb 2, 2009 Basic XHTML Concepts
IST 561 Session2--Feb 2, 2009 Basic XHTML ConceptsIST 561 Session2--Feb 2, 2009 Basic XHTML Concepts
IST 561 Session2--Feb 2, 2009 Basic XHTML ConceptsD.A. Garofalo
 
Lca2009 Video A11y
Lca2009 Video A11yLca2009 Video A11y
Lca2009 Video A11yguesta3d158
 
Web Scraper Shibuya.pm tech talk #8
Web Scraper Shibuya.pm tech talk #8Web Scraper Shibuya.pm tech talk #8
Web Scraper Shibuya.pm tech talk #8Tatsuhiko Miyagawa
 
technical fluency
technical fluencytechnical fluency
technical fluencyjudell
 
Standardizing the Web: A Look into the Why of Web Standards
Standardizing the Web: A Look into the Why of Web StandardsStandardizing the Web: A Look into the Why of Web Standards
Standardizing the Web: A Look into the Why of Web StandardsTim Wright
 
Spring
SpringSpring
Springdasgin
 
Plone Interactivity
Plone InteractivityPlone Interactivity
Plone InteractivityEric Steele
 
07 Collada Overview
07 Collada Overview07 Collada Overview
07 Collada Overviewjohny2008
 
Agile Tour Shanghai December 2011
Agile Tour Shanghai December 2011Agile Tour Shanghai December 2011
Agile Tour Shanghai December 2011Alistair McKinnell
 
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...Daniel Cukier
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solrtomhill
 
Integrating and Interpreting Social Data from Heterogeneous Sources
Integrating and Interpreting Social Data from Heterogeneous SourcesIntegrating and Interpreting Social Data from Heterogeneous Sources
Integrating and Interpreting Social Data from Heterogeneous SourcesMatthew Rowe
 
JavaServer Faces Anti-Patterns and Pitfalls
JavaServer Faces Anti-Patterns and PitfallsJavaServer Faces Anti-Patterns and Pitfalls
JavaServer Faces Anti-Patterns and PitfallsDennis Byrne
 
Anvita Dynamic Fontson Web Feb2001
Anvita Dynamic Fontson Web Feb2001Anvita Dynamic Fontson Web Feb2001
Anvita Dynamic Fontson Web Feb2001guest6e7a1b1
 
Edge trends mizuno-template
Edge trends mizuno-templateEdge trends mizuno-template
Edge trends mizuno-templateshintaro mizuno
 
Leaflet JS (GIS) and Capital MetroRail
Leaflet JS (GIS) and Capital MetroRailLeaflet JS (GIS) and Capital MetroRail
Leaflet JS (GIS) and Capital MetroRailterrafrost2
 

Similar a Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing (20)

Why Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the WebWhy Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the Web
 
OSCON 2004: XML and Apache
OSCON 2004: XML and ApacheOSCON 2004: XML and Apache
OSCON 2004: XML and Apache
 
IST 561 Session2--Feb 2, 2009 Basic XHTML Concepts
IST 561 Session2--Feb 2, 2009 Basic XHTML ConceptsIST 561 Session2--Feb 2, 2009 Basic XHTML Concepts
IST 561 Session2--Feb 2, 2009 Basic XHTML Concepts
 
Lca2009 Video A11y
Lca2009 Video A11yLca2009 Video A11y
Lca2009 Video A11y
 
Web Scraper Shibuya.pm tech talk #8
Web Scraper Shibuya.pm tech talk #8Web Scraper Shibuya.pm tech talk #8
Web Scraper Shibuya.pm tech talk #8
 
technical fluency
technical fluencytechnical fluency
technical fluency
 
Standardizing the Web: A Look into the Why of Web Standards
Standardizing the Web: A Look into the Why of Web StandardsStandardizing the Web: A Look into the Why of Web Standards
Standardizing the Web: A Look into the Why of Web Standards
 
Spring
SpringSpring
Spring
 
Plone Interactivity
Plone InteractivityPlone Interactivity
Plone Interactivity
 
07 Collada Overview
07 Collada Overview07 Collada Overview
07 Collada Overview
 
Agile Tour Shanghai December 2011
Agile Tour Shanghai December 2011Agile Tour Shanghai December 2011
Agile Tour Shanghai December 2011
 
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solr
 
Integrating and Interpreting Social Data from Heterogeneous Sources
Integrating and Interpreting Social Data from Heterogeneous SourcesIntegrating and Interpreting Social Data from Heterogeneous Sources
Integrating and Interpreting Social Data from Heterogeneous Sources
 
Juggling
JugglingJuggling
Juggling
 
JavaServer Faces Anti-Patterns and Pitfalls
JavaServer Faces Anti-Patterns and PitfallsJavaServer Faces Anti-Patterns and Pitfalls
JavaServer Faces Anti-Patterns and Pitfalls
 
Anvita Dynamic Fontson Web Feb2001
Anvita Dynamic Fontson Web Feb2001Anvita Dynamic Fontson Web Feb2001
Anvita Dynamic Fontson Web Feb2001
 
Mojolicious on Steroids
Mojolicious on SteroidsMojolicious on Steroids
Mojolicious on Steroids
 
Edge trends mizuno-template
Edge trends mizuno-templateEdge trends mizuno-template
Edge trends mizuno-template
 
Leaflet JS (GIS) and Capital MetroRail
Leaflet JS (GIS) and Capital MetroRailLeaflet JS (GIS) and Capital MetroRail
Leaflet JS (GIS) and Capital MetroRail
 

Último

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Último (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

  • 1. Challenges with data quality, sharing, and versioning David Dooling <ddooling@wustl.edu> GIA 2009
  • 2. Production Centers • Tony Cox, Sanger • David Dooling, WUStL Sequencing Scale Scale Quality Infrastructure Sharing Data flow Versioning • Toby Bloom, Broad Quality Integration Standards Sharing <ddooling@wustl.edu>
  • 4. Moore’s Law ,-./011-2# 300.-4/#567# 8,9# :;0.6<-# :-=>-1?-# !quot;quot;quot;# !quot;quot;$# !quot;quot;!# !quot;quot;%# !quot;quot;&# !quot;quot;'# !quot;quot;(# !quot;quot;)# !quot;quot;*# !quot;quot;+# !quot;$quot;# <ddooling@wustl.edu>
  • 5. Images 200 TB/week <ddooling@wustl.edu>
  • 6. Images 10 PB/year <ddooling@wustl.edu>
  • 7. Perspective 20 PB/day <ddooling@wustl.edu>
  • 8. Perspective 2 PB/s <ddooling@wustl.edu>
  • 9. FASTQ @HWI-EAS404:5:1:6:180#0/1 GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT +HWI-EAS404:5:1:6:180#0/1 aaaa`]aaaa`aa^aa]aaaa^`_``____`W]a_`T[[b__`YXUW][MSTNZX^[[`_Z[^``X`^a @HWI-EAS404:5:1:6:396#0/1 TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA +HWI-EAS404:5:1:6:396#0/1 Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`^ZPP[__^_a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^NZ @HWI-EAS404:5:1:6:1344#0/1 GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG +HWI-EAS404:5:1:6:1344#0/1 aabaaa__]^a`[^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X```_WVNYWKDNLTW[YXSVZ^ZTZZVRUX[ @HWI-EAS404:5:1:6:1814#0/1 AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC +HWI-EAS404:5:1:6:1814#0/1 aa````aa^a`_^``a`XY`^ZX^YW^[XUWUYOMVZZ_W^^XXTSMHMLLNTTDWU__[WVVY]Y_]X 7 TB/week <ddooling@wustl.edu>
  • 10. FASTQ @HWI-EAS404:5:1:6:180#0/1 GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT +HWI-EAS404:5:1:6:180#0/1 aaaa`]aaaa`aa^aa]aaaa^`_``____`W]a_`T[[b__`YXUW][MSTNZX^[[`_Z[^``X`^a @HWI-EAS404:5:1:6:396#0/1 TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA +HWI-EAS404:5:1:6:396#0/1 Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`^ZPP[__^_a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^NZ @HWI-EAS404:5:1:6:1344#0/1 GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG +HWI-EAS404:5:1:6:1344#0/1 aabaaa__]^a`[^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X```_WVNYWKDNLTW[YXSVZ^ZTZZVRUX[ @HWI-EAS404:5:1:6:1814#0/1 AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC +HWI-EAS404:5:1:6:1814#0/1 aa````aa^a`_^``a`XY`^ZX^YW^[XUWUYOMVZZ_W^^XXTSMHMLLNTTDWU__[WVVY]Y_]X 350 TB/year <ddooling@wustl.edu>
  • 11. Mapping 2 TB/week <ddooling@wustl.edu>
  • 12. Mapping 100 TB/year <ddooling@wustl.edu>
  • 13. Mapping 42,000 core-hr/week <ddooling@wustl.edu>
  • 14. Mapping 5 core-yr/week <ddooling@wustl.edu>
  • 15. Mapping 250 core cluster <ddooling@wustl.edu>
  • 17. The Balanced PC • Clock speed • AGP • Front-side bus • Hypertransport • 1 Gbps • PCI-X • SATA • PCI-Express • Infiniband • Multi-core • Front-side bus • GPU • 10 Gbps <ddooling@wustl.edu>
  • 18. The balanced PS 1 10 gosub get(sequencers) 20 gosub get(disk) 30 gosub get(backup_capacity) 40 gosub get(network_capacity) 50 gosub get(cluster_nodes) 1 - Pipeline for Sequencing <ddooling@wustl.edu>
  • 19. The unbalanced PS 10 gosub get(sequencers) 20 gosub get(disk) 30 gosub get(backup_capacity) 40 gosub get(network_capacity) 50 gosub get(cluster_nodes) 60 goto 10 <ddooling@wustl.edu>
  • 33. Quality is Job 1 <ddooling@wustl.edu>
  • 34. ...must be more than just a slogan <ddooling@wustl.edu>
  • 35. Quality missteps Initial low fidelity between base quality values and quality Tsonev, S. SEP 2007 <ddooling@wustl.edu>
  • 36. An aside “basecall calibration predicted vs. observed” <ddooling@wustl.edu>
  • 38. Quality is the key Need high fidelity between prediction and observed 50 bytes per base 20 bytes per base 2 bytes per base 3 bits per base <ddooling@wustl.edu>
  • 39. The down side http://www3.appliedbiosystems.com/cms/ groups/mcb_marketing/documents/ generaldocuments/cms_057559.pdf http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg <ddooling@wustl.edu>
  • 46. Submitted to central repositories <ddooling@wustl.edu>
  • 47. ... and replicated across the pond <ddooling@wustl.edu>
  • 48. The goal of this project is to provide a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods. <ddooling@wustl.edu>
  • 49. Write-only databases Search limited to sequence and values of specific XML entities submitted as metadata <ddooling@wustl.edu>
  • 50. Write-only databases x Search limited to sequence and values of specific XML entities submitted as metadata <ddooling@wustl.edu>
  • 51. Speaking of XML <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <LS454> <STUDY_SET xmlns:xsi=quot;http://www.w3.org/2001/ <EXPERIMENT_SET xmlns:xsi=quot;http://www.w3.org/ <INSTRUMENT_MODEL>GS 20</ <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT XMLSchema-instancequot;> 2001/XMLSchema-instancequot;> INSTRUMENT_MODEL> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT <STUDY alias=quot;LowSalternSDbayVir111005quot; <EXPERIMENT ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT accession=quot;SRP000145quot;> alias=quot;LowSalternSDbayVir111005_experimentquot; <FLOW_SEQUENCE>TACGTACGTACGTACGTACGTACGTACGTACGT ACGTACGTACGTACGTACGTACGTACGTACG</VALUE> <DESCRIPTOR> expected_number_runs=quot;2quot; accession=quot;SRX000217quot;> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT </RUN_ATTRIBUTE> <STUDY_TITLE>Solar Salterns, viral <TITLE>454 sequencing of saltern metagenome ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT <RUN_ATTRIBUTE> fraction from low salinity saltern in San Diego, fragment library</TITLE> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</ <TAG>key_sequence</TAG> CA </STUDY_TITLE> <STUDY_REF accession=quot;SRP000145quot; FLOW_SEQUENCE> <VALUE>TCAG</VALUE> <STUDY_TYPE refname=quot;LowSalternSDbayVir111005quot;/> <FLOW_COUNT>168</FLOW_COUNT> </RUN_ATTRIBUTE> existing_study_type=quot;Metagenomicsquot;/> <DESIGN> </LS454> </RUN_ATTRIBUTES> <STUDY_ABSTRACT>Viral community from a <DESIGN_DESCRIPTION>454 Sequencing of </PLATFORM> </RUN> quot;lowquot; salinity saltern and sequenced at 454 Life viral fraction from low salinity saltern in San <PROCESSING> <RUN alias=quot;D1LDSHLquot; instrument_model=quot;454 GS Sciences. </STUDY_ABSTRACT> Diego, CA</DESIGN_DESCRIPTION> <BASE_CALLS> 20quot; run_date=quot;2006-04-06T09:25:19Zquot; <CENTER_NAME>SDSU</CENTER_NAME> <SAMPLE_DESCRIPTOR accession=quot;SRS000373quot; <SEQUENCE_SPACE>Base Space</ run_file=quot;D1LDSHLquot; run_center=quot;454MSCquot; refname=quot;28373quot;/> SEQUENCE_SPACE> total_data_blocks=quot;1quot; accession=quot;SRR001054quot;> <CENTER_PROJECT_NAME>LowSalternSDbayVir111005</ <LIBRARY_DESCRIPTOR> <BASE_CALLER>454BaseCaller</BASE_CALLER> <EXPERIMENT_REF accession=quot;SRX000217quot; CENTER_PROJECT_NAME> <LIBRARY_NAME>lowSalternSDbayVir111005</ </BASE_CALLS> refname=quot;LowSalternSDbayVir111005_experimentquot;/> <PROJECT_ID>28373</PROJECT_ID> LIBRARY_NAME> <QUALITY_SCORES qtype=quot;phredquot;> <DATA_BLOCK name=quot;D1LDSHLquot; region=quot;1quot; </DESCRIPTOR> <LIBRARY_STRATEGY>OTHER</ <QUALITY_SCORER>454BaseCaller</ total_spots=quot;70935quot; total_reads=quot;70935quot; <STUDY_ATTRIBUTES> LIBRARY_STRATEGY> QUALITY_SCORER> number_channels=quot;1quot; format_code=quot;1quot; sector=quot;0quot;> <STUDY_ATTRIBUTE> <LIBRARY_SOURCE>OTHER</LIBRARY_SOURCE> <NUMBER_OF_LEVELS>64</NUMBER_OF_LEVELS> <FILES> <TAG>NCBI parent project ID</TAG> <LIBRARY_SELECTION>RANDOM</ <MULTIPLIER>1</MULTIPLIER> <FILE filename=quot;D1LDSHL01.sffquot; <VALUE>28725</VALUE> LIBRARY_SELECTION> </QUALITY_SCORES> filetype=quot;sffquot;/> </STUDY_ATTRIBUTE> <LIBRARY_LAYOUT> </PROCESSING> </FILES> </STUDY_ATTRIBUTES> <SINGLE/> </EXPERIMENT> </DATA_BLOCK> </STUDY> </LIBRARY_LAYOUT> </EXPERIMENT_SET> <RUN_ATTRIBUTES> </STUDY_SET> <LIBRARY_CONSTRUCTION_PROTOCOL> <RUN_ATTRIBUTE> none provided <TAG>flow_count</TAG> <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> </LIBRARY_CONSTRUCTION_PROTOCOL> <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <VALUE>168</VALUE> <SAMPLE_SET xmlns:xsi=quot;http://www.w3.org/2001/ </LIBRARY_DESCRIPTOR> <RUN_SET xmlns:xsi=quot;http://www.w3.org/2001/ </RUN_ATTRIBUTE> XMLSchema-instancequot;> <SPOT_DESCRIPTOR> XMLSchema-instancequot;> <RUN_ATTRIBUTE> <SAMPLE alias=quot;28373quot; accession=quot;SRS000373quot;> <SPOT_DECODE_SPEC> <RUN alias=quot;D0IIGP3quot; instrument_model=quot;454 GS <TAG>flow_sequence</TAG> <SAMPLE_NAME> <NUMBER_OF_READS_PER_SPOT>2</ 20quot; run_date=quot;2006-03-17T09:39:51Zquot; <TAXON_ID>496920</TAXON_ID> NUMBER_OF_READS_PER_SPOT> run_file=quot;D0IIGP3quot; run_center=quot;454MSCquot; <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT <COMMON_NAME>saltern metagenome</ <READ_SPEC> total_data_blocks=quot;1quot; accession=quot;SRR001053quot;> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT COMMON_NAME> <READ_INDEX>0</READ_INDEX> <EXPERIMENT_REF accession=quot;SRX000217quot; ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT </SAMPLE_NAME> <READ_CLASS>Technical Read</ refname=quot;LowSalternSDbayVir111005_experimentquot;/> ACGTACGTACGTACGTACGTACGTACGTACG</VALUE> <DESCRIPTION>viral fraction from low READ_CLASS> <DATA_BLOCK name=quot;D0IIGP3quot; region=quot;1quot; </RUN_ATTRIBUTE> salinity saltern in San Diego, CA </ <READ_TYPE>Adapter</READ_TYPE> total_spots=quot;51121quot; total_reads=quot;51121quot; <RUN_ATTRIBUTE> DESCRIPTION> <BASE_COORD>1</BASE_COORD> number_channels=quot;1quot; format_code=quot;1quot; sector=quot;0quot;> <TAG>key_sequence</TAG> <SAMPLE_ATTRIBUTES> </READ_SPEC> <FILES> <VALUE>TCAG</VALUE> <SAMPLE_ATTRIBUTE> <READ_SPEC> <FILE filename=quot;D0IIGP301.sffquot; </RUN_ATTRIBUTE> <TAG>collection_date</TAG> <READ_INDEX>1</READ_INDEX> filetype=quot;sffquot;/> </RUN_ATTRIBUTES> <VALUE>11/10/05</VALUE> <READ_CLASS>Application Read</ </FILES> </RUN> </SAMPLE_ATTRIBUTE> READ_CLASS> </DATA_BLOCK> </RUN_SET> <SAMPLE_ATTRIBUTE> <READ_TYPE>Forward</READ_TYPE> <RUN_ATTRIBUTES> <TAG>lat_lon</TAG> <BASE_COORD>5</BASE_COORD> <RUN_ATTRIBUTE> <VALUE>32.599040, -117.107356</VALUE> </READ_SPEC> <TAG>flow_count</TAG> </SAMPLE_ATTRIBUTE> </SPOT_DECODE_SPEC> <VALUE>168</VALUE> </SAMPLE_ATTRIBUTES> </SPOT_DESCRIPTOR> </RUN_ATTRIBUTE> </SAMPLE> </DESIGN> <RUN_ATTRIBUTE> </SAMPLE_SET> <PLATFORM> <TAG>flow_sequence</TAG> <ddooling@wustl.edu>
  • 54. The Cathedral and the Bazaar Linux overturned much of what I thought I knew. I had been preaching the Unix gospel of small tools, rapid prototyping and evolutionary programming for years. But I also believed there was a certain critical complexity above which a more centralized, a priori approach was required. I believed that the most important software (operating systems and really large tools like the Emacs programming editor) needed to be built like cathedrals, carefully crafted by individual wizards or small bands of mages working in splendid isolation, with no beta to be released before its time. <ddooling@wustl.edu>
  • 55. The Vatican and the Reformation <ddooling@wustl.edu>
  • 56. The popes Will this scale? <ddooling@wustl.edu>
  • 59. The Human Reference >7 dna:chromosome chromosome:NCBI36:7:1:158821424:1 ...AATAACTATATAAGTAAATAAGCAAGCTGTATGAATATACAAAGCTCTCTGGTAAAG GTAAATACATAAACAAACATAAAAACAGTCCTATTGTAATTTTGGTTTGTAACTCTGCTT TTTATTTTCTACATAATTTAAAAGGCAAATGCATAAAATGTAATTGTAAATCTGTTAGCT GGTATACAATGAATAAAGATATAATTTGTCACATCAATAACATAAAAAGAGTAGAGCTAT ATATATAGCAGTAGAATTTTGGTATGTGATTGAACTTAAGTTGAAATAAATTCAAATTAA AATGTTATAACTCTAGGATGTTATATGTAATTCTCATAGTAACCAAAAATGAAATATACA TAGAATATAAACAAAAGGAAATGAGACTAGAAACAAAATGTGTCACTACAAAAAAATCAA CTAAAGATAAAAAAGAAATAATTGAGAAAATGATTGGCAAAAATCAGTAACTCTGACGTA TTAAAACTTTCCATGCTACATAAATCTGAAAACTCTATTTCACATAAAACTGGAGCTGAA AGAAACAAATATTTACCTATAAAGTTAAAAGTTATATAGGGAACAAACACTAATTTTTTT TAGAAAAAATTATAAAAAGAGTAAAAATATGCCTTATACTACCGTAATTTCATGTTTTAC AGCTCTGGGAAAATAGAAAATAAAATGTTCTGTTAGCATGAATCCCTCTGTGCCCCC... <ddooling@wustl.edu>
  • 61. The Human Reference (a) 2 A 4(24) B 82 3(2) 5 7 16(2) 3(3) 2 3 3(2) 2 5 58(2) 3(2) 2(2) 8 2(3) 6(2) 2(219) 2 2 23(2) 3 2 2 3 81 3(21) 4(22) 4(3) 13 3(24) 3 A 2(2) 2(2) 2(202) 19(8) 2(19) 2(15) 2 2(34) 2(13) 158 C 5(7) 2(42) 4(9) 2(15) 2(4) 7(8) 3(3) 71 B 18 2 C 2 D 37 F 139 6 E E 13(2) 13(2) 55(3) 2(6) 2(7) 6(3) 4(7) 4 5 2 F 3 D 38(6) 3(5) 160 3(50) 2 G G 2 2(61) 4(51) 2(49) 3(50) 8 2(7) H 4 2(4) 142 2(50) 5 5(5) 8(6) 5(7) 158 3 3(41) 173 H (b) (c) 142 G 160 81 13(7) 158 117 93 29 D H A 184 9(6) H 48(10) 140 8 8(5) 38(6) 114 G F 13(2) 13(2) 55(3) 132 207 D 139 A 82 127(2) B E 62 E 37 71 B F 37 139 D F 13(2) 55(3) E D 21 158 32(3) 45(3) A 13(2) C s5766 13(2) 38(6) 20(2) 18 F G B 8 8(5) A 81 18(6) 58(7) E 171 C G 123(2) 82 B D Zhi, BJ Raphael, AL Price, H Tang and PA Pevzner. Identifying repeat domains in large genomes. Genome Biology 2006, 7:R7 <ddooling@wustl.edu>

Notas del editor

  1. What are the challenges that the large genome centers are currently facing that the typical researcher will be facing soon? Do not store images Do not store SRF Keep FASTQ
  2. This acceleration breaks everything
  3. 3.4*125/75*35 = 198.333333333333
  4. We need to stop having to deal with images It should be transparent to the end user
  5. LHC http://atlasexperiment.org/
  6. (90*2+90/125*50)*35 = 7560 Uncompressed
  7. For 75 b read, you need 200 bytes, 25% is the headers Save 12.5% by simply not replicating the sequence header
  8. 8*90/12*35 = 2100
  9. Cost of software
  10. The chain is only as strong as its weakest link. Images: Assembly line backing up? Keystone cops piling up? Stooges? Transition: situation not-unlike that faced by PC manufacturers over the past decade
  11. This analogy works on another level as well...
  12. Intel convinced everyone that the speed of the computer was equal to the clock speed of the processor Many people believed this Even when using a 56k modem Even when AML Opteron came out Even when Intel went to multi-core and lower clock speeds A cautionary tale for those joining the Gb race Which wraps up the scale up...
  13. ... and leads us into quality
  14. ... and leads us into quality
  15. Make the best small engine in the world
  16. Made high-quality cars for years Recognized after years of consistent performance
  17. Now enjoy premium cost and high resale value Everyone I know has a Honda Odyssey
  18. Money from the T-bird allowed them to design, develop, and introduce the...
  19. It&#x2019;s gotten better
  20. Google image search second or third result Draw your own conclusions
  21. This distrust of base calls and quality values has reinforced the cult of traces This does not scale for human resources, disk space, etc. This leads to a very bad situation for those of us responsible for the computing, storage, and network infrastrcuture
  22. Quality is at the core of all other issues, storage, compute, throughput, etc. If it&#x2019;s a bad base, call it a bad base Don&#x2019;t forget the GHz race
  23. Reducing data to base calls and quality values does reduce its value Especially for data not natively in &#x201C;base space&#x201D; Is there a richness in this data that is lost? But you gain not having to have custom tool tails for each native data type
  24. 2 bits/base is absolute minimum
  25. Grid
  26. No one ever feels lucky
  27. No one ever feels lucky
  28. They have learned their lesson, by creating an incredible amount of XML to submit Study, Sample, Experiment, Run
  29. He may know a lot about software, but he does not know anything about building cathedrals
  30. Currently, revisions are tightly controlled by central repositories, NCBI, UCSC, EBI
  31. Push and pull around diff&#x2019;s Balance curation with rapid advances Debian web of trust
  32. How far will FASTA get you? C. elegans - part of genome repeat structure http://genomebiology.com/2006/7/1/R7 Can you use the current de Bruijn graph assembly engines for alignment?
  33. Talk to me