Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Bacterial Pathogen Genomics at NCBI

"Bacterial Pathogen Genomics at NCBI" presentation at the Standards for Pathogen Identification via NGS (SPIN) workshop hosted by the National Institute for Standards and Technology October 2014 by Dr. Bill Klimke.

  • Inicia sesión para ver los comentarios

Bacterial Pathogen Genomics at NCBI

  1. 1. Bacterial  Pathogen  Genomics  at   NCBI
  2. 2. FDA,  USDA,  CDC   State,  Local  and    Foreign  Public  Health  Agencies   Industry/Academia   Addi$onal     DATA  ANALYSIS   DATA  ASSEMBLY  AND   STORAGE  and  Analysis   DATA  ACQUISITION     NCBI,  EMBL    DDBJ      (INDIS)   (Public  Access  Database)   Our Current Model – Publicly available data NaFonal  Network  of  Sequencers  IntrenaFonal  Network  of  Sequencers  
  3. 3. Automated  Bacterial  Assembly   SRA  Reads   sample  1   Trim  reads     (Ns,  adaptor)   Reference       Distance  tree   Find  closest  reference  genome(s)   ArgoCA  (Combined  Assembly)   De  novo  assembly  panel   Argo  (Reference   assisted   assembly)   SOAP  denovo   GS-­‐assembler   (newbler)   MaSuRCA     Celera   Assembler     Reads  remapped  to  combined  assembly   ConFg  fasta   Read  placements  (bam)   Quality  profile   SPAdes  
  4. 4. WGS  &  Epidemiologically  Relevant  Distance  (ERD) •  WGS  allows  high  resoluFon  genotypic  comparison  of   pathogen  isolates   •  What  is  the  epidemiological  relevance  of  genotypic   distance?   •  Many  methods  to  compute  –  we  need  some  common   principles…  
  5. 5. Since  all  approaches  start  with  sequence  reads,  we  must   retain  for  independent  confirmaHon   0   0.5   1   0   500   1000   1500   Millions   FDA-­‐CFSAN:  microbial    foodborne  pathogen   research     SRA  format  bytes  per  sequenced  base  versus     number  of  bases  in  MiSeq  runs   With  Quality   Without  QualiFes   0   0.2   0.4   0.6   0.8   0   200   400   600   800   1000   1200   Millions   OXFORD  University:  PopulaFon  Genomics  of   Mycobacterium  tuberculosis     SRA  format  bytes  per  sequenced  base  versus     number  of  bases  in  MiSeq  and  HiSeq  runs   With  Quality   Without  Quality   Storage  is  manageable…  
  6. 6. Reliable, transparent, high throughput, high resolution ERDs? Major challenge is to distinguish independent events (SNPs) from single events that generate multiple nucleotide differences i.e. collapsed repeats and other artifacts, alignment errors (reference-based alignments), sequence quality, & recombination
  7. 7. Fairly uniform distribution of differences along the two genomes…? Cumulative count of differences
  8. 8. Iterative density filtering (Richa Agarwala modification of Science. 2011 Jan 28;331(6016):430-4.
  9. 9. Table:  Samples  currently  processed  (as  of  Sept  5,  2014)    in  NCBI  Pathogen  Pipeline   Organisms   Center   Listeria   Salmonella   E.  coli   Total   CDC   903   903   FDA  +  State  Partners*   858   6129   307   7294   100K   565   34   599   FERA   14   14   Total   1775   6694   341   8810   Processing  Status  
  10. 10. How  to  measure  the  system?       need  the  raw  data  (sequence  reads)  in  unprocessed  form         any  read  trimming/filtering  along  with  the  assembly  can  be  regenerated                
  11. 11. Assembly  metrics     map  the  reads  back  to  the  assembly  and  generate  a  profile  of  each  posiFon   (coverage,  alleles,  qualiFes)   compare  the  assembly  against  other  assemblies  of  the  same  organism  (genus,   species)  and  check  the  expected  genome  size,  or  similarity  to  related  genomes     annotaFon  metrics  such  as  frameshiied  proteins    
  12. 12. What  is  the  actual  measurement  for  sequence     similarity?     the  number  of  pairwise  SNPs  between  two  genomes   What  is  the  threshold?     a  pairwise  distance  (an  observaFonally  determined  cutoff  below  which  a  cluster  of  2     or  more  isolates  are  considered  significantly  close  enough  to  warrant  further  invesFgaFon)      
  13. 13. Sensi>vity  vs.  Specificity     sequence  clustering     sensiFvity  –  measure  of  isolates  which  belong  to  the  cluster  within  epidemiologically   relevant  distance     (true  posiFves)  /  true  posiFves  +  false  negaFves  (not  correctly  idenFfied)     specificity  –  measure  of  isolates  which  are  excluded  from  a  cluster  within     epidemiologically  relevant  distance   (true  negaFves)  /  true  negaFves  +  false  posiFves  
  14. 14. Organism   Total   Samples   Not   expected   species1   Mixed   organisms   Less  than   5X   coverage  Duplicates   PacBio   Poor   2nd  read   Failed   assembly   stage   Listeria   1775   20   2  (?)   1   5   1   Salmonella   6694   35   5   9   12   E.  coli   341   8   1   1.  not  L.  monocytogenes,  S.  enterica,  or  E.  coli   Processing  Problems  
  15. 15. PROBLEMS!  
  16. 16. Reference  Materials  
  17. 17. Streptococcus massiliensis 4401825 - CANO - GCA_000341525.1 Streptococcus massiliensis DSM 18628 - ARCE - GCA_000380065.1 Streptococcus intermedius BA1 - ANFT - GCA_000313655.1 Streptococcus intermedius B196 - - GCA_000463355.1 Streptococcus intermedius C270 - - GCA_000463385.1 Streptococcus intermedius F0413 - AFXO - GCA_000234035.1 Streptococcus intermedius SK54 - AJKN - GCA_000258445.1 Streptococcus intermedius JTH08 - - GCA_000306805.1 Streptococcus intermedius ATCC 27335 - ATFK - GCA_000413475.1 Streptococcus intermedius F0395 - AFXN - GCA_000234015.1 Streptococcus sp. AS20 - JANS - GCA_000524255.1 Streptococcus constellatus subsp. constellatus SK53 - AICQ - GCA_000257785.1 Streptococcus constellatus subsp. constellatus SK53 - BASU - GCA_000474075.1 Streptococcus constellatus subsp. pharyngis C1050 - - GCA_000463425.1 Streptococcus constellatus subsp. pharyngis SK1060 = CCUG 46377 - AFUP - GCA_000223295.2 Streptococcus constellatus subsp. pharyngis SK1060 = CCUG 46377 - BASX - GCA_000474135.1 Streptococcus constellatus subsp. pharyngis C232 - - GCA_000463395.1 Streptococcus constellatus subsp. pharyngis C818 - - GCA_000463445.1 Streptococcus anginosus SK1138 - ALJO - GCA_000287595.1 Streptococcus sp. CM7 - JATP - GCA_000526035.1 Streptococcus sp. OBRC6 - JACR - GCA_000517685.1 Streptococcus anginosus F0211 - AECT - GCA_000184365.2 Streptococcus anginosus 1505 - BASW - GCA_000474115.1 Streptococcus sp. ACC21 - JAQU - GCA_000524375.1 Streptococcus sp. AC15 - JDFJ - GCA_000565055.1 Streptococcus anginosus subsp. whileyi MAS624 - - GCA_000478925.1 Streptococcus anginosus subsp. whileyi CCUG 39159 - AICP - GCA_000257765.1 Streptococcus anginosus C238 - - GCA_000463505.1 Streptococcus anginosus DORA_7 - AZMF - GCA_000508545.1 Streptococcus anginosus 1_2_62CV - ADME - GCA_000186545.1 Streptococcus anginosus C1051 - - GCA_000463465.1 Streptococcus anginosus T5 - BASY - GCA_000474155.1 Streptococcus anginosus SK52 = DSM 20563 - AFIM - GCA_000214555.2 Streptococcus anginosus SK52 = DSM 20563 - AREF - GCA_000373605.1 Streptococcus anginosus SK52 = DSM 20563 - BAST - GCA_000474055.1 Streptococcus intermedius SK54 - BASV - GCA_000474095.1 0.05
  18. 18. Escherichia coli KTE179 - ANYQ - GCA_000326485.1 Escherichia coli KTE229 - ANXK - GCA_000353165.1 Escherichia coli H252 - AEFI - GCA_000190895.1 Escherichia coli HVH 180 (4-3051617) - AVYH - GCA_000458685.1 Escherichia coli HVH 73 (4-2393174) - AVUX - GCA_000457025.1 Escherichia coli HVH 104 (4-6977960) - AVVT - GCA_000457455.1 Escherichia coli HVH 19 (4-7154984) - AVTL - GCA_000456265.1 Escherichia coli 908675 - AXTY - GCA_000488755.1 Escherichia coli HVH 127 (4-7303629) - AVWO - GCA_000457855.1 Escherichia coli HVH 12 (4-7653042) - AVTG - GCA_000494955.1 Escherichia coli KOEGE 32 (66a) - AWAD - GCA_000459635.1 Escherichia coli UMEA 3041-1 - AWAW - GCA_000460015.1 Escherichia coli HVH 148 (4-3192490) - AVXH - GCA_000495015.1 Escherichia coli HVH 59 (4-1119338) - AVUQ - GCA_000456885.1 Escherichia coli HVH 222 (4-2977443) - AVZU - GCA_000459455.1 Escherichia coli UMEA 3140-1 - AWBK - GCA_000460295.1 Escherichia coli HVH 178 (4-3189163) - AVYG - GCA_000495055.1 Escherichia coli KTE4 - ANSO - GCA_000350645.1 Escherichia coli KTE3 - ASTO - GCA_000407685.1 Escherichia coli KTE240 - ASUS - GCA_000408305.1 Escherichia coli BIDMC 49b - JAPT - GCA_000522365.1 Escherichia coli BIDMC 49a - JAPU - GCA_000522385.1 Escherichia coli APEC O1 - - GCA_000014845.1 Escherichia coli DSM 30083 = JCM 1649 = ATCC 11775 - BAIM - GCA_000613265.1 Escherichia coli JCM 20135 - BAKV - GCA_000614505.1 Escherichia coli DSM 30083 = JCM 1649 = ATCC 11775 - AGSE - GCA_000690815.1 Escherichia coli DSM 30083 = JCM 1649 = ATCC 11775 - JMST - GCA_000734955.1 Escherichia coli HVH 214 (4-3062198) - AZJN - GCA_000507665.1 Escherichia coli UMEA 3162-1 - AWBU - GCA_000460475.1 Escherichia coli HVH 191 (3-9341900) - AVYR - GCA_000458875.1 Escherichia coli HVH 170 (4-3026949) - AVYA - GCA_000458555.1 Escherichia coli S88 - - GCA_000026285.1 Escherichia coli UMEA 3893-1 - AWEI - GCA_000461775.1 Escherichia coli HVH 217 (4-1022806) - AVZQ - GCA_000459375.1 Escherichia coli KTE5 - ANSP - GCA_000350665.1 Escherichia coli KTE7 - ASTP - GCA_000407705.1 Escherichia coli HVH 32 (4-3773988) - AVTX - GCA_000456505.1 Escherichia coli UMEA 3206-1 - AWCK - GCA_000460795.1 Escherichia coli UMEA 3203-1 - AWCJ - GCA_000460775.1 Escherichia coli KTE62 - ANUK - GCA_000351605.1 Escherichia coli KTE27 - ASTY - GCA_000407885.1 Escherichia coli cloneA_i1 - AEYT - GCA_000233675.2 Escherichia coli 597 - AYQU - GCA_000503475.1 Escherichia coli HVH 203 (4-3126218) - AVZD - GCA_000459115.1 Escherichia coli UMEA 3702-1 - AWDZ - GCA_000461595.1 Escherichia coli UMEA 3662-1 - AWDU - GCA_000461495.1 Escherichia coli HVH 5 (4-7148410) - AVTB - GCA_000456085.1 Escherichia coli HVH 102 (4-6906788) - AVVR - GCA_000465155.1 Escherichia coli HVH 201 (4-4459431) - AVZB - GCA_000459075.1 Escherichia coli HM605 - AJWU - GCA_000264175.1 Escherichia coli HM605 - CADZ - GCA_000285375.1 0.01
  19. 19. hlp://www.ncbi.nlm.nih.gov/assembly/?term=%22anomalous%22[ProperFes]  
  20. 20. Contamina>on  (mul>ple  organisms)  
  21. 21. Assembly  for  sample  SAMN02727350   Type   Number  of   conFgs   Sum  of  conFg   lengths   Full  assembly   667   5251272   conFgs  with  Listeria  hits   37   3031650   conFgs  with  Staphylococcus   hits   630   2203573  
  22. 22. Contamina>on  (carryover  contamina>on)  
  23. 23. Contamina>on  (mul>ple  strains)  
  24. 24. Table:  Assembly  stats  for  SAMN02693748   measurement   result   num_input_reads   4212706   aligned_reads   4040070   assembly_num_bases   3180478   assembly_num_conFgs   50   assembly_N50   2817733   poor_quality_support_bases   132321  
  25. 25. Organism   Biosample   SRA  Run   Similarity  to:   Listeria  monocytogenes  IEH-­‐NGS-­‐LIS-­‐00100    SAMN02567873   SRR1207486   Listeria  SLCC7179           SRR1220750   Listeria  J0161   Salmonella  enterica  EnteriFdis   MDH-­‐2014-­‐00798   SAMN02741943   SRR1553852   Schwarzengrund  str.   CVM19633           SRR1272871   EnteriFdis  str.  P125109   Salmonella  enterica  Fluntern   MDH-­‐2013-­‐00153   SAMN02378158   SRR1067624   Javiana  and   Schwarzengrund           SRR1395304   Cubana  and  Agona  
  26. 26. Proficiency  TesFng   •  Replicate  results  (phylogeny,  SNPs)  from  published  studies   •  Resequencing     ü  same  isolate  on  mulFple  plasorms   ü  same  isolate  in  mulFple  libraries   ü  same  isolate  in  mulFple  labs   •  Blinded  submissions   ü  already-­‐characterized  isolates   ü  mixed  sample  isolates   ü  metagenomic  isolates   •  Corner  cases   ü  Extreme  coverage   ü  Duplicates   ü  Sample  mixups  
  27. 27. Acknowledgements   National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA Richa  Agarwala   Azat  Badretdin   Slava  Brover   Joshua  Cherry   Vyacheslav  Chetvernin   Robert  Cohen   Michael  DiCuccio   Mike  Feldgarden   Dan  Hai   William  Klimke   Arjun  Prasad   Edward  Rice   Kirill  Rotmistrovskyy   Stephen  Sherry   Sergey  Shiryev   MarFn  Shumway   TaFana  Tatusova   Igor  Tolstoy   Chunlin  Xiao   Leonid  Zaslavsky   Alexander  Zasypkin   Alejandro  A.  Schaffer   Lukas  Wagner   Aleksandr  Morgulis     David  Lipman   James  Ostell     NCBI •  This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. http://www.ncbi.nlm.nih.gov CDC FDA/CFSAN NIHGRI UC-Davis USDA Vendors: PacBio, Illumina, Roche

×