Publicidad
Publicidad

Más contenido relacionado

Presentaciones para ti(20)

Similar a Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark(20)

Publicidad

Más de DataWorks Summit(20)

Publicidad

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

  1. Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark Zhong Wang, Ph.D. Group Lead, Genome Analysis 05/23/2019
  2. 1999-2007 2008-now: JGI as the DOE sequencing center dedicated to plants and microbes. DOE JGI: A brief history
  3. Our Mission 3 DOE JGI, Serving as a genomic user facility in support of the DOE missions: • Walnut Creek 1999-2019 • Berkeley, CA • 250 employees • $70M annual budget bioenergy, carbon cycling, & biogeochemistry
  4. Our sequencer lineups Miseq NextSeq 500 Hiseq 2500 PacBio RSII Oxford Nanopore Short-read technologies Long-read technologies Novaseq 6000 PacBio Sequel MinION Promethion 200Tb sequencing data in FY18 Illumina
  5. Genomics big data is not typical big data Unstructured Volume, variety veracity increases during analytics
  6. Metagenome is the genome of a microbial community 10s "intimate kiss" = 80 million bacteria Metagenomics questions: Who are there? What they do? How they interact?
  7. Microbial communities are “dark matters” Number of Species Cow ~6000 Human ~1000 Soil, >100000 >90% of the species haven’t been seen before
  8. Metagenome sequencing and assembly Harvest microbes Extract DNA Shear, & Sequencing Assembly Short Reads Reconstructed genomes Microbial Community Metagenome DNA
  9. The metagenome assembly problem Library of Books Shredded Library “reconstructed” Library Genome ~= Book Metagenome ~= Library Sequencing ~= sampling the pieces and read them
  10. Scale is an enemy 1 10 100 1,000 10,000 100,000 1,000,000 Typical Human Cow Ocean Soil Gigabases (Gb)
  11. Complexity is another… Remove contaminants, sequencing errors Overlap graph de bruijn graph Contigs or clusters Repetitive elements Homologous genes Horizontal transferred genes
  12. The ideal solution and the failed ones  Easy to develop  Robust  Scale to big data  Efficient BigMem • Easy to develop • Expensive • Not scale MPI • Fast • Hard to develop • Not robust Hadoop • Easy to develop • Scale • Slow
  13. Addressing big data: Apache Spark • New scalable programming paradigm • Compatible with Hadoop-supported storage systems • Improves efficiency through: • In-memory computing primitives • General computation graphs • Improves usability through: • Rich APIs in Java, Scala, Python • Interactive shell  Scale to big data  Efficient  Easy to develop  Robust
  14. Goal: Metagenome read clustering Read clustering can reduce metagenome problem to single-genome problem • Parallel Processing • Individualized optimization Reads Read clusters
  15. Algorithm 2 3 1 Node: Read Edge: number of kmers two reads share Kmer to reads is what word to sentence Read graph containing all reads Graph Partitioning: LPA Kmer-mapping reads Graph Construction and Edge Reduction Label Propagation Algorithm
  16. Clustering performance on long reads Read length = 500-20,000
  17. Short reads? Not so much Read length = 150
  18. Can long reads come in rescue?
  19. Hybrid clustering
  20. A tradeoff between cost and performance 0 50 100 150 200 250 0% 20% 40% 60% 80% 100% mean cluster size (K) #reads (M) #clusters Percent of long reads used
  21. Short-read only: there is still a way out
  22. More samples, better results: one vs 50
  23. More data, better results: clustering success is dependent on coverage
  24. Can we scale to big data?
  25. Hardware and software environments Customized EMR Bridge nodes 20 20 8 cores 8 (160) 8 (160) 28 (224) memory 64 (1280) 61 (1220) 128 (1024) Hadoop 2.7.3 2.7.3 2.7.2 Spark 2.1.1 2.2.0 2.1.0
  26. A quick reminder… 2 3 1 Node: Read Edge: number of kmers two reads share Kmer to reads is what word to sentence Read graph containing all reads Graph Partitioning: LPA Kmer-mapping reads (KMR) Graph Construction and Edge Reduction (Edges) Label Propagation Algorithm (LPA)
  27. Scale to bigger data volume on a 20-node cluster 0 200 400 600 800 20 40 60 80 100 ExecutionTime(mins) Data Size (GB) KMR Edges LPA Total
  28. Increasing nodes on a 50G-dataset 0 100 200 300 400 500 25 50 75 100 ExecutionTime(mins) Number of nodes 50G KMR Edges LPA Total
  29. Fine tune parallelism 0 50 100 150 200 250 300 350 1 2 3 4 5 6 7 8 ExecutionTIme(mins) Spark default parallelism (log10) 50G 20G
  30. Dataset complexity vs performance 146.33 44.5 0 20 40 60 80 100 120 140 160 Human Iso-Seq Alzheimer(PacBio) Cow Rumen(Illumina) ExecutionTime(mins) KMR Edges LPA
  31. Platform comparison: Clouds and HPC Customized EMR Bridge nodes 20 20 8 cores 8 (160) 8 (160) 28 (224) memory 64 (1280) 61 (1220) 128 (1024) Time (min) 106 105 126
  32. Now we have a big hammer…
  33. Clustering for identifying genome contaminants Russula 70Mb Bradyrhizobium 7.2Mb Collimonas: 5.3Mb
  34. Targeting big metagenome projects Dr. Morgan-Kiss @ Miami University Dr. Slonczewski @Kenyon University Two lakes, 1.2Tbp
  35. Acknowledgements Spark Team Lizhen Shi @FSU Xiandong Meng Kexue Li, LiliWang and Li Deng @Shanghai U Kurt Labutti Elizabeth Tseng @PacBio Lisa Gerhardt , Evan Racah @ NERSC Yong Qin, Gary Jung, Greg Kurtzer, Bernard Li, @ HPC Philip Blood, Bryon Gill @PSC
Publicidad