Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
1. Big Data Genomics:
Clustering Billions of DNA
Sequences with Apache Spark
Zhong Wang, Ph.D.
Group Lead, Genome Analysis
05/23/2019
2. 1999-2007
2008-now: JGI as the DOE sequencing center dedicated to plants and microbes.
DOE JGI: A brief history
3. Our Mission
3
DOE JGI, Serving as a genomic user facility
in support of the DOE missions:
• Walnut Creek 1999-2019
• Berkeley, CA
• 250 employees
• $70M annual budget
bioenergy, carbon cycling, & biogeochemistry
5. Genomics big data is not typical big data
Unstructured
Volume, variety
veracity increases
during analytics
6. Metagenome is the genome of a microbial community
10s "intimate kiss" = 80 million bacteria
Metagenomics questions: Who are there? What they do? How they interact?
7. Microbial communities are “dark matters”
Number of Species
Cow
~6000
Human
~1000
Soil,
>100000
>90% of the species haven’t been seen before
8. Metagenome sequencing and assembly
Harvest
microbes
Extract
DNA
Shear, &
Sequencing
Assembly
Short Reads
Reconstructed
genomes
Microbial
Community
Metagenome
DNA
9. The metagenome assembly problem
Library of Books Shredded Library “reconstructed” Library
Genome ~= Book Metagenome ~= Library
Sequencing ~= sampling the pieces and read them
10. Scale is an enemy
1
10
100
1,000
10,000
100,000
1,000,000
Typical Human Cow Ocean Soil
Gigabases (Gb)
11. Complexity is another…
Remove contaminants,
sequencing errors
Overlap graph
de bruijn graph
Contigs or clusters
Repetitive elements
Homologous genes
Horizontal transferred genes
12. The ideal solution and the failed ones
Easy to develop
Robust
Scale to big data
Efficient
BigMem
• Easy to
develop
• Expensive
• Not scale
MPI
• Fast
• Hard to
develop
• Not robust
Hadoop
• Easy to
develop
• Scale
• Slow
13. Addressing big data: Apache Spark
• New scalable programming paradigm
• Compatible with Hadoop-supported
storage systems
• Improves efficiency through:
• In-memory computing primitives
• General computation graphs
• Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
Scale to big data
Efficient
Easy to develop
Robust
14. Goal: Metagenome read clustering
Read clustering can reduce metagenome problem to
single-genome problem
• Parallel Processing
• Individualized optimization
Reads Read clusters
15. Algorithm
2 3
1
Node: Read
Edge: number of kmers two reads share
Kmer to reads is what word to sentence
Read graph containing all reads Graph Partitioning: LPA
Kmer-mapping reads
Graph Construction and Edge Reduction Label Propagation Algorithm
20. A tradeoff between cost and performance
0
50
100
150
200
250
0% 20% 40% 60% 80% 100%
mean cluster size (K) #reads (M) #clusters
Percent of long reads used
26. A quick reminder…
2 3
1
Node: Read
Edge: number of kmers two reads share
Kmer to reads is what word to sentence
Read graph containing all reads Graph Partitioning: LPA
Kmer-mapping reads (KMR)
Graph Construction and Edge Reduction (Edges) Label Propagation Algorithm (LPA)
27. Scale to bigger data volume on a 20-node cluster
0
200
400
600
800
20 40 60 80 100
ExecutionTime(mins)
Data Size (GB)
KMR Edges LPA Total
28. Increasing nodes on a 50G-dataset
0
100
200
300
400
500
25 50 75 100
ExecutionTime(mins)
Number of nodes
50G
KMR Edges LPA Total
34. Targeting big metagenome projects
Dr. Morgan-Kiss
@ Miami University
Dr. Slonczewski
@Kenyon University
Two lakes, 1.2Tbp
35. Acknowledgements
Spark Team
Lizhen Shi @FSU
Xiandong Meng
Kexue Li, LiliWang and Li Deng
@Shanghai U
Kurt Labutti
Elizabeth Tseng @PacBio
Lisa Gerhardt , Evan Racah
@ NERSC
Yong Qin, Gary Jung,
Greg Kurtzer, Bernard Li,
@ HPC
Philip Blood,
Bryon Gill
@PSC