Using Supercomputers and Gene Sequencers to Discover Your Inner Microbiome
1. “Using Supercomputers and Gene Sequencers
to Discover Your Inner Microbiome”
Keynote Talk
International Conference on Computational Science
San Diego, CA
June 6, 2016
Dr. Larry Smarr
Director, California Institute for Telecommunications and Information Technology
Harry E. Gruber Professor,
Dept. of Computer Science and Engineering
Jacobs School of Engineering, UCSD
http://lsmarr.calit2.net
1
2. Abstract
The human body is host to 100 trillion microorganisms, ten times the number of DNA-bearing cells in the
human body, and these microbes contain 300 times the number of DNA genes that our human DNA does.
The microbial component of our "superorganism" is comprised of hundreds of species with immense
biodiversity. To put a more personal face on the "patient of the future," I have been collecting massive
amounts of data from my own body over the last seven years, which reveals detailed examples of the
episodic evolution of this coupled immune-microbial system. Collaborating with the UC San Diego Knight
Lab, we have genetically sequenced a time series of my gut microbiome, as well as single moments from
50 patients with autoimmune disease. An elaborate software pipeline, running on high performance
computers, reveals the details of the microbial ecology and its genetic components, in health as well as in
disease. Not only can we compare a person with a disease to a healthy population, but we can also follow
the dynamics of the diseased patient. We can look forward to revolutionary changes in medical practice
over the next decade.
3. Forty Years of Computing Gravitational Waves
From Colliding Black Holes
1977
L. Smarr and K. Eppley
Gravitational Radiation Computed
from an Axisymmetric
Black Hole Collision
2016
LIGO Consortium
Spiral Black Hole Collision
40 Years
4. Complexity of Computing First Gut Microbiome Dynamics
Versus First Dynamics of Colliding Black Holes
• My 1975 PhD Dissertation
– Solving Einstein’s Equations of General Relativity for Colliding Black Holes and Grav Waves
– CDC 6600 Megaflop/s
– Hundreds of Hours
• Rob Knight and Smarr Gut Microbiome Map
– Mapping From Illumina Sequencing to Taxonomy and Gene Abundance Dynamics
– Comet Petaflop/s
– Comet Core is 40,000x CDC6600 Speed
– Million Core-Hours
– 10,000x Supercomputer Time
• Gut Microbiome Takes ~ ½ Billion Times the Compute Power of Early Solutions of
Dynamic General Relativity
5. As a Model for the Precision Medicine Initiative,
I Have Tracked My Internal Biomarkers To Understand My Body’s Dynamics
My Quarterly
Blood Draw
Calit2 64 Megapixel VROOM
6. Only One of My Blood Measurements
Was Far Out of Range--Indicating Chronic Inflammation
Normal Range <1 mg/L
27x Upper Limit
Complex Reactive Protein (CRP) is a Blood Biomarker
for Detecting Presence of Inflammation
Episodic Peaks in Inflammation
Followed by Spontaneous Drops
7. Adding Stool Tests Revealed
Oscillatory Behavior in an Immune Variable Which is Antibacterial
Normal Range
<7.3 µg/mL
124x Upper Limit for Healthy
Lactoferrin is a Protein Shed from Neutrophils -
An Antibacterial that Sequesters Iron
Typical
Lactoferrin Value for
Active Inflammatory
Bowel Disease
(IBD)
8. To Understand the Interaction of Genetics and the Immune System
We Must Consider the Human Microbiome
Your Microbiome is
Your “Near-Body” Environment
and its Cells
Contain 100x as Many DNA Genes
As Your Human DNA-Bearing Cells
Your Body Has 10 Times
As Many Microbe Cells As DNA-Bearing
Human Cells
Inclusion of the “Dark Matter” of the Body
Will Radically Alter Medicine
9. Most of Evolutionary Time
Was in the Microbial World
You
Are
Here
Source: Carl Woese, et al
Tree of Life Derived from 16S rRNA Sequences
10. The Cost of Sequencing DNA
Has Fallen Over 100,000x in the Last Ten Years
This Has Enabled Sequencing of
Both Human and Microbial Genomes
11. June 8, 2012 June 14, 2012
Interest in the Human Microbiome
Has Moved Quickly From Frontier Science to Public Awareness
August 18, 2012June, 2012
13. To Map Out the Dynamics of Autoimmune Microbiome Ecology
Couples Next Generation Genome Sequencers to Big Data Supercomputer
5 Ileal Crohn’s Patients,
3 Points in Time
2 Ulcerative Colitis Patients,
6 Points in Time
“Healthy” Individuals
Source: Jerry Sheehan, Calit2
Weizhong Li, Sitao Wu, CRBS, UCSD
Total of 27 Billion Reads
Or 2.7 Trillion Bases
Inflammatory Bowel Disease (IBD) Patients
250 Subjects
1 Point in Time
7 Points in Time
Each Sample Has 100-200 Million Illumina Short Reads (100 bases)
Larry Smarr
(Colonic Crohn’s)
14. Computational NextGen Sequencing Pipeline:
From Sequence to Taxonomy and Function
PI: (Weizhong Li, CRBS, UCSD):
NIH R01HG005978 (2010-2013, $1.1M)
15. We Used SDSC’s Gordon Data-Intensive Supercomputer
to Completely Analyze a Subset of These Gut Microbiomes
• ~180,000 Core-Hours on Gordon
– KEGG Protein Family Annotation: 90,000 Core-Hours
– Mapping: 36,000 core-hrs
– Used 16 Cores/Node and up to 50 nodes
– Duplicates removal: 18,000 core-hrs
– Assembly: 18,000 core-hrs
– Other: 18,000 core-hrs
• Gordon RAM Required
– 64GB RAM for Reference DB
– 192GB RAM for Assembly
• Gordon Disk Required
– Ultra-Fast Disk Holds Ref DB for All Nodes
– 8TB for All Subjects
Enabled by
a Grant of Time
on Gordon from
SDSC Director
Mike Norman
Source: Weizhong Li, UCSD
16. We Used Dell’s HPC Cloud to Extend Our Taxonomic Analysis
to All of Our Human Gut Microbiomes
• Dell’s Sanger Cluster
– 32 Nodes, 512 Cores
– 48GB RAM per Node
• We Processed the Taxonomic Relative Abundance
– Used ~35,000 Core-Hours on Dell’s Sanger
• Produced Relative Abundance of
~10,000 Bacteria, Archaea, Viruses in ~300 People
– ~3Million Spreadsheet Cells
Source: Weizhong Li, UCSD
Enabled by
a Grant of Time
From Dell/R Systems
17. We Found Major State Shifts in Microbial Ecology Phyla
Between Healthy and Three Forms of IBD
Most
Common
Microbial
Phyla
Average HE
Average
Ulcerative Colitis
Average LS
Colonic Crohn’s Disease
Average
Ileal Crohn’s Disease
Collapse of Bacteroidetes
Explosion of Actinobacteria
Explosion of
Proteobacteria
Hybrid of UC and CD
High Level of Archaea
18. Building a UC San Diego High Performance Cyberinfrastructure
to Support Distributed Microbiome Analysis
FIONA
12 Cores/GPU
128 GB RAM
3.5 TB SSD
48TB Disk
10Gbps NIC
Knight Lab
10Gbps
Gordon
Prism@UCSD
Data Oasis
7.5PB,
200GB/s
Knight 1024 Cluster
In SDSC Co-Lo
CHERuB
100Gbps
Emperor & Other Vis Tools
64Mpixel Data Analysis Wall
120Gbps
40Gbps
1.3Tbps
PRP/
19. We Use OpenOrd on Calit2’s 64M Pixel Tiled Wall
to Explore Clustering of Patients and Microbe Species
Ileal
Crohn’s
Healthy
Ulcerative
Colitis
www.sandia.gov/~smartin/presentations/OpenOrd.pdf
Source:
Philip Weber,
QI, UCSD
25. Larry’s 40 Stool Samples Over 3.5 Years
to Rob’s lab on April 30, 2015
26. Larry Smarr Gut Microbiome Ecology Shifted After Drug Therapy
Between Two Time-Stable Equilibriums Correlated to Physical Symptoms
Lialda
&
Uceris
12/1/13
to
1/1/14
12/1/13-
1/1/14
Frequent IBD Symptoms
Weight Loss
7/1/12 to 12/1/14
Blue Balls on
Diagram to the Right
Principal Coordinate Analysis of
Microbiome Ecology
PCoA by Justine Debelius and Jose Navas,
Knight Lab, UCSD
Weight Data from Larry Smarr, Calit2, UCSD
Weekly Weight
Few IBD Symptoms
Weight Gain 1/1/14 to 8/1/16
Red Balls on
Diagram to the Right
27. Each Microbe Contains
a Few Thousand Genes on Its DNA
E. Coli Contains ~5000 Genes on its Circular Chromosome,
Which is 1000x the Length of the Cell!
Several Million Genes Can Occur in the Human Gut Microbiome
28. In a “Healthy” Gut Microbiome:
Large Taxonomy Variation, Low Protein Family Variation
Source: Nature, 486, 207-212 (2012)
Over 200 People
29. We Computed the Relative Abundance
of 10,000 KEGG Orthogolous Protein Families In Health and Disease States
http://www.genome.jp/kegg/
Kyoto Encyclopedia
of Genes and
Genomes (KEGG)
30. Using PCA on the 10,000 KEGG Protein Families
We Can Discover Over- and Under-Abundant Genes in Health and Disease
Source: Bryn Taylor, Justine Debelius, Rob Knight, Mehrdad Yazdani, Larry Smarr, UCSD; Weizhong Li, JCVI
31. Using Kolmogorov-Smirnov Test and Random Forest Machine Learning,
We Can Classify Over and Under-Abundant Protein Families
Source: Bryn Taylor, Justine Debelius, Rob Knight, Mehrdad Yazdani, Larry Smarr, UCSD; Weizhong Li, JCVI
Note: Orders of Magnitude Increase or Decrease in
Protein Families Between Health and Disease
Next Step: Which Proteins (Functions) are Altered?
32. To Expand IBD Project the Knight/Smarr Labs Were Awarded
~ 1 Million Core-Hours on SDSC’s Comet Supercomputer
• 8x Compute Resources Over Prior Study
• Smarr Gut Microbiome Time Series
– From 7 Samples Over 1.5 Years
– To 50 Samples Over 4 Years
• IBD Patients: From 5 Crohn’s Disease and 2 Ulcerative Colitis Patients
to ~100 Patients
– 50 Carefully Phenotyped Patients Drawn from Sandborn BioBank
– 43 Metagenomes from the RISK Cohort of Newly Diagnosed IBD patients
• New Software Suite from Knight Lab
– Re-annotation of Reference Genomes, Functional / Taxonomic Variations
– Novel Compute-Intensive Assembly Algorithms from Pavel Pevzner
33. We Used SDSC’s Comet to Uniformly Compute
Protein-Coding Genes, RNAs, & CRISPR Annotations
• We Downloaded from NCBI Over 60,000 Bacterial and Archaea Genomes
– Required 5 Core-Hours Per Genome
– 300,000 Core-Hours to Complete
– Ran 24 Cores in Parallel
– Over 400 Days Wall-Clock Time
• Requires a Variety of Software Programs
– Prodigal for Gene Prediction
– Diamond for Protein Homolog Search Against UniRef db
– Infernal for ncRNA Prediction
– RNAMMER for rRNA Prediction
– Aragorn for tRNA Prediction
• Will Make These Results a New Community Database
– Knight Lab, Calit2, SDSC
Source: Zhenjiang (Zech) Xu, Knight Lab, UCSD
34. Next Large Supercomputer Project:
Addressing the Challenges of Metagenomic Assembly
• Differences Between Closely Related Strains
• Varying Coverage Depth Across Individual Genomes
• Inter-Species Repeats (Ribosomal Genes, HGTs, etc.)
• Huge Size and Complexity of Datasets
metaSPAdes: a new versatile assembler
for metagenomic data
Nagarajan and Pop Nature Reviews Genetics 2013
Sergey Nurk1, Dmitry Meleshko1, Anton Korobeynikov1 and Pavel Pevzner1,2
1Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia
2University of California San Diego, La Jolla, USA
35. Massive Research is Underway to Discover
A Wide Range of New Techniques for Manipulating Your Microbiome
www.huffingtonpost.com/entry/gut-bacteria-microbiome-disease_us_57068c55e4b053766188f383
www.synlogictx.com
36. Genetic Sequencing of Humans and Their Microbes
Is a Huge Growth Area and the Future Foundation of Medicine
Source: @EricTopol
Twitter 9/27/2014
37. Thanks to Our Great Team!
Calit2@UCSD
Future Patient Team
Jerry Sheehan
Tom DeFanti
Joe Keefe
John Graham
Kevin Patrick
Mehrdad Yazdani
Jurgen Schulze
Andrew Prudhomme
Philip Weber
Fred Raab
Ernesto Ramirez
UCSD CSE Department
Pavel Pevzner
JCVI Team
Karen Nelson
Shibu Yooseph
Manolito Torralba
Ayasdi
Devi Ramanan
Pek Lum
UCSD Metagenomics Team
Weizhong Li
Sitao Wu
SDSC Team
Michael Norman
Mahidhar Tatineni
Robert Sinkovits
Ilkay Altintas
UCSD Health Sciences Team
David Brenner
Rob Knight Lab
Justine Debelius
Jose Navas
Bryn Taylor
Gail Ackermann
Greg Humphrey
William J. Sandborn Lab
Elisabeth Evans
John Chang
Dell/R Systems
Brian Kucic
John Thompson
Thomas Hill