1. Taxon diversity analysis for bulk insect samples using
Illumina Hi-seq platform
Xin ZHOU, Shanlin LIU, Yiyuan LI,
Qing YANG, and Xu SU
Department of Science and Technology
Environmental Genomics Research Group
BGI, China
Adelaide, Australia, 3 December 2011
2. Problem Solutions?
Opt.1: ......zzzzZZZZZ
Opt.2: morph sorting indiv. ID … Opt.1
Opt.3: morph sorting indiv. barcoding … Opt.1
Opt.4: grinding up NGS CLUSTERING/BLAST
DIVERSITY!
Zhou et al. 2011, 4th International Barcode of Life Conference
3. Environmental barcoding of bulk insects
aquatic insects
mini-barcode (130bp)
454
bat diet (insects)
COI fragment, 157 bp
454
Biodiversity soup: metabarcoding of arthropods Malaise trap (insects)
for rapid biodiversity assessment and COI fragment, ~400 bp
biomonitoring, Yu D.W. et.al., in review 454
Zhou et al. 2011, 4th International Barcode of Life Conference
4. Major NGS platforms applicable in environmental barcoding
Requirement
Read Data/run
NGS platforms Run time of library
length (GB)
construction
454 platform
~400bp 0.7 23 hr. Yes
(GS FLX Titanium XL+)
Illumina platform 150bp
600 14 d. Yes
(Hi-Seq 2000) PE reads
Illumina platform 150bp
2 27 hr. Yes
(Mi-Seq) PE reads
Ion Torrent 200bp ~1 3.5 hr. No
Illumina Hi-Seq higher through-put
less $ / bp
increasing reading length
variety of bioinformatics tools available from genomic
pipelines
Zhou et al. 2011, 4th International Barcode of Life Conference
5. Sequencing capacity at BGI
• 28 Illumina GAIIx Data production:
• 137 Illumina Hi-Seq2000 • 100 Gb / day (2009)
• 25 Life Tech SOLiD 4 • >5 Tb / day (end of 2010)
• 16 ABI 3730XL • >1500X human genome / day
• 110 MegaBACEs
• 2 Illumina iScan
• 1 Roche 454
• 1 Ion Torrent
• 1 Illumina Mi-Seq
Zhou et al. 2011, 4th International Barcode of Life Conference
6. What I am NOT going to talk about:
• Primer optimization
• Systematic comparisons of NGS platforms
• Quantitative diversity analysis
What I AM going to talk about:
• Can Illumina NGS be used in diversity analysis?
Zhou et al. 2011, 4th International Barcode of Life Conference
7. Can Illumina NGS be used in diversity analysis?
Sequencing error rate
Read-length
Zhou et al. 2011, 4th International Barcode of Life Conference
8. Sequencing error rate
No indel issue in homopolymers
Sequencing quality keeps increasing
Rare nucleotide error can be easily
corrected by:
Recent improvement in sequencing quality
increasing sequencing depth using Illumina’s V3 chemical
(even at 100 bp, only about 10% of the base callings has error
pair-end (PE) sequencing rate >1%)
setting stringent matching criteria in
150bp
the overlapping fragment by allowing
only >99% identity 150bp
Insert-size
250nt
PE sequencing enables forming
sequence contigs
Zhou et al. 2011, 4th International Barcode of Life Conference
9. Read length
150bp
150bp
Insert-size
250nt
Read length keeps increasing
150PE enables contig read of 250bp
Short-gun reads can be further assembled
into longer fragments (“short-gun” assembly
strategy used in genome sequencing
projects)
Option of scaffold assembly
Zhou et al. 2011, 4th International Barcode of Life Conference
10. Illumina environmental barcoding
Illumina
e-barcoding
PCR based PCR free
Lib1 (658bp, 150PE) Lib2 (200bp, 150PE)
Full length COI COI amplicons
barcode PE shotgun PE Mitochondrial
sequencing sequencing shotgun PE
sequencing
Full length COI Full length COI
without PCR bias
Zhou et al. 2011, 4th International Barcode of Life Conference
11. Approach #1: PCR-based
Sample information
XSBN
Mock
(provided by Yu et al.)
# Specimens 23 292
# Haplotypes (2%) 12 230
Soup protocol DNA extracted individually and mixed for PCR
PCR primers LepF1/LepR1 Customized
Sequence length 658 bp 700 bp
Sequencing
Full length (658bp) + Short-gun library (~200bp)
library details
Sequencing
150PE
protocol
Zhou et al. 2011, 4th International Barcode of Life Conference
12. Approach #1: PCR-based
Pre-analysis data filtering
Lib 1 Mock XSBN
Raw data 1.67G 4.04G
Filtering adapter 1.60G 1.28G
High quality (Q20) 0.35G 0.50G
# Reads
1,081,997 1,150,477
(Primer removed)
# Unique reads
36,618 45,444
(Abundance > 1)
Zhou et al. 2011, 4th International Barcode of Life Conference
13. OTU filtering workflow
Unique OTU Alignment Remove Compared
reads cluster Chimera to reads
(abunda (98%) of Lib 2
nce > 1)
Mock 36,618 784 490 119 44
XSBN 45,444 4,189 3887 403 399
Zhou et al. 2011, 4th International Barcode of Life Conference
14. Sanger Reference
Blast at 100% identity
Results
NGS OTUs
Mock 4 8 36 LepF1/R1
Customized
XSBN 32 198 197 primers
Zhou et al. 2011, 4th International Barcode of Life Conference
15. Sanger Reference
Mock
NGS OTUs
“False positive”?
31 can be found in
False negative our total sample,
from which our
mock samples
Not found in raw were assembled
data (likely due
to primer failure)
4 8 36
5 likely to be PCR
errors
Zhou et al. 2011, 4th International Barcode of Life Conference
16. Sanger Reference
XSBN Cross-sample
NGS OTUs contamination?
17 not found in raw
data (primer failure)
Mean + SE
32 198 197
(group1) (group2)
15 were lost in
data filtering
Zhou et al. 2011, 4th International Barcode of Life Conference
17. Sanger Reference
NGS OTUs
Significantly less false
positives
after removal of sequences
with abundance <10
32 198 197 49 181 84
Slight drop of true
positives
Zhou et al. 2011, 4th International Barcode of Life Conference
18. Approach #1: PCR-based
What’s next?
Illumina
e-barcoding
Obtaining full-length barcodes via short-gun reads assembly
(new program in development – “SOAPbarcode”)
New algorithm to filter out false positive OTUs
Zhou et al. 2011, 4th International Barcode of Life Conference
19. Approach #2: PCR-free method
Total MT isolation
Individual
&
barcoding
DNA extraction
Shotgun sequencing
Reference Reference
based method independent method
Zhou et al. 2011, 4th International Barcode of Life Conference
20. Building reference library: individual barcoding
1. 89 individuals;
2. 84 reference barcodes;
3. 39 OTUs (2%);
Taxon group # OTUs
Lepidoptera 25
Diptera 7
Hemiptera 4
Hymenoptera 2
Psocoptera 1
Total 39
Zhou et al. 2011, 4th International Barcode of Life Conference
21. Total MT isolation
& DNA extraction
Sample Total MT MT DNA
mixture isolation extraction
Zhou et al. 2011, 4th International Barcode of Life Conference
22. Shotgun sequencing
Insert size: 200bp;
Read length: 100bp PE;
Percentage of
base pairs
Q20
96.2%
(Sequencing error rate < 1%)
Q30
92.9%
(Sequencing error rate < 0.1%)
GC content 38.0%
Zhou et al. 2011, 4th International Barcode of Life Conference
23. Pre-analysis
Data filtering:
1. Adaptor contamination removal;
2. Quality control:
in each read, only allowing <10bp with seq. error rate >1%
Raw data 2.45G
After filtering 2.20G
Ratio of high
89.91%
quality reads
Zhou et al. 2011, 4th International Barcode of Life Conference
24. Approach #2: PCR-free method
Method 1: Reference based
Blast reads to reference barcodes,
confident identification is made only when:
1. Best BLAST hit >98% identity;
2. Reference coverage > 90%;
Taxon groups # OTUs
Reference 1 Coverage: 100%
Lepidoptera 20
Correct Diptera 2
mapping Hemiptera 3
Psocoptera 1
Total 26
Reference 2Not found 13 Coverage: 30%
Incorrect
mapping
Zhou et al. 2011, 4th International Barcode of Life Conference
25. Potential sources of failure in detecting taxa
Taxon specific
or
Bio-mass
(size & number)
Zhou et al. 2011, 4th International Barcode of Life Conference
26. Failures in taxon detection
Taxon bias?
Taxon groups # Total # OTUs
undetected OTUs missing
Lepidoptera 25 5
Diptera 7 5
Hymenoptera 2 2
Hemiptera 4 1
Psocoptera 1 0
Total 39 13
Zhou et al. 2011, 4th International Barcode of Life Conference
27. Failures in taxon detection
OR bio-mass (body size, # individuals)?
Readily detected Missing
Average length> 5mm Average length < 5mm
Zhou et al. 2011, 4th International Barcode of Life Conference
28. Approach #2: PCR-free method
Method 2: Reference independent
(Will we be able to identify diversity without reference MT genomes
for the targeted species?)
Workflow:
1. Assembly of COI gene using genome
assembly program (SOAPdenovo);
2. Annotation using ~240 MT genomes
downloaded from Genbank;
Zhou et al. 2011, 4th International Barcode of Life Conference
29. PCR-Free reference-independent: results
23/31 falling in standard COI barcode
region (mostly >600 bp);
1 of 23 is not in our reference barcodes;
(Insecta; Lepidoptera; Pyralidae);
Multiple genes obtained simultaneously;
1 nearly complete mitochondrial genome (~15k bp);
3 fragments >6000 bp;
Zhou et al. 2011, 4th International Barcode of Life Conference
30. Reference independent
23/31 falling in standard COI barcode
1 of 23 was not presented in our reference barcodes;
region (mostly >600 bp);
(Insecta; Lepidoptera; Pyralidae);
Number of individuals we collected
5 individuals failed in Sanger sequencing
89 individuals
3 OTUs not detected in reference
Barcode references
independent method because: 39 OTUs (84 individuals)
References based
(1) sequencing depth is too low 26 OTUs
(<10X) to allow for reliable References independent
23 OTUs
assembly
(2) relatively small body-size
Zhou et al. 2011, 4th International Barcode of Life Conference
31. PCR-free method
Multiple MT genes obtained simultaneously
Gene Number
ATP6 29
ATP8 4
COX1 31
COX2 33
COX3 31
CYTB 31
ND1 35
ND2 34
ND3 24
ND4 30
ND4L 16
ND5 30
ND6 24 Zhou et al. 2011, 4th International Barcode of Life Conference
32. PCR-free method
1 nearly complete mitochondrial genome (~15k bp);
3 fragments longer than 6k bp;
Barcode region
Zhou et al. 2011, 4th International Barcode of Life Conference
33. Approach #2: PCR-free method
What’s next?
Currently:
MT DNA 5-10% after isolation;
Non-targeting DNA affects MT assembly (e.g.,
bacteria & genomic DNA);
Taxonomic/biomass bias
Potential solutions:
1. Wet-lab protocol optimization
Pre-sorting insects by body-size
Alternative MT isolation methods
2. Increase sequencing depth
Zhou et al. 2011, 4th International Barcode of Life Conference
34. Conclusions
Illumina Hi-Seq delivers compatible performance
as other NGS platforms in analyzing bulk insect
samples, with potential advantages in achieving
higher sensitivity at lower cost;
Deep sequencing capacity enables a novel PCR-
free approach, which may eventually solve biases
caused by DNA amplification;
It shares issues with other NGS platforms (non-
quantitative, inflation of OTUs, etc.)
Methodology optimization is much needed in
many details of the pipeline;
Collaborative and synergistic efforts made by the
community would greatly advance the progress.
Zhou et al. 2011, 4th International Barcode of Life Conference
35. Acknowledgements
Funder:
Collaborators: Douglas W. Yu
Kunming Institute of Zoology, Chinese Academy of Sciences
Mehrdad Hajibabaei, Shadi Shokralla
University of Guelph
Owain Edwards
CSIRO Ecosystem Sciences
LU Jianliang
WU Qiong
AN Sainan
ZHOU Yizhuang
ZHAO Jing
Zhou et al. 2011, 4th International Barcode of Life Conference
36. Thanks for your
attention!
36
Zhou et al. 2011, 4th International Barcode of Life Conference