1. QIIME: Quantitative Insights Into
Microbial Ecology (part 1)
Thomas Jeffries
Federico M. Lauro
Grazia Marina Quero
Tiziano Minuzzo
The Omics Analysis Sydney Tutorial
Australian Museum
23rd
-24th
February 2015
2. QIIME
• Open source software package for taxonomic analysis of 16S
rRNA sequences
• UC Colorado & Northern Arizona
• www.qiime.org (great resource…..)
• Good community support
• Can google most problems
• Multi-platform
• Widely used
Caporaso
Knight
4. Data formats
• 454:
DNA sequences (FASTA, .fna)
Quality (.qual)
Mapping file (.txt)
• Illumina
Sequences and quality in same file (.fastq)
Also supports paired end
5. Getting into QIIME
• Command line interface
• Some very basic commands needed for QIIME:
example:
/folder$ programme.py -i file_in -o
file_out
ls :list files in working directory
cd : changes directory
cd .. : goes back to parent directory
‘tab’ key: magically fills out file names
mkdir : makes a directory
pwd : tells you where you are
6. QIIME tutorial and example data
• Many tutorials @ http://qiime.org/tutorials/index.html
• Good place to start: http://qiime.org/tutorials/tutorial.html
• Great Microbial Ecology course (includes QIIME): http://edamame-course.org/
• A few of the commands have changed in the new version – the current
commands are in this talk - and I have renamed the files to make it easier to
follow
15. 1. Check mapping file format
• Checks that format of mapping file is ok
validate_mapping_file.py -m my_mapping_file.txt -o
validate_mapping_file_output
“No errors or warnings were found in mapping file”
16. 1. Check mapping file
Name (ID) of
sample
Primer
Sequencing
barcode
Sample categories
(treatments)
Tab separated !!!
17. Hands on – validate your mapping
file
validate_mapping_file.py -o
moving_pictures_tutorial-
1.8.0/illumina/cid_l1/ -m
moving_pictures_tutorial-
1.8.0/illumina/raw/filtered_mapping_l1.txt
18. 2. De-multiplex - 454
• Using sample specific barcodes, identify each sequence
with a sample (renames sequences)
• Performs some QC:
Removes sequences < 200bp
Removes sequences with a quality score <25
Removes sequences with >6 ambiguous bases or >6
homopolymer runs
split_libraries.py -m my_mapping_file.txt -f
my_sequence_file.fna -q my_quality_file.qual -o
split_library_output
• Produces seqs.fna
19. 2. De-multiplex - Illumina (Step 1)
• If the samples contain paired-end reads, you first need to
join them and update the barcodes using:
join_paired_ends.py -f my_forw_reads.fastq -r
my_rev_reads.fastq -b my_barcodes.fastq -o
my_joined.fastq
20. 2. De-multiplex - Illumina (Step 2)
Then you can proceed to the split libraries step. If the
sequences are NOT paired-ends go directly to
split_libraries_fastq.py. This step also performs the
Illumina reads QC:
split_libraries_fastq.py -m my_mapping_file.txt -i
my_sequence_file.fastq -b my_barcodes.fastq -o
split_library_output
• Data from multiple lanes can be processed together by separating
inputs with a comma (,)
• Produces seqs.fna
22. 3. OTU picking strategies
• De Novo OTU picking: clustering of sequences at 97%
Overlapping sequences
No reference database necessary
computationally expensive
• Closed-Reference
non overlapping reads
needs reference database
discards sequences with no match - e.g. no erroneous reads
• Open-reference
Overlapping reads
reads clustered against reference and non matching reads are clustered de-
novo
24. 3. Pick OTUs
Note: following steps can be automated by (what we are doing):
pick_de_novo_otus.py –i seqs.fna -o otus
pick_otus.py -i seqs.fna -o picked_otus_default
•Will cluster your sequences at 97% similarity (can change this if
you wish) and produce ‘seqs_otus.txt’ which maps each
sequence to a cluster
•Uses UCLUST algorithm (Edgar, 2010, Bioinformatics)
25. 3. Pick OTUs
Generate OTUs by clustering reads based on similarity (default is
97%)
Sort reads according to size (long -> short)
Cluster
OTU1
OTU2
OTU3
OTU4
OTU5
26. 4. Pick representative sequences
• We want a representative sequence for each OTU – time
consuming to annotate each sequence and they are already
clustered……
• This will take the most abundant sequence in each OTU and
make a file that has 1 sequence for each OTU (rep_set1.fna)
pick_rep_set.py -i seqs_otus.txt -f seqs.fna -o rep_set1.fna
27. 5. Annotate (assign taxonomy to each
OTU)
• Compare each representative sequence to a database using one
of several algorithms:
• UCLUST, BLAST, RDP Classifier, et al…..….
• New Defaults: UCLUST against the Greengenes database
assign_taxonomy.py -i rep_set1.fna
(output in directory: uclust_assigned_taxonomy)
• BLAST example (reference sequences and taxonomy
downloaded from database):
assign_taxonomy.py -i rep_set1.fna -r ref_seq_set.fna -t
id_to_taxonomy.txt -m blast
28. 5. Annotate
• Some useful databases that are compatible with QIIME:
http://greengenes.secondgenome.com
Good for everything and default in
QIIME
http://unite.ut.ee
Fungal Internal Transcribed Spacer (ITS)
Good for soil fungi
http://www.arb-silva.de
Contains both 16S and 18S rRNA (Eukaryotes…)
Good representation of marine taxa
30. Species A
Species B
Species C
mixed
amplicons
Sample 1
Sample 2
Sample 3
OTU 1
OTU 2
OTU 3
Split library into
samples using
barcodes
Used clustering
to choose OTUs
Picked a
representative
sequences and
assigned
taxonomy
Reference
database
31. 6. Putting it all together: making an
OTU table
• Need to combine the OTU identity with the abundance
information in the clusters and link back to each sample so we
can do ECOLOGY
• The table is in .biom format:
• http://biom-format.org/documentation/biom_format.html
• Convert to text file:
• biom convert -i otu_table.biom -o otu_table.txt --table-type "otu table" --header-key
taxonomy –b
make_otu_table.py -i seqs_otus.txt -t
rep_set1_tax_assignments.txt -o otu_table.biom
32. Closed reference O.T.U. picking
pick_closed_reference_otus.py -i seqs.fna -r reference.fna -o
otus_w_tax/ -t taxa_map.txt
•Reference is database i.e. greengenes unaligned 97% otus and
matching taxa map (same files as for BLAST)
•Output has all of your sequences aligned to greengenes and an OTU
table
•So this picks OTUs and Assign taxonomy in 1 step (but loose non-
matching sequences….do we care? – taxa summaries no, beta-
diversity maybe….)
•Quick – good for illumina
33. 7. Aligning sequences
• Back to our representative sequences….
• How closely related are the organisms present in the samples i.e.
what is the phylogeny of our community and how does this shift
between samples
• Default: PYNAST to align samples to a reference set of pre-
aligned sequences (e.g. greengenes ALIGNED) – more
computationally efficient than de novo alignment
• Can also select other methods e.g. MUSCLE,
align_seqs.py -i rep_set1.fna –o pynast_aligned/
34. 7. Aligning sequences
• Not all regions of the rRNA gene are informative or useful for phylogenetic
inference
• Gaps – short length sequence vs full length rRNA gene
• filter_alignment.py -i rep_set1_aligned.fasta -o
filtered_alignment/
• Optional lanemask template that defines informative regions for some
databases
• filter_alignment.py -i seqs_rep_set_aligned.fasta -m
lanemask_in_1s_and_0s -o filtered_alignment/
• If you are going to use this alignment for making a phylogenetic tree this step
is essential…..
35. A note on chimera removal
•Chimeras sequences formed from DNA of 2 or more organisms (artifact of PCR
amplification)
•QIIME uses ChimeraSlayer to detect chimeric sequences using your alignment and a
reference database
•You should then remove these OTU’s from your OTU table and alignment before
proceeding with tree building and visualization of results :
•-e chimeric_seqs.txt when making OTU table, filter_fasta.py for alignment
identify_chimeric_seqs.py -m ChimeraSlayer -i rep_set_aligned.fasta -a
reference_set1_aligned.fasta -o chimeric_seqs.txt
36. 8. Make a phylogenetic tree
make_phylogeny.py -i rep_set1_aligned_pfiltered.fasta -o
rep_phylo.tre
• Builds a tree from the alignment using FastTree
• Outputs a tree in newick format (.tre) which can be
opened with software such as FigTree or can be
used to calculate phylogenetic metrics
• Also filter Chimeras from tree
37. We now have 2 final outputs:
• OTU Table
1.Taxonomic composition
2. -diversity (e.g. ‘species’ richness)α
3. -diversity (e.g. abundance similarity between samples)β
• Phylogenetic tree
1.Phylogenetic -diversityβ
QIIME has powerful visualization and statistical
tools
38. Hands on – reformatting outputs
biom convert -i "otu table" --header-key taxonomy -b
moving_pictures_tutorial-
1.8.0/illumina/otus_denovo/otu_table.biom -o
moving_pictures_tutorial-
1.8.0/illumina/otus_denovo/otu_table.txt --table-type
filter_alignment.py -i moving_pictures_tutorial-
1.8.0/illumina/otus_denovo/pynast_aligned_seqs/seqs_rep_se
t_aligned.fasta -o moving_pictures_tutorial-
1.8.0/illumina/otus_denovo/pynast_aligned_seqs/filtered_align
ment
We have automated (piped) most of the steps I have talked about
We need to convert the OTU table to a text file and filter the alignment
39. 9. Merging the mapping files
• We started with 6 lanes of Illumina but now we have a single OTU table. The
merged mapping file will have duplicated barcodes but these are not used
anymore (already demultiplexed):
• merge_mapping_files.py -o combined_mapping_file.txt -m
mapfile1.txt,mapfile2.txt…,mapfilexxx.txt
41. Visualizing diversity 1 – community
composition
biom summarize-table –i otu_table.biom –o otu_table_summary.txt
Counts/Sample detail:
L3S237: 138.0
L3S235: 187.0
L3S372: 205.0
L3S373: 228.0
L3S367: 259.0
L3S370: 273.0
L3S368: 274.0
L3S369: 284.0
• Summary of OTU table: we want to standardize the number of
sequences (sampling depth) to allow accurate comparison
Ie. 146 sequences
single_rarefaction.py -i otu_table.biom -o otu_table_even146.biom -d 138
alpha_rarefaction.py -i otu_table.biom -m combined_mapping_file.txt -o
rarefaction/ -t rep_set.tre
42. • How ‘deep’ do we need to go to adequately
sample community? = Rarefaction analysis
• number of species increase until a point
where producing more sequence does not
significantly increase the number of
observed species
• repeated subsampling of your data at
different intervals. Plots subsamples against
the number of observed species. If curves
flatten, then you have sequenced at
sufficient depth.
• Rarefaction trade off between ‘keeping’
samples below a given sequence cut-off and
loosing diversity
Visualizing diversity 1 – community
composition
45. Software references:
QIIME Caporaso et al 2010. QIIME allows analysis of high-throughput community sequencing data. Nature Methods
7(5): 335-336.
UCLUST Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics
26(19):2460-2461.
BLAST Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol
215(3):403-410.
GRENGENES McDonald et al 2012. An improved Greengenes taxonomy with explicit ranks for ecological and
evolutionary analyses of bacteria and archaea. ISME J 6(3): 610–618.
RDP Classifier Wang Q, Garrity GM, Tiedje JM, Cole JR. 2007. Naive Bayesian classifier for rapid assignment of
rRNA sequences into the new bacterial taxonomy. Appl Environ Microb 73(16): 5261-5267.
PyNAST Caporaso JG et al 2010. PyNAST: a flexible tool for aligning sequences to a template alignment.
Bioinformatics 26:266-267.
ChimeraSlayer Haas BJ, Gevers D, Earl AM, Feldgarden M, Ward DV, Giannoukos G, et al. 2011. Chimeric 16S
rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Research
21:494-504.
MUSCLE Edgar, R.C. 2004 MUSCLE: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Res:1792-1797
FasttTree Price MN, Dehal PS, Arkin AP. 2010. FastTree 2-Approximately Maximum-Likelihood Trees for Large
Alignments. Plos One 5(3)
UNIFRAC Lozupone C, Knight R. 2005. UniFrac: a new phylogenetic method for comparing microbial communities.
Appl Environ Microbiol 71(12): 8228-8235.
Emperor Vazquez-Baeza Y, Pirrung M, Gonzalez A, Knight R. 2013. Emperor: A tool for visualizing high-throughput
microbial community data. Gigascience 2(1):16.