Introduction to Bioinformatics

Introduction to Bioinformatics
Part 0: So You Want To Be a Computational
Biologist?
Leighton Pritchard and Peter Cock

Table of Contents
Introduction
Recording Your Work
Conclusion

What is this “bioinformatics” thing,
anyway?
• Bioinformatics: biology using computational and
mathematical tools
• A discipline within biology
• Loman & Watson (2013) “So you want to be a computational
biologist?” http://dx.doi.org/10.1038/nbt.2740
• Welch et al. (2014) “Bioinformatics Curriculum Guidelines:
Toward a Deﬁnition of Core Competencies”
http://dx.doi.org/10.1371/journal.pcbi.1003496
• Watson (2014) “The only core competency you need”
http://bit.ly/1fS4iDJ (blog)

Some uncomfortable truths
• This one-day course will not make you a bioinformatician

• But practice will. . .

• But practice will. . .
• The best way to learn is to do (“I don’t know how to do this
yet, but I will ﬁnd out.”)
• http://bit.ly/Rq0D61 (“Bioinformatics is a way of life”)
• Most bioinformatics is problem-solving
• We will introduce some useful tools and concepts

What it takes to be a bioinformatician
• Patience
(problem-solving)
• Suspicion (statistics)
• Biological knowledge
• Social skills (no-one
knows everything: ask!)
• Lots of practice
• Self-conﬁdence (challenge
results and dogma)
• Core domain skills:
biology, computer science,
statistics
• Deliver results (qualiﬁed,
honest)
• Watson (2014) “What it takes to be a bioinformatician”
http://bit.ly/1jDuQsO (blog)

More general advice?
• Ask us (we do this a lot)
• BioStars (https://www.biostars.org)
• SeqAnswers (http://seqanswers.com/)
• PLoS Comp Biol collections (http:
//www.ploscollections.org/static/pcbiCollections)

Why Do It?
• Doing bioinformatics is doing science: keep a lab book!
• You will not remember multiple ﬁles, analysis details, etc. in a
week/month/six months/a year/three years
• Noble (2009)
http://dx.doi.org/10.1371/journal.pcbi.1000424
• Baggerly & Coombes (2009)
http://arxiv.org/pdf/1010.1092.pdf

How To Do It? I
• There is no one correct way, but. . .
• Think about data/docs/project structure before you start

How To Do It? II
• Use plain text where possible
• Use version control
• Keep backups
• Diﬀerent tools suit diﬀerent purposes: code vs. data vs.
analysis vs. . . .
• Find a way that works for you.

How To Do It? III
• Reproducibility is key!
• Scripts and pipelines are better for this than notes of what
you did
• Also better for version control, and reuse
• Avoid unnecessary duplication
• Someone else may have solved your problem
• One (backed up) read-only copy of raw data, keep analyses
separate

Plain Text Files
• README.txt/README.md in each directory/folder
• Plain text is always human-readable
• Markdown (https:
//daringfireball.net/projects/markdown/basics)
• RST (http://docutils.sourceforge.net/docs/ref/rst/
restructuredtext.html)

Galaxy workﬂows
• Use through browser, graphical interface
• Reproducible, shareable, documented, reusable analyses
• Wraps standard bioinformatics tools
• Local instance (http://ppserver/galaxy) uses JHI cluster

script
• Writes your terminal activity to a plain text file
• Saves effort copy/pasting and typing commands into a lab
book, as you go
• Easy to use with other tools
• use man script at your terminal to find out more

MediaWiki
• Useful for shared projects/data
• Automatic version control and attribution
• Many local instances at JHI (ask around)

A language notebook
• e.g. iPython Notebook, Mathematica, MatLab cells
• Integrates live code and analysis with lab-book

LATEX
• Powerful, versatile typesetting system (e.g. these slides)
• Similar to markup/markdown
• Pros: great for mathematical/computing work, writing a thesis
• Cons: not easy to pick up

In Conclusion
• Bioinformatics is just biology using computers and
mathematics
• You still need to “do science” in the same way:
• Keep accurate records
• Plan and conduct experiments (with controls)
• Follow the literature
• Professional development

An Introduction to Bioinformatics
Tools
Part 1: Golden Rules of Bioinformatics

On Conﬁdence
“Ignorance more frequently begets conﬁdence than does
knowledge: it is those who know little, not those who know much,
who so positively assert. . .”
- Charles Darwin

Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions

Zeroeth Golden Rule of Bioinformatics
• No-one knows everything about everything - talk to people!
• local bioinformaticians, mailing lists, forums, Twitter, etc.
• Keep learning - there are lots of resources
• There is no free lunch - no method works best on all data
• The worst errors are silent - share worries, problems, etc.
• Share expertise (see ﬁrst item)

First Golden Rule of Bioinformatics
• Always inspect the raw data (trends, outliers, clustering)
• What is the question? Can the data answer it?
• Communicate with data collectors! (don’t be afraid of
pedantry)
• Who? When? How?
• You need to understand the experiment to analyse it (easier if
you helped design it).
• Be wary of block eﬀects (experimenter, time, batch, etc.)

Second Golden Rule of Bioinformatics
• Do not trust the software: it is not an authority
• Software does not distinguish meaningful from meaningless
data
• Software has bugs
• Algorithms have assumptions, conditions, and applicable
domains
• Some problems are inherently hard, or even insoluble
• You must understand the analysis/algorithm
• Always sanity test
• Test output for robustness to parameter (including data)
choice

Third Golden Rule of Bioinformatics
• Everyone has expectations of their data/experiment
• Beware cognitive errors, such as conﬁrmation bias!
• System 1 vs. System 2 ≈ intuition vs. reason
• Think statistically!
• Large datasets can be counterintuitive and appear to conﬁrm a
large number of contradictory hypotheses
• Always account for multiple tests.
• Avoid “data dredging”: intensive computation is not an
adequate substitute for expertise
• Use test-driven development of analyses and code
• Use examples that pass and fail

In Conclusion
• Always communicate!
• worst errors are silent
• Don’t trust the data
• formatting/validation/category errors - check!
• suitability for scientiﬁc question
• Don’t trust the software
• software is not an authority
• always benchmark, always validate
• Don’t trust yourself
• beware cognitive errors
• think statistically
• biological “stories” can be constructed from nonsense

Tools
Part 2: BLAST

Table of Contents
Introduction
Alignment
BLAST
BLAST Statistics
Using BLAST

Learning Outcomes
• How BLAST searches work
• How the way BLAST searches work aﬀects your results
• Why search parameters matter
• Setting search parameters

Why So Much Detail?
• You’re going to go away and do lots of BLAST searches
• Everyone uses BLAST - not everyone uses it well
• Easier to fix problems if you know how it works
• Understanding what’s going on helps avoid misuse/abuse
• Understanding what’s going on helps use the tool more
effectively
• Not so much detail, really
• like knowing about Tm and ion concentration effects, not
molecular orbitals or thermodynamics (but ask if you’re
interested ;) )

What BLAST Is
• BLAST:
• Basic (it’s actually sophisticated)
• Local Alignment (what it does: local sequence alignment)
• Search Tool (what it does: search against a database)

What BLAST Is
• BLAST:
• Basic (it’s actually sophisticated)
• Local Alignment (what it does: local sequence alignment)
• Search Tool (what it does: search against a database)
• The most important software package in bioinformatics?
• Fast, robust, sequence similarity search tool
• Does not necessarily produce optimal alignments
• Not foolproof.

What A BLAST Search Is
• Every BLAST search is an in silico hybridisation experiment
• BLAST search = identiﬁcation of similar sequences in a given
database
• Results depend on:
• query sequence
• BLAST program (including version and BLAST vs BLAST+)
• database
• parameters

Alignment Search Space
Consider two biological sequences to be aligned. . .
• One sequence on the x-axis, the other on the y-axis
• Each point in space is a pairing of two letters
• Ungapped alignments are diagonal lines in the search space,
gapped alignments have short ’breaks’
• There may be one or more ”optimal” alignments

Global vs Local Alignment
• Global alignment: sequences are aligned along their entire
lengths
• Local alignment: the best subsequence alignment is found

Global vs Local Alignment
• Global alignment: sequences are aligned along their entire
lengths
• Local alignment: the best subsequence alignment is found
• Consider an alignment of the same gene from two
distantly-related eukaryotes, where:
• Exons are conserved and small in relation to gene locus size
• Introns are not well-conserved but large in relation to gene
locus size
• Local alignment will align the conserved exon regions
• Global alignment will align the whole (mostly unrelated) locus

Our Goal
• We aim to align the words
• COELACANTH
• PELICAN

Our Goal
• COELACANTH
• PELICAN
• Each identical letter (match) scores +1
• Each diﬀerent letter (mismatch) scores -1
• Each gap scores -1

Our Goal
• COELACANTH
• PELICAN
• Each identical letter (match) scores +1
• Each diﬀerent letter (mismatch) scores -1
• Each gap scores -1
• All sequence alignment is maximisation of an alignment score
- a mathematical operation.

Fill the matrix – represents all possible
alignments & scores

Algorithms
• Global: Needleman-Wunsch (as in example)
• Local: Smith-Waterman (diﬀers from example)

Algorithms
• Biological information encapsulated only in the scoring
scheme (matches, mismatches, gaps)

Algorithms
• Biological information encapsulated only in the scoring
scheme (matches, mismatches, gaps)
• NW/SW are guaranteed to ﬁnd the optimal match with
respect to the scoring system being used
• BUT the optimal alignment is a biological approximation: no
scoring scheme encapsulates biological “truth”
• Any pair of sequences can be aligned: ﬁnding meaning is up
to you

BLAST Is A Heuristic
• BLAST does not use Needleman-Wunsch or Smith-Waterman
• BLAST approximates dynamic programming methods
• BLAST is not guaranteed to give a mathematically optimal
alignment

alignment
• BLAST does not explore the complete search space

alignment
• BLAST uses heuristics (loosely-deﬁned rules) to reﬁne
High-scoring Segment Pairs (HSPs)

alignment
• BLAST uses heuristics (loosely-defined rules) to refine
High-scoring Segment Pairs (HSPs)
• BLAST reports only “statistically-significant” alignments
(dependent on parameters)

Steps in the Algorithm
1. Seeding
2. Extension
3. Evaluation

Word Hits
• A word hit is a short sequence and its neighbourhood
• neighbourhood: words of same length whose aligned score is
greater than or equal to a threshold value T
• Three parameters: scoring matrix, word size W , and T

Seeding
• BLAST assumption: signiﬁcant alignments have words in
common
• BLAST ﬁnds word (neighbourhood) hits in the database index
• Word hits are used to seed alignments

Seeding Controls Sensitivity
• Word size W controls number of hits (smaller words =⇒
more hits)
• Threshold score T controls number of hits (lower threshold
=⇒ more hits)
• Scoring matrix controls which words match

The Two-Hit Algorithm
• BLAST assumption: word hits cluster on the diagonal for
signiﬁcant alignments
• The acceptable distance A between words on the diagonal is a
parameter of your model
• Smaller distances isolate single words, and reduce search space

Extension
• The best-scoring seeds are extended in each direction
• BLAST does not explore the complete search space, so a rule
(heuristic) to stop extension is needed
• Two-stage process:
• Extend, keeping alignment score, and drop-oﬀ score
• When drop-of score reaches a threshold X, trim alignment
back to top score

Example
• Consider two sentences (match=+1, mismatch=-1)
• The quick brown fox jumps over the lazy dog.
• The quiet brown cat purrs when she sees him.

Example
• Extend to the right from the seed T
• The quic
• The quie
• 123 4565 <- score
• 000 0001 <- drop-off score

Example
• Extend to drop-oﬀ threshold
• The quick brown fox jump
• The quiet brown cat purr
• 123 45654 56789 876 5654 <- score
• 000 00012 10000 123 4345 <- drop-off score

Example
• Trim back from drop-oﬀ threshold to get optimal alignment
• The quick brown
• The quiet brown
• 123 45654 56789 <- score
• 000 00012 10000 <- drop-off score

Notes on implementation
• X controls termination of alignment extension, but dependent
on:
• substitution matrix
• gap opening and extension parameters

Evaluation
• The principle is easy: use a score threshold S to determine
strong and weak alignments
• S is monotonic with E, so an equivalent threshold can be
calculated
• Score S is independent of database size and search space. E
values are not.
• Alignment consistency of HSPs is also a factor in the report

Log-odds Matrices
• Substitution matrices are your model of evolution
• Substitution matrices are log-odds matrices
• Positive numbers indicate likely substitutions/similarity
• Negative numbers indicate unlikely substitutions/dissimilarity
BLOSUM62

Choice of Matrix
• Substitution matrix determines the raw alignment score S
• S is the sum of pairwise scores in an alignment
• BLAST provides, for proteins:
• BLOSUM45 BLOSUM50 BLOSUM62 BLOSUM80 BLOSUM90
• PAM30 PAM70 PAM250
• BLOSUM matrices empirically deﬁned from multiple sequence
alignments of ≥ n% identity, for BLOSUMn
• For nucleotides: ‘matrix’ deﬁned by match/mismatch
(reward/penalty) parameters

Deﬁnition
• The Karlin-Altschul equation
E = kmne−λS
• Symbols:
• k: minor constant, adjusts for correlation between alignments
• m: number of letters in query sequence
• n: number of letters in the database
• λ: scoring matrix scaling factor
• S: raw alignment score

Interpretation
• The Karlin-Altschul equation
E = kmne−λS
• E is the number of alignments of a similar score expected by
chance when querying a database of the same size and letter
frequency, where the letters in that database are
randomly-ordered
• Small changes in score S can produce large changes in E
• BUT biological sequence databases are not random!

Multiple BLAST tools
• BLASTN vs MEGABLAST vs TBLASTX vs ...?
• Korf et al. (2003) BLAST is really good for theory part,
but practical examples dated due to changes with BLAST+

Multiple flavours of BLAST
• NCBI “legacy” BLAST
• Now obsolete and not being updated
• Spawned offshoots including:
• WU-BLAST aka AB-BLAST (commerical)
• MPI-BLAST for use on clusters
• Versions to run on graphics cards
• NCBI BLAST+
• Re-written in 2009 using C++ instead of C
• Many improvements
• Slightly different output
• Different commands used to run it

Multiple ways to run BLAST
• BLAST+ at the command line (today)
• Via a script or programming language
• Via a graphical tool like BioEdit, CLCbio, Blast2GO
• Via the NCBI website
• Via a genome consortium website
• Via a Galaxy web server
• etc
• Offers flexibility but different settings/options/versions

Multiple places to run BLAST
• On the NCBI servers, e.g. via website or tool
• On 3rd party servers, e.g. via websites
• On your own computer
• On our Linux cluster

Core BLAST tools: Query sequences vs
Database
• Nucleotide vs Nucleotide:
• blastn (covering blastn, megablast, dc-megablast)
• Translated nucleotide vs Protein:
• blastx
• Protein vs Translated nucleotide:
• tblastn
• Protein vs Protein:
• blastp, psiblast, phiblast, deltablast
See http://blast.ncbi.nlm.nih.gov/ for a reminder ;)

The BLAST tools have built in help
1 $ blastp -h
2 USAGE
3 blastp [-h] [-help] [- import_search_strategy filename]
4 [- export_search_strategy filename] [-task task_name] [-db database_name ]
5 [-dbsize num_letters ] [-gilist filename] [-seqidlist filename]
6 [- negative_gilist filename] [- entrez_query entrez_query ]
7 [- db_soft_mask filtering_algorithm ] [- db_hard_mask filtering_algorithm ]
8 [-subject subject_input_file ] [- subject_loc range] [-query input_file]
9 [-out output_file ] [-evalue evalue] [-word_size int_value]
10 [-gapopen open_penalty ] [-gapextend extend_penalty ]
11 [- xdrop_ungap float_value ] [-xdrop_gap float_value ]
12 [- xdrop_gap_final float_value ] [-searchsp int_value] [-max_hsps int_value]
13 [- sum_statistics ] [-seg SEG_options] [- soft_masking soft_masking ]
14 [-matrix matrix_name ] [-threshold float_value ] [- culling_limit int_value]
15 ...
16 [- max_target_seqs num_sequences ] [-num_threads int_value] [-ungapped]
17 [-remote] [- comp_based_stats compo] [- use_sw_tback ] [-version]
18
19 DESCRIPTION
20 Protein -Protein BLAST 2.2.29+
21
22 Use ’-help ’ to print detailed descriptions of command line arguments

Minimal example of BLAST+ at the
command line
1 $ blastp -query my_input.fasta -db my_database -out my_output.txt
• Replace blastp with the appropriate tool, e.g. blastn
• Replace my input.fasta with your actual filename
• Replace my database with your actual database, e.g. nr
• Replace my output.txt with your desired output filename
• Best to avoid spaces in your folder and filenames!
e.g.
1 $ blastp -query query.fasta -db dbA -out my_output.txt

Setting the BLAST+ output format
1 $ blastp -help
2 USAGE
3 ...
4
5 *** Formatting options
6 -outfmt <String >
7 alignment view options:
8 0 = pairwise ,
9 1 = query -anchored showing identities ,
10 2 = query -anchored no identities ,
11 3 = flat query -anchored , show identities ,
12 4 = flat query -anchored , no identities ,
13 5 = XML Blast output ,
14 6 = tabular ,
15 7 = tabular with comment lines ,
16 8 = Text ASN.1,
17 9 = Binary ASN.1,
18 10 = Comma -separated values ,
19 11 = BLAST archive format (ASN .1)
20
21 ...
22 Default = ‘0’
23 ...

Setting the BLAST+ output format
Default is plain text pairwise alignments, for humans:
1 $ blastp -query query.fasta -db dbA -out my_output.txt
2 ...
XML output can be useful (e.g. for BLAST2GO):
1 $ blastp -query query.fasta -db dbA -out my_output.xml -outfmt 5
2 ...
Tabular output is easiest to ﬁlter, sort, etc:
1 $ blastp -query query.fasta -db dbA -out my_output.tab -outfmt 6
2 ...

Setting the e-value threshold
Check the built in help:
1 $ blastp -help
2 USAGE
3 ...
4 -evalue <Real >
5 Expectation value (E) threshold for saving hits
6 Default = ‘10’
7 ...
Example using 0.0001 or 1 × 10−5 in scientiﬁc notation (1e-5)
1 $ blastp -query query.fasta -db dbA -out my_output.txt -evalue 1e-5
2 ...

In Conclusion
• Every BLAST search is an experiment
• Badly-designed searches can give you bad results
• Knowing how BLAST works helps improve search design
• BLAST results still require inspection and interpretation

Tools
Part 3: Workshop

Table of Contents
Introduction
Workshop Data
Gene Prediction
Genome Comparisons
Gene Comparisons
Conclusions

Learning Outcomes
• Workshop example: bacterial genome annotation
(because they’re small and data easy to handle)
• The role of biological insight in a bioinformatics workﬂow
• Visual interaction with sequence data
• Using alternative tools
• Comparison of tools and outputs
• Online tools for automated function prediction

What You Will Be Doing
Illustrative example of concepts: Functional annotation of a draft
bacterial genome
1. Gene prediction
2. Genome comparisons
3. Gene comparisons

Locate your data
• You are in group A, B, C or D - this decides your chromosome
sequence:
chrA.fasta, chrB.fasta, chrC.fasta, chrD.fasta
• Each sequence represents a single stitched, ordered draft
bacterial genome comprising a number of contigs.
• You will use your sequence as the basis of the exercises in the
workshop.

Locate your data
• You are in group A, B, C or D - this decides your dataset:
chrA.fasta, chrB.fasta, chrC.fasta, chrD.fasta
• You also have a GFF ﬁle describing the location of assembled
contigs
chrA contigs.gff, chrB contigs.gff,
chrC contigs.gff, chrD contigs.gff

Inspect the data
1 $ head -n 3 chrA.fasta
2 >chrA
3 ttttcttgattgaccttgttcgagtggagtccgccgtgtcactttcgctttggcagcagt
4 gtcttgcccgtttgcaggatgagttacctgccacagaattcagtatgtggatacgcccgt
5 $ head -n 3 chrA_contigs .gff
6 ##gff -version 3
7 chrA stitching contig 1 154993 . . . ID= contig00005_b ;Name= contig00005_b
8 chrA stitching contig 155036 241491 . . . ID=contig00018;Name=contig00018

Inspect the data
Starting Artemis
1 $ art &

Load the chromosome sequence
Select the sequence for your group

Load the contig GFF
Select the ﬁle for your group

Find the stitching sequence
The contigs are stitched with a speciﬁc sequence: see if you can
ﬁnd, and identify it.

Lines of Evidence
• ab initio genecalling:
• Unsupervised methods - not trained on a dataset
• Supervised methods - trained on a dataset
• homology matches
• alignment to genes from related organisms (annotation
transfer)
• from known gene products (e.g. proteins, ncRNA)
• from transcripts/other intermediates (e.g. ESTs, cDNA,
RNAseq)

Consensus Methods
• Combine weighted evidence from multiple sources, using linear
combination or graph theoretical methods
• For eukaryotes:
• EVM http://evidencemodeler.sourceforge.net/
• Jigsaw http://www.cbcb.umd.edu/software/jigsaw/
• GLEAN http://sourceforge.net/projects/glean-gene/

Basic Gene Finding
• We could use Artemis to identify the longest coding region in
each ORF, lots of manual steps
• This is the most basic gene ﬁnding, and can easily be
automated, e.g. EMBOSS getorf
• Dedicated gene ﬁnders usually more appropriate...

Finding Open Reading Frames
• ORF ﬁnding is naive, does not consider:
• Start codon
• Splicing
• Promoter/RBS motifs
• Wider context (e.g. overlapping genes)

Prokaryotic Prediction Methods
• Prokaryotes “easier” than eukaryotes for gene prediction
• Less uncertainty in predictions (isoforms, gene structure)
• Very gene-dense (over 90% of chromosome is coding sequence)
• No intron-exon structure
• Problem is: “which possible ORF contains the true gene, and
which start site is correct?”
• Still not a solved problem

Two ab initio Prokaryotic Prediction
Methods
You will be using two tools
• Glimmer
• Interpolated Markov models
• Can be trained on “gold standard” datasets
• Prodigal
• Log-likelihood model based on GC frame plots, followed by
dynamic programming
• Can be trained on “gold standard” datasets

Using Glimmer
Supervised - we train on a related complete genome sequence,
then run glimmer3
1 $ build -icm -r NC_004547.icm < NC_004547.ffn
2 $ glimmer3 -o 50 -g 110 -t 30 chrA.fasta NC_004547.icm chrA_glimmer3
• -o 50: max overlap bases
• -g 110: min gene length
• -t 30: threshold score

Using Glimmer
glimmer3 output is not standard GFF format:
1 $ head -n 4 chrA_glimmer3 .predict
2 >chrA
3 orf00001 36 1430 +3 8.81
4 orf00002 1435 2535 +1 11.51
5 orf00005 2676 3761 +3 8.63
We could Google for help, or use provided conversion script:
1 $ python glimmer_to_gff .py chrA_glimmer3 .predict

Using Glimmer
We now have output in GFF
1 $ head -n 3 chrA_glimmer3 .gff
2 chrA Glimmer CDS 36 1430 8.81 + 0 ID=orf00001;Name=orf00001

Using Prodigal
Unsupervised (i.e. untrained) mode
1 $ prodigal -f gff -o chrA_prodigal .gff -i chrA.fasta

Using Prodigal
Prodigal GFF output is correctly formatted and informative
1 $ head -n 6 chrA_prodigal .gff
2 ##gff -version 3
3 # Sequence Data: seqnum =1; seqlen =4727782; seqhdr =" chrA"
4 # Model Data: version=Prodigal.v2 .50; run_type=Single;model ="Ab initio "; gc_cont
=54.48; transl_table =11; uses_sd =1
5 chrA Prodigal_v2 .50 CDS 3 1430 188.5 + 0 ID=1_1;partial =10; start_type=Edge;
rbs_motif=None;rbs_spacer=None;score =188.54; cscore =185.37; sscore =3.18;
rscore =0.00; uscore =3.18; tscore =0.00
6 chrA Prodigal_v2 .50 CDS 1435 2535 185.6 + 0 ID=1_2;partial =00; start_type=ATG;
rbs_motif=None;rbs_spacer=None;score =185.61; cscore =184.24; sscore =1.36;
rscore = -7.73; uscore =3.48; tscore =4.37
7 chrA Prodigal_v2 .50 CDS 2676 3761 146.2 + 0 ID=1_3;partial =00; start_type=ATG;
rbs_motif=None;rbs_spacer=None;score =146.19; cscore =149.82; sscore = -3.63;
rscore = -7.73; uscore = -0.28; tscore =4.37

Comparing predictions in Artemis

Do ORF(orange)/CDS(green,blue) prediction methods agree?

Do glimmer(green)/prodigal(blue) CDS prediction methods
agree?
How do we know which (if either) is best?

Using a “Gold Standard”
A general approach for all predictive methods
• Deﬁne a known, “correct” set of true/false, positive/negative
etc. examples - the “gold standard”
• Evaluate your predictive method against that set for
• sensitivity, speciﬁcity, accuracy, precision, etc.
Many methods available, coverage beyond the scope of this
introduction

Contingency Tables
Condition (Gold standard)
True False
Test outcome
Positive True Positive False Positive
Negative False Negative True Negative
Sensitivity = TPR = TP/(TP + FN)
Speciﬁcity = TNR = TN/(FP + TN)
FPR = 1 − Speciﬁcity = FP/(FP + TN)
If you don’t have this information, you can’t interpret predictive
results properly.

Why Performance Metrics Matter
• You go for a checkup, and are tested for disease X
• The test has sensitivity = 0.95 (predicts disease where there is
disease)
• The test has FPR = 0.01 (predicts disease where there is no
disease)

• You go for a checkup, and are tested for disease X
• The test has sensitivity = 0.95 (predicts disease where there is
disease)
• The test has FPR = 0.01 (predicts disease where there is no
disease)
• Your test is positive
• What is the probability that you have disease X?
• 0.01, 0.05, 0.50, 0.95, 0.99?

• Unless you know the baseline occurrence of disease X, you
cannot know.

• Unless you know the baseline occurrence of disease X, you
cannot know.
• Baseline occurrence: fX
• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5
• fX = 0.8 =⇒ P(disease|+ve) = 0.997 ≈ 1.0

• Imagine a predictor for protein functional class
• Predictor has has sensitivity = 0.95, FPR = 0.01
• You run the predictor on 20,000 proteins in an organism

• Imagine a predictor for protein functional class
• Predictor has has sensitivity = 0.95, FPR = 0.01
• You run the predictor on 20,000 proteins in an organism
• We estimate ≈ 200 members in protein complement, so
fX = 0.01
• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5

Bayes’ Theorem
• May seem counter-intuitive: 95% sensitivity, 99% speciﬁcity
=⇒ 50% chance of any prediction being incorrect
• Probability given by Bayes’ Theorem
• P(X|+) = P(+|X)P(X)
P(+|X)P(X)+P(+| ¯X)P( ¯X)
• This is commonly overlooked in the literature (conﬁrmation
bias?)
• e.g. in paper describing novel TTSS predictor:
“The surprisingly high number of (false) positives in genomes
without TTSS exceeds the expected false positive rate”

Interpreting Performance Metrics
• Use Bayes’ Theorem!
• Predictions apply to groups, not individual members of the
group. e.g.
• Test for airport smugglers has P(smuggler|+) = 0.9
• Test gives 100 positives
• Which speciﬁc individuals are truly smugglers?

Interpreting Performance Metrics
• Use Bayes’ Theorem!
• Predictions apply to groups, not individual members of the
group. e.g.
• Test for airport smugglers has P(smuggler|+) = 0.9
• Test gives 100 positives
• Which speciﬁc individuals are truly smugglers?
• The test does not allow you to determine this - you need more
evidence for each individual
• Same principle applies to all other tests, (including protein
functional class prediction) - you should not ‘cherry-pick’ for
publication without other evidence

“Gold Standard” results
• Tested glimmer and prodigal on two ”gold standards”
• Manually annotated (>3 expert person years) close relative
• Community-annotated close relative
• Both methods trained directly on the annotated genes in each
organism!

genecaller glimmer prodigal
predicted 4752 4287
missed 284 (6%) 407 (9%)
Exact Prediction
sensitivity 62% 71%
FDR 41% 25%
PPV 59% 75%
Correct ORF
sensitivity 94% 91%
FDR 10% 3%
PPV 90% 97%

genecaller glimmer prodigal
predicted 4679 4467
missed 112 (3%) 156 (3%)
Exact Prediction
sensitivity 62% 86%
FDR 31% 14%
PPV 69% 86%
Correct ORF
sensitivity 97% 97%
FDR 7% 3%
PPV 93% 97%

Gene/CDS Prediction
• Many alternative methods, all perform diﬀerently
• To assess/choose methods, performance metrics are required
• Even on (relatively simple) prokaryotes, current best methods
imperfect
• Manual assessment and intervention is essential, and usually
the longest part of the process

Run a megaBLAST Comparison
BLAST your chromosome against the comparator sequence.
Put results in chrA megablast Pba.tab
1 $ blastn -query chrA.fasta -subject NC_004547.fna -out chrA_megablast_Pba .tab -
outfmt 6
2 $ head -n 3 chrA_megablast_Pba .tab
3 chrA gi |50118965| ref|NC_004547 .2|:10948 -12453 80.34 1511 287 10 4579450 4580955
1506 1 0.0 1136
4 chrA gi |50118965| ref|NC_004547 .2|: c33859 -32447 82.04 1409 253 0 4563151 4564559
1 1409 0.0 1201
5 chrA gi |50118965| ref|NC_004547 .2|: c34917 -33868 82.48 1050 184 0 4562093 4563142
1 1050 0.0 920
Note this defaults to using MEGABLAST...

Run a BLASTN Comparison
BLAST your chromosome against the comparator sequence
Put results in chrA blastn Pba.tab
1 $ blastn -query chrA.fasta -subject NC_004547.fna -out chrA_blastn_Pba .tab -
outfmt 6 -task blastn
2 $ head -n 3 chrA_blastn_Pba .tab
3 chrA gi |50118965| ref|NC_004547 .2|:5629 -7497 79.68 1865 379 0 4584915 4586779
1865 1 0.0 1654
4 chrA gi |50118965| ref|NC_004547 .2|:5629 -7497 92.59 27 2 0 4479367 4479393 1254
1280 0.004 41.0
5 chrA gi |50118965| ref|NC_004547 .2|:5629 -7497 100.00 17 0 0 4613022 4613038 52 36
2.1 31.9
Note we added -task blastn

Do BLASTN and megaBLAST compar-
isons agree?
Check the number of alignments returned with wc
1 $ wc chrA_megablast_Pba .tab
2 2675 32100 242539 chrA_megablast_Pba .tab
3 $ wc chrA_blastn_Pba .tab
4 31792 381504 2850953 chrA_blastn_Pba .tab
What is this telling us?
Why do the results diﬀer?

BLASTN vs megaBLAST
• Legacy BLASTN uses the BLAST algorithm, megaBLAST
does not
• (though BLAST+ BLASTN now uses megaBLAST by default)
• megaBLAST uses a fast, greedy algorithm due to Zhang et al.
(2000) http://www.ncbi.nlm.nih.gov/pubmed/10890397

BLASTN vs megaBLAST
• Legacy BLASTN uses the BLAST algorithm, megaBLAST
does not
• (though BLAST+ BLASTN now uses megaBLAST by default)
• megaBLAST uses a fast, greedy algorithm due to Zhang et al.
(2000) http://www.ncbi.nlm.nih.gov/pubmed/10890397
• megaBLAST is optimised for
• genome-level searches
• queries on large sequence sets (automatic query packing)
• long alignments of similar sequences, with SNPs/sequencing
errors
• A discontinuous mode (dc-megaBLAST) is recommended for
more divergent sequences

Viewing alignments in ACT
Start ACT from the command line:
1 $ act &

Use the “File”, “Open...” menu item

Increase the Number of Comparisons
Use more files ...

Remove Weak Matches
Use ﬁlter sliders

MUMmer
• MUMmer is a suite of alignment programs and scripts
• mummer, promer, nucmer, etc.
• Very different to BLAST (suffix tree alignment) - very fast
• Extremely flexible
• Used for genome comparisons, assemblies, scaffolding, repeat
detection, etc.
• Forms the basis for other aligners/assemblers

Run a MUMmer Comparison
Create a new sub-directory for MUMmer output.
1 $ pwd
2 .../ data/workshop/chromosomes
3 $ mkdir nucmer_out
Run nucmer to create chrA NC 004547.delta
1 $ nucmer --prefix=nucmer_out/ chrA_NC_004547 chrA.fasta NC_004547.fna
Then ﬁlter this ﬁle to generate a coordinate table for visualisation
1 $ delta -filter -q nucmer_out/ chrA_NC_004547 .delta > nucmer_out/ chrA_NC_004547 .
filter
2 $ show -coords -rcl nucmer_out/ chrA_NC_004547 .filter > nucmer_out/
chrA_NC_004547_filtered .coords

MUMmer output is very diﬀerent from BLAST output
1 $ head nucmer_out/ chrA_NC_004547_filtered .coords
2 ...

Use a one-line shell command to convert to ACT-friendly format:
1 $ tail -n +6 nucmer_out/ chrA_NC_004547_filtered .coords | awk ’{print $7" "$10" "
$1" "$2" "$12" "$4" "$5" "$13}’ > chrA_mummer_NC_004547 .crunch
2 $ head chrA_mummer_NC_004547 .crunch
3 2526 82.49 15 2540 4727782 4985117 4982588 5064019
4 2944 82.29 2676 5619 4727782 4982544 4979600 5064019
5 85 95.29 11092 11176 4727782 758690 758774 5064019
6 1356 81.69 17446 18801 4727782 77639 78994 5064019

Select Files
Select your chromosome, and the megaBLAST/MUMmer output

Genome Alignments
• Alignment result depends on algorithm, and parameter choices
• Some algorithms/parameter sets more sensitive than others
• Appropriate visualisation is essential
Much more detail at http://www.slideshare.net/leightonp/
comparative-genomics-and-visualisation-part-1

Reciprocal Best BLAST Hits (RBBH)
• To compare our genecall proteins to NC 004547.faa reference
set...
• BLAST reference proteins against our proteins
• BLAST our proteins against reference proteins
• Pairs with each other as best BLAST Hit are called RBBH

One-way BLAST vs RBBH
One-way BLAST includes many low-quality hits

One-way BLAST vs RBBH
Reciprocal best BLAST hits remove many low-quality matches

Reciprocal Best BLAST Hits (RBBH)
• Pairs with each other as best BLAST hit are called RBBH
• Should ﬁlter on percentage identity and alignment length
• RBBH pairs are candidate orthologues
• (most orthologues will be RBBH, but the relationship is
complicated)
• Outperforms OrthoMCL, etc. (beyond scope of course why
and how. . .)
http://dx.doi.org/10.1093/gbe/evs100
http://dx.doi.org/10.1371/journal.pone.0018755
(We have a tool for this on our in-house Galaxy server)

In Conclusion
• The tools you will need to use will be task-dependent, but
some things are universal. . .
• Good experimental design (including BLAST searches, etc.)
• Keeping accurate records for reproduction/replication
• Validation/sanity checking of results
• Comparison and benchmarking of methods
• (Cross-)validation of predictive methods
Remember: everything gets easier with practice, so practice
lots!

Introduction to Bioinformatics

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Introduction to Bioinformatics

Similar a Introduction to Bioinformatics (20)

Más de Leighton Pritchard

Más de Leighton Pritchard (20)

Último

Último (20)

Introduction to Bioinformatics