SlideShare una empresa de Scribd logo
1 de 167
Descargar para leer sin conexión
Introduction to Bioinformatics
Part 0: So You Want To Be a Computational
Biologist?
Leighton Pritchard and Peter Cock
Bertrand Russell
Table of Contents
Introduction
Recording Your Work
Conclusion
What is this “bioinformatics” thing,
anyway?
• Bioinformatics: biology using computational and
mathematical tools
• A discipline within biology
• Loman & Watson (2013) “So you want to be a computational
biologist?” http://dx.doi.org/10.1038/nbt.2740
• Welch et al. (2014) “Bioinformatics Curriculum Guidelines:
Toward a Definition of Core Competencies”
http://dx.doi.org/10.1371/journal.pcbi.1003496
• Watson (2014) “The only core competency you need”
http://bit.ly/1fS4iDJ (blog)
Some uncomfortable truths
• This one-day course will not make you a bioinformatician
Some uncomfortable truths
• This one-day course will not make you a bioinformatician
• But practice will. . .
Some uncomfortable truths
• This one-day course will not make you a bioinformatician
• But practice will. . .
• The best way to learn is to do (“I don’t know how to do this
yet, but I will find out.”)
• http://bit.ly/Rq0D61 (“Bioinformatics is a way of life”)
• Most bioinformatics is problem-solving
• We will introduce some useful tools and concepts
What it takes to be a bioinformatician
• Patience
(problem-solving)
• Suspicion (statistics)
• Biological knowledge
• Social skills (no-one
knows everything: ask!)
• Lots of practice
• Self-confidence (challenge
results and dogma)
• Core domain skills:
biology, computer science,
statistics
• Deliver results (qualified,
honest)
• Watson (2014) “What it takes to be a bioinformatician”
http://bit.ly/1jDuQsO (blog)
More general advice?
• Ask us (we do this a lot)
• BioStars (https://www.biostars.org)
• SeqAnswers (http://seqanswers.com/)
• PLoS Comp Biol collections (http:
//www.ploscollections.org/static/pcbiCollections)
Table of Contents
Introduction
Recording Your Work
Conclusion
Why Do It?
• Doing bioinformatics is doing science: keep a lab book!
• You will not remember multiple files, analysis details, etc. in a
week/month/six months/a year/three years
• Noble (2009)
http://dx.doi.org/10.1371/journal.pcbi.1000424
• Baggerly & Coombes (2009)
http://arxiv.org/pdf/1010.1092.pdf
How To Do It? I
• There is no one correct way, but. . .
• Think about data/docs/project structure before you start
How To Do It? II
• Use plain text where possible
• Use version control
• Keep backups
• Different tools suit different purposes: code vs. data vs.
analysis vs. . . .
• Find a way that works for you.
How To Do It? III
• Reproducibility is key!
• Scripts and pipelines are better for this than notes of what
you did
• Also better for version control, and reuse
• Avoid unnecessary duplication
• Someone else may have solved your problem
• One (backed up) read-only copy of raw data, keep analyses
separate
Plain Text Files
• README.txt/README.md in each directory/folder
• Plain text is always human-readable
• Markdown (https:
//daringfireball.net/projects/markdown/basics)
• RST (http://docutils.sourceforge.net/docs/ref/rst/
restructuredtext.html)
Galaxy workflows
• Use through browser, graphical interface
• Reproducible, shareable, documented, reusable analyses
• Wraps standard bioinformatics tools
• Local instance (http://ppserver/galaxy) uses JHI cluster
script
• Writes your terminal activity to a plain text file
• Saves effort copy/pasting and typing commands into a lab
book, as you go
• Easy to use with other tools
• use man script at your terminal to find out more
MediaWiki
• Useful for shared projects/data
• Automatic version control and attribution
• Many local instances at JHI (ask around)
A language notebook
• e.g. iPython Notebook, Mathematica, MatLab cells
• Integrates live code and analysis with lab-book
LATEX
• Powerful, versatile typesetting system (e.g. these slides)
• Similar to markup/markdown
• Pros: great for mathematical/computing work, writing a thesis
• Cons: not easy to pick up
Table of Contents
Introduction
Recording Your Work
Conclusion
In Conclusion
• Bioinformatics is just biology using computers and
mathematics
• You still need to “do science” in the same way:
• Keep accurate records
• Plan and conduct experiments (with controls)
• Follow the literature
• Professional development
An Introduction to Bioinformatics
Tools
Part 1: Golden Rules of Bioinformatics
Leighton Pritchard and Peter Cock
On Confidence
“Ignorance more frequently begets confidence than does
knowledge: it is those who know little, not those who know much,
who so positively assert. . .”
- Charles Darwin
Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions
Zeroeth Golden Rule of Bioinformatics
• No-one knows everything about everything - talk to people!
• local bioinformaticians, mailing lists, forums, Twitter, etc.
• Keep learning - there are lots of resources
• There is no free lunch - no method works best on all data
• The worst errors are silent - share worries, problems, etc.
• Share expertise (see first item)
Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions
First Golden Rule of Bioinformatics
• Always inspect the raw data (trends, outliers, clustering)
• What is the question? Can the data answer it?
• Communicate with data collectors! (don’t be afraid of
pedantry)
• Who? When? How?
• You need to understand the experiment to analyse it (easier if
you helped design it).
• Be wary of block effects (experimenter, time, batch, etc.)
Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions
Second Golden Rule of Bioinformatics
• Do not trust the software: it is not an authority
• Software does not distinguish meaningful from meaningless
data
• Software has bugs
• Algorithms have assumptions, conditions, and applicable
domains
• Some problems are inherently hard, or even insoluble
• You must understand the analysis/algorithm
• Always sanity test
• Test output for robustness to parameter (including data)
choice
Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions
Third Golden Rule of Bioinformatics
• Everyone has expectations of their data/experiment
• Beware cognitive errors, such as confirmation bias!
• System 1 vs. System 2 ≈ intuition vs. reason
• Think statistically!
• Large datasets can be counterintuitive and appear to confirm a
large number of contradictory hypotheses
• Always account for multiple tests.
• Avoid “data dredging”: intensive computation is not an
adequate substitute for expertise
• Use test-driven development of analyses and code
• Use examples that pass and fail
Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions
In Conclusion
• Always communicate!
• worst errors are silent
• Don’t trust the data
• formatting/validation/category errors - check!
• suitability for scientific question
• Don’t trust the software
• software is not an authority
• always benchmark, always validate
• Don’t trust yourself
• beware cognitive errors
• think statistically
• biological “stories” can be constructed from nonsense
An Introduction to Bioinformatics
Tools
Part 2: BLAST
Leighton Pritchard and Peter Cock
Table of Contents
Introduction
Alignment
BLAST
BLAST Statistics
Using BLAST
Learning Outcomes
• How BLAST searches work
• How the way BLAST searches work affects your results
• Why search parameters matter
• Setting search parameters
About Bioinformatics Tools
A Recent Twitter Conversation
A Recent Twitter Conversation
Why So Much Detail?
• You’re going to go away and do lots of BLAST searches
• Everyone uses BLAST - not everyone uses it well
• Easier to fix problems if you know how it works
• Understanding what’s going on helps avoid misuse/abuse
• Understanding what’s going on helps use the tool more
effectively
• Not so much detail, really
• like knowing about Tm and ion concentration effects, not
molecular orbitals or thermodynamics (but ask if you’re
interested ;) )
Table of Contents
Introduction
Alignment
BLAST
BLAST Statistics
Using BLAST
What BLAST Is
• BLAST:
• Basic (it’s actually sophisticated)
• Local Alignment (what it does: local sequence alignment)
• Search Tool (what it does: search against a database)
What BLAST Is
• BLAST:
• Basic (it’s actually sophisticated)
• Local Alignment (what it does: local sequence alignment)
• Search Tool (what it does: search against a database)
• The most important software package in bioinformatics?
• Fast, robust, sequence similarity search tool
• Does not necessarily produce optimal alignments
• Not foolproof.
What A BLAST Search Is
• Every BLAST search is an in silico hybridisation experiment
• BLAST search = identification of similar sequences in a given
database
• Results depend on:
• query sequence
• BLAST program (including version and BLAST vs BLAST+)
• database
• parameters
Alignment Search Space
Consider two biological sequences to be aligned. . .
• One sequence on the x-axis, the other on the y-axis
• Each point in space is a pairing of two letters
• Ungapped alignments are diagonal lines in the search space,
gapped alignments have short ’breaks’
• There may be one or more ”optimal” alignments
Global vs Local Alignment
• Global alignment: sequences are aligned along their entire
lengths
• Local alignment: the best subsequence alignment is found
Global vs Local Alignment
• Global alignment: sequences are aligned along their entire
lengths
• Local alignment: the best subsequence alignment is found
• Consider an alignment of the same gene from two
distantly-related eukaryotes, where:
• Exons are conserved and small in relation to gene locus size
• Introns are not well-conserved but large in relation to gene
locus size
• Local alignment will align the conserved exon regions
• Global alignment will align the whole (mostly unrelated) locus
Our Goal
• We aim to align the words
• COELACANTH
• PELICAN
Our Goal
• We aim to align the words
• COELACANTH
• PELICAN
• Each identical letter (match) scores +1
• Each different letter (mismatch) scores -1
• Each gap scores -1
Our Goal
• We aim to align the words
• COELACANTH
• PELICAN
• Each identical letter (match) scores +1
• Each different letter (mismatch) scores -1
• Each gap scores -1
• All sequence alignment is maximisation of an alignment score
- a mathematical operation.
Initialise the matrix
Fill the cells
Fill the matrix – represents all possible
alignments & scores
Traceback
Algorithms
• Global: Needleman-Wunsch (as in example)
• Local: Smith-Waterman (differs from example)
Algorithms
• Global: Needleman-Wunsch (as in example)
• Local: Smith-Waterman (differs from example)
• Biological information encapsulated only in the scoring
scheme (matches, mismatches, gaps)
Algorithms
• Global: Needleman-Wunsch (as in example)
• Local: Smith-Waterman (differs from example)
• Biological information encapsulated only in the scoring
scheme (matches, mismatches, gaps)
• NW/SW are guaranteed to find the optimal match with
respect to the scoring system being used
• BUT the optimal alignment is a biological approximation: no
scoring scheme encapsulates biological “truth”
• Any pair of sequences can be aligned: finding meaning is up
to you
Table of Contents
Introduction
Alignment
BLAST
BLAST Statistics
Using BLAST
BLAST Is A Heuristic
• BLAST does not use Needleman-Wunsch or Smith-Waterman
• BLAST approximates dynamic programming methods
• BLAST is not guaranteed to give a mathematically optimal
alignment
BLAST Is A Heuristic
• BLAST does not use Needleman-Wunsch or Smith-Waterman
• BLAST approximates dynamic programming methods
• BLAST is not guaranteed to give a mathematically optimal
alignment
• BLAST does not explore the complete search space
BLAST Is A Heuristic
• BLAST does not use Needleman-Wunsch or Smith-Waterman
• BLAST approximates dynamic programming methods
• BLAST is not guaranteed to give a mathematically optimal
alignment
• BLAST does not explore the complete search space
• BLAST uses heuristics (loosely-defined rules) to refine
High-scoring Segment Pairs (HSPs)
BLAST Is A Heuristic
• BLAST does not use Needleman-Wunsch or Smith-Waterman
• BLAST approximates dynamic programming methods
• BLAST is not guaranteed to give a mathematically optimal
alignment
• BLAST does not explore the complete search space
• BLAST uses heuristics (loosely-defined rules) to refine
High-scoring Segment Pairs (HSPs)
• BLAST reports only “statistically-significant” alignments
(dependent on parameters)
Steps in the Algorithm
1. Seeding
2. Extension
3. Evaluation
Word Hits
• A word hit is a short sequence and its neighbourhood
• neighbourhood: words of same length whose aligned score is
greater than or equal to a threshold value T
• Three parameters: scoring matrix, word size W , and T
Seeding
• BLAST assumption: significant alignments have words in
common
• BLAST finds word (neighbourhood) hits in the database index
• Word hits are used to seed alignments
Seeding Controls Sensitivity
• Word size W controls number of hits (smaller words =⇒
more hits)
• Threshold score T controls number of hits (lower threshold
=⇒ more hits)
• Scoring matrix controls which words match
The Two-Hit Algorithm
• BLAST assumption: word hits cluster on the diagonal for
significant alignments
• The acceptable distance A between words on the diagonal is a
parameter of your model
• Smaller distances isolate single words, and reduce search space
Extension
• The best-scoring seeds are extended in each direction
• BLAST does not explore the complete search space, so a rule
(heuristic) to stop extension is needed
• Two-stage process:
• Extend, keeping alignment score, and drop-off score
• When drop-of score reaches a threshold X, trim alignment
back to top score
Example
• Consider two sentences (match=+1, mismatch=-1)
• The quick brown fox jumps over the lazy dog.
• The quiet brown cat purrs when she sees him.
Example
• Consider two sentences (match=+1, mismatch=-1)
• The quick brown fox jumps over the lazy dog.
• The quiet brown cat purrs when she sees him.
• Extend to the right from the seed T
• The quic
• The quie
• 123 4565 <- score
• 000 0001 <- drop-off score
Example
• Consider two sentences (match=+1, mismatch=-1)
• The quick brown fox jumps over the lazy dog.
• The quiet brown cat purrs when she sees him.
• Extend to drop-off threshold
• The quick brown fox jump
• The quiet brown cat purr
• 123 45654 56789 876 5654 <- score
• 000 00012 10000 123 4345 <- drop-off score
Example
• Consider two sentences (match=+1, mismatch=-1)
• The quick brown fox jumps over the lazy dog.
• The quiet brown cat purrs when she sees him.
• Trim back from drop-off threshold to get optimal alignment
• The quick brown
• The quiet brown
• 123 45654 56789 <- score
• 000 00012 10000 <- drop-off score
Notes on implementation
• X controls termination of alignment extension, but dependent
on:
• substitution matrix
• gap opening and extension parameters
Evaluation
• The principle is easy: use a score threshold S to determine
strong and weak alignments
• S is monotonic with E, so an equivalent threshold can be
calculated
• Score S is independent of database size and search space. E
values are not.
• Alignment consistency of HSPs is also a factor in the report
Table of Contents
Introduction
Alignment
BLAST
BLAST Statistics
Using BLAST
Log-odds Matrices
• Substitution matrices are your model of evolution
• Substitution matrices are log-odds matrices
• Positive numbers indicate likely substitutions/similarity
• Negative numbers indicate unlikely substitutions/dissimilarity
BLOSUM62
Choice of Matrix
• Substitution matrix determines the raw alignment score S
• S is the sum of pairwise scores in an alignment
• BLAST provides, for proteins:
• BLOSUM45 BLOSUM50 BLOSUM62 BLOSUM80 BLOSUM90
• PAM30 PAM70 PAM250
• BLOSUM matrices empirically defined from multiple sequence
alignments of ≥ n% identity, for BLOSUMn
• For nucleotides: ‘matrix’ defined by match/mismatch
(reward/penalty) parameters
Definition
• The Karlin-Altschul equation
E = kmne−λS
• Symbols:
• k: minor constant, adjusts for correlation between alignments
• m: number of letters in query sequence
• n: number of letters in the database
• λ: scoring matrix scaling factor
• S: raw alignment score
Interpretation
• The Karlin-Altschul equation
E = kmne−λS
• E is the number of alignments of a similar score expected by
chance when querying a database of the same size and letter
frequency, where the letters in that database are
randomly-ordered
• Small changes in score S can produce large changes in E
• BUT biological sequence databases are not random!
Table of Contents
Introduction
Alignment
BLAST
BLAST Statistics
Using BLAST
Multiple BLAST tools
• BLASTN vs MEGABLAST vs TBLASTX vs ...?
• Korf et al. (2003) BLAST is really good for theory part,
but practical examples dated due to changes with BLAST+
Multiple flavours of BLAST
• NCBI “legacy” BLAST
• Now obsolete and not being updated
• Spawned offshoots including:
• WU-BLAST aka AB-BLAST (commerical)
• MPI-BLAST for use on clusters
• Versions to run on graphics cards
• NCBI BLAST+
• Re-written in 2009 using C++ instead of C
• Many improvements
• Slightly different output
• Different commands used to run it
Multiple ways to run BLAST
• BLAST+ at the command line (today)
• Via a script or programming language
• Via a graphical tool like BioEdit, CLCbio, Blast2GO
• Via the NCBI website
• Via a genome consortium website
• Via a Galaxy web server
• etc
• Offers flexibility but different settings/options/versions
Multiple places to run BLAST
• On the NCBI servers, e.g. via website or tool
• On 3rd party servers, e.g. via websites
• On your own computer
• On our Linux cluster
Core BLAST tools: Query sequences vs
Database
• Nucleotide vs Nucleotide:
• blastn (covering blastn, megablast, dc-megablast)
• Translated nucleotide vs Protein:
• blastx
• Protein vs Translated nucleotide:
• tblastn
• Protein vs Protein:
• blastp, psiblast, phiblast, deltablast
See http://blast.ncbi.nlm.nih.gov/ for a reminder ;)
The BLAST tools have built in help
1 $ blastp -h
2 USAGE
3 blastp [-h] [-help] [- import_search_strategy filename]
4 [- export_search_strategy filename] [-task task_name] [-db database_name ]
5 [-dbsize num_letters ] [-gilist filename] [-seqidlist filename]
6 [- negative_gilist filename] [- entrez_query entrez_query ]
7 [- db_soft_mask filtering_algorithm ] [- db_hard_mask filtering_algorithm ]
8 [-subject subject_input_file ] [- subject_loc range] [-query input_file]
9 [-out output_file ] [-evalue evalue] [-word_size int_value]
10 [-gapopen open_penalty ] [-gapextend extend_penalty ]
11 [- xdrop_ungap float_value ] [-xdrop_gap float_value ]
12 [- xdrop_gap_final float_value ] [-searchsp int_value] [-max_hsps int_value]
13 [- sum_statistics ] [-seg SEG_options] [- soft_masking soft_masking ]
14 [-matrix matrix_name ] [-threshold float_value ] [- culling_limit int_value]
15 ...
16 [- max_target_seqs num_sequences ] [-num_threads int_value] [-ungapped]
17 [-remote] [- comp_based_stats compo] [- use_sw_tback ] [-version]
18
19 DESCRIPTION
20 Protein -Protein BLAST 2.2.29+
21
22 Use ’-help ’ to print detailed descriptions of command line arguments
Minimal example of BLAST+ at the
command line
1 $ blastp -query my_input.fasta -db my_database -out my_output.txt
• Replace blastp with the appropriate tool, e.g. blastn
• Replace my input.fasta with your actual filename
• Replace my database with your actual database, e.g. nr
• Replace my output.txt with your desired output filename
• Best to avoid spaces in your folder and filenames!
e.g.
1 $ blastp -query query.fasta -db dbA -out my_output.txt
Setting the BLAST+ output format
1 $ blastp -help
2 USAGE
3 ...
4
5 *** Formatting options
6 -outfmt <String >
7 alignment view options:
8 0 = pairwise ,
9 1 = query -anchored showing identities ,
10 2 = query -anchored no identities ,
11 3 = flat query -anchored , show identities ,
12 4 = flat query -anchored , no identities ,
13 5 = XML Blast output ,
14 6 = tabular ,
15 7 = tabular with comment lines ,
16 8 = Text ASN.1,
17 9 = Binary ASN.1,
18 10 = Comma -separated values ,
19 11 = BLAST archive format (ASN .1)
20
21 ...
22 Default = ‘0’
23 ...
Setting the BLAST+ output format
Default is plain text pairwise alignments, for humans:
1 $ blastp -query query.fasta -db dbA -out my_output.txt
2 ...
XML output can be useful (e.g. for BLAST2GO):
1 $ blastp -query query.fasta -db dbA -out my_output.xml -outfmt 5
2 ...
Tabular output is easiest to filter, sort, etc:
1 $ blastp -query query.fasta -db dbA -out my_output.tab -outfmt 6
2 ...
Setting the e-value threshold
Check the built in help:
1 $ blastp -help
2 USAGE
3 ...
4 -evalue <Real >
5 Expectation value (E) threshold for saving hits
6 Default = ‘10’
7 ...
Example using 0.0001 or 1 × 10−5 in scientific notation (1e-5)
1 $ blastp -query query.fasta -db dbA -out my_output.txt -evalue 1e-5
2 ...
In Conclusion
• Every BLAST search is an experiment
• Badly-designed searches can give you bad results
• Knowing how BLAST works helps improve search design
• BLAST results still require inspection and interpretation
An Introduction to Bioinformatics
Tools
Part 3: Workshop
Leighton Pritchard and Peter Cock
Table of Contents
Introduction
Workshop Data
Gene Prediction
Genome Comparisons
Gene Comparisons
Conclusions
Learning Outcomes
• Workshop example: bacterial genome annotation
(because they’re small and data easy to handle)
• The role of biological insight in a bioinformatics workflow
• Visual interaction with sequence data
• Using alternative tools
• Comparison of tools and outputs
• Online tools for automated function prediction
What You Will Be Doing
Illustrative example of concepts: Functional annotation of a draft
bacterial genome
1. Gene prediction
2. Genome comparisons
3. Gene comparisons
Table of Contents
Introduction
Workshop Data
Gene Prediction
Genome Comparisons
Gene Comparisons
Conclusions
Locate your data
• You are in group A, B, C or D - this decides your chromosome
sequence:
chrA.fasta, chrB.fasta, chrC.fasta, chrD.fasta
• Each sequence represents a single stitched, ordered draft
bacterial genome comprising a number of contigs.
• You will use your sequence as the basis of the exercises in the
workshop.
Locate your data
• You are in group A, B, C or D - this decides your dataset:
chrA.fasta, chrB.fasta, chrC.fasta, chrD.fasta
• You also have a GFF file describing the location of assembled
contigs
chrA contigs.gff, chrB contigs.gff,
chrC contigs.gff, chrD contigs.gff
Inspect the data
1 $ head -n 3 chrA.fasta
2 >chrA
3 ttttcttgattgaccttgttcgagtggagtccgccgtgtcactttcgctttggcagcagt
4 gtcttgcccgtttgcaggatgagttacctgccacagaattcagtatgtggatacgcccgt
5 $ head -n 3 chrA_contigs .gff
6 ##gff -version 3
7 chrA stitching contig 1 154993 . . . ID= contig00005_b ;Name= contig00005_b
8 chrA stitching contig 155036 241491 . . . ID=contig00018;Name=contig00018
Inspect the data
Starting Artemis
1 $ art &
Load the chromosome sequence
Select the sequence for your group
Load the chromosome sequence
Load the contig GFF
Load the contig GFF
Select the file for your group
Load the contig GFF
Find the stitching sequence
The contigs are stitched with a specific sequence: see if you can
find, and identify it.
Table of Contents
Introduction
Workshop Data
Gene Prediction
Genome Comparisons
Gene Comparisons
Conclusions
Lines of Evidence
• ab initio genecalling:
• Unsupervised methods - not trained on a dataset
• Supervised methods - trained on a dataset
• homology matches
• alignment to genes from related organisms (annotation
transfer)
• from known gene products (e.g. proteins, ncRNA)
• from transcripts/other intermediates (e.g. ESTs, cDNA,
RNAseq)
Consensus Methods
• Combine weighted evidence from multiple sources, using linear
combination or graph theoretical methods
• For eukaryotes:
• EVM http://evidencemodeler.sourceforge.net/
• Jigsaw http://www.cbcb.umd.edu/software/jigsaw/
• GLEAN http://sourceforge.net/projects/glean-gene/
Basic Gene Finding
• We could use Artemis to identify the longest coding region in
each ORF, lots of manual steps
• This is the most basic gene finding, and can easily be
automated, e.g. EMBOSS getorf
• Dedicated gene finders usually more appropriate...
Finding Open Reading Frames
• ORF finding is naive, does not consider:
• Start codon
• Splicing
• Promoter/RBS motifs
• Wider context (e.g. overlapping genes)
Prokaryotic Prediction Methods
• Prokaryotes “easier” than eukaryotes for gene prediction
• Less uncertainty in predictions (isoforms, gene structure)
• Very gene-dense (over 90% of chromosome is coding sequence)
• No intron-exon structure
• Problem is: “which possible ORF contains the true gene, and
which start site is correct?”
• Still not a solved problem
Two ab initio Prokaryotic Prediction
Methods
You will be using two tools
• Glimmer
• Interpolated Markov models
• Can be trained on “gold standard” datasets
• Prodigal
• Log-likelihood model based on GC frame plots, followed by
dynamic programming
• Can be trained on “gold standard” datasets
Using Glimmer
Supervised - we train on a related complete genome sequence,
then run glimmer3
1 $ build -icm -r NC_004547.icm < NC_004547.ffn
2 $ glimmer3 -o 50 -g 110 -t 30 chrA.fasta NC_004547.icm chrA_glimmer3
• -o 50: max overlap bases
• -g 110: min gene length
• -t 30: threshold score
Using Glimmer
glimmer3 output is not standard GFF format:
1 $ head -n 4 chrA_glimmer3 .predict
2 >chrA
3 orf00001 36 1430 +3 8.81
4 orf00002 1435 2535 +1 11.51
5 orf00005 2676 3761 +3 8.63
We could Google for help, or use provided conversion script:
1 $ python glimmer_to_gff .py chrA_glimmer3 .predict
Using Glimmer
We now have output in GFF
1 $ head -n 3 chrA_glimmer3 .gff
2 chrA Glimmer CDS 36 1430 8.81 + 0 ID=orf00001;Name=orf00001
3 chrA Glimmer CDS 1435 2535 11.51 + 0 ID=orf00002;Name=orf00002
4 chrA Glimmer CDS 2676 3761 8.63 + 0 ID=orf00005;Name=orf00005
Using Prodigal
Unsupervised (i.e. untrained) mode
1 $ prodigal -f gff -o chrA_prodigal .gff -i chrA.fasta
Using Prodigal
Prodigal GFF output is correctly formatted and informative
1 $ head -n 6 chrA_prodigal .gff
2 ##gff -version 3
3 # Sequence Data: seqnum =1; seqlen =4727782; seqhdr =" chrA"
4 # Model Data: version=Prodigal.v2 .50; run_type=Single;model ="Ab initio "; gc_cont
=54.48; transl_table =11; uses_sd =1
5 chrA Prodigal_v2 .50 CDS 3 1430 188.5 + 0 ID=1_1;partial =10; start_type=Edge;
rbs_motif=None;rbs_spacer=None;score =188.54; cscore =185.37; sscore =3.18;
rscore =0.00; uscore =3.18; tscore =0.00
6 chrA Prodigal_v2 .50 CDS 1435 2535 185.6 + 0 ID=1_2;partial =00; start_type=ATG;
rbs_motif=None;rbs_spacer=None;score =185.61; cscore =184.24; sscore =1.36;
rscore = -7.73; uscore =3.48; tscore =4.37
7 chrA Prodigal_v2 .50 CDS 2676 3761 146.2 + 0 ID=1_3;partial =00; start_type=ATG;
rbs_motif=None;rbs_spacer=None;score =146.19; cscore =149.82; sscore = -3.63;
rscore = -7.73; uscore = -0.28; tscore =4.37
Comparing predictions in Artemis
Comparing predictions in Artemis
Comparing predictions in Artemis
Comparing predictions in Artemis
Do ORF(orange)/CDS(green,blue) prediction methods agree?
Comparing predictions in Artemis
Do glimmer(green)/prodigal(blue) CDS prediction methods
agree?
How do we know which (if either) is best?
Using a “Gold Standard”
A general approach for all predictive methods
• Define a known, “correct” set of true/false, positive/negative
etc. examples - the “gold standard”
• Evaluate your predictive method against that set for
• sensitivity, specificity, accuracy, precision, etc.
Many methods available, coverage beyond the scope of this
introduction
Contingency Tables
Condition (Gold standard)
True False
Test outcome
Positive True Positive False Positive
Negative False Negative True Negative
Sensitivity = TPR = TP/(TP + FN)
Specificity = TNR = TN/(FP + TN)
FPR = 1 − Specificity = FP/(FP + TN)
If you don’t have this information, you can’t interpret predictive
results properly.
Why Performance Metrics Matter
• You go for a checkup, and are tested for disease X
• The test has sensitivity = 0.95 (predicts disease where there is
disease)
• The test has FPR = 0.01 (predicts disease where there is no
disease)
Why Performance Metrics Matter
• You go for a checkup, and are tested for disease X
• The test has sensitivity = 0.95 (predicts disease where there is
disease)
• The test has FPR = 0.01 (predicts disease where there is no
disease)
• Your test is positive
• What is the probability that you have disease X?
• 0.01, 0.05, 0.50, 0.95, 0.99?
Why Performance Metrics Matter
• What is the probability that you have disease X?
• Unless you know the baseline occurrence of disease X, you
cannot know.
Why Performance Metrics Matter
• What is the probability that you have disease X?
• Unless you know the baseline occurrence of disease X, you
cannot know.
• Baseline occurrence: fX
• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5
• fX = 0.8 =⇒ P(disease|+ve) = 0.997 ≈ 1.0
Why Performance Metrics Matter
• Imagine a predictor for protein functional class
• Predictor has has sensitivity = 0.95, FPR = 0.01
• You run the predictor on 20,000 proteins in an organism
Why Performance Metrics Matter
• Imagine a predictor for protein functional class
• Predictor has has sensitivity = 0.95, FPR = 0.01
• You run the predictor on 20,000 proteins in an organism
• We estimate ≈ 200 members in protein complement, so
fX = 0.01
• fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5
Bayes’ Theorem
• May seem counter-intuitive: 95% sensitivity, 99% specificity
=⇒ 50% chance of any prediction being incorrect
• Probability given by Bayes’ Theorem
• P(X|+) = P(+|X)P(X)
P(+|X)P(X)+P(+| ¯X)P( ¯X)
• This is commonly overlooked in the literature (confirmation
bias?)
• e.g. in paper describing novel TTSS predictor:
“The surprisingly high number of (false) positives in genomes
without TTSS exceeds the expected false positive rate”
Interpreting Performance Metrics
• Use Bayes’ Theorem!
• Predictions apply to groups, not individual members of the
group. e.g.
• Test for airport smugglers has P(smuggler|+) = 0.9
• Test gives 100 positives
• Which specific individuals are truly smugglers?
Interpreting Performance Metrics
• Use Bayes’ Theorem!
• Predictions apply to groups, not individual members of the
group. e.g.
• Test for airport smugglers has P(smuggler|+) = 0.9
• Test gives 100 positives
• Which specific individuals are truly smugglers?
• The test does not allow you to determine this - you need more
evidence for each individual
• Same principle applies to all other tests, (including protein
functional class prediction) - you should not ‘cherry-pick’ for
publication without other evidence
“Gold Standard” results
• Tested glimmer and prodigal on two ”gold standards”
• Manually annotated (>3 expert person years) close relative
• Community-annotated close relative
• Both methods trained directly on the annotated genes in each
organism!
“Gold Standard” results
genecaller glimmer prodigal
predicted 4752 4287
missed 284 (6%) 407 (9%)
Exact Prediction
sensitivity 62% 71%
FDR 41% 25%
PPV 59% 75%
Correct ORF
sensitivity 94% 91%
FDR 10% 3%
PPV 90% 97%
“Gold Standard” results
genecaller glimmer prodigal
predicted 4679 4467
missed 112 (3%) 156 (3%)
Exact Prediction
sensitivity 62% 86%
FDR 31% 14%
PPV 69% 86%
Correct ORF
sensitivity 97% 97%
FDR 7% 3%
PPV 93% 97%
Gene/CDS Prediction
• Many alternative methods, all perform differently
• To assess/choose methods, performance metrics are required
• Even on (relatively simple) prokaryotes, current best methods
imperfect
• Manual assessment and intervention is essential, and usually
the longest part of the process
Table of Contents
Introduction
Workshop Data
Gene Prediction
Genome Comparisons
Gene Comparisons
Conclusions
Run a megaBLAST Comparison
BLAST your chromosome against the comparator sequence.
Put results in chrA megablast Pba.tab
1 $ blastn -query chrA.fasta -subject NC_004547.fna -out chrA_megablast_Pba .tab -
outfmt 6
2 $ head -n 3 chrA_megablast_Pba .tab
3 chrA gi |50118965| ref|NC_004547 .2|:10948 -12453 80.34 1511 287 10 4579450 4580955
1506 1 0.0 1136
4 chrA gi |50118965| ref|NC_004547 .2|: c33859 -32447 82.04 1409 253 0 4563151 4564559
1 1409 0.0 1201
5 chrA gi |50118965| ref|NC_004547 .2|: c34917 -33868 82.48 1050 184 0 4562093 4563142
1 1050 0.0 920
Note this defaults to using MEGABLAST...
Run a BLASTN Comparison
BLAST your chromosome against the comparator sequence
Put results in chrA blastn Pba.tab
1 $ blastn -query chrA.fasta -subject NC_004547.fna -out chrA_blastn_Pba .tab -
outfmt 6 -task blastn
2 $ head -n 3 chrA_blastn_Pba .tab
3 chrA gi |50118965| ref|NC_004547 .2|:5629 -7497 79.68 1865 379 0 4584915 4586779
1865 1 0.0 1654
4 chrA gi |50118965| ref|NC_004547 .2|:5629 -7497 92.59 27 2 0 4479367 4479393 1254
1280 0.004 41.0
5 chrA gi |50118965| ref|NC_004547 .2|:5629 -7497 100.00 17 0 0 4613022 4613038 52 36
2.1 31.9
Note we added -task blastn
Do BLASTN and megaBLAST compar-
isons agree?
Check the number of alignments returned with wc
1 $ wc chrA_megablast_Pba .tab
2 2675 32100 242539 chrA_megablast_Pba .tab
3 $ wc chrA_blastn_Pba .tab
4 31792 381504 2850953 chrA_blastn_Pba .tab
What is this telling us?
Why do the results differ?
BLASTN vs megaBLAST
• Legacy BLASTN uses the BLAST algorithm, megaBLAST
does not
• (though BLAST+ BLASTN now uses megaBLAST by default)
• megaBLAST uses a fast, greedy algorithm due to Zhang et al.
(2000) http://www.ncbi.nlm.nih.gov/pubmed/10890397
BLASTN vs megaBLAST
• Legacy BLASTN uses the BLAST algorithm, megaBLAST
does not
• (though BLAST+ BLASTN now uses megaBLAST by default)
• megaBLAST uses a fast, greedy algorithm due to Zhang et al.
(2000) http://www.ncbi.nlm.nih.gov/pubmed/10890397
• megaBLAST is optimised for
• genome-level searches
• queries on large sequence sets (automatic query packing)
• long alignments of similar sequences, with SNPs/sequencing
errors
• A discontinuous mode (dc-megaBLAST) is recommended for
more divergent sequences
Viewing alignments in ACT
Start ACT from the command line:
1 $ act &
Use the “File”, “Open...” menu item
Increase the Number of Comparisons
Use more files ...
Select chromosome sequences
Add BLAST/megaBLAST results
Zoom Out
Remove Weak Matches
Use filter sliders
MUMmer
• MUMmer is a suite of alignment programs and scripts
• mummer, promer, nucmer, etc.
• Very different to BLAST (suffix tree alignment) - very fast
• Extremely flexible
• Used for genome comparisons, assemblies, scaffolding, repeat
detection, etc.
• Forms the basis for other aligners/assemblers
Run a MUMmer Comparison
Create a new sub-directory for MUMmer output.
1 $ pwd
2 .../ data/workshop/chromosomes
3 $ mkdir nucmer_out
Run nucmer to create chrA NC 004547.delta
1 $ nucmer --prefix=nucmer_out/ chrA_NC_004547 chrA.fasta NC_004547.fna
Then filter this file to generate a coordinate table for visualisation
1 $ delta -filter -q nucmer_out/ chrA_NC_004547 .delta > nucmer_out/ chrA_NC_004547 .
filter
2 $ show -coords -rcl nucmer_out/ chrA_NC_004547 .filter > nucmer_out/
chrA_NC_004547_filtered .coords
Run a MUMmer Comparison
MUMmer output is very different from BLAST output
1 $ head nucmer_out/ chrA_NC_004547_filtered .coords
2 ...
Run a MUMmer Comparison
Use a one-line shell command to convert to ACT-friendly format:
1 $ tail -n +6 nucmer_out/ chrA_NC_004547_filtered .coords | awk ’{print $7" "$10" "
$1" "$2" "$12" "$4" "$5" "$13}’ > chrA_mummer_NC_004547 .crunch
2 $ head chrA_mummer_NC_004547 .crunch
3 2526 82.49 15 2540 4727782 4985117 4982588 5064019
4 2944 82.29 2676 5619 4727782 4982544 4979600 5064019
5 85 95.29 11092 11176 4727782 758690 758774 5064019
6 1356 81.69 17446 18801 4727782 77639 78994 5064019
Select Files
Select your chromosome, and the megaBLAST/MUMmer output
View Basic Alignment
Filter Weak BLAST Matches
Genome Alignments
• Alignment result depends on algorithm, and parameter choices
• Some algorithms/parameter sets more sensitive than others
• Appropriate visualisation is essential
Much more detail at http://www.slideshare.net/leightonp/
comparative-genomics-and-visualisation-part-1
Table of Contents
Introduction
Workshop Data
Gene Prediction
Genome Comparisons
Gene Comparisons
Conclusions
Reciprocal Best BLAST Hits (RBBH)
• To compare our genecall proteins to NC 004547.faa reference
set...
• BLAST reference proteins against our proteins
• BLAST our proteins against reference proteins
• Pairs with each other as best BLAST Hit are called RBBH
One-way BLAST vs RBBH
One-way BLAST includes many low-quality hits
One-way BLAST vs RBBH
Reciprocal best BLAST hits remove many low-quality matches
Reciprocal Best BLAST Hits (RBBH)
• Pairs with each other as best BLAST hit are called RBBH
• Should filter on percentage identity and alignment length
• RBBH pairs are candidate orthologues
• (most orthologues will be RBBH, but the relationship is
complicated)
• Outperforms OrthoMCL, etc. (beyond scope of course why
and how. . .)
http://dx.doi.org/10.1093/gbe/evs100
http://dx.doi.org/10.1371/journal.pone.0018755
(We have a tool for this on our in-house Galaxy server)
Table of Contents
Introduction
Workshop Data
Gene Prediction
Genome Comparisons
Gene Comparisons
Conclusions
In Conclusion
• The tools you will need to use will be task-dependent, but
some things are universal. . .
• Good experimental design (including BLAST searches, etc.)
• Keeping accurate records for reproduction/replication
• Validation/sanity checking of results
• Comparison and benchmarking of methods
• (Cross-)validation of predictive methods
Remember: everything gets easier with practice, so practice
lots!

Más contenido relacionado

La actualidad más candente (20)

Gene Expression Omnibus (GEO)
Gene Expression Omnibus (GEO)Gene Expression Omnibus (GEO)
Gene Expression Omnibus (GEO)
 
Blast bioinformatics
Blast bioinformaticsBlast bioinformatics
Blast bioinformatics
 
Comparative genomics and proteomics
Comparative genomics and proteomicsComparative genomics and proteomics
Comparative genomics and proteomics
 
Alignments
AlignmentsAlignments
Alignments
 
Applications of bioinformatics
Applications of bioinformaticsApplications of bioinformatics
Applications of bioinformatics
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
 
Sequence file formats
Sequence file formatsSequence file formats
Sequence file formats
 
Proteomic databases
Proteomic databasesProteomic databases
Proteomic databases
 
OMIM Database
OMIM DatabaseOMIM Database
OMIM Database
 
ENTREZ.ppt
ENTREZ.pptENTREZ.ppt
ENTREZ.ppt
 
Pathway analysis 2012
Pathway analysis 2012Pathway analysis 2012
Pathway analysis 2012
 
Sequence alignment belgaum
Sequence alignment belgaumSequence alignment belgaum
Sequence alignment belgaum
 
BLAST
BLASTBLAST
BLAST
 
Protein Data Bank (PDB)
Protein Data Bank (PDB)Protein Data Bank (PDB)
Protein Data Bank (PDB)
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
BLAST
BLASTBLAST
BLAST
 

Destacado

B.sc biochem i bobi u-1 introduction to bioinformatics
B.sc biochem i bobi u-1 introduction to bioinformaticsB.sc biochem i bobi u-1 introduction to bioinformatics
B.sc biochem i bobi u-1 introduction to bioinformaticsRai University
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformaticsHamid Ur-Rahman
 
Bioinformatics Final Presentation
Bioinformatics Final PresentationBioinformatics Final Presentation
Bioinformatics Final PresentationShruthi Choudary
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformaticsAbhishek Vatsa
 
Bioinformatics
BioinformaticsBioinformatics
BioinformaticsJTADrexel
 
Seqpig script language for large bioinformatic datasets
Seqpig   script language for large bioinformatic datasetsSeqpig   script language for large bioinformatic datasets
Seqpig script language for large bioinformatic datasetsArian Pasquali
 
Comparative Genomics and Visualisation - Part 1
Comparative Genomics and Visualisation - Part 1Comparative Genomics and Visualisation - Part 1
Comparative Genomics and Visualisation - Part 1Leighton Pritchard
 
Vivian unlocking-vista
Vivian unlocking-vistaVivian unlocking-vista
Vivian unlocking-vistavxVistA.org
 
SageCite demonstrator overview
SageCite demonstrator overviewSageCite demonstrator overview
SageCite demonstrator overviewmonicaduke
 
Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Keith Bradnam
 
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...VHIR Vall d’Hebron Institut de Recerca
 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in BioinformaticsArindam Ghosh
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.Elena Sügis
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithmavrilcoghlan
 
Dna sequencing
Dna sequencingDna sequencing
Dna sequencingsikojp
 
Using Genetic Sequencing to Unravel the Dynamics of Your Superorganism Body
Using Genetic Sequencing to Unravel the Dynamics of Your Superorganism BodyUsing Genetic Sequencing to Unravel the Dynamics of Your Superorganism Body
Using Genetic Sequencing to Unravel the Dynamics of Your Superorganism BodyLarry Smarr
 
Agricultural biotechnology
Agricultural biotechnologyAgricultural biotechnology
Agricultural biotechnologyRainu Rajeev
 

Destacado (20)

B.sc biochem i bobi u-1 introduction to bioinformatics
B.sc biochem i bobi u-1 introduction to bioinformaticsB.sc biochem i bobi u-1 introduction to bioinformatics
B.sc biochem i bobi u-1 introduction to bioinformatics
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Bioinformatics Final Presentation
Bioinformatics Final PresentationBioinformatics Final Presentation
Bioinformatics Final Presentation
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
In a Different Class?
In a Different Class?In a Different Class?
In a Different Class?
 
Seqpig script language for large bioinformatic datasets
Seqpig   script language for large bioinformatic datasetsSeqpig   script language for large bioinformatic datasets
Seqpig script language for large bioinformatic datasets
 
Comparative Genomics and Visualisation - Part 1
Comparative Genomics and Visualisation - Part 1Comparative Genomics and Visualisation - Part 1
Comparative Genomics and Visualisation - Part 1
 
Vivian unlocking-vista
Vivian unlocking-vistaVivian unlocking-vista
Vivian unlocking-vista
 
SageCite demonstrator overview
SageCite demonstrator overviewSageCite demonstrator overview
SageCite demonstrator overview
 
Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2
 
Visualization Tools
Visualization ToolsVisualization Tools
Visualization Tools
 
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in Bioinformatics
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithm
 
Dna sequencing
Dna sequencingDna sequencing
Dna sequencing
 
Using Genetic Sequencing to Unravel the Dynamics of Your Superorganism Body
Using Genetic Sequencing to Unravel the Dynamics of Your Superorganism BodyUsing Genetic Sequencing to Unravel the Dynamics of Your Superorganism Body
Using Genetic Sequencing to Unravel the Dynamics of Your Superorganism Body
 
Agricultural biotechnology
Agricultural biotechnologyAgricultural biotechnology
Agricultural biotechnology
 

Similar a Introduction to Bioinformatics

Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyYannick Pouliot
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkAdaryl "Bob" Wakefield, MBA
 
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Anita de Waard
 
Writing a successful data management plan with the DMPTool
Writing a successful data management plan with the DMPToolWriting a successful data management plan with the DMPTool
Writing a successful data management plan with the DMPToolkfear
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theoryC. Tobin Magle
 
Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Fiona Nielsen
 
HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 Scott Edmunds
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantLynne Thomas
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypseENUG
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planC. Tobin Magle
 
Umm, how did you get that number? Managing Data Integrity throughout the Data...
Umm, how did you get that number? Managing Data Integrity throughout the Data...Umm, how did you get that number? Managing Data Integrity throughout the Data...
Umm, how did you get that number? Managing Data Integrity throughout the Data...John Kinmonth
 
Data Management for librarians
Data Management for librariansData Management for librarians
Data Management for librariansC. Tobin Magle
 
Preservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesPreservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesDorothea Salo
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciencesChris Dwan
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
 
2013 ucar best practices
2013 ucar best practices2013 ucar best practices
2013 ucar best practicesc.titus.brown
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
 

Similar a Introduction to Bioinformatics (20)

Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
 
Library Linked Data
Library Linked DataLibrary Linked Data
Library Linked Data
 
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
 
Writing a successful data management plan with the DMPTool
Writing a successful data management plan with the DMPToolWriting a successful data management plan with the DMPTool
Writing a successful data management plan with the DMPTool
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theory
 
Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016
 
HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management plan
 
Umm, how did you get that number? Managing Data Integrity throughout the Data...
Umm, how did you get that number? Managing Data Integrity throughout the Data...Umm, how did you get that number? Managing Data Integrity throughout the Data...
Umm, how did you get that number? Managing Data Integrity throughout the Data...
 
Data Management for librarians
Data Management for librariansData Management for librarians
Data Management for librarians
 
Preservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesPreservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanities
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Use of Artificial Intelligence for Literature Screening
Use of Artificial Intelligence for Literature ScreeningUse of Artificial Intelligence for Literature Screening
Use of Artificial Intelligence for Literature Screening
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
2013 ucar best practices
2013 ucar best practices2013 ucar best practices
2013 ucar best practices
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
 

Más de Leighton Pritchard

Little Rotters: Adventures With Plant-Pathogenic Bacteria
Little Rotters: Adventures With Plant-Pathogenic BacteriaLittle Rotters: Adventures With Plant-Pathogenic Bacteria
Little Rotters: Adventures With Plant-Pathogenic BacteriaLeighton Pritchard
 
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymesReverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymesLeighton Pritchard
 
Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010Leighton Pritchard
 
Whole genome taxonomic classi cation for prokaryotic plant pathogens
Whole genome taxonomic classication for prokaryotic plant pathogensWhole genome taxonomic classication for prokaryotic plant pathogens
Whole genome taxonomic classi cation for prokaryotic plant pathogensLeighton Pritchard
 
Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Leighton Pritchard
 
Microbial Agrogenomics 4/2/2015, UK-MX Workshop
Microbial Agrogenomics 4/2/2015, UK-MX WorkshopMicrobial Agrogenomics 4/2/2015, UK-MX Workshop
Microbial Agrogenomics 4/2/2015, UK-MX WorkshopLeighton Pritchard
 
BM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeBM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeLeighton Pritchard
 
Highly Discriminatory Diagnostic Primer Design From Whole Genome Data
Highly Discriminatory Diagnostic Primer Design From Whole Genome DataHighly Discriminatory Diagnostic Primer Design From Whole Genome Data
Highly Discriminatory Diagnostic Primer Design From Whole Genome DataLeighton Pritchard
 
ICSB 2013 - Visits Abroad Report
ICSB 2013 - Visits Abroad ReportICSB 2013 - Visits Abroad Report
ICSB 2013 - Visits Abroad ReportLeighton Pritchard
 
Adventures in Bioinformatics (2012)
Adventures in Bioinformatics (2012)Adventures in Bioinformatics (2012)
Adventures in Bioinformatics (2012)Leighton Pritchard
 
Golden Rules of Bioinformatics
Golden Rules of BioinformaticsGolden Rules of Bioinformatics
Golden Rules of BioinformaticsLeighton Pritchard
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesLeighton Pritchard
 
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS dataRepeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS dataLeighton Pritchard
 
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...Leighton Pritchard
 
Rapid generation of E.coli O104:H4 PCR diagnostics
Rapid generation of E.coli O104:H4 PCR diagnosticsRapid generation of E.coli O104:H4 PCR diagnostics
Rapid generation of E.coli O104:H4 PCR diagnosticsLeighton Pritchard
 
Mining Plant Pathogen Genomes for Effectors
Mining Plant Pathogen Genomes for EffectorsMining Plant Pathogen Genomes for Effectors
Mining Plant Pathogen Genomes for EffectorsLeighton Pritchard
 
Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2Leighton Pritchard
 

Más de Leighton Pritchard (20)

RDVW Hands-on session: Python
RDVW Hands-on session: PythonRDVW Hands-on session: Python
RDVW Hands-on session: Python
 
Little Rotters: Adventures With Plant-Pathogenic Bacteria
Little Rotters: Adventures With Plant-Pathogenic BacteriaLittle Rotters: Adventures With Plant-Pathogenic Bacteria
Little Rotters: Adventures With Plant-Pathogenic Bacteria
 
Pathogen Genome Data
Pathogen Genome DataPathogen Genome Data
Pathogen Genome Data
 
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymesReverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
 
Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010
 
Whole genome taxonomic classi cation for prokaryotic plant pathogens
Whole genome taxonomic classication for prokaryotic plant pathogensWhole genome taxonomic classication for prokaryotic plant pathogens
Whole genome taxonomic classi cation for prokaryotic plant pathogens
 
Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)
 
Microbial Agrogenomics 4/2/2015, UK-MX Workshop
Microbial Agrogenomics 4/2/2015, UK-MX WorkshopMicrobial Agrogenomics 4/2/2015, UK-MX Workshop
Microbial Agrogenomics 4/2/2015, UK-MX Workshop
 
BM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeBM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of Strathclyde
 
Sequencing and Beyond?
Sequencing and Beyond?Sequencing and Beyond?
Sequencing and Beyond?
 
Highly Discriminatory Diagnostic Primer Design From Whole Genome Data
Highly Discriminatory Diagnostic Primer Design From Whole Genome DataHighly Discriminatory Diagnostic Primer Design From Whole Genome Data
Highly Discriminatory Diagnostic Primer Design From Whole Genome Data
 
ICSB 2013 - Visits Abroad Report
ICSB 2013 - Visits Abroad ReportICSB 2013 - Visits Abroad Report
ICSB 2013 - Visits Abroad Report
 
Adventures in Bioinformatics (2012)
Adventures in Bioinformatics (2012)Adventures in Bioinformatics (2012)
Adventures in Bioinformatics (2012)
 
Golden Rules of Bioinformatics
Golden Rules of BioinformaticsGolden Rules of Bioinformatics
Golden Rules of Bioinformatics
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In Sequences
 
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS dataRepeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
 
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
 
Rapid generation of E.coli O104:H4 PCR diagnostics
Rapid generation of E.coli O104:H4 PCR diagnosticsRapid generation of E.coli O104:H4 PCR diagnostics
Rapid generation of E.coli O104:H4 PCR diagnostics
 
Mining Plant Pathogen Genomes for Effectors
Mining Plant Pathogen Genomes for EffectorsMining Plant Pathogen Genomes for Effectors
Mining Plant Pathogen Genomes for Effectors
 
Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2
 

Último

Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 

Último (20)

Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 

Introduction to Bioinformatics

  • 1. Introduction to Bioinformatics Part 0: So You Want To Be a Computational Biologist? Leighton Pritchard and Peter Cock
  • 4. What is this “bioinformatics” thing, anyway? • Bioinformatics: biology using computational and mathematical tools • A discipline within biology • Loman & Watson (2013) “So you want to be a computational biologist?” http://dx.doi.org/10.1038/nbt.2740 • Welch et al. (2014) “Bioinformatics Curriculum Guidelines: Toward a Definition of Core Competencies” http://dx.doi.org/10.1371/journal.pcbi.1003496 • Watson (2014) “The only core competency you need” http://bit.ly/1fS4iDJ (blog)
  • 5. Some uncomfortable truths • This one-day course will not make you a bioinformatician
  • 6. Some uncomfortable truths • This one-day course will not make you a bioinformatician • But practice will. . .
  • 7. Some uncomfortable truths • This one-day course will not make you a bioinformatician • But practice will. . . • The best way to learn is to do (“I don’t know how to do this yet, but I will find out.”) • http://bit.ly/Rq0D61 (“Bioinformatics is a way of life”) • Most bioinformatics is problem-solving • We will introduce some useful tools and concepts
  • 8. What it takes to be a bioinformatician • Patience (problem-solving) • Suspicion (statistics) • Biological knowledge • Social skills (no-one knows everything: ask!) • Lots of practice • Self-confidence (challenge results and dogma) • Core domain skills: biology, computer science, statistics • Deliver results (qualified, honest) • Watson (2014) “What it takes to be a bioinformatician” http://bit.ly/1jDuQsO (blog)
  • 9. More general advice? • Ask us (we do this a lot) • BioStars (https://www.biostars.org) • SeqAnswers (http://seqanswers.com/) • PLoS Comp Biol collections (http: //www.ploscollections.org/static/pcbiCollections)
  • 11. Why Do It? • Doing bioinformatics is doing science: keep a lab book! • You will not remember multiple files, analysis details, etc. in a week/month/six months/a year/three years • Noble (2009) http://dx.doi.org/10.1371/journal.pcbi.1000424 • Baggerly & Coombes (2009) http://arxiv.org/pdf/1010.1092.pdf
  • 12. How To Do It? I • There is no one correct way, but. . . • Think about data/docs/project structure before you start
  • 13. How To Do It? II • Use plain text where possible • Use version control • Keep backups • Different tools suit different purposes: code vs. data vs. analysis vs. . . . • Find a way that works for you.
  • 14. How To Do It? III • Reproducibility is key! • Scripts and pipelines are better for this than notes of what you did • Also better for version control, and reuse • Avoid unnecessary duplication • Someone else may have solved your problem • One (backed up) read-only copy of raw data, keep analyses separate
  • 15. Plain Text Files • README.txt/README.md in each directory/folder • Plain text is always human-readable • Markdown (https: //daringfireball.net/projects/markdown/basics) • RST (http://docutils.sourceforge.net/docs/ref/rst/ restructuredtext.html)
  • 16. Galaxy workflows • Use through browser, graphical interface • Reproducible, shareable, documented, reusable analyses • Wraps standard bioinformatics tools • Local instance (http://ppserver/galaxy) uses JHI cluster
  • 17. script • Writes your terminal activity to a plain text file • Saves effort copy/pasting and typing commands into a lab book, as you go • Easy to use with other tools • use man script at your terminal to find out more
  • 18. MediaWiki • Useful for shared projects/data • Automatic version control and attribution • Many local instances at JHI (ask around)
  • 19. A language notebook • e.g. iPython Notebook, Mathematica, MatLab cells • Integrates live code and analysis with lab-book
  • 20. LATEX • Powerful, versatile typesetting system (e.g. these slides) • Similar to markup/markdown • Pros: great for mathematical/computing work, writing a thesis • Cons: not easy to pick up
  • 22. In Conclusion • Bioinformatics is just biology using computers and mathematics • You still need to “do science” in the same way: • Keep accurate records • Plan and conduct experiments (with controls) • Follow the literature • Professional development
  • 23. An Introduction to Bioinformatics Tools Part 1: Golden Rules of Bioinformatics Leighton Pritchard and Peter Cock
  • 24. On Confidence “Ignorance more frequently begets confidence than does knowledge: it is those who know little, not those who know much, who so positively assert. . .” - Charles Darwin
  • 25. Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • 26. Zeroeth Golden Rule of Bioinformatics • No-one knows everything about everything - talk to people! • local bioinformaticians, mailing lists, forums, Twitter, etc. • Keep learning - there are lots of resources • There is no free lunch - no method works best on all data • The worst errors are silent - share worries, problems, etc. • Share expertise (see first item)
  • 27. Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • 28. First Golden Rule of Bioinformatics • Always inspect the raw data (trends, outliers, clustering) • What is the question? Can the data answer it? • Communicate with data collectors! (don’t be afraid of pedantry) • Who? When? How? • You need to understand the experiment to analyse it (easier if you helped design it). • Be wary of block effects (experimenter, time, batch, etc.)
  • 29. Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • 30. Second Golden Rule of Bioinformatics • Do not trust the software: it is not an authority • Software does not distinguish meaningful from meaningless data • Software has bugs • Algorithms have assumptions, conditions, and applicable domains • Some problems are inherently hard, or even insoluble • You must understand the analysis/algorithm • Always sanity test • Test output for robustness to parameter (including data) choice
  • 31. Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • 32. Third Golden Rule of Bioinformatics • Everyone has expectations of their data/experiment • Beware cognitive errors, such as confirmation bias! • System 1 vs. System 2 ≈ intuition vs. reason • Think statistically! • Large datasets can be counterintuitive and appear to confirm a large number of contradictory hypotheses • Always account for multiple tests. • Avoid “data dredging”: intensive computation is not an adequate substitute for expertise • Use test-driven development of analyses and code • Use examples that pass and fail
  • 33. Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • 34. In Conclusion • Always communicate! • worst errors are silent • Don’t trust the data • formatting/validation/category errors - check! • suitability for scientific question • Don’t trust the software • software is not an authority • always benchmark, always validate • Don’t trust yourself • beware cognitive errors • think statistically • biological “stories” can be constructed from nonsense
  • 35. An Introduction to Bioinformatics Tools Part 2: BLAST Leighton Pritchard and Peter Cock
  • 37. Learning Outcomes • How BLAST searches work • How the way BLAST searches work affects your results • Why search parameters matter • Setting search parameters
  • 39. A Recent Twitter Conversation
  • 40. A Recent Twitter Conversation
  • 41. Why So Much Detail? • You’re going to go away and do lots of BLAST searches • Everyone uses BLAST - not everyone uses it well • Easier to fix problems if you know how it works • Understanding what’s going on helps avoid misuse/abuse • Understanding what’s going on helps use the tool more effectively • Not so much detail, really • like knowing about Tm and ion concentration effects, not molecular orbitals or thermodynamics (but ask if you’re interested ;) )
  • 43. What BLAST Is • BLAST: • Basic (it’s actually sophisticated) • Local Alignment (what it does: local sequence alignment) • Search Tool (what it does: search against a database)
  • 44. What BLAST Is • BLAST: • Basic (it’s actually sophisticated) • Local Alignment (what it does: local sequence alignment) • Search Tool (what it does: search against a database) • The most important software package in bioinformatics? • Fast, robust, sequence similarity search tool • Does not necessarily produce optimal alignments • Not foolproof.
  • 45. What A BLAST Search Is • Every BLAST search is an in silico hybridisation experiment • BLAST search = identification of similar sequences in a given database • Results depend on: • query sequence • BLAST program (including version and BLAST vs BLAST+) • database • parameters
  • 46. Alignment Search Space Consider two biological sequences to be aligned. . . • One sequence on the x-axis, the other on the y-axis • Each point in space is a pairing of two letters • Ungapped alignments are diagonal lines in the search space, gapped alignments have short ’breaks’ • There may be one or more ”optimal” alignments
  • 47. Global vs Local Alignment • Global alignment: sequences are aligned along their entire lengths • Local alignment: the best subsequence alignment is found
  • 48. Global vs Local Alignment • Global alignment: sequences are aligned along their entire lengths • Local alignment: the best subsequence alignment is found • Consider an alignment of the same gene from two distantly-related eukaryotes, where: • Exons are conserved and small in relation to gene locus size • Introns are not well-conserved but large in relation to gene locus size • Local alignment will align the conserved exon regions • Global alignment will align the whole (mostly unrelated) locus
  • 49. Our Goal • We aim to align the words • COELACANTH • PELICAN
  • 50. Our Goal • We aim to align the words • COELACANTH • PELICAN • Each identical letter (match) scores +1 • Each different letter (mismatch) scores -1 • Each gap scores -1
  • 51. Our Goal • We aim to align the words • COELACANTH • PELICAN • Each identical letter (match) scores +1 • Each different letter (mismatch) scores -1 • Each gap scores -1 • All sequence alignment is maximisation of an alignment score - a mathematical operation.
  • 54. Fill the matrix – represents all possible alignments & scores
  • 56. Algorithms • Global: Needleman-Wunsch (as in example) • Local: Smith-Waterman (differs from example)
  • 57. Algorithms • Global: Needleman-Wunsch (as in example) • Local: Smith-Waterman (differs from example) • Biological information encapsulated only in the scoring scheme (matches, mismatches, gaps)
  • 58. Algorithms • Global: Needleman-Wunsch (as in example) • Local: Smith-Waterman (differs from example) • Biological information encapsulated only in the scoring scheme (matches, mismatches, gaps) • NW/SW are guaranteed to find the optimal match with respect to the scoring system being used • BUT the optimal alignment is a biological approximation: no scoring scheme encapsulates biological “truth” • Any pair of sequences can be aligned: finding meaning is up to you
  • 60. BLAST Is A Heuristic • BLAST does not use Needleman-Wunsch or Smith-Waterman • BLAST approximates dynamic programming methods • BLAST is not guaranteed to give a mathematically optimal alignment
  • 61. BLAST Is A Heuristic • BLAST does not use Needleman-Wunsch or Smith-Waterman • BLAST approximates dynamic programming methods • BLAST is not guaranteed to give a mathematically optimal alignment • BLAST does not explore the complete search space
  • 62. BLAST Is A Heuristic • BLAST does not use Needleman-Wunsch or Smith-Waterman • BLAST approximates dynamic programming methods • BLAST is not guaranteed to give a mathematically optimal alignment • BLAST does not explore the complete search space • BLAST uses heuristics (loosely-defined rules) to refine High-scoring Segment Pairs (HSPs)
  • 63. BLAST Is A Heuristic • BLAST does not use Needleman-Wunsch or Smith-Waterman • BLAST approximates dynamic programming methods • BLAST is not guaranteed to give a mathematically optimal alignment • BLAST does not explore the complete search space • BLAST uses heuristics (loosely-defined rules) to refine High-scoring Segment Pairs (HSPs) • BLAST reports only “statistically-significant” alignments (dependent on parameters)
  • 64. Steps in the Algorithm 1. Seeding 2. Extension 3. Evaluation
  • 65. Word Hits • A word hit is a short sequence and its neighbourhood • neighbourhood: words of same length whose aligned score is greater than or equal to a threshold value T • Three parameters: scoring matrix, word size W , and T
  • 66. Seeding • BLAST assumption: significant alignments have words in common • BLAST finds word (neighbourhood) hits in the database index • Word hits are used to seed alignments
  • 67. Seeding Controls Sensitivity • Word size W controls number of hits (smaller words =⇒ more hits) • Threshold score T controls number of hits (lower threshold =⇒ more hits) • Scoring matrix controls which words match
  • 68. The Two-Hit Algorithm • BLAST assumption: word hits cluster on the diagonal for significant alignments • The acceptable distance A between words on the diagonal is a parameter of your model • Smaller distances isolate single words, and reduce search space
  • 69. Extension • The best-scoring seeds are extended in each direction • BLAST does not explore the complete search space, so a rule (heuristic) to stop extension is needed • Two-stage process: • Extend, keeping alignment score, and drop-off score • When drop-of score reaches a threshold X, trim alignment back to top score
  • 70. Example • Consider two sentences (match=+1, mismatch=-1) • The quick brown fox jumps over the lazy dog. • The quiet brown cat purrs when she sees him.
  • 71. Example • Consider two sentences (match=+1, mismatch=-1) • The quick brown fox jumps over the lazy dog. • The quiet brown cat purrs when she sees him. • Extend to the right from the seed T • The quic • The quie • 123 4565 <- score • 000 0001 <- drop-off score
  • 72. Example • Consider two sentences (match=+1, mismatch=-1) • The quick brown fox jumps over the lazy dog. • The quiet brown cat purrs when she sees him. • Extend to drop-off threshold • The quick brown fox jump • The quiet brown cat purr • 123 45654 56789 876 5654 <- score • 000 00012 10000 123 4345 <- drop-off score
  • 73. Example • Consider two sentences (match=+1, mismatch=-1) • The quick brown fox jumps over the lazy dog. • The quiet brown cat purrs when she sees him. • Trim back from drop-off threshold to get optimal alignment • The quick brown • The quiet brown • 123 45654 56789 <- score • 000 00012 10000 <- drop-off score
  • 74. Notes on implementation • X controls termination of alignment extension, but dependent on: • substitution matrix • gap opening and extension parameters
  • 75. Evaluation • The principle is easy: use a score threshold S to determine strong and weak alignments • S is monotonic with E, so an equivalent threshold can be calculated • Score S is independent of database size and search space. E values are not. • Alignment consistency of HSPs is also a factor in the report
  • 77. Log-odds Matrices • Substitution matrices are your model of evolution • Substitution matrices are log-odds matrices • Positive numbers indicate likely substitutions/similarity • Negative numbers indicate unlikely substitutions/dissimilarity BLOSUM62
  • 78. Choice of Matrix • Substitution matrix determines the raw alignment score S • S is the sum of pairwise scores in an alignment • BLAST provides, for proteins: • BLOSUM45 BLOSUM50 BLOSUM62 BLOSUM80 BLOSUM90 • PAM30 PAM70 PAM250 • BLOSUM matrices empirically defined from multiple sequence alignments of ≥ n% identity, for BLOSUMn • For nucleotides: ‘matrix’ defined by match/mismatch (reward/penalty) parameters
  • 79. Definition • The Karlin-Altschul equation E = kmne−λS • Symbols: • k: minor constant, adjusts for correlation between alignments • m: number of letters in query sequence • n: number of letters in the database • λ: scoring matrix scaling factor • S: raw alignment score
  • 80. Interpretation • The Karlin-Altschul equation E = kmne−λS • E is the number of alignments of a similar score expected by chance when querying a database of the same size and letter frequency, where the letters in that database are randomly-ordered • Small changes in score S can produce large changes in E • BUT biological sequence databases are not random!
  • 82. Multiple BLAST tools • BLASTN vs MEGABLAST vs TBLASTX vs ...? • Korf et al. (2003) BLAST is really good for theory part, but practical examples dated due to changes with BLAST+
  • 83. Multiple flavours of BLAST • NCBI “legacy” BLAST • Now obsolete and not being updated • Spawned offshoots including: • WU-BLAST aka AB-BLAST (commerical) • MPI-BLAST for use on clusters • Versions to run on graphics cards • NCBI BLAST+ • Re-written in 2009 using C++ instead of C • Many improvements • Slightly different output • Different commands used to run it
  • 84. Multiple ways to run BLAST • BLAST+ at the command line (today) • Via a script or programming language • Via a graphical tool like BioEdit, CLCbio, Blast2GO • Via the NCBI website • Via a genome consortium website • Via a Galaxy web server • etc • Offers flexibility but different settings/options/versions
  • 85. Multiple places to run BLAST • On the NCBI servers, e.g. via website or tool • On 3rd party servers, e.g. via websites • On your own computer • On our Linux cluster
  • 86. Core BLAST tools: Query sequences vs Database • Nucleotide vs Nucleotide: • blastn (covering blastn, megablast, dc-megablast) • Translated nucleotide vs Protein: • blastx • Protein vs Translated nucleotide: • tblastn • Protein vs Protein: • blastp, psiblast, phiblast, deltablast See http://blast.ncbi.nlm.nih.gov/ for a reminder ;)
  • 87. The BLAST tools have built in help 1 $ blastp -h 2 USAGE 3 blastp [-h] [-help] [- import_search_strategy filename] 4 [- export_search_strategy filename] [-task task_name] [-db database_name ] 5 [-dbsize num_letters ] [-gilist filename] [-seqidlist filename] 6 [- negative_gilist filename] [- entrez_query entrez_query ] 7 [- db_soft_mask filtering_algorithm ] [- db_hard_mask filtering_algorithm ] 8 [-subject subject_input_file ] [- subject_loc range] [-query input_file] 9 [-out output_file ] [-evalue evalue] [-word_size int_value] 10 [-gapopen open_penalty ] [-gapextend extend_penalty ] 11 [- xdrop_ungap float_value ] [-xdrop_gap float_value ] 12 [- xdrop_gap_final float_value ] [-searchsp int_value] [-max_hsps int_value] 13 [- sum_statistics ] [-seg SEG_options] [- soft_masking soft_masking ] 14 [-matrix matrix_name ] [-threshold float_value ] [- culling_limit int_value] 15 ... 16 [- max_target_seqs num_sequences ] [-num_threads int_value] [-ungapped] 17 [-remote] [- comp_based_stats compo] [- use_sw_tback ] [-version] 18 19 DESCRIPTION 20 Protein -Protein BLAST 2.2.29+ 21 22 Use ’-help ’ to print detailed descriptions of command line arguments
  • 88. Minimal example of BLAST+ at the command line 1 $ blastp -query my_input.fasta -db my_database -out my_output.txt • Replace blastp with the appropriate tool, e.g. blastn • Replace my input.fasta with your actual filename • Replace my database with your actual database, e.g. nr • Replace my output.txt with your desired output filename • Best to avoid spaces in your folder and filenames! e.g. 1 $ blastp -query query.fasta -db dbA -out my_output.txt
  • 89. Setting the BLAST+ output format 1 $ blastp -help 2 USAGE 3 ... 4 5 *** Formatting options 6 -outfmt <String > 7 alignment view options: 8 0 = pairwise , 9 1 = query -anchored showing identities , 10 2 = query -anchored no identities , 11 3 = flat query -anchored , show identities , 12 4 = flat query -anchored , no identities , 13 5 = XML Blast output , 14 6 = tabular , 15 7 = tabular with comment lines , 16 8 = Text ASN.1, 17 9 = Binary ASN.1, 18 10 = Comma -separated values , 19 11 = BLAST archive format (ASN .1) 20 21 ... 22 Default = ‘0’ 23 ...
  • 90. Setting the BLAST+ output format Default is plain text pairwise alignments, for humans: 1 $ blastp -query query.fasta -db dbA -out my_output.txt 2 ... XML output can be useful (e.g. for BLAST2GO): 1 $ blastp -query query.fasta -db dbA -out my_output.xml -outfmt 5 2 ... Tabular output is easiest to filter, sort, etc: 1 $ blastp -query query.fasta -db dbA -out my_output.tab -outfmt 6 2 ...
  • 91. Setting the e-value threshold Check the built in help: 1 $ blastp -help 2 USAGE 3 ... 4 -evalue <Real > 5 Expectation value (E) threshold for saving hits 6 Default = ‘10’ 7 ... Example using 0.0001 or 1 × 10−5 in scientific notation (1e-5) 1 $ blastp -query query.fasta -db dbA -out my_output.txt -evalue 1e-5 2 ...
  • 92. In Conclusion • Every BLAST search is an experiment • Badly-designed searches can give you bad results • Knowing how BLAST works helps improve search design • BLAST results still require inspection and interpretation
  • 93. An Introduction to Bioinformatics Tools Part 3: Workshop Leighton Pritchard and Peter Cock
  • 94. Table of Contents Introduction Workshop Data Gene Prediction Genome Comparisons Gene Comparisons Conclusions
  • 95. Learning Outcomes • Workshop example: bacterial genome annotation (because they’re small and data easy to handle) • The role of biological insight in a bioinformatics workflow • Visual interaction with sequence data • Using alternative tools • Comparison of tools and outputs • Online tools for automated function prediction
  • 96. What You Will Be Doing Illustrative example of concepts: Functional annotation of a draft bacterial genome 1. Gene prediction 2. Genome comparisons 3. Gene comparisons
  • 97. Table of Contents Introduction Workshop Data Gene Prediction Genome Comparisons Gene Comparisons Conclusions
  • 98. Locate your data • You are in group A, B, C or D - this decides your chromosome sequence: chrA.fasta, chrB.fasta, chrC.fasta, chrD.fasta • Each sequence represents a single stitched, ordered draft bacterial genome comprising a number of contigs. • You will use your sequence as the basis of the exercises in the workshop.
  • 99. Locate your data • You are in group A, B, C or D - this decides your dataset: chrA.fasta, chrB.fasta, chrC.fasta, chrD.fasta • You also have a GFF file describing the location of assembled contigs chrA contigs.gff, chrB contigs.gff, chrC contigs.gff, chrD contigs.gff
  • 100. Inspect the data 1 $ head -n 3 chrA.fasta 2 >chrA 3 ttttcttgattgaccttgttcgagtggagtccgccgtgtcactttcgctttggcagcagt 4 gtcttgcccgtttgcaggatgagttacctgccacagaattcagtatgtggatacgcccgt 5 $ head -n 3 chrA_contigs .gff 6 ##gff -version 3 7 chrA stitching contig 1 154993 . . . ID= contig00005_b ;Name= contig00005_b 8 chrA stitching contig 155036 241491 . . . ID=contig00018;Name=contig00018
  • 101. Inspect the data Starting Artemis 1 $ art &
  • 102. Load the chromosome sequence Select the sequence for your group
  • 103. Load the chromosome sequence
  • 105. Load the contig GFF Select the file for your group
  • 107. Find the stitching sequence The contigs are stitched with a specific sequence: see if you can find, and identify it.
  • 108. Table of Contents Introduction Workshop Data Gene Prediction Genome Comparisons Gene Comparisons Conclusions
  • 109. Lines of Evidence • ab initio genecalling: • Unsupervised methods - not trained on a dataset • Supervised methods - trained on a dataset • homology matches • alignment to genes from related organisms (annotation transfer) • from known gene products (e.g. proteins, ncRNA) • from transcripts/other intermediates (e.g. ESTs, cDNA, RNAseq)
  • 110. Consensus Methods • Combine weighted evidence from multiple sources, using linear combination or graph theoretical methods • For eukaryotes: • EVM http://evidencemodeler.sourceforge.net/ • Jigsaw http://www.cbcb.umd.edu/software/jigsaw/ • GLEAN http://sourceforge.net/projects/glean-gene/
  • 111. Basic Gene Finding • We could use Artemis to identify the longest coding region in each ORF, lots of manual steps • This is the most basic gene finding, and can easily be automated, e.g. EMBOSS getorf • Dedicated gene finders usually more appropriate...
  • 112. Finding Open Reading Frames • ORF finding is naive, does not consider: • Start codon • Splicing • Promoter/RBS motifs • Wider context (e.g. overlapping genes)
  • 113. Prokaryotic Prediction Methods • Prokaryotes “easier” than eukaryotes for gene prediction • Less uncertainty in predictions (isoforms, gene structure) • Very gene-dense (over 90% of chromosome is coding sequence) • No intron-exon structure • Problem is: “which possible ORF contains the true gene, and which start site is correct?” • Still not a solved problem
  • 114. Two ab initio Prokaryotic Prediction Methods You will be using two tools • Glimmer • Interpolated Markov models • Can be trained on “gold standard” datasets • Prodigal • Log-likelihood model based on GC frame plots, followed by dynamic programming • Can be trained on “gold standard” datasets
  • 115. Using Glimmer Supervised - we train on a related complete genome sequence, then run glimmer3 1 $ build -icm -r NC_004547.icm < NC_004547.ffn 2 $ glimmer3 -o 50 -g 110 -t 30 chrA.fasta NC_004547.icm chrA_glimmer3 • -o 50: max overlap bases • -g 110: min gene length • -t 30: threshold score
  • 116. Using Glimmer glimmer3 output is not standard GFF format: 1 $ head -n 4 chrA_glimmer3 .predict 2 >chrA 3 orf00001 36 1430 +3 8.81 4 orf00002 1435 2535 +1 11.51 5 orf00005 2676 3761 +3 8.63 We could Google for help, or use provided conversion script: 1 $ python glimmer_to_gff .py chrA_glimmer3 .predict
  • 117. Using Glimmer We now have output in GFF 1 $ head -n 3 chrA_glimmer3 .gff 2 chrA Glimmer CDS 36 1430 8.81 + 0 ID=orf00001;Name=orf00001 3 chrA Glimmer CDS 1435 2535 11.51 + 0 ID=orf00002;Name=orf00002 4 chrA Glimmer CDS 2676 3761 8.63 + 0 ID=orf00005;Name=orf00005
  • 118. Using Prodigal Unsupervised (i.e. untrained) mode 1 $ prodigal -f gff -o chrA_prodigal .gff -i chrA.fasta
  • 119. Using Prodigal Prodigal GFF output is correctly formatted and informative 1 $ head -n 6 chrA_prodigal .gff 2 ##gff -version 3 3 # Sequence Data: seqnum =1; seqlen =4727782; seqhdr =" chrA" 4 # Model Data: version=Prodigal.v2 .50; run_type=Single;model ="Ab initio "; gc_cont =54.48; transl_table =11; uses_sd =1 5 chrA Prodigal_v2 .50 CDS 3 1430 188.5 + 0 ID=1_1;partial =10; start_type=Edge; rbs_motif=None;rbs_spacer=None;score =188.54; cscore =185.37; sscore =3.18; rscore =0.00; uscore =3.18; tscore =0.00 6 chrA Prodigal_v2 .50 CDS 1435 2535 185.6 + 0 ID=1_2;partial =00; start_type=ATG; rbs_motif=None;rbs_spacer=None;score =185.61; cscore =184.24; sscore =1.36; rscore = -7.73; uscore =3.48; tscore =4.37 7 chrA Prodigal_v2 .50 CDS 2676 3761 146.2 + 0 ID=1_3;partial =00; start_type=ATG; rbs_motif=None;rbs_spacer=None;score =146.19; cscore =149.82; sscore = -3.63; rscore = -7.73; uscore = -0.28; tscore =4.37
  • 123. Comparing predictions in Artemis Do ORF(orange)/CDS(green,blue) prediction methods agree?
  • 124. Comparing predictions in Artemis Do glimmer(green)/prodigal(blue) CDS prediction methods agree? How do we know which (if either) is best?
  • 125. Using a “Gold Standard” A general approach for all predictive methods • Define a known, “correct” set of true/false, positive/negative etc. examples - the “gold standard” • Evaluate your predictive method against that set for • sensitivity, specificity, accuracy, precision, etc. Many methods available, coverage beyond the scope of this introduction
  • 126. Contingency Tables Condition (Gold standard) True False Test outcome Positive True Positive False Positive Negative False Negative True Negative Sensitivity = TPR = TP/(TP + FN) Specificity = TNR = TN/(FP + TN) FPR = 1 − Specificity = FP/(FP + TN) If you don’t have this information, you can’t interpret predictive results properly.
  • 127. Why Performance Metrics Matter • You go for a checkup, and are tested for disease X • The test has sensitivity = 0.95 (predicts disease where there is disease) • The test has FPR = 0.01 (predicts disease where there is no disease)
  • 128. Why Performance Metrics Matter • You go for a checkup, and are tested for disease X • The test has sensitivity = 0.95 (predicts disease where there is disease) • The test has FPR = 0.01 (predicts disease where there is no disease) • Your test is positive • What is the probability that you have disease X? • 0.01, 0.05, 0.50, 0.95, 0.99?
  • 129. Why Performance Metrics Matter • What is the probability that you have disease X? • Unless you know the baseline occurrence of disease X, you cannot know.
  • 130. Why Performance Metrics Matter • What is the probability that you have disease X? • Unless you know the baseline occurrence of disease X, you cannot know. • Baseline occurrence: fX • fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5 • fX = 0.8 =⇒ P(disease|+ve) = 0.997 ≈ 1.0
  • 131. Why Performance Metrics Matter • Imagine a predictor for protein functional class • Predictor has has sensitivity = 0.95, FPR = 0.01 • You run the predictor on 20,000 proteins in an organism
  • 132. Why Performance Metrics Matter • Imagine a predictor for protein functional class • Predictor has has sensitivity = 0.95, FPR = 0.01 • You run the predictor on 20,000 proteins in an organism • We estimate ≈ 200 members in protein complement, so fX = 0.01 • fX = 0.01 =⇒ P(disease|+ve) = 0.490 ≈ 0.5
  • 133. Bayes’ Theorem • May seem counter-intuitive: 95% sensitivity, 99% specificity =⇒ 50% chance of any prediction being incorrect • Probability given by Bayes’ Theorem • P(X|+) = P(+|X)P(X) P(+|X)P(X)+P(+| ¯X)P( ¯X) • This is commonly overlooked in the literature (confirmation bias?) • e.g. in paper describing novel TTSS predictor: “The surprisingly high number of (false) positives in genomes without TTSS exceeds the expected false positive rate”
  • 134. Interpreting Performance Metrics • Use Bayes’ Theorem! • Predictions apply to groups, not individual members of the group. e.g. • Test for airport smugglers has P(smuggler|+) = 0.9 • Test gives 100 positives • Which specific individuals are truly smugglers?
  • 135. Interpreting Performance Metrics • Use Bayes’ Theorem! • Predictions apply to groups, not individual members of the group. e.g. • Test for airport smugglers has P(smuggler|+) = 0.9 • Test gives 100 positives • Which specific individuals are truly smugglers? • The test does not allow you to determine this - you need more evidence for each individual • Same principle applies to all other tests, (including protein functional class prediction) - you should not ‘cherry-pick’ for publication without other evidence
  • 136. “Gold Standard” results • Tested glimmer and prodigal on two ”gold standards” • Manually annotated (>3 expert person years) close relative • Community-annotated close relative • Both methods trained directly on the annotated genes in each organism!
  • 137. “Gold Standard” results genecaller glimmer prodigal predicted 4752 4287 missed 284 (6%) 407 (9%) Exact Prediction sensitivity 62% 71% FDR 41% 25% PPV 59% 75% Correct ORF sensitivity 94% 91% FDR 10% 3% PPV 90% 97%
  • 138. “Gold Standard” results genecaller glimmer prodigal predicted 4679 4467 missed 112 (3%) 156 (3%) Exact Prediction sensitivity 62% 86% FDR 31% 14% PPV 69% 86% Correct ORF sensitivity 97% 97% FDR 7% 3% PPV 93% 97%
  • 139. Gene/CDS Prediction • Many alternative methods, all perform differently • To assess/choose methods, performance metrics are required • Even on (relatively simple) prokaryotes, current best methods imperfect • Manual assessment and intervention is essential, and usually the longest part of the process
  • 140. Table of Contents Introduction Workshop Data Gene Prediction Genome Comparisons Gene Comparisons Conclusions
  • 141. Run a megaBLAST Comparison BLAST your chromosome against the comparator sequence. Put results in chrA megablast Pba.tab 1 $ blastn -query chrA.fasta -subject NC_004547.fna -out chrA_megablast_Pba .tab - outfmt 6 2 $ head -n 3 chrA_megablast_Pba .tab 3 chrA gi |50118965| ref|NC_004547 .2|:10948 -12453 80.34 1511 287 10 4579450 4580955 1506 1 0.0 1136 4 chrA gi |50118965| ref|NC_004547 .2|: c33859 -32447 82.04 1409 253 0 4563151 4564559 1 1409 0.0 1201 5 chrA gi |50118965| ref|NC_004547 .2|: c34917 -33868 82.48 1050 184 0 4562093 4563142 1 1050 0.0 920 Note this defaults to using MEGABLAST...
  • 142. Run a BLASTN Comparison BLAST your chromosome against the comparator sequence Put results in chrA blastn Pba.tab 1 $ blastn -query chrA.fasta -subject NC_004547.fna -out chrA_blastn_Pba .tab - outfmt 6 -task blastn 2 $ head -n 3 chrA_blastn_Pba .tab 3 chrA gi |50118965| ref|NC_004547 .2|:5629 -7497 79.68 1865 379 0 4584915 4586779 1865 1 0.0 1654 4 chrA gi |50118965| ref|NC_004547 .2|:5629 -7497 92.59 27 2 0 4479367 4479393 1254 1280 0.004 41.0 5 chrA gi |50118965| ref|NC_004547 .2|:5629 -7497 100.00 17 0 0 4613022 4613038 52 36 2.1 31.9 Note we added -task blastn
  • 143. Do BLASTN and megaBLAST compar- isons agree? Check the number of alignments returned with wc 1 $ wc chrA_megablast_Pba .tab 2 2675 32100 242539 chrA_megablast_Pba .tab 3 $ wc chrA_blastn_Pba .tab 4 31792 381504 2850953 chrA_blastn_Pba .tab What is this telling us? Why do the results differ?
  • 144. BLASTN vs megaBLAST • Legacy BLASTN uses the BLAST algorithm, megaBLAST does not • (though BLAST+ BLASTN now uses megaBLAST by default) • megaBLAST uses a fast, greedy algorithm due to Zhang et al. (2000) http://www.ncbi.nlm.nih.gov/pubmed/10890397
  • 145. BLASTN vs megaBLAST • Legacy BLASTN uses the BLAST algorithm, megaBLAST does not • (though BLAST+ BLASTN now uses megaBLAST by default) • megaBLAST uses a fast, greedy algorithm due to Zhang et al. (2000) http://www.ncbi.nlm.nih.gov/pubmed/10890397 • megaBLAST is optimised for • genome-level searches • queries on large sequence sets (automatic query packing) • long alignments of similar sequences, with SNPs/sequencing errors • A discontinuous mode (dc-megaBLAST) is recommended for more divergent sequences
  • 146. Viewing alignments in ACT Start ACT from the command line: 1 $ act &
  • 147. Use the “File”, “Open...” menu item
  • 148. Increase the Number of Comparisons Use more files ...
  • 152. Remove Weak Matches Use filter sliders
  • 153. MUMmer • MUMmer is a suite of alignment programs and scripts • mummer, promer, nucmer, etc. • Very different to BLAST (suffix tree alignment) - very fast • Extremely flexible • Used for genome comparisons, assemblies, scaffolding, repeat detection, etc. • Forms the basis for other aligners/assemblers
  • 154. Run a MUMmer Comparison Create a new sub-directory for MUMmer output. 1 $ pwd 2 .../ data/workshop/chromosomes 3 $ mkdir nucmer_out Run nucmer to create chrA NC 004547.delta 1 $ nucmer --prefix=nucmer_out/ chrA_NC_004547 chrA.fasta NC_004547.fna Then filter this file to generate a coordinate table for visualisation 1 $ delta -filter -q nucmer_out/ chrA_NC_004547 .delta > nucmer_out/ chrA_NC_004547 . filter 2 $ show -coords -rcl nucmer_out/ chrA_NC_004547 .filter > nucmer_out/ chrA_NC_004547_filtered .coords
  • 155. Run a MUMmer Comparison MUMmer output is very different from BLAST output 1 $ head nucmer_out/ chrA_NC_004547_filtered .coords 2 ...
  • 156. Run a MUMmer Comparison Use a one-line shell command to convert to ACT-friendly format: 1 $ tail -n +6 nucmer_out/ chrA_NC_004547_filtered .coords | awk ’{print $7" "$10" " $1" "$2" "$12" "$4" "$5" "$13}’ > chrA_mummer_NC_004547 .crunch 2 $ head chrA_mummer_NC_004547 .crunch 3 2526 82.49 15 2540 4727782 4985117 4982588 5064019 4 2944 82.29 2676 5619 4727782 4982544 4979600 5064019 5 85 95.29 11092 11176 4727782 758690 758774 5064019 6 1356 81.69 17446 18801 4727782 77639 78994 5064019
  • 157. Select Files Select your chromosome, and the megaBLAST/MUMmer output
  • 159. Filter Weak BLAST Matches
  • 160. Genome Alignments • Alignment result depends on algorithm, and parameter choices • Some algorithms/parameter sets more sensitive than others • Appropriate visualisation is essential Much more detail at http://www.slideshare.net/leightonp/ comparative-genomics-and-visualisation-part-1
  • 161. Table of Contents Introduction Workshop Data Gene Prediction Genome Comparisons Gene Comparisons Conclusions
  • 162. Reciprocal Best BLAST Hits (RBBH) • To compare our genecall proteins to NC 004547.faa reference set... • BLAST reference proteins against our proteins • BLAST our proteins against reference proteins • Pairs with each other as best BLAST Hit are called RBBH
  • 163. One-way BLAST vs RBBH One-way BLAST includes many low-quality hits
  • 164. One-way BLAST vs RBBH Reciprocal best BLAST hits remove many low-quality matches
  • 165. Reciprocal Best BLAST Hits (RBBH) • Pairs with each other as best BLAST hit are called RBBH • Should filter on percentage identity and alignment length • RBBH pairs are candidate orthologues • (most orthologues will be RBBH, but the relationship is complicated) • Outperforms OrthoMCL, etc. (beyond scope of course why and how. . .) http://dx.doi.org/10.1093/gbe/evs100 http://dx.doi.org/10.1371/journal.pone.0018755 (We have a tool for this on our in-house Galaxy server)
  • 166. Table of Contents Introduction Workshop Data Gene Prediction Genome Comparisons Gene Comparisons Conclusions
  • 167. In Conclusion • The tools you will need to use will be task-dependent, but some things are universal. . . • Good experimental design (including BLAST searches, etc.) • Keeping accurate records for reproduction/replication • Validation/sanity checking of results • Comparison and benchmarking of methods • (Cross-)validation of predictive methods Remember: everything gets easier with practice, so practice lots!