1. January 7th, 2015
Mariam Quiñones
Computational Biology Specialist
Bioinformatics and Computational Biosciences Branch
Office of Cyber Infrastructure and Computational Biology
2. Upcoming Seminars on NGS Analysis
2
http://inside.niaid.nih.gov/topic/training/scientificsoftwaretraining/Pages/default.aspx
3. BCBB: A Branch Devoted to Bioinformatics and
Computational Biosciences
Researchers’ time is increasingly important
BCBB saves our collaborators time and effort
Researchers speed projects to completion using
BCBB consultation and development services
No need to hire extra post docs or use external
consultants or developers
3
5. Contact BCBB…
“NIH Users: Access a menu of BCBB services on the
NIAID Intranet:
• http://bioinformatics.niaid.nih.gov/
Outside of NIH –
• search “BCBB” on the NIAID Public Internet Page:
www.niaid.nih.gov
– or – use this direct link
http://www.niaid.nih.gov/about/organization/odoffices/omo/ocicb/Pages/bcbb.aspx
Email us at:
• ScienceApps@niaid.nih.gov
5
6. Why has the scientific community adopted
deep sequencing?
6
• Cheaper, faster sequencing
• No need for cloning or probes
• Many applications
• Higher specificity and sensitivity (RNA-seq, Chip-Seq)
• More..
7. What is Next Generation Sequencing?
7
Image from: http://s.ngm.com/2009/06/tag-caves/img/01-rumbling-falls-615.jpg
It is sequencing
produced by 2nd and 3rd
generation instruments
(e.g. Illumina, PacBio)”
• It is also known as High-Throughput Next Generation Sequencing (HT-NGS)
or “Deep Sequencing”. Provides deeper coverage than the typical Sanger
sequencing
8. Agenda for today
Overview of Next Generation Sequencing
NGS sequencing platforms
NGS Analysis Basics
• File formats
• Quality Control
• Viewing alignment files
Common applications of NGS
8
9. Remember Sanger?
9
• Sanger introduced the “dideoxy method” (also known as Sanger
sequencing) in December 1977
Alignment of reads using tools such as ‘Sequencher’
IMAGE: http://www.lifetechnologies.com
10. Sanger Next Generation Sequencing
10
• Sanger: Dideoxy Chain Termination1977
• Hood et al., Fluorescently labeled ddNTPs, Partial Automation1986
• NIH begins Human Genome Project,1990
• HGP/Celera draft assembly published Nature / Science2001
• Next-Gen Sequencing (454 Roche)2004
• First Solexa Sequencer, Genome Analyzer 1G/Run2006
• 1990 – 2003
• “shotgun”
2007
J.Craig Venter James Watson
11. Greater vision: Genomics to Bedside
“ Only a population perspective can fulfill the promise
of genomic medicine. The scientific landscape for
genomics is exciting, and the promise for improving
health is great. Applying genomic tools in clinical and
public health practice will require a multidisciplinary
research collaboration of basic sciences with clinical
and population sciences (e.g., epidemiologists;
behavioral, social, and communication scientists;
health services researchers; and public health
practitioners)”
Am J Public Health. 2012 January; 102(1): 34–37
11
12. Popular Sequencing Platforms (non-Illumina)
12
SOLiD – 5500 xl series
320 Gb / 8 day run
GS FLX Titanium XL+
700bp reads
Up to 700 Mb / 23 hours
PacBio RS II
500 Mb – 1 Gb / 4 hr run
(up to 40kb read lengths)
Ion Torrent 318
1.2–2 Gb / 7 hr
pH sensing
Sequencing by
Synthesis
Single molecule
Ion Torrent Proton
(2 exomes / 2-4 hr run)
13. Roche 454
Pyrosequencing
Used mostly for targeted
sequencing such as 16S rRNA
Long reads (>500) but with high
error rate in homopolymer
regions
Much lower yield than other
platforms
13
14. PacBio - Single Molecule Real Time
14
• Very long reads that
are good to span
repeats but with 11%
error rate
• Consensus analysis
of reads corrects error
rate
• It’s good for base
modification detection
• It can be combined
with shorter reads to
improve de novo
assemblies
Genome Res. 2013 Jan;23(1):121-8. doi: 10.1101/gr.
141705.112. Epub 2012 Oct 11
15. Ion Torrent sequence detection
15
http://en.wikipedia.org/wiki/
Ion_semiconductor_sequencing
16. New kid - MinION
16
Bases identified by
changes in current
18. 18
HiSeq X = $1000/genome at 30X
And more throughput..
BROAD Institute, Macrogen…
19. Illumina
Sequence by
synthesis
It uses dNTPs
containing a
terminator (with a
fluorescent label)
which blocks further
polymerization
allowing only one
base added
19
http://nxseq.bitesizebio.com/articles/
sequencing-by-synthesis-explaining-the-
illumina-sequencing-technology/
20. Where are these sequences being stored?
• NCBI SRA database http://www.ncbi.nlm.nih.gov/sra
• European Read Archive (ENA)
http://www.ebi.ac.uk/ena/about/sra_submissions
• 1000 genomes data http://www.1000genomes.org/data
• Human Microbiome Projec (Microbiome data) http://hmpdacc.org/
Some data repositories include:
22. Major challenges when working with sequencing data
We need:
Algorithms for managing (LIMS), analyzing and visualizing data
Reproducible workflows and standards for analysis
Better transfer and data storage technology
Specialized tools for integrating various data types
22
Emerging solutions
Algorithms that can parallelize jobs in a cluster
ABySS uses MPI, AllPaths LG, Discovar
GATK Genome Analysis Toolkit uses MapReduce (Google’s framework)
Web tools with workflow capabilities
Galaxy Bioinformatics https://usegalaxy.org
Various Cloud based solutions (e.g. Illumina BaseSpace)
Lots of open source tools: see http://seqanswers.com/wiki/Software
23. Galaxy https://usegalaxy.org
Makes analysis methods available to
the community and facilitates
reproducibility via creation of reusable
workflows (read Galaxy slides)
Free web service, also compatible with
Cloud http://usegalaxy.org/cloud
Open source
Provides a Genome Track Browser to
visualize custom data.
23
25. First the basics – NGS 101
Sequence data
• What does a short read looks like?
• How to know if sequencer facility has
provided good quality reads?
• What to expect if sequencer facility has
mapped the reads to my genome of interest?
25
27. Common Sequence file formats
Next gen sequence file formats are based on the
commonly used
FASTA format
>sequence_ID and optional comments
ATTCCGGTGCGGTGCGGTGCTGCCGTGCCGGTGC
TTCGAAATTGGCGTCAGT
The Phred quality scores per base were added
27
@HWI-ST406:207:D1DGFACXX:8:1101:20481:2058 1:N:0:AGTCAA!
CATGGGGATCGAATTCATCGCCGTCCCCTCTGTTCCGATTTATTCCATATGTGCTTCGCAACAACGCTTTCTCACAGAATACAGGAGCTTCTATACTGTA!
+!
BBBFFFFFFFFFFIIIIIIFFIIFFIIIFFIIFFFIFBFIIIIIIIIIFIIFBFFIFFFBFFBFFBFBFFFFFFFBBFFFFFFBBBBBBBBBBBFFFBFB!
28. Raw sequence file formats
FASTQ format (fasta format with quality values for each base)
28
@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA - base calls
+
BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ - Base quality+33
Full read header description"
@ <instrument-name>:<run ID>:<flowcell ID>:<lane-number>:<tile-number>: <x-pos>: <y-pos>
<read number>:<is filtered>:<control number>:<barcode sequence>
Space to separate Read ID
Read ID "
29. Quality values
29
Quality scores are normally expected up to 40 in a Phred scale.
ASCII characters <http://en.wikipedia.org/wiki/ASCII>
BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ "
The highest base quality score in this sequence: ‘D’=(68-33)=35
From http://en.wikipedia.org/wiki/FASTQ_format
= 0.00032 (or 1/3200 incorrect)P=10
-35/10
If base quality = 35
30. Other read formats
SFF (Roche 454 or Ion Torrent)
• Sff – contain Flowgrams, phred quality scores, clipping information
• 454 reads are often reported as fasta and qual or converted to fastq
30
31. First the basics – NGS 101
Sequence data
• What does a short read looks like?
• How to know if the sequencing facility has
provided good quality reads?
• What to expect if sequencer facility has
mapped the reads to my genome of interest?
31
32. Basic Concepts in Quality Control of
sequence data
The sequencing facility runs quality control tests to ensure that the
actual run was successful and/or to determine if a new library is good
for sequencing more of it.
The user should run quality control tests prior to full bioinformatics
analyses
• This will avoid misinterpretation of the data due to unexpected bias
• QC measurements can report the following:
– Percent GC in sample reads
– Presence of overrepresented kmers and sequences such as adapters
– Per base quality score
– Distribution of nucleotide bases
After mapping reads to a genome, additional test could be run to
determine:
– Mapping error rate
– Percent of possible PCR duplicates (reads with same start and end position in
reference genome)
– Distribution of insert size (pair ends)
32
34. Quality control with FastQC
By: FastQC
http://www.bioinformatics.bbsrc.ac.uk/projects/
fastqc/
Mean quality per base
Sequence content per base
GC content per read
For filtering or trimming by quality use tools such as
FastX-toolkit, Btrim, PrinSeq.
35. First the basics – NGS 101
Sequence data
• What does a short read looks like?
• How to know if the sequencing facility has
provided good quality reads?
• What to expect if sequencing facility has
mapped (aligned) the reads to my genome
of interest?
35
36. File formats for aligned reads
SAM (sequence alignment map)
36
CTTGGGCTGCGTCGTGTCTTCGCTTCACACCCGCGACGAGCGCGGCTTCT
CTTGGGCTGCGTCGTGTCTTCGCTTCACACC
Chr_ start end
Chr2 100000 100050
Chr_ start end
Chr2 50000 50050
37. Most commonly used alignment file formats
SAM (sequence alignment map)
Unified format for storing alignments to a reference genome
BAM (binary version of SAM) – used commonly to deliver data
Compressed SAM file, is normally indexed
BED
Commonly used to report features described by chrom, start, end, name,
score, and strand.
For example:
chr1 11873 14409 uc001aaa.3 0 +
37
38. SAM/BAM format (sequence alignment map)
38
QNAME FLAG RNAME POSITION MAPQ CIAGR MRNM MPOS TLEN
SEQ QUAL OPT
http://samtools.sourceforge.net/samtools.shtml#8
39. First the basics – NGS 101
How to visualize an alignment?
• Use a genome browser
39
40. What is a Genome Browser?
Graphical interface for display of genomic information
from biological databases
• known and predicted genes
• ESTs
• mRNAs
• CpG islands
• assembly gaps and coverage
• chromosomal bands
• homology to other organisms
• RNA-seq data
• Transcription factor binding sites
• GC percent
• Splicing variants
• Known SNPs
• Associated publications
• Sequence repeats
Besides genome sequence, they provide additional data:
40
41. Viewing reads in browser
If your genome is available via the UCSC genome
browser http://genome.ucsc.edu/, import bam format
file to the UCSC genome browser by hosting the file on
a server and providing the link.
If your genome is not in UCSC, use another browser
such as IGV http://www.broadinstitute.org/igv/ , or IGB
http://bioviz.org/igb/
• Import genome (fasta)
• Import annotations (gff3 or bed format)
• Import data (bam)
42. Data from ENCODE, Expression (RNA-Seq), Methylation
and Transcription Factor binding (Chip-Seq) and more.
43. Use UCSC Custom Track to display data
Next-gen sequencing
can also be imported
typically by hosting a
BAM file in a server
and providing the link
Is used to display your own data and annotations
A variety of formats are accepted (click on the links to file types for more information)
Data remains available for a limited time after upload
43
44. Java tool that runs in user’s computer
Allows for upload of custom annotations in many
formats
• Add your genome (for example P. falciparum latest
version) in fasta format.
• Add gene expression (gct format) or sequence
alignments (bam format).
• Add custom annotations (bed, wig format, gff3).
Integrative Genomics Viewer (IGV)
http://www.broadinstitute.org/igv/
44
45. Example of use of IGV to visualize custom genome and data
Reads imported in bam format
Annotation imported in bed format
reads
46. RNA-Seq / miRNA-seq
(noncoding, differential
expression,
Novel splice forms,
antisense)
Epigenetics (Chip-
Seq, MNase-seq,
Bisulfite-Seq)
CNV,
Structural
variations
Targeted
resequencing
“Exome analysis”
Whole genome
sequencing
Metagenomics
(16S microbiome,
environmental
WGS)
Somatic mutations
Variants in
mendelian diseases
High throughput
sequencing
De novo
genome
assembly
Beyond the basics – a growing list of
NGS applications
48. How to align reads to a genome?
Step 1: Choose an appropriate alignment software
http://seqanswers.com/wiki/Software
• Common tools:
– Bowtie: FAST, Accurate (e.g. for Chip-Seq)
– BWA: FAST, Accurate, gapped alignment (variant analysis)
– TopHat: Uses Bowtie for initial mapping and then maps
junctions (good for RNA-seq mapping)
– GSMapper, MIRA: developed for 454 Roche data
– QIIME, mothur: Suite for processing and alignment of 16S
rRNA amplicon data for microbiome
48
49. How to align reads to a genome?
Step 2: Map the reads to generate an alignment file
(bam).
To visualize bam output files, sort and index the
output file with using tools samtools or picard.
49
50. Counting experiments (RNA-seq)
Methods
Reverse transcribe to cDNA
Prepare library (usually paired
end and/or strand specific)
Features
A design for capture is not
required
Alignment depth is
proportional to the abundance
of the transcript
Applications
Identify coding sequences,
miRNAs, alternative splicing,
antisense transcripts
Quantify differential
expression
50
RPKM - Reads Per Kilobase of exon model per Million mapped reads
(Haas & Zody, 2010)
51. Strategies for Mapping Junction Reads
Split reads and align separately to reference
• Sometimes based on intermediate reference of reconstructed splice
junction sequences
• Finds known and novel splice sites
• e.g., TopHat, SOAPsplice, Trinity
51
Frontiers in Genetics, Huang 2011
Slide courtesy of Andrew Oler (BCBB)
53. Counting experiments (Chip-seq)
(Chromosome Immunoprecipitation and Sequencing)
Features
Allows genome wide discovery of
protein-DNA interactions(e.g.
transcription factor, histone
modification)
DNA and proteins are cross-linked
and purified; then bound DNA is
analyzed by massively parallel short-
read sequencing
It is cheaper, and provides better
signal to noise ratio than chip-chip,
not dependent on probes
53
54. Counting experiments (Chip-seq)
Features
Analysis typically involves
mapping, peak detection
and binding motif analysis
Challenges include scoring
diffuse or low intensity
peaks in relation to
background and coverage
Common tools: USeq,
MACS
54
http://bit.ly/qLjRGA
55. ChIP-seq Downstream Analysis
55
Supplemental Table 2: D1 Histone-enriched loci (Illumina GAII FDR< 0.0001)
Go Category Total
Genes
Changed
Genes
Enrichment FDR
Cell fate commitment 75 60 1.59848 0
Sequence-specific DNA binding 424 337 1.588112 0
Cellular morphogenesis during differentiation 125 99 1.582495 0
Cell projection organization and biogenesis 169 131 1.548823 0
Cell part morphogenesis 169 131 1.548823 0
Embryonic morphogenesis 88 68 1.543986 0
Regionalization 82 63 1.535126 0
Neurogenesis 221 168 1.518918 0
Wnt receptor signaling pathway 107 80 1.493907 0
Regulation of cell differentiation 119 88 1.477587 0
Regulation of transcription from RNA polymerase II
promoter
99 72 1.453164 0
Organ morphogenesis 304 221 1.452566 0
Embryonic development 226 164 1.449949 0
Regulation of developmental process 191 138 1.443653 0
Voltage-gated ion channel activity 171 123 1.43723 0
Nervous system development 604 433 1.432413 0
Cation channel activity 228 162 1.419703 0
Transcription factor activity 791 552 1.394376 0
Muscle development 136 94 1.38104 0
# Peaks Found in Different Tissues
Allele-specific Binding
Oler et al., NSMB, 2010; Mikkelsen et al., Nature, 2007; Park, Nat Rev Genet, 2009; Barski et al., Cell, 2007
Slide courtesy of Andrew Oler / Vijay Nagarajan (BCBB)
57. RNA-Seq / miRNA-seq
(noncoding, differential
expression,
Novel splice forms,
antisense)
Epigenetics (Chip-
Seq, Mnase-seq,
Bisulfite-Seq)
CNV,
Structural
variations
Targeted
resequencing
“Exome analysis”
Whole genome
sequencing
Metagenomics
(16S microbiome,
environmental
WGS)
Somatic mutations
Variants in
mendelian diseases
High throughput
sequencing
De novo
genome
assembly
Beyond the basics – a growing list of
NGS applications
58. De novo genome assembly
58
AllPATHS-LG
http://www.broadinstitute.org/news/2787
59. De novo genome assembly
A good assembly needs:
• library preparation that minimizes GC bias which lead to poor coverage
• High coverage (e.g 100 fold Illumina ) with low error rate
• For a small genome, if possible, add 50x fold PacBio (1500bp read length) to
reduce the number of contigs. Alternatively, use mate pairs and pair ends of
various insert sizes.
De novo assemblers for large genomes
• ALLPATHS-LG and DISCOVAR – developed and recommended by BROAD
Institute http://www.broadinstitute.org/science/programs/genome-biology/crd
• SOAP de novo – developed and used by BGI http://1.usa.gov/oTUrWC
• ABYSS http://www.bcgsc.ca/platform/bioinfo/software/abyss
De novo assemblers for smaller genomes
• VELVET
• NEWBLER (454)
59
Related publications
http://1.usa.gov/id8h5d
60. Use case: Panda Genome
Published Nature 2010
• SOAP denovo
(de Brujin graph algorithm)
• 56 fold coverage
• 500bp insert paired end
• 2kb mate pair
• Genome was 94% complete
Image courtesy of Zhihe Zhang
In Scientific American
De novo genome assembly
61. Asian Honey Bee (published January 2015)
238 Mbp draft of the A. cerana genome and generated 10,651 genes.
• 72% of the A. cerana-specific genes had more than one GO term, and
1,696 enzymes were categorized into 125 pathways.
• Genes involved in chemoreception and immunity were carefully
identified and compared to those from other sequenced insect
models. These included 10 gustatory receptors, 119 odorant
receptors, 10 ionotropic receptors, and 160 immune-related genes.
61
• 3 libraries
• Pair end: 500bp
• Mate pair: 3kb and 10kb
• 2,430 scaffolds
• RNA-seq data also assembled
• Tools: AllPaths-LG, RepeatMasker,
• RNA-seq tools: Trinity, TopHat, Cufflinks
62. 62
Schematic overview of
SOAP denovo algorithm
http://1.usa.gov/oTUrWC
Contig assembly
Scaffolding
Preassembly sequencing
error correction
Gap closure
63. RNA-Seq / miRNA-seq
(noncoding, differential
expression,
Novel splice forms,
antisense)
Epigenetics (Chip-
Seq, Mnase-seq,
Bisulfite-Seq)
CNV,
Structural
variations
Targeted
resequencing
“Exome analysis”
Whole genome
sequencing
Metagenomics
(16S microbiome,
environmental
WGS)
Somatic mutations
Variants in
mendelian diseases
High throughput
sequencing
De novo
genome
assembly
Beyond the basics – a growing list of
NGS applications
64. 64
Metagenomics and microbiome analysis
Analysis methods:
• Reference based analysis
• 16S RNA – OTU based methods
• Shotgun data (454, Illumina)
• Assign taxonomy (RDP classifier, blast)
• Pipelines for 16S RNA: qiime, mothur
• Other tools: MEGAN, CARMA,
metaphyler
• De novo Assembly of WGS and funtional
analysis of microbiomes
• Methods are under development with
the goal of dealing with insufficient
coverage, sequencing errors, repeats
• Tools: MG-RAST, metAMOS, HUMAnN
• It looks at gene classes, metabolic
pathways
http://bit.ly/o4dGqH http://www.hmpdacc.org/
65. Sample study: skin microbiome
The skin is an ecosystem, host to a microbial
milieu that, for the most part, is harmless.
Analysis of 16S ribosomal RNA genes reveals a
greater diversity of organisms than has been
found by culture-based methods
The cutaneous immune system modulates
colonization by the microbiota and is also vital
during infection and wounding. Dysregulation of
the skin immune response is evident in several
skin disorders
65
Elizabeth A. Grice & Julia A. Segre
Nature Reviews Microbiology 9, 244-253
Recommended software: mothur, qiime
66. RNA-Seq / miRNA-seq
(noncoding, differential
expression,
Novel splice forms,
antisense)
Epigenetics (Chip-
Seq, Mnase-seq,
Bisulfite-Seq)
CNV,
Structural
variations
Targeted
resequencing
“Exome analysis”
Whole genome
sequencing
Metagenomics
(16S microbiome,
environmental
WGS)
Somatic mutations
Variants in
mendelian diseases
High throughput
sequencing
De novo
genome
assembly
A growing list of applications
67. Variant Analysis
…like finding a needle in a ‘deep’ haystack
67
SNPs – Single nucleotide
polymorphisms
Indels – Insertion
Deletions
CNVs- copy number
variations
SV- structural variations
Variant = any position in
difference to a specified reference
sequence
68. 68
Efforts at creating databases of variants:
HapMap Project
• Project that started on 2002 with the goal of describing patterns of human
genetic variation and create a haplotype map using SNPs present in at
least 1% of the population, which were deposited in dbSNPs.
• It used 269 individuals.
Haplotypes – adjacent SNPs that are inherited together
1000 Genomes
• Started in 2008 with a goal of using at least 1000 individuals (about
2,500 samples at 4X coverage), interrogate 1000 gene regions in 900
samples (exome analysis), find most genetic variants with allele
frequencies above 1% and to a 0.1% if in coding regions as well as
Indels and structural variants
• Make data available to the public ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/
or via Amazon Cloud http://s3.amazonaws.com/1000genomes
69. Encode: Encyclopedia of DNA Elements
69
Main Goal:
• Find all functional elements in the genome
70. Exome-Seq
Targeted exome capture
targets ~20,000 variants
near coding sequences
and a few rare missense or
loss of function variants
Provides high depth of
coverage for more accurate
variant calling
It is starting to be used as a
diagnostic tool
70
Ann Neurol. 2012 Jan;71(1):5-14.
-Nimblegen
-Agilent
-Illumina
71. VCF format (version 4.0)
71
Format used to report information about a position in the
genome
Use by the 1000 genomes project to report all variants