SlideShare una empresa de Scribd logo
1 de 73
Descargar para leer sin conexión
January 7th, 2015
Mariam Quiñones
Computational Biology Specialist
Bioinformatics and Computational Biosciences Branch
Office of Cyber Infrastructure and Computational Biology
Upcoming Seminars on NGS Analysis
2
http://inside.niaid.nih.gov/topic/training/scientificsoftwaretraining/Pages/default.aspx
BCBB: A Branch Devoted to Bioinformatics and
Computational Biosciences
  Researchers’ time is increasingly important
  BCBB saves our collaborators time and effort
  Researchers speed projects to completion using
BCBB consultation and development services
  No need to hire extra post docs or use external
consultants or developers
3
BCBB Staff
4
Bioinformatics Software
Developers
Computational Biologists
Project Managers and
Analysts
Contact BCBB…
  “NIH Users: Access a menu of BCBB services on the
NIAID Intranet:
•  http://bioinformatics.niaid.nih.gov/
  Outside of NIH –
•  search “BCBB” on the NIAID Public Internet Page:
www.niaid.nih.gov
– or – use this direct link
http://www.niaid.nih.gov/about/organization/odoffices/omo/ocicb/Pages/bcbb.aspx
  Email us at:
•  ScienceApps@niaid.nih.gov
5
Why has the scientific community adopted
deep sequencing?
6
•  Cheaper, faster sequencing
•  No need for cloning or probes
•  Many applications
•  Higher specificity and sensitivity (RNA-seq, Chip-Seq)
•  More..
What is Next Generation Sequencing?
7
Image from: http://s.ngm.com/2009/06/tag-caves/img/01-rumbling-falls-615.jpg
It is sequencing
produced by 2nd and 3rd
generation instruments
(e.g. Illumina, PacBio)”
•  It is also known as High-Throughput Next Generation Sequencing (HT-NGS)
or “Deep Sequencing”. Provides deeper coverage than the typical Sanger
sequencing
Agenda for today
  Overview of Next Generation Sequencing
  NGS sequencing platforms
  NGS Analysis Basics
•  File formats
•  Quality Control
•  Viewing alignment files
  Common applications of NGS
8
Remember Sanger?
9
•  Sanger introduced the “dideoxy method” (also known as Sanger
sequencing) in December 1977
Alignment of reads using tools such as ‘Sequencher’
IMAGE: http://www.lifetechnologies.com
Sanger   Next Generation Sequencing
10
• Sanger: Dideoxy Chain Termination1977
• Hood et al., Fluorescently labeled ddNTPs, Partial Automation1986
• NIH begins Human Genome Project,1990
• HGP/Celera draft assembly published Nature / Science2001
• Next-Gen Sequencing (454 Roche)2004
• First Solexa Sequencer, Genome Analyzer 1G/Run2006
•  1990 – 2003
•  “shotgun”
2007
J.Craig Venter James Watson
Greater vision: Genomics to Bedside
  “ Only a population perspective can fulfill the promise
of genomic medicine. The scientific landscape for
genomics is exciting, and the promise for improving
health is great. Applying genomic tools in clinical and
public health practice will require a multidisciplinary
research collaboration of basic sciences with clinical
and population sciences (e.g., epidemiologists;
behavioral, social, and communication scientists;
health services researchers; and public health
practitioners)”
Am J Public Health. 2012 January; 102(1): 34–37
11
Popular Sequencing Platforms (non-Illumina)
12
SOLiD – 5500 xl series
320 Gb / 8 day run
GS FLX Titanium XL+
700bp reads
Up to 700 Mb / 23 hours
PacBio RS II
500 Mb – 1 Gb / 4 hr run
(up to 40kb read lengths)
Ion Torrent 318
1.2–2 Gb / 7 hr
pH sensing
Sequencing by
Synthesis
Single molecule
Ion Torrent Proton
(2 exomes / 2-4 hr run)
Roche 454
  Pyrosequencing
  Used mostly for targeted
sequencing such as 16S rRNA
  Long reads (>500) but with high
error rate in homopolymer
regions
  Much lower yield than other
platforms
13
PacBio - Single Molecule Real Time
14
•  Very long reads that
are good to span
repeats but with 11%
error rate
•  Consensus analysis
of reads corrects error
rate
•  It’s good for base
modification detection
•  It can be combined
with shorter reads to
improve de novo
assemblies
Genome Res. 2013 Jan;23(1):121-8. doi: 10.1101/gr.
141705.112. Epub 2012 Oct 11
Ion Torrent sequence detection
15
http://en.wikipedia.org/wiki/
Ion_semiconductor_sequencing
New kid - MinION
16
Bases identified by
changes in current
Illumina platforms
17
ILLUMINA
18
HiSeq X = $1000/genome at 30X
And more throughput..
BROAD Institute, Macrogen…
Illumina
  Sequence by
synthesis
  It uses dNTPs
containing a
terminator (with a
fluorescent label)
which blocks further
polymerization
allowing only one
base added
19
http://nxseq.bitesizebio.com/articles/
sequencing-by-synthesis-explaining-the-
illumina-sequencing-technology/
Where are these sequences being stored?
•  NCBI SRA database http://www.ncbi.nlm.nih.gov/sra
•  European Read Archive (ENA)
http://www.ebi.ac.uk/ena/about/sra_submissions
•  1000 genomes data http://www.1000genomes.org/data
•  Human Microbiome Projec (Microbiome data) http://hmpdacc.org/
Some data repositories include:
Large Sequencing Projects
21
http://cancergenome.nih.gov/cancergenomics
www.1000genomes.org/
http://commonfund.nih.gov/hmp/
http://www.icgc.org/
http://img.jgi.doe.gov/cgi-bin/m/main.cgi
http://www.genome10k.org/
Major challenges when working with sequencing data
We need:
  Algorithms for managing (LIMS), analyzing and visualizing data
  Reproducible workflows and standards for analysis
  Better transfer and data storage technology
  Specialized tools for integrating various data types
22
Emerging solutions
Algorithms that can parallelize jobs in a cluster
  ABySS uses MPI, AllPaths LG, Discovar
  GATK Genome Analysis Toolkit uses MapReduce (Google’s framework)
Web tools with workflow capabilities
  Galaxy Bioinformatics https://usegalaxy.org
  Various Cloud based solutions (e.g. Illumina BaseSpace)
  Lots of open source tools: see http://seqanswers.com/wiki/Software
Galaxy https://usegalaxy.org
  Makes analysis methods available to
the community and facilitates
reproducibility via creation of reusable
workflows (read Galaxy slides)
  Free web service, also compatible with
Cloud http://usegalaxy.org/cloud
  Open source
  Provides a Genome Track Browser to
visualize custom data.
23
Cartoons from: fixingpcerrors.com and squido.com
I have data,
where do I start?
First the basics – NGS 101
 Sequence data
• What does a short read looks like?
• How to know if sequencer facility has
provided good quality reads?
• What to expect if sequencer facility has
mapped the reads to my genome of interest?
25
Understanding file
formats
@F29EPBU01CZU4O
GCTCCGTCGTAAAAGGGG
+
24469:666811//..,,
@F29EPBU01D60ZF
CTCGTTCTTGATTAATGAAACATTCTTGGCAAA
TGCTTTCGCTCTGGTCCGTCTTGCGCCGGTCCA
AGAATTTCACCTCTAGCGGCGCAATACGAATG
CCCAAACACACCCAACACACCA
+
G???HHIIIIIIIIIBG555?
=IIIIIIIIHHGHHIHHHIIIIIIHHHIIHHHIIIIIIIIIH99;;CB
BCCEI???DEIIIIII??;;;IIGDBCEA?
9944215BB@>>@A=BEIEEE
@F29EPBU01EIPCX
TTAATGATTGGAGTCTTGGAAGCTTGACTACCC
TACGTTCTCCTACAAATGGACCTTGAGAGCTTG
TTTGGAGGTTCTAGCAGGGGAGCGCATCTCCC
CAAACACACCCAACACACCA
+
IIIIIIIIIIIIIIIIIIIIIIHHHHIIIIHHHIIIIIIIIIIIIIHHHIIIIIIIIIIIIIIIIIH
HHIIIIIIIIEIIB94422=4GEEEEEIBBBBHHHFIH??
?CII=?AEEEE
@F29EPBU01DER7Q
TGACGTGCAAATCGGTCGTCCGACCTCGGTAT
AGGGGCGAAGACTAATCGAACCATCTAGTAGC
Common Sequence file formats
  Next gen sequence file formats are based on the
commonly used
FASTA format
>sequence_ID and optional comments
ATTCCGGTGCGGTGCGGTGCTGCCGTGCCGGTGC
TTCGAAATTGGCGTCAGT
  The Phred quality scores per base were added
27
@HWI-ST406:207:D1DGFACXX:8:1101:20481:2058 1:N:0:AGTCAA!
CATGGGGATCGAATTCATCGCCGTCCCCTCTGTTCCGATTTATTCCATATGTGCTTCGCAACAACGCTTTCTCACAGAATACAGGAGCTTCTATACTGTA!
+!
BBBFFFFFFFFFFIIIIIIFFIIFFIIIFFIIFFFIFBFIIIIIIIIIFIIFBFFIFFFBFFBFFBFBFFFFFFFBBFFFFFFBBBBBBBBBBBFFFBFB!
Raw sequence file formats
  FASTQ format (fasta format with quality values for each base)
28
@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA - base calls
+
BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ - Base quality+33
Full read header description"
@ <instrument-name>:<run ID>:<flowcell ID>:<lane-number>:<tile-number>: <x-pos>: <y-pos>
<read number>:<is filtered>:<control number>:<barcode sequence>
Space to separate Read ID
Read ID "
Quality values
29
Quality scores are normally expected up to 40 in a Phred scale.
ASCII characters <http://en.wikipedia.org/wiki/ASCII>
BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ "
The highest base quality score in this sequence: ‘D’=(68-33)=35
From http://en.wikipedia.org/wiki/FASTQ_format
= 0.00032 (or 1/3200 incorrect)P=10
-35/10
If base quality = 35
Other read formats
  SFF (Roche 454 or Ion Torrent)
•  Sff – contain Flowgrams, phred quality scores, clipping information
•  454 reads are often reported as fasta and qual or converted to fastq
30
First the basics – NGS 101
 Sequence data
• What does a short read looks like?
• How to know if the sequencing facility has
provided good quality reads?
• What to expect if sequencer facility has
mapped the reads to my genome of interest?
31
Basic Concepts in Quality Control of
sequence data
  The sequencing facility runs quality control tests to ensure that the
actual run was successful and/or to determine if a new library is good
for sequencing more of it.
  The user should run quality control tests prior to full bioinformatics
analyses
•  This will avoid misinterpretation of the data due to unexpected bias
•  QC measurements can report the following:
–  Percent GC in sample reads
–  Presence of overrepresented kmers and sequences such as adapters
–  Per base quality score
–  Distribution of nucleotide bases
  After mapping reads to a genome, additional test could be run to
determine:
–  Mapping error rate
–  Percent of possible PCR duplicates (reads with same start and end position in
reference genome)
–  Distribution of insert size (pair ends)
32
Demo – Fastq Quality
33
Quality control with FastQC
By: FastQC
http://www.bioinformatics.bbsrc.ac.uk/projects/
fastqc/
Mean quality per base
Sequence content per base
GC content per read
For filtering or trimming by quality use tools such as
FastX-toolkit, Btrim, PrinSeq.
First the basics – NGS 101
 Sequence data
• What does a short read looks like?
• How to know if the sequencing facility has
provided good quality reads?
• What to expect if sequencing facility has
mapped (aligned) the reads to my genome
of interest?
35
File formats for aligned reads
  SAM (sequence alignment map)
36
CTTGGGCTGCGTCGTGTCTTCGCTTCACACCCGCGACGAGCGCGGCTTCT
CTTGGGCTGCGTCGTGTCTTCGCTTCACACC
Chr_ start end
Chr2 100000 100050
Chr_ start end
Chr2 50000 50050
Most commonly used alignment file formats
  SAM (sequence alignment map)
Unified format for storing alignments to a reference genome
  BAM (binary version of SAM) – used commonly to deliver data
Compressed SAM file, is normally indexed
  BED
Commonly used to report features described by chrom, start, end, name,
score, and strand.
For example:
chr1 11873 14409 uc001aaa.3 0 +
37
SAM/BAM format (sequence alignment map)
38
QNAME FLAG RNAME POSITION MAPQ CIAGR MRNM MPOS TLEN
SEQ QUAL OPT
http://samtools.sourceforge.net/samtools.shtml#8
First the basics – NGS 101
 How to visualize an alignment?
• Use a genome browser
39
What is a Genome Browser?
Graphical interface for display of genomic information
from biological databases
• known and predicted genes
• ESTs
• mRNAs
• CpG islands
• assembly gaps and coverage
• chromosomal bands
• homology to other organisms
• RNA-seq data
• Transcription factor binding sites
• GC percent
• Splicing variants
• Known SNPs
• Associated publications
• Sequence repeats
Besides genome sequence, they provide additional data:
40
Viewing reads in browser
  If your genome is available via the UCSC genome
browser http://genome.ucsc.edu/, import bam format
file to the UCSC genome browser by hosting the file on
a server and providing the link.
  If your genome is not in UCSC, use another browser
such as IGV http://www.broadinstitute.org/igv/ , or IGB
http://bioviz.org/igb/
•  Import genome (fasta)
•  Import annotations (gff3 or bed format)
•  Import data (bam)
Data from ENCODE, Expression (RNA-Seq), Methylation
and Transcription Factor binding (Chip-Seq) and more.
Use UCSC Custom Track to display data
Next-gen sequencing
can also be imported
typically by hosting a
BAM file in a server
and providing the link
  Is used to display your own data and annotations
  A variety of formats are accepted (click on the links to file types for more information)
  Data remains available for a limited time after upload
43
  Java tool that runs in user’s computer
  Allows for upload of custom annotations in many
formats
•  Add your genome (for example P. falciparum latest
version) in fasta format.
•  Add gene expression (gct format) or sequence
alignments (bam format).
•  Add custom annotations (bed, wig format, gff3).
Integrative Genomics Viewer (IGV)
http://www.broadinstitute.org/igv/
44
Example of use of IGV to visualize custom genome and data
Reads imported in bam format
Annotation imported in bed format
reads
RNA-Seq / miRNA-seq
(noncoding, differential
expression,
Novel splice forms,
antisense)
Epigenetics (Chip-
Seq, MNase-seq,
Bisulfite-Seq)
CNV,
Structural
variations
Targeted
resequencing
“Exome analysis”
Whole genome
sequencing
Metagenomics
(16S microbiome,
environmental
WGS)
Somatic mutations
Variants in
mendelian diseases
High throughput
sequencing
De novo
genome
assembly
Beyond the basics – a growing list of
NGS applications
47
RNA-Seq, Chip-Seq,
Resequencing,
Variant Analysis…
Most applications of NGS require alignment to a
known genome as the first step
Slide modified from Andrew Oler (BCBB)
How to align reads to a genome?
  Step 1: Choose an appropriate alignment software
http://seqanswers.com/wiki/Software
•  Common tools:
–  Bowtie: FAST, Accurate (e.g. for Chip-Seq)
–  BWA: FAST, Accurate, gapped alignment (variant analysis)
–  TopHat: Uses Bowtie for initial mapping and then maps
junctions (good for RNA-seq mapping)
–  GSMapper, MIRA: developed for 454 Roche data
–  QIIME, mothur: Suite for processing and alignment of 16S
rRNA amplicon data for microbiome
48
How to align reads to a genome?
  Step 2: Map the reads to generate an alignment file
(bam).
To visualize bam output files, sort and index the
output file with using tools samtools or picard.
49
Counting experiments (RNA-seq)
Methods
  Reverse transcribe to cDNA
  Prepare library (usually paired
end and/or strand specific)
Features
  A design for capture is not
required
  Alignment depth is
proportional to the abundance
of the transcript
Applications
  Identify coding sequences,
miRNAs, alternative splicing,
antisense transcripts
  Quantify differential
expression
50
RPKM - Reads Per Kilobase of exon model per Million mapped reads
(Haas & Zody, 2010)
Strategies for Mapping Junction Reads
  Split reads and align separately to reference
•  Sometimes based on intermediate reference of reconstructed splice
junction sequences
•  Finds known and novel splice sites
•  e.g., TopHat, SOAPsplice, Trinity
51
Frontiers in Genetics, Huang 2011
Slide courtesy of Andrew Oler (BCBB)
Strand specific RNA-seq can more easily
reveal antisense transcript regulation
52
Counting experiments (Chip-seq)
(Chromosome Immunoprecipitation and Sequencing)
Features
  Allows genome wide discovery of
protein-DNA interactions(e.g.
transcription factor, histone
modification)
  DNA and proteins are cross-linked
and purified; then bound DNA is
analyzed by massively parallel short-
read sequencing
  It is cheaper, and provides better
signal to noise ratio than chip-chip,
not dependent on probes
53
Counting experiments (Chip-seq)
Features
  Analysis typically involves
mapping, peak detection
and binding motif analysis
  Challenges include scoring
diffuse or low intensity
peaks in relation to
background and coverage
  Common tools: USeq,
MACS
54
http://bit.ly/qLjRGA
ChIP-seq Downstream Analysis
55
Supplemental Table 2: D1 Histone-enriched loci (Illumina GAII FDR< 0.0001)
Go Category Total
Genes
Changed
Genes
Enrichment FDR
Cell fate commitment 75 60 1.59848 0
Sequence-specific DNA binding 424 337 1.588112 0
Cellular morphogenesis during differentiation 125 99 1.582495 0
Cell projection organization and biogenesis 169 131 1.548823 0
Cell part morphogenesis 169 131 1.548823 0
Embryonic morphogenesis 88 68 1.543986 0
Regionalization 82 63 1.535126 0
Neurogenesis 221 168 1.518918 0
Wnt receptor signaling pathway 107 80 1.493907 0
Regulation of cell differentiation 119 88 1.477587 0
Regulation of transcription from RNA polymerase II
promoter
99 72 1.453164 0
Organ morphogenesis 304 221 1.452566 0
Embryonic development 226 164 1.449949 0
Regulation of developmental process 191 138 1.443653 0
Voltage-gated ion channel activity 171 123 1.43723 0
Nervous system development 604 433 1.432413 0
Cation channel activity 228 162 1.419703 0
Transcription factor activity 791 552 1.394376 0
Muscle development 136 94 1.38104 0
# Peaks Found in Different Tissues
Allele-specific Binding
Oler et al., NSMB, 2010; Mikkelsen et al., Nature, 2007; Park, Nat Rev Genet, 2009; Barski et al., Cell, 2007
Slide courtesy of Andrew Oler / Vijay Nagarajan (BCBB)
Are you still awake?
56
RNA-Seq / miRNA-seq
(noncoding, differential
expression,
Novel splice forms,
antisense)
Epigenetics (Chip-
Seq, Mnase-seq,
Bisulfite-Seq)
CNV,
Structural
variations
Targeted
resequencing
“Exome analysis”
Whole genome
sequencing
Metagenomics
(16S microbiome,
environmental
WGS)
Somatic mutations
Variants in
mendelian diseases
High throughput
sequencing
De novo
genome
assembly
Beyond the basics – a growing list of
NGS applications
De novo genome assembly
58
AllPATHS-LG
http://www.broadinstitute.org/news/2787
De novo genome assembly
  A good assembly needs:
•  library preparation that minimizes GC bias which lead to poor coverage
•  High coverage (e.g 100 fold Illumina ) with low error rate
•  For a small genome, if possible, add 50x fold PacBio (1500bp read length) to
reduce the number of contigs. Alternatively, use mate pairs and pair ends of
various insert sizes.
  De novo assemblers for large genomes
•  ALLPATHS-LG and DISCOVAR – developed and recommended by BROAD
Institute http://www.broadinstitute.org/science/programs/genome-biology/crd
•  SOAP de novo – developed and used by BGI http://1.usa.gov/oTUrWC
•  ABYSS http://www.bcgsc.ca/platform/bioinfo/software/abyss
  De novo assemblers for smaller genomes
•  VELVET
•  NEWBLER (454)
59
Related publications
http://1.usa.gov/id8h5d
Use case: Panda Genome
Published Nature 2010
•  SOAP denovo
(de Brujin graph algorithm)
•  56 fold coverage
•  500bp insert paired end
•  2kb mate pair
•  Genome was 94% complete
Image courtesy of Zhihe Zhang
In Scientific American
De novo genome assembly
Asian Honey Bee (published January 2015)
  238 Mbp draft of the A. cerana genome and generated 10,651 genes.
•  72% of the A. cerana-specific genes had more than one GO term, and
1,696 enzymes were categorized into 125 pathways.
•  Genes involved in chemoreception and immunity were carefully
identified and compared to those from other sequenced insect
models. These included 10 gustatory receptors, 119 odorant
receptors, 10 ionotropic receptors, and 160 immune-related genes.
61
•  3 libraries
•  Pair end: 500bp
•  Mate pair: 3kb and 10kb
•  2,430 scaffolds
•  RNA-seq data also assembled
•  Tools: AllPaths-LG, RepeatMasker,
•  RNA-seq tools: Trinity, TopHat, Cufflinks
62
Schematic overview of
SOAP denovo algorithm
http://1.usa.gov/oTUrWC
Contig assembly
Scaffolding
Preassembly sequencing
error correction
Gap closure
RNA-Seq / miRNA-seq
(noncoding, differential
expression,
Novel splice forms,
antisense)
Epigenetics (Chip-
Seq, Mnase-seq,
Bisulfite-Seq)
CNV,
Structural
variations
Targeted
resequencing
“Exome analysis”
Whole genome
sequencing
Metagenomics
(16S microbiome,
environmental
WGS)
Somatic mutations
Variants in
mendelian diseases
High throughput
sequencing
De novo
genome
assembly
Beyond the basics – a growing list of
NGS applications
64
Metagenomics and microbiome analysis
Analysis methods:
•  Reference based analysis
•  16S RNA – OTU based methods
•  Shotgun data (454, Illumina)
•  Assign taxonomy (RDP classifier, blast)
•  Pipelines for 16S RNA: qiime, mothur
•  Other tools: MEGAN, CARMA,
metaphyler
•  De novo Assembly of WGS and funtional
analysis of microbiomes
•  Methods are under development with
the goal of dealing with insufficient
coverage, sequencing errors, repeats
•  Tools: MG-RAST, metAMOS, HUMAnN
•  It looks at gene classes, metabolic
pathways
http://bit.ly/o4dGqH http://www.hmpdacc.org/
Sample study: skin microbiome
  The skin is an ecosystem, host to a microbial
milieu that, for the most part, is harmless.
  Analysis of 16S ribosomal RNA genes reveals a
greater diversity of organisms than has been
found by culture-based methods
  The cutaneous immune system modulates
colonization by the microbiota and is also vital
during infection and wounding. Dysregulation of
the skin immune response is evident in several
skin disorders
65
Elizabeth A. Grice & Julia A. Segre
Nature Reviews Microbiology 9, 244-253
Recommended software: mothur, qiime
RNA-Seq / miRNA-seq
(noncoding, differential
expression,
Novel splice forms,
antisense)
Epigenetics (Chip-
Seq, Mnase-seq,
Bisulfite-Seq)
CNV,
Structural
variations
Targeted
resequencing
“Exome analysis”
Whole genome
sequencing
Metagenomics
(16S microbiome,
environmental
WGS)
Somatic mutations
Variants in
mendelian diseases
High throughput
sequencing
De novo
genome
assembly
A growing list of applications
Variant Analysis
…like finding a needle in a ‘deep’ haystack
67
  SNPs – Single nucleotide
polymorphisms
  Indels – Insertion
Deletions
  CNVs- copy number
variations
  SV- structural variations
Variant = any position in
difference to a specified reference
sequence
68
Efforts at creating databases of variants:
HapMap Project
•  Project that started on 2002 with the goal of describing patterns of human
genetic variation and create a haplotype map using SNPs present in at
least 1% of the population, which were deposited in dbSNPs.
•  It used 269 individuals.
Haplotypes – adjacent SNPs that are inherited together
1000 Genomes
•  Started in 2008 with a goal of using at least 1000 individuals (about
2,500 samples at 4X coverage), interrogate 1000 gene regions in 900
samples (exome analysis), find most genetic variants with allele
frequencies above 1% and to a 0.1% if in coding regions as well as
Indels and structural variants
•  Make data available to the public ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/
or via Amazon Cloud http://s3.amazonaws.com/1000genomes
Encode: Encyclopedia of DNA Elements
69
Main Goal:
•  Find all functional elements in the genome
Exome-Seq
Targeted exome capture
  targets ~20,000 variants
near coding sequences
and a few rare missense or
loss of function variants
  Provides high depth of
coverage for more accurate
variant calling
  It is starting to be used as a
diagnostic tool
70
Ann Neurol. 2012 Jan;71(1):5-14.
-Nimblegen
-Agilent
-Illumina
VCF format (version 4.0)
71
  Format used to report information about a position in the
genome
  Use by the 1000 genomes project to report all variants
VCF format
72
http://www.broadinstitute.org/gsa/wiki/index.php/Understanding_the_Unified_Genotyper's_VCF_files
Thank You
Question or Comments please contact:
mariam.quinones@niaid.nih.gov
ScienceApps@niaid.nih.gov
73

Más contenido relacionado

La actualidad más candente

Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Manikhandan Mudaliar
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing priyanka raviraj
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-SeqcursoNGS
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignmentAfra Fathima
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation SequencingFarid MUSA
 
Introduction to second generation sequencing
Introduction to second generation sequencingIntroduction to second generation sequencing
Introduction to second generation sequencingDenis C. Bauer
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
The ensembl database
The ensembl databaseThe ensembl database
The ensembl databaseAshfaq Ahmad
 
Next Generation Sequencing of DNA
Next Generation Sequencing of DNANext Generation Sequencing of DNA
Next Generation Sequencing of DNAmaryamshah13
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingUzma Jabeen
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities Paolo Dametto
 

La actualidad más candente (20)

Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
 
Rna seq
Rna seqRna seq
Rna seq
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-Seq
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation Sequencing
 
Introduction to second generation sequencing
Introduction to second generation sequencingIntroduction to second generation sequencing
Introduction to second generation sequencing
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Express sequence tags
Express sequence tagsExpress sequence tags
Express sequence tags
 
Ngs ppt
Ngs pptNgs ppt
Ngs ppt
 
The ensembl database
The ensembl databaseThe ensembl database
The ensembl database
 
Next Generation Sequencing of DNA
Next Generation Sequencing of DNANext Generation Sequencing of DNA
Next Generation Sequencing of DNA
 
Exome sequence analysis
Exome sequence analysisExome sequence analysis
Exome sequence analysis
 
Biological networks
Biological networksBiological networks
Biological networks
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities
 

Destacado

3Com 3C509CX
3Com 3C509CX3Com 3C509CX
3Com 3C509CXsavomir
 
T-BioInfo Methods and Approaches
T-BioInfo Methods and ApproachesT-BioInfo Methods and Approaches
T-BioInfo Methods and ApproachesElia Brodsky
 
Development of a Multi-Variant Frequency Ladder™ for Next Generation Sequenci...
Development of a Multi-Variant Frequency Ladder™ for Next Generation Sequenci...Development of a Multi-Variant Frequency Ladder™ for Next Generation Sequenci...
Development of a Multi-Variant Frequency Ladder™ for Next Generation Sequenci...Thermo Fisher Scientific
 
Invicta eshre-poster-mitochondrial dna
Invicta eshre-poster-mitochondrial dnaInvicta eshre-poster-mitochondrial dna
Invicta eshre-poster-mitochondrial dnaINVICTA GENETICS
 
zandona14nipsA0
zandona14nipsA0zandona14nipsA0
zandona14nipsA0Pia Sen
 
Next Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesNext Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesChung-Tsai Su
 
Foursquare For Businesses
Foursquare For BusinessesFoursquare For Businesses
Foursquare For Businesses6S Marketing
 
Next Generation Sequencing 2013 Report by Yole Developpement
Next Generation Sequencing 2013 Report by Yole DeveloppementNext Generation Sequencing 2013 Report by Yole Developpement
Next Generation Sequencing 2013 Report by Yole DeveloppementYole Developpement
 
CSU Next Generation Sequencing Core 06/09/2015
CSU Next Generation Sequencing Core 06/09/2015CSU Next Generation Sequencing Core 06/09/2015
CSU Next Generation Sequencing Core 06/09/2015Richard Casey
 
Colorado State University Next Generation Sequencing Core 060915
Colorado State University Next Generation Sequencing Core 060915Colorado State University Next Generation Sequencing Core 060915
Colorado State University Next Generation Sequencing Core 060915ngscore
 
Invicta eshre-poster-pregnancy rate after frozen blastocyst
Invicta eshre-poster-pregnancy rate after frozen blastocystInvicta eshre-poster-pregnancy rate after frozen blastocyst
Invicta eshre-poster-pregnancy rate after frozen blastocystINVICTA GENETICS
 
Exploring new frontiers with next-generation sequencing
Exploring new frontiers with next-generation sequencingExploring new frontiers with next-generation sequencing
Exploring new frontiers with next-generation sequencingQIAGEN
 
Nextgenerationsequencing ngs 131218163555-phpapp02
Nextgenerationsequencing     ngs  131218163555-phpapp02Nextgenerationsequencing     ngs  131218163555-phpapp02
Nextgenerationsequencing ngs 131218163555-phpapp02鋒博 蔡
 
Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...
Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...
Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...Thermo Fisher Scientific
 
Avances en genética. Utilidad de la NGS y la bioinformática.
Avances en genética. Utilidad de la NGS y la bioinformática.Avances en genética. Utilidad de la NGS y la bioinformática.
Avances en genética. Utilidad de la NGS y la bioinformática.BBK Innova Sarea
 
Expanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSExpanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSIntegrated DNA Technologies
 

Destacado (17)

3Com 3C509CX
3Com 3C509CX3Com 3C509CX
3Com 3C509CX
 
ChIP-seq Theory
ChIP-seq TheoryChIP-seq Theory
ChIP-seq Theory
 
T-BioInfo Methods and Approaches
T-BioInfo Methods and ApproachesT-BioInfo Methods and Approaches
T-BioInfo Methods and Approaches
 
Development of a Multi-Variant Frequency Ladder™ for Next Generation Sequenci...
Development of a Multi-Variant Frequency Ladder™ for Next Generation Sequenci...Development of a Multi-Variant Frequency Ladder™ for Next Generation Sequenci...
Development of a Multi-Variant Frequency Ladder™ for Next Generation Sequenci...
 
Invicta eshre-poster-mitochondrial dna
Invicta eshre-poster-mitochondrial dnaInvicta eshre-poster-mitochondrial dna
Invicta eshre-poster-mitochondrial dna
 
zandona14nipsA0
zandona14nipsA0zandona14nipsA0
zandona14nipsA0
 
Next Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesNext Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and Opportunities
 
Foursquare For Businesses
Foursquare For BusinessesFoursquare For Businesses
Foursquare For Businesses
 
Next Generation Sequencing 2013 Report by Yole Developpement
Next Generation Sequencing 2013 Report by Yole DeveloppementNext Generation Sequencing 2013 Report by Yole Developpement
Next Generation Sequencing 2013 Report by Yole Developpement
 
CSU Next Generation Sequencing Core 06/09/2015
CSU Next Generation Sequencing Core 06/09/2015CSU Next Generation Sequencing Core 06/09/2015
CSU Next Generation Sequencing Core 06/09/2015
 
Colorado State University Next Generation Sequencing Core 060915
Colorado State University Next Generation Sequencing Core 060915Colorado State University Next Generation Sequencing Core 060915
Colorado State University Next Generation Sequencing Core 060915
 
Invicta eshre-poster-pregnancy rate after frozen blastocyst
Invicta eshre-poster-pregnancy rate after frozen blastocystInvicta eshre-poster-pregnancy rate after frozen blastocyst
Invicta eshre-poster-pregnancy rate after frozen blastocyst
 
Exploring new frontiers with next-generation sequencing
Exploring new frontiers with next-generation sequencingExploring new frontiers with next-generation sequencing
Exploring new frontiers with next-generation sequencing
 
Nextgenerationsequencing ngs 131218163555-phpapp02
Nextgenerationsequencing     ngs  131218163555-phpapp02Nextgenerationsequencing     ngs  131218163555-phpapp02
Nextgenerationsequencing ngs 131218163555-phpapp02
 
Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...
Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...
Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...
 
Avances en genética. Utilidad de la NGS y la bioinformática.
Avances en genética. Utilidad de la NGS y la bioinformática.Avances en genética. Utilidad de la NGS y la bioinformática.
Avances en genética. Utilidad de la NGS y la bioinformática.
 
Expanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSExpanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGS
 

Similar a Overview of Next Gen Sequencing Data Analysis

International Cancer Genomics Consortium (ICGC) Data Coordinating Center
International Cancer Genomics Consortium (ICGC) Data Coordinating CenterInternational Cancer Genomics Consortium (ICGC) Data Coordinating Center
International Cancer Genomics Consortium (ICGC) Data Coordinating CenterNeuro, McGill University
 
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Spark Summit
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theoryC. Tobin Magle
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchEuropean Bioinformatics Institute
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planC. Tobin Magle
 
Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021 Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021 Sanjay Padhi, Ph.D
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionChris Dwan
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataPhilip Cheung
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 
Scott Edmunds flashtalk slides from Beyond the PDF2
Scott Edmunds flashtalk slides from Beyond the PDF2Scott Edmunds flashtalk slides from Beyond the PDF2
Scott Edmunds flashtalk slides from Beyond the PDF2GigaScience, BGI Hong Kong
 
The Progress on Sagace and Data Integration
The Progress on Sagace and Data IntegrationThe Progress on Sagace and Data Integration
The Progress on Sagace and Data IntegrationMaori Ito
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistrySunghwan Kim
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekData Driven Innovation
 
Supporting high throughput high-biotechnologies in today’s research environme...
Supporting high throughput high-biotechnologies in today’s research environme...Supporting high throughput high-biotechnologies in today’s research environme...
Supporting high throughput high-biotechnologies in today’s research environme...Ed Dodds
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Databricks
 
Wikidata workshop for ISB Biocuration 2016
Wikidata workshop for ISB Biocuration 2016Wikidata workshop for ISB Biocuration 2016
Wikidata workshop for ISB Biocuration 2016Benjamin Good
 
NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw Alexander Pico
 
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical KnowledgeBioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical KnowledgeChunlei Wu
 

Similar a Overview of Next Gen Sequencing Data Analysis (20)

International Cancer Genomics Consortium (ICGC) Data Coordinating Center
International Cancer Genomics Consortium (ICGC) Data Coordinating CenterInternational Cancer Genomics Consortium (ICGC) Data Coordinating Center
International Cancer Genomics Consortium (ICGC) Data Coordinating Center
 
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theory
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management plan
 
Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021 Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Scott Edmunds flashtalk slides from Beyond the PDF2
Scott Edmunds flashtalk slides from Beyond the PDF2Scott Edmunds flashtalk slides from Beyond the PDF2
Scott Edmunds flashtalk slides from Beyond the PDF2
 
The Progress on Sagace and Data Integration
The Progress on Sagace and Data IntegrationThe Progress on Sagace and Data Integration
The Progress on Sagace and Data Integration
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistry
 
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
 
Supporting high throughput high-biotechnologies in today’s research environme...
Supporting high throughput high-biotechnologies in today’s research environme...Supporting high throughput high-biotechnologies in today’s research environme...
Supporting high throughput high-biotechnologies in today’s research environme...
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
 
Wikidata workshop for ISB Biocuration 2016
Wikidata workshop for ISB Biocuration 2016Wikidata workshop for ISB Biocuration 2016
Wikidata workshop for ISB Biocuration 2016
 
NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw NetBioSIG2013-Talk Robin Haw
NetBioSIG2013-Talk Robin Haw
 
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical KnowledgeBioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
 

Más de Bioinformatics and Computational Biosciences Branch

Más de Bioinformatics and Computational Biosciences Branch (20)

Hong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptxHong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptx
 
Virus Sequence Alignment and Phylogenetic Analysis 2019
Virus Sequence Alignment and Phylogenetic Analysis 2019Virus Sequence Alignment and Phylogenetic Analysis 2019
Virus Sequence Alignment and Phylogenetic Analysis 2019
 
Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
 
Introduction to METAGENOTE
Introduction to METAGENOTE Introduction to METAGENOTE
Introduction to METAGENOTE
 
Intro to homology modeling
Intro to homology modelingIntro to homology modeling
Intro to homology modeling
 
Protein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modelingProtein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modeling
 
Homology modeling: Modeller
Homology modeling: ModellerHomology modeling: Modeller
Homology modeling: Modeller
 
Protein docking
Protein dockingProtein docking
Protein docking
 
Protein function prediction
Protein function predictionProtein function prediction
Protein function prediction
 
Protein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on RosettaProtein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on Rosetta
 
UNIX Basics and Cluster Computing
UNIX Basics and Cluster ComputingUNIX Basics and Cluster Computing
UNIX Basics and Cluster Computing
 
Statistical applications in GraphPad Prism
Statistical applications in GraphPad PrismStatistical applications in GraphPad Prism
Statistical applications in GraphPad Prism
 
Intro to JMP for statistics
Intro to JMP for statisticsIntro to JMP for statistics
Intro to JMP for statistics
 
Categorical models
Categorical modelsCategorical models
Categorical models
 
Better graphics in R
Better graphics in RBetter graphics in R
Better graphics in R
 
Automating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtoolsAutomating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtools
 
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)
 
Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)
 
GraphPad Prism: Curve fitting
GraphPad Prism: Curve fittingGraphPad Prism: Curve fitting
GraphPad Prism: Curve fitting
 
Appendix: Crash course in R and BioConductor
Appendix: Crash course in R and BioConductorAppendix: Crash course in R and BioConductor
Appendix: Crash course in R and BioConductor
 

Último

Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 

Último (20)

Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 

Overview of Next Gen Sequencing Data Analysis

  • 1. January 7th, 2015 Mariam Quiñones Computational Biology Specialist Bioinformatics and Computational Biosciences Branch Office of Cyber Infrastructure and Computational Biology
  • 2. Upcoming Seminars on NGS Analysis 2 http://inside.niaid.nih.gov/topic/training/scientificsoftwaretraining/Pages/default.aspx
  • 3. BCBB: A Branch Devoted to Bioinformatics and Computational Biosciences   Researchers’ time is increasingly important   BCBB saves our collaborators time and effort   Researchers speed projects to completion using BCBB consultation and development services   No need to hire extra post docs or use external consultants or developers 3
  • 4. BCBB Staff 4 Bioinformatics Software Developers Computational Biologists Project Managers and Analysts
  • 5. Contact BCBB…   “NIH Users: Access a menu of BCBB services on the NIAID Intranet: •  http://bioinformatics.niaid.nih.gov/   Outside of NIH – •  search “BCBB” on the NIAID Public Internet Page: www.niaid.nih.gov – or – use this direct link http://www.niaid.nih.gov/about/organization/odoffices/omo/ocicb/Pages/bcbb.aspx   Email us at: •  ScienceApps@niaid.nih.gov 5
  • 6. Why has the scientific community adopted deep sequencing? 6 •  Cheaper, faster sequencing •  No need for cloning or probes •  Many applications •  Higher specificity and sensitivity (RNA-seq, Chip-Seq) •  More..
  • 7. What is Next Generation Sequencing? 7 Image from: http://s.ngm.com/2009/06/tag-caves/img/01-rumbling-falls-615.jpg It is sequencing produced by 2nd and 3rd generation instruments (e.g. Illumina, PacBio)” •  It is also known as High-Throughput Next Generation Sequencing (HT-NGS) or “Deep Sequencing”. Provides deeper coverage than the typical Sanger sequencing
  • 8. Agenda for today   Overview of Next Generation Sequencing   NGS sequencing platforms   NGS Analysis Basics •  File formats •  Quality Control •  Viewing alignment files   Common applications of NGS 8
  • 9. Remember Sanger? 9 •  Sanger introduced the “dideoxy method” (also known as Sanger sequencing) in December 1977 Alignment of reads using tools such as ‘Sequencher’ IMAGE: http://www.lifetechnologies.com
  • 10. Sanger   Next Generation Sequencing 10 • Sanger: Dideoxy Chain Termination1977 • Hood et al., Fluorescently labeled ddNTPs, Partial Automation1986 • NIH begins Human Genome Project,1990 • HGP/Celera draft assembly published Nature / Science2001 • Next-Gen Sequencing (454 Roche)2004 • First Solexa Sequencer, Genome Analyzer 1G/Run2006 •  1990 – 2003 •  “shotgun” 2007 J.Craig Venter James Watson
  • 11. Greater vision: Genomics to Bedside   “ Only a population perspective can fulfill the promise of genomic medicine. The scientific landscape for genomics is exciting, and the promise for improving health is great. Applying genomic tools in clinical and public health practice will require a multidisciplinary research collaboration of basic sciences with clinical and population sciences (e.g., epidemiologists; behavioral, social, and communication scientists; health services researchers; and public health practitioners)” Am J Public Health. 2012 January; 102(1): 34–37 11
  • 12. Popular Sequencing Platforms (non-Illumina) 12 SOLiD – 5500 xl series 320 Gb / 8 day run GS FLX Titanium XL+ 700bp reads Up to 700 Mb / 23 hours PacBio RS II 500 Mb – 1 Gb / 4 hr run (up to 40kb read lengths) Ion Torrent 318 1.2–2 Gb / 7 hr pH sensing Sequencing by Synthesis Single molecule Ion Torrent Proton (2 exomes / 2-4 hr run)
  • 13. Roche 454   Pyrosequencing   Used mostly for targeted sequencing such as 16S rRNA   Long reads (>500) but with high error rate in homopolymer regions   Much lower yield than other platforms 13
  • 14. PacBio - Single Molecule Real Time 14 •  Very long reads that are good to span repeats but with 11% error rate •  Consensus analysis of reads corrects error rate •  It’s good for base modification detection •  It can be combined with shorter reads to improve de novo assemblies Genome Res. 2013 Jan;23(1):121-8. doi: 10.1101/gr. 141705.112. Epub 2012 Oct 11
  • 15. Ion Torrent sequence detection 15 http://en.wikipedia.org/wiki/ Ion_semiconductor_sequencing
  • 16. New kid - MinION 16 Bases identified by changes in current
  • 18. 18 HiSeq X = $1000/genome at 30X And more throughput.. BROAD Institute, Macrogen…
  • 19. Illumina   Sequence by synthesis   It uses dNTPs containing a terminator (with a fluorescent label) which blocks further polymerization allowing only one base added 19 http://nxseq.bitesizebio.com/articles/ sequencing-by-synthesis-explaining-the- illumina-sequencing-technology/
  • 20. Where are these sequences being stored? •  NCBI SRA database http://www.ncbi.nlm.nih.gov/sra •  European Read Archive (ENA) http://www.ebi.ac.uk/ena/about/sra_submissions •  1000 genomes data http://www.1000genomes.org/data •  Human Microbiome Projec (Microbiome data) http://hmpdacc.org/ Some data repositories include:
  • 22. Major challenges when working with sequencing data We need:   Algorithms for managing (LIMS), analyzing and visualizing data   Reproducible workflows and standards for analysis   Better transfer and data storage technology   Specialized tools for integrating various data types 22 Emerging solutions Algorithms that can parallelize jobs in a cluster   ABySS uses MPI, AllPaths LG, Discovar   GATK Genome Analysis Toolkit uses MapReduce (Google’s framework) Web tools with workflow capabilities   Galaxy Bioinformatics https://usegalaxy.org   Various Cloud based solutions (e.g. Illumina BaseSpace)   Lots of open source tools: see http://seqanswers.com/wiki/Software
  • 23. Galaxy https://usegalaxy.org   Makes analysis methods available to the community and facilitates reproducibility via creation of reusable workflows (read Galaxy slides)   Free web service, also compatible with Cloud http://usegalaxy.org/cloud   Open source   Provides a Genome Track Browser to visualize custom data. 23
  • 24. Cartoons from: fixingpcerrors.com and squido.com I have data, where do I start?
  • 25. First the basics – NGS 101  Sequence data • What does a short read looks like? • How to know if sequencer facility has provided good quality reads? • What to expect if sequencer facility has mapped the reads to my genome of interest? 25
  • 27. Common Sequence file formats   Next gen sequence file formats are based on the commonly used FASTA format >sequence_ID and optional comments ATTCCGGTGCGGTGCGGTGCTGCCGTGCCGGTGC TTCGAAATTGGCGTCAGT   The Phred quality scores per base were added 27 @HWI-ST406:207:D1DGFACXX:8:1101:20481:2058 1:N:0:AGTCAA! CATGGGGATCGAATTCATCGCCGTCCCCTCTGTTCCGATTTATTCCATATGTGCTTCGCAACAACGCTTTCTCACAGAATACAGGAGCTTCTATACTGTA! +! BBBFFFFFFFFFFIIIIIIFFIIFFIIIFFIIFFFIFBFIIIIIIIIIFIIFBFFIFFFBFFBFFBFBFFFFFFFBBFFFFFFBBBBBBBBBBBFFFBFB!
  • 28. Raw sequence file formats   FASTQ format (fasta format with quality values for each base) 28 @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA - base calls + BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ - Base quality+33 Full read header description" @ <instrument-name>:<run ID>:<flowcell ID>:<lane-number>:<tile-number>: <x-pos>: <y-pos> <read number>:<is filtered>:<control number>:<barcode sequence> Space to separate Read ID Read ID "
  • 29. Quality values 29 Quality scores are normally expected up to 40 in a Phred scale. ASCII characters <http://en.wikipedia.org/wiki/ASCII> BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ " The highest base quality score in this sequence: ‘D’=(68-33)=35 From http://en.wikipedia.org/wiki/FASTQ_format = 0.00032 (or 1/3200 incorrect)P=10 -35/10 If base quality = 35
  • 30. Other read formats   SFF (Roche 454 or Ion Torrent) •  Sff – contain Flowgrams, phred quality scores, clipping information •  454 reads are often reported as fasta and qual or converted to fastq 30
  • 31. First the basics – NGS 101  Sequence data • What does a short read looks like? • How to know if the sequencing facility has provided good quality reads? • What to expect if sequencer facility has mapped the reads to my genome of interest? 31
  • 32. Basic Concepts in Quality Control of sequence data   The sequencing facility runs quality control tests to ensure that the actual run was successful and/or to determine if a new library is good for sequencing more of it.   The user should run quality control tests prior to full bioinformatics analyses •  This will avoid misinterpretation of the data due to unexpected bias •  QC measurements can report the following: –  Percent GC in sample reads –  Presence of overrepresented kmers and sequences such as adapters –  Per base quality score –  Distribution of nucleotide bases   After mapping reads to a genome, additional test could be run to determine: –  Mapping error rate –  Percent of possible PCR duplicates (reads with same start and end position in reference genome) –  Distribution of insert size (pair ends) 32
  • 33. Demo – Fastq Quality 33
  • 34. Quality control with FastQC By: FastQC http://www.bioinformatics.bbsrc.ac.uk/projects/ fastqc/ Mean quality per base Sequence content per base GC content per read For filtering or trimming by quality use tools such as FastX-toolkit, Btrim, PrinSeq.
  • 35. First the basics – NGS 101  Sequence data • What does a short read looks like? • How to know if the sequencing facility has provided good quality reads? • What to expect if sequencing facility has mapped (aligned) the reads to my genome of interest? 35
  • 36. File formats for aligned reads   SAM (sequence alignment map) 36 CTTGGGCTGCGTCGTGTCTTCGCTTCACACCCGCGACGAGCGCGGCTTCT CTTGGGCTGCGTCGTGTCTTCGCTTCACACC Chr_ start end Chr2 100000 100050 Chr_ start end Chr2 50000 50050
  • 37. Most commonly used alignment file formats   SAM (sequence alignment map) Unified format for storing alignments to a reference genome   BAM (binary version of SAM) – used commonly to deliver data Compressed SAM file, is normally indexed   BED Commonly used to report features described by chrom, start, end, name, score, and strand. For example: chr1 11873 14409 uc001aaa.3 0 + 37
  • 38. SAM/BAM format (sequence alignment map) 38 QNAME FLAG RNAME POSITION MAPQ CIAGR MRNM MPOS TLEN SEQ QUAL OPT http://samtools.sourceforge.net/samtools.shtml#8
  • 39. First the basics – NGS 101  How to visualize an alignment? • Use a genome browser 39
  • 40. What is a Genome Browser? Graphical interface for display of genomic information from biological databases • known and predicted genes • ESTs • mRNAs • CpG islands • assembly gaps and coverage • chromosomal bands • homology to other organisms • RNA-seq data • Transcription factor binding sites • GC percent • Splicing variants • Known SNPs • Associated publications • Sequence repeats Besides genome sequence, they provide additional data: 40
  • 41. Viewing reads in browser   If your genome is available via the UCSC genome browser http://genome.ucsc.edu/, import bam format file to the UCSC genome browser by hosting the file on a server and providing the link.   If your genome is not in UCSC, use another browser such as IGV http://www.broadinstitute.org/igv/ , or IGB http://bioviz.org/igb/ •  Import genome (fasta) •  Import annotations (gff3 or bed format) •  Import data (bam)
  • 42. Data from ENCODE, Expression (RNA-Seq), Methylation and Transcription Factor binding (Chip-Seq) and more.
  • 43. Use UCSC Custom Track to display data Next-gen sequencing can also be imported typically by hosting a BAM file in a server and providing the link   Is used to display your own data and annotations   A variety of formats are accepted (click on the links to file types for more information)   Data remains available for a limited time after upload 43
  • 44.   Java tool that runs in user’s computer   Allows for upload of custom annotations in many formats •  Add your genome (for example P. falciparum latest version) in fasta format. •  Add gene expression (gct format) or sequence alignments (bam format). •  Add custom annotations (bed, wig format, gff3). Integrative Genomics Viewer (IGV) http://www.broadinstitute.org/igv/ 44
  • 45. Example of use of IGV to visualize custom genome and data Reads imported in bam format Annotation imported in bed format reads
  • 46. RNA-Seq / miRNA-seq (noncoding, differential expression, Novel splice forms, antisense) Epigenetics (Chip- Seq, MNase-seq, Bisulfite-Seq) CNV, Structural variations Targeted resequencing “Exome analysis” Whole genome sequencing Metagenomics (16S microbiome, environmental WGS) Somatic mutations Variants in mendelian diseases High throughput sequencing De novo genome assembly Beyond the basics – a growing list of NGS applications
  • 47. 47 RNA-Seq, Chip-Seq, Resequencing, Variant Analysis… Most applications of NGS require alignment to a known genome as the first step Slide modified from Andrew Oler (BCBB)
  • 48. How to align reads to a genome?   Step 1: Choose an appropriate alignment software http://seqanswers.com/wiki/Software •  Common tools: –  Bowtie: FAST, Accurate (e.g. for Chip-Seq) –  BWA: FAST, Accurate, gapped alignment (variant analysis) –  TopHat: Uses Bowtie for initial mapping and then maps junctions (good for RNA-seq mapping) –  GSMapper, MIRA: developed for 454 Roche data –  QIIME, mothur: Suite for processing and alignment of 16S rRNA amplicon data for microbiome 48
  • 49. How to align reads to a genome?   Step 2: Map the reads to generate an alignment file (bam). To visualize bam output files, sort and index the output file with using tools samtools or picard. 49
  • 50. Counting experiments (RNA-seq) Methods   Reverse transcribe to cDNA   Prepare library (usually paired end and/or strand specific) Features   A design for capture is not required   Alignment depth is proportional to the abundance of the transcript Applications   Identify coding sequences, miRNAs, alternative splicing, antisense transcripts   Quantify differential expression 50 RPKM - Reads Per Kilobase of exon model per Million mapped reads (Haas & Zody, 2010)
  • 51. Strategies for Mapping Junction Reads   Split reads and align separately to reference •  Sometimes based on intermediate reference of reconstructed splice junction sequences •  Finds known and novel splice sites •  e.g., TopHat, SOAPsplice, Trinity 51 Frontiers in Genetics, Huang 2011 Slide courtesy of Andrew Oler (BCBB)
  • 52. Strand specific RNA-seq can more easily reveal antisense transcript regulation 52
  • 53. Counting experiments (Chip-seq) (Chromosome Immunoprecipitation and Sequencing) Features   Allows genome wide discovery of protein-DNA interactions(e.g. transcription factor, histone modification)   DNA and proteins are cross-linked and purified; then bound DNA is analyzed by massively parallel short- read sequencing   It is cheaper, and provides better signal to noise ratio than chip-chip, not dependent on probes 53
  • 54. Counting experiments (Chip-seq) Features   Analysis typically involves mapping, peak detection and binding motif analysis   Challenges include scoring diffuse or low intensity peaks in relation to background and coverage   Common tools: USeq, MACS 54 http://bit.ly/qLjRGA
  • 55. ChIP-seq Downstream Analysis 55 Supplemental Table 2: D1 Histone-enriched loci (Illumina GAII FDR< 0.0001) Go Category Total Genes Changed Genes Enrichment FDR Cell fate commitment 75 60 1.59848 0 Sequence-specific DNA binding 424 337 1.588112 0 Cellular morphogenesis during differentiation 125 99 1.582495 0 Cell projection organization and biogenesis 169 131 1.548823 0 Cell part morphogenesis 169 131 1.548823 0 Embryonic morphogenesis 88 68 1.543986 0 Regionalization 82 63 1.535126 0 Neurogenesis 221 168 1.518918 0 Wnt receptor signaling pathway 107 80 1.493907 0 Regulation of cell differentiation 119 88 1.477587 0 Regulation of transcription from RNA polymerase II promoter 99 72 1.453164 0 Organ morphogenesis 304 221 1.452566 0 Embryonic development 226 164 1.449949 0 Regulation of developmental process 191 138 1.443653 0 Voltage-gated ion channel activity 171 123 1.43723 0 Nervous system development 604 433 1.432413 0 Cation channel activity 228 162 1.419703 0 Transcription factor activity 791 552 1.394376 0 Muscle development 136 94 1.38104 0 # Peaks Found in Different Tissues Allele-specific Binding Oler et al., NSMB, 2010; Mikkelsen et al., Nature, 2007; Park, Nat Rev Genet, 2009; Barski et al., Cell, 2007 Slide courtesy of Andrew Oler / Vijay Nagarajan (BCBB)
  • 56. Are you still awake? 56
  • 57. RNA-Seq / miRNA-seq (noncoding, differential expression, Novel splice forms, antisense) Epigenetics (Chip- Seq, Mnase-seq, Bisulfite-Seq) CNV, Structural variations Targeted resequencing “Exome analysis” Whole genome sequencing Metagenomics (16S microbiome, environmental WGS) Somatic mutations Variants in mendelian diseases High throughput sequencing De novo genome assembly Beyond the basics – a growing list of NGS applications
  • 58. De novo genome assembly 58 AllPATHS-LG http://www.broadinstitute.org/news/2787
  • 59. De novo genome assembly   A good assembly needs: •  library preparation that minimizes GC bias which lead to poor coverage •  High coverage (e.g 100 fold Illumina ) with low error rate •  For a small genome, if possible, add 50x fold PacBio (1500bp read length) to reduce the number of contigs. Alternatively, use mate pairs and pair ends of various insert sizes.   De novo assemblers for large genomes •  ALLPATHS-LG and DISCOVAR – developed and recommended by BROAD Institute http://www.broadinstitute.org/science/programs/genome-biology/crd •  SOAP de novo – developed and used by BGI http://1.usa.gov/oTUrWC •  ABYSS http://www.bcgsc.ca/platform/bioinfo/software/abyss   De novo assemblers for smaller genomes •  VELVET •  NEWBLER (454) 59 Related publications http://1.usa.gov/id8h5d
  • 60. Use case: Panda Genome Published Nature 2010 •  SOAP denovo (de Brujin graph algorithm) •  56 fold coverage •  500bp insert paired end •  2kb mate pair •  Genome was 94% complete Image courtesy of Zhihe Zhang In Scientific American De novo genome assembly
  • 61. Asian Honey Bee (published January 2015)   238 Mbp draft of the A. cerana genome and generated 10,651 genes. •  72% of the A. cerana-specific genes had more than one GO term, and 1,696 enzymes were categorized into 125 pathways. •  Genes involved in chemoreception and immunity were carefully identified and compared to those from other sequenced insect models. These included 10 gustatory receptors, 119 odorant receptors, 10 ionotropic receptors, and 160 immune-related genes. 61 •  3 libraries •  Pair end: 500bp •  Mate pair: 3kb and 10kb •  2,430 scaffolds •  RNA-seq data also assembled •  Tools: AllPaths-LG, RepeatMasker, •  RNA-seq tools: Trinity, TopHat, Cufflinks
  • 62. 62 Schematic overview of SOAP denovo algorithm http://1.usa.gov/oTUrWC Contig assembly Scaffolding Preassembly sequencing error correction Gap closure
  • 63. RNA-Seq / miRNA-seq (noncoding, differential expression, Novel splice forms, antisense) Epigenetics (Chip- Seq, Mnase-seq, Bisulfite-Seq) CNV, Structural variations Targeted resequencing “Exome analysis” Whole genome sequencing Metagenomics (16S microbiome, environmental WGS) Somatic mutations Variants in mendelian diseases High throughput sequencing De novo genome assembly Beyond the basics – a growing list of NGS applications
  • 64. 64 Metagenomics and microbiome analysis Analysis methods: •  Reference based analysis •  16S RNA – OTU based methods •  Shotgun data (454, Illumina) •  Assign taxonomy (RDP classifier, blast) •  Pipelines for 16S RNA: qiime, mothur •  Other tools: MEGAN, CARMA, metaphyler •  De novo Assembly of WGS and funtional analysis of microbiomes •  Methods are under development with the goal of dealing with insufficient coverage, sequencing errors, repeats •  Tools: MG-RAST, metAMOS, HUMAnN •  It looks at gene classes, metabolic pathways http://bit.ly/o4dGqH http://www.hmpdacc.org/
  • 65. Sample study: skin microbiome   The skin is an ecosystem, host to a microbial milieu that, for the most part, is harmless.   Analysis of 16S ribosomal RNA genes reveals a greater diversity of organisms than has been found by culture-based methods   The cutaneous immune system modulates colonization by the microbiota and is also vital during infection and wounding. Dysregulation of the skin immune response is evident in several skin disorders 65 Elizabeth A. Grice & Julia A. Segre Nature Reviews Microbiology 9, 244-253 Recommended software: mothur, qiime
  • 66. RNA-Seq / miRNA-seq (noncoding, differential expression, Novel splice forms, antisense) Epigenetics (Chip- Seq, Mnase-seq, Bisulfite-Seq) CNV, Structural variations Targeted resequencing “Exome analysis” Whole genome sequencing Metagenomics (16S microbiome, environmental WGS) Somatic mutations Variants in mendelian diseases High throughput sequencing De novo genome assembly A growing list of applications
  • 67. Variant Analysis …like finding a needle in a ‘deep’ haystack 67   SNPs – Single nucleotide polymorphisms   Indels – Insertion Deletions   CNVs- copy number variations   SV- structural variations Variant = any position in difference to a specified reference sequence
  • 68. 68 Efforts at creating databases of variants: HapMap Project •  Project that started on 2002 with the goal of describing patterns of human genetic variation and create a haplotype map using SNPs present in at least 1% of the population, which were deposited in dbSNPs. •  It used 269 individuals. Haplotypes – adjacent SNPs that are inherited together 1000 Genomes •  Started in 2008 with a goal of using at least 1000 individuals (about 2,500 samples at 4X coverage), interrogate 1000 gene regions in 900 samples (exome analysis), find most genetic variants with allele frequencies above 1% and to a 0.1% if in coding regions as well as Indels and structural variants •  Make data available to the public ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/ or via Amazon Cloud http://s3.amazonaws.com/1000genomes
  • 69. Encode: Encyclopedia of DNA Elements 69 Main Goal: •  Find all functional elements in the genome
  • 70. Exome-Seq Targeted exome capture   targets ~20,000 variants near coding sequences and a few rare missense or loss of function variants   Provides high depth of coverage for more accurate variant calling   It is starting to be used as a diagnostic tool 70 Ann Neurol. 2012 Jan;71(1):5-14. -Nimblegen -Agilent -Illumina
  • 71. VCF format (version 4.0) 71   Format used to report information about a position in the genome   Use by the 1000 genomes project to report all variants
  • 73. Thank You Question or Comments please contact: mariam.quinones@niaid.nih.gov ScienceApps@niaid.nih.gov 73