Más contenido relacionado
La actualidad más candente (20)
Similar a 20140710 3 l_paul_ercc2.0_workshop (20)
Más de External RNA Controls Consortium (6)
20140710 3 l_paul_ercc2.0_workshop
- 1. © Lexogen, 2013
Spike-In RNA Variants:
Design, Production and Application
ERCC 2.0 workshop
Stanford University – July 10-11, 2014
PPT Number TBD
Project Number 0221
Theme T5.2 Mixquer Transcript Quantification (WAFF)
Author Lukas Paul
- 2. © Lexogen, 20142
1. Company introduction
2. ERCC spike-in mixes in Lexogen‘s R&D
3. Design and rational of Spike-In RNA Variants
4. Production and application of Spike-In RNA Variants
ERCC 2.0 Workshop
Vertraulich / Confidential
- 3. © Lexogen, 20143Vertraulich / Confidential
Lexogen: Company
• Founded in 2007
• Based in Vienna, Austria
• 28 employees (75% in R&D)
• Lexogen, Inc.: o/n delivery to US customers
• Services & products with focus on
o Transcriptome profiling technologies
o Complementary technologies to Next Generation Sequencing
o Innovative solutions for transcriptome research
Lexogen’s mission is to develop innovative technologies that will allow to resolve
all complexities of the transcriptome - one of the most enigmatic and exciting
areas in biology.
www.LEXOGEN.com
- 4. © Lexogen, 20144
1. Company introduction
2. ERCC spike-in mixes in Lexogen‘s R&D
3. Design and rational of Spike-In RNA Variants
4. Production and application of Spike-In RNA Variants
ERCC 2.0 Workshop
Vertraulich / Confidential
- 5. © Lexogen, 20145
SENSETM mRNA-Seq Library Preparation Kit
• Convenient, fragmentation-free workflow
• Core technology: reverse transcription and ligation on intact RNA
• Results in very high preservation of strand orientation
Vertraulich / Confidential PN0203 PPT0383
- 6. © Lexogen, 20146
ERCC-based Validation of Strandedness
• Strandedness usually quantified by comparing the orientation of a mapped
read with the genome annotation
• Problem: annotation incomplete & natural antisense transcription interferes
Use of ERCC transcripts with known orientation provides
an absolute means to determine strandedness
Vertraulich / Confidential PN0203 PPT0383
Total RNA Strand Specificity
(ERCCs only)a
False Antisense
Readsb
Sense Reads
(genome-wide)c
2 µg 99.997% 0.003% 99.890%
1 µg 99.986% 0.014% 99.815%
500 ng 99.997% 0.003% 99.821%
50 ng 99.965% 0.035% 99.779%
a number of reads mapping to ERCC genes in the sense direction divided by total number of ERCC reads
b number of antisense reads mapping to ERCC transcripts divided by the total number of reads mapped to the ERCC genome
c number of reads mapping to annotated genes in the sense orientation divided by the number of reads mapping in both directions. Note that this
measure includes biologically relevant antisense transcription.
- 7. © Lexogen, 20147
ERCC-validated Strandedness Determines False Positive
Background of Library Preparation Method
Vertraulich / Confidential
Knowing the strandedness of the library preparation
protocol allows for determining whether a detected
transcript is truly antisense or belongs to the false positive
background.
98%
99.9%
strandedness
1153
2415
true antisense
transcripts
- 8. © Lexogen, 20148
“ERCC-validated” Strandedness in Lexogen’s Portfolio
• SENSE mRNA-Seq
library preparation kit
• SENSE Total RNA-Seq
library preparation kit
Vertraulich / Confidential PN0203 PPT0383
• QuantSeqTM 3’ mRNA
library preparation Kit,
see workflow (right),
ERCCs also used to assess
correctness of 3’ end
mapping
- 9. © Lexogen, 20149
Correlation Between ERCC Input and FPKM Measured
Vertraulich / Confidential PN0203 PPT0383
FPKM
N of molecules [102]
1 10 102 103 104 105 106
10-21101021037.5x104
o SENSE, R2=0.910
Competitors, R2=0.834
•
- 10. © Lexogen, 201410
Further Use for ERCC: Transcript Length Coverage:
• Native genes: interference from divergent annotations and differentially
expressed transcript variants
• Primer selectivity: aa
ERCCs with seamless coverage from first to last nucleotide
Native transcripts start with high coverage indicative of 5’ truncated
annotations
Vertraulich / Confidential PN0203 PPT0383
Example: SQUARE TM library prep with intrinsic over-representation of termini
ERCC-0096 Top 500 transcripts
- 11. © Lexogen, 201411
1. Company introduction
2. ERCC spike-in mixes in Lexogen‘s R&D
3. Design and rational of Spike-In RNA variants
4. Production and application of Spike-In RNA variants
ERCC 2.0 Workshop
Vertraulich / Confidential
- 12. © Lexogen, 201412
Spike-In RNA Variants (SIRVs) - Rational
• ERCC spike-in controls were designed as mono-exonic RNAs without
sequence overlap.
• Complementary, we found it to be desirable to have a set of nucleic acids
simulating transcript variants that can be used as external spike-in controls.
• This reference set would
o comprise two or more transcript families, with transcripts of the same
family representing reference transcript variants of the same gene
o enable the controlled identification and/or quantification of transcript
variants in one or more samples and
o permit the assessment, validation and correction of Bioinformatics
pipelines.
Vertraulich / Confidential
- 13. © Lexogen, 201413
Spike-In RNA Variants – Gene Structure
Reference genes
• 7 human genes selected because of diversity in exon-intron structure
• Annotated transcripts (Ensembl database) aligned to gene in CLC workbench
• „Master transcript“ created for each gene (sequence of all transcript variants)
KLK5
LDHD
Vertraulich / Confidential
CLC main workbench 5
CLC main workbench 5
PN0203 PPT0383
- 14. © Lexogen, 201414
Addition of Transcript Variants
• Annotated transcript variants were analyzed for AS events
• AS events not covered by a variant within a family were incorporated in a
new variant based on the master transcript
• To cover non-splicing variants, antisense and overlapping transcripts were
added (mono- and poly-exonic)
• Further, Transcription Start-Site (TSS) and End-Site (TES) variants were
added
KLK5
SIRV1
Vertraulich / Confidential
- 15. © Lexogen, 201415
Spike-In RNA Variants (SIRV): Nucleotide Sequence
AIM
• The nucleotide sequence of the SIRVs should be non-homologous at least
to eukarytic genomes and transcriptomes.
• In the best case they should not align with any natural occurring sequence.
SOLUTION
• Genomic sequences from viruses were used to fill-in exon sequences.
Would work in external controls for eukaryotes.
• Sequences were then inverted (flipped) to lose alignment identiy.
Final sequences do not align with any entry in the NCBI nt collection when
blasted with standard parameters.
SIRV sequences also do not align with ERCC sequences.
In silico experiments confirmed that NGS reads generated from the SIRVs
would not map to the genome of any model organism or the “ERCCome”.
Vertraulich / Confidential
- 16. © Lexogen, 201416
Re-establishing Exon-Intron Junction Dinucleotides
Vertraulich / Confidential
• Most junctions are common, i.e. are also
annotated in the master transcript.
• These intron sequences are currently annotated as
NN (see below), hence junction recognition is no
problem for alignment programs
NN-NN GT-AG GC-AG AT-AC
SIRVS
198 (61.11%) 116 (31.10%)
7 (2.16%) 3 (0.93)
314 (96.91%)
ICE database 98.70% 0.79% 0.08%
• Exon-defined intron boundaries
were converted to GT-AG (97%),
GC-AG (2%) or AT-AC (1%)
Nucleotide conversion to conform with GT-AG rule
- 17. © Lexogen, 201417
SIRV Properties - Summary
SIRVs are modelled on mammalian sequences
• Set of seven SIRV families with 6-18 transcript variants each
• 74 transcript variants in total, average length 1200 nt (median 917 nt)
• Variants include alternative splicing, start- and end-site variations ,
antisense and overlapping transcripts
• GC content: 30-50% (in analogy to ERCC standards)
• Poly(A) tail: A(30) at 3’-end (ERCCs: 19-25 adenosines)
• Length: 220-2,557 nt, longer SIRVs were trimmed by exon removal
Further modifications
• GT-AT exon-intron junction dinucleotide rule observed
• Homopolymer runs: ≤7nt
• 5’ truncation to obtain 5’ G, needed for T7 transcription
• No homology to NCBI nt collection entries or ERCC sequences due to
sequence inversion
Vertraulich / Confidential PN0203 PPT0383
- 18. © Lexogen, 201418
SIRV Design - Overview
Vertraulich / Confidential
Take natural gene structure and annotated transcript variants
Shorten transcript length to a maximum of 2500 nt
Fill gene structure with heterologous sequence
Duplicate and modify to add alternative splicing variants
Add transcription start-site and end-site variants
Add antisense and overlapping variants
observe
GU-AG
intron rule
cassette exon
alternative
start-site
alternative
end-site
alternative last exon
intron retention
overlapping,
antisense antisense
A5SS
A3SS
MXEalternative first exon
overlapping
- 19. © Lexogen, 201419
1. Company introduction
2. ERCC spike-in mixes in Lexogen‘s R&D
3. Design and rational of Spike-In RNA Variants
4. Production and application of Spike-In RNA Variants
ERCC 2.0 Workshop
Vertraulich / Confidential
- 20. © Lexogen, 201420
SIRV Production: In vitro Transcription Construct
Vertraulich / Confidential
starts with 5’ G,
cap optional
poly(A) tail added Synthetic constructs
cloned for singularization
and amplification
Run-off T7 transcription
T7-PromoterRestr.Site G Sequence A(30) Restr.Site5’ 3’
220 - 2557 nt
- 21. © Lexogen, 201421
SIRV Production, QC and quantification
Production
Plasmid linearization
T7 run-off transcription
Purification (essential!)
Storage in Na-Citrate buffer
Quality Control
Photometric (Nanodrop): Purity, quantifcation
Microfluidics (Bioanalyzer): Integrity, quantifcation
• Planned: qPCR: Accurate quantification
Vertraulich / Confidential
- 22. © Lexogen, 201422
SIRVs: Mixes & RNA-Seq Samples
Initially, 2 mixes were prepared from 60 purified transcript variants:
1. Equimolar: 1:1:1…
2. Low dynamic range: 1:10:100
3 Samples were prepared from these:
1. Equimolar mix,
SIRVs only
illumina TruSeq library prep without poly(A) selection
2. Equimolar mix,
30% SIRVs, 3% ERCCs, 67% UHR (Universal Human Reference RNA)
illumina TruSeq library prep without poly(A) selection
3. Low dynamic range,
30% SIRVs, 3% ERCCs, 67% UHR (Universal Human Reference RNA)
illumina TruSeq library prep without poly(A) selection
Vertraulich / Confidential
- 23. © Lexogen, 201423
SIRVs: RNA-Seq Experiment
• Illumina MiSeq run: 1x150 nt, 27M reads obtained
• Mapping with tophat (v.2.0.8) against combined transcriptomic and
genomic reference (Ensembl GRCh 37.75), Ambion’s ERCC92, and SIRVs
Vertraulich / Confidential
Total reads Mapping reads (%)
Uniquely
Mapping reads (%)
#1, equimolar
SIRVs 10,246,442 8,585,641 83.79% 8,505,344 83.01%
#2, equimolar
SIRVs, ERCCs, UHR 10,119,416 8,642,852 85.41% 8,399,336 83.00%
#3, 1:10:100
SIRVs, ERCCs, UHR 6,308,855 5,404,486 85.67% 5,268,757 83.51%
GRCh37.75 ERCC92 SIRVs
Sample #1 4,330 0.05% 11 0.00% 8,505,555 99.95%
Sample #2 7,521,308 89.55% 38,031 0.45% 839,997 10.00%
Sample #3 4,156,399 78.89% 22,207 0.42% 1,090,151 20.69%
- 24. © Lexogen, 201424
SIRV RNA-Seq: Input / Output correlation
Vertraulich / Confidential
Molecules Molecules
Molecules sample #1 FPKM
sample#2FPKM
#1 #2
#3 #1 vs #2
- 25. © Lexogen, 201425
SIRVs RNA-Seq: Transcript Hypotheses
Transcript Hypotheses by Cufflinks
• Not complete: e.g., 3ASS and exons not recognized despite multiple exon-
exon reads
Vertraulich / Confidential
cufflinks
- 26. © Lexogen, 201426
Spike-In RNA Variants: Short Summary
Design & production
• 74 transcript variants in 7 families (6-18 variants / family)
• Mimic eukaryotic genes in length and GC content; A(30) tail
• Include variation on alternative splicing, transcription start-sites and end-
sites, sense/antisense and overlapping genes
• No homology to NCBI nt collection entries or ERCC sequences
• Produced from stock plasmids as T7 run-off transcripts
Mixtures
• 60 SIRVs were mixed in equimolar or low dynamic range (10²) concentrations
Application in RNA-Seq
• Mixtures showed high mapability and no cross-mapping with UHR or ERCCs
• Low input / output correlation as determined by tophat / cufflinks derived
FPKM
• Cufflinks cannot reconstruct all SIRV transcript variants, even in the
equimolar mix, which will lead to wrong FPKM values
Vertraulich / Confidential
- 27. © Lexogen, 201427
Spike-In RNA Variants: Outlook
Optimizing production & quantification
• Large-scale production and purification of transcripts
• qPCR-based quantification in addition to Nanodrop & Bioanalyzer results
Application
• Evaluation of software for its performance in transcript hypothesis building
and transcript isoform quantification
Open questions
• Concentration range?
• Sufficient variant complexity? Length? Capping? SNPs?
• How many different mixes?
• Pipeline validation (Consortium?)
• Sample comparison (DE)
• Technical variation
• Master mix vs. modules: ERCCs, SIRVs, ncRNA standards & miRNA standards
(complexity, price, validation?)
Vertraulich / Confidential