20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2013
Spike-In RNA Variants:
Design, Production and Application
ERCC 2.0 workshop
Stanford University – July 10-11, 2014
PPT Number TBD
Project Number 0221
Theme T5.2 Mixquer Transcript Quantification (WAFF)
Author Lukas Paul

© Lexogen, 20142
1. Company introduction
2. ERCC spike-in mixes in Lexogen‘s R&D
3. Design and rational of Spike-In RNA Variants
4. Production and application of Spike-In RNA Variants
ERCC 2.0 Workshop
Vertraulich / Confidential

© Lexogen, 20143Vertraulich / Confidential
Lexogen: Company
• Founded in 2007
• Based in Vienna, Austria
• 28 employees (75% in R&D)
• Lexogen, Inc.: o/n delivery to US customers
• Services & products with focus on
o Transcriptome profiling technologies
o Complementary technologies to Next Generation Sequencing
o Innovative solutions for transcriptome research
Lexogen’s mission is to develop innovative technologies that will allow to resolve
all complexities of the transcriptome - one of the most enigmatic and exciting
areas in biology.
www.LEXOGEN.com

© Lexogen, 20144
ERCC 2.0 Workshop

© Lexogen, 20145
SENSETM mRNA-Seq Library Preparation Kit
• Convenient, fragmentation-free workflow
• Core technology: reverse transcription and ligation on intact RNA
• Results in very high preservation of strand orientation
Vertraulich / Confidential PN0203 PPT0383

© Lexogen, 20146
ERCC-based Validation of Strandedness
• Strandedness usually quantified by comparing the orientation of a mapped
read with the genome annotation
• Problem: annotation incomplete & natural antisense transcription interferes
Use of ERCC transcripts with known orientation provides
an absolute means to determine strandedness
Total RNA Strand Specificity
(ERCCs only)a
False Antisense
Readsb
Sense Reads
(genome-wide)c
2 µg 99.997% 0.003% 99.890%
1 µg 99.986% 0.014% 99.815%
500 ng 99.997% 0.003% 99.821%
50 ng 99.965% 0.035% 99.779%
a number of reads mapping to ERCC genes in the sense direction divided by total number of ERCC reads
b number of antisense reads mapping to ERCC transcripts divided by the total number of reads mapped to the ERCC genome
c number of reads mapping to annotated genes in the sense orientation divided by the number of reads mapping in both directions. Note that this
measure includes biologically relevant antisense transcription.

© Lexogen, 20147
ERCC-validated Strandedness Determines False Positive
Background of Library Preparation Method
Knowing the strandedness of the library preparation
protocol allows for determining whether a detected
transcript is truly antisense or belongs to the false positive
background.
98%
99.9%
strandedness
1153
2415
true antisense
transcripts

© Lexogen, 20148
“ERCC-validated” Strandedness in Lexogen’s Portfolio
• SENSE mRNA-Seq
library preparation kit
• SENSE Total RNA-Seq
library preparation kit
• QuantSeqTM 3’ mRNA
library preparation Kit,
see workflow (right),
ERCCs also used to assess
correctness of 3’ end
mapping

© Lexogen, 20149
Correlation Between ERCC Input and FPKM Measured
FPKM
N of molecules [102]
1 10 102 103 104 105 106
10-21101021037.5x104
o SENSE, R2=0.910
Competitors, R2=0.834
•

© Lexogen, 201410
Further Use for ERCC: Transcript Length Coverage:
• Native genes: interference from divergent annotations and differentially
expressed transcript variants
• Primer selectivity: aa
 ERCCs with seamless coverage from first to last nucleotide
 Native transcripts start with high coverage indicative of 5’ truncated
annotations
Example: SQUARE TM library prep with intrinsic over-representation of termini
ERCC-0096 Top 500 transcripts

© Lexogen, 201411
3. Design and rational of Spike-In RNA variants
4. Production and application of Spike-In RNA variants
ERCC 2.0 Workshop

© Lexogen, 201412
Spike-In RNA Variants (SIRVs) - Rational
• ERCC spike-in controls were designed as mono-exonic RNAs without
sequence overlap.
• Complementary, we found it to be desirable to have a set of nucleic acids
simulating transcript variants that can be used as external spike-in controls.
• This reference set would
o comprise two or more transcript families, with transcripts of the same
family representing reference transcript variants of the same gene
o enable the controlled identification and/or quantification of transcript
variants in one or more samples and
o permit the assessment, validation and correction of Bioinformatics
pipelines.

© Lexogen, 201413
Spike-In RNA Variants – Gene Structure
Reference genes
• 7 human genes selected because of diversity in exon-intron structure
• Annotated transcripts (Ensembl database) aligned to gene in CLC workbench
• „Master transcript“ created for each gene (sequence of all transcript variants)
KLK5
LDHD
CLC main workbench 5
CLC main workbench 5
PN0203 PPT0383

© Lexogen, 201414
Addition of Transcript Variants
• Annotated transcript variants were analyzed for AS events
• AS events not covered by a variant within a family were incorporated in a
new variant based on the master transcript
• To cover non-splicing variants, antisense and overlapping transcripts were
added (mono- and poly-exonic)
• Further, Transcription Start-Site (TSS) and End-Site (TES) variants were
added
KLK5
SIRV1

© Lexogen, 201415
Spike-In RNA Variants (SIRV): Nucleotide Sequence
AIM
• The nucleotide sequence of the SIRVs should be non-homologous at least
to eukarytic genomes and transcriptomes.
• In the best case they should not align with any natural occurring sequence.
SOLUTION
• Genomic sequences from viruses were used to fill-in exon sequences.
 Would work in external controls for eukaryotes.
• Sequences were then inverted (flipped) to lose alignment identiy.
 Final sequences do not align with any entry in the NCBI nt collection when
blasted with standard parameters.
 SIRV sequences also do not align with ERCC sequences.
 In silico experiments confirmed that NGS reads generated from the SIRVs
would not map to the genome of any model organism or the “ERCCome”.

© Lexogen, 201416
Re-establishing Exon-Intron Junction Dinucleotides
• Most junctions are common, i.e. are also
annotated in the master transcript.
• These intron sequences are currently annotated as
NN (see below), hence junction recognition is no
problem for alignment programs
NN-NN GT-AG GC-AG AT-AC
SIRVS
198 (61.11%) 116 (31.10%)
7 (2.16%) 3 (0.93)
314 (96.91%)
ICE database 98.70% 0.79% 0.08%
• Exon-defined intron boundaries
were converted to GT-AG (97%),
GC-AG (2%) or AT-AC (1%)
Nucleotide conversion to conform with GT-AG rule

© Lexogen, 201417
SIRV Properties - Summary
SIRVs are modelled on mammalian sequences
• Set of seven SIRV families with 6-18 transcript variants each
• 74 transcript variants in total, average length 1200 nt (median 917 nt)
• Variants include alternative splicing, start- and end-site variations ,
antisense and overlapping transcripts
• GC content: 30-50% (in analogy to ERCC standards)
• Poly(A) tail: A(30) at 3’-end (ERCCs: 19-25 adenosines)
• Length: 220-2,557 nt, longer SIRVs were trimmed by exon removal
Further modifications
• GT-AT exon-intron junction dinucleotide rule observed
• Homopolymer runs: ≤7nt
• 5’ truncation to obtain 5’ G, needed for T7 transcription
• No homology to NCBI nt collection entries or ERCC sequences due to
sequence inversion

© Lexogen, 201418
SIRV Design - Overview
Take natural gene structure and annotated transcript variants
Shorten transcript length to a maximum of 2500 nt
Fill gene structure with heterologous sequence
Duplicate and modify to add alternative splicing variants
Add transcription start-site and end-site variants
Add antisense and overlapping variants
observe
GU-AG
intron rule
cassette exon
alternative
start-site
alternative
end-site
alternative last exon
intron retention
overlapping,
antisense antisense
A5SS
A3SS
MXEalternative first exon
overlapping

© Lexogen, 201420
SIRV Production: In vitro Transcription Construct
starts with 5’ G,
cap optional
poly(A) tail added Synthetic constructs
cloned for singularization
and amplification
Run-off T7 transcription
T7-PromoterRestr.Site G Sequence A(30) Restr.Site5’ 3’
220 - 2557 nt

© Lexogen, 201421
SIRV Production, QC and quantification
Production
 Plasmid linearization
 T7 run-off transcription
 Purification (essential!)
 Storage in Na-Citrate buffer
Quality Control
 Photometric (Nanodrop): Purity, quantifcation
 Microfluidics (Bioanalyzer): Integrity, quantifcation
• Planned: qPCR: Accurate quantification

© Lexogen, 201422
SIRVs: Mixes & RNA-Seq Samples
Initially, 2 mixes were prepared from 60 purified transcript variants:
1. Equimolar: 1:1:1…
2. Low dynamic range: 1:10:100
3 Samples were prepared from these:
1. Equimolar mix,
SIRVs only
illumina TruSeq library prep without poly(A) selection
2. Equimolar mix,
30% SIRVs, 3% ERCCs, 67% UHR (Universal Human Reference RNA)
3. Low dynamic range,
30% SIRVs, 3% ERCCs, 67% UHR (Universal Human Reference RNA)

© Lexogen, 201423
SIRVs: RNA-Seq Experiment
• Illumina MiSeq run: 1x150 nt, 27M reads obtained
• Mapping with tophat (v.2.0.8) against combined transcriptomic and
genomic reference (Ensembl GRCh 37.75), Ambion’s ERCC92, and SIRVs
Total reads Mapping reads (%)
Uniquely
Mapping reads (%)
#1, equimolar
SIRVs 10,246,442 8,585,641 83.79% 8,505,344 83.01%
#2, equimolar
SIRVs, ERCCs, UHR 10,119,416 8,642,852 85.41% 8,399,336 83.00%
#3, 1:10:100
SIRVs, ERCCs, UHR 6,308,855 5,404,486 85.67% 5,268,757 83.51%
GRCh37.75 ERCC92 SIRVs
Sample #1 4,330 0.05% 11 0.00% 8,505,555 99.95%
Sample #2 7,521,308 89.55% 38,031 0.45% 839,997 10.00%
Sample #3 4,156,399 78.89% 22,207 0.42% 1,090,151 20.69%

© Lexogen, 201424
SIRV RNA-Seq: Input / Output correlation
Molecules Molecules
Molecules sample #1 FPKM
sample#2FPKM
#1 #2
#3 #1 vs #2

© Lexogen, 201425
SIRVs RNA-Seq: Transcript Hypotheses
Transcript Hypotheses by Cufflinks
• Not complete: e.g., 3ASS and exons not recognized despite multiple exon-
exon reads
cufflinks

© Lexogen, 201426
Spike-In RNA Variants: Short Summary
Design & production
• 74 transcript variants in 7 families (6-18 variants / family)
• Mimic eukaryotic genes in length and GC content; A(30) tail
• Include variation on alternative splicing, transcription start-sites and end-
sites, sense/antisense and overlapping genes
• No homology to NCBI nt collection entries or ERCC sequences
• Produced from stock plasmids as T7 run-off transcripts
Mixtures
• 60 SIRVs were mixed in equimolar or low dynamic range (10²) concentrations
Application in RNA-Seq
• Mixtures showed high mapability and no cross-mapping with UHR or ERCCs
• Low input / output correlation as determined by tophat / cufflinks derived
FPKM
• Cufflinks cannot reconstruct all SIRV transcript variants, even in the
equimolar mix, which will lead to wrong FPKM values

© Lexogen, 201427
Spike-In RNA Variants: Outlook
Optimizing production & quantification
• Large-scale production and purification of transcripts
• qPCR-based quantification in addition to Nanodrop & Bioanalyzer results
Application
• Evaluation of software for its performance in transcript hypothesis building
and transcript isoform quantification
Open questions
• Concentration range?
• Sufficient variant complexity? Length? Capping? SNPs?
• How many different mixes?
• Pipeline validation (Consortium?)
• Sample comparison (DE)
• Technical variation
• Master mix vs. modules: ERCCs, SIRVs, ncRNA standards & miRNA standards
(complexity, price, validation?)

20140710 3 l_paul_ercc2.0_workshop

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a 20140710 3 l_paul_ercc2.0_workshop

Similar a 20140710 3 l_paul_ercc2.0_workshop (20)

Más de External RNA Controls Consortium

Más de External RNA Controls Consortium (6)

Último

Último (20)

20140710 3 l_paul_ercc2.0_workshop