17. Long
oligos
synthesized
on
arrays
(DNA)
RNA
baits
synthesized
from
DNA
oligo
template
RNA
baits
hybridized
to
DNA
sequencing
library
Targets
captured
using
beads
and
bioFn-‐labeled
baits
RNA
bait
degraded,
leaving
sequencing
library
enriched
for
target
regions
18. Data
Flow
FASTQ
files
generated
by
Illumina
pipeline
Aligned
to
reference
genome
(hg18,
excluding
_random,
unmapped,
and
hap)
using
Novoalign
SAM/BAM
used
extensively
Follow
Broad
InsFtute
GATK
pipeline
for
exome
capture
Use
picard
java
library
for
quality
assessment
Processed
BAM
files
available
via
local
hZp
for
browsing
20. Realignment
around
Indels
The
problem
Aligners
align
each
read
independently
PotenFally
leads
to
increased
error
rates
around
indels
A
potenFal
soluFon
Locally
realign
reads
in
regions
that
might
harbor
an
indel
Goal
is
to
align
reads
overlying
indels
more
accurately,
reducing
errors
in
each
read
and,
in
turn,
reducing
SNV
call
error
rates
21. Quality Recalibration
Since most SNV callers will rely on quality scores to
estimate error probabilities, having the best possible
estimates for error rates is important
Reported error rates from the Illumina sequencer
generally reflect technical parameters of the base call
process, but not other systematic biases
Quality recalibration can include covariates to
account for systematic biases
Cycle count, dinucleotide context, original quality,
and sample/library variables
43. Data
Delivery
Via
USB
results
Storage
Sizes
are
LARGE
400GB
per
sample
as
delivered
with
raw
reads
included
Should
use
2-‐locaFon
backed-‐up
storage
Not
trivial
to
find
such
storage,
so
might
resort
to
mulFple
USB
drives
Minimize:
Data
movement
Keeping
mulFple
copies
indefinitely
46. Data
Delivery
Storage
Processing
Data
are
typically
tab-‐delimited
text
files,
so
Excel
can
be
useful
for
examining
individual
small
files
Generally,
command-‐line
tools
needed
MacOS
and
linux
only
supported
operaFng
systems,
but
Windows
might
work....
Some
analyses
(snpdiff)
require
large
memory
48. Workflows
Tumor/Normal
Copy
Number
Structural
Varia,on
Annotated
SomaFc
Variants
Germline
List
of
annotated
genotypes
per
individual,
summarized
into
a
single
file
that
can
be
used
for
filtering
50. Germline
Workflow
Output
Future
direcFons
Be
“smarter”
about
inheritance
framework
Further
refinements
of
comparison
to
other
data
types
(exomes,
snp
arrays,
RNA-‐seq)
58. Frequent
geneFc
alteraFons
in
three
criFcal
signalling
pathways.
The
Cancer
Genome
Atlas
Research
Network
Nature
000,
1-‐8
(2008)
doi:10.1038/nature07385
59.
60.
61. ChromaFn
ChromaFn
is
the
complex
of
protein
and
DNA
that
make
up
the
chromosomes.
It
is
not
a
staFc
structure.
62. DNAse
is
an
enzyme
that
cuts
DNA
at
locaFons
where
DNA
is
accessible
These
“accessible”
regions
have
been
associated
with
open
chromaFn
Regions
of
open
chromaFn
are
necessary
for
transcripFonal
and
regulatory
machinery
to
have
access
to
gene
neighborhoods
and
facilitate
transcripFon
63. DNAse
HypersensiFvity
Method
for
finding
regions
of
“open”
chromaFn
In
data
published
with
the
ENCODE
consorFum,
DNAse
hypersensiFve
(HS)
were
shown
to
be
correlated
with:
Histone
modificaFon
TranscripFon
start
sites
Early
replicaFng
regions
TranscripFon
factor
binding
sites
(experimentally
determined
by
ChIP/chip,
etc.)
IdenFficaFon
and
analysis
of
funcFonal
elements
in
1%
of
the
human
genome
by
the
ENCODE
pilot
project.
The
ENCODE
ConsorFum.
Nature,
2007.
68. DNAse
HS
Sites
and
Gene
Expression
DNAse
HS
sites
near
transcripFon
start
sites
are
associated
with
acFvely
transcribed
genes.
69.
70. Nucleosome
PosiFoning
Distances
between
sequences
in
non-‐DNAse
HS
regions
have
an
oscillaFng
paZern
with
frequency
that
corresponds
to
a
single
turn
of
the
double-‐
helix
DNAse
is
known
to
cut
preferenFally
in
the
minor
groove,
which
is
exposed
every
10.4
bases
when
wrapped
around
a
nucleosome
A
nucleosome
is
wrapped
by
147
base
pairs
when
complexed
with
DNA
ImplicaFon:
Nucleosomes
are
posiFoned
in
a
highly
organized,
precise
manner