Forsharing cshl2011 sequencing

High-‐Resolu,on
Views
of

Cancer
Genomes

Overview
of
BAC
in
the
Genome

Repeats
are
not
created
equal

Genomic
Sequencing

TargeFng
the
Exome

  Long
oligos
synthesized
on

arrays
(DNA)

  RNA
baits
synthesized

from
DNA
oligo
template

  RNA
baits
hybridized
to

DNA
sequencing
library

  Targets
captured
using

beads
and
bioFn-‐labeled

baits

  RNA
bait
degraded,

leaving
sequencing
library

enriched
for
target
regions

Data
Flow

  FASTQ
ﬁles
generated
by
Illumina
pipeline

  Aligned
to
reference
genome
(hg18,
excluding

_random,
unmapped,
and
hap)
using
Novoalign

  SAM/BAM
used
extensively

  Follow
Broad
InsFtute
GATK
pipeline
for
exome

capture

  Use
picard
java
library
for
quality
assessment

  Processed
BAM
ﬁles
available
via
local
hZp
for

browsing

Data
Pipeline....

  Samtools
import

  Samtools
sort

  Picard
MarkDuplicates

  GATK
Indel
Realignment

  GATK
Quality
RecalibraFon

  Picard
QC
metrics

Realignment
around
Indels

  The
problem

  Aligners
align
each
read
independently

  PotenFally
leads
to
increased
error
rates
around

indels

  A
potenFal
soluFon

  Locally
realign
reads
in
regions
that
might

harbor
an
indel

  Goal
is
to
align
reads
overlying
indels
more

accurately,
reducing
errors
in
each
read
and,
in

turn,
reducing
SNV
call
error
rates

Quality Recalibration
  Since most SNV callers will rely on quality scores to
estimate error probabilities, having the best possible
estimates for error rates is important
  Reported error rates from the Illumina sequencer
generally reflect technical parameters of the base call
process, but not other systematic biases
  Quality recalibration can include covariates to
account for systematic biases
  Cycle count, dinucleotide context, original quality,
and sample/library variables

Variant
Calling
and
EvaluaFon

A
developing
art

Sequencing
Tumor/Normal
Pairs

SomaFc
(tumor
only)
Variant

Likely
False
PosiFve
(normal
only)

NCI60
Exome
Sequencing

No
Normals
Available!

Variants
by
Genomic
LocaFon

Type
1:
in
dbSNP,
Type
2:
not
in
dbSNP

Coding,
novel
(no
dbSNP)

Copy
Number
from
Exomes

Complete
Genome
Sequencing

Complete
Genomics
Data

Data

  Delivery

  Via
USB
results

  Storage

  Sizes
are
LARGE

  400GB
per
sample
as
delivered
with
raw
reads
included

  Should
use
2-‐locaFon
backed-‐up
storage

  Not
trivial
to
ﬁnd
such
storage,
so
might
resort
to
mulFple

USB
drives

  Minimize:

  Data
movement

  Keeping
mulFple
copies
indeﬁnitely

Breakdown
of
Data
Sizes

Data

  Delivery

  Storage

  Processing

  Data
are
typically
tab-‐delimited
text
files,
so
Excel

can
be
useful
for
examining
individual
small
files

  Generally,
command-‐line
tools
needed

  MacOS
and
linux
only
supported
operaFng

systems,
but
Windows
might
work....

  Some
analyses
(snpdiff)
require
large
memory

Workflows

  Tumor/Normal

  Copy
Number

  Structural
Varia,on

  Annotated
SomaFc
Variants

  Germline

  List
of
annotated
genotypes
per
individual,

summarized
into
a
single
file
that
can
be
used
for

filtering

Germline
Workﬂow

  Output

  Future
direcFons

  Be
“smarter”
about
inheritance
framework

  Further
reﬁnements
of
comparison
to
other
data

types
(exomes,
snp
arrays,
RNA-‐seq)

Medvedev
et
al.,
Nature
2009

Frequent
geneFc
alteraFons
in
three
criFcal
signalling
pathways.

The
Cancer
Genome
Atlas
Research
Network
Nature
000,
1-‐8
(2008)
doi:10.1038/nature07385

ChromaFn

  ChromaFn
is
the
complex
of
protein
and
DNA
that
make
up

the
chromosomes.

It
is
not
a
staFc
structure.

  DNAse
is
an
enzyme

that
cuts
DNA
at

locaFons
where
DNA
is

accessible

  These
“accessible”

regions
have
been

associated
with
open

chromaFn

  Regions
of
open

chromaFn
are

necessary
for

transcripFonal
and

regulatory
machinery
to

have
access
to
gene

neighborhoods
and

facilitate
transcripFon

DNAse
HypersensiFvity

  Method
for
finding
regions
of
“open”

chromaFn

  In
data
published
with
the
ENCODE

consorFum,
DNAse
hypersensiFve
(HS)

were
shown
to
be
correlated
with:

  Histone
modificaFon

  TranscripFon
start
sites

  Early
replicaFng
regions

  TranscripFon
factor
binding
sites

(experimentally
determined
by
ChIP/chip,

etc.)

IdenFficaFon
and
analysis
of
funcFonal
elements
in
1%
of
the
human
genome
by
the
ENCODE

pilot
project.

The
ENCODE
ConsorFum.

Nature,
2007.

DNAse-‐chip
Method

Crawford,
G.E.,
Davis,
S.,
Scacheri,
P.C.,
Renaud,
G.,
Halawi,
M.J.,
Erdos,
M.R.,
Green,
R.,

Meltzer,
P.S.,
Wolfsberg,
T.G.,
and
Collins,
F.S.
Nat
Methods,
2006

DNAse-‐Seq
Method

Crawford,
G.E.,
Davis,
S.,
Scacheri,
P.C.,
Renaud,
G.,
Halawi,
M.J.,
Erdos,
M.R.,
Green,
R.,

Meltzer,
P.S.,
Wolfsberg,
T.G.,
and
Collins,
F.S.
Nat
Methods,
2006

DNAse
Sites
RelaFve
to
Genes

DNAse
HS
Sites
and
Gene
Expression

  DNAse
HS
sites
near

transcripFon
start
sites

are
associated
with

acFvely
transcribed

genes.

Nucleosome
PosiFoning

  Distances
between
sequences

in
non-‐DNAse
HS
regions
have

an
oscillaFng
paZern
with

frequency
that
corresponds
to

a
single
turn
of
the
double-‐
helix

  DNAse
is
known
to
cut

preferenFally
in
the
minor

groove,
which
is
exposed
every

10.4
bases
when
wrapped

around
a
nucleosome

  A
nucleosome
is
wrapped
by

147
base
pairs
when

complexed
with
DNA

  ImplicaFon:
Nucleosomes
are

posiFoned
in
a
highly

organized,
precise
manner

Forsharing cshl2011 sequencing

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (16)

Similar a Forsharing cshl2011 sequencing

Similar a Forsharing cshl2011 sequencing (20)

Último

Último (20)

Forsharing cshl2011 sequencing