Investigating the 3D structure of the genome with Hi-C data analysis

Investigating the 3D structure of the genome with
Hi-C data analysis
Sylvain Foissac & Nathalie Villa-Vialaneix
prenom.nom@inra.fr
Séminaire MIAT - Toulouse, 2 juin 2017
SF & NV2 | Hi-C data analysis 1/28

Sommaire
1 Normalization
2 TAD identiﬁcation
3 A/B compartments
4 Differential analysis

Sommaire
1 Normalization
3 A/B compartments

Purpose of normalization
1 within matrix normalization: make bins comparable within a matrix
(not needed for differential analysis)

Purpose of normalization
1 within matrix normalization: make bins comparable within a matrix
(not needed for differential analysis)
2 between matrix normalization: make the same bin pair comparable
between two matrices (needed for differential analysis)

Different within matrix normalizations
to correct technical biases
(GC content, mappability...)
explicit correction [Yaffe and Tanay, 2011, Hu et al., 2012]: every factor
causing bais is identiﬁed and estimated

(GC content, mappability...)
explicit correction [Yaffe and Tanay, 2011, Hu et al., 2012]: every factor
causing bais is identiﬁed and estimated
non parametric correction ICE correction using matrix balancing
[Imakaev et al., 2012]
K = b Kb for a K st ∀ i = 1, . . . , p,
p
j=1
Kij is constant

picture from [Schmitt et al., 2016]

to take distances into account
theoretical distribution taken from [Belton et al., 2012]
Kd
ij =
Kij − Kd(i,j)
σ(Dd(i,j))
with
Kd average counts at distance d
σ(Dd) standard deviation
available in HiTC [Servant et al., 2012]

Between matrix normalization
correct for differences in sequencing depth
standard approach: similar to RNA-seq normalization

However...

However...
density adjustment by LOESS ﬁt [Robinson and Oshlack, 2010]
(implemented in csaw)

Sommaire
1 Normalization
3 A/B compartments

Topologically Associated Domains (TADs)
[Rao et al., 2014]

TAD method jungle
Directionality index [Dixon et al., 2012]: compute divergence between
up/downstream interaction counts + HMM to identify TADs

TAD method jungle
armatus [Filippova et al., 2013]: maximize a criteria which evaluate a
within/between count ratio + combine multi-resolution results in a
consensual segmentation

TAD method jungle
segmentation method [Brault et al., 2017]: block boundary estimation in
matrix

TAD method jungle
segmentation method [Brault et al., 2017]: block boundary estimation in
matrix
... (many others), interestingly, very few provides a hierarchical
clustering
Comparisons in: [Fotuhi Siahpirani et al., 2016, Dali and Blanchette, 2017]

DI evolution with respect to armatus TADs

CTCF at TAD boundaries

Enrichment of genomic features around TAD boundaries
Homo Sapiens [Dixon et al., 2012]
Sus Scrofa (PORCINET project)

Current methodological development
Constrained HAC as a way to compare/combine TADs between samples
Contrained HAC: Hierarchical clustering with contiguity constrains

Current methodological development
Constrained HAC as a way to compare/combine TADs between samples
Contrained HAC: Hierarchical clustering with contiguity constrains
Challenges (currently under development with Pierre Neuvial and Marie
Chavent):
methodological issues: what happens when using Ward’s linkage
criterion with a non Euclidean similarity (counts of the Hi-C matrix)?
what happens when adding constrains to HAC? (partially solved)
development of the R package adjclust (Google Summer of Code
selected project)

Sommaire
1 Normalization
3 A/B compartments

A/B compartments
[Lieberman-Aiden et al., 2009]
[Giorgetti et al., 2013]
Method (in theory):
compute Pearson correlations between bins
(using interaction counts with all the other bins
of the same chromosome)
compute eigenvectors (or perform PCA) on this
correlation matrix
affect A/B compartments to +/- values of PCs

A/B compartments in practice
after ICED and distance-based normalizations

A/B compartments in practice
after ICED and distance-based normalizations
Method:
differentiate between A/B using sign of the correlation between PCs
and diagonal counts
choose a relevant PC and method maximizing − log10(p − value)
between diagonal counts in +/- PC (2-group comparison Student test)

Biological validation

Sommaire
1 Normalization
3 A/B compartments

Filtering
In differential analysis of sequencing data, ﬁltering is a crucial step:
removing low count features (that are little or no chance to be found
differential) improves the test power (leverage the multiple testing
correction effect) and can save unnecessary computational time

Filtering
can be performed 1/ at the beginning of the analysis or after the
estimation of the parameters of the model used for differential
analysis

Filtering
analysis; 2/ can be ﬁxed to an arbitrary value (minimum total count
per sample) or automated from the data

Filtering
for Hi-C data:
ﬁltering was performed at the beginning of the analysis (to limit the
computation burden)
was performed by using an arbitrary threshold or a threshold based
on the estimation of the noise background by a quantile of
inter-chromosomal counts (as in R package diffHic)

Filtering
500 kb - automatic filter (filters counts<∼ 5) - 96.4% of pairs filtered out
before filtering after filtering

Exploratory analysis (500kb bins)
chromosome 1
1 0.911
1
0.8886
0.8866
1
0.8566
0.8651
0.8288
1
0.8973
0.9118
0.8912
0.8692
1
0.8935
0.9032
0.8818
0.8799
0.906
1
LW90−160216−GCCAAT
LW90−160223−CTTGTA
LW90−160308−AGTTCC
LW110−160307−CGATGT
LW110−160308−AGTCAA
LW110−160517−ACAGTG
LW
90−160216−G
C
C
AAT
LW
90−160223−C
TTG
TA
LW
90−160308−AG
TTC
C
LW
110−160307−C
G
ATG
T
LW
110−160308−AG
TC
AA
LW
110−160517−AC
AG
TG
−1.0 −0.5 0.0 0.5 1.0
Cosinus (Frobenius norm)
good reproducibility between
experiments
no clear organization with respect to
the condition

Exploratory analysis (500kb bins)
chromosome 1
1 0.911
1
0.8886
0.8866
1
0.8566
0.8651
0.8288
1
0.8973
0.9118
0.8912
0.8692
1
0.8935
0.9032
0.8818
0.8799
0.906
1
LW90−160216−GCCAAT
LW90−160223−CTTGTA
LW90−160308−AGTTCC
LW110−160307−CGATGT
LW110−160308−AGTCAA
LW110−160517−ACAGTG
LW
90−160216−G
C
C
AAT
LW
90−160223−C
TTG
TA
LW
90−160308−AG
TTC
C
LW
110−160307−C
G
ATG
T
LW
110−160308−AG
TC
AA
LW
110−160517−AC
AG
TG
−1.0 −0.5 0.0 0.5 1.0
Cosinus (Frobenius norm)
good reproducibility between
experiments
no clear organization with respect to
the condition
all data after ﬁltering and between
matrix normalization (LOESS)
2 outliers but PC1 is organized with
respect to the condition

Methods for differential analysis of Hi-C
Similar to RNA-seq [Lun and Smyth, 2015] and R package diffHic
(essentially a wrapper for edgeR):
count data modeled by Binomial Negative distribution

parameters (mean, variance per gene) are estimated from data: a
variance vs mean relationship is modeled

parameters (mean, variance per gene) are estimated from data: a
variance vs mean relationship is modeled
test is performed using an exact test (similar to Fisher) or a
log-likelihood ratio test (GLM model)

Complementary remarks about DE analysis
Hi-C data contain more zeros than RNA-seq data: some people
propose to use Zero Inﬂated BN distribution (unpublished as far as I
know)

know)
provides a p-value for every pair of bins:
analysis based on a very large number of bins for finer resolutions
(500kb after filtering: 998 623 pairs of bins; without filtering:
13 509 221 pairs of bins): problem solved for 500kb bins but still under
study for 40kb bins

know)
study for 40kb bins
tests are performed as if bin pairs were independant whereas they are
spatially correlated

know)
study for 40kb bins
tests are performed as if bin pairs were independant whereas they are
spatially correlated: estimation of model parameters might be improved
if 1/ smoothed with respect to spatial proximity (similar to what is
sometimes performed methylation data analysis); 2/ performed
independantly for pairs of bins at a given distance (future work).
post-analysis of spatial distribution of p-values, work-in-progress with
Pierre Neuvial (submitted CNRS project)

because last page had no picture
probably not suited for the youngest

Preliminary results
913 bin pairs found differential (after multiple testing correction)
most of them are related to 3 chromosomes
parameter setting (ﬁlters...) and biological analysis are work-in-progress...

Differential TADs (state-of-the-art)
Detecting differential domains between the two conditions
Existing approaches:
[Fraser et al., 2015] (3 conditions, no replicate)
HMM on TAD boundaries (with a tolerance threshold) to identify
different TAD boundaries between samples
HAC on TADs, cophenetic distance to obtain local conserved structure
by using a z-score approach

R package diffHic computes up/down-stream counts (with ± 100Kb)
and uses the GLM model implemented in edgeR with an interaction
between stream direction (up/down) and condition.

R package diffHic computes up/down-stream counts (with ± 100Kb)
and uses the GLM model implemented in edgeR with an interaction
between stream direction (up/down) and condition.
However, the ﬁrst approach does not take biological variability into account
(no replicate) and the second uses only a very aggregated criterion.

Differential TADs (perspectives)
Ideas for future work
Using constrained HAC, are we able to:
compute a consensus dendrogram using several biological replicates;
differentiate branches signiﬁcantly (in which sense?) different
between conditions taking into account the within condition variability?

Conclusions and perspectives
Honnestly, it’s late and I really do not believe that I will have enough time to
make a conclusion and discuss perspectives so...
Questions?

References
Belton, J., Patton MacCord, R., Harmen Gibcus, J., Naumova, N., Zhan, Y., and Dekker, J. (2012).
Hi-C: a comprehensive technique to capture the conformation of genomes.
Methods, 58:268–276.
Brault, V., Chiquet, J., and Lévy-Leduc, C. (2017).
Efficient block boundaries estimation in block-wise constant matrices: an application to HiC data.
Electronic Journal of Statistics, 11(1):1570–1599.
Dali, R. and Blanchette, M. (2017).
A critical assessment of topologically associating domain prediction tools.
Nucleic Acid Research, 45(6):2994–3005.
Dixon, J., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y., Hu, M., Liu, J., and Ren, B. (2012).
Topological domains in mammalian genomes identified by analysis of chromatin interactions.
Nature, 485:376–380.
Filippova, D., Patro, R., Duggal, G., and Kingsford, C. (2013).
Identification of alternative topological domains in chromatin.
Algorithms for Molecular Biology, 9:14.
Fotuhi Siahpirani, A., Ay, F., and Roy, S. (2016).
A multi-task graph-clustering approach for chromosome conformation capture data sets identifies conserved modules of
chromosomal interactions.
Genome Biology, 17:114.
Fraser, J., Ferrai, C., Chiariello, A., Schueler, M., Rito, T., Laudanno, G., Barbieri, M., Moore, B., Kraemer, D., Aitken, S., Xie, S.,
Morris, K., Itoh, M., Kawaji, H., Jaeger, I., Hayashizaki, Y., Carninci, P., Forrest, A., The FANTOM Consortium, Semple, C.,
Dostie, J., Pombo, A., and Nicodemi, M. (2015).
Hierarchical folding and reorganization of chromosomes are linked to transcriptional changes in cellular differentiation.
Molecular Systems Biology, 11:852.
Giorgetti, L., Servant, N., and Heard, E. (2013).
Changes in the organization of the genome during the mammalian cell cycle.

Genome Biology, 14:142.
Hu, M., Deng, K., Selvaraj, S., Qin, Z., Ren, B., and Liu, J. (2012).
HiCNorm: removing biases in Hi-C data via Poisson regression.
Bioinformatics, 28(23):3131–3133.
Imakaev, M., Fudenberg, G., McCord, R., Naumova, N., Goloborodko, A., Lajoie, B., Dekker, J., and Mirny, L. (2012).
Iterative correction of Hi-C data reveals hallmarks of chromosome organization.
Nature Methods, 9:999–1003.
Lieberman-Aiden, E., van Berkum, N., Williams, L., Imakaev, M., Ragoczy, T., Telling, A., Amit, I., Lajoie, B., Sabo, P., Dorschner,
M., Sandstrom, R., Bernstein, B., Bender, M., Groudine, M., Gnirke, A., Stamatoyannopoulos, J., Mirny, L., Lander, E., and
Dekker, J. (2009).
Comprehensive mapping of long-range interactions reveals folding principles of the human genome.
Science, 326(5950):289–293.
Lun, A. and Smyth, G. (2015).
diffHic: a Bioconductor package to detect differential genomic interactions in Hi-C data.
BMC Bioinformatics, 16:258.
Rao, S., Huntley, M., Durand, N., Stamenova, E., Bochkov, I., Robinson, J., Sanborn, A., Machol, I., Omer, A., Lander, E., and
Lieberman Aiden, E. (2014).
A 3D map of the human genome at kilobase resolution reveals principle of chromatin looping.
Cell, 159(7):1665–1680.
Robinson, M. and Oshlack, A. (2010).
A scaling normalization method for differential expression analysis of RNA-seq data.
Genome Biology, 11:R25.
Schmitt, A., Hu, M., and Ren, B. (2016).
Genome-wide mapping and analysis of chromosome architecture.
Nature Reviews, 17(12):743–755.
Servant, N., Lajoie, B., Nora, E., Giorgetti, L., Chen, C., Heard, E., Dekker, J., and Barillot, E. (2012).

HiTC: exploration of high-throughput ‘C’ experiments.
Bioinformatics, 28(21):2843–2844.
Yaffe, E. and Tanay, A. (2011).
Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture.
Nature Genetics, 43:1059–1065.

Investigating the 3D structure of the genome with Hi-C data analysis

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Investigating the 3D structure of the genome with Hi-C data analysis

Similar a Investigating the 3D structure of the genome with Hi-C data analysis (20)

Más de tuxette

Más de tuxette (20)

Último

Último (20)

Investigating the 3D structure of the genome with Hi-C data analysis