This document discusses genetic linkage analysis and map construction. It begins by presenting Mendel's experiments with plant hybrids and rediscovery of his work in 1900. It then discusses estimation of recombination frequency using different genetic populations like backcross, doubled haploid, recombinant inbred lines and describes maximum likelihood estimation. The document also discusses significance tests for linkage and examples of linkage analysis in different populations like F2, BC1F2. It describes expected genotypic frequencies and effect of distortion. Finally, it mentions three-point analysis and construction of linkage maps.
3. Experiments with Plant Hybrids (1866)
Seed shape: 5474 round vs 1850 wrinkled
Cotyledon color: 6022 yellow vs 2001 green
Seed coat color: 705 grey-brown vs 224
white
Pod shape: 882 inflated vs 299 constricted
Unripe pod color: 428 green vs 152 yellow
Flower position: 651 axial vs 207 terminal
Stem length: 787 long (20-50cm) vs 277
short (185-230cm)
Rediscovered in 1900
13. Mendel and Fisher
Annuals of Science 1:115-
close to the values that Mendel expected under his theory
that there must have been some manipulation, or
omission, of data
Dominant trait: 1/3 AA + 2/3 Aa
Family size: 10
Non-segregating (AA) :
Segregating (Aa) = 1:2 (Mendel)
Fisher: Pro {Aa family classified as
AA} = 0.75^10=0.0563
Pro {Non-segregating (AA)}
=2/3*(1-0.0563)=0.6291
Non-segregating (AA) :
Segregating (Aa) = 0.3709 : 0.6291
= 1 : 1.6961 13
18. Genetic markers in linkage analysis
Morphological traits
hybridization experiments
Cytogenetic and bio-chemistry
markers (e.g. isozyme)
DNA molecular markers
RFLP, SSR, SNP etc.
19. The four gametes (haplotypes) of an F1
P1: AABB P2: aabb
A B a b
A B a b
F1: AaBb
A B
a b
Meiosis
A B A b a B a b
(1-r)/2 r/2 r/2 (1-r)/2
19
Parental type Recombinant Recombinant Parental type
type type
21. MLE of recombination frequency
Likelihood function
n1 n2 n3 n4
n! 1 1 1 1
L (1 r ) r r (1 r ) C (1 r ) n1 n4
( r ) n2 n3
n1!n2 !n3! n4 ! 2 2 2 2
Logarithm of likelihood
ln L ln C (n1 n4 ) ln(1 r ) (n2 n3 ) ln r
n2 n3 n2 n3
r
MLE of r n1 n2 n3 n4 n
Fisher information
d 2 ln L n1 n4 n2 n3 n
I E( 2
) E
d r (1 r ) 2 r2 r (1 r )
Variance of estimated r Vr
1 r (1 r )
I n
22. Significance test of linkage
Null hypothesis H0: r = 0.5 (no genetic linkage, or
locus A-a and B-b are independent)
Alternative hypothesis HA
Likelihood ratio test (LRT) or LOD score
L(r 0.5) 2
LRT 2 ln[ ]~ (df 1)
L(r )
L(r )
LOD
L(r 0.5)
23. An example P1BC1 population
Genotypes of two inbred parents P1 and P2
are AABB and aabb
Observed samples of the four genotypes in
P1BC1
AABB 162 AABb 40 AaBB 41 AaBb
158
40 41 81
r 20.20%
162 40 41 158 401
r (1 r ) 4
Vr 4.02 10 23
n
24. Test of linkage
Null hypothesis H0: r = 0.5
Alternative hypothesis HA
L( r ) (1 r ) n1 n4 r n2 n3
6.3 10153
L( r 0.5) ( 1 ) n1 n2 n3 n4
4
Likelihood ratio test (LRT) (P<0.0001) and LOD
score
L( r )
LRT 2 * ln[ ] 708.27
L( r 0.5)
L(r )
LOD log[ ] 153.80 24
L (r 0.5)
25. Genotypic frequencies in RIL
populations, compared with DH
DH Theoretical RIL Theoretical
population frequency population frequency
AABB f1=(1-r)/2 AABB f1=(1-R)/2
AAbb f2=r/2 AAbb f2=R/2
aaBB f3=r/2 aaBB f3=R/2
aabb f4=(1-r)/2 aabb f4=(1-R)/2
25
R=2r/(1+2r)
26. Parent type or
RIL Marker 1 Marker 2
recombinant
C263 XNpb387 n1=6
RIL1 0 or A 0 or A P1 type
n2=2
n3=0
RIL2 2 or B 2 or B P2 type
n4=2
RIL3 0 or A 2 or B Recombinant
RIL4 0 or A 0 or A P1 type R=2/10=0.2
RIL5 0 or A 0 or A P1 type r=0.125
RIL6 0 or A 2 or B Recombinant
RIL7 0 or A 0 or A P1 type LRT=17.72 (P=2.56 10-5)
RIL8 2 or B 2 or B P2 type LOD=3.85
RIL9 0 or A 0 or A P1 type
RIL10 0 or A 0 or A P1 type
28. MLE of r in F2: dominant markers
2
Logarithm of the likelihood ratio k (1 r )
ln L C n1 ln(3 2r r 2 ) (n3 n7 ) ln(2r r2) n9 ln(1 2r r2)
C n1 ln(2 k ) (n3 n7 ) ln(1 k ) n9 ln k
MLE of r
2 ( 2n 3n1 n9 ) ( 2n 3n1 n9 ) 2 n n9
k (1 r )
2n
Variance of the estimated r
(1 k )(2 k ) (2r r 2 )(3 2r r 2 )
Vr
2n(1 2k ) 2n(3 4r 2r 2 )
29. MLE of r in F2: co-dominant markers
(Newton-Raphson algorithm)
Log-likelihood function
ln L ln C (2n1 2n9 n2 n4 n6 n8 ) ln(1 r )
( n2 n4 n6 n8 2n3 2n7 ) ln r n5 ln(1 2r 2r 2 )
The first-order derivative of LogL
f'(r) ) d dr L 2n 2n n 1n n n n n n rn 2n 2n 1n (24rr 22r)
ln
r
1 9 2 4 6 8 2 4 6 8 3 7 5
2
The second-order derivative of LogL
2 2
d ln L 2 n 2n n n n n n n n n 2n 2n n ( 4r 4r )
f''(r) d r
) 2
( r 1)
1 9
r
2
2
4
(1 2r 2r )
6 8 2 4 6
2
8 3 7 5
2 2
The iteration algorithm:
ri+1 = ri - f'(ri)/f''(ri)
30. MLE of r in F2: co-dominant
markers (EM algorithm)
EM for expectation and maximization
E-step: for an initial r0, calculate the probability of
crossover in each marker type
M-step: Update r, and repeat from the E-step
1
r' n nk Pk ( R | G)
k
37. Distortion has little effect on
linkage analysis!
DH pop Theo. Freq. Distortion Freq. in distortion
AABB f1=(1-r)/2 (1-r)/2 (1-r)/(1+s)
AAbb f2=r/2 r/2 r/(1+s)
aaBB f3=r/2 s r/2 r s/(1+s)
aabb f4=(1-r)/2 s (1-r)/2 (1-r) s/(1+s)
Sum 1 (1+s)/2 1
r r /(1 s) r s /(1 s) r (1 s) /(1 s) r
39. Linkage analysis of three markers
r13 r12 r23 21 r12 r23
When 0 interference),
(no
(1 r13 ) (1 r12 )(1 r23 ) r12 r23
r13 r12 (1 r23 ) (1 r12 ) r23 r12 r23 2r12 r23
When 1 (complete interference),
r13 r12 r23
The order of the three loci can be determined after
linkage analysis (3!/2=3 potential orders)
39
1 2 3, or 1 3 2, or 2 1 3
40. Mapping distance and
recombination frequency
Mapping distance m13 m12 m23
Unit of mapping distance
M (Morgan) or cM (centi-Morgan), 1M=100cM
The function of mapping distance on
recombination frequency (Mapping
function):
m f (r )
40
41. Common mapping functions
Morgan function (complete interference)
In M: m =r (M)
In cM: m =r 100 (cM)
Haldane function (no interference)
1 2m
In M: m f (r ) 2 ln(1 2r ) r 1
2 (1 e )
m / 50
In cM: m f (r ) 50 ln(1 2r ) r 1
2 (1 e )
Kosambi function (interference depends on length of interval)
4m
In M: m
1 1 2r
ln r
1 e 1
4m
4 1 2r 2 e 1
m / 25
1 2r 1e 1
m 25 ln r 41
In cM: 1 2r 2 em / 25 1
42. Comparison of the three functions
Mapping distance (cM)
(M)
42
Recombination frequency
43. Three steps in linkage map construction
Step 1: Grouping. Grouping can be based on
(i) a threshold of LOD score
(ii) a threshold of marker distance (cM)
(iii) anchor information
Step 2: Ordering. Three ordering algorithms are
(i) SER: SERiation (Buetow and Chakravarti, 1987. Am J Hum
Genet 41:180 188)
(ii) RECORD: REcombination Counting and ORDering (Van Os
et al., 2005. Theor Appl Genet 112: 30 40)
(iii) nnTwoOpt: nearest neighbor was used for tour construction,
and two-opt was used for tour improvement, similar to Travelling
Salesman Problem (TSP) (Lin and Kernighan, 1973. Oper. Res.
21: 498 516.
44. Three steps in linkage map construction
Due to the large number of markers (n), it is impossible
to compare all possible orders (say n=50, possible
orders are n!/2=1.52x1064). Orders from the above
algorithms are regional optimizations.
Step 3: Rippling. Five rippling criteria are
(i) SARF (Sum of Adjacent Recombination Frequencies)
(ii) SAD (Sum of Adjacent Distances)
(iii) SALOD (Sum of Adjacent LOD scores)
(iv) COUNT (number of recombination events)
54. What is QTL Mapping?
The procedure to map individual genetic factors
with small effects on the quantitative traits, to
specific chromosomal segments in the genome
The key questions in QTL mapping studies are:
How many QTL are there?
Where are they in the marker map?
How large an influence does each of them
have on the trait of interest?
57. Bi-parental mapping populations (linkage
mapping)
Temporary population: F2 and BC
Permanent population: RIL, DH, CSSL
Secondary population
Association mapping
Natural populations: human and animals
58. Single marker analysis (Sax 1923; Soller et al. 1976)
The single marker analysis identifies QTLs based on the difference
between the mean phenotypes for different marker groups, but cannot
separate the estimates of recombination fraction and QTL effect.
Interval mapping (IM) (Lander and Botstein 1989)
IM is based on maximum likelihood parameter estimation and provides
a likelihood ratio test for QTL position and effect. The major
disadvantage of IM is that the estimates of locations and effects of QTLs
may be biased when QTLs are linked.
Regression interval mapping (RIM)
(Haley and Knott 1992; Martinez and Curnow 1992 )
RIM was proposed to approximate maximum likelihood interval mapping
to save computation time at one or multiple genomic positions.
59. Composite interval mapping (CIM) (Zeng 1994)
CIM combines IM with multiple marker regression analysis,
which controls the effects of QTLs on other intervals or
chromosomes onto the QTL that is being tested, and thus
increases the precision of QTL detection.
Multiple interval mapping (MIM) (Kao et al. 1999)
MIM is a state-of-the-art gene mapping procedure. But
implementation of the multiple-QTL model is difficult, since the
number of QTL defines the dimension of the model which is
also an unknown parameter of interest.
Bayesian model (Sillanpää and Corander 2002)
In any Bayesian model, a prior distribution has to be
considered. Based on the prior, Bayesian statistics derives the
posterior, and then conduct inference based on the posterior
distribution. However, Bayesian models have not been widely
used in practice, partially due to the complexity of
computation and the lack of user-friendly software.
61. Backcrosses (P1BC1 and P2BC1)
of P1: MMQQ and P2: mmqq
BC1 BC2
Genotypic Genotypic
Genotype Frequency Genotype Frequency
value value
1 1
MMQQ 2 (1 r ) m+a MmQq 2 (1 r ) m+d
1 1
MMQq 2 r m+d Mmqq 2 r m-a
1 1
MmQQ 2 r m+a mmQq 2 r m+d
1 1
MmQq 2 (1 r ) m+d mmqq 2 (1 r ) m-a
62. Two marker types:
MM (1 r ) MMQQ r MMQq
(1 r )(m a) r (m d ) m (1 r )a rd
Mm r MmQQ (1 r ) MmQq
r (m a) (1 r )(m d ) m ra (1 r )d
Difference in phenotype between the two types
MM Mm (1 2r )(a d )
63. Linear model (j=1 2 n)
yi b0 b* x* e j
j
b* represent QTL effect x * is the indicator
j
variable (0 or 1) for QTL genotype
Likelihood profile
Support interval: One-LOD interval
64. P1: Mi Q Mi +1 P2: mi q mi +1
Mi Q Mi +1 mi q mi +1
F1: Mi Q Mi +1 P1: Mi Q Mi +1
mi q mi +1 Mi Q Mi +1
Mi Q Mi +1 Mi Q Mi +1 Mi Q Mi +1 Mi Q Mi +1
Mi Q Mi +1 Mi Q mi +1 mi q Mi +1 mi q mi +1
Mi Q Mi +1 Mi Q Mi +1
Mi q mi +1 mi Q Mi +1
1 4
65.
66. Assumption: No more than one QTL
per chromosome or linkage group
Large confidence interval
Biased effect estimation
Composite interval mapping (CIM)
(Zeng 1994)
67. In the algorithm of CIM, both QTL effect at the
current testing position and regression coefficients
of the marker variables used to control genetic
background were estimated simultaneously in an
expectation and maximization (EM) algorithm.
Thus, this algorithm could not completely ensure
that the effect of QTL at current testing interval
was not absorbed by the background marker
variables and therefore may result in biased
estimation of the QTL effect.
68. Theoretical basis of ICIM
m
G ajg j aa jk g j g k
j 1 j k
E ( g j | X) j xj j xj 1
E( g j gk | X) j k x j xk j k x j xk 1 j k x j 1xk j k x j 1xk 1
m 1
yi b0 b j xij b jk xij xik ei
j 1 j k
69. One-dimensional scanning (interval mapping)
yi yi b j xij
j k ,k 1
Two-dimensional scanning (interval mapping)
yi yi br xir brs xir xis
r j , j 1,k ,k 1 r j, j 1
s k ,k 1
70. 40 2
1.5
30
LOD score
1
0.5
Effect
20
0
10 -0.5 11111111111222222222233333333334444444444
-1
0 -1.5
11111111111222222222233333333334444444444 -2
Scanning posoition along the genome Scanning posoition along the genome
80 3
2
60
LOD score
1
Effect
40 0
-1 11111111111222222222233333333334444444444
20
-2
0 -3
11111111111222222222233333333334444444444 -4
Scanning posoition along the genome Scanning posoition along the genome
70 1.5
60 1
LOD score
50
40 0.5
Effect
30 0
20
-0.5 11111111111222222222233333333334444444444
10
0 -1
11111111111222222222233333333334444444444 -1.5
Scanning posoition along the genome Scanning posoition along the genome
74. One-locus model in F2
One-locus model: G aw dv
where is mean of the two homozygous
genotypes QQ and qq, a is the additive
effect, d is the dominance effect . w and
v are the indicators for genotypes at the
QTL, valued at 1 and 0 for QQ, 0 and 1
for Qq, and -1 and 0 for qq, respectively
75. The expected genotypic value of an
individual with known marker types
E (G | x1 , x2 , y1 , y2 ) a E ( w | x1 , x2 , y1 , y2 )
d E (v | x1 , x2 , y1 , y2 )
76. Probability of the three QTL
genotypes under given marker types
Left Right QQ (w=1, v=0) Qq (w=0, v=1) qq (w=-1, v=0)
marker marker (m+a) (m+d) (m-a)
2 2 1 1 2 2
AA BB
1
4 (1 r1 ) (1 r2 ) 2 1
r (1 r1 )r2 (1 r2 ) r r
4 1 2
2
AA Bb
1
2 (1 r1 ) 2 r2 (1 r2 ) 1
r (1 r1 )(1 r2 )
2 1
2 1
r (1 r1 )r2
2 1
2 1
r r (1 r2 )
2 1 2
1 2 2
(1 r1 ) r2 1 1 2
AA bb 4 r (1 r1 )r2 (1 r2 )
2 1 r (1 r2 ) 2
4 1
77. Estimation of marker class mean
Indicator
Marker for marker E (w | x1 , x2 , y1 , y2 ) E (v | x1 , x2 , y1 , y2 ) Genetic mean
n Frequency
class of the class
x1 x2 y1 y2
AABB n1 1
4 (1 r ) 2 1 1 0 0 f1 g1 f1a g1d
1 f2a g2d
AABb n2 2 r (1 r ) 1 0 0 1 f2 g2
1 2
AAbb n3 4 r 1 -1 0 0 f3 g3 f 3a g3d
1 2r1r2 /(1 r ) f1 2r1 (1 r1 )r2 (1 r2 ) /(1 r ) 2 g1
[(1 2r1 )r2 (1 r2 )] /( r r ) 2
f2 r1 (1 r1 )(1 2r2 2r22 ) /( r r 2 ) g2
(r2 r1 ) / r f3 2r1 (1 r1 )r2 (1 r2 ) / r 2 g3
78. Relationship between marker
class mean and marker effect
(including marker interactions)
f1a g1d 1 1 1 0 0 1 0 0 0 (d ) d
f 2a g 2d 1 1 0 0 1 0 1 0 0 (a ) A1
f 3a g 3d 1 1 1 0 0 1 0 0 0 (a ) A2
f 4a g 4d 1 0 1 1 0 0 0 1 0 (d ) D1
g5d 1 0 0 1 1 0 0 0 1 (d ) D2
f 4a g 4d 1 0 1 1 0 0 0 1 0 (d ) AA12
f 3a g 3d 1 1 1 0 0 1 0 0 0 AD12
f 2a g 2d 1 1 0 0 1 0 1 0 0 DA12
f1a g1d 1 1 1 0 0 1 0 0 0 (d ) DD12
79. Relationship between marker
effects and QTL effects
1
(d ) d 2
( g1 g3 )d
(a) A1 f2a
1
(a) A2 2 ( f1 f 3 )a
1 1
(d ) D1 ( g
2 1 2
g3 g 4 )d
1 1
(d ) D2 ( g g2
2 1 2 g 3 )d
(d ) AA12 1
2 ( g1 g 3 )d
AD12 0
DA12 0
(d ) DD12 ( 1 g1 g 2
2
1
2
g 3 g 4 g 5 )d
80. The linear model of genotypic
values on markers in F2
E(w | x1 , x2 , y1 , y2 ) x
1 1 2 2 x
E (v | x1 , x2 , y1 , y2 ) 1 1 y 2 y2
xx
12 1 2 yy
12 1 2
81. The linear model of genotypic
values on markers in F2
E (G | x1 , x2 , y1 , y2 ) (a) A1 x1 (d ) D1 y1 (a) A2 x2 (d ) D2 y2
(d ) AA12 x1 x2 (d ) DD12 y1 y2
82. Properties of the linear model in F2
The additive and dominance effects of the
flanked QTL are completely absorbed by the
six variables in the model above.
Interactions between marker variables may be
declared as interaction between QTL by
mistake when using ANOVA.
But from our analysis, interactions between
marker variables can be caused simply by
dominance effects of QTL .
83. Multiple QTL model in F2
For multiple QTL, assume there are m
QTL located on m intervals defined by
m+1 markers on one chromosome, then
the genotypic value of an F2 individual is
defined as:
m
G [a j w j d jv j ]
j 1
84. The linear model in F2 under
multiple QTL
The genotypic value of an F2
individual with known marker types
can be re-organized as:
m 1 m 1
E (G ) j xj j yj
j 1 j 1
m m
j, j 1 xjxj 1 j, j 1 yj yj 1
j 1 j 1
85. The linear model for QTL
mapping in F2
m 1 m 1
P E (G ) j xj j yj
j 1 j 1
m m
j, j 1 xjxj 1 j, j 1 yj yj 1
j 1 j 1
87. ICIM (Inclusive Composite
Interval Mapping) in F2
Pi Pi [ j xij j yij ]
j k ,k 1
[ j , j 1 ijx xi , j 1 j, j 1 yij yi , j 1 ]
j k
88. Hypothesis test of QTL
mapping in F2
The two hypotheses used to test the existence
of QTL at the scanning position are:
vs. H 0 : 1 2 3
H A : at least two of 1 , 1 and 3 are not equal
The logarithm likelihood under HA is
9 3
2
LA log[ jk f ( Pi ; k , )]
j 1 i Sj k 1
where S j denotes individuals belonging to the j th marker class (j=1,
th
jk k=1, 2, 3) is the proportion of the k QTL genotype in
th
the j class, and f ( ; k , 2 ) is the density function of the normal
2
distribution N ( k , ) .
89. EM algorithm of QTL mapping
in F2
Use EM algorithm to get the estimation
of 1 , 2 and 3
So the genetic effects in G aw dv
were therefore estimated by
1 1
2 ( 1 3 ) a 2 ( 1 3 ) d 2
90. EM algorithm of QTL mapping in F2
Parameters under H0 were calculated as:
n n
1 2 1 2
0 n
Pi 0 n
( Pi 0 )
i 1 i 1
From which the maximum likelihood
under H0, and the LOD score between HA
and H0 can be calculated.
93. QTL distribution models in
simulation
F2 populations were simulated by
the genetics and breeding
simulation tool of QuLine.
QTL mapping using ICIM was
implemented by the software QTL
IciMapping.
94. Theoretical marker effects in the
genetic model used in simulation
The expected additive, dominance,
additive by additive, and dominance by
dominance effects of the two flanking
markers associated with each QTL is
shown in the following table.
It indicated that the dominance of a QTL
could complicate the coefficients of the
two markers flanking a QTL, and cause
the interactions between markers.
101. 180 individuals
The cross was made in Chengdu, China,
in July 2002 between the indica rice
variety and Nipponbare.
137 SSR markers.
The whole genome was of 2046.2 cM, and
the average marker distance was 17.1 cM.
A number of agronomic traits were
investigated in the field.