obtaining regionalized maps of the spatial distribution of an insect plague across time and on the whole crop growth stage,Define the “convenient” distance between (pheromone) traps,Software: SAS, S+, GS+, GIS software
Automating Google Workspace (GWS) & more with Apps Script
The role of a biometrician in an International Agricultural Center: service and research in an interdisciplinary frame
1. The role of a biometrician in an
International Agricultural Center:
service and research in an
interdisciplinary frame
Some ideas and examples
JFD
2. Interdisciplinary collaboration
• No man (or woman) is an island
• Few problems are simple problems
• Problems are better solved using teams of specialists
working together
• The biometrician needs scientist’s data to propose and
test methodologies and the scientists need the analytical
tools to address their problems
JFD
3. Should be possible for the Biometrician to
participate in all or some of
• Planning of surveys and experiments
• Data processing, analysis and interpretation
• Design of novel tools/methodologies for analysis
• Writing and/or editing results
• Seminars and courses on new methodologies
and software (the free software issue)
• Capacity building (internal and external)
JFD
4. Trying to
• Allow researchers to work in depth within
their disciplines and problems, not on
methods for data analysis, computer
routines, software, etc.
• Supply new points of view and new
methodological tools on the research work
• Improve the quality of inferences
JFD
5. Four levels of participation
1. Routine analysis and known problems: short
and fast responses
2. New problems, known methodologies: we
need think together
3. New problems, existing methodology: we need
to understand, study and propose solutions
4. New problems, unknown methodology: we
need do some methodological research
JFD
7. Experimental Design
(continuous traits, Mixed Model)
• Spatial analysis of an Experimental Design
• Variety trial designed as a row-column design
with 64 varieties planted in two contiguous
replicates laid out in 8 rows and 16 columns
Nguyen and Williams, 1993
(Austral. J. Stat. 35: 363-370)
JFD
12. Spatial statistical model
entry
yijkl i ijkl
ε ~ N (0, R)
Where:
i = 1,2,…, t (entries)
j = 1,2,…, J (replications)
k = 1,2,.., K (rows)
l = 1,2,…, L (columns)
JFD
13. Results
Precision of estimates and test of hypothesis
Average Standard Error of ls-means
rcbd row-col spat-sph
Average 0.5076 0.4041 0.4526
Standard Errors of some ls-means differences
Label rcbd row-col spat-sph
5 vs 3 0 0.7179 0.5256 0.4737
5 vs 2 1 0.7179 0.4833 0.4286
5 vs 14 2 0.7179 0.5232 0.4906
5 vs 44 2 0.7179 0.5159 0.4938
JFD
14. conclusion
When possible it is convenient to use
layouts that allow the spatial analysis with
the objective of reaching lower standard
errors for the contrasts
JFD
15. Spatial distribution of two insect plagues
(apple and peach trees) on a region
• Grafolita (Cydia
molesta)
• Carpocapsa
pomonella
• Pheromone traps
JFD
16. Objectives
• To obtain regionalized maps of the spatial
distribution of an insect plague across time and
on the whole crop growth stage
• Define the “convenient” distance between
(pheromone) traps
• Software: SAS, S+, GS+, GIS software, R?
JFD
19. Spherical :
3
d ij d ij
f (d ij ) 1 1.5 0.5 I (d ij )
Exponential :
d ij
f (d ij ) e
Gaussian : (range): distance from
2
which observations are
independent
d ij
f (d ij ) e 2
f (d ii ) 1
JFD
20. A linear mixed model is used to
selecting the best fit model, and
estimating the range
2
yij V( ) R F
ij
F { f (d ij )}
JFD
21. • Null model: the independence model
R0= 2I, in which ij = 0 for all ij
• Alternative model: the spatially related model:
R = 2F,
F={f(dij)} matrix
• Using a Likelihood Ratio Test (LRT)
JFD
23. Prediction (interpolation)
Prediction of non observed values using the
observed values plus the knowledge of a good
spatial model: KRIGING procedures
Krige, Danie G. (1951). "A statistical approach to some basic mine
valuation problems on the Witwatersrand". J. of the Chem., Metal.
and Mining Soc. of South Africa 52 (6): 119–139.
JFD
24. Carpocapsa
Mean Standard Range Number of traps
deviation
60 49 6-259
JFD 111
25. Forecasting
(a more complex model)
• Forest inventory and growth models
– Non linear (but linearized models)
– structured equations models
JFD
26. Growth forestry model
• Eucalyptus (Bicostata)
• In a region
• Using the inventory data sets (yearly data)
• n = 2461
- from 506 plots (300 m2, circular)
- Measured each year (from 3 to 15)
• “difference” models
JFD
30. characteristics
• results from an equation are independent variables on another
• “residuals” ( 1, 2,…, 5) are correlated and, possibly, do not have
homogeneous variances
• it is necessary a simultaneous estimation process
• 22 regression coefficients
• Possible methods: OLS, SUR, 2SLS, 3SLS
• Software: Systemfit in “R” (freeware)
JFD
31. model
evaluation
External:
Internal:
adjust of estimated model
Model fit on a randomly
on other “independent”
selected sub-dataset
sub-dataset
Year by year Long term
JFD
36. Competitiveness drivers
Driver Nr. variables
1. Knowledge 11
2. Innovation platform 12
3. Connectivity 12
4. Infrastructure 8
5. Macroeconomic variables 9
6. Social cohesion 13
Total 65
JFD
37. Objectives and Methods
• Generating a Competitiveness Index (CI)
– STAGE 1: PCA per driver
– STAGE 2: PCA using the first PC from STAGE 1
(CI = scores on the first PC from STAGE 2)
• Improvement comparison: advances by driver
– Relative improvement = (2000 vs. 1990)
– Weighted average per driver
(Weight = participation of each driver into the CI)
JFD
38. Driver Nr. vars. Contribution Explained
to CI Variability(1)
Knowledge 11 20.1 74.5
I Platform 12 18.4 66.1
Connectivity 12 21.6 82.5
Infrastructure 8 13.4 78.0
Macro Econ -vars 9 5.6 47.8
S-cohesion 13 20.9 62.4
Sum 65 100
(1) First PC within driver
JFD
42. Association mapping
• The mixed model on association mapping
• Example:
– 46 wheat genotypes
– 374 markers (DArT)
– 5 traits: w1000, leaf rust, steam rust, maturity,
yield
– environments: 15, 17, 5, 15, 10
JFD
43. Components
Mixed
Model
Relations
Markers Phenotypic Underlying
between
data (DArT) data structure
genotypes
Binary Groups Coancestry
traits
values (co-variables) (Parentage)
(Y1,…,Ye)
{0, 1} Q matrix K matrix
STRUCTURE K=2pij or similarity
no linked markers from markers
JFD
45. Proposition
• The marker is associated to the phenotype if the
trait average for genotypes owning a “0” is “very”
different to the trait average for genotypes owning
a “1”
• “very”= statistically different in a test of hypothesis
• 3 possible models: one way anova, Q model, Q+K
model
JFD
46. Q+K model (one site)
fixed random
g 1
yij i l 1
(l )
ˆ(
qijl ) j (i ) ij
Xβ Zu ε
i = 1,2 (two states of marker)
i = 1 if the ith marker is present in jth genotype
ˆ(
qijl )= membership probability of the jth genotype to the lth group
j(i) = genotype nested in the marker state (i=1 or i=0) JFD
47. Variance-covariance matrices
V (ε) 2
I
2
V (u ) A a
• A is known: coefficient of parentage or some
similarity using non linked markers
• Variances should be estimated
JFD
49. Bonferroni bound
(when tests are not independent)
• Example: m=46 tests, =0.05,
P[reject at least one Ho/Ho true] =
1-(1- )m = = 1-(0.95)46 = 0.906
• if you want 0.05 “on all tests” type I error
Reject Ho when
1
ˆ 1 (1 0.05) 46
0.001114
JFD
51. Methodological Research
1. The Ward-MLM three-way strategy for
classifying genetic resources in multiple
environments
2. Sampling strategies for conserving
diversity when forming core subsets
using genetic markers
JFD
52. Interdisciplinary group
• Jose Crossa CIMMYT-Biometrics
• Suketoshi Taba CIMMYT-Maize Genebank
• Marilyn Warburton CIMMYT-USDA-Molecular
genetics
• Sarah Hearne IITA- Molecular genetics
• Steve A. Eberhart USDA-Genebank
• Jose Villaseñor C. P.-mathematics
• Jorge Franco UDELAR-CIMMYT-Biometrics
JFD
53. The Ward-MLM three-way strategy
for classifying genetic resources in
multiple environments: evaluate GxE
JFD
Franco et al (2003) Crop Sci 43; 1249-1258
54. Properties of the Ward-MLM strategy
1. It assigns the observations to an optimal number of
clusters based on membership probability (is a
statistical method)
2. It uses discrete an continuous variables simultaneously
3. It follows the optimization of two objective functions
(minimum variance within group, and maximum Log-
likelihood)
4. It allows the estimation of the quality of the resulting
clustering (average of the assignment probabilities)
5. It can be used on 3-way data sets (genotype ×
environment × trait). JFD
55. Example
256 Caribbean maize genotypes
Discrete variable :
• Agronomic Scale (1=poor, 2=regular, 3=good)
Continuous variables:
• Days to anthesis (DA)
• Plant height (PH, cm)
• Days to senescence of the ear leave (DS)
• Ear length (EL, cm)
Three environments (Mexico)
JFD
63. Sampling strategies for conserving diversity
when forming core subsets using genetic
markers
• DEF: A core collection (or core subset) is a
sample from a large germplasm collection that
contains, with a minimum of repetitiveness, the
maximum possible genetic diversity of the
species in question (Frankel and Brown, 1984)
• Forming core subsets requires sampling
JFD
64. STEP 1. Numerical classification
• The most used methods are Ward (minimum
variance within cluster), and UPGMA (average of
distances)
• They require an initial matrix of distances between
genotypes
• With SSR we can use genetic distances (Modified
Rogers, Cavalli-Sforza & Edwards). Both are Euclidian
metrics
JFD
65. STEP 2. drawing accessions from clusters
(how many accessions from each cluster?)
Allocation methods : are methods for determining the
number of observations to be randomly drawn from each
stratum (cluster)
Optimal (Neyman): proportional to the size and variability
of the cluster
P: proportional to the cluster size
L: proportional to the log of the cluster size
JFD
66. D-method: proportional to any measure of the
cluster diversity
di
ni n t = 1,2,…,number of clusters
d
t t
ni: number of accessions to be drawn from ith cluster
di: average of distances between accessions within
the ith cluster
n: size of the core subset
JFD
Franco et al (2006) Crop Sci 46; 854-864
67. Diversity
Distances among
Diversity indexes
Individuals
(allele richness)
or groups
Expected Non informative Informative
heterozygosis Markers Markers
(He) p {0,1} p [0,1]
Number of
Simple Matching Modified Rogers
effective alleles
(Ne)
Cavalli-Sforza and
Shannon Index Jaccard
Edwards
Nei and Li
Euclidian
(Dice)
JFD
68. Example: three maize data sets
(SSR markers)
Obs. Alleles Markers Values Missing
Bulks 275 186 24 [0, 1] 1.5 %
Landraces 521 209 26 {0,.5,1} no
Populations 25 209 26 [0, 1] no
JFD
73. Conclusions
• When constructing core subsets of individuals
(landraces/accessions) the D allocation method used
with a stratified sampling strategy was better than the M
strategy
• For bulks and populations the M-strategy was better for
diversity indexes.
• Some stratified sampling strategies (21-24) were always
better showing the higher average distance (MR, and
CE) between accessions
• All 25 strategies selected non-informative alleles but the
M strategy selected less than the others.
JFD
74. Uses of D allocation method
• D method was used define reference
collections for inbred lines and populations
of maize from Mexico with Marilyn
Warburton from CIMMYT
• D method was used in a collaboration with
Sarah Hearne of IITA to define the
reference germplasm collection of cowpea
accessions
JFD
75. Current research
A method for classifying genotypes
using phenotypic and genotypic
information simultaneously
Phenotypic: continuous and categorical
Genotypic: SSR, DART, SNP
We have a draft
JFD