This document discusses strategies to improve the utilization of germplasm collections in seedbanks to increase genetic diversity in food crops. Scientists often need to screen smaller subsets of accessions for particular traits due to the large size of collections. The document proposes exploring climate data as a prediction model for pre-screening crop traits before full field trials in order to identify landraces with a higher probability of possessing interesting traits, which could reduce costs compared to large-scale field screening. It describes linking genebank accession and trait observation data to climate data from locations where landraces originated to build models predicting traits from climate variables.
Trait data mining using FIGS, seminar at Copenhagen University (27 May 2009)
2. Overall
goal:
– User-‐friendly
access
to
relevant
informa3on
on
plant
gene3c
resources.
– Increased
u3liza3on
of
germplasm
for
gene3c
diversity
in
food
crops.
Strategies
to
improve
the
u,liza,on
of
germplasm
in
seedbank
collec3ons
to
increase
the
gene3c
diversity
of
food
crops
for
enhanced
food
security.
2
3. • Scien3sts
and
plant
breeders
want
a
few
hundred
germplasm
accessions
to
evaluate
for
a
par3cular
trait.
• How
does
the
scien3st
select
a
small
subset
likely
to
have
the
useful
trait?
• More
than
560
000
wheat
accessions
in
genebanks
worldwide.
3
Slide
adopted
from
a
slide
by
Ken
Street,
ICARDA
(FIGS
team)
4. “I am screening for variations in powdery mildew resistance
genes can you send me 1200 landrace accessions of bread
wheat”…
“I am screening for drought – could you send me some
landraces from Afghanistan and some other dry countries”…
“I am screening for rust can you send me 9000 bread wheat
samples”…
“I am looking for new salt tolerance genes can you send me
some wild relatives from salty areas”…
“I want about 500 bread durum acc to screen for RWA”…
“I am screening for Sunn Pest and can handle about 200 acc –
can you send me a selection of Triticum species”…
4
Slide
adopted
from
a
slide
by
Ken
Street,
ICARDA
(FIGS
team)
5. • The
scien3st
or
the
breeder
need
a
smaller
subset
to
cope
with
the
field
screening
experiments.
• A
common
approach
is
to
create
a
so-‐called
core
collec,on.
Sir
OVo
H.
Frankel
(1900-‐1998)
proposed
that
a
limited
or
"core
collec3on"
could
be
established
from
an
exis3ng
collec3on.
With
minimum
similarity
between
its
entries
the
core
collec3on
is
of
limited
size
and
chosen
to
represent
the
gene,c
diversity
of
a
large
collec3on,
a
crop,
a
wild
species
or
group
of
species
5
(1984)
.
6. • Given
that
the
trait
property
you
are
looking
for
is
rela3vely
rare:
• Perhaps
as
rare
as
a
unique
allele
for
one
single
landrace
cul3var...
• Ge_ng
what
you
want
is
largely
a
ques3on
of
LUCK!
6
Slide
adopted
from
a
slide
by
Ken
Street,
ICARDA
(FIGS
team)
8. Objec,ve
of
this
study:
– Explore
climate
data
as
a
predic3on
model
for
“pre-‐screening”
of
crop
traits
BEFORE
full
scale
field
trials.
– Iden3fica3on
of
landraces
with
a
higher
probability
of
holding
an
interes3ng
trait
property.
8
9. • Primi,ve
crops
and
tradi,onal
landraces
are
the
source
of
exo3c
traits,
crop
proper3es.
• Traits
from
landraces
are
an
interes3ng
source
of
novel
traits
for
improvement
of
modern
crops.
• Landraces
are
ogen
not
described
for
the
economically
valuable
trait
in
ques3on.
• Iden3fica3on
of
crop
traits
are
ogen
the
result
of
a
larger
field
trial
screening
project
(thousands
of
individual
plants).
• Large
scale
field
trials
are
very
costly
(land
area
and
human
working
hours).
9
10. The
underlying
assump3on
is
that
the
climate
at
the
original
source
loca3on,
where
the
landrace
was
developed
during
long-‐term
tradi3onal
cul3va3on,
is
correlated
to
trait.
The
aim
is
to
build
a
computer
model
explaining
the
crop
trait
score
(dependent
variables)
from
the
climate
data
(independent
variables).
10
11. Wild
rela3ves
are
Primi3ve
cul3vated
crops
Tradi3onal
cul3vated
crops
shaped
by
climate
are
shaped
by
climate
(landraces)
are
shaped
by
and
humans
climate
and
humans
Modern
cul3vated
crops
Perhaps
future
crops
are
(cul3vars)
are
mostly
shaped
shaped
in
the
molecular
by
humans
(plant
breeders)
laboratory…?
11
12. 1) Landrace
samples
(genebank
seed
accessions)
2) Trait
observa3ons
(experimental
design)
3) Climate
data
(for
the
landrace
origin
loca3ons)
•
The
accession
iden3fier
(accession
number)
provides
the
bridge
to
the
crop
trait
observa3ons.
•
The
longitude,
la,tude
coordinates
for
the
original
collec3ng
site
of
the
accessions
(landraces)
provide
the
bridge
to
the
environmental
data.
12
13. More
than
6
million
genebank
accessions,
more
than
1
400
genebanks,
worldwide.
13
14. Faba
bean,
Finland
Field
trials,
Gatersleben,
Germany
Cauliflower
(S.
Jeppson)
Forage
crops,
Dotnuva,
Lithuania
Radish
(S.
Jeppson)
Linnés
äpple
Powdery
Mildew,
Leaf
spots
Yellow
rust
Black
stem
rust
14
Blumeria
graminis
Ascochyta
sp.
Puccinia
strilformis
Puccinia
graminis
hVp://barley.ipk-‐gatersleben.de
15. The
climate
data
is
extracted
from
the
WorldClim
dataset.
hVp://www.worldclim.org/
Data
from
weather
sta3ons
worldwide
are
combined
to
a
con3nuous
surface
layer.
Climate
data
for
each
landrace
is
extracted
from
this
surface
layer.
Precipita3on:
20
590
sta3ons
Temperature:
7
280
sta3ons
15
16. This
study
is
part
of
a
new
method
to
predict
crop
traits
of
primi3ve
cul3vated
material
from
climate
variables
by
using
mul3variate
sta3s3cal
methods.
16
17. FIGS
The
FIGS
technology
takes
much
of
the
guess
work
out
of
choosing
which
accessions
are
most
likely
to
contain
the
specific
characteris3cs
being
sought
by
plant
breeders
to
improve
plant
produc3vity
across
numerous
challenging
environments.
hVp://www.figstraitmine.org/
17
17
18. What is hVp://www.figstraitmine.org/
Mediterranean
region
Origin of Concept (1980s):
Wheat and barley landraces from Queensland
Australia
marine soils in the Mediterranean
region provided genetic variation
Slide made by
for boron toxicity. M C Mackay 1995 18
20. • No
sources
of
Sunn
pest
resistance
previously
found
in
hexaploid
wheat.
• 2000
accessions
screened
at
ICARDA
without
result
• A
FIGS
set
of
534
accessions
was
developed
and
screened.
• 10
resistant
accessions
were
found!
• The
FIGS
selec3on
started
from
16
000
landraces
from
VIR,
ICARDA
and
AWCC
• Exclude
origin
CHN,
PAK,
IND
were
Sunn
pest
only
recently
reported
(6
328
acc).
• Only
accession
per
collec3ng
site
(2
830
acc).
• Excluding
dry
environments
below
280
mm/
year
• Excluding
sites
of
low
winter
temperature
below
10
degrees
Celsius
(1
502
acc)
Slide
adopted
from
Ken
Street,
ICARDA
(FIGS
team)
20
21. • The
fundamental
ecological
niche
of
an
organism
was
formalized
by
Hutchinson[1]
in
1957
as
a
mul3dimensional
hypercube
defining
the
ecological
condi3ons
that
allow
a
species
to
exist.
• Full
understanding
of
all
the
environmental
condi3ons
for
any
organism
is
a
monumental
task
[2].
• A
computer
model
of
the
occurrence
locali3es
together
with
associated
environmental
condi3ons
such
as
rainfall,
temperature,
day
length
etc.,
provides
an
approxima3on
of
the
fundamental
niche.
• Popular
soCware
implementa3ons
for
modeling
the
ecological
niche
include
openModeller,
MaxEnt,
BioCLIM,
DesktopGARP,
etc.
21
George
Evelyn
Hutchinson
(1903
–
1991)
22.
A flexible, user friendly, cross-
platform environment where the entire process of a
fundamental niche modeling experiment can be
carried out.
Input: species occurrence and environmental data.
Output: a fundamental niche model and projection
of the model into an environmental scenario.
hVp://openmodeller.sourceforge.net/
22
24. – The
ini3al
model
is
developed
from
the
training
set
– Fine
tuning
of
model
parameters
and
se_ngs
– No
model
can
ever
be
absolutely
correct!
– A
simula3on
model
can
only
be
an
approxima3on
– A
model
is
always
created
for
a
specific
purpose!
– The
simula3on
model
is
applied
to
make
predic3ons
based
on
new
fresh
data
– Be
aware
of
extrapola3on
24
25. – For
the
ini3al
calibra3on
or
training
step.
– Further
calibra3on,
tuning
step
– Ogen
cross-‐valida3on
on
the
training
set
is
used
to
reduce
the
consump3on
of
raw
data.
– For
the
model
valida3on
or
goodness
of
fit
tes3ng.
– External
data,
not
used
in
the
model
calibra3on.
25
27. Name
of
the
sta3s3c
Symbol
Range
*
Correla3on
coefficient
r
-‐1
to
1
*
Coefficient
of
determina3on
r2
0
to
1
•
A
number
of
different
coefficients
are
developed
to
measure
correla3on
in
different
situa3ons.
•
The
best
known
is
the
Pearson
product-‐
moment
correla,on
coefficient.
•
The
indicates
the
strength
and
direc3on
of
a
linear
rela3onship
between
two
random
variables.
•
The
indicates
how
well
future
outcomes
are
The
covariance
of
the
two
variables
is
divided
by
the
likely
to
be
predicted
by
a
sta3s3cal
model.
product
of
their
standard
devia3ons.
27
28. The
distance
between
the
model
(predic3ons)
and
the
reference
values
(valida3on)
is
the
residuals.
Example
of
a
bad
model
calibra3on
Cross-‐valida3on
indicates
the
appropriate
model
Be
aware
of
over-‐fi_ng!
NB!
Model
valida3on!
complexity.
28
31. Sta,on
Al,tude
La,tude
Longitude
Priekuli,
Latvia
83
m
57.3167
25.3667
Bjørke
forsøksgård,
Norway
149
m
60.7667
11.2167
Landskrona,
Sweden
3
m
55.8667
12.8333
31
32. accide AccNum Country Locality Eleva,on La,tude Longitude Coordinate
7436 NGB27 Finland Sarkalahti, Luumäki 95 m 61.0333 27.3333 SESTO
9717 NGB456 Norway Dønna, Nordland 71 m 66.1167 12.5 Georeferenced
9601 NGB468 Norway Trysil 400 m 61.2833 12.2833 Georeferenced
9600 NGB469 Norway BJØRNEBY 400 m 61.2833 12.2833 Georeferenced
7966 NGB775 Sweden Överkalix, Allsån 45 m 66.4 22.9333 SESTO
8510 NGB776 Sweden Överkalix 100 m 66.4 22.7667 SESTO
7810 NGB792 Finland Luusua, Kemijärvi 145 m 66.4833 27.35 SESTO
9538 NGB2072 Norway Finset 1220 m 60.6 7.5 Georeferenced
8482 NGB2565 Sweden Öland 11 m 56.7333 16.6667 Georeferenced
9102 NGB4641 Denmark Støvring, Jylland 55 m 56.8833 9.8333 Georeferenced
9015 NGB4701 Faroe Islands Faroe Islands 81 m 62.0167 -6.7667 Georeferenced
9039 NGB6300 Faroe Islands Faroe Islands 81 m 62.0167 -6.7667 Georeferenced
8531 NGB9529 Denmark Lyderupgaard 9m 56.5667 9.35 Georeferenced
7344 NGB13458 Finland Koskenkylä, Rovaniemi 91 m 66.5167 25.8667 Georeferenced
32
33. From
a
total
of
19
landrace
accessions
included
in
the
dataset,
only
4
of
the
landrace
accessions
included
geo-‐referenced
coordinates
in
the
NordGen
SESTO
database.
10
accessions
were
geo-‐referenced
from
the
reported
place
name
and
descrip3ons
of
the
original
gathering
site
included
in
SESTO
and
other
sources.
For
5
accessions
there
were
not
enough
informa3on
available
to
locate
the
original
gathering
loca3on.
Right
side
illustra.on
Example
of
georeferencing
for
NGB9529,
landrace
reported
as
origina@ng
from
Lyderupgaard
using
KRAK.dk
and
maps.google.com
33
35. Score
plots
The
observa3ons
made
at
Priekuli
(Latvia)
are
separated
from
the
observa3ons
made
at
Bjørke
(Norway)
and
Landskrona
(Sweden)
in
PC1
and
PC2.
The
combined
observa3ons
from
each
year
(2002
and
2003)
are
less
separated.
The
two
replicate
series
are
NOT
separated
35
36. The
bi-‐plot
shows
heading
days
and
ripening
days
as
the
most
influen3al
trait
variables
for
the
separa3on
of
the
observa3ons
from
the
different
observa3on
loca3ons.
Length
of
plant
par3cipate
in
spreading
out
the
scores
(in
PC1
and
PC2),
but
is
less
ac3ve
in
the
separa3on
of
the
groups.
The
influence
plot
(residuals
against
leverage)
shows
sample
observed
at
Priekuli
in
2003
(replicate
2)
with
a
very
high
leverage
-‐
well
separated
from
the
“data
cloud”.
Ager
looking
into
the
raw
data
(see
next
slide),
this
data
point
was
removed
as
outlier
(set
to
NaN).
36
37. Sample
(FRO)
observed
at
Priekuli
in
2003
(replicate
2)
has
the
lowest
score
for
harvest
index
in
the
en3re
dataset.
Ager
looking
into
the
raw
data
(see
the
table
above),
this
observa3on
point
was
removed
as
outlier
(set
to
NaN).
37
38. The
ini3al
PCA
analysis
of
the
climate
data
showed
a
nice
spread
of
the
scores.
No
surprises.
The
influence
plot
iden3fied
sample
(NOR)
as
a
mild
outlier.
I
decided
to
keep
this
sample,
but
to
keep
an
eye
out
for
it
in
the
mul3-‐way
analysis.
38
40. •
Plot
of
the
trait
scores
(max
–
min)
from
each
observa3on
loca3on
and
year.
•
The
effect
from
the
different
experimental
condi3ons
have
a
significant
effect
on
the
trait
observa3ons.
40
42. tmin
tmax
prec
Mode
3
(climate
variables)
have
very
different
range
of
numerical
values
(tmin,
tmax,
and
prec).
Scaling
across
mode
3
is
thus
applied
to
the
mul3-‐
way
models.
Leg
is
displayed
the
box-‐plot
for
the
3-‐way
data
unfolded
as
to
keep
the
dimensions
of
Scaling
across
mode
3
mode
3.
The
3-‐way
climate
data
was
reasonably
well
described
by
a
PARAFAC
model
of
two
components.
42
44. 6
Mode
3
*
LVA
2002
*
LVA
2003
*
NOR
2002
28
6
*
NOR
2003
*
SWE
2002
14
landraces
(x2)
Mode
2
(Traits)
*
SWE2003
*
Heading
days
*
Ripening
days
*
Length
of
plant
*
Harvest
index
*
Volumetric
weight
6
traits
*
Grain
weight
Bjørke
(N)
Bjørke
(N)
Landskrona
(S)
Landskrona
(S)
Priekuli
(Lv)
Priekuli
(Lv)
2002
2003
2002
2003
2002
2003
6
traits
6
traits
6
traits
6
traits
6
traits
6
traits
28
records
44
45. 3
14
12
(loca3on
of
origin)
Climate
data
(mode
3):
14
landraces
•
Minimum
temperature
•
Maximum
temperature
•
Precipita3on
•
…
(many
more
can
be
added)
12
monthly
means
Min.
temperature
Max.
temperature
Precipita3on
Jan,
Feb,
Mar,
…
Jan,
Feb,
Mar,
…
Jan,
Feb,
Mar,
…
14
samples
45
46. •
The
ini3al
PARAFAC
models
calibrated
from
the
4-‐way
trait
dataset
failed
to
converge
to
any
good
models.
The
core-‐consistency
remained
very
low.
•
The
problem
showed
to
be
lack
of
systema3c
independent
varia3on
between
instances
of
mode
3
(observa3on
years)
and
mode
4
(observa3on
loca3ons)
•
A
two
component
PARAFAC
model
was
chosen
for
the
new
3-‐way
trait
dataset.
(NOR)
was
iden3fied
as
a
mild
outlier
from
the
influence
plot.
No3ce
that
both
replica3ons
are
located
in
the
same
part
of
the
plot.
And
that
they
(together)
are
not
isolated
from
the
“data
cloud”.
46
47. PARAFAC
split-‐half
(mode
1)
analysis:
The
two
PARAFAC
models
each
calibrated
from
two
independent
split-‐half
subsets,
both
converge
to
a
very
similar
solu3on
as
the
model
calibrated
from
the
complete
dataset.
The
PARAFAC
model
is
thus
a
general
and
stable
model
for
the
scope
of
Scandinavia.
47
48. Further
search
for
any
good
PARAFAC
split-‐half
for
the
climate
dataset:
A
systema3c
recording
of
results
from
10
different
split-‐half
alterna3ves
resulted
in
two
good
split-‐half.
The
PARAFAC
model
for
the
climate
data
is
thus
reasonable
general
(for
Scandinavia),
but
less
stable
than
the
model
for
the
3-‐way
trait
data.
48
51. • Ogen
the
cri3cal
levels
(α)
for
the
p-‐value
is
set
as
0.05,
0.01
and
0.001.
• For
the
modeling
of
14
samples
(landraces)
gives:
– 12
degrees
of
freedom
for
the
correla3on
tests
– One-‐tailed
test
(looking
only
at
posi3ve
correla3on
of
predic3ons
versus
the
reference
values).
– A
coefficient
of
determina3on
(r2)
larger
than
0.56
is
significant
at
the
0.001
(0.1%)
level
for
14
values/samples.
Many
introductory
text
books
on
sta3s3cs
include
a
table
of
Cri3cal
Values
for
Pearson’s
r.
51
53. • Latvia
2002
(LY11)
– May
2002
was
extreme
dry
in
Priekuli.
– June
2002
was
extreme
wet
in
Priekuli.
– The
wet
June
caused
germina3on
on
the
spikes
for
many
of
the
early
varie3es.
• Landskrona
2003
(LY32)
– June
2003
was
extreme
dry
in
Landskrona.
– June
was
the
3me
for
grain
filling
here.
• Too
extreme
for
the
genotype
to
be
“normally”
expressed
?
• Too
large
effect
from
“G
by
E”
interac3on
?
53
54. Sowing
Rainfall
(mm)
Sta,on
Year
week
May
June
July
August
Bjørke
forsøksgård,
Norway
2002
17
82.9
67.4
128.5
136.5
2003
21
75.1
85.7
67.1
53.2
Landskrona,
Sweden
2002
13
53.5
75.3
76.4
68.9
2003
15
70.7
40.4
76.0
45.7
Priekuli,
Latvia
2002
17
38.2
111.1
67.0
11.3
2003
19
88.0
59.2
87.8
175.8
54
55. 55
56.
56
57. Exploring
why
some
of
the
subset
(LY)
give
very
bad
N-‐PLS
regressions...
57
58. 58
59. All
samples
RMSECV=3.72
Without
NGB456
RMSECV=3.18
Expl.
X
=
96%
r2
cal
=
0.64
Expl.
X
=
98%
r2
cal
=
0.54
r2
cv
=
0.16
Expl.
y
=
54%
r2
cv
=
0.33
Expl.
y
=
64%
59
66. • The first dataset I started to work with is a “FIGS”
dataset with genebank accessions of Barley
(Hordeum vulgare ssp. vulgare) collected from
different countries worldwide and tested for
susceptibility of net blotch infection. Net blotch is
a common disease of barley caused by the fungus
Pyrenophora teres.
• The barley plants were inoculated with the fungus
and the percentage of the leaves infected with the
disease was normalized to an interval scale (1 to 9).
• 1-3 are basically resistant group 1
• 4-6 are intermediate group 2
• 7-9 are susceptible group 3
66
67. • Field
loca3ons
(USA)
– Athens,
Georgia
(273
observa3ons)
– Fargo,
North
Dakota
(3381
observa3ons)
– Langdon,
North
Dakota
(858
observa3ons)
– Stephen,
Minnesota
(139
observa3ons)
• Observa3on
years
(1987
–
2004)
– 9
dis3nct
years
• Greenhouse
versus
field
trials
– Greenhouse
(1676
observa3ons)
– Field
trial
(2975
observa3ons)
67
70. Individual 95% CIs For Mean Based on
Pooled StDev
Level N Mean StDev -----+---------+---------+---------+-
ATHENS 262 2,0840 0,6555 (---*---)
FARGO 789 1,6793 0,6023 (-*-)
LANGDON 1558 1,6727 0,6466 (-*)
STEPHEN 136 1,6103 0,7810 (-----*----)
-----+---------+---------+---------+-
1,60 1,80 2,00 2,20
• one-‐way
ANOVA
test
for
difference
between
the
observa3on
loca3ons.
The
p-‐value
of
0.000
rejects
the
null
hypothesis
of
no
difference.
• The
Tukey
pair-‐wise
comparison
test
gave
the
same
result.
70
72. • Agro-‐clima3c
Zone
(UNESCO
classifica3on)
• Soil
classifica3on
(FAO
Soil
map)
• Aridity
(dryness)
• Precipita3on
• Poten3al
evapotranspira3on
(water
loss)
• Temperature
• Maximum
temperatures
• Minimum
temperatures
(mean
values
for
month
and
year)
72
73. Discriminant Analysis: obs_nb versus acz_moisture; ...
Quadratic Method for Response: obs_nb
Predictors: acz_moisture; acz_winter_temp;
acz_summer_temp; arid_annual;
pet_annual;
prec_annual; temp_annual; tmax_annual; • The
correctly
classified
groups
tmin_annual
for
the
training
dataset
was
Group
Count 1049
1 2
1190
3
234
45.9%,
and
we
would
expect
a
similar
success
rate
for
the
Summary of classification
predic3on
of
the
“blinded”
Put into Group 1 2 3
values.
1 523 427 48
2 287 451 25
• Remember
that
random
3 238 314 163
classifica3on
of
three
groups
Total N 1048 1192 236
N correct 523 451 163
are:
33.3%
Proportion 0,499 0,378 0,691
• A
test
set
of
9
samples
N = 2476 N Correct = 1137 showed
a
propor3on
correct
Proportion Correct = 0,459
classifica3ons
of
44.4%
73
74. Michael
Mackay
FIGS
coordinator
Ken
Street
FIGS
project
leader
Harold
Bockelman
Net
blotch
data
Eddy
De
Pauw
Climate
data
Dag
Endresen
Data
analysis
74