Scientific seminar with presentation of the FIGS method and results from the FIGS study with Nordic Barley landraces for the Vavilov Seminar at IPK Gatersleben (12 May 2010).
Endresen, D.T.F. (2010). Predictive association between trait data and ecogeographic data for Nordic barley landraces. Crop Sci. 50(6):2418-2430. doi: 10.2135/cropsci2010.03.0174
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Predictive association between trait data and eco-geographic data for Nordic barley landraces (Gatersleben, 2010-05-12)
1. A Lifeboat to the Gene PoolPredictive association between trait data and eco-geographic data for identification of trait properties useful for improvement of food crops Vavilov Seminar at IPK Gatersleben May 12, 2010 - Dag Endresen, NordGen
8. Crop Genetic Diversity Traditional landraces Crop Wild Relatives Modern cultivars Genetic bottlenecks during crop domestication and during modern plant breeding. The circles represent allelic variation. The funnels represents allelic variation of genes found in the crop wild relatives, but gradually lost during domestication, traditional cultivation and modern plant breeding. Illustration based on: Tanksley, Steven D. and Susan R. McCouch 1997. Seed Banks and Molecular Maps: Unlocking Genetic Potential from the Wild Science 277 (5329), 1063. (22 August 1997). doi:10.1126/science.277.5329.1063 4
9. Plant Genetic Resources for Crop Improvement Primitive crops and traditional landraces are an important source for novel traits for improvement of modern crops. Landraces are often not well described for the economically valuable traits. Identification of novel crop traits will often be the result of a larger field trial screening project (thousands of individual plants). Large scale field trials are very costly, area and human working hours. 5
10. Challenges for improved utilization of genetic resources for crop improvement :* Large gene bank collections* Limited screening capacity 6
11. A needle in a hay stack Scientists and plant breeders want a few hundred germplasm accessions to evaluate for a particular trait. How does the scientist select a small subset likely to have the useful trait? Example: More than 560 000 wheat accessions in genebanks worldwide. Slide adopted from a slide by Ken Street, ICARDA (FIGS team) 7
12. Core collection subset The scientist or the breeder need a smaller subset to cope with the field screening experiments. A common approach is to create a so-called core collection. Sir Otto H. Frankel (1900-1998) proposed a limited set established from an existing collection with minimum similarity between its entries. The core collection is of limited size and chosen to represent the genetic diversityof a large collection (1984) . 8
13. Core subset selection Given that the trait property you are looking for is relatively rare: Perhaps as rare as a unique allele for one single landrace cultivar... Getting what you want is largely a question of LUCK! 9 Slide adopted from a slide by Ken Street, ICARDA (FIGS team)
15. Focused Identification of Germplasm Strategy Objective of this method: Explore climate data as a prediction model for “computer pre-screening” of crop traits BEFORE full scale field trials. Identification of landraces with a higher probability of holding an interesting trait property. 11
16. Climate effect during the cultivation process Primitive cultivated crops are shaped by local climate and humans Wild relatives are shaped by the environment Traditional cultivated crops (landraces) are shaped by climate and humans Modern cultivated crops are mostly shaped by humans (plant breeders) Perhaps future crops are shaped in the molecular laboratory…? 12
17. Predictive pattern between eco-geography and trait The predictive pattern between the eco-geography and the traits can of course also have other sources than adaption. During traditional cultivation the farmer will also select for and introduce germplasm for improved suitability of the landrace to the local conditions. 13
18. FIGS selection method Assumption: the climate at the original source location, where the landrace was developed during long-term traditional cultivation, is correlated to the trait score. Aim: to build a computer model explaining the crop trait score (dependent variables) from the climate data (independent variables). 14
19.
20. The longitude, latitude coordinates for the original collecting site of the accessions (landraces) provide the bridge to the environmental data. 15
21. 1. Genetic resources, genebank collections Lima, Peru Alnarp, Sweden Svalbard Benin 16 More than 7.4 million genebank accessions, more than 1 400 genebanks, worldwide.
22. 2. Trait data, descriptive crop data Field trials, Gatersleben, Germany Potato Priekuli Latvia Faba bean, Finland Linnés äpple Forage crops, Dotnuva, Lithuania Radish (S. Jeppson) 17 Powdery Mildew, Blumeria graminis Leaf spots Ascochyta sp. Yellow rust Puccinia strilformis Black stem rust Puccinia graminis http://barley.ipk-gatersleben.de
23. 3. Climate data – WorldClim The climate data can be extracted from the WorldClim dataset. http://www.worldclim.org/ Data from weather stations worldwide are combined to a continuous surface layer. Climate data for each landrace is extracted from this surface layer. Precipitation: 20 590 stations Temperature: 7 280 stations 18
24. FIGS – Focused Identification of Germplasm Strategy FIGS selection is a new method to predict crop traits of primitive cultivated material from climate variables by using multivariate statistical methods. 19
25. What is FIGS http://www.figstraitmine.org/ FocusedIdentification of GermplasmStrategy Mediterranean region Origin of Concept (1980s): Wheat and barley landraces from marine soils in the Mediterranean region provided genetic variation for boron toxicity. South Australia Slide made by Michael Mackay 1995 20
26. 21 FIGS The FIGS technology takes much of the guess work out of choosing which accessions are most likely to contain the specific characteristics being sought by plant breeders to improve plant productivity across numerous challenging environments.http://www.figstraitmine.org/ FIGS salinity set 21
28. Ecological Niche ModelingSpecies Distribution Models The fundamental ecological niche of an organism was formalized by G. E. Hutchinson[1] in 1957 as a multidimensional hypercube defining the ecological conditions that allow a species to exist. A computer model of the occurrence localities together with associated environmental conditions such as rainfall, temperature, day length etc., provides an approximation of the fundamental niche. Popular software implementations for modeling the ecological niche include openModeller, MaxEnt, BioCLIM, DesktopGARP, etc. 23 George Evelyn Hutchinson (1903 – 1991)
30. Data for the simulation model Training set For the initial calibration or training step. Calibration set Further calibration, tuning step Often cross-validation on the training set is used to reduce the consumption of raw data. Test set For the model validation or goodness of fit testing. New external data, not used in the model calibration. 25
31. A model of the real world Validation step No model can ever be absolutely correct A simulation model can only be an approximation A model is always created for a specific purpose Apply the model The simulation model is applied to make predictions based on new fresh data Be aware to avoid extrapolation problems 26
34. Residuals (validate model fit) The distance between the model (predictions) and the reference values (validation) is the residuals. Example of a bad model calibration Calibration step Cross-validation indicates the appropriate model complexity. 28 Be aware of over-fitting! NB! Model validation!
36. Morphological traits in Nordic Barley landraces Field observations by AgneseKolodinskaBrantestam (2005) Multi-way N-PLS data analysis, Dag Endresen (2009) 30 Priekuli (L) Bjorke (N) Landskrona (S)
37. Landrace origin locations (georeferencing) From a total of 19 landrace accessions included in the dataset, only 4 of the landrace accessions included geo-referenced coordinates in the NordGen SESTO database. 10 accessions were geo-referenced from the reported place name and descriptions of the original gathering site included in SESTO and other sources. For 5 accessions there were not enough information available to locate the original gathering location. Right side illustration Example of georeferencing for NGB9529, landrace reported as originating from Lyderupgaard using KRAK.dk and maps.google.com 31
43. … (many more layers can be added)3 climate variables X 14 landraces (location of origin) 12 monthly means 2-way array (bi-linear): 36 variables Min. temperature Max. temperature Precipitation Jan, Feb, Mar, … Jan, Feb, Mar, … Jan, Feb, Mar, … 14 samples 34
46. Mean centering removes the absolute intensity to avoid the model to focus on the variables with the highest numerical values (intensity).
47. Scaling makes the relative distribution of values (range spread) more equal between variables.
48. After auto-scaling all variables have a mean of zero and a standard deviation of one.
49. The objective is to help the model to separate the relevant information from the noise.36
50. Trait dataset - outlier Outlier: NGB6300, replicate 2 from Priekuli 2003 (LYR122) The influence plot (residuals against leverage) shows sample NGB6300 (FRO) observed at Priekuli in 2003 (replicate 2) with a very high leverage - well separated from the “data cloud”. After looking into the raw data (see the table above), this observation point was removed as outlier (set to NaN). 37
51. PARAFAC split-half, trait data (3-way) PARAFAC split-half (mode 1) analysis: The two PARAFAC models each calibrated from two independent split-half subsets, both converge to the same solutions. The PARAFAC 3-way method produces thus a stable model for this dataset. 38
52. PARAFAC split-half, climate data (3-w) 10 different PARAFAC split-half alternatives resulted in 2 good splits 39
54. Significance levels Often the critical levels (a) for the p-value significance is set as 0.05, 0.01 and 0.001 (5 %, 1 %, 0.1 %). For the modeling of 14 samples (landraces) gives: 12 degrees of freedom for the correlation tests (mean x, y) One-tailed test (looking only at positive correlation of predictions versus the reference values). A coefficient of determination (r2) larger than 0.56 is significant at the 0.001 (0.1%) level for 14 values/samples. Many introductory text books on statistics include a table of Critical Values for Pearson’s r. 41
55. N-PLS regression results Heading Length H-Index Volwgt TGW Priekuli (L) Bjorke (N) Landskrona (S) Ripening The 5% and 1% significance levels indicated by the horizontal green lines 42
56. Experiment observation site Latvia 2002 (LY11) May 2002 was extreme dry in Priekuli. June 2002 was extreme wet in Priekuli. The wet June caused germination on the spikes for many of the early varieties. Landskrona 2003 (LY32) June 2003 was extreme dry in Landskrona. June was the time for grain filling here. Too extreme for the genotype to be “normally” expressed ? Too large effect from “G by E” interaction ? 43
68. Work in progress! (SIMCA, D-PLS, Multi-way)Ken Street FIGS project leader Harold Bockelman Net blotch data Eddy De Pauw Climate data Dag Endresen Data analysis 47
69. Trait observation locations, USDA Research Stations Dr Harold Bockelman extracted the trait data (C&E) from the GRIN database, USDA-ARS, National Plant Germplasm System, Germplasm Resources Information Network, online http://www.ars-grin.gov/npgs 48
70. 49 Climate data Agro-climatic Zone (UNESCO classification) Soil classification (FAO Soil map) Aridity (dryness) Precipitation Potential evapotranspiration (water loss) Temperature Maximum temperatures Minimum temperatures (mean values for month and year) Eddy De Pauw (ICARDA, 2008)
71.
72. 2 000 accessions screened at ICARDA without result (during last 7 years).
73. A FIGS set of 534 accessions was developed and screened (2007, 2008).
Photo by Dag Endresen. Barley (Hordeum vulgare L.) at Gatersleben (June, 2007). URL: http://www.flickr.com/photos/dag_endresen/4189818373/
Photo: Dag Endresen. Barley seeds (Hordeum vulgare L.), genebank accession NGB11242, at the Nordic Gene Bank, Alnarp (July 2004). URL: http://www.flickr.com/photos/dag_endresen/4262545194/
Illustration based on: Tanksley, Steven D. and Susan R. McCouch 1997. Seed Banks and Molecular Maps: Unlocking Genetic Potential from the Wild Science 277 (5329), 1063. (22 August 1997). doi:10.1126/science.277.5329.1063
Modern agriculture uses advanced plant varieties based on the most productive genetics. The original land races and wild forms produce lower yields, but their greater genetic variation contains a higher diversity in e.g. resistance to disease. High-yielding modern crops are therefore vulnerable when a new disease arises.
Photo: Dag Endresen.Field of sugar beet (Beta vulgaris L.) at Alnarp (June 2005). URL: http://www.flickr.com/photos/dag_endresen/4189812241/
Some selected literature on core collections:---1995 :: Core Collections of Plant Genetic Resources. Author: Hodgkin, T.; Brown, A.H.D.; van Hintum, Th.J.L.; Morales, E.A.V. (eds.). ISBN-10: 0-471-95545-0.http://www.bioversityinternational.org/index.php?id=19&user_bioversitypublications_pi1[showUid]=2365---1999 :: Core collections for today and tomorrow. Author: Johnson, R.C.; Hodgkin, T. (eds.). ISBN-10: 92-9043-424-4. http://www.bioversityinternational.org/index.php?id=19&user_bioversitypublications_pi1[showUid]=2153---2000 :: Core Collections of plant genetic resources. IPGRI Technical Bulletin No. 3. Author: van Hintum, Th.J.L.; Brown, A.H.D.; Spillane, C.; Hodgkin, T. ISBN-10: 92-9043-454-6. http://www.bioversityinternational.org/index.php?id=19&user_bioversitypublications_pi1[showUid]=2540---2002 :: Accession management trials of genetic resources collections. IPGRI Technical Bulletin No. 5. Author: Sackville Hamilton, N.R.; Engels, J.M.M.; van Hintum, Th.J.L.; Koo, B.; Smale, B. ISBN-10: 92-9043-516-X.http://www.bioversityinternational.org/index.php?id=19&user_bioversitypublications_pi1[showUid]=2703
TODO: lookup reference for rare alleles... (Erling Fimland mentioned alleles unique for a single landrace in husbandry domesticated animals)http://commons.wikimedia.org/wiki/File:P_game.svg
Modern agriculture uses advanced plant varieties based on the most productive genetics. The original land races and wild forms produce lower yields, but their greater genetic variation contains a higher diversity in e.g. resistance to disease. High-yielding modern crops are therefore vulnerable when a new disease arises.
Illustration traditional cattle farming: http://commons.wikimedia.org/wiki/File:Traditional_farming_Guinea.jpg (USAID, Public Domain)
Estimates from FAO State of the World report 2009 (SoW 2009). http://www.fao.org/nr/cgrfa/cgrfa-meetings/cgrfa-comm/twelfth-reg/en/
The WorldClim dataset is described in: Hijmans, R.J., S.E. Cameron, J.L. Parra, P.G. Jones and A. Jarvis, 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965-1978NOAA GHCN-Monthly version 2:http://www.ncdc.noaa.gov/oa/climate/ghcn-monthly/index.phpWeather stations, precipitation: 20590;temperature:7280
We often divide the data for a simulation model project in three equal parts: one set for initial model calibration or training, one set for further calibration or fine tuning; and one test set for validation on the model.
http://en.wikipedia.org/wiki/CorrelationFormula (1): dividing the sample covariance between the two variables by the product of their sample standard deviation.Formula (2): Xi, Yi, X, sx are the standard score, sample mean, and sample standard deviation (equal result as above).---http://en.wikipedia.org/wiki/Anscombe%27s_quartetAnscombe, Francis J. (1973) Graphs in statistical analysis. American Statistician, 27, 17–21. A set of four different pairs of variables created by Francis Anscombe. All the four y variables have the same mean (7.5), standard deviation (4.12). All pairs have the correlation (0.81) and regression line (y = 3 + 0.5x). However, as can be seen on the plots, the distribution of the variables is very different.
Residuals
KRAK: http://www.krak.dk/query?mop=aq&mapstate=7%3B9.305588071850734%3B56.61105751259899%3Bh%3B9.282591620463698%3B56.61775781407488%3B9.328584523237769%3B56.60435721112311%3B853%3B469&what=map_adr# Google Maps: http://maps.google.com/maps/ms?ie=UTF8&hl=en&msa=0&msid=107144586665622662057.00045ff98921bd0418037&ll=56.606941,9.297695&spn=0.055554,0.150204&t=h&z=13
Illustration of the 3-way cube model compared to the more common data array (2 variable dimensions, 2-way, bi-linear)
Illustration of the 3-way cube model compared to the more common data array (2 variable dimensions, 2-way, bi-linear)
Box-plot of the trait scores to illustrate the effect of the preprocessing. First row is no preprocessing; row 2 is mean-centering (centering across mode 1, samples); last row is auto-scale (centering across mode 1 and scaling across mode 2, traits).Mean centeringremoves the absolute intensity information (the mean for each variable is subtracted from the individual data values). This pre-processing strategy is applied to avoid the model to focus on the variables with the highest numerical values (intensity).Scaling: In general, scaling a variable in the data can be viewed as a multiplication of the corresponding column vector entries with some number. If the significances of the variables to the model are known prior to modeling, then it might be a good idea to upscale the highly relevant variables. In contrast, if a variable is supposed to bear merely noise, then its significance must be downscaled. However this is a rare case in reality. Therefore, unit-variance scaling (UV-scaling) is most often used. Moreover scaling itself is sometimes associated with UV-scaling. (Johann Gasteiger, and Dr. Thomas Engel (editors). 2003. Chemoinformatics: a textbook. Wiley-VCH, Weinheim. ISBN 9783527306817. Page 214)
NGB6300 (accide 9039, FRO) observed at Priekuli, Latvia in 2003, replicate 2 is highlighted.NGB776 (accide 8510, SWE) observed at Landskrona, Sweden in 2002 (both replicates) are highlighted. Replicate 2 (LYR312) largest residual, replicate 1 (LYR311) below.
Map to illustrated the first successful split-half subsets. Set 1: NGB6300, NGB27, NGB469, NGB776, NGB4701, NGB2072, NGB4641 are indicated with blue placemarks. Set 2: NGB792, NGB13458, NGB9529, NGB468, NGB775, NGB456, NGB2565 are indicated with red placemarks. Map of the second good split-half. Set 1: NGB456, NGB9529, NGB469, NGB2072, NGB468, NGB4641, NGB776 are indicated with blue placemarks. Set 2: NGB4701, NGB27, NGB2565, NGB792, NGB13458, NGB6300, NGB775 are indicated by red placemarks.
http://en.wikipedia.org/wiki/Correlation, http://en.wikipedia.org/wiki/Coefficient_of_determination, http://en.wikipedia.org/wiki/Statistical_model_validationTable of critical values for r: http://www.runet.edu/~jaspelme/statsbook/Chapter%20files/Table_of_Critical_Values_for_r.pdfTable of critical values for r: http://www.gifted.uconn.edu/siegle/research/Correlation/corrchrt.htmTable of critical values for r: http://www.jeremymiles.co.uk/misc/tables/pearson.html
Dr Harold Bockelman extracted the trait data (C&E) from the GRIN database (USDA-ARS, National Plant Germplasm System, Germplasm Resources Information Network, online http://www.ars-grin.gov/npgs) USDA GRIN, trait data online: http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?1041
* Bouhssini, M., Street, K., Joubi, A., Ibrahim, Z., Rihawi, F. (2009). Sources of wheat resistance to Sunn pest, EurygasterintegricepsPuton, in Syria. Genetic Resources and Crop Evolution. URL http://dx.doi.org/10.1007/s10722-009-9427-1,http://www.springerlink.com/content/587250g7qr073636/(Recent FIGS study at ICARDA, Syria.)
Closing slide with lifeboat. Source image (Google images): http://www.nut.no/html/body_hyperbaric_lifeboat.html