Presented at Evolution 2013, June 24; describes an approach to teaching populations genetics at the upper undergraduate/beginning graduate level, using simulations based in R and incorporating available large genomic data sets.
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
Teaching Population Genetics with R
1. A Simulation-Based Approach to
Teaching Population Genetics:
R as a Teaching Platform
Bruce J. Cochrane
Department of Zoology/Biology
Miami University
Oxford OH
2. Two Time Points
• 1974
o Lots of Theory
o Not much Data
o Allozymes Rule
• 2013
o Even More Theory
o Lots of Data
o Sequences, -omics, ???
3. The Problem
• The basic approach hasn’t changed, e. g.
o Hardy Weinberg
o Mutation
o Selection
o Drift
o Etc.
• Much of it is deterministic
4. And
• There is little initial connection with real data
o The world seems to revolve around A and a
• At least in my hands, it doesn’t work
5. The Alternative
• Take a numerical (as opposed to analytical) approach
• Focus on understanding random variables and distributions
• Incorporate “big data”
• Introduce current approaches – coalescence, Bayesian
Analysis, etc. – in this context
6. Why R?
• Open Source
• Platform-independent (Windows, Mac, Linux)
• Object oriented
• Facile Graphics
• Web-oriented
• Packages available for specialized functions
7. Where We are Going
• The Basics – Distributions, chi-square and the Hardy Weinberg
Equilibrium
• Simulating the Ewens-Watterson Distribution
• Coalescence and summary statistics
• What works and what doesn’t
13. Calculating chi-squared
The function
function(obs,exp,df=1){
chi <-sum((obs-exp)^2/exp)
pr <-1-pchisq(chi,df)
c(chi,pr)
A sample function call
obs <-c(315,108,101,32)
z <-sum(obs)/16
exp <-c(9*z,3*z,3*z,z)
chixw(obs,exp,3)
The output
chi-square = 0.47
probability(<.05) = 0.93
deg. freedom = 3
14. Basic Hardy Weinberg Calculations
The Biallelic Case
Sample input
obs <-c(13,35,70)
hw(obs)
Output
[1] "p= 0.2585 q= 0.7415"
obs exp
[1,] 13 8
[2,] 35 45
[3,] 70 65
[1] "chi squared = 5.732 p = 0.017 with 1 d. f."
15. Illustrating With Ternary Plots
library(HardyWeinberg)
dat <-(HWData(100,100))
gdist <-dat$Xt #create a variable with the working data
HWTernaryPlot(gdist, hwcurve=TRUE,addmarkers=FALSE,region=0,vbounds=FALSE,axis=2,
vertexlab=c("0","","1"),main="Theoretical Relationship",cex.main=1.5)
16. Access to Data
• Direct access of data
o HapMap
o Dryad
o Others
• Manipulation and visualization within R
• Preparation for export (e. g. Genalex)
19. And Determining the Number of Outliers
nsnps <- length(hwdist)
quant <-quantile(hwdist,c(.025,.975))
low <-length(hwdist[hwdist<quant[1]])
high <-length(hwdist[hwdist>quant[2]])
accept <-nsnps-low-high
low; accept; high
[1] 982
[1] 37330
[1] 976
20. Sampling and Plotting Deviation from Hardy Weinberg
chr21.poly <-na.omit(chr21.sum) #remove all NA's (fixed SNPs)
chr21.samp <-sample(nrow(chr21.poly),1000, replace=FALSE)
plot(chr21.poly$z.HWE[chr21.samp])
21. Plotting F for Randomly Sampled Markers
chr21.sub <-chr21.poly[chr21.samp,]
Hexp <- 2*chr21.sub$MAF*(1-chr21.sub$MAF)
Fi <- 1-(chr21.sub$P.AB/Hexp)
plot(Fi,xlab="Locus",ylab="F")
23. The Ewens- Watterson Test
• Based on Ewens (1977) derivation of the theoretical
equilibrium distribution of allele frequencies under the
infinite allele model.
• Uses expected homozygosity (Σp2) as test statistic
• Compares observed homozygosity in sample to expected
distribution in n random simulations
• Observed data are
o N=number of samples
o k= number of alleles
o Allele Frequency Distribution
24. Classic Data (Keith et al., 1985)
• Xdh in D. pseudoobscura, analyzed by sequential
electrophoresis
• 89 samples, 15 distinct alleles
25. Testing the Data
1. Input the Data
Xdh <- c(52,9,8,4,4,2,2,1,1,1,1,1,1,1,1) # vector of allele numbers
length(Xdh) # number of alleles = k
sum(Xdh) #number of samples = n
2. Calculate Expected Homozygosity
Fx <-fhat(Xdh)
3. Run the Analysis
Ewens(n,k,Fx)
27. With Newer (and more complete) Data
Lactase Haplotypes in European and African Populations
1. Download data for Lactase gene from HapMap (CEU, YRI)
o 25 SNPS
o 48,000 KB
2. Determine numbers of haplotypes and frequencies for each
3. Apply Ewens-Waterson test to each.
29. Some Basic Statistics from Sequence Data
library(seqinR)
library(pegas)
dat <-read.fasta(file="./Data/FGB.fas")
#additional code needed to rearrange data
sites <-seg.sites(dat.dna)
nd <-nuc.div(dat.dna)
taj <-tajima.test(dat.dna)
length(sites); nd;taj$D
[1] 23
[1] 0.007561061
[1] -0.7759744
Intron sequences, 433 nucleotides each
from Peters JL, Roberts TE, Winker K, McCracken KG (2012)
PLoS ONE 7(2): e31972. doi:10.1371/journal.pone.0031972
30. Coalescence I – A Bunch of Trees
trees <-read.tree("http://dl.dropbox.com/u/9752688/ZOO%20422P/R/msfiles/tree.1.txt")
plot(trees[1:9],layout=9)
32. Coalescence III – Summary Statistics
system("./ms 50 1000 -s 10 -L | ./sample_stats >samp.ss")
# 1000 simulations of 50 samples, with number of sites set to 10
ss.out <-read_ss("samp.ss")
head(ss.out)
pi S D thetaH H
1. 1.825306 10 -0.521575 2.419592 -0.594286
2. 2.746939 10 0.658832 2.518367 0.228571
3. 3.837551 10 2.055665 3.631837 0.205714
4. 2.985306 10 0.964128 2.280000 0.705306
5. 1.577959 10 -0.838371 5.728163 -4.150204
6. 2.991020 10 0.971447 3.539592 -0.548571
33. Coalescence IV – Distribution of Summary Statistics
hist(ss.out$D,main="Distribution of Tajima's D (N=1000)",xlab="D")
abline(v=mean(ss.out$D),col="blue")
abline(v=quantile(ss.out$D,c(.025,.975)),col="red")
34. Other Uses
• Data Manipulation
o Conversion of HapMap Data for use elsewhere (e. g. Genalex)
o Other data sources via API’s (e. g. package rdryad)
• Other Analyses
o Hierarchical F statistics (hierfstat)
o Haplotype networking (pegas)
o Phylogenetics (ape, phyclust, others)
o Approximate Bayesian Computation (abc)
• Access for students
o Scripts available via LMS
o Course specific functions can be accessed (source("http://db.tt/A6tReYEC")
o Notes with embedded code in HTML (Rstudio, knitr)
36. Challenges
• Some coding required
• Data Structures are a challenge
• Packages are heterogeneous
• Students resist coding
37. Nevertheless
• Fundamental concepts can be easily visualized graphically
• Real data can be incorporated from the outset
• It takes students from fundamental concepts to real-world
applications and analyses
For Further information:
cochrabj@miamioh.edu
Functions
http://db.tt/A6tReYEC