Lecture 13: Population Structure October 8, 2012
Last Time Effective population size calculations Historical importance of drift: shifting balance or noise? Population structure
Today Course feedback The F-Statistics Sample calculations of F ST Defining populations on genetic criteria
Midterm Course Evaluations Based on five responses: It s not too late to have an impact! Lectures are generally OK Labs are valuable, but better organization and more feedback are needed Difficulty level is OK Book is awful
F-Coefficients Quantification of the structure of genetic variation in populations: population structure Partition variation to the Total Population (T), Subpopulations (S), and Individuals (I) S T
F-Coefficients Combine different sources of reduction in expected heterozygosity into one equation: 1 F = (1 F )(1 F IT ST IS ) Overall deviation from H-W expectations Deviation due to subpopulation differentiation Deviation due to inbreeding within populations
F-Coefficients and IBD View F-statistics as probability of Identity by Descent for different samples 1 F = (1 F )(1 F IT ST IS ) Overall probability of IBD Probability of IBD for 2 individuals in a subpopulation Probability of IBD within an individual
F-Statistics Can Measure Departures from Expected Heterozygosity Due to Wahlund Effect where F ST = H T H H T S H T is the average expected heterozygosity in the total population F IS = H S H H S I H S is the average expected heterozygosity in subpopulations F IT = H T H H T I H I is observed heterozygosity within a subpopulation
Calculating F ST Recessive allele for flower color B 2 B 2 = white; B 1 B 1 and B 1 B 2 = dark pink Subpopulation 1: F(white) = 10/20 = 0.5 F(B 2 ) 1 = q 1 = 0.5 = 0.707 p 1 =1-0.707 = 0.293 White: 10, Dark: 10 Subpopulation 2: F(white)=2/20=0.1 F(B 2 ) 2 = q 2 = 0.1 = 0.32 p 2 = 1-0.32 = 0.68 White: 2, Dark: 18
Calculating F ST Calculate Average H E of Subpopulations (H S ) For 2 subpopulations: H S = Σ2p i q i /2 = (2(0.707)(0.293) + 2(0.32)(0.68))/2 H S = 0.425 White: 10, Dark: 10 Calculate Average H E for Merged Subpopulations (H T ): F(white) = 12/40 = 0.3 q = 0.3 = 0.55; p=0.45 H T = 2pq = 2(0.55)(0.45) H T = 0.495 White: 2, Dark: 18
Bottom Line: F ST = (H T -H S )/H T = (0.495-0.425)/ 0.495 = 0.14 14% of the total variation in flower color alleles is due to variation among populations White: 10, Dark: 10 AND Expected heterozygosity is increased 14% when subpopulations are merged (Wahlund Effect) White: 2, Dark: 18
Nei's Gene Diversity: G ST Nei's generalization of F ST to multiple, multiallelic loci G = ST D H ST T D ST = Where H S is mean H E of m subpopulations, calculated for n alleles with frequency of p j H H S T = 1 H m S m (1 n i= 1 j= 1 p 2 j ) H =1! " P 2 T j Where p j is mean allele frequency of allele j over all subpopulation
Unbiased Estimate of F ST Weir and Cockerham's (1984) Theta Compensates for sampling error, which can cause large biases in F ST or G ST (e.g., if sample represents different proportions of populations) Calculated in terms of correlation coefficients Calculated by FSTAT software: http://www2.unil.ch/popgen/softwares/fstat.htm Goudet, J. (1995). "FSTAT (Version 1.2): A computer program to calculate F- statistics." Journal of Heredity 86(6): 485-486. Often simply referred to as F ST in the literature Weir, B.S. and C.C. Cockerham. 1984. Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370.
Linanthus parryae population structure Annual plant in Mojave desert is classic example of migration vs drift Allele for blue flower color is recessive Use F-statistics to partition variation among regions, subpopulations, and individuals F ST can be calculated for any hierarchy: F RT : Variation due to differentiation of regions F SR : Variation due to differentiation among subpopulations within regions Schemske and Bierzychudek 2007 Evolution
Linanthus parryae population structure
Hartl and Clark 2007 H S = 1 30 # & 2 "% 1! " p im ( 30 i=1 $ m=1 ' 3 1 # & 2 H R =! N! r % 1"! p rm ( N r r=1 $ m=1 ' r# & 2 H T = 2% 1! " p m ( $ m ' H R H S FSR = H F SR = F RT = F RT = F ST = F ST = R 0.1589! 0.1424 0.1589 H T H H T R 0.2371! 0.1589 0.2371 H T H H T 0.2371! 0.1424 0.2371 S = 0.1036 = 0.3299 = 0.3993
F ST as Variance Partitioning Think of F ST as proportion of genetic variation partitioned among populations V ( q) F = ST pq where V(q) is variance of q across subpopulations Denominator is maximum amount of variance that could occur among subpopulations
Analysis of Molecular Variance (AMOVA) Analogous to Analysis of Variance (ANOVA) Use pairwise genetic distances as response Test significance using permutations Partition genetic diversity into different hierarchical levels, including regions, subpopulations, individuals Many types of marker data can be used Method of choice for dominant markers, sequence, and SNP
Phi Statistics from AMOVA φ CT = σ 2 a σ + σ 2 a 2 b + σ 2 c Correlation of random pairs of haplotypes drawn from a region relative to pairs drawn from the whole population (F RT ) φ SC = 2 σ b σ + σ 2 b 2 c Correlation of random pairs of haplotypes drawn from an individual subpopulation relative to pairs drawn from a region (F SR ) φ ST = 2 2 σ a + σ b σ + σ + σ 2 a 2 b 2 c Correlation of random pairs of haplotypes drawn from an individual subpopulation relative to pairs drawn from the whole population (F ST ) http://www.bioss.ac.uk/smart/unix/mamova/slides/frames.htm
What if you don t know how your samples are organized into populations (i.e., you don t know how many source populations you have)? What if reference samples aren t from a single population? What if they are offspring from parents coming from different source populations (admixture)?
What s a population anyway?
Defining populations on genetic criteria Assume subpopulations are at Hardy-Weinberg Equilibrium and linkage equilibrium Probabilistically assign individuals to populations to minimize departures from equilibrium Can allow for admixture (individuals with different proportions of each population) and geographic information Bayesian approach using Monte- Carlo Markov Chain method to explore parameter space Implemented in STRUCTURE program: Londo and Schaal 2007 Mol Ecol 16:4523
Example: Taita Thrush data* Three main sampling locations in Kenya Low migration rates (radio-tagging study) 155 individuals, genotyped at 7 microsatellite loci Slide courtesy of Jonathan Pritchard
Estimating K Structure is run separately at different values of K. The program computes a statistic that measures the fit of each value of K (sort of a penalized likelihood); this can be used to help select K. Assumed value of K! Posterior probability of K Taita thrush data 1 2 3 4 5 ~0 ~0 0.993 0.007 0.00005
Another method for inference of K The ΔK method of Evanno et al. (2005, Mol. Ecol. 14: 2611-2620): Eckert, Population Structure, 5-Aug-2008 46
Inferred population structure Africans Europeans MidEast Cent/S Asia Asia Oceania America Each individual is a thin vertical line that is partitioned into K colored segments according to its membership coefficients in K clusters." Rosenberg et al. 2002 Science 298: 2381-2385
Inferred population structure regions Rosenberg et al. 2002 Science 298: 2381-2385