OEB 242: Ppulatin Genetics Statistics Review HYPOTHESIS TESTING Null hypthesis has tw parts: Substantive (what are values we expect if nthing interesting is happening?) and frmal (hw much deviatin frm expected values d we allw?) Sme exemplars: H 0: Alleles at lcus A and lcus B assrt independently; thus any deviatin frm a 1:1:1:1 gametic rati is n greater than culd be explained by chance alne at the α=.05 level. H 0: The ppulatin is in Hardy-Weinberg equilibrium; thus any deviatin frm a 1:2:1 gentypic rati is n greater than culd be explained by chance alne at the α=.1 level. p-value represents P(bserved data H 0) statistics means never having t say yu re certain -- Must specify significance threshld fail t reject H 0 (why nt accept H 0? ) reject H 0: what can yu therefre cnclude (if anything?) Degrees f freedm are critical fr cnnecting the test statistic t a p-value in e.g. a chi-squared test. Given a cntingency table, the d.f. represents the minimum number f entries necessary t reppulate the entire thing, hlding cnstant what is knwn abut the dataset (the ttal number f datapints and the prprtins f datapints that fall int ne class r anther). Find by taking (ttal number f classes f data) 1 (fr fixing N tt) 1 (fr every independent parameter estimated when furnishing expected values). RANDOM VARIABLES An unspecified value, that takes n actual values accrding t a prbability distributin Mean r expected value is a weighted average f the pssible values an r.v. can take Variance is the expected value f the squared deviatins frm the mean: Var(X) = E[(X-µ) 2 ] We have used a few different kinds f randm variables in this curse: Binmial randm variables represent the number f successes in n independent trials, each f which has prbability f success = p. An example is the Wright-Fisher mdel f drift, where we imagine reprductin as sampling frm an infinite pl f gametes. P X = k X~Bin(n, k)) = p q. Mean = np; var = npq Pissn randm variables are binmial randm variables with large n and small p. They are cmputatinally mre tractable and are useful t describe scenaris where yu have very many chances t d smething rare. Mutatins, fr example, are mdeled as a Pissn prcess. P X = k X~Pis(λ)) = Mean = var = λ Gemetric randm variables represent the number f failures befre getting ne success with prbability p. The Kingman calescent, fr example, imagines nn-calescence as a failure with prbability q = 1-p and calescence as a success where p is equal t the frequency f the allele in questin. P X = k X~Gem(p)) = q p Mean = q/p, var = q/p 2 Expnential randm variables are the cntinuus analgues f gemetric randm variables. The Kingman calescent ften uses this apprximatin, which hlds when the ppulatin is large. P X = k X~Exp(λ)) = λe Mean = λ -1, var = λ -2 There are a few different ways t talk abut the dependencies f randm variables: Cvariance is an analgue f variance fr tw randm variables. It describes the extent t which tw r.v.s track each ther: if I change ne, hw des the ther change? Cvariance is the expected value f the prducts f the deviatins frm the mean: Cv(X,Y) = E[(X-µ x) (Y-µ y)] Crrelatin cefficient is a scaled versin f cvariance that falls between -1 (perfectly anticrrelated) and 1 (perfectly crrelated). Divide the cvariance by the prduct f the standard deviatins (i.e., square rt f the variance) f the tw randm variables t nrmalize. The slpe f the regressin line is slightly different: it measures the directness f the assciatin between tw r.v.s, whereas cvariance and crrelatin measure the precisin r tightness f that assciatin. Recall, fr example, that the slpe f the regressin line between midparent and ffspring gives the narrw sense heritability. Gd luck n the exam, try the review prblems n the website, and dn t frget t send in yur final papers via email by 6PM n Weds 5/6 (end f reading perid). 1
OEB 242: Ppulatin Genetics Chapter 4: Mutatin and Neutral Thery Infinite alleles mdel Assume each mutatin creates a new allele. Hence, hmzygsity implies identity-bydescent (autzygsity) F = hmzygsity = prbability tw randmly chsen chrmsmes (alleles) are IBD F = 1 μ + 1 μ 1 F Equilibrium value f F (mutatin-drift balance; F t = F t-1 ): H = heterzygsity = 1 F Equilibrium value f H (mutatin-drift balance; H t = H t-1 ): Infinite sites mdel Assume each mutatin affects ne base and that there is n recmbinatin. Used t derive Kingman calescent: T ~ exp ; E T = () Use t predict number f mutatins separating tw sequences (= pairwise diversity r per-site heterzygsity, Π) by multiplying mutatin rate times length f tw branches. à E Π = Θ Use t predict number f segregating sites in a sample f k alleles by multiplying mutatin rate times ttal length f tree. à E S = Θ The neutral thery The infinite sites mdel lets us estimate Θ in several different ways, which frms the basis f Tajima s D and ther neutrality tests. D = "#("#$%&'(%) The denminatr is a nrmalizing factr that is difficult t slve analytically. The numeratr tells us whether using pairwise diversity r the ttal number f segregating sites gives a greater estimate f theta. If D is psitive, this suggests a surplus f intermediate alleles (which inflate pairwise diversity disprprtinately). This is cnsistent with shallw/recent calescent times, and suggests balancing selectin r admixture. If D is negative, this suggests a surplus f rare alleles (which inflate the number f segregating sites disprprtinately). This is cnsistent with deep/ancient calescent times, and suggests directinal selectin r ppulatin grwth. In the neutral thery, the prbability f fixatin f an allele is its frequency. As a crllary, the fixatin rate is therefre the neutral mutatin rate, independent f ppulatin size. The prbability that any ne new allele fixes is (1/2N), and the ppulatin-wide rate f new mutatins is (2Nµ). The prduct f these tw values is simply µ. The average time between fixatin events is therefre (1/µ). The expected time t fixatin f an allele, given that it will eventually fix, is 4N generatins. The expected time t lss f a new neutral allele, cnditinal n its eventual lss, is 2ln(2N). Gd luck n the exam, try the review prblems n the website, and dn t frget t send in yur final papers via email by 6PM n Weds 5/6 (end f reading perid). 2
OEB 242: Ppulatin Genetics Chapter 7: Mlecular Ppulatin Genetics Here, we are lking at timescales that dn t allw us t invke the infinite sites/alleles mdels. Generally, we need t accunt fr the pssibility f multiple mutatins at the same site. Distinguish d (the number f differences bserved) frm k (the number f substitutins inferred). The prcedure fr inferring substitutins frm differences depends n whether we are talking abut amin acids r abut nucletides, and depending n what assumptins we make abut the prbabilities f varius mutatins. The Jukes-Cantr mdel, fr example, assumes that all mutatins are equally prbable. Frm this assumptin, ne can establish a recurrence equatin fr the prbability f a site taking n a given identity, which then can be translated int a partial differential equatin and slved t find an estimatr fr k based n d. We can test fr selectin in prtein-cding regins by cmparing either the rate f differences (dn/ds) r the rate f substitutins (Ka/Ks) fr nn-synnymus/amin-acidchanging mutatins versus synnymus mutatins, per site. Mutatins that change prtein structure are presumably mre visible t selectin than thse that d nt. In calculating these statistics, ne must accunt fr the number f sites that are ptentially synnymus r nn-synnymus. A twfld degenerate site is ften cunted as 2/3 nn, 1/3 syn. Under neutrality, the rati ~=1. Under purifying (negative) selectin, the rati may be less than 1. (Changes t the prtein are nt tlerated by selectin and are remved) Under psitive selectin, the rati may be greater than 1. (Changes t the prtein are favred and accumulate at an accelerated rate) The mlecular clck assumes that the number f substitutins is directly prprtinal t the amunt f evlutinary time that separates tw sequences. It is critical t realize, hwever, that T MRCA is half f this value (because the tw lineages tgether sum t the ttal amunt f evlutinary time) In general, substitutin rates d vary acrss rganisms, acrss genmic regins, etc., and s the clck is nt always cnstant. But the pint remains that we can quantify these rates and then make assumptins abut them in rder t interpret their significance. The McDnald-Kreitman test assumes that, under neutrality, a mlecular clck type assumptin will preserve the dn/ds rati acrss large evlutinary timescales. The test quantifies this by cmparing the dn/ds fr recent, micrevlutinary events (which give rise t plymrphism within a ppulatin r species) with the dn/ds fr ancient, macrevlutinary events (which give rise t differences between species). The test is mst straightfrwardly implemented as a chi-squared test using a cntingency table f nn-synnymus/synnymus and plymrphic/divergent mutatins. (In this case, there are fur data classes, and three fixed parameters: N tt, % NS vs. S, % P vs. D ), meaning that we have 1 df. If plymrphism is nt prprtinal t divergence, we must interpret: If div > ply: suggests (eg) psitive selectin between species If div < ply: suggests (eg) purifying selectin between species, balancing selectin within ppulatin Gd luck n the exam, try the review prblems n the website, and dn t frget t send in yur final papers via email by 6PM n Weds 5/6 (end f reading perid). 3
OEB 242: Ppulatin Genetics Chapter 8: Evlutinary Quantitative Genetics The Mendelian paradigm is mngenic, but we want t be able t talk abut plygenic (cmplex, quantitative) traits. We want t be able t ask questins like, hw much d genetics influence the phentype? Unfrtunately, this is an ill-frmed questin, and there s n way t talk abut hw genetic a trait is in the abstract. We have t grund ur discussin in ppulatins. This pens the dr fr the cncept f heritability Technical definitin: prprtin f phentypic variance attributable t genetic variance Interpretatin: extent t which genetic differences amng individuals explain phentypic differences amng individuals Desn t tell us hw many genes are invlved in a trait, e.g., but des help us understand relative cntributin f genetics and envirnment fr a given ppulatin There are tw ways t get a hld n it: by measuring variance r by calculating dminance cefficients. Variance One apprach invlves quantifying heritability by lking at the relatinships amng the variances f varius quantities (e.g. phentype) and hw they are related frm ne individual t its family member Variance decmpsitin: V P = V G + V E + V GE (Variance due t genetics, envirnment, and gene-envirnment interactins tgether explain the ttal phentypic variance) V G = V A + V D + V I (Variance due t genetics, in turn, is explained by the variance due t additive allele effects, dminance effects, and epistatic interactins) Brad sense: H 2 = V G / V P Narrw sense: h 2 = V A / V P h 2 = slpe fr regressin f mean ffspring vs mean parents Visscher, Hill & Wray, 2008 We can use the narrw-sense heritability fr predicting respnse t selectin Breeder s equatin : R=h 2 S, where S = selectin differential = (mean phentypic value fr breeding ppulatin) (mean phentypic value fr entire ppulatin) and R = respnse t selectin = the change in the mean phentypic value f the ppulatin after ne generatin f selectin Dminance Anther apprach t quantitative genetics psits tw values, a and d, which can be used t represent the strength f dminance and the relatinship amng the phentypic values assciated with each gentypic class. We assume the mean phentypes (als called gentypic value ) f AA, AA and A A are a, d, and a, respectively. Then, using HWE prprtins, we calculate the ppulatin mean (which depends critically n gentype frequencies). The mean is p 2 a+2pqd-q 2 a (which can be simplified). We can then describe ur gentypic values as deviatins frm the ppulatin mean. We say: (gentypic value) (smething) = (pp mean); therefre Gd luck n the exam, try the review prblems n the website, and dn t frget t send in yur final papers via email by 6PM n Weds 5/6 (end f reading perid). 4
OEB 242: Ppulatin Genetics (smething) = (pp mean) (gentypic value). Nw smething is ur gentypic value expressed as a deviatin frm the ppulatin mean (see left side f belw) We can then calculate a new statistic, the breeding values fr each gentype. The breeding value f an allele represents the phentypic cntributin it wuld make if it were strictly additive, and hence the breeding value f a gentype equals the sum f the breeding values f the cnstituent alleles. Thus, if there is n dminance and all effects are purely additive, gentypic values are equal t breeding values. We culd calculate the per-allele effect n phentype t get breeding values, r we culd prject ur gentypic values nt its leastsquares fit regressin line as in the diagram. On the right side f the belw diagram, we express breeding values as deviatins frm the ppulatin mean. As with the gentypic values, this cnventin makes analysis easier. Because f the definitin f breeding values given abve, we can get V A by lking at the variance f the breeding values. We can cmpare the breeding values t the gentypic values t get the dminance deviatins. (As suggested abve, when there is n dminance, breeding values = gentypic values and hence dminance deviatin = 0). In the belw diagram, these values appear in blue, and represent the distance between the gentypic values (black circles) and the breeding values (white circles). We can lk at the variance f the dminance deviatins t calculate V D. Gd luck n the exam, try the review prblems n the website, and dn t frget t send in yur final papers via email by 6PM n Weds 5/6 (end f reading perid). 5