Australian bird data set comparison between Arlequin and other programs Peter Beerli, Kevin Rowe March 7, 2006 1 Data set We used a data set of Australian birds in 5 populations. Kevin ran the program Arlequin over all populations and all 10 loci. Results are in the Appendix. Several interesting things popped up: 1.1 Polymorphic sites versus substitutions Arlequinreports 57 segregating sites (variable sites) that translates into 53 polymorphic sites. Obviously only 4 mutations were singletons. This is somewhat striking given the data set. What happened to the missing sites? 1.2 Population size θ estimates Arlequin reports several values for the population size and it seems that it uses only the polymorphic sites, which is wrong, even for the Watterson estimator θ = S n 1 i=1 1 i (1) we should use all segregating sites. Using (1) we calculate with 57 segregating sites in 96 samples a value of θ = 11.0974 versus a reported value of 10.3186 based on 53 polymorphic sites. The result does not look very different so when we calculate the population size per site instead of locus: Θ (57,96) = θ (57,96) /3766 = 0.00294673 Θ (53,96) = θ (53,96) /3766 = 0.00273994 Arlequintreated all loci as blocks in a single locus and so errs on the analysis [add here later with the per locus analysis]. Another problem seems to be the differentiation between all sites (3766) or 1
usable sites (867). I bet that several papers are out there that are unclear about how to calculate the population size per site. Calculating population sizes taking into account different population models is feasible if we know the variance of the number of offspring using the values above and the relationship that the population size in a Canning model is θ = N e µ/σ 2, if we use the Wright-Fisher population size θ W F as a yardstick then we can make the relationship θ C = θ W F σ 2 Wright-Fisher uses a offspring n umber variance of 1, yielding both sizes the same. The population size θ M of a Moran model is half the size of a Wright-Fisher model. θ W F was derived under the assumption that all parents die every generation. We can express this survival into the next generation with a ratio N e k/(2n) with k individuals dying every time unit. In a Moran model k = 1 and so N e /(2N) = 1/2 and so we can get θ M = θ W F 2 = θ W F 2/(2N) An example of the influence of the model onto the estimate: Θ W F = 0.00274; Θ C = 0.274 with σ 2 = 0.01, or Θ C = 0.0000274 with σ 2 = 100; and Θ M = 0.00274/2.91910 6 = 1876.82 with µ = 10 9 per site and generation. Translating all this into effective population size using µ results in: N W F = 684, 985,N C = 6850 and N C = 68, 498, 500, respectively; and N M = 1, 876, 820, 000, 000. Of course, we know that we are not waist-deep in these birds and so the Moran model is probably an unlikely candidate, but the the Canning model with low variance might describe the total popualtion size of these birds better. [I want to address differences due to estimation method and also see at the variance due to loci, this will all come in a followup] 2 Appendix 2.1 All populations together RUN NUMBER 1 (07/03/06 at 15:41:17) Project information: NbSamples = 1 DataType = DNA GenotypicData = 0 ============================== Settings used for Calculations 2
============================== General settings: ----------------- Deletion Weight = 1 Transition Weight Weight = 1 Tranversion Weight Weight = 1 Epsilon Value = 1e-06 Significant digits for output = 5 Use original haplotype definition Alllowed level of missing data = 0.05 Active Tasks: ------------- Molecular Diversity: Molecular Distance :Pairwise difference GammaA Value = 0 Theta estimators : Theta(Hom) Theta(S) Theta(k) Theta(Pi) Tajima s selective neutrality test --------------- Ewens-Watterson neutrality test ------------ No. of Simultated Samples = 10000 Fu s Neutraliy test: No. of Simultated Samples = 10000 Warning: The locus separator has been removed ------- ============================================================================== == ANALYSES AT THE INTRA-POPULATION LEVEL ============================================================================== =============================================================================== == Sample : AllBirds =============================================================================== ================================ == Molecular diversity indices : (AllBirds) ================================ Reference: Tajima, F., 1983. Tajima, F. 1993. Nei, M., 1987. Zouros, E., 1979. Ewens, W.J. 1972. Sample size : 96 No. of haplotypes : 96 Deletion weight : 1 Transition weight : 1 Transversion weight : 1 Allowed level of missing data : 5 % Number of observed transitions : 36 3
Number of observed transversions : 21 Number of substitutions : 57 Number of observed indels : 0 Number of polymorphic sites : 53 Number of observed sites with transitions : 35 Number of observed sites with transversions : 21 Number of observed sites with substitutions : 53 Number of observed sites with indels : 0 Number of observed nucleotide sites : 3766 Number of usable nucleotide sites : 867 Nucleotide composition (Relative values) C : 24.97% T : 24.79% A : 27.25% G : 22.99% Total :100.00% Distance method : Pairwise difference (no Gamma correction) Mean number of pairwise differences : 8.968640 +/- 4.167432 Nucleotide diversity : 0.010344 +/- 0.005324 (Standard deviations are for both the sampling and the stochastic processes) Unable to compute Theta(Hom) when all gene copies are different Unable to compute Theta(k) when all gene copies are different Theta(S) : 10.318618 S.D. Theta(S) : 2.846638 Theta(Pi) : 8.968640 S.D. Theta(Pi) : 4.616216 ========================================== == Tajima s test of selective neutrality : (AllBirds) ========================================== Reference: Tajima, F. 1989a. Tajima, F., 1996. Sample size : 96 No. of sites with substitutions (S) : 53 Mean No. of pairwise differences (Pi) : 8.96864 Distance method : Pairwise difference (no Gamma correction, indels not taken into account) Tajima s D : -0.41913 P(D random < D obs) : 0.35961 (Beta distribution aproximation) No. of simulations : 10000 Obs. Theta(S) : 10.31862 Mean Theta(S) : 10.30855 S.D. Theta(S) : 2.91849 Mean D : -0.08598 S.D. D : 0.90049 P(D simul < D obs) : 0.39370 ================================================== == Ewens-Watterson tests of selective neutrality : (AllBirds) ================================================== Reference: Ewens, W.J. 1972. Watterson, G., 1975. 4
Stewart, F. M. 1977. Slatkin, M. 1994b. Slatkin, M., 1996. No. of genes in sample : 96.00000 No. of alleles in sample : 96 The test is impossible because all gene copies are different =============================================== == Fu s Fs test of selective neutrality : (AllBirds) =============================================== Reference: Fu, Y. X. (1996). Original No. of alleles(k) : 96 Theta(Pi) : 8.96864 Exp. No. of alleles : 22.52870 Fs : -11.22769 No. of simulations :10000 Mean Theta(Pi) : 9.00305 S.D. Theta(Pi) : 4.64860 Mean k : 22.53300 S.D. k : 3.67288 Prob(sim_Fs <=obs_fs) : 0.00970 END OF RUN NUMBER 1 (07/03/06 at 15:45:25)) Total computing time for this run : 0h 0m 14s 941 ms 5