GEOS 36501/EVOL January 2012 Page 1 of 23

Size: px

Start display at page:

Download "GEOS 36501/EVOL January 2012 Page 1 of 23"

Scott Chandler
5 years ago
Views:

1 GEOS 36501/EVOL January 2012 Page 1 of 23 III. Sampling 1 Overview of Sampling, Error, Bias 1.1 Biased vs. random sampling 1.2 Biased vs. unbiased statistic (or estimator) 1.3 Precision vs. accuracy 2 Error Estimates With Assumed Sampling Distribution 2.1 Standard Error: Standard deviation of distribution of sample statistics that would result from infinite number of trials of drawing sample from underlying probability distribution and calculating the sample statistic. 2.2 In practice we generally do not estimate error by repeated sampling from the underlying distribution (expensive and time-consuming), although there are exceptions. 2.3 Approximations based on sample distribution (from Sokal and Rohlf):

2 GEOS 36501/EVOL January 2012 Page 2 of 23

3 GEOS 36501/EVOL January 2012 Page 3 of Limitations: Many approximation formulae make assumptions about shape of distribution and sample size We may be interested in novel statistic or one whose sampling distribution is not well characterized. 3 Bootstrap Error Estimates 3.1 Estimate standard error by resampling from the single sample we have. 3.2 This approach uses sampling with replacement from observed sample to simulate sampling without replacement from the underlying distribution. 3.3 Procedure Start with observed sample of size n and observed sample statistic, call it Z Randomly pick a sample of size n, with replacement, from the observed sample Calculate the sample statistic of interest on this random sample; call is Z boot Repeat many times (generally hundreds to thousands, ideally until estimate of SE stabilizes) Calculate standard deviation of the Z boot. This is an estimate of the standard error of the observed sample statistic Z: SD(Z boot ) SE(Z). 3.4 Simple (but not necessarily most useful) example: trimmed mean Define p-% trimmed mean as mean of sample with p% lowest and p% highest observations discarded. (Idea is to try to reduce effect of outliers.) Suppose data consist of 10 (ordered) observations: 1,2,3,4,8,10,12,15,20,30. Let the trimmed mean be denoted Z. Then Z = ( )/6 = 8.67.

4 GEOS 36501/EVOL January 2012 Page 4 of 23 R code to estimate SE(Z) #define function trim.mean<-function(x,ntrim){ ii<-order(x) xtmp<-x[ii] return(mean(xtmp[(ntrim+1):(n-ntrim)]))} data<-c(1,2,3,4,8,10,12,15,20,30) #specify data n<-length(data) ntrim<-2 #specify number to trim from each side Zobs<-trim.mean(data,ntrim) #get observed value nrep< #specify number of bootstrap replicates Zboot<-rep(NA,nrep) #assign memory for (i in 1:nrep) #get bootstrap replicates Zboot[i]<-trim.mean(sample(data,n,replace=TRUE),ntrim) SE<-sd(Zboot) #calculate bootstrap std. error hist(zboot,breaks=50) #plot histogram of results #alternative code, without loops DATA<-matrix(sample(data,nrep*n,replace=TRUE),n,nrep) #each column is a bootstrap replicate Zboot<-apply(DATA,2,trim.mean,ntrim) SE<-sd(Zboot) This yields Z obs = 8.67 and SE(Z) 3.1. Histogram of Zboot Frequency Zboot

5 GEOS 36501/EVOL January 2012 Page 5 of Useful R function: sample(x,n,replace=true[or FALSE]) returns a random sample of size n from the vector x with or without replacement. 3.6 To sample from array X so that the variables (columns) stay together: nr<-dim(x)[1] #get number of rows i<-sample(1:nr,n,replace=true[or FALSE]) #returns vector of integers sampled on [1,n] XSAMP<-X[i,] 4 Parametric bootstrap 4.1 Take observed sample and estimate relevant parameter from it. 4.2 Resample from parametric distribution with parameter equal to sample estimate (rather than resampling from observed distribution). 4.3 This approach can also be applied to more complicated situations: for example, simulating a process with parameters estimated from data We ll do lots of this later...

6 GEOS 36501/EVOL January 2012 Page 6 of 23 5 Examples of Finite-sample Bias (sample-size bias) 5.1 Sample variance (x x) 2 /n is biased. This is systematically too low, which makes sense since it is based on squared deviations from sample mean (x x) 2 /(n 1) is unbiased. 5.2 Number of taxa Rarefaction method (from Raup 1975) Abundance of species i is N i ; N = N i. Consider a particular species, i. ( N N i ) n is the number of ways of drawing the non-i individuals in a sample of n. ( N n) is the number of ways of drawing all individuals. Therefore, the ratio of these two is the probability of not drawing any individuals of species i. Therefore 1 minus this ratio is the probability of drawing at least one individual of species i. So the expected number of species is just the sum of this probability, calculated for each species in turn Caveats Rarefaction for interpolation rather than extrapolation Collecting curves vs. rarefaction curves Apparent leveling off of curves does not imply that nearly everything has been found (only that you re unlikely to find it with modest effort). Curves affected by factors other than sample size (sampling method, taxonomic treatment, size of geographic area etc.). Crossing of rarefaction curves can make interpretation difficult.

7 GEOS 36501/EVOL January 2012 Page 7 of 23

8 GEOS 36501/EVOL January 2012 Page 8 of Examples of application of taxonomic rarefaction (Raup 1975; Raup and Schopf 1978) This example suggests that the increase in observed family diversity in post-paleozoic echinoids cannot be accounted for by an increase in the number of species sampled.

9 GEOS 36501/EVOL January 2012 Page 9 of 23 This example suggests that much of the variation in the number of observed echinoid orders is consistent with differences in number of sampled species. (But does this mean that s really all that is going on?!)

10 GEOS 36501/EVOL January 2012 Page 10 of Interpretation of taxonomic rarefaction curves not entirely straightforward. Sampling standardization to be treated in more detail later

11 GEOS 36501/EVOL January 2012 Page 11 of Range Example: Range of samples from normal distribution

12 GEOS 36501/EVOL January 2012 Page 12 of 23

13 GEOS 36501/EVOL January 2012 Page 13 of 23

14 GEOS 36501/EVOL January 2012 Page 14 of 23

15 GEOS 36501/EVOL January 2012 Page 15 of Example: Test for nonrandomness of sampling with respect to morphology (Foote 1997, Paleobiology 23:181)

16 GEOS 36501/EVOL January 2012 Page 16 of Correction in general case via rarefaction (random subsampling at controlled sample-size) (Foote 1992, Paleobiology 18:1) Caveat: Range at standardized sample size may not convey any information that isn t conveyed by sample variance.

17 GEOS 36501/EVOL January 2012 Page 17 of 23 6 Extreme value statistics 6.1 Introduction to problem Previous look at standard errors considered sampling distribution of quantities such as mean We may also be interested in distribution of extremes: For example, how is the largest of n observations distributed, or the second smallest, etc.? Applications: earthquakes, floods, etc.; evolutionary constraints 6.2 Probability of number of observations exceeding some value, if distribution known P r(x > x) = 1 F (x), where F (x) is the cumulative distribution If there are N observations, then the probability that exactly k of them exceed some value x is given by a simple binomial: ( ) N [1 F (x)] k F (x) N k k Example: normal with N = 10, x = 0.67, and k = 3: F (0.67) = 0.75, so the probability = ( 10 3 ) = Future observations Suppse we have n 1 past observations ranked from m = 1 (largest) to m = n 1 (smallest), and we take n 2 future observations. What is the probability that exactly k of n 2 observations will exceed the m th value from the first set of n 1 observations? Simply find F (x) corresponding to the m th value and plug into previous binomial equation. Clearly this works only if we know the distribution.

GEOS 36501/EVOL 33001 13 January 2012 Page 18 of 23 6.

18 GEOS 36501/EVOL January 2012 Page 18 of Probability of number of observations exceeding some value, even if distribution is not known General expressions:

19 GEOS 36501/EVOL January 2012 Page 19 of Derivaton: See Gumbel pp Intuitive explanation for insensitivity to distribution: A given number of points should cover a given proportion of the cumulative distribution, regardless of the shape of the distribution (provided that it is continuous) Example (table from Gumbel): Note symmetry in table. Probability of x exceedances above largest is the same as probability of x exceedances below lowest, etc.

20 GEOS 36501/EVOL January 2012 Page 20 of Application to crinoid evolution (Foote 1994)

21 GEOS 36501/EVOL January 2012 Page 21 of 23

22 GEOS 36501/EVOL January 2012 Page 22 of 23

23 GEOS 36501/EVOL January 2012 Page 23 of Relationship to theory of records Let there be n 1 past trials and n 2 future trials. What is the probability that the record set (m = 1) by first set of trials will stand by the second set (i.e. x = 0)? This is w(0). Now, suppose we let n 1 = n 2, then we have: ( n1 ) ( m m n2 ) x w(x) = (n 1 + n 2 ) ( n 1 +n 2 1), x+m 1 which, for n 1 = n 2, m = 1, and x = 0, gives which is equal to 1 2. w(0) = ( n1 1 )( n1 0 ) (2n 1 ) ( 2n What is the expected number of exceedances above the past record? E(x) = mn 2 n = n 1 n for large n 1 ), Thus, for athletic contests, if all trials reflect the same underlying pool of talent, equipment, etc., the waiting time between successive record should progressively double Likewise for discoveries of largest dinosaur, oldest primate etc. Deviations suggest change in rules or nonrandom searching.

II. Introduction to probability, 2

GEOS 33000/EVOL 33000 5 January 2006 updated January 10, 2006 Page 1 II. Introduction to probability, 2 1 Random Variables 1.1 Definition: A random variable is a function defined on a sample space. In