SIO 221B, Rudnick adapted from Davis 1. 1 x lim. N x 2 n = 1 N. { x} 1 N. N x = 1 N. N x = 1 ( N N x ) x = 0 (3) = 1 x N 2

Size: px

Start display at page:

Download "SIO 221B, Rudnick adapted from Davis 1. 1 x lim. N x 2 n = 1 N. { x} 1 N. N x = 1 N. N x = 1 ( N N x ) x = 0 (3) = 1 x N 2"

Phebe Scott
5 years ago
Views:

1 SIO B, Rudnick adapted from Davis VII. Sampling errors We do not have access to the true statistics, so we must compute sample statistics. By this we mean that the number of realizations we average over is necessarily finite. We wish to determine the accuracy of our sample statistics to aid in interpretation. he difference between the true and sampling statistics is called the sampling error. We have already defined the true mean he sample mean is defined as x lim { x} x n () x n () ote the use of curly braces as a notation for the sample mean. he distinguishing factor between () and () is that the sample mean is over a finite number of realizations. Consider the problem of estimating the true mean with the sample mean. Suppose the sample mean were calculated an infinite number of times, then we could consider the mean of the sample mean, and compare it to the true mean. he bias is defined to be the mean difference between the sample mean and the true mean E = { x} x = x n x = hus the sample mean has desirable property of zero bias. x n x = ( x ) x = (3) Consider the variance of the sample mean E = ({ x} x ) = x n x (4) Moving the true mean inside the sum results in E = x n x (5) he quantity in square brackets in (5) is the fluctuation, so E = x n (6) he average squared sum of fluctuations is the sum over all elements of the covariance matrix of the fluctuations E = x x n (7) m m= Assuming the fluctuations are independent with equal variance then

2 SIO B, Rudnick adapted from Davis E = δ nm σ = σ m= So the variance decreases as the inverse of. he square root of E is called the root-meansquare error or the standard error, and decreases as the inverse of the square root of (8) E = σ (9) Recognizing this improvement of error by -/ is key to understanding many measures of statistical error. ote the parallel between sampling error and the random walk. ote that the rms error depends on the true variance. Suppose first that we know the true mean so we can estimate the variance as he mean of this estimate is ˆσ ˆσ = { x } () = x = σ () n so this estimate of variance is unbiased. he variance of the estimate of variance is F = Assuming the fluctuations are independent F = ({ x } σ ) = m= ( )σ 4 + x 4 x x n σ 4 () m σ 4 = ( x 4 σ ) 4 (3) So the error in the variance depends on the fourth moment. his is a problem of closure. o determine the error of any statistic we need to know a higher order statistic. hus errors on an estimated statistic are always less well determined than the original estimate. he reality is we don t know the true mean in order to use (), so we must estimate it using (). Consider the following estimator ˆσ = A ( x { x} ) { } (4) where A is a constant to be determined. he obvious choice is A=, but is it optimal in any sense? Let s investigate the bias in the estimate (4) F = ˆσ σ (5) ow we rewrite the estimate (4) in terms of deviation from the mean to get F = A ( x x ) ({ x} x ) { } σ (6) As the sample mean operator acts only on x, expanding the quadratic yields F = A x { } ({ x} x ) σ (7)

3 SIO B, Rudnick adapted from Davis 3 he second term in the angle brackets of (7) is the variance of the sample mean, which for independent data is given by (8). he first term in the angle brackets is just (-). he result is thus F = Aσ σ (8) If we want the estimate (4) to be unbiased, that is F=, then A = So an unbiased estimate of variance is ˆσ = ( x n { x} ) () Clearly, dividing by - rather than by makes little difference when is large. So any bias in the most obvious estimate of variance is small. Also, there are other estimates of variance that might be optimal in other ways, rather than to eliminate bias. For example, consider the variance of the sample variance (9) F = ( ˆσ σ ) () Expanding the quadratic results, and applying the averaging operator to each term results in F = ˆσ 4 σ ˆσ + σ 4 () We already know the second term from the calculation of bias σ ˆσ = Aσ 4 he first term in () is more challenging to evaluate {( { }) } {( { } x ) } {( { }) } { { } + { x } } ˆσ 4 = A x x = A x x x = A x x = A x x x Since the sample average operates only on x and not on x ˆσ 4 = A = A x ({ x } { x } ) { } { x } x { }, we find (3) (4) (5) { } + { x } 4 Expanding the sample means as sums, and applying the averaging operator to each term, results in

4 SIO B, Rudnick adapted from Davis 4 ˆσ 4 = A x x n x m m= x 3 n x + x m p m= p= 4 n x m x p x q m= p= q= (6) So, we have a number of mean products, and we have to sum over all the elements. We assume, as above, that the data are independent. Because nonzero contributions come only from even moments, we find ˆσ 4 = A ( ( )σ 4 + x ) 4 ( 3 ( )σ 4 + x ) 4 + ( 4 3 ( )σ 4 + x ) 4 (7) he factor of 3 in the last term comes from the fact that there are three possible pairs in the fourth-order product. his can be simplified to σˆ 4 = A ( + 3)σ 4 + ( ) x 4 3 (8) We require the fourth moment. If we assume x to have a normal distribution, then x 4 = 3σ 4, and the result is ( )( +) ˆσ 4 = A σ 4 (9) he variance of the sample mean is found by substituting (3) and (9) into () ( ) ( +) F = A σ 4 Aσ 4 + σ 4 (3) We choose the value of A that minimizes the variance by setting the derivative of (3) with respect to A equal to zero, yielding A = (3) + So for minimum variance (3), A is slightly less than, while for zeros bias (9), A is slightly greater than. Maybe a choice of is not so bad after all. he resulting variance and bias, with A given by (3) is F = σ 4 + ; F = σ + So there is a negative bias, which is understood as the value for A in this case is less than that in the unbiased case (9). For completeness, the variance and bias in the unbiased estimator of variance is, with A given by (9), F = σ 4 ; F = (33) where we have assumed the data to be normally distributed. (3)

5 SIO B, Rudnick adapted from Davis 5 Correlated Observations Real observations are almost always correlated. All the estimates of sampling error we have considered so far have assumed that the realizations are independent. An example of correlated observations occurs when we have closely spaced samples of a variable with very long scales, as daily samples of temperature off the Scripps pier in the presence of ESO s and PDO s. he time series in Figure contains time scales relative long compared to the separation between observations A and B, but comparable to the separation between points A and C. Recall the expression for the sample mean (7), that consists of a sum over all elements in the covariance matrix. Let s relax the assumption of independence, but assume now that the statistics are stationary, so x n x m = σ ρ( n m ) (34) where ρ is correlation. hat is the covariance depends only on the difference in sample index. Assuming that the samples are evenly spaced, the mean square error of the sample mean is E = σ n ρ( n ) (35) n= 3 C x A B t Figure. A serially correlated time series. Observations A and B are closely spaced relative to the dominant time scales. Observations A and C are relatively far apart. hus observation C is more likely than B to be independent of A.

6 SIO B, Rudnick adapted from Davis 6 Understand that the index n in (35) refers to the diagonal of the covariance matrix, the covariance matrix is symmetric, and the covariance matrix is such that each diagonal band is made up of equal values. If the data are uncorrelated, then the only nonzero term in (35) is for n=, and (8) is recovered. he variance of the sample mean can be written E = σ E (36) where the equivalent number of independent samples in serially correlated data E is E = n= n ρ n E is sometimes called the equivalent number of degrees of freedom. If the number of samples is large compared to the value of n where correlation ρ vanishes, then E = n= hese ideas can be extended to continuous time series, which clearly have correlated samples. In this case the sample mean is ( ) ρ n ( ) (37) (38) { x} = x( t)dt (39) he mean square error of the sample mean is E = x ( t ) x ( t ) dt dt (4) Assuming stationary statistics then E = σ he number equivalent degrees of freedom is E = t ρ( t )dt (4) t ρ( t )dt aking the limit as the time series is infinitely long, the mean square error is E = σ (4) ρ( t )dt (43) he integral of the correlation is thus an important quantity, often called the integral time scale. he ratio of the time series length to the integral time scales tells how many independent samples there are in a time series. here is an illuminating relationship between the spectrum and the variance of the sample mean. ote that the spectrum and the covariance are Fourier transform pairs

7 SIO B, Rudnick adapted from Davis 7 ( ) = C t S f ( )e iπ ft dt (44) Here S is the spectrum, C the covariance, and f frequency. Because C = σ ρ, and using (43), the following relationship holds for time series long compared to the integral time scale E = S( ) his somewhat surprising result suggests that the spectrum at zero frequency, that is the lowest frequencies in a continuous time series, determine the error in the sample mean. High frequencies in a continuous time series are effectively averaged out, but the lowest frequencies cause error. A series of independent data have a white spectrum, and that a white spectrum is nonzero at low frequency is key to the effect on the error in the sample mean. Confidence limits he concept of confidence limits is a bit subtle. he idea is to come up with an interval and state the probability that the true value of a statistic is in that interval. Suppose we could calculate a sample mean an infinite number of times, and come up with its true distribution. (Figure ) Given that distribution, we could say that the next sample mean we calculate would fall between lower and upper limits xl and xu with a probability of (45) F {x} (r) x L <x> x U r Figure. he probability density function of the sample mean.

8 SIO B, Rudnick adapted from Davis 8 x U P = F x ( )dr (46) x L { } r After we have calculated the next sample mean, it either does or does not fall within these limits. hat is the probability it is within some interval is either or. Since we claim to know the true mean first before calculating the sample mean, this is not a satisfactory procedure. he question we really want to ask is: What is the range of true means likely to have led to the sample mean we just calculated? he upper limit is the largest true mean for which there is a small ε probability of getting a smaller sample mean than the one observed. he lower limit is the smallest true mean for which there is an ε probability of getting a larger sample mean than the one observed. (Figure 3) If the sample mean we observed is ˆm and the cumulative distribution of the sample mean is D { x} with assumed lower and upper true means of xl and xu, then the following probability statements are appropriate. D { x} ( ˆm x L ) = ε; D { x} ˆm x U ( ) = ε (47).4..8 F {x}.6.4. x L ˆm x U Figure 3. he confidence interval for the sample mean is bounded by the largest and smallest true means that were likely to have led to the observed sample mean. he area underneath the blue curve to the right of ˆm is the probability ε of getting a larger sample mean than the one observed, given a true mean of xl. Similarly, the area under the green curve to the left of ˆm is ε.

9 SIO B, Rudnick adapted from Davis 9 he story we need to tell to have confidence limits is to say we know the distribution, but we don t know the mean. he confidence interval defines as in (47) is said to be at the (-ε)% level. o make more progress, we need an assumed distribution. Because the sample mean is a sum of random variables, its distribution should approach normal with a standard deviation given by (9) as the number of degrees of freedom gets large. he lower and upper limits (47) are then x L = ˆm σ z ε ; x U = ˆm + σ z ε where z ε is the inverse normal cumulative distribution function with zero mean and unit standard deviation evaluated at ε. (demo) ote that the limits are related to the estimated mean as a sum. he chi-squared χ distribution is useful for finding confidence intervals for sample variances. For example, estimates of spectra are often assumed to have a χ distribution. Consider an estimate of the sample variance made by the sum of squares of independent, identically and normally distributed variables ˆσ = σ x n (48) F L ˆσ U Figure 4. he confidence interval for the sample variance. he different widths of the two χ pdfs result because a larger mean χ variable has a larger variance, as the χ pdf must pass through the origin.

10 SIO B, Rudnick adapted from Davis where he variable x = ; x = (49) x n = ˆσ is a χ variable with degrees of freedom. he mean of this variable is and its variance is x n he lower and upper limits are then where χ ;ε x n = x n σ (5) = (5) = x n x m x n + m= = ( ) = σ L = ˆσ χ ; ε ; σ U = ˆσ (5) (53) χ ;ε is the inverse χ cumulative distribution function with degrees of freedom evaluated at ε. (Figure 4) As a χ variable is a sum, the χ distribution approaches normal as gets large. In contrast to the confidence limits for a sample mean, the limits for sample variance are related to the estimate as a ratio.

Analytical Methods. Session 3: Statistics II. UCL Department of Civil, Environmental & Geomatic Engineering. Analytical Methods.

Analytical Methods. Session 3: Statistics II. UCL Department of Civil, Environmental & Geomatic Engineering. Analytical Methods. Analytical Methods Session 3: Statistics II More statistical tests Quite a few more questions that we might want to ask about data that we have. Is one value significantly different to the rest, or to