Notes on MAS344 Computational Statistics School of Mathematical Sciences, QMUL. Steven Gilmour

Size: px

Start display at page:

Download "Notes on MAS344 Computational Statistics School of Mathematical Sciences, QMUL. Steven Gilmour"

Lynn Franklin
5 years ago
Views:

1 Notes on MAS344 Computational Statistics School of Mathematical Sciences, QMUL Steven Gilmour

2 ii Preface MAS344 Computational Statistics is a third year course given at the School of Mathematical Sciences, Queen Mary, University of London. It consists of lectures and computer laboratories where students use statistical package GenStat (both the command language and menus). The practicals for the computer labs and other course information are available on the website set for this course: sgg/mas344.html The syllabus of the course includes: Probability density functions: the empirical cdf; q-q plots; histogram estimation; kernel density estimation. Nonparametric tests: permutation tests; randomization tests; link to standard methods; rank tests. Data splitting: the jackknife; bias estimation; cross-validation; model selection. Bootstrapping: the parametric bootstrap; the simple bootstrap; the smoothed bootstrap; the balanced bootstrap; bias estimation; bootstrap confidence intervals; the bivariate bootstrap; bootstrapping linear models. Steven Gilmour January 2007

3 Contents 1 Probability Density Functions The Empirical Cumulative Distribution Function Trivial Example Properties of ECDF Estimation Using the ECDF Empirical Quantiles Quantile-Quantile Plots Examples Kolmogorov-Smirnov one-sample test Kolmogorov and Smirnov s approximation to the null distribution Simulation of the null distribution Histogram estimation Properties of histogram estimators Example Kernel Estimation Rosenblatt s histogram estimator Kernel density estimators Nonparametric Tests Permutation Tests Introductory example: visual acuity The two-sample location problem Trivial example Randomization Tests Test for independence of bivariate data Matched pairs The Mann-Whitney-Wilcoxon Rank Sum Test Introductory example: Ice Data The MWW test function and its null distribution Rejecting the null hypothesis iii

4 0 CONTENTS Trivial example Dealing with large samples Normal approximation Simulation General considerations about rank tests Key steps in constructing a test Advantages of rank tests Disadvantages of rank tests Rank tests versus permutation tests Data-Splitting The Jackknife Cross-Validation Bootstrapping Non-parametric computational estimation Bootstrap estimates of bias, standard error and MSE The parametric bootstrap Simulation of random variables Example: parametric bootstrap The smoothed bootstrap The balanced bootstrap Bootstrapping bivariate data Non-parametric bootstrap Fully parametric bootstrap Semi-parametric bootstrap Summary of the bivariate bootstrap

5 Chapter 1 Probability Density Functions This is a course on classical statistical inference, in which we develop computational methods which allow inferences to be based on less strict assumptions than the likelihood-based methods introduced in Fundamentals of Statistics II. In particular, we will study methods which do not require us to make strong distributional assumptions. In this chapter, we study methods of estimating probability density functions (p.d.f.s). For a given sample of data, we might want to check whether or not it follows some particular parametric distribution, or we might want to estimate its p.d.f. without having to make a particular distributional assumption. A useful tool for both these tasks, and many others we will use in this course, is the empirical cumulative distribution function (c.d.f.). 1.1 The Empirical Cumulative Distribution Function Recall that, for a random variable (r.v.) Y, the c.d.f. F is defined as F Y (y) = P (Y y). For a set of univariate data y 1,..., y n, the empirical c.d.f. (e.c.d.f.) is defined as ˆF Y (y) = #{i : y i y}, n where #S denotes the number of elements in the set S. The e.c.d.f. is a step function, with steps of size 1 at each data point. n 1

6 2 CHAPTER 1. PROBABILITY DENSITY FUNCTIONS Trivial Example Consider the following observations on a variable: 2.0, 2.1, 2.8, 3.6, 4.9, 6.1, 7.2, 8.3. The e.c.d.f. is 0 y < y < y < y < 3.6 ˆF (y) = y < y < y < y < y Properties of ECDF If Y 1,..., Y n are independent and identically distributed (i.i.d.) r.v.s, each having the distribution of Y, then P (Y i y) = F (y) and clearly E( ˆF Y (y)) = E(#{i : y i y}) n = F (y), i.e. at any point y, the e.c.d.f. is an unbiased estimator of the c.d.f. In fact, at any fixed point y, the e.c.d.f. has sampling distribution given by n ˆF Y (y) Binom(n, F (y)). Therefore V ( ˆF Y (y)) = F (y)(1 F (y))/n. It can be shown that, among estimators which are unbiased at every point y, the e.c.d.f. has minimum variance. For a continuous r.v. it is perhaps more natural to think of the empirical probability density function (e.p.d.f.). This has a spike of probability 1/n at each data point and value 0 elsewhere. This makes it clear that we are estimating a continuous function using a discrete function. In fact, the e.c.d.f. is equivalent to the c.d.f. of a discrete uniform distribution with values at the observed data points. Plot the e.p.d.f. for the trivial example above Estimation Using the ECDF Since the c.d.f. contains all the information about Y, it seems natural to use the e.c.d.f. to estimate functions of the c.d.f. of particular interest, such as

7 1.2. QUANTILE-QUANTILE PLOTS 3 the mean and variance. Consider the mean of the e.c.d.f. From the form of the e.p.d.f. it is clear that this is just the sample mean. The estimator of the variance is not the usual sample variance, but 1 n n (Y i Ȳ )2. i=1 If we consider estimating E(Y r ) for r = 1, 2,..., we see that the estimator of any population moment is the corresponding sample moment. Hence, this is a way of deriving the method of moments familiar from Fundamentals of Statistics II Empirical Quantiles Recall that the α-quantile of Y, for α (0, 1), is the value y α such that F (y α ) = α. For example, y 1 is the median, y 1 and y 3 are the lower and upper quartiles, etc. For a discrete distribution, the quantile might not exist for a specific value of α. Applying this definition to the e.c.d.f. we find that the j -quantile could n be anything between the jth and j + 1th ordered observations, y [j] and y [j+1]. Equivalently, the y [j] could be the α-quantile for any α between j 1 and j. n n The most common convention, and the one that we will use, is to define the y [j] to be the j 0.5 -quantile, or the jth sample quantile. n For our trivial example, the observations are the 1 -, 3 -, 5 -, 7 -, 9 -, , - and -quantiles Quantile-Quantile Plots Given a univariate set of data, assumed i.i.d., we might want to determine whether they come from a specific reference distribution, e.g. Normal, Exponential, Gamma, etc. The simplest, most commonly used and probably most reliable way to do this is to use graphical methods. A histogram of the observations can be useful to detect gross discrepancies from the assumed distribution, e.g. a large degree of skewness when Normality is being assumed. However, histograms are not very reliable for detecting more subtle departures from assumptions. In addition, their appearance depends strongly on the bins used to define the histogram. A more useful graphical method of comparing a sample with a reference distribution is to use a quantile-quantile plot, more commonly referred to as a q-q plot. The sample quantiles are plotted against the quantiles of the

8 4 CHAPTER 1. PROBABILITY DENSITY FUNCTIONS reference distribution. For a reference distribution with c.d.f. F, plot the points ( F ( ) ) 1 j 0.5 n, y[j]. If the reference distribution is correct, the points should fall close to a straight line with intercept 0 and slope 1. Systematic departures from a straight line indicate that the distribution is inappropriate. Two particular departures, which are often of little interest are: if the mean is specified incorrectly, the line will have intercept different from 0; if the variance is specified incorrectly, the line will have slope different from 1. Note that these departures do not change the q-q plot from a straight line. Therefore we can use any reference distribution which has the correct shape, i.e. which differs only in location and scale and not in skewness, kurtosis, etc. Some specific cases: To check whether the data come from a Uniform distribution, we can use U[0,1] and plot y [j] against j 0.5 n. To check whether the data come from a Normal distribution, we can use N(0,1) and plot y [j] against Φ ( ) 1 j 0.5 n. This is the Normal probability plot familiar from Statistical Modelling I. To check whether the data come from a Gamma distribution, we can use any scale parameter β, but must use the correct shape parameter α. If 2α is an integer we can use the quantiles of the appropriate chi-squared distribution, since Gamma(α, 2) is equivalent to χ 2 2α Examples Compare our small data set with the Uniform, Normal and χ 2 5 distributions. 1.3 Kolmogorov-Smirnov one-sample test We now consider a formal hypothesis test, the Kolmogorov-Smirnov test, for verifying that a sample comes from a population with some known distribution. Let x 1,..., x n be observations on continuous i.i.d. r.v.s X 1,..., X n with c.d.f. F. We want to test the hypothesis H 0 : F (x) = F 0 (x) for all x, (1.1)

9 1.3. KOLMOGOROV-SMIRNOV ONE-SAMPLE TEST 5 where F 0 is a known c.d.f. The Kolmogorov distance D n is defined by D n = sup ˆF (x) F 0 (x). (1.2) x R Note that the supremum (1.2) must occur at one of the observed values x i or immediately to the left of x i. The Kolmogorov distance is used as our test statistic. The null distribution of the statistic D n can be obtained by simulation or, for large samples, using the Kolmogorov-Smirnov distribution function Kolmogorov and Smirnov s approximation to the null distribution The approximation is given by the following theorem. Theorem 1.1 Let F 0 be a continuous c.d.f., and let X 1,..., X n be a sequence of i.i.d. r.v.s with c.d.f. F 0. Then 1. The null distribution of D n does not depend on F 0 ; it depends only on n. 2. If n the distribution of nd n is asymptotically Kolmogorov s distribution with c.d.f. Q(x) = 1 2 ( 1) k 1 e 2k2 x 2, (1.3) k=1 that is lim n P ( nd n x) = Q(x). Example: Fire Occurrences A nature reserve in Australia had 15 fires in The fires occurred on the following days of the year: 4, 18, 32, 37, 56, 64, 78, 89, 104, 134, 154, 178, 190, 220, 256. A researcher claims that the time between the occurrences of fire in the reserve, say X, follows an exponential distribution, i.e., X Exp(λ), where λ = Is the claim justified? The null hypothesis is H 0 : F (x) = F 0 (x),

10 6 CHAPTER 1. PROBABILITY DENSITY FUNCTIONS u 0.6 U x_u X Figure 1.1: Continuous Cumulative Distribution Function where F 0 (x) = 1 e 0.009x for x 0 We find that the maximum absolute difference between the empirical and exponential distributions is d 15 = Thus, nd 15 = = = 0.5. From the table of tail probabilities P (Z > z) = 1 Q(z) we find that the upper 10% point of the distribution is We conclude that there is insufficient evidence to reject H Simulation of the null distribution We may approximate the null distribution of D n by simulation. For this we use the standard uniform random variable. Lemma 1.1 Let X be a continuous r.v. with a c.d.f. F and let U = F (X). Then U Uniform[0, 1]. Proof Let u [0, 1]. X is continuous and so xu RF (x u ) = u. See Figure (1.1). Now, F (u) = P (U < u) = P (F (X) < F (x u )) = P (X < x u ) = F (x u ) = u So, F (u) = u and r.v. U is uniform on [0, 1].

11 1.3. KOLMOGOROV-SMIRNOV ONE-SAMPLE TEST 7 To perform the simulation of D n do the following: generate random samples of size n from the standard uniform distribution U[0, 1] with c.d.f. F (u) = u; find the maximum absolute difference between F (u) and the empirical distribution ˆF (u) for the generated sample; repeat this N times to get the approximate distribution of D n. Small Example Five independent weighings of a standard weight (in gm 10 6 ) give the following discrepancies from the supposed true weight: 1.2, 0.2, 0.6, 0.8, 1.0. Are the discrepancies sampled from N(0, 1)? We set the null hypothesis as H 0 : F (x) = F 0 (x) where F 0 (x) = Φ(x), i.e., it is the c.d.f. of a standard normal r.v. X. To calculate the value of the test function (1.2) we need the empirical c.d.f. for the data and also the values of Φ at the data points. The empirical c.d.f. Calculations: ˆF (x) = 0 for x < for 1.2 x < for 1.0 x < for 0.6 x < for 0.2 x < for x 0.8 x ˆF (x) Φ(x) ˆF (x) Φ(x)

12 8 CHAPTER 1. PROBABILITY DENSITY FUNCTIONS F(x) d_max x Figure 1.2: Calculating d max Hence, the observed value of D n, say d 0, is d 0 = What is the null distribution of D n? We have n = 5. Suppose we have randomly generated the following five values from the standard uniform distribution: D 5 gets value d max = See Figure 1.2 Another random sample of 5 uniform r.v.s will give another value d max. Repeating this procedure we simulate a set of values for D 5. Then, having the approximate null distribution of the test statistic, we may calculate the p- value of the observed d 0. Below is a simulated distribution of D 5 using a GenStat program **** ********************* ******************************************** ********************************************** ********************************** ************************ **************** ******* *** *

13 1.4. HISTOGRAM ESTIMATION 9 This shows that d 0 = is close to the middle of the distribution and so the data do not contradict the null hypothesis that the discrepancies are normally distributed with zero mean and variance equal to one. 1.4 Histogram estimation We now return to the problem of non-parametric (or distribution-free) estimation of the p.d.f. We have already seen that the empirical p.d.f. can be used. It does, however, have some undesirable properties as an estimator of the p.d.f.: It has value 0 everywhere, except at the data points; It approximates a continuous function with a set of spikes of probability. Just as we can get a smoother version of the raw data by drawing a histogram, we can get a smoother estimator of the p.d.f. by using the histogram as an estimate of the p.d.f. Assume that the p.d.f. f(y) has support on the finite interval [a, b]. Partition this interval into m non-overlapping bins T 1,..., T m, with widths h 1,..., h m. For a set of data, with n k observations in T k, the histogram estimator of f(y) is ˆf H (y) = n k nh k ; for y T k. Of course, this is just a histogram of the data, scaled to have total area 1. Sometimes it is called the density histogram Properties of histogram estimators Unlike the e.p.d.f., the histogram estimator has probability 0 of being exactly equal to any particular value, just like the p.d.f. We can work out its expectation and variance. For y T k, E( ˆf H (y)) = E(n k) nh k = T k f(t)dt h k. We can see that ˆf H (y) is biased. It can be shown that the bias decreases as the bin width decreases. Of course, as the bin width tends to 0, the histogram estimator tends to the e.p.d.f. and so tends to being unbiased.

14 10 CHAPTER 1. PROBABILITY DENSITY FUNCTIONS Since the observations are independent and have constant probability of being in T k, n k has a binomial distribution and so ( V ( ˆf T k f(t)dt 1 ) T k f(t)dt H (y)) =. It is easily seen that the variance decreases as the bin width increases. We see that the effect of changing the bin width is the opposite on variance and bias. Thus, there must be some optimal bin width which minimises the mean square error (variance plus squared bias). Unfortunately, this is in general difficult to find Example Given the following frequency table, write out and plot the histogram estimator. nh 2 k Estimate P (Y 1.2). Interval Frequency T T T T T T T T T Kernel Estimation Rosenblatt s histogram estimator A major problem with histogram estimators is that the choice of break points between the bins is somewhat arbitrary. Rosenblatt in 1956 suggested a moving bins approach to get round this problem. At a specific point y, the p.d.f. is estimated to be ˆf(y) = #{y i : y i (y h/2, y + h/2]}, nh

15 1.5. KERNEL ESTIMATION 11 for some bin width h. This can also be written as where ˆF is the e.c.d.f. Example ˆf(y) = ˆF (y + h/2) ˆF (y h/2), h Calculate Rosenblatt s estimator for the trivial example in section with h = Kernel density estimators Another way to write Rosenblatt s estimator is as where ˆf(y) = 1 nh K(t) = n ( ) y yi K, h i=1 { 1 if t < 1/2; 0 otherwise. The function K is the kernel or filter. Other forms of K can be used to develop other types of kernel density estimator (k.d.e.). It should be such that K(t) 0, t and K(t)dt = 1. R The kernel should also have the property that it emphasizes points near y i. If we do this, it can be shown that the k.d.e. has similar properties to the histogram estimator, i.e. as h is decreased the bias decreases, but the variance increases. Common choices of kernel function include: the rectangular kernel used in Rosenblatt s estimator; the triangular kernel, 2 + 4t 1 < t 0; 2 K(t) = 2 4t 0 < t 1 ; 2 0 otherwise;

16 12 CHAPTER 1. PROBABILITY DENSITY FUNCTIONS the standard normal density, K(t) = 1 t 2 2π e 2. Experience has shown that kernel density estimation is not sensitive to the choice of kernel, but is sensitive to the choice of h.

17 Chapter 2 Nonparametric Tests In this chapter we will concentrate on hypothesis tests, for example for comparing the locations of two samples. 2.1 Permutation Tests Introductory example: visual acuity (McClave and Dietrich, 1988) In a comparison of visual acuity (VA) of deaf (D) and hearing (H) children, eye movement rates were taken on eight deaf and ten hearing children. A clinical psychologist believes that deaf children have greater visual acuity than hearing children. The larger a child s eye movement rate, the more visual acuity the child possesses. VA sample D D D D D D D D H H H H H H H H H H Is the visual acuity of deaf and hearing children equivalent? This is a twotreatment comparison problem. In the classical approach we assume that: 1. x 1,..., x m are observations on r.v.s X 1,..., X m iid N(µ1, σ 2 1), y 1,..., y n are observations on r.v.s Y 1,..., Y n iid N(µ2, σ 2 2); and 2. σ 2 1 = σ 2 2. Then, if the assumptions are met, we may test the hypothesis about the equivalence of the methods A and B, i.e., H 0 : µ 1 = µ 2, using the well known 13

18 14 CHAPTER 2. NONPARAMETRIC TESTS exact two-sample t-test. What if the assumptions are not fulfilled? Then some nonparametric methods prove to be useful where the assumptions are much less restrictive. A test based on all possible permutations of the data is one of the possibilities The two-sample location problem Assumptions: x 1,..., x m are observations on i.i.d. r.v.s X 1,..., X m with a c.d.f. F 1, y 1,..., y n are observations on i.i.d. r.v.s Y 1,..., Y n with a c.d.f. F 2. The null hypothesis is H 0 : F 1 (x) = F 2 (x) for all x and possible alternative hypotheses are: H 1 : F 1 (x) F 2 (x) with inequality for at least one x, H 1 : F 1 (x) F 2 (x) with inequality for at least one x, H 1 : Either F 1 (x) F 2 (x) or F 1 (x) F 2 (x) with inequality for at least one x. For this problem one suitable test statistic is Other possibilities are: T = X Ȳ. (2.1) difference between sample medians, i.e., T = Me 1 Me 2 ; difference between sample trimmed means, i.e., T = X t Ȳt; any monotonic function of T, say g(t ); for example g(t ) = T a b, b > 0. Null distribution of T : Under H 0 all of the m + n observations form a single random sample, and every selection of m observations out of m + n is equally likely; there are m+n C m such selections,

19 2.1. PERMUTATION TESTS 15 for every selection of the m observations we get a value of T, say t. Hence, the null distribution of T is given by P (T = t) = #(t, m, n) m+n C m, (2.2) where #(t, m, n) denotes the number of all subsets for which T = t. The p-value, for the two-sided alternative, is P ( T t 0 ) = #(t : t t 0, m, n ) m+n C m, where t 0 is obtained from the sample, #(t : t t 0, m, n ) denotes the number of the subsets for which t t Trivial example Let m = 2, n = 3. Sample 1 Sample We are interested in testing the hypothesis that the two samples come from the same population. Using T = X 2 X 1 we have t 0 = = 36.5, 5 C 2 = 10. All possible pairs (x 1, x 2 ) and the values of the test function are: (x 1, x 2 ) t (1,2) (1,3) 4.00 (1,4) (1,5) (2,3) (2,4) 1.50 (2,5) 5.67 (3,4) (3,5) (4,5)

20 16 CHAPTER 2. NONPARAMETRIC TESTS The null distribution of T is given by t P (T = w) Hence the p-value for a two-sided test is P ( T 36.5) = 0.1. We can see that this data set is too small to be able to draw any real conclusions. 2.2 Randomization Tests For even moderately large data sets, the number of permutations we have to handle becomes unmanageable. For example for the visual acuity data set, we would have calculate the difference between means for 18 C 8 = permutations. This is still possible, with a computer, but this is still a small data set. A good solution is to calculate the test statistic for a large (size 1000 or 10,000 perhaps) random sample of permutations. Otherwise, proceed as with permutation tests. Such tests are sometimes called randomization tests. (However, sometimes permutation tests in which the permutations are defined by the randomization carried out in the experiment are called randomization tests, so care is needed.) Example: visual acuity Test the psychologist s claim using the data given below. VA sample D D D D D D D D H H H H H H H H H H If there is no difference between the two groups with respect to VA then both samples come from a population with a common distribution. Hence the null hypothesis is H 0 : F D (x) = F H (x) for all x against the alternative H 1 : F D (x) F H (x) with inequality for at least one x We will use permutation test function (2.1) to verify this hypothesis. The following GenStat program calculates the value of the statistic T = D H, say t 0, simulates the null distribution of T, draws a histogram of it and calculates the p-value for the given data. Assume that the same vector of observations is read into GenStat under two different names: one named VA and the other named VAperm

21 2.2. RANDOMIZATION TESTS 17 variate [values = 8(1), 10(0)] first calculate second = 1 - first calc permstat = sum(first*va)/8 - sum(second*va)/10 print permstat scalar numsim calculate numsim = 1000 variate [nvalues = numsim] perm for i = 1...numsim randomize [seed = 0] VAperm calc perm$[i] = sum(first*vaperm)/8 - sum(second*vaperm)/10 endfor histogram perm calc v = perm >= permstat calc p = sum(v)/numsim print p Here is the output from GenStat. permstat Histogram of perm *** ***************** *********************************** *********************************************** ***************************************************** ****************************** ************ * Scale: 1 asterisk represents 5 units. p Conclusions The p-value is very small, hence we reject the null hypothesis that both samples come from the same population. The data support the claim that deaf children have greater visual acuity than hearing children.

22 18 CHAPTER 2. NONPARAMETRIC TESTS 2.3 Test for independence of bivariate data Let observations (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ) be a realization of i.i.d. r.v.s (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ) with a c.d.f. F X,Y. We may state the hypotheses as: H 0 : There is no association between r.v.s X and Y H 1 : There is an association between r.v.s X and Y. A suitable test statistic for a permutation test is based on the sample correlation coefficient n i=1 ˆρ = X iy i n XȲ n i=1 (X i X) 2 n i=1 (Y i Ȳ. )2 For a given sample n xȳ and the denominator are constants for all permutations of the data. The only part of the coefficient sensitive to changes due to permutations is the sum n i=1 x iy i. Hence, the function n V p = X i Y i (2.3) i=1 may be used as a test statistic for the hypothesis of independence. We get the distribution of V p by calculating values of V p for all n! ways of pairing y i with x i, or a random selection of them. 2.4 Matched pairs Let (y 1,..., y n ) and (z 1,..., z n ) be a realization of r.v.s (Y 1,..., Y n ) and (Z 1,..., Z n ) such that the Y s and the Z s do not have to be independent. Then we analyze differences X i = Y i Z i which should be symmetrically distributed about zero if there is no difference between Y and Z. The null hypothesis is H 0 : X i have mean zero H 1 : The mean is not zero The permutation test statistic is W + p = n Ψ i x i, (2.4) i=1

23 2.4. MATCHED PAIRS 19 where the r.v. Ψ i is The null distribution of W + p Ψ i = { 1 if xi > 0 0 otherwise. is given by P (W + p = w) = #(w, n) 2 n, where #(w, n) denotes the number of subsets of the values of the data which give W + p = w. Trivial example Take n = 4. Then we have 2 4 = 16 possible subsets of positive and negative values which can be expressed as pluses and minuses or as zeroes and ones. Furthermore, let (x 1, x 2, x 3, x 4 ) = (1, 2, 3, 2), so that w 0 = = 6 ψ ψ ψ ψ w Multiplying each column ψ = (ψ 1, ψ 2, ψ 3, ψ 4 ) of the above table by the vector of absolute values of the data, i.e., by ( x 1, x 2, x 3, x 4 ) = (1, 2, 3, 2) we get the values of the statistic W + p, that is values w which are in the last row of the table above. These give us the exact discrete distribution of W + p : w P (W + = w) Here the p-value is p = P (W p + 6) = 4 = This sample does not give 16 sufficient evidence to reject the null hypothesis. Example: Coca-Cola Advertising. Newbold (1988) The Coca-Cola Company ran a national advertising campaign based on the slogan Twice the cola, twice the fun. To test whether the campaign had improved brand awareness, random samples of 500 people in each of 10 cities were asked to name five soft drinks, both before and after the campaign had run. The accompanying table shows the numbers naming Coca-Cola. Test the hypothesis that the campaign made no difference to the customers awareness of the Coca-Cola brand

24 20 CHAPTER 2. NONPARAMETRIC TESTS City Before After x i = a i b i x i ψ i Atlanta Boston Chicago Denver Los Angeles Miami New Orleans New York Philadelphia St. Louis The null hypothesis is H 0 : X i have mean zero against the alternative H 1 : The mean is greater than zero, which means that the campaign had been successful. The value of the test function (2.4) is 10 w = x i ψ i = 179 i=1 Below are the calculations and the simulation of the null distribution to find the p-value for this test. calc diff = A-B calc psi = diff > 0 calc absdiff = abs(diff) calc w = sum(psi*absdiff) scalar [1000] numsim variate [nvalues=10]subset variate [nvalues=numsim] wsim

25 2.5. THE MANN-WHITNEY-WILCOXON RANK SUM TEST 21 for i=1...numsim calc subset = urand(0) calc subset = subset > 0.5 calc wsim$[i] = sum(subset*absdiff) endfor hist wsim variate [nvalues = numsim] v calc v = wsim >= w calc p=sum(v)/numsim print p GenStat output: w Histogram of wsim ****** ******************* ********************************* ****************************************** ******************************************** ****************************************** ********************************* ************************ ******** ** Scale: 1 asterisk represents 4 units. p The Mann-Whitney-Wilcoxon Rank Sum Test Introductory example: Ice Data Two methods have been used to measure the latent heat of ice. The measurements obtained are:

26 22 CHAPTER 2. NONPARAMETRIC TESTS Method A Method B Observations Ranks Observations Ranks Are the two methods equivalent? This is a two-treatment comparison problem. As an alternative to a permutation test, a test based on ranks is one of the possibilities The MWW test function and its null distribution Assumptions: x 1,..., x m are observations on i.i.d. r.v.s X 1,..., X m with c.d.f. F 1 ; y 1,..., y n are observations on i.i.d. r.v.s Y 1,..., Y n with c.d.f. F 2. The null hypothesis takes the form H 0 : F 1 (x) = F 2 (x) for all x and possible alternative hypotheses are: H 1 : F 1 (x) F 2 (x) with inequality for at least one x; H 1 : F 1 (x) F 2 (x) with inequality for at least one x; H 1 : Either F 1 (x) F 2 (x) or F 1 (x) F 2 (x) with inequality for at least one x.

27 2.5. THE MANN-WHITNEY-WILCOXON RANK SUM TEST 23 Test statistic: Replace X 1,..., X m, Y 1,..., Y n with ranks out of N = m + n integers so that R 1,..., R m, R m+1,..., R m+n denote the ranks of random samples of X s and Y s respectively. The MWW test statistic is W = m R i (2.5) i=1 Theorem 2.1 Let random variables X 1,..., X m, Y 1,..., Y n be independently distributed according to a common c.d.f. F and let R 1 <... < R m denote the ranks of X s in the combined ranking of all N = m + n observations. Then for each of the N C m possible m-tuples (r 1,..., r m ) we have P (R 1 = r 1,..., R m = r m ) = 1 N C m. (2.6) This means that if X i, Y j i = 1,..., m, j = 1,..., n come from the same population than all orderings (ranks) are equally likely. Hence, the null distribution of W is given by P (W = w) = #(w; m, N) N C m, (2.7) where #(w; m, N) denotes the number of all combinations of m ranks out of {1,..., N} for which the sum m i=1 r i is equal to w. Here we assume that there are no ties. Theorem 2.2 The Null Distribution of statistic W is symmetric about 1 m(n + 1). 2 Proof Let the N subjects be ranked in inverse order. The subjects that previously held rank 1 now has rank N, rank 2 is replaced by N 1, and in general rank R i is replaced by N (R i 1). Let us denote by R i the new inverse ranks, i = 1,..., m, attached to X 1,..., X m. Then by Theorem 2.1 P (R 1 = r 1,..., R m = r m ) = 1 N C m

28 24 CHAPTER 2. NONPARAMETRIC TESTS and W R has the same null distribution as W R. Now, W R = m [N (R i 1)] = m(n + 1) i=1 m R i = m(n + 1) W R. i=1 Hence W R 1 2 m(n + 1) = 1 2 m(n + 1) W R. W R has the same distribution as W R, so it follows that W R 1 m(n + 1) 2 and 1m(N + 1) W 2 R have the same distribution and P (W R 12 ) m(n + 1) = k = P which is equivalent to P (W R = 12 ) m(n + 1) + k = P ( ) 1 2 m(n + 1) W R = k (W R = 12 m(n + 1) k ). Hence, W R is symmetric about 1 m(n + 1) Rejecting the null hypothesis The null hypothesis is rejected (like in all hypothesis tests) if the observed value of W is extreme compared with the Null Distribution. Let The p-value is defined as W = m R i, w = i=1 m r i. i=1 P (W w) or P (W w) for a one-tailed test, P ( W E(W ) w E(W ) ) for a two-tailed test Trivial example Let m = 3, n = 5.

29 2.5. THE MANN-WHITNEY-WILCOXON RANK SUM TEST 25 Sample 1 Sample 2 Observations Ranks Observations Ranks We are interested in testing the hypothesis that the two samples come from the same population. Here we have w 0 = 7, 8 C 3 = 56 All possible 3-tuples (r 1, r 2, r 3 ) and the corresponding values of the test function are: The distribution of W is given by (r 1, r 2, r 3 ) w (1,2,3) 6 (1,2,4) 7 (1,2,5) 8.. (5,6,8) 19 (5,7,8) 20 (6,7,8) 21 w P (W = w) Hence the p-value for a two-sided test is 4 56 = Dealing with large samples There are three possibilities: 1. Tables: for m and n not too large critical values of W are tabulated 2. Normal approximation to the distribution of W 3. Simulation of the distribution of W

30 26 CHAPTER 2. NONPARAMETRIC TESTS Normal approximation W is the sum of random variables R 1,..., R m. Under the Null Hypothesis these are identically distributed. However, they are dependent, and so the Central Limit Theorem does not immediately apply. But if m and n are reasonably large, then the dependence is weak and it can be shown that the Null Distribution of W is approximately normal, so ( ) W E(W ) P a Φ(a), var(w ) where Φ denotes the c.d.f. of a standard normal variable. Theorem 2.3 Assume that x 1,..., x m are observations on i.i.d. r.v.s X 1,..., X m with c.d.f. F 1 and y 1,..., y n are observations on i.i.d. r.v.s Y 1,..., Y n with c.d.f. F 2. Under H 0 : F 1 = F 2, i.e., samples 1 and 2 come from the same population, the expected value and the variance of MWW test statistic W are, respectively: (a) E(W ) = 1 m(n + 1) 2 (b) var(w ) = 1 mn(n + 1) 12 Proof Part (a) Let R i be rank of Z i, where (Z 1,..., Z N ) = (X 1,..., X m, Y 1,..., Y n ), N = m+n. R i is a random variable whose values are k = 1, 2,..., N, each having equal probability 1 of occurring. Hence, N E(R i ) = N k=1 kp (R i = k) = 1 N N k=1 k = 1 N(N+1) = N+1 N 2 2 and so E(W ) = E( m i=1 R i) = m i=1 E(R i) = m N+1 2. Part (a) is also a conclusion from Theorem 2.7. Part (b) We have var(w ) = E(W 2 ) [E(W )] 2. So, we need to calculate E(W 2 ): E(W 2 ) = E [( ( m i=1 R m i) 2 ] = E i=1 R2 i + m ) m i=1 j=1,j i R ir j =

31 2.5. THE MANN-WHITNEY-WILCOXON RANK SUM TEST 27 m i=1 E(R2 i ) + m i=1 m j=1,j i E(R ir j ). Now, E(Ri 2 ) = N i=1 k2 P (R i = k) = 1 N N i=1 k2 = 1 N(N+1)(2N+1) = N 6 (N+1)(2N+1) 6. Also, if i j E(R i R j ) = N N k=1 l=1,l k klp (R 1 i = k, R j = l) = N N N(N 1) k=1 l=1,l k kl = [ ( N 1 N(N 1) k=1 k ) 2 N k=1 k2 ] =... = (N+1)(3N+2) 12. So, E(W 2 ) = m (N+1)(2N+1) + m(m 1) (N+1)(3N+2) 6 12 and finally we get var(w ) = m (N+1)(2N+1) 6 + m(m 1) (N+1)(3N+2) 12 m 2 (N+1) 2 4 =... = mn(n+1) 12. Example on ice data, continued Using the two-sample t-test for the two-tailed alternative hypothesis we get the p-value equal to In the MWW rank test, the test statistic turns out to be w 0 = 53, giving p = 0.01 from the tables for the two-tailed alternative. Using the normal approximation we get p = In both cases we would reject the null hypothesis. However, the t-test indicates stronger evidence against H 0 then the rank test does. This test is more powerful under the condition that the normality assumption is fulfilled (which we can check using the methods of Chapter 1) Simulation P (W = w) can be approximated by the proportion of a large number of permutations of the ranks which give the value w. If P (W = w) = p and

32 28 CHAPTER 2. NONPARAMETRIC TESTS 1 there are M permutations, then the proportion is a r.v. Bin(M, p), with M expected value 1 1 (Mp) = p and variance (Mp(1 p)) = p(1 p). M M 2 M 2.6 General considerations about rank tests Rank tests can be constructed for several other problems, as well as comparing the locations of two samples. Assume that H 0 implies that all reorderings of the data are equally likely outcomes Key steps in constructing a test 1. Replace the observations by their ranks. 2. Choose a test statistic which is sensitive in distinguishing between the null and alternative hypotheses. 3. Calculate the null distribution of the test statistic by recording its value on all possible orderings of the data. 4. Proceed in the usual way, either to find the critical region for a given significance level α, or to calculate the p-value of the observed data Advantages of rank tests 1. We do not need to assume any parametric family of distributions for the data. 2. They can be devised to be suitable for a wide range of hypothesis testing problems. 3. They are robust: not sensitive to occasional errors in the data or to outliers. 4. They are easy to apply Disadvantages of rank tests 1. They can be computationally intensive; not a serious problem as we can approximate the null distribution of the test statistic by a normal distribution; or

33 2.7. RANK TESTS VERSUS PERMUTATION TESTS 29 simulation 2. They have less power; by replacing observations by ranks some information has been discarded. Parametric tests are more powerful than rank tests provided that the assumptions on the populations are fulfilled. 2.7 Rank tests versus permutation tests There are some similarities and some differences between the two kinds of nonparametric tests. Which one to choose depends on the given hypothesistesting problem. Similarities. Both types of test are: non-parametric: the underlying distribution does not need to be assumed to belong to a particular family of distributions, based on the general principles of statistical hypothesis-testing, exact: for large samples we can get close to the p-value by taking enough simulations, non-informative about the estimates of the parameters. Differences Criterion Rank Tests Permutation Tests Robustness Not sensitive to outliers Quite sensitive to outliers Power Less powerful; loss of More powerful; for large n information by using ranks comparable to parametric tests Computational Null distribution depends Null distribution has to Complexity only on the sample size; be calculated for each critical values can be data set. tabulated. Asymptotic Usually asymptotically Depends on the Distribution normal (under H 0 ). specific test. Ties Slight problem. No problem.

34 30 CHAPTER 2. NONPARAMETRIC TESTS

35 Chapter 3 Data-Splitting We now move on to a discussion of resampling methods, which involve sampling from the observed data to try to learn something more about the process which generated the data. In this chapter, we discuss data-splitting methods, which involve sampling subsets of the data, often to see how well they can predict the remaining values of the data. Let x 1,..., x n be a realization of the i.i.d. r.v.s X 1,..., X n with c.d.f. F. We are interested in the accuracy and precision of estimation of a population parameter θ. 3.1 The Jackknife The jackknife is one of the oldest resampling methods. Here we get replications of an estimator ˆθ by constructing new samples by simply omitting one observation at a time. So, we get n samples of size n 1. Here is the procedure: Empirical Jackknife Jackknife Jackknife distribution samples of size n 1 replications of ˆθ estimates of ˆF {x 2, x 3,..., x n } θ(1) {x 1, x 3..., x n } θ(2) {x 1,..., x n }... {x 1,..., x n 1 } θ(n) bias: (n 1)( θ ˆθ) where θ = 1 n n i=1 θ (i) variance: n i=1 (θ (i) θ ) 2 n 1 n 31

36 32 CHAPTER 3. DATA-SPLITTING Here, θ (i) is calculated in the same way as ˆθ except that the ith observation is omitted. The jackknife estimators of bias and variance of ˆθ are defined to be: Biasˆθ(θ Jack) = (n 1)( θ ˆθ), (3.1) varˆθ(θ Jack) = n 1 n n (θ(i) θ ) 2 (3.2) For simple types of θ the jackknife estimator can be calculated explicitly. Example: Jackknife estimator of the mean and of its variance Mean Let θ be the expected value of a r.v. X with c.d.f. F and let the estimator of θ be the average of a random sample of size n, i.e., ˆθ = X. The jackknife replications of ˆθ are calculated as: θ (i) = 1 n 1 i=1 n j=1,j i X j. Is the jackknife estimate of bias of the mean equal to zero? θ = 1 n n θ(i) = 1 n i=1 n i=1 1 n 1 n j=1,j i X j = 1 1 n n 1 n n X j = i=1 j=1,j i The jackknife estimate of bias is 1 1 n n n 1 (n 1) X i = X = ˆθ. i=1 Biasˆθ(θ Jack) = (n 1)( θ ˆθ) = 0. Variance of the mean Here we calculate the jackknife variance of the mean. varˆθ(θ Jack) = n 1 n n i=1 (θ (i) θ ) 2 = n 1 n ( n 1 n 1 i=1 n j=1,j i X j X) 2 = n 1 n n i=1 ( ) 2 1 n 1 (n X X i ) X = n 1 n n i=1 ( X X i ) 2 (n 1) 2 =

37 3.1. THE JACKKNIFE 33 1 n(n 1) n (X i X) 2 = 1 n S2. i=1 So, the jackknife estimator of the variance of the mean is the familiar one. Example: Opinion survey An opinion survey asked a random sample of 200 people a yes/no question of whom 75 answered yes. The estimate of the population proportion p of those who would answer yes is ˆp = 75 = 3. A social science researcher is interested in a parameter θ = p(1 p) = pq. A natural estimate of the parameter is Is the estimator ˆθ = ˆpˆq biased? ˆθ = ˆpˆq = = If we knew that the probability of answering yes is the same for each person, then we could use the Bern(p) distribution and calculate the bias. However, it may be rather unlikely that all people have the same attitude to the questioned problem and so this assumption may not be feasible. Then a non-parametric method can help. Here, we calculate the jackknife estimate of bias of ˆθ = ˆpˆq. Let a random variable X i have two values: 1 if the answer is yes and 0 if the answer is no. Then the sum n is the number of yes answers in the survey. Also, note that i=1 X i nˆp = n X i. i=1 Then the ith jackknife replication of ˆθ, which is based on the sample of size n 1, can be written as θ (i) = { nˆp 1 n 1 nˆp n 1 ( ) 1 nˆp 1 n 1 if Xi = 1 ( ) 1 nˆp if X n 1 i = 0.

38 34 CHAPTER 3. DATA-SPLITTING So, we have θ = 1 n n θ(i) = 1 n i=1 [ nˆp nˆp 1 ( 1 nˆp 1 ) + (n nˆp) nˆp ( 1 nˆp )] n 1 n 1 n 1 n 1 Note that in the above formula nˆp is the number of ones and n nˆp is the number of zeros. Simplifying the above formula we get θ = n(n 2) n(n 2) ˆp(1 ˆp) = (n 1) 2 (n 1) ˆpˆq. 2 Hence, the jackknife estimator of the bias of parameter ˆθ is ( ) n(n 2) Biasˆθ(θ Jack) = (n 1)( θ ˆθ) = (n 1) ˆpˆq ˆpˆq = ˆpˆq (n 1) 2 n 1 = ˆθ n 1. So, in our example we get Biasˆθ(θ Jack) = = Now we may correct the initial estimate of θ by the estimate of the bias to obtain ˆθ new = ( ) = Cross-Validation The simplest type of cross-validation, known as leave-one-out cross-validation, uses the same idea as the jackknife, namely leaving out each observation in turn and estimating the parameter(s) from the remaining data. However, the object of interest is usually not the parameters themselves, but predictions of the observation left out. Given n 1 of the observations, how well can the other one be predicted? Leave-one-out cross-validation is used most often with regression-type models (linear models, ARIMA models, generalized linear models, etc.). Assume that we have bivariate data (x 1, y 1 ),..., (x n, y n ) and that we have fitted a model E(Y i ) = f(x i ; β), where β is a vector of parameters. Let ˆβ be the estimate of β from the full data set. Now, let ˆβ ( i) be the estimate, using the same method of estimation, of β using the n 1 observations excluding (x i, y i ). Then ŷ ( i) = f(x i ; ˆβ ( 1) ) is the prediction of y i using the rest of the data.

39 3.2. CROSS-VALIDATION 35 The overall quality of prediction is often summarised using n (y i ŷ ( i) ) 2. i=1 In the context of linear models, this is called the prediction error sum of squares or PRESS statistic. Leave-one-out cross-validation is particularly useful in model selection, when we want to find the model which will give the best prediction of new observations. The idea of cross-validation can be extended to leave-k-out. This is useful if, for example, influential observations might occur in groups. However, it is computationally much more demanding.

40 36 CHAPTER 3. DATA-SPLITTING

41 Chapter 4 Bootstrapping 4.1 Non-parametric computational estimation Let x 1,..., x n be a realization of the i.i.d. r.v.s X 1,..., X n with c.d.f. F. We are interested in the precision of estimation of a population parameter θ F. One possibility is to estimate θ F by θ ˆF, where ˆF is the empirical distribution function. We will denote an estimator of a parameter θ by ˆθ. Examples where f(x) = df (x). Then dx θ F = E(X) = θ ˆF = 1 n n x i, i=1 xf(x)dx, which is the sample mean. Here we assign equal probability, 1, to each n realization of X. Then θ F = var(x) = θ ˆF = 1 n (x E(X)) 2 f(x)dx. n (x i x) 2, i=1 which is the (population) variance of the sample. 37

42 38 CHAPTER 4. BOOTSTRAPPING 3. θ F = F (c) = P (X c) Then θ ˆF = 1 n #{i : x i c}. Question How good is ˆθ = θ ˆF as an estimator of θ F? Three common measures of goodness are: Bias θ (ˆθ) = E F (ˆθ) θ, (4.1) se θ (ˆθ) = var(ˆθ), (4.2) MSE θ (ˆθ) = E F [ (ˆθ θ) 2 ].. (4.3) We know that MSE θ (ˆθ) = var(ˆθ) + ( Bias θ (ˆθ)) 2. (4.4) Also note that MSE θ (ˆθ) = var(ˆθ) + [ ( ) ] 2 se θ (ˆθ) Biasθ (ˆθ). 2 se θ (ˆθ) ( 2 Bias θ (ˆθ)) = seθ (ˆθ) 1 + Problem How to calculate Bias θ (ˆθ), se θ (ˆθ) and MSE θ (ˆθ)? ( ) 2 Biasθ (ˆθ). se θ (ˆθ) = If we knew the distribution F then we could calculate the expected value and variance of the estimator ˆθ directly from definitions. It may be difficult if f(x) = df is complicated. Then a practical alternative is simulation: dx Generate a large number of random samples of size n from a population with c.d.f. F and calculate a value of ˆθ for each random sample. The mean and variance of the set of generated values of ˆθ will give a good approximation to E F (ˆθ) and var F (ˆθ).

43 4.2. BOOTSTRAP ESTIMATES OF BIAS, STANDARD ERROR AND MSE39 What if F is unknown? Then the simulation from F is impossible. In such situations a further approximation is to replace F by ˆF. Let θ be the estimate of θ calculated from a random sample from ˆF. The idea is that and Bias θ (ˆθ) Biasˆθ(θ ) var F (ˆθ) var ˆF (θ ). The heuristic reasoning is that ˆF is close to F and so the relationship of θ ˆF to θ F should be close to the relationship of θ F to θ ˆF, as shown in the diagram below. True unknown Empirical Resampled F data ˆF resampling F θ F θ ˆF θ F 4.2 Bootstrap estimates of bias, standard error and MSE Assume we do not know F. The bootstrap estimates of Bias θ (ˆθ), se θ (ˆθ) and MSE θ (ˆθ) are obtained by substituting ˆF for F, ˆθ for θ and θ for ˆθ in (4.1), (4.2) and (4.3). ˆF is the distribution which assigns probability 1 to n each observation x i. So, a random sample from ˆF is just a random sample from the set {x 1,..., x n } with replacement. The procedure to calculate the estimates is the following: construct N samples of size n from {x 1,..., x n } with replacement; denote the bootstrap samples by {x 1,..., x n } i, i = 1,..., N; denote by θ i the value of the estimator calculated for the i-th bootstrap sample; calculate the sample mean and variance of the bootstrap estimates θ i, i = 1,..., n, that is θ = 1 N N i=1 θ i, s 2 θ = 1 N 1 N i=1 (θ i θ ) 2 ; Bias θ (ˆθ), var F (ˆθ), se F (ˆθ) and MSE θ (ˆθ) are approximated, respectively, by Biasˆθ(θ ), var ˆF (θ ), seˆθ(θ ) and MSEˆθ(θ ) which are further approximated

44 40 CHAPTER 4. BOOTSTRAPPING by Biasˆθ(θ ) = θ ˆθ, var ˆF (θ ) = s 2 θ, ŝe ˆF (θ ) = s 2 θ, MSE ˆθ(θ ) = s 2 θ + ( θ ˆθ) 2. The following diagram represents the bootstrap resampling method: Empirical Bootstrap Bootstrap Bootstrap distribution samples of size n replications of ˆθ estimates ˆF {x 1,..., x n } 1 θ1 {x 1,..., x n } 2 θ2 {x 1,..., x n }... {x 1,..., x n } N θn bias: θ ˆθ variance: 1 N 1 N i=1 (θ i θ ) 2 Example: sample mean and sample median Let x 1,..., x n be a realization of the i.i.d. r.v.s X 1,..., X n with c.d.f. F. Consider θ F = E F (X i ) and let θ ˆF = X. We know that the sample mean X is an unbiased estimator of E(X i ). What is the bootstrap bias of the mean? Biasˆθ(θ ) = E ˆF (θ ) ˆθ = E ˆF ( X ) X = 0 Is the estimate of the bootstrap bias of the sample mean, Biasˆθ( X), equal to zero as well? Let 2.3, 3.4, 2.5, 3.2, 2.7, 2.6, 3.1, 3.5, 2.9, 2.5 be a sample from a population with a c.d.f. F. Here the sample mean is 2.87 and the sample median is 2.80.

4.1 Non-parametric computational estimation

4.1 Non-parametric computational estimation Chapter 4 Resamplig Methods 4.1 No-parametric computatioal estimatio Let x 1,...,x be a realizatio of the i.i.d. r.vs X 1,...,X with a c.d.f. F. We are iterested i the precisio of estimatio of a populatio