MA 575 Linear Models: Cedric E. Ginestet, Boston University Bootstrap for Regression Week 9, Lecture 1

MA 575 Linear Models: Cedric E. Ginestet, Boston University Bootstrap for Regression Week 9, Lecture 1 1 The General Bootstrap This is a computer-intensive resampling algorithm for estimating the empirical distribution function (EDF) of a random variable X from a set of observations x = {x 1,..., x n }. This technique therefore permits to obtain empirical estimates, without making any assumptions about the distribution of X, when analytical ones are not available. The bootstrap was first introduced in 1979 as an algorithm to obtain sufficient estimates of standard errors (Efron and Tibshirani, 1993). According to the legend, Baron Munchausen saved himself from drowning in quicksand by pulling himself up using only his bootstraps. The statistical bootstrap, which uses re-sampling from a given set of data to mimic the variability that produced the data in the first place, has a rather more dependable theoretical basis, and can be a highly effective procedure for estimation of error quantities in statistical problems. 1.1 Motivations for Using the Bootstrap When performing regression analysis, the distributional assumptions on the behavior of the error terms need not be satisfied. In such cases, it may be difficult to identify the distribution of the regression coefficients, and to compute a test statistic on that basis. Thus far, we have invoked the central limit theorem (CLT) to justify our use of normal assumptions. However, for small sample sizes, such theoretical justifications will not apply. Hence, we need to use a non-parametric toolkit, which does not make any assumption about the distribution of the data. 1.2 The Plug-in Principle Consider a general scenario in which we have drawn realizations from an unknown population distribution F, such that y 1,..., y n F. The sample mean of these realizations is then computed ȳ n = 1 n y i. What is the standard error of the statistic ȳ n? Invoking the central limit theorem, we know that for moderately large n, we would obtain, ȳ n N(µ F, σ 2 F /n), Department of Mathematics and Statistics, Boston University 1

where µ F and σf 2 are the mean and variance of the unknown F -distributed random variable. Using this result, we can define the standard error of ȳ n, as follows, se F (Ȳn) = ( Var[Ȳn] ) ( ) 1/2 1/2 1 = n 2 Var[Y i ] = σ F. n where we have emphasized the dependence of this quantity on F, through the use of a subscript. Here, the population standard deviation can be estimated using the sample estimate, σ F 2 = 1 (y i ȳ n ) 2. n 1 Alternatively, we can use the plug-in principle, which proposes to replace the unknown distribution F by a sample estimate F, such that we may use se F (Ȳn), where F is obtained by bootstrapping the sampled data. 1.3 Sampling with Replacement We first introduce the notion of a bootstrap sample, denoted y b. Each such bootstrap sample is drawn from the empirical distribution function (EDF), constructed using the original sample y 1,..., y n, such that F n (t; y) := 1 I {y i t}, n where I{} is the indicator function, defined as follows, { 1 if y i t, I {y i t} := 0 otherwise. The EDF, F n (y), is therefore obtained by assigning an equal probability 1/n and a label i 1, i 2,..., i n to each element in y. We can then sample with replacement from the EDF by drawing n values from the distribution of the indexes. That is, drawing samples from the EDF, y j Fn, j = 1,..., n; is equivalent to drawing indexes from a uniform distribution on the indexes between 1 and n, i j Unif(1,..., n), j = 1,..., n. The resulting bootstrap sample consists of the following sequence of elements, {y 1 = y i1, y 2 = y i2,..., y n = y in }, forming an n-dimensional bootstrap sample. This procedure is repeated B times in order to produce b = 1,..., B samples of the form, y b := [y 1b,..., y nb] T. Such bootstrap samples are best conceived as a resampling or a randomization of the original data. Sampling with replacement ensures that the bootstrap samples are indeed probabilistically independent, E[y j (y k) T ] = E[y j ]E[y k] T, j, k = 1,..., B; where we are here treating each y j as a random vector. It is common practice to draw about B = 1000 bootstrap samples. However, Efron and Tibshirani (1993) originally advocated that anything between 25 and 200 samples was sufficient for most inferential purposes. Department of Mathematics and Statistics, Boston University 2

1.4 Bootstrapped Standard Error Continuing the previous example, we may be interested in estimating the standard error of the statistic ȳ n using the bootstrap. Such an estimate can be obtained by computing the statistic of interest here, the sample mean of the data y for each bootstrap sample, θ b := 1 n yib. (1) Once this is obtained, it suffices to compute the standard error of this distribution of bootstrapped sample means, ( ) 1/2 1 B se F (ȳ) := (θb B 1 θ ) 2, where the bootstrap mean of the bootstrapped sample means is given by θ := 1 B b=1 B θ b. The quantity, se F (ȳ), is then referred to as the bootstrapped standard error. Of course, this procedure could be repeated for any statistic θ := s(y), since we are only using the fact that the quantity of interest is a function of the data. In such cases, the bootstrap estimates in equation (1) would be computed using the bootstrap samples, such that θ b := s(y ). The central advantage of using the bootstrap is that we can control the accuracy of the bootstrap estimate through our choice of B. A larger value of B will yield a better estimate of the ideal bootstrap estimate, which would be based on all resamples of the data vector y. Because the number of possible such resamples grows factorially with n, we have adopted a Monte Carlo method for estimating this quantity. Since the bootstrap does not make any assumption about the distribution of the data, it should be regarded as a non-parametric procedure. 2 Bootstrap for Regression A key assumption made when conducting simple or multiple regression is that the error terms are normally distributed. In many practical situations, such an assumption may be untenable, or difficult to verify. When this occurs, one can resort to a bootstrap estimation of the standard errors in the model of interest. There exists two different methods for applying the bootstrap to regression. One can either sample the pairs of predictors and observed values, or directly re-sample the residuals, once we have fitted the model. 2.1 Bootstrapping Cases Firstly, a naive approach to bootstrap estimation in regression analysis is to re-sample cases. With this approach, we proceed as follows, b=1 b b := {(y i1b, x i1b),..., (y inb, x inb)}, for every b = 1,..., B. For each vector of bootstrap replicates, we compute βb, which is obtained by minimizing the RSS based on each bootstrap sample, b b, such that β b := argmin β R p (yib x T ibβ) 2. Department of Mathematics and Statistics, Boston University 3

The bootstrap estimate of the standard error of an estimator in our model, say β l for instance, with l = 1,..., p, can then be estimated as where the bootstrap mean is β l := β lb /B. 2.2 Bootstrapping Residuals ( ) 1/2 1 B se( β l ) = (βlb B 1 β l ) 2, b=1 Alternatively, one can sample with replacement from the residuals of a fitted model based on the OLS estimator β. This produces the following bootstrap sample, based on the fitted values ŷ i s, b b := {(x T 1 β + ê i1b, x 1 ),..., (x T n β + ê inb, x n )}, where for every j = 1,..., n, we could also have defined y i j := x T j β + ê ij. Note that the vector of predictor x T j does not have the same index as the residual ê i j. The latter quantity was sampled with replacement from the EDF of residuals under the OLS estimator, β, {ê 1 = y 1 ŷ 1,..., ê n = y n ŷ n }. That is, in this procedure, we are first fitting our standard model to derive the OLS estimate, β. This, in turn, allows us to resample the residuals, given that particular estimate. This second strategy is less statistically robust than the boostrapping cases, as it assumes that homoscedasticity holds. That is, since we are breaking the dependence of the residuals on the vectors of predictors, x i, we are implicitly assuming that the variance of the residuals does not depend on the values of x i. When this assumption is unlikely to hold, it is preferable to boostrap cases, which is more robust than bootstrapping the residuals. 3 Theory of the Bootstrap 3.1 Consistency of the EDF For any set of random variables {Y 1,..., Y n }, from some unknown cumulative distribution function (CDF) denoted F, the empirical distribution function (EDF), F n, is defined for any t R, F n (t; Y) := 1 n I {Y i t}, where we have emphasized the fact that F n is a random quantity, which depends on the full n-dimensional random vector, y. The EDF has two desirable properties. It is both (i) unbiased and (ii) consistent, with respect to F. To show that F n is unbiased with respect to the target CDF, F, it suffices to take the Department of Mathematics and Statistics, Boston University 4

expectation for some t R and any n N, E[ F n (t; Y)] = Fn (t; Y)dF (Y 1 )... df (Y n ) R n = 1 I {Y i t} df (Y i ) n = 1 n R P[Y i t] = P[Y t] = F (t), where the penultimate step follows from the fact that Y i F, for every i = 1,..., n. Secondly, Fn can also be shown to be consistent, in the sense that as n, the estimate F n (t; Y n ) converges to F (t), for every t R. That is, for every y R, we have the following pointwise convergence, [ P lim n ] F n (t; Y n ) = F (t) = 1. This is simply the strong law of large numbers, stating for any random variable, X, with finite second moments, we have X n a.s. X. In this case, the sequences of random numbers are composed of the F n (t; y). 3.2 Unbiasedness vs. Consistency Observe that the unbiasedness and consistency of an estimator are two different criteria. i. Unbiasedness refers to the average behavior of an estimator. What is its expectation? ii. Consistency captures the long-range behavior of an estimator, and is generally based on one of the laws of large numbers. Observe that these two criteria are independent: An estimator can be unbiased and inconsistent, such as for any sequence of sample means with expectation θ, such that X n θ, and some random variable, Y, centered at 0; we have E[ X n + Y ] = θ, and lim X n + Y θ, a.s. n Inversely, we may also have a random variable, which is consistent, yet biased, such as for instance, Xn +1/n, which is biased for every n, but nonetheless consistent. That is, [ E X n + 1 ] = θ + 1, and lim X n + 1 n n n n = θ, a.s. 3.3 Rates of Convergence Taken together, these results show that the good performance of the bootstrap relies on the rate of convergence of the EDF, Fn to the population distribution, F. Therefore, we have replaced a distributional assumption on the random variables of interest, by an appeal to the strong law of large number. Since the strong law of large number converges at a rate O(1/n), it follows that we are gaining in accuracy over a reliance on the central limit theorem, whose convergence rate is only of order O(1/ n). Roughly, for any sequence of random variables X 1,..., X n, with mean E[X i ] = µ, the sum S n converges as follows, S n n a.s. µ. Department of Mathematics and Statistics, Boston University 5

The strong law captures the first-order approximation of the sample mean. If, in addition, we know that Var[X i ] = σ 2, for every i = 1,..., n, we then have S n nµ n d N(0, σ 2 ), which represents a second-order approximation of the mean µ. When using the bootstrap, we are exploiting the fact that the strong law of large number has a better rate of convergence than the central limit theorem. References Efron, B. and Tibshirani, R. (1993). An introduction to the bootstrap. Chapman & Hall, London. Department of Mathematics and Statistics, Boston University 6