Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods

Size: px

Start display at page:

Download "Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods"

Jacob Hart
5 years ago
Views:

1 Chapter 4 Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods 4.1 Introduction It is now explicable that ridge regression estimator (here we take ordinary ridge estimator (ORE) in particular) can be of great use for estimating the unknown regression coefficients in the presence of multicollinearity. But apart from its ability to create good parameter estimates with smaller MSE than the OLSE, it must also provide fine solutions when dealing with more intricate inference problems like obtaining confidence intervals. The problem with the ORE is that its sampling distribution is unknown, hence we make use of the resampling methods to obtain the asymptotic confidence intervals for the regression coefficients based on the ORE. Recently, Firinguetti and Bobadilla [40] developed asymptotic confidence intervals for the regression coefficients based on ORE and Edgeworth expansion. Crivelli et. al. [17] proposed the use of a technique that combines the bootstrap and the Edgeworth expansion to obtain an approximation to the distribution of some ridge regression estimators and carried out some simulation experiments. Alheety and Ramanathan [4] proposed a data dependent confidence interval for the ridge parameter k. Confidence intervals have become very popular in the applied statistician s col- 73

2 lection of data-analytic tools. They combine point estimation and hypothesis testing into a single inferential statement. Recent advances in statistical methods and fast computing allow the construction of highly accurate approximate confidence intervals. Confidence intervals for a given population parameter θ are sample based range [ˆθ 1, ˆθ 2 ] given out for the unknown θ. The range possesses the property that θ would lie within its bounds with a high specified probability. Of course this probability is with respect to all possible samples, each sample giving rise to a confidence interval which depends on the chance mechanism involved in drawing the samples. Two approaches for confidence intervals includes the construction of exact intervals for special cases like the ratio of normal means or a single binomial parameter and others are the most commonly used, approximate confidence intervals which are also known as standard intervals (or normal theory intervals) having the following form ˆθ ± z (α)ˆσ, (4.1.1) where ˆθ is a point estimate of the parameter θ, ˆσ is an estimate of standard deviation of ˆθ, and z (α) is the 100α th percentile of a normal variate, (for example z (0.95) = etc.). The main drawback of standard intervals is that they are based on an asymptotic approximation that may not be accurate in practice. There has been considerable progress on developing better confidence intervals techniques for improving the standard interval, involving bias corrections and parameter transformations etc.. Numerous methods have been proposed to improve upon standard intervals. These methods produce approximate confidence intervals that have better coverage accuracy than the standard one. Some references under this area includes McCullagh [67], Cox and Reid [16], Efron [34], Hall [49], DiCiccio and Tibshirani [24] etc.. The confidence intervals based on the resampling methods like bootstrap and jackknife can be seen as automatic algorithms for carrying out these improvements. We have already discussed that bootstrap and jackknife are two powerful methods for variance estimation which is why we can use them to produce confidence intervals. We discuss the different methods for construction of the bootstrap and jackknife confidence intervals in Section (4.2) and (4.3) respectively, Section (4.4) contains the model and the estimators, Section (4.5) gives a simulation study for the comparison of coverage probabilities and average confidence widths based on ORE and OLSE and 74

3 Section (4.6) consists of some concluding remarks. 4.2 Bootstrap Confidence Intervals Suppose that we draw a sample S = (x 1,x 2,...,x n ) of size n from a population of size N and we are interested in some statistic ˆθ as an estimate of the corresponding population parameter θ. The classical inference, where we derive the asymptotic distribution of sampling distribution of ˆθ when its exact distribution is intractable or has some drawbacks. The corresponding sampling distribution of statistic ˆθ may be inaccurate if the assumptions about the population are wrong or the sample size is relatively small. On the other hand, as discussed earlier also, nonparametric bootstrap enables us to estimate the sampling distribution of a statistic empirically without making any assumptions about the parent population and without deriving sampling distribution explicitly. In this method, we draw a sample of size n from the elements of sample S and the sampling is done with replacement. This is the first bootstrap resample S 1 = (x 11,x 12,...,x 1n). Each element of the resample is selected from the original sample with probability 1/n. We repeat this step of resampling a large number of times say B so that the r th bootstrap resample is given by S r = (x r1,x r2,...,x rn). Next, we compute the statistic ˆθ for each of the bootstrap resample i.e. ˆθ r. Now, the distribution of ˆθ r around the original statistic ˆθ is similar to the sampling distribution of ˆθ around population parameter θ. The average of the statistics computed based on bootstrap resamples estimates the expectation of the bootstrapped statistic. It is given as ˆθ = B r=1 ˆθ r B. Then ˆB = ˆθ ˆθ gives an estimate of bias of ˆθ which is ˆθ θ. The estimate of the bootstrap variance of ˆθ is ˆV (ˆθ ) = B r=1 (ˆθ r ˆθ ) 2, B 1 which estimates the sampling variance of ˆθ. With the help of this variance estimate, we can easily produce bootstrap confidence intervals. There are several 75

4 methods for constructing bootstrap confidence intervals, each of them are briefly discussed below Normal Theory Method The first method for constructing bootstrap confidence interval is based on the assumption that the sampling distribution of ˆθ is normal. This uses the bootstrap estimate of sampling variance to construct a 100(1 α)% confidence interval of the following form where ˆ SE(ˆθ ) = (2ˆθ ˆθ ) ± z (1 α/2) ˆ SE(ˆθ ), ˆV (ˆθ ) is the bootstrap estimate of the standard error of ˆθ and z (1 α/2) is the (1 α/2) quantile of the standard normal distribution Bootstrap Percentile and Centered Bootstrap Percentile Method Another method for the construction of bootstrap confidence intervals is the bootstrap percentile method which is the most popular among all primarily due to its simplicity and natural appeal. In this method, we use the empirical quantiles of ˆθ to form the confidence interval for θ. The bootstrap percentile confidence interval has the following form ˆθ (lower) < θ < ˆθ (upper), where ˆθ (1), ˆθ (2),..., ˆθ (B) are the ordered bootstrap replicates of ˆθ with lower confidence limit=[(b + 1)α/2] and upper confidence limit=[(b + 1)(1 α/2)]. The square brackets indicates rounding to the nearest integer. For example, there are 1000 bootstrap replications (i.e. B = 1000) for ˆθ, denoted by (ˆθ 1, ˆθ 2,..., ˆθ 1000). After sorting them from bottom to top, let us denote these bootstrap values as (ˆθ (1), ˆθ (2),..., ˆθ (1000) ). Then the bootstrap percentile confidence interval at 95% level would be [ˆθ (25), ˆθ (975) ]. Here, it is implicitly assumed that the distribution 76

5 of ˆθ is symmetric around θ as the method approximates the sampling distribution of (ˆθ θ) by the bootstrap distribution of (ˆθ ˆθ ) which is contrary to the bootstrap theory. The coverage error is substantial if the distribution of ˆθ is not nearly symmetric. Now, suppose the sampling distribution of (ˆθ θ) is approximated by the bootstrap distribution of (ˆθ ˆθ) which is what bootstrap theory explains. Then the statement that (ˆθ θ) lies within the range(b ˆθ,B ˆθ) would carry a probability of 0.95 where B s denotes the 100s th percentile of bootstrap distribution of ˆθ. In case of 1000 bootstrap replications B = ˆθ (25) and B = ˆθ (975), we see that the above statement reduces to the statement that θ lies in the following confidence interval (2ˆθ B ) < θ < (2ˆθ B ). The above mentioned range is known as centered bootstrap percentile confidence interval. This method can be applied to any statistic and works well in cases where the bootstrap distribution is symmetrical and centered around the observed statistic Bootstrap Studentized-t Method Another method for constructing the bootstrap confidence intervals is the studentized bootstrap, also called the bootstrap-t method. The studentized-t bootstrap confidence interval takes the same form as the normal confidence interval except that instead of using the quantiles from a normal distribution, a bootstrapped t-distribution is constructed from which the quantiles are computed (see Davison and Hinkley [21] and Efron and Tibshirani [35]). Bootstrapping a statistical function of the form t = (ˆθ θ)/(se), where SE is the sample estimate of the standard error of ˆθ brings extra accuracy. This additional accuracy is due to the so called one-term Edgeworth correction by the bootstrap (see Hall [50]). The basic example of it is the standard t statistics t = ( x µ)/(s/ n), which is a special case with θ = µ (the population mean), ˆθ = x (the sample mean) and s being the sample standard deviation. The bootstrap counterpart 77

6 of this is t = (ˆθ ˆθ)/(SE), where SE is the standard error based on bootstrap distribution. Denote the 100s th bootstrap percentile of t by b s and consider the statement that (b < t < b ) and after substituting t = (ˆθ θ)/se, we get confidence limits for θ as ˆθ (SE)b < θ < ˆθ (SE)b This range is known as bootstrap-t based confidence interval for θ at 95% confidence level. Like the normal approximation, this confidence interval leads to interval estimates that are symmetric about the original point estimator, which may not be appropriate. It is best suited for bootstrapping a location statistic. The use of the studentized bootstrap is controversial to some extent. In some cases, the endpoints of the intervals seem to be too wide and the method seem to be sensitive to outliers. Also, this method is computationally very intensive if (SE) is calculated using a double bootstrap but if (SE) is easily available, the method performs reliably in many examples in both its standard and variance stabilized forms (see Davison and Hinkley [21]). Percentile bootstrap endpoints are simple to calculate and can work well, especially if the sampling distribution is symmetrical. It may not have the correct coverage when the sampling distribution of the statistic is skewed. Coverage of the percentile bootstrap can be improved by adjusting the endpoints for bias and nonconstant variance. This method is known as BCa method which we discuss in the next subsection Bias corrected and Accelerated (BCa) Method If we have a distribution which has skew or bias, we need to do some adjustments. One method which is proved to be reliable in such cases is BCa method, this method will tend to be closer to the true confidence interval than the percentile method. This is an automatic algorithm for producing highly accurate 78

7 confidence limits from a bootstrap distribution. For a detailed review on the same, see Efron [34], Hall [49], DiCiccio [23], Efron and Tibshirani [35]. The BCa procedure is a method of setting approximate confidence intervals for θ from the percentiles of the bootstrap histogram. The parameter of interest being θ only, ˆθ is an estimate of θ based on the observed data and ˆθ is a bootstrap replication of ˆθ obtained by resampling. Let Ĝ(c) be the cumulative distribution function of B bootstrap replications of ˆθ, Ĝ(c) = #{ˆθ < c}/b. (4.2.1) The upper endpoint, ˆθBCa [α] of one-sided BCa interval at α level i.e. θ (, ˆθ BCa [α]) is defined in terms of Ĝ and two parameters z 0, the bias correction and a, the acceleration. That is how BCa stands for bias corrected and accelerated. The BCa endpoint is given by ( ) z 0 + z ˆθ (α) BCa [α] = Ĝ 1 Φ z 0 +. (4.2.2) 1 a(z 0 + z (α) ) Here Φ is the standard normal c.d.f., with z (α) = Φ 1 α. The central 90% BCa confidence interval is given by (ˆθ BCa [0.05], ˆθ BCa [.95]). In (4.2.2), if a and z 0 are zero, then ˆθ BCa [α] = Ĝ 1 (α), the 100α th percentile of the bootstrap replications. Also, if Ĝ is normal, then ˆθ BCa [α] = ˆθ + z (α)ˆσ, the standard interval endpoint. In general, (4.2.2) makes three different corrections to the standard intervals, improving their coverage accuracy. Suppose that there exists a monotone increasing transformation φ = m(θ) such that ˆφ = m(ˆθ) is normally distributed for every choice of θ, but with a bias and a non constant variance, The BCa algorithm estimates z 0 by ˆφ N(φ z 0 σ φ,σ 2 φ), σ φ = 1 + aφ. (4.2.3) { ẑ 0 = Φ 1 #{ˆθ (b) < ˆθ} }, B Φ 1 of the proportion of the bootstrap replications less than ˆθ. The acceleration 79

8 a measures how quickly the standard error is changing on the normalized scale. The acceleration a is estimated using the empirical influence function of the statistic ˆθ = t( ˆF), t((1 ǫ) U i = lim ˆF + ǫδ i ) ǫ 0,i = 1, 2,...,n. (4.2.4) ǫ Here δ i is a point mass on x i, so (1 ǫ) ˆF + ǫδ i is a version of ˆF putting extra weight on x i and less weight on the other points. The estimate of a is â = 1 6 n i=1 U3 i ( n (4.2.5) i=1 U2 i )3/2. There is a simpler way of calculating U i and â. Instead of (4.2.4), we can use the following jackknife influence function (see Hinkley [51]) in (4.2.5) U i = (n 1)(ˆθ ˆθ (i) ), (4.2.6) where ˆθ (i) is the estimate of θ based on the reduced data set x (i) = (x 1,x 2,...,x i 1, x i+1,...,x n ). In the next section, we discuss about the confidence intervals based on jackknife method. 4.3 Jackknife Confidence Intervals Jackknife technique is generally used to reduce the bias of parameter estimates and to estimate the variance. Let n be the total sample size and the procedure is to estimate a parameter θ for n times, each time deleting one sample data point. The resulting estimator denoted by ˆθ i where the i th data point has been excluded before calculating the estimate. The so-called pseudo values are computed as θ i = nˆθ (n 1)ˆθ i, here ˆθ is the value of the parameter estimated from the entire data set. These pseudo values act as independent and identical normal random variables. The average of the above gives the jackknife estimator of θ as 80

9 θ = 1 n n θ i = nˆθ i=1 (n 1) n n ˆθ i. i=1 Also, the variance of these pseudo values is the estimate of the variance of ˆθ which is given by V = 1 n(n 1) n ( θ i θ)( θ i θ). (4.3.1) i=1 Apart from reducing bias and estimating variance, the jackknife technique also offers a very simple method to construct confidence interval for θ (see Miller [71]). Now, using the consistent estimator for variance given in (4.3.1), we get the confidence interval for θ (see Hinkley [51]) as ) where t (1 α2 ;n 1 is the upper α 2 ( θ t 1 α ) 2 ;n 1 vii, (4.3.2) 100% point of the Students s t- distribution with n 1 degrees of freedom, which is suited even for small samples. The next section contains the model and the estimators and confidence intervals in terms of the unknown regression coefficients. 4.4 The Model and the Estimators Let us consider the following multiple linear regression model y = Xβ + u (4.4.1) where y is an (n 1) vector of observations on the variable to be explained, X is an (n p) matrix of n observations on p explanatory variables assumed to be of full column rank, β is a (p 1) vector of regression coefficients associated with them and u is an (n 1) vector of disturbances, the elements of which are assumed to be independently and identically normally distributed with E(u) = 0; V ar(u) = σ 2 I. 81

10 The OLSE for β in model (4.4.1) is given by ˆβ OLSE = (X X) 1 X y. (4.4.2) Now, suppose if the data has the problem of multicollinearity, we use the ORE given by Hoerl and Kennard [52] as ˆβ ORE = (X X + ki) 1 X y = (I A 1 ki)ˆβ, (4.4.3) where k > 0 and A = X X + ki. The jackknife ridge estimator JRE of ORE is calculated using the formula ˆβ JRE = [I (A 1 ki) 2 ]ˆβ (4.4.4) Now for bootstrapping, firstly, for the model defined in (4.4.1), we fit the least squares regression equation for full sample and calculate the standardized residuals u i and then draw an n sized bootstrap sample with replacement (u (b) 1,u (b) 2,...,u (b) n ) from the residuals u i s giving 1/n probability to each u i. After this, we obtain the bootstrap y values using the resampled residuals keeping the design matrix fixed as shown below y (b) = X ˆβ OLSE + u (b). We then regress these bootstrapped y values on the fixed X to obtain the bootstrap estimates of the regression coefficients. So, the ORE from the first bootstrap sample is ˆβ ORE(b1) = (X X + ki) 1 X y (b1). Repeat the above steps B times where B is the number of bootstrap resamples. Based on these, the bootstrap estimator for β is given by B ˆβ ORE = ˆβ ORE(br)/B, (4.4.5) r=1 where r = 1,...,B. The estimated bias is given by Bias est = ˆβ ORE ˆβ OLSE. 82

11 The estimated variance of ridge estimator through bootstrap is given as V ar est = B (ˆβ ORE(br) ˆβ ORE ) 2 /(B 1). (4.4.6) r=1 Now, based on these estimators, we construct the confidence intervals for the regression coefficient β in following subsection Confidence Intervals for Regression Coefficients using ORE The different confidence intervals discussed in last section gives the following forms for the confidence intervals in terms of the regression coefficient β. 1. Normal theory method A 95% confidence interval for β based on ORE is (2ˆβ ORE ˆβ ORE ) z (1 α/2) V arest < β < (2ˆβ ORE ˆβ ORE ) + z (1 α/2) V arest, where α = 0.05, V ar est is the bootstrap estimate of the variance of ˆβ ORE as defined in (4.4.6) and z (1 α/2) is the (1 α/2) quantile of the standard normal distribution. 2. Percentile Method A 95% confidence interval for 1000 bootstrap resamples would be ˆβ ORE(25) < β < ˆβ ORE(975), where ˆβ ORE(r) is the rth observation in the ordered bootstrap replicates of ˆβ ORE. 3. Studentized t Method A 95% studentized t-interval would be ˆβ ORE (SE)b < β < ˆβ ORE (SE)b 0.025, 83

12 where b s is the 100s th bootstrap percentile of t and t is as defined in Section (4.2.3). 4. BCa Method A 95% confidence interval for β based on ORE is ˆβ BCa [0.025] < β < ˆβ BCa [0.975], where ˆθ BCa [α] is as defined in (4.2.2). Another method known as the ABC (approximate bootstrap confidence intervals) method has been proposed by Efron[34], that gives analytic adjustment to BCa method for smoothly defined parameters in exponential families. They are touted in the literature as improvements for common parametric and nonparametric BCa procedures, and may be preferred in order to avoid the BCa s Monte Carlo calculations. [see avoidhese intervals are the approximations to the BCa intervals of Efron[34], using analytic methods to avoid the BCa s Monte Carlo calculations [see DiCiccio and Efron [25];Efron and Tibshirani [35]; DiCiccio and Efron [26]]. We have adopted this method in the linear model setup, however we did not see significant improvement in its performance over the BCa method. Hence, this method is not pursued in the numerical investigations carried out later. 5. Jackknife Method A 95% jackknife confidence interval for β based on ORE is ( ˆβ JRE t 1 α ) ( 2 ;n p vii < β < ˆβ JRE + t 1 α ) 2 ;n p vii, (4.4.7) ) where α = 0.05, t (1 α2 ;n p is the upper α 100% point of the Students s 2 t-distribution with (n p) degrees of freedom and v ii is as given in (4.3.1) with (n 1) in the denominator being replaced by (n p). In order to compare these methods of constructing asymptotic confidence inter- 84

13 vals based on ORE and OLSE, we calculate coverage probabilities which is defined as the proportion that the confidence interval includes the true parameter, under repeated sampling from the same underlying population and confidence width which is the difference between the upper and lower confidence endpoints. In the next section, we carry out a simulation study to obtain the confidence widths and coverage probabilities based on the confidence intervals developed using ORE and OLSE. 4.5 A Simulation Study After getting into some theoretical aspects of each method to construct confidence intervals, we now apply the methods on simulated data and compare the performance ORE and OLSE based on their coverage probabilities and confidence widths. The model is y = Xβ + u (4.5.1) where u N(0, 1). Here β is taken as the normalized eigen vector corresponding to the largest eigen value of X X. The explanatory variables are generated from the following equation x ij = (1 ρ 2 ) 1 2 wij + ρw ip, i = 1, 2...,n; j = 1, 2,...,p. where w ij are independent standard normal pseudo-random numbers and ρ 2 is the correlation between x ij and x ij for j,j < p and j j. When j or j = p, the correlation will be ρ. We have taken ρ = 0.9 and 0.99 to investigate the effects of different degrees of collinearity with sample sizes n = 25, 50 and 100. The feasible value of k is obtained by the optimal formula k = pσ2 β β Hoerl et al. [53], so that ˆk = pˆσ2 ˆβ ˆβ, as given by where ˆσ 2 = (y X ˆβ) (y X ˆβ). n p 85

14 For calculating different bootstrap confidence intervals like Normal, Percentile, Studentized and BCa, we have used the function called boot.ci in R. The confidence limits through jackknife are calculated using (4.4.7) and variance of jackknifed estimator is calculated using bootstrap method for variance estimation. We have calculated the coverage probabilities and average confidence width using 1999 bootstrap resamples and the experiment is repeated 500 times. The coverage probability, say CP is calculated using the following formula CP = #(ˆβ L < β < ˆβ U ), N and the average confidence width, say CW is calculated by CW = N i=1 (ˆβ U ˆβ L ), N where N is the simulation size, ˆβ L and ˆβ U are lower and upper confidence interval endpoints respectively. The results for coverage probability and average confidence width at 95% and 99% confidence levels with different values of n and ρ are summarized in Table (4.1)-Table (4.4). Note that in all the tables, column namely OLSE gives the coverage probability and confidence width based on the confidence intervals through ˆβ OLSE using t-distribution. From Tables (4.1) and (4.2), we find that the coverage probabilities of all the intervals improve with the increasing value of n and become close to each other which is due to the consistency of the estimators. It is evident from Tables (4.3) and (4.4) that the confidence intervals based on ORE have shorter widths in comparison to the width of the interval based on OLSE. It is interesting to note that the coverage probabilities and confidence width through OLSE and through jackknife are very close to each other. Also, from Tables (4.3) and (4.4), we find that with the increasing collinearity between the dependent variables, the difference between the confidence width of interval based on OLSE and intervals based on ORE is increasing. Also, with the increasing value of sample size, the confidence width of all the intervals is decreasing. 86

15 According to Tables (4.1) and (4.2), all the bootstrap methods are generally conservative in terms coverage probabilities, however jackknife method seems to give coverage probabilities closer to the target. In terms of confidence width, resampling methods have smaller confidence width than the OLSE; jackknife method having larger confidence width than bootstrap methods. In noting that bootstrap methods are conservative with smaller confidence width, they seem to have an advantage over the jackknife method. 4.6 Concluding Remarks In the present chapter, we illustrated the use of different confidence intervals based on bootstrap and jackknife resampling methods. We obtained the coverage probabilities and confidence width based on ORE using different bootstrap and jackknife method and compared it with that of OLSE which we computed using t-intervals. The shorter confidence widths obtained through ORE shows its superiority over OLSE in the case of multicollinearity. The next chapter deals with jackknifing and bootstrapping another biased estimator given by Liu [64] to overcome the drawback of ORE. 87

16 Table 4.1: Coverage Probabilities through different methods at 95% confidence level n ρ OLSE N ormal P ercentile Studentized BCa Jackknif e

17 Table 4.2: Coverage Probabilities through different methods at 99% confidence level n ρ OLSE N ormal P ercentile Studentized BCa Jackknif e

18 Table 4.3: Average confidence width through different methods at 95% confidence level using OLSE and ORE n ρ OLSE N ormal P ercentile Studentized BCa Jackknif e

19 Table 4.4: Average confidence width through different methods at 99% confidence level using ORE n ρ OLSE N ormal P ercentile Studentized BCa Jackknif e

Bootstrap Confidence Intervals

Bootstrap Confidence Intervals Patrick Breheny September 18 Patrick Breheny STA 621: Nonparametric Statistics 1/22 Introduction Bootstrap confidence intervals So far, we have discussed the idea behind