ABSTRACT. POST, JUSTIN BLAISE. Methods to Improve Prediction Accuracy under Structural Constraints. (Under the direction of Howard Bondell.

Size: px

Start display at page:

Download "ABSTRACT. POST, JUSTIN BLAISE. Methods to Improve Prediction Accuracy under Structural Constraints. (Under the direction of Howard Bondell."

Brandon Fox
5 years ago
Views:

1 ABSTRACT POST, JUSTIN BLAISE. Methods to Improve Prediction Accuracy under Structural Constraints. (Under the direction of Howard Bondell.) Statisticians are often faced with the difficult task of model selection and making inference from the selected model. A method such as Ordinary Least Squares (OLS) can be used to fit a given model but that model can often be improved by either removing some extraneous variables, shrinking parameter estimates, or by using estimates that are averages across many different models. This thesis investigates and provides new solutions for two common problems. In multiple linear regression with correlated predictors, a common goal is to increase prediction accuracy. Ridge regression (Hoerl and Kennard, 1970) is a method that increases prediction accuracy in the correlated predictor situation by trading an increased bias for a reduction in variance. Ridge Regression falls into a class of regularization, or penalized regression, methods. The amount of penalization is controlled by a tuning parameter that is often selected by minimizing some criterion. Once selected, this value is often treated as known in advance. However, there is inherent variability in the selection of the tuning parameter. In order to account for this extra level of variability and gain predictive power, a model averaged version of ridge regression utilizing a Mallows criterion is developed. Theory is shown that demonstrates the estimator with weight vector chosen using Mallows criterion is asymptotically optimal in that its loss is asymptotically equivalent to that of the infeasible optimal weight vector. We demonstrate that the proposed estimator performs well in both simulation and real data examples. We also propose similar estimates based on other weighting criterion, as well as on another penalization procedure dubbed the least absolute shrinkage and selection operator (Tib-

2 shirani, 1996). Simulation results are given that show the usefulness of these estimators compared to their single model counterparts. When faced with categorical predictors and a continuous response, the objective of analysis often consists of two tasks: finding which factors are important and determining which levels of the factors differ significantly from one another. Often times these tasks are done separately using Analysis of Variance followed by a post-hoc hypothesis testing procedure such as Tukey s Honestly Significant Difference test. When interactions between factors are included in the model the collapsing of levels of a factor becomes a more difficult problem. When testing for differences between two levels of a factor, claiming no difference would refer not only to equality of main effects, but also equality of each interaction involving those levels. An appropriate regularization is constructed that encourages levels of factors to collapse and entire factors to be set to zero. It is shown that the procedure has the oracle property implying that asymptotically it performs as well as if the exact structure were known beforehand. We also discuss the application to estimating interactions in the unreplicated case. Simulation studies show the procedure outperforms post hoc hypothesis testing procedures as well as similar methods that do not include a structural constraint. The method is also illustrated using a real data example.

4 Methods to Improve Prediction Accuracy under Structural Constraints by Justin Blaise Post A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Statistics Raleigh, North Carolina 2012 APPROVED BY: Jason Osborne Brian Reich Yichao Wu Howard Bondell Chair of Advisory Committee

5 DEDICATION To my lovely wife Malorie and my loving family. ii

6 BIOGRAPHY The author was born in a small town in rural Pennsylvania. He received a bachelor s degree in mathematics from Penn State Erie, the Behrend College. He has spent the past five years at North Carolina State University learning statistics. iii

7 ACKNOWLEDGEMENTS I would like to thank my advisor, Dr. Howard Bondell, for his guidance and my wife Malorie for her support and love. iv

8 TABLE OF CONTENTS List of Tables viii List of Figures xii Chapter 1 Introduction Regularization Framework Plan of Dissertation Chapter 2 Introduction to Model Averaging Bayesian Model Averaging Review The BMA Framework Frequentist Model Averaging Review Information Theory Weights Bootstrap Weights Cross Validation Weights Mallows Criterion Weights Chapter 3 Model Averaged Ridge Regression Introduction and Motivation FMA Review - Basic Paradigm Adapting FMA to Ridge Regression Weight Choice Fixed Weights Adaptive Weights Model Averaged Ridge Regression using Mallows Criterion Bayesian Connection Asymptotic Properties Simulation Studies Set-up and Competitors Results Real Data Examples Pakistani Economic Data Longley s Economic Data Boston Housing Data Discussion Chapter 4 Alternative Model Averaged Ridge Regression Estimators Information Criterion Weights AIC Weights with One Predictor and Known Variance v

9 4.2 Using a Distribution on the Tuning Parameter Ridge Regession with One Predictor Bootstrap Estimates Combining Model Weights and a Distribution Using the Mixed Model Formulation Normal Approximation to the ML and REML Estimates Taylor Expansion Bootstrap Estimates Simulation Comparisons Simulation Set-ups Simulation Results Chapter 5 Model Averaging the LASSO The LASSO procedure Model Averaged LASSO using Mallows Criterion Model Averaged LASSO using a Mixed Model Taylor Expansion Bayesian Connection Simulation Studies Simulation Set-up Simulation Results Chapter 6 Factor Selection and Structural Identification in the Interaction ANOVA Model Introduction Collapsing via Penalized Regression GASH-ANOVA Method Investigating Interactions in the Unreplicated Case Asymptotic Properties Computation and Tuning Simulation Studies Simulation Set-up Competitors and Methods of Evaluation Simulation Results Additional Simulation Set-up Additional Simulation Results Real Data Example Discussion References vi

10 Appendices Appendix A Chapter 3 Proofs A.1 Lemma A.2 Lemma A.3 Proof of Theorem A.4 Proof of theorem Appendix B Chapter 6 Proofs B.1 Proof of Theorem vii

11 LIST OF TABLES Table 3.1 Table 3.2 Table 3.3 Table 3.4 Table 3.5 Table 4.1 Table 4.2 Table 4.3 Average model error and standard errors over 500 independent data sets, sample size n = 20, for the Mallows model averaged ridge regression estimates (MA) and the corresponding minimum C L model estimate (Best). Estimates found are model averaged over 100 equally spaced points over the degrees of freedom. Letters indicate groups of significantly different estimates Average model error and standard errors over 500 independent data sets, sample size n = 50, for the Mallows model averaged ridge regression estimates (MA) and the corresponding minimum C L model estimate (Best). Estimates found are model averaged over 100 equally spaced points over the degrees of freedom. Letters indicate groups of significantly different estimates Average prediction error and standard error for 300 splits of the Pakistani economic data. The training sets had 18 observations and the test sets had 10 observations Average prediction error and standard error for 300 splits of the Longley economic data. The training sets had 13 observations and the test sets had three observations Average prediction error and standard error for 300 splits of the Boston housing data. The training sets had 350 observations and the test sets had 156 observations Average model error and standard errors over 200 independent data sets, sample size n = 20, coefficient vector β 1, for the model averaged RR estimates using AIC weights. Estimates found are the minimum AIC estimate and the model averaged estimate using a grid of 160 equally spaced points over the degrees of freedom Average model error and standard errors over 200 independent data sets, sample size n = 20, coefficient vector β 1, for the model averaged RR estimates using AICc weights. Estimates found are the minimum AICc estimate and the model averaged estimate using a grid of 160 equally spaced points over the degrees of freedom Average model error and standard errors over 200 independent data sets, sample size n = 20, coefficient vector β 1, for the model averaged RR estimates using BIC weights. Estimates found are the minimum BIC estimate and the model averaged estimate using a grid of 160 equally spaced points over the degrees of freedom viii

12 Table 4.4 Table 4.5 Table 4.6 Table 5.1 Table 5.2 Table 5.3 Table 5.4 Average model error and standard errors over 200 independent data sets, sample size n = 20, coefficient vector β 2, for the model averaged RR estimates using AIC weights. Estimates found are the minimum AIC estimate and the model averaged estimate using a grid of 160 equally spaced points over the degrees of freedom Average model error and standard errors over 200 independent data sets, sample size n = 20, coefficient vector β 2, for the model averaged RR estimates using AICc weights. Estimates found are the minimum AICc estimate and the model averaged estimate using a grid of 160 equally spaced points over the degrees of freedom Average model error and standard errors over 200 independent data sets, sample size n = 20, coefficient vector β 2, for the model averaged RR estimates using BIC weights. Estimates found are the minimum BIC estimate and the model averaged estimate using a grid of 160 equally spaced points over the degrees of freedom Average model error and standard errors over 200 independent data sets, sample size n = 20, for the model averaged LASSO estimates using Mallows criterion. Estimates found are model averaged over the Break points, a grid of 101 equally spaced points over the Fraction of the LASSO solution to the OLS solution, and the corresponding single Best model estimate. Letters indicate groups of significantly different estimates Percent of the time the model averaged LASSO estimates using Mallows criterion and the Mallows best chosen model selected the exact true model, the percent of time these procedures included the true model, and the average number of zero estimates for coefficient vector β Percent of the time the model averaged LASSO estimates using Mallows criterion and the Mallows best chosen model selected the exact true model, the percent of time these procedures included the true model, and the average number of zero estimates for coefficient vector β Percent of the time the model averaged LASSO estimates using Mallows criterion and the Mallows best chosen model selected the exact true model, the percent of time these procedures included the true model, and the average number of zero estimates for coefficient vector β ix

13 Table 5.5 Table 5.6 Table 5.7 Average model error and standard errors over 200 independent data sets, sample size n = 20, for the model averaged LASSO estimates using AIC weights. Estimates found are model averaged over the Break points, a grid of 101 equally spaced points over the Fraction of the LASSO solution to the OLS solution, and the corresponding single Best model estimate. Letters indicate groups of significantly different estimates Average model error and standard errors over 200 independent data sets, sample size n = 20, for the model averaged LASSO estimates using AICc weights. Estimates found are model averaged over the Break points, a grid of 101 equally spaced points over the Fraction of the LASSO solution to the OLS solution, and the corresponding single Best model estimate. Letters indicate groups of significantly different estimates Average model error and standard errors over 200 independent data sets, sample size n = 20, for the model averaged LASSO estimates using BIC weights. Estimates found are model averaged over the Break points, a grid of 101 equally spaced points over the Fraction of the LASSO solution to the OLS solution, and the corresponding single Best model estimate. Letters indicate groups of significantly different estimates Table 6.1 Table of Cell Means for θ Table 6.2 Table of Cell Means for θ Table 6.3 Null Model Simulation Results Table 6.4 Simulation Results for Effect Vector θ 1. There are 23 truly nonzero level differences and 11 level differences that should be collapsed. There are 68 truly nonzero pairwise differences and 71 pairwise differences that should be collapsed Table 6.5 Simulation Results for θ 2. There are 18 truly nonzero level differences and 16 level differences that should be collapsed. There are 61 truly nonzero pairwise differences and 78 pairwise differences that should be collapsed. 129 Table 6.6 Table of Cell Means for θ Table 6.7 Simulation Results for θ 3 with error variance one. There are three truly nonzero level differences and three level differences that should be collapsed. There are three truly nonzero pairwise differences and 11 pairwise differences that should be collapsed Table 6.8 Simulation Results for θ 3 with error variance four. There are three truly nonzero level differences and three level differences that should be collapsed. There are three truly nonzero pairwise differences and 11 pairwise differences that should be collapsed x

14 Table 6.9 Distinct Levels within Factors Table 6.10 Treatment Combination Means and Distinct Levels xi

15 LIST OF FIGURES Figure 3.1 Figure 3.2 Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 Figure 4.5 Figure 4.6 Figure 5.1 Ridge regression solution path for the full Pakistani economic data. Tick marks along the top of the graph represent where the ridge regression estimate has a given number of effective degrees of freedom. Tick marks along the right of the graph represent the full least squares solution with the following: 1=Land Cultivated, 2=Inflation Rate, 3=Number of Establishments, 4=Population, and 5=Literacy Rate Ridge regression solution path for the full Longley economic data. Tick marks along the top of the graph represent where the ridge regression estimate has a given number of effective degrees of freedom. Tick marks along the right of the graph represent the full least squares solution with the following: 1=Gross National Product implicit price deflater, 2=Gross National Product, 3=number of unemployed, 4=number of people in the armed forces, 5= noninstitutionalized population at least 14 years of age, and 6=year. 48 Ridge regression profile for β 1 from the simulation done in section 4.6. The number on the right of the graph describes which regression parameter (1-8) path the curve represents. The three vertical lines give the minimum AIC, AICc, and BIC chosen estimates AIC weighted ridge regression profile for β 1 from the simulation done in section 4.6. The model averaged ridge regression estimate is given by the area under the curves AICc weighted ridge regression profile for β 1 from the simulation done in section 4.6. The model averaged ridge regression estimate is given by the area under the curves BIC weighted ridge regression profile for β 1 from the simulation done in section 4.6. The model averaged ridge regression estimate is given by the area under the curves Plots of Beta distributions for different values of the shape parameters α 1 and α Weighted ridge regression profiles for a one predictor example along with their Beta weighted estimates LASSO solution path for Pakistani economic data. Tick marks along the top of the graph represent the break points. Tick marks along the right of the graph represent the full least squares solution with the following: 1=Land Cultivated, 2=Inflation Rate, 3=No. of Establishment, 4=Population, and 5=Literacy Rate xii

16 Chapter 1 Introduction Statisticians are often faced with the difficult task of model selection and making inference from the selected model. The topic of model selection has received much attention in the statistical literature. Historically hypothesis testing methods were the focus, however in recent decades there have been many developments on the topic that utilize regularization, or penalization, methods. Given a response and predictor variables, a method such as ordinary least squares (OLS) can be used to fit a given model. The fit of this model can often be improved upon by removing some extraneous variables, using estimates that are averages of estimates across different models, or shrinking the estimates in some way. This thesis investigates and provides new solutions for two common problems in this area. 1.1 Regularization Framework Regularization methods are a powerful group of procedures that minimize a loss function subject to a penalty term. For a response y i and predictors X i = (x i1, x i2,...) T, i = 1

17 1,..., n, a general one dimensional regularization problem has the form min f H [ n ] L(y i, f(x i )) + λj(f), i=1 where L(, ) is a loss function, f( ) is a function in some space of functions H, λ is a tuning parameter, and J(f) is a penalty functional (Hastie et al., 2009). Many common methods used today fall into this framework including ridge regression (RR) (Hoerl and Kennard, 1970), the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996), least angle regression (LARS) (Efron et al., 2002), and the elastic net (EN) (Zou and Hastie, 2005) (involves a two-dimensional functional with two tuning parameters) to name a few. 1.2 Plan of Dissertation This thesis lays out new solutions that utilize the regularization framework for two common problems statisticians face. The first problem is that of improving the prediction accuracy of our model estimates. We develop a method that combines ridge regression and frequentist model averaging (FMA). The second problem is that of choosing the important factors in multi-factor analysis of variance (ANOVA) and deciding which levels of the factors are significantly different from one another when interactions are included in the model. We develop a new penalization method that simultaneously selects important predictors and also creates non-overlapping groups of levels within each factor. The make up of this thesis is as follows: In chapter 2, we begin by giving a review of FMA. The new model averaged ridge regression estimator is given in chapter 3. Chapter 4 gives some alternative methods for estimating a model averaged ridge regression solution. 2

18 Chapter 5 discusses the extensions of FMA to the LASSO procedure. In chapter 6, the grouping and selection using heredity (GASH) ANOVA method is introduced as a solution to the multi-factor ANVOA with interaction problem. The material in chapters 3 and 6 come almost entirely from two journal articles (one submitted and one to be submitted). Thus, the notation used in chapter 6 is different from that used in the rest of the dissertation and any symbols and definitions apply only to that chapter. Also, some material is briefly repeated in chapter 3. 3

19 Chapter 2 Introduction to Model Averaging Consider the regression setup with n independent observations, each consisting of a response variable y i and a countably infinite number of predictors X i = (x i1, x i2,...) T, i = 1,..., n. Assume the true model for the data can be written as y i = µ i + ɛ i = β j x ij + ɛ i j=1 and in vector form as y = µ + ɛ, where y = (y 1,..., y n ) T, µ = (µ 1,..., µ n ) T, β 1, β 2... are regression parameters, and ɛ = (ɛ 1,..., ɛ n ) T are error terms. In practice, we often approximate µ using a finite number of predictors p n, which we can write in matrix form as E[y] = µ X β, 4

20 where µ i is approximated by p n j=1 β jx ij, X is a n p n design matrix with (i, j) element x ij, and β = (β 1,..., β pn ) T. The standard way to fit the linear model above is using ordinary least squares (OLS). The OLS model fitting criterion is given by min β y X β 2 and, when X is full rank, the OLS solution is given by β OLS = (X T X ) 1 X T y. The OLS solution can often be improved upon in terms of prediction accuracy and interpretation (Burnham and Anderson, 2002). Interpretation of the estimates can be aided by eliminating some of the predictors from the model. Prediction accuracy can be improved by trading increased bias for lower variance. The OLS estimates are unbiased, but when many correlated predictors are included in the model the estimates may have high variance. There are many methods for selecting the structure and fit of the model. For example, on can use best subset selection, forward/backward stepwise selection, or any of the regularization methods mentioned in chapter 1. These methods all have some criteria such as hypothesis tests or a tuning parameters that control the final model and model fit selected. Once a procedure is applied and the model and fit are selected, inference usually proceeds conditional on this result. Thus, this aspect is often treated as known in advance. This can lead to invalid inference because the variability inherent in the model selection process is ignored, i.e. if another data set was available and the selection procedure applied to that data set, a different structure or fit would likely be chosen. There are in general three ways to conduct unconditional inference (Burnham and Anderson, 2002). One approach is to use theoretical adjustments in the standard error 5

21 formulas, another is to use bootstrapping to assess the variability, and the third is to use model averaging. Model averaging usually does not help in terms of interpretability of the model, but, if the goal is to predict for new responses, model averaging tends to yield very good prediction accuracy. Note that while this is being called unconditional inference, the inference is of course conditional on the overall set of models being considered. Model averaging is most often done in a Bayesian framework and is called Bayesian model averaging (BMA). However, there is an increasing amount of literature in the last decade on frequentist model averaging (FMA) (Wang et al., 2009). We now give a brief review of BMA (for in depth reviews of theory and application see Hoeting et al. (1999) and Clyde and George (2004)) followed by a more in depth review of FMA. 2.1 Bayesian Model Averaging Review The Bayesian paradigm gives a strong and natural theoretical approach to incorporating model and parameter uncertainty into inference, yielding the unconditional inference we desire. There has been extensive work done in the area of BMA, especially as of late due to the advances in computing power and Markov chain Monte Carlo (MCMC) methods (Clyde and George, 2004) The BMA Framework Suppose we denote our set of possible models by M = {M m }, m = 1,..., M, where each model has parameter vector β m consisting of elements from β. We require that the regression parameter β j represent the effect for the same predictor in each of the M models. In order to perform proper BMA, a prior for each of the M models is needed along with priors for each parameter vector given its model. Denote the prior for model 6

22 M m by P (M m ) and the prior for the parameters of that model by P (β m M m ). We denote the combination of our data as Z = (X, y). Hence, for model M m we have the joint distribution P (Z, β m, M m ) = P (Z β m, M m )P (β m M m )P (M m ), where P (Z β m, M m ) is the distribution or likelihood of the data. Following Clyde and George (2004), this places all of the models into a hierarchical mixture model framework where the data come from first generating the model M m from the candidate model set, then generating the parameter vector β m for that model, and finally generating the data from P (Z β m, M m ). Using Bayes rule the posterior probability for model M m is given by P (M m Z) = P (Z M m)p (M m ) M k=1 P (Z M k)p (M k ), where P (Z M m ) = P (Z β m, M m )P (β m M m )dβ m (throughout we assume a continuous distribution but the definition for a discrete distribution is straightforward). These posterior model probabilities can be used to select a single best model, compare models, or combine models using model averaging. If one model has a much higher posterior probability or if a few models have high posterior probability these models can be selected and reported. To compare model j to model m, a Bayesian would usually look at the posterior odds of the two models defined as P (M j Z) P (M m Z) = P (Z M j)p (M j ) P (Z M m )P (M m ). 7

23 When the prior probabilities of the two models are the same they cancel and we are left with what is known as the Bayes factor. The Bayes factor describes the level of support for one model over another utilizing just the data at hand. The Bayes factor can sometimes be difficult to find exactly. For large sample sizes Schwarz (or Bayes) information criterion (BIC), given by BIC = 2log(L) + log(n)p n (Schwarz, 1978) where L is the likelihood of the data, offers a good approximation to the posterior probability for a given model: P (M m Z) exp ( BIC m/2) P (M m ) M k=1 exp ( BIC k/2) P (M k ), where BIC m denotes the BIC value for model m. For this reason, the ratio of the exponentiated BIC values is often used to approximate the Bayes factor (Kass and Raftery, 1995). To make inference for a particular quantity of interest η, one would look at the posterior distribution of η (Hoeting et al., 1999). This is given by M P (η Z) = P (η M m, Z)P (M m Z). m=1 This posterior distribution is an average of the posterior distributions under the different models weighted by their posterior model probabilities. The mean and variance of this posterior distribution are given by M E [η Z] = E [η Z, M m ] P (M m Z) m=1 8

24 and M ( V ar [η Z] = V ar [η Z, Mm ] + E [η Z, M m ] 2) P (M m Z) E [η Z] 2. m=1 It has been shown that by using these averaged estimates one tends to see better performance in terms of predictive ability than using an estimate from a single model (Madigan and Raftery, 1994). The main issues and drawbacks associated with BMA are the posterior calculations and the prior model specifications (Clyde and George, 2004). The posterior calculations require evaluating integrals for each model in the candidate set that rarely have closed forms. This limitation has been somewhat eased by the recent advances in computational speed and in MCMC methods. However, MCMC methods have their own difficulties such as identifiability issues and determining if the distributions have converged (Eberly and Carlin, 2000). The other main issue is the difficulty that comes is specifying meaningful and non-conflicting prior distributions for the models and their parameters. This becomes more difficult when M is very large. 2.2 Frequentist Model Averaging Review Although the amount of literature on BMA is much larger that that of FMA, FMA has the advantage that its estimators are constructed using only the data at hand. FMA can be applied to a diverse number of models including but not limited to logistic regression (Ghosh and Yuan 2009, Chen et al. 2009), power and exponential models (Burnham and Anderson, 2002), linear regression models, and combinations of different regression procedures (Yang, 2001). Hjort and Claeskens (2003) give asymptotic theory for FMA 9

25 estimators using a local misspecification framework. Because our focus is on the linear regression model, we choose to define the FMA estimators in this framework. The basic reason for using FMA is to provide more robust results. Recall we suppose that there are M candidate models under consideration. When only considering linear mean structures, the M models usually arise from including or excluding predictors from the models (including derived predictors such as interactions or quadratic terms). Again, let β j represent the effect for the same predictor in each of the M models. We define the FMA estimate for β j, j = 1,..., p n as M ˆ β j = w m ˆβj,m, m=1 where w m is a weight for model m, M m=1 w m = 1, and ˆβ j,m is the estimate for β j under the m th model (if the corresponding predictor is not in model m, ˆβ j,m is set to zero). Another technique for performing FMA is to average the predicted responses rather than the regression coefficients. In the context of the linear model, these two methods are equivalent. Also, we make note that by setting the weight for a single model to one, the usual approach of selecting one best model falls into the FMA framework. This leaves the choice of weight vector. There are many ways to choose the weights for FMA estimators, but the approaches basically fall into one of two categories. The first category involves fixed weights that are based on a model fit criterion and the second category includes weights that are chosen adaptively to minimize or maximize some criterion. The method for choosing the weights is important as different methods have different properties. The selection of weight vectors for FMA estimators is still an active area of research which we give a review of here (for a more detailed review, see Wang et al. (2009)). 10

26 2.2.1 Information Theory Weights The first type of fixed weight we review was developed by Buckland et al. (1997), which are based around information theory. Information theory gives a strong theoretical basis for model selection so we briefly describe the approach before giving the weights. Information theorists do not believe that a true model exists, rather they believe in an abstract notion of truth (call this f) that is being approximated by a model g (Burnham and Anderson, 2002). The Kullback-Leibler (KL) distance between f and g is defined generally for continuous functions as I(f, g) = ( ) f(z) f(z)log dz, g(z θ) where Z is the data and θ is the parameter vector for the approximating model g. The KL distance I(f, g) denotes the information lost when g is used to approximate f. Given the data Z with sample size n and the class of models g(z θ), there exists a conceptual KL best model g(z θ 0 ). The connection between KL distance and the maximum likelihood estimate (MLE) of θ, ˆθ ML, is that ˆθ ML converges to θ 0 asymptotically and bias of ˆθ ML is with respect to θ 0. This connection between likelihood inference and information theory creates a basis for using information theory for model selection. Due to the fact that f is unknown, we cannot estimate I(f, g). However, by noting that I(f, g) = E f [log (f(z))] E f [log (g(z θ))] = C E f [log (g(z θ))], where C is a constant, one can estimate a relative distance to compare models. This quantity is minimized at θ 0, but we must estimate θ by some ˆθ. Therefore, to use KL information as a model selection criterion we must change our target to minimizing the 11

27 estimated expected KL distance. Therefore, the critical issue is to estimate E Z [ ( ( ))]] [E Z log g Z ˆθ(Z ), where Z and Z are independent random samples and the expectations are both with respect to f (Akaike, 1973). By use of Taylor expansions and other approximations (Burnham and Anderson, 2002), it can be shown that an unbiased estimator of our target is ( ( )) log g Z ˆθ(Z) ˆtr { J(θ 0 ) [I(θ 0 )] 1}, where tr stands for the trace, { [ ] [ } T J(θ 0 ) = E f log (g(z θ)) θ θ log (g(z θ))], θ=θ 0 and { } I(θ 0 ) = E f 2 log (g(z θ)). θ i θ j θ=θ 0 If f is in the set of models g, then J(θ 0 ) = I(θ 0 ) and the trace term is simply the length of the parameter vector, p n. Although this is not likely to be the case, the use of p n as an estimate of the trace term is still a good estimate (Shibata, 1989). Note that θ should include the error variance σ 2 if it is unknown (making p n from above p n + 1). By using p n as the estimate and multiplying by negative two we get the usual Akaike s information criterion (AIC) (Akaike, 1974), ( ( )) AIC = 2log g Z ˆθ + 2p n = 2log (L) + 2p n. 12

28 Thus, the use of AIC for model selection has strong theoretical underpinnings. For their weights, Buckland et al. (1997) decided to use the class of information criteria of the form IC = 2log (L) + Q (n, p n ), where L represents the likelihood for the model and Q (n, p n ) is a penalty function that can depend on the sample size n and the number of parameters p n. To compare model j to model m using this criterion, one can look at the ratio, L j exp ( Q j /2) L m exp ( Q m /2) = exp ( IC j/2) exp ( IC m /2), where L m, Q m, and IC m are the likelihood, penalty function, and IC for model m respectively. This led to the model averaging weight for model m given by w m = exp ( IC m /2) M k=1 exp ( IC k/2). Burnham and Anderson (2002) modified these weights using the AIC differences, AIC m = AIC m AIC min, where AIC m is the AIC for model m and AIC min is the minimum of the AIC estimates over the model set M. They advocate using these AIC differences to rank the different candidate models and to form equivalent (when Q corresponds to AIC) but more easily interpreted weights which they call Akaike weights given by w m = exp ( AIC m /2 ) exp ( AIC k /2). M k=1 These weights are proportional to the likelihood of the model given the data. They also provide a numerical value of the chance of model m being the actual KL best model in 13

29 the candidate model set. If desired, the weights can be redefined with prior probabilities, P (w m ), for the models. The weights are then given by w m = exp ( m/2) P (w m ) M k=1 exp ( k/2) P (w k ). Note that these weights do not fully satisfy the BMA scheme because prior distributions on the parameters are also needed (Burnham and Anderson, 2002). Differences of this form can be found and used for any choice of penalty Q in the IC class and we shall use this form of the weight. Typical penalty functions used for Q are the BIC and AIC penalties already discussed as well as the small sample correction to AIC in the linear model with homoskedastic errors called AICc (Hurvich and Tsai (1989) and Hurvich and Tsai (1995)). The AICc criterion is given by AICc = AIC + 2p n(p n + 1) n p n 1. When selecting a single best model AIC tends to perform poorly when the ratio of the sample size to the number of parameters is small (< 40). Therefore, AICc is recommended in this case as it has been shown to perform better than AIC for small samples even when its assumptions are not met (Burnham and Anderson (2002) and Sakamoto et al. (1986)). The same is likely true when using these values for model averaging. Neither Buckland et al. (1997) nor Burnham and Anderson (2002) give any theory based on these information criteria weights. However, they give many examples to demonstrate their usefulness as well as methods for constructing unconditional confidence intervals for the parameters. Further, Leung and Barron (2006) give risk bounds for weights of a different form that are based on information criteria. 14

30 AIC Weights in the BMA Framework The AIC weights (and AICc weights) can be motivated from the Bayesian point of view (Burnham and Anderson, 2002). If the approximation to the Bayesian posterior model probability utilizing BIC is used, a savvy prior for the models yields the AIC weights. Let the prior weight for model M m be given by P (M m ) exp ( BIC m /2 ) exp ( AIC m /2 ), where BIC m is the BIC difference for model m. Thus, using these priors it is easily seen that P (M m Z) exp ( BIC m /2 ) ( P (M m ) exp ( BIC k /2) P (M k ) = exp AIC m /2 ) exp ( AIC k /2) = w m. M k=1 M k=1 This prior does not depend on the data, only the sample size and number of parameters Bootstrap Weights The other type of fixed weight we discuss are weights found using the bootstrap. The bootstrap was first introduced by Efron (1979) and has become a popular method for estimation of sampling variances and other quantities that are analytically intractable. The bootstrap can be done in a parametric or nonparametric fashion and has been shown to work well in many situations, although it can fail with smaller sample sizes (Burnham and Anderson, 2002). The basic idea of the bootstrap is to estimate the conceptual probability distribution of the data, call this h, by creating a pseudo-population using the data values at hand (or by assuming a parametric distribution on the data fit). Resamples with replacement from this pseudo-population, known as bootstrap samples, 15

31 are used to obtain estimates of uncertainty about some quantity of interest from h. This set of bootstrap samples acts as a set of independent real data samples from h. The behavior of the estimator on truly independent data sets from h can then be deduced from its behavior over the bootstrap samples. For example, suppose one wished to estimate the variance of an estimator ˆψ given a sample of size n arising from some distribution h. If the estimate is complicated, the theoretical variance may be very difficult to find analytically. Instead, one might use the nonparametric bootstrap method. First, one would create B bootstrap samples each of size n by sampling with replacement from the n observed values (randomly sampling from the empirical distribution of the data). The recommended number of bootstrap samples is anywhere from B = 1000 to B = 10, 000. The estimator of interest would then be calculated on each bootstrap sample, call these estimators ˆψ b, b = 1,..., B. Finally, to make inference about the variance of the estimator one could look at the sample variance of the bootstrap estimators, s 2 boot = B b=1 ( ˆψ b ˆ ψ) 2, B 1 where ˆ ψ is the sample mean of the bootstrap estimates. The bootstrap has been applied to model selection, although not always successfully (Freedman et al., 1988). In terms of model averaging, the method of bagging (Breiman, 1996) is a bootstrap method that has been used successfully to reduce the variance of trees (although eliminating their interpretability). For our purposes, the bootstrap can be used to calculate the weights for our model averaged estimate. To accomplish this, one must first create B bootstrap samples from the data. Next, each model under consideration is fit to each of the B datasets. For each bootstrap sample, the models 16

32 are then ranked in terms of some criterion (such as minimum AIC, AICc, BIC, Mallows C P or cross validation value) and one model selected as the best. The model selection relative frequency for model m given by w Boot m = frequency of selection/b can be estimated and used as a model weight. These model selection relative frequencies were shown to give positive results that were similar to using IC weights, although not identical (Burnham and Anderson, 2002) Cross Validation Weights The first type of adaptive weight we discuss are weights based on Cross Validation (CV) and the Jackknife. CV is a method to assess the performance of a model or to select model parameters. The idea has been around in some form since the 1930 s, but k-fold CV was first clearly brought forward in the late 1960 s and 1970 s. The method of CV is relatively straight forward procedure that is one of the most widely used (Hastie et al., 2009) and we describe it here. The hope when selecting a model is that it will perform very well on an independent test set arising from the same population. In a data rich situation, one could separate the data into three parts: a training set, a validation set, and a test set. The different models would be fit on the training set, the model would be chosen based on its performance on the validation set, and the test set would be used to assess the prediction error of the chosen model. CV attempts to estimate the expected prediction error of a method by reusing folds of the data. For k-fold CV, the data is split into k equally sized and distinct subsets or folds. The models are fit on all but one fold of the data and the remaining fold 17

33 is used to evaluate the prediction error on that fit. The prediction errors for each fold are then averaged and this value is used as the estimate of prediction error. If there is a tuning parameter, this process is done for many different values of the tuning parameter. The tuning parameter that yields the smallest estimated prediction error is the value chosen. There is no optimal choice for the number of folds k. However, 5-fold or 10-fold CV is often used. The jackknife is a similar procedure that is also known as leave one out CV (LooCV), as it uses n as the number of folds. Weights using these methods have been created. The adaptive regression by mixing (ARM) procedure (Yang, 2001) uses CV weights to combine different regression models. The ARMS procedure (Yuan and Yang, 2005) adds in a screening step that uses AIC or BIC. These procedures were shown to have a desirable risk bound. Hansen and Racine (2009) defined weights based on the jackknife criterion which they termed Jackknife Model Averaging that, under suitable conditions, were also shown to have desirable risk and loss properties Mallows Criterion Weights Hansen (2007) proposed minimizing Mallows C P criterion to select the weight vector for combining subset regressions. Mallows (1973) described a useful statistic for assessing the fit of a model called the C P statistic (known commonly as Mallows C P ). Denote a subset of the predictors by P, the number of predictors in that subset by P, and the least squares estimate on that subset by ˆβ P. Mallows C P criterion is given by C P = 1ˆσ 2 y X ˆβP P n, 18

34 where ˆσ 2 is an estimate of the error variance σ 2. Let our model set M be a (possibly infinite) sequence of candidate models where the m th model uses any k m > 0 regressors. The m th candidate model is given by y i = k m j=1 β j(m) x i,j(m) + v i(m) + ɛ i (2.1) and we define the approximation error v i(m) by v i(m) = µ i k m j=1 β j(m) x i,j(m). Using matrix notation, model 2.1 can be rewritten as y = µ + ɛ = X m β m + V m + ɛ. Here X m is an n k m matrix with ij th element x i,j(m) and V m = (v 1(m),..., v n(m) ) T. Define ˆβ m,ols to be the OLS solution for the m th candidate model. Thus, we have an estimate for µ from the m th candidate model, ˆµ m = X m ˆβm,OLS = X m (X T mx m ) 1 X T my = P m y, where P m is the usual hat matrix. The Mallows model averaged estimate of µ is defined as M ˆµ(w) = w m P m y P (w)y. (2.2) m=1 19

35 The choice of weight vector is then chosen adaptively using a Mallows type criterion given by C n (w) = (y ˆµ(w)) T (y ˆµ(w)) + 2σ 2 tr(p (w)). The estimated weight vector is given by ŵ = argmin C n (w), w the estimate of µ is ˆµ(ŵ), and the estimate for β is given by M ˆβ Mallows = ŵ m ˆβm,OLS. m=1 Li (1987) demonstrated the asymptotic optimality of the minimum Mallows C P chosen model in homoskedastic regression. Part of Hansen s motivation for using Mallows Criterion as a basis for defining model averaging weights is that it is asymptotically equivalent to the squared error and so it must also minimize the squared error in large samples (Hansen, 2007). Hansen gave theory that was extended by Wan et al. (2010) that showed the method was asymptotically optimal in the sense that the chosen weight vector asymptotically achieved the best possible squared loss. 20

36 Chapter 3 Model Averaged Ridge Regression 3.1 Introduction and Motivation Consider the regression setup with n independent observations, each consisting of a response variable y i and a countably infinite number of predictors X i = (x i1, x i2,...) T, i = 1,..., n. The true model for the data can be written as y i = µ i + ɛ i = β j x ij + ɛ i j=1 and in vector form as y = µ + ɛ, where y = (y 1,..., y n ) T, µ = (µ 1,..., µ n ) T, ɛ = (ɛ 1,..., ɛ n ) T, β j are the regression parameters, E(ɛ i X i ) = 0, and V ar(ɛ i X i ) = σ 2. In practice, we approximate µ using a finite number of predictors p n, which we can write in matrix form as E[y] = µ X β, 21

37 where µ i is approximated by p n j=1 β jx ij, X is a n p n design matrix with (i, j) element x ij, and β = (β 1,..., β pn ) T. The standard way to fit the linear model above is using ordinary least squares (OLS). The OLS model fitting criterion is given by min β y X β 2 and, in the full rank case, the OLS solution is given by β OLS = (X T X ) 1 X T y. The OLS solution can often be improved upon in terms of prediction accuracy and interpretation (Burnham and Anderson, 2002). Interpretation of the estimates can be aided by eliminating some of the predictors from the model. Prediction accuracy can be improved by trading increased bias for lower variance. The OLS estimates are unbiased, but when many correlated predictors are included in the model the estimates may have high variance. There are many methods for selecting the structure and fit of the model such as best subset selection, forward/backward stepwise selection, ridge regression (RR), the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996), the elastic net (EN) (Zou and Hastie, 2005), and least angle regression (LARS) (Efron et al., 2002) to name a few. In this paper, our goal will be that of improving the prediction accuracy of our estimates. Ridge regression is a commonly used method when the objective of analysis is to increase prediction accuracy in the face of correlated predictors. When there exists high correlation among the predictors the X T X matrix may become close to singular and the OLS solution will exhibit high sampling variance. Ridge regression is a penalized regression technique that places a constraint on the L 2 norm of the regression parameter 22

38 vector. The ridge regression solution in Lagrangian form is given by β RR = min β y X β 2 + λβ T β, where λ > 0 is a tuning parameter. The closed form solution is given by β RR = (X T X + λi pn ) 1 X T y. A nice property of ridge regression is that β RR exists even when p n > n. If the true mean is in the linear space spanned by the observed predictors, i.e. µ = X β, ridge regression has been shown to have favorable properties. The mean square error (MSE) of the ridge estimator is MSE( β RR ) = (Bias of β ] RR ) 2 + V ar [ βrr = λ 2 β T (X T X + λi pn ) 1 (X T X + λi pn ) 1 β + σ 2 (X T X + λi pn ) 1 X T X (X T X + λi pn ) 1. It is well known that the OLS solution is unbiased and has variance σ ( 2 X T X ) 1. We can see that compared to the OLS estimate, the ridge regression estimate increases the bias but decreases the variance by adding a term to the diagonal of the X T X matrix. This is commonly referred to as a bias-variance trade-off. In the seminal ridge regression paper by Hoerl and Kennard (1970), it was shown that there exists λ values such that β RR obtains a lower total MSE than β OLS. This result was extended to a more general MSE type criterion by Theobald (1974). Lee and Triveld (1982) also showed a similar result when the errors are correlated. When the true mean is not in the linear space spanned by the predictors, ridge regression may still give good results. Uemukai (2011) gives conditions on the coefficients and predictors under which a single parameter ridge regression estimate from a model with omitted variables will achieve better MSE than the OLS estimate. Through numer- 23

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)