ABSTRACT. POST, JUSTIN BLAISE. Methods to Improve Prediction Accuracy under Structural Constraints. (Under the direction of Howard Bondell.

Size: px
Start display at page:

Download "ABSTRACT. POST, JUSTIN BLAISE. Methods to Improve Prediction Accuracy under Structural Constraints. (Under the direction of Howard Bondell."

Transcription

1 ABSTRACT POST, JUSTIN BLAISE. Methods to Improve Prediction Accuracy under Structural Constraints. (Under the direction of Howard Bondell.) Statisticians are often faced with the difficult task of model selection and making inference from the selected model. A method such as Ordinary Least Squares (OLS) can be used to fit a given model but that model can often be improved by either removing some extraneous variables, shrinking parameter estimates, or by using estimates that are averages across many different models. This thesis investigates and provides new solutions for two common problems. In multiple linear regression with correlated predictors, a common goal is to increase prediction accuracy. Ridge regression (Hoerl and Kennard, 1970) is a method that increases prediction accuracy in the correlated predictor situation by trading an increased bias for a reduction in variance. Ridge Regression falls into a class of regularization, or penalized regression, methods. The amount of penalization is controlled by a tuning parameter that is often selected by minimizing some criterion. Once selected, this value is often treated as known in advance. However, there is inherent variability in the selection of the tuning parameter. In order to account for this extra level of variability and gain predictive power, a model averaged version of ridge regression utilizing a Mallows criterion is developed. Theory is shown that demonstrates the estimator with weight vector chosen using Mallows criterion is asymptotically optimal in that its loss is asymptotically equivalent to that of the infeasible optimal weight vector. We demonstrate that the proposed estimator performs well in both simulation and real data examples. We also propose similar estimates based on other weighting criterion, as well as on another penalization procedure dubbed the least absolute shrinkage and selection operator (Tib-

2 shirani, 1996). Simulation results are given that show the usefulness of these estimators compared to their single model counterparts. When faced with categorical predictors and a continuous response, the objective of analysis often consists of two tasks: finding which factors are important and determining which levels of the factors differ significantly from one another. Often times these tasks are done separately using Analysis of Variance followed by a post-hoc hypothesis testing procedure such as Tukey s Honestly Significant Difference test. When interactions between factors are included in the model the collapsing of levels of a factor becomes a more difficult problem. When testing for differences between two levels of a factor, claiming no difference would refer not only to equality of main effects, but also equality of each interaction involving those levels. An appropriate regularization is constructed that encourages levels of factors to collapse and entire factors to be set to zero. It is shown that the procedure has the oracle property implying that asymptotically it performs as well as if the exact structure were known beforehand. We also discuss the application to estimating interactions in the unreplicated case. Simulation studies show the procedure outperforms post hoc hypothesis testing procedures as well as similar methods that do not include a structural constraint. The method is also illustrated using a real data example.

3 Copyright 2012 by Justin Blaise Post All Rights Reserved

4 Methods to Improve Prediction Accuracy under Structural Constraints by Justin Blaise Post A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Statistics Raleigh, North Carolina 2012 APPROVED BY: Jason Osborne Brian Reich Yichao Wu Howard Bondell Chair of Advisory Committee

5 DEDICATION To my lovely wife Malorie and my loving family. ii

6 BIOGRAPHY The author was born in a small town in rural Pennsylvania. He received a bachelor s degree in mathematics from Penn State Erie, the Behrend College. He has spent the past five years at North Carolina State University learning statistics. iii

7 ACKNOWLEDGEMENTS I would like to thank my advisor, Dr. Howard Bondell, for his guidance and my wife Malorie for her support and love. iv

8 TABLE OF CONTENTS List of Tables viii List of Figures xii Chapter 1 Introduction Regularization Framework Plan of Dissertation Chapter 2 Introduction to Model Averaging Bayesian Model Averaging Review The BMA Framework Frequentist Model Averaging Review Information Theory Weights Bootstrap Weights Cross Validation Weights Mallows Criterion Weights Chapter 3 Model Averaged Ridge Regression Introduction and Motivation FMA Review - Basic Paradigm Adapting FMA to Ridge Regression Weight Choice Fixed Weights Adaptive Weights Model Averaged Ridge Regression using Mallows Criterion Bayesian Connection Asymptotic Properties Simulation Studies Set-up and Competitors Results Real Data Examples Pakistani Economic Data Longley s Economic Data Boston Housing Data Discussion Chapter 4 Alternative Model Averaged Ridge Regression Estimators Information Criterion Weights AIC Weights with One Predictor and Known Variance v

9 4.2 Using a Distribution on the Tuning Parameter Ridge Regession with One Predictor Bootstrap Estimates Combining Model Weights and a Distribution Using the Mixed Model Formulation Normal Approximation to the ML and REML Estimates Taylor Expansion Bootstrap Estimates Simulation Comparisons Simulation Set-ups Simulation Results Chapter 5 Model Averaging the LASSO The LASSO procedure Model Averaged LASSO using Mallows Criterion Model Averaged LASSO using a Mixed Model Taylor Expansion Bayesian Connection Simulation Studies Simulation Set-up Simulation Results Chapter 6 Factor Selection and Structural Identification in the Interaction ANOVA Model Introduction Collapsing via Penalized Regression GASH-ANOVA Method Investigating Interactions in the Unreplicated Case Asymptotic Properties Computation and Tuning Simulation Studies Simulation Set-up Competitors and Methods of Evaluation Simulation Results Additional Simulation Set-up Additional Simulation Results Real Data Example Discussion References vi

10 Appendices Appendix A Chapter 3 Proofs A.1 Lemma A.2 Lemma A.3 Proof of Theorem A.4 Proof of theorem Appendix B Chapter 6 Proofs B.1 Proof of Theorem vii

11 LIST OF TABLES Table 3.1 Table 3.2 Table 3.3 Table 3.4 Table 3.5 Table 4.1 Table 4.2 Table 4.3 Average model error and standard errors over 500 independent data sets, sample size n = 20, for the Mallows model averaged ridge regression estimates (MA) and the corresponding minimum C L model estimate (Best). Estimates found are model averaged over 100 equally spaced points over the degrees of freedom. Letters indicate groups of significantly different estimates Average model error and standard errors over 500 independent data sets, sample size n = 50, for the Mallows model averaged ridge regression estimates (MA) and the corresponding minimum C L model estimate (Best). Estimates found are model averaged over 100 equally spaced points over the degrees of freedom. Letters indicate groups of significantly different estimates Average prediction error and standard error for 300 splits of the Pakistani economic data. The training sets had 18 observations and the test sets had 10 observations Average prediction error and standard error for 300 splits of the Longley economic data. The training sets had 13 observations and the test sets had three observations Average prediction error and standard error for 300 splits of the Boston housing data. The training sets had 350 observations and the test sets had 156 observations Average model error and standard errors over 200 independent data sets, sample size n = 20, coefficient vector β 1, for the model averaged RR estimates using AIC weights. Estimates found are the minimum AIC estimate and the model averaged estimate using a grid of 160 equally spaced points over the degrees of freedom Average model error and standard errors over 200 independent data sets, sample size n = 20, coefficient vector β 1, for the model averaged RR estimates using AICc weights. Estimates found are the minimum AICc estimate and the model averaged estimate using a grid of 160 equally spaced points over the degrees of freedom Average model error and standard errors over 200 independent data sets, sample size n = 20, coefficient vector β 1, for the model averaged RR estimates using BIC weights. Estimates found are the minimum BIC estimate and the model averaged estimate using a grid of 160 equally spaced points over the degrees of freedom viii

12 Table 4.4 Table 4.5 Table 4.6 Table 5.1 Table 5.2 Table 5.3 Table 5.4 Average model error and standard errors over 200 independent data sets, sample size n = 20, coefficient vector β 2, for the model averaged RR estimates using AIC weights. Estimates found are the minimum AIC estimate and the model averaged estimate using a grid of 160 equally spaced points over the degrees of freedom Average model error and standard errors over 200 independent data sets, sample size n = 20, coefficient vector β 2, for the model averaged RR estimates using AICc weights. Estimates found are the minimum AICc estimate and the model averaged estimate using a grid of 160 equally spaced points over the degrees of freedom Average model error and standard errors over 200 independent data sets, sample size n = 20, coefficient vector β 2, for the model averaged RR estimates using BIC weights. Estimates found are the minimum BIC estimate and the model averaged estimate using a grid of 160 equally spaced points over the degrees of freedom Average model error and standard errors over 200 independent data sets, sample size n = 20, for the model averaged LASSO estimates using Mallows criterion. Estimates found are model averaged over the Break points, a grid of 101 equally spaced points over the Fraction of the LASSO solution to the OLS solution, and the corresponding single Best model estimate. Letters indicate groups of significantly different estimates Percent of the time the model averaged LASSO estimates using Mallows criterion and the Mallows best chosen model selected the exact true model, the percent of time these procedures included the true model, and the average number of zero estimates for coefficient vector β Percent of the time the model averaged LASSO estimates using Mallows criterion and the Mallows best chosen model selected the exact true model, the percent of time these procedures included the true model, and the average number of zero estimates for coefficient vector β Percent of the time the model averaged LASSO estimates using Mallows criterion and the Mallows best chosen model selected the exact true model, the percent of time these procedures included the true model, and the average number of zero estimates for coefficient vector β ix

13 Table 5.5 Table 5.6 Table 5.7 Average model error and standard errors over 200 independent data sets, sample size n = 20, for the model averaged LASSO estimates using AIC weights. Estimates found are model averaged over the Break points, a grid of 101 equally spaced points over the Fraction of the LASSO solution to the OLS solution, and the corresponding single Best model estimate. Letters indicate groups of significantly different estimates Average model error and standard errors over 200 independent data sets, sample size n = 20, for the model averaged LASSO estimates using AICc weights. Estimates found are model averaged over the Break points, a grid of 101 equally spaced points over the Fraction of the LASSO solution to the OLS solution, and the corresponding single Best model estimate. Letters indicate groups of significantly different estimates Average model error and standard errors over 200 independent data sets, sample size n = 20, for the model averaged LASSO estimates using BIC weights. Estimates found are model averaged over the Break points, a grid of 101 equally spaced points over the Fraction of the LASSO solution to the OLS solution, and the corresponding single Best model estimate. Letters indicate groups of significantly different estimates Table 6.1 Table of Cell Means for θ Table 6.2 Table of Cell Means for θ Table 6.3 Null Model Simulation Results Table 6.4 Simulation Results for Effect Vector θ 1. There are 23 truly nonzero level differences and 11 level differences that should be collapsed. There are 68 truly nonzero pairwise differences and 71 pairwise differences that should be collapsed Table 6.5 Simulation Results for θ 2. There are 18 truly nonzero level differences and 16 level differences that should be collapsed. There are 61 truly nonzero pairwise differences and 78 pairwise differences that should be collapsed. 129 Table 6.6 Table of Cell Means for θ Table 6.7 Simulation Results for θ 3 with error variance one. There are three truly nonzero level differences and three level differences that should be collapsed. There are three truly nonzero pairwise differences and 11 pairwise differences that should be collapsed Table 6.8 Simulation Results for θ 3 with error variance four. There are three truly nonzero level differences and three level differences that should be collapsed. There are three truly nonzero pairwise differences and 11 pairwise differences that should be collapsed x

14 Table 6.9 Distinct Levels within Factors Table 6.10 Treatment Combination Means and Distinct Levels xi

15 LIST OF FIGURES Figure 3.1 Figure 3.2 Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 Figure 4.5 Figure 4.6 Figure 5.1 Ridge regression solution path for the full Pakistani economic data. Tick marks along the top of the graph represent where the ridge regression estimate has a given number of effective degrees of freedom. Tick marks along the right of the graph represent the full least squares solution with the following: 1=Land Cultivated, 2=Inflation Rate, 3=Number of Establishments, 4=Population, and 5=Literacy Rate Ridge regression solution path for the full Longley economic data. Tick marks along the top of the graph represent where the ridge regression estimate has a given number of effective degrees of freedom. Tick marks along the right of the graph represent the full least squares solution with the following: 1=Gross National Product implicit price deflater, 2=Gross National Product, 3=number of unemployed, 4=number of people in the armed forces, 5= noninstitutionalized population at least 14 years of age, and 6=year. 48 Ridge regression profile for β 1 from the simulation done in section 4.6. The number on the right of the graph describes which regression parameter (1-8) path the curve represents. The three vertical lines give the minimum AIC, AICc, and BIC chosen estimates AIC weighted ridge regression profile for β 1 from the simulation done in section 4.6. The model averaged ridge regression estimate is given by the area under the curves AICc weighted ridge regression profile for β 1 from the simulation done in section 4.6. The model averaged ridge regression estimate is given by the area under the curves BIC weighted ridge regression profile for β 1 from the simulation done in section 4.6. The model averaged ridge regression estimate is given by the area under the curves Plots of Beta distributions for different values of the shape parameters α 1 and α Weighted ridge regression profiles for a one predictor example along with their Beta weighted estimates LASSO solution path for Pakistani economic data. Tick marks along the top of the graph represent the break points. Tick marks along the right of the graph represent the full least squares solution with the following: 1=Land Cultivated, 2=Inflation Rate, 3=No. of Establishment, 4=Population, and 5=Literacy Rate xii

16 Chapter 1 Introduction Statisticians are often faced with the difficult task of model selection and making inference from the selected model. The topic of model selection has received much attention in the statistical literature. Historically hypothesis testing methods were the focus, however in recent decades there have been many developments on the topic that utilize regularization, or penalization, methods. Given a response and predictor variables, a method such as ordinary least squares (OLS) can be used to fit a given model. The fit of this model can often be improved upon by removing some extraneous variables, using estimates that are averages of estimates across different models, or shrinking the estimates in some way. This thesis investigates and provides new solutions for two common problems in this area. 1.1 Regularization Framework Regularization methods are a powerful group of procedures that minimize a loss function subject to a penalty term. For a response y i and predictors X i = (x i1, x i2,...) T, i = 1

17 1,..., n, a general one dimensional regularization problem has the form min f H [ n ] L(y i, f(x i )) + λj(f), i=1 where L(, ) is a loss function, f( ) is a function in some space of functions H, λ is a tuning parameter, and J(f) is a penalty functional (Hastie et al., 2009). Many common methods used today fall into this framework including ridge regression (RR) (Hoerl and Kennard, 1970), the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996), least angle regression (LARS) (Efron et al., 2002), and the elastic net (EN) (Zou and Hastie, 2005) (involves a two-dimensional functional with two tuning parameters) to name a few. 1.2 Plan of Dissertation This thesis lays out new solutions that utilize the regularization framework for two common problems statisticians face. The first problem is that of improving the prediction accuracy of our model estimates. We develop a method that combines ridge regression and frequentist model averaging (FMA). The second problem is that of choosing the important factors in multi-factor analysis of variance (ANOVA) and deciding which levels of the factors are significantly different from one another when interactions are included in the model. We develop a new penalization method that simultaneously selects important predictors and also creates non-overlapping groups of levels within each factor. The make up of this thesis is as follows: In chapter 2, we begin by giving a review of FMA. The new model averaged ridge regression estimator is given in chapter 3. Chapter 4 gives some alternative methods for estimating a model averaged ridge regression solution. 2

18 Chapter 5 discusses the extensions of FMA to the LASSO procedure. In chapter 6, the grouping and selection using heredity (GASH) ANOVA method is introduced as a solution to the multi-factor ANVOA with interaction problem. The material in chapters 3 and 6 come almost entirely from two journal articles (one submitted and one to be submitted). Thus, the notation used in chapter 6 is different from that used in the rest of the dissertation and any symbols and definitions apply only to that chapter. Also, some material is briefly repeated in chapter 3. 3

19 Chapter 2 Introduction to Model Averaging Consider the regression setup with n independent observations, each consisting of a response variable y i and a countably infinite number of predictors X i = (x i1, x i2,...) T, i = 1,..., n. Assume the true model for the data can be written as y i = µ i + ɛ i = β j x ij + ɛ i j=1 and in vector form as y = µ + ɛ, where y = (y 1,..., y n ) T, µ = (µ 1,..., µ n ) T, β 1, β 2... are regression parameters, and ɛ = (ɛ 1,..., ɛ n ) T are error terms. In practice, we often approximate µ using a finite number of predictors p n, which we can write in matrix form as E[y] = µ X β, 4

20 where µ i is approximated by p n j=1 β jx ij, X is a n p n design matrix with (i, j) element x ij, and β = (β 1,..., β pn ) T. The standard way to fit the linear model above is using ordinary least squares (OLS). The OLS model fitting criterion is given by min β y X β 2 and, when X is full rank, the OLS solution is given by β OLS = (X T X ) 1 X T y. The OLS solution can often be improved upon in terms of prediction accuracy and interpretation (Burnham and Anderson, 2002). Interpretation of the estimates can be aided by eliminating some of the predictors from the model. Prediction accuracy can be improved by trading increased bias for lower variance. The OLS estimates are unbiased, but when many correlated predictors are included in the model the estimates may have high variance. There are many methods for selecting the structure and fit of the model. For example, on can use best subset selection, forward/backward stepwise selection, or any of the regularization methods mentioned in chapter 1. These methods all have some criteria such as hypothesis tests or a tuning parameters that control the final model and model fit selected. Once a procedure is applied and the model and fit are selected, inference usually proceeds conditional on this result. Thus, this aspect is often treated as known in advance. This can lead to invalid inference because the variability inherent in the model selection process is ignored, i.e. if another data set was available and the selection procedure applied to that data set, a different structure or fit would likely be chosen. There are in general three ways to conduct unconditional inference (Burnham and Anderson, 2002). One approach is to use theoretical adjustments in the standard error 5

21 formulas, another is to use bootstrapping to assess the variability, and the third is to use model averaging. Model averaging usually does not help in terms of interpretability of the model, but, if the goal is to predict for new responses, model averaging tends to yield very good prediction accuracy. Note that while this is being called unconditional inference, the inference is of course conditional on the overall set of models being considered. Model averaging is most often done in a Bayesian framework and is called Bayesian model averaging (BMA). However, there is an increasing amount of literature in the last decade on frequentist model averaging (FMA) (Wang et al., 2009). We now give a brief review of BMA (for in depth reviews of theory and application see Hoeting et al. (1999) and Clyde and George (2004)) followed by a more in depth review of FMA. 2.1 Bayesian Model Averaging Review The Bayesian paradigm gives a strong and natural theoretical approach to incorporating model and parameter uncertainty into inference, yielding the unconditional inference we desire. There has been extensive work done in the area of BMA, especially as of late due to the advances in computing power and Markov chain Monte Carlo (MCMC) methods (Clyde and George, 2004) The BMA Framework Suppose we denote our set of possible models by M = {M m }, m = 1,..., M, where each model has parameter vector β m consisting of elements from β. We require that the regression parameter β j represent the effect for the same predictor in each of the M models. In order to perform proper BMA, a prior for each of the M models is needed along with priors for each parameter vector given its model. Denote the prior for model 6

22 M m by P (M m ) and the prior for the parameters of that model by P (β m M m ). We denote the combination of our data as Z = (X, y). Hence, for model M m we have the joint distribution P (Z, β m, M m ) = P (Z β m, M m )P (β m M m )P (M m ), where P (Z β m, M m ) is the distribution or likelihood of the data. Following Clyde and George (2004), this places all of the models into a hierarchical mixture model framework where the data come from first generating the model M m from the candidate model set, then generating the parameter vector β m for that model, and finally generating the data from P (Z β m, M m ). Using Bayes rule the posterior probability for model M m is given by P (M m Z) = P (Z M m)p (M m ) M k=1 P (Z M k)p (M k ), where P (Z M m ) = P (Z β m, M m )P (β m M m )dβ m (throughout we assume a continuous distribution but the definition for a discrete distribution is straightforward). These posterior model probabilities can be used to select a single best model, compare models, or combine models using model averaging. If one model has a much higher posterior probability or if a few models have high posterior probability these models can be selected and reported. To compare model j to model m, a Bayesian would usually look at the posterior odds of the two models defined as P (M j Z) P (M m Z) = P (Z M j)p (M j ) P (Z M m )P (M m ). 7

23 When the prior probabilities of the two models are the same they cancel and we are left with what is known as the Bayes factor. The Bayes factor describes the level of support for one model over another utilizing just the data at hand. The Bayes factor can sometimes be difficult to find exactly. For large sample sizes Schwarz (or Bayes) information criterion (BIC), given by BIC = 2log(L) + log(n)p n (Schwarz, 1978) where L is the likelihood of the data, offers a good approximation to the posterior probability for a given model: P (M m Z) exp ( BIC m/2) P (M m ) M k=1 exp ( BIC k/2) P (M k ), where BIC m denotes the BIC value for model m. For this reason, the ratio of the exponentiated BIC values is often used to approximate the Bayes factor (Kass and Raftery, 1995). To make inference for a particular quantity of interest η, one would look at the posterior distribution of η (Hoeting et al., 1999). This is given by M P (η Z) = P (η M m, Z)P (M m Z). m=1 This posterior distribution is an average of the posterior distributions under the different models weighted by their posterior model probabilities. The mean and variance of this posterior distribution are given by M E [η Z] = E [η Z, M m ] P (M m Z) m=1 8

24 and M ( V ar [η Z] = V ar [η Z, Mm ] + E [η Z, M m ] 2) P (M m Z) E [η Z] 2. m=1 It has been shown that by using these averaged estimates one tends to see better performance in terms of predictive ability than using an estimate from a single model (Madigan and Raftery, 1994). The main issues and drawbacks associated with BMA are the posterior calculations and the prior model specifications (Clyde and George, 2004). The posterior calculations require evaluating integrals for each model in the candidate set that rarely have closed forms. This limitation has been somewhat eased by the recent advances in computational speed and in MCMC methods. However, MCMC methods have their own difficulties such as identifiability issues and determining if the distributions have converged (Eberly and Carlin, 2000). The other main issue is the difficulty that comes is specifying meaningful and non-conflicting prior distributions for the models and their parameters. This becomes more difficult when M is very large. 2.2 Frequentist Model Averaging Review Although the amount of literature on BMA is much larger that that of FMA, FMA has the advantage that its estimators are constructed using only the data at hand. FMA can be applied to a diverse number of models including but not limited to logistic regression (Ghosh and Yuan 2009, Chen et al. 2009), power and exponential models (Burnham and Anderson, 2002), linear regression models, and combinations of different regression procedures (Yang, 2001). Hjort and Claeskens (2003) give asymptotic theory for FMA 9

25 estimators using a local misspecification framework. Because our focus is on the linear regression model, we choose to define the FMA estimators in this framework. The basic reason for using FMA is to provide more robust results. Recall we suppose that there are M candidate models under consideration. When only considering linear mean structures, the M models usually arise from including or excluding predictors from the models (including derived predictors such as interactions or quadratic terms). Again, let β j represent the effect for the same predictor in each of the M models. We define the FMA estimate for β j, j = 1,..., p n as M ˆ β j = w m ˆβj,m, m=1 where w m is a weight for model m, M m=1 w m = 1, and ˆβ j,m is the estimate for β j under the m th model (if the corresponding predictor is not in model m, ˆβ j,m is set to zero). Another technique for performing FMA is to average the predicted responses rather than the regression coefficients. In the context of the linear model, these two methods are equivalent. Also, we make note that by setting the weight for a single model to one, the usual approach of selecting one best model falls into the FMA framework. This leaves the choice of weight vector. There are many ways to choose the weights for FMA estimators, but the approaches basically fall into one of two categories. The first category involves fixed weights that are based on a model fit criterion and the second category includes weights that are chosen adaptively to minimize or maximize some criterion. The method for choosing the weights is important as different methods have different properties. The selection of weight vectors for FMA estimators is still an active area of research which we give a review of here (for a more detailed review, see Wang et al. (2009)). 10

26 2.2.1 Information Theory Weights The first type of fixed weight we review was developed by Buckland et al. (1997), which are based around information theory. Information theory gives a strong theoretical basis for model selection so we briefly describe the approach before giving the weights. Information theorists do not believe that a true model exists, rather they believe in an abstract notion of truth (call this f) that is being approximated by a model g (Burnham and Anderson, 2002). The Kullback-Leibler (KL) distance between f and g is defined generally for continuous functions as I(f, g) = ( ) f(z) f(z)log dz, g(z θ) where Z is the data and θ is the parameter vector for the approximating model g. The KL distance I(f, g) denotes the information lost when g is used to approximate f. Given the data Z with sample size n and the class of models g(z θ), there exists a conceptual KL best model g(z θ 0 ). The connection between KL distance and the maximum likelihood estimate (MLE) of θ, ˆθ ML, is that ˆθ ML converges to θ 0 asymptotically and bias of ˆθ ML is with respect to θ 0. This connection between likelihood inference and information theory creates a basis for using information theory for model selection. Due to the fact that f is unknown, we cannot estimate I(f, g). However, by noting that I(f, g) = E f [log (f(z))] E f [log (g(z θ))] = C E f [log (g(z θ))], where C is a constant, one can estimate a relative distance to compare models. This quantity is minimized at θ 0, but we must estimate θ by some ˆθ. Therefore, to use KL information as a model selection criterion we must change our target to minimizing the 11

27 estimated expected KL distance. Therefore, the critical issue is to estimate E Z [ ( ( ))]] [E Z log g Z ˆθ(Z ), where Z and Z are independent random samples and the expectations are both with respect to f (Akaike, 1973). By use of Taylor expansions and other approximations (Burnham and Anderson, 2002), it can be shown that an unbiased estimator of our target is ( ( )) log g Z ˆθ(Z) ˆtr { J(θ 0 ) [I(θ 0 )] 1}, where tr stands for the trace, { [ ] [ } T J(θ 0 ) = E f log (g(z θ)) θ θ log (g(z θ))], θ=θ 0 and { } I(θ 0 ) = E f 2 log (g(z θ)). θ i θ j θ=θ 0 If f is in the set of models g, then J(θ 0 ) = I(θ 0 ) and the trace term is simply the length of the parameter vector, p n. Although this is not likely to be the case, the use of p n as an estimate of the trace term is still a good estimate (Shibata, 1989). Note that θ should include the error variance σ 2 if it is unknown (making p n from above p n + 1). By using p n as the estimate and multiplying by negative two we get the usual Akaike s information criterion (AIC) (Akaike, 1974), ( ( )) AIC = 2log g Z ˆθ + 2p n = 2log (L) + 2p n. 12

28 Thus, the use of AIC for model selection has strong theoretical underpinnings. For their weights, Buckland et al. (1997) decided to use the class of information criteria of the form IC = 2log (L) + Q (n, p n ), where L represents the likelihood for the model and Q (n, p n ) is a penalty function that can depend on the sample size n and the number of parameters p n. To compare model j to model m using this criterion, one can look at the ratio, L j exp ( Q j /2) L m exp ( Q m /2) = exp ( IC j/2) exp ( IC m /2), where L m, Q m, and IC m are the likelihood, penalty function, and IC for model m respectively. This led to the model averaging weight for model m given by w m = exp ( IC m /2) M k=1 exp ( IC k/2). Burnham and Anderson (2002) modified these weights using the AIC differences, AIC m = AIC m AIC min, where AIC m is the AIC for model m and AIC min is the minimum of the AIC estimates over the model set M. They advocate using these AIC differences to rank the different candidate models and to form equivalent (when Q corresponds to AIC) but more easily interpreted weights which they call Akaike weights given by w m = exp ( AIC m /2 ) exp ( AIC k /2). M k=1 These weights are proportional to the likelihood of the model given the data. They also provide a numerical value of the chance of model m being the actual KL best model in 13

29 the candidate model set. If desired, the weights can be redefined with prior probabilities, P (w m ), for the models. The weights are then given by w m = exp ( m/2) P (w m ) M k=1 exp ( k/2) P (w k ). Note that these weights do not fully satisfy the BMA scheme because prior distributions on the parameters are also needed (Burnham and Anderson, 2002). Differences of this form can be found and used for any choice of penalty Q in the IC class and we shall use this form of the weight. Typical penalty functions used for Q are the BIC and AIC penalties already discussed as well as the small sample correction to AIC in the linear model with homoskedastic errors called AICc (Hurvich and Tsai (1989) and Hurvich and Tsai (1995)). The AICc criterion is given by AICc = AIC + 2p n(p n + 1) n p n 1. When selecting a single best model AIC tends to perform poorly when the ratio of the sample size to the number of parameters is small (< 40). Therefore, AICc is recommended in this case as it has been shown to perform better than AIC for small samples even when its assumptions are not met (Burnham and Anderson (2002) and Sakamoto et al. (1986)). The same is likely true when using these values for model averaging. Neither Buckland et al. (1997) nor Burnham and Anderson (2002) give any theory based on these information criteria weights. However, they give many examples to demonstrate their usefulness as well as methods for constructing unconditional confidence intervals for the parameters. Further, Leung and Barron (2006) give risk bounds for weights of a different form that are based on information criteria. 14

30 AIC Weights in the BMA Framework The AIC weights (and AICc weights) can be motivated from the Bayesian point of view (Burnham and Anderson, 2002). If the approximation to the Bayesian posterior model probability utilizing BIC is used, a savvy prior for the models yields the AIC weights. Let the prior weight for model M m be given by P (M m ) exp ( BIC m /2 ) exp ( AIC m /2 ), where BIC m is the BIC difference for model m. Thus, using these priors it is easily seen that P (M m Z) exp ( BIC m /2 ) ( P (M m ) exp ( BIC k /2) P (M k ) = exp AIC m /2 ) exp ( AIC k /2) = w m. M k=1 M k=1 This prior does not depend on the data, only the sample size and number of parameters Bootstrap Weights The other type of fixed weight we discuss are weights found using the bootstrap. The bootstrap was first introduced by Efron (1979) and has become a popular method for estimation of sampling variances and other quantities that are analytically intractable. The bootstrap can be done in a parametric or nonparametric fashion and has been shown to work well in many situations, although it can fail with smaller sample sizes (Burnham and Anderson, 2002). The basic idea of the bootstrap is to estimate the conceptual probability distribution of the data, call this h, by creating a pseudo-population using the data values at hand (or by assuming a parametric distribution on the data fit). Resamples with replacement from this pseudo-population, known as bootstrap samples, 15

31 are used to obtain estimates of uncertainty about some quantity of interest from h. This set of bootstrap samples acts as a set of independent real data samples from h. The behavior of the estimator on truly independent data sets from h can then be deduced from its behavior over the bootstrap samples. For example, suppose one wished to estimate the variance of an estimator ˆψ given a sample of size n arising from some distribution h. If the estimate is complicated, the theoretical variance may be very difficult to find analytically. Instead, one might use the nonparametric bootstrap method. First, one would create B bootstrap samples each of size n by sampling with replacement from the n observed values (randomly sampling from the empirical distribution of the data). The recommended number of bootstrap samples is anywhere from B = 1000 to B = 10, 000. The estimator of interest would then be calculated on each bootstrap sample, call these estimators ˆψ b, b = 1,..., B. Finally, to make inference about the variance of the estimator one could look at the sample variance of the bootstrap estimators, s 2 boot = B b=1 ( ˆψ b ˆ ψ) 2, B 1 where ˆ ψ is the sample mean of the bootstrap estimates. The bootstrap has been applied to model selection, although not always successfully (Freedman et al., 1988). In terms of model averaging, the method of bagging (Breiman, 1996) is a bootstrap method that has been used successfully to reduce the variance of trees (although eliminating their interpretability). For our purposes, the bootstrap can be used to calculate the weights for our model averaged estimate. To accomplish this, one must first create B bootstrap samples from the data. Next, each model under consideration is fit to each of the B datasets. For each bootstrap sample, the models 16

32 are then ranked in terms of some criterion (such as minimum AIC, AICc, BIC, Mallows C P or cross validation value) and one model selected as the best. The model selection relative frequency for model m given by w Boot m = frequency of selection/b can be estimated and used as a model weight. These model selection relative frequencies were shown to give positive results that were similar to using IC weights, although not identical (Burnham and Anderson, 2002) Cross Validation Weights The first type of adaptive weight we discuss are weights based on Cross Validation (CV) and the Jackknife. CV is a method to assess the performance of a model or to select model parameters. The idea has been around in some form since the 1930 s, but k-fold CV was first clearly brought forward in the late 1960 s and 1970 s. The method of CV is relatively straight forward procedure that is one of the most widely used (Hastie et al., 2009) and we describe it here. The hope when selecting a model is that it will perform very well on an independent test set arising from the same population. In a data rich situation, one could separate the data into three parts: a training set, a validation set, and a test set. The different models would be fit on the training set, the model would be chosen based on its performance on the validation set, and the test set would be used to assess the prediction error of the chosen model. CV attempts to estimate the expected prediction error of a method by reusing folds of the data. For k-fold CV, the data is split into k equally sized and distinct subsets or folds. The models are fit on all but one fold of the data and the remaining fold 17

33 is used to evaluate the prediction error on that fit. The prediction errors for each fold are then averaged and this value is used as the estimate of prediction error. If there is a tuning parameter, this process is done for many different values of the tuning parameter. The tuning parameter that yields the smallest estimated prediction error is the value chosen. There is no optimal choice for the number of folds k. However, 5-fold or 10-fold CV is often used. The jackknife is a similar procedure that is also known as leave one out CV (LooCV), as it uses n as the number of folds. Weights using these methods have been created. The adaptive regression by mixing (ARM) procedure (Yang, 2001) uses CV weights to combine different regression models. The ARMS procedure (Yuan and Yang, 2005) adds in a screening step that uses AIC or BIC. These procedures were shown to have a desirable risk bound. Hansen and Racine (2009) defined weights based on the jackknife criterion which they termed Jackknife Model Averaging that, under suitable conditions, were also shown to have desirable risk and loss properties Mallows Criterion Weights Hansen (2007) proposed minimizing Mallows C P criterion to select the weight vector for combining subset regressions. Mallows (1973) described a useful statistic for assessing the fit of a model called the C P statistic (known commonly as Mallows C P ). Denote a subset of the predictors by P, the number of predictors in that subset by P, and the least squares estimate on that subset by ˆβ P. Mallows C P criterion is given by C P = 1ˆσ 2 y X ˆβP P n, 18

34 where ˆσ 2 is an estimate of the error variance σ 2. Let our model set M be a (possibly infinite) sequence of candidate models where the m th model uses any k m > 0 regressors. The m th candidate model is given by y i = k m j=1 β j(m) x i,j(m) + v i(m) + ɛ i (2.1) and we define the approximation error v i(m) by v i(m) = µ i k m j=1 β j(m) x i,j(m). Using matrix notation, model 2.1 can be rewritten as y = µ + ɛ = X m β m + V m + ɛ. Here X m is an n k m matrix with ij th element x i,j(m) and V m = (v 1(m),..., v n(m) ) T. Define ˆβ m,ols to be the OLS solution for the m th candidate model. Thus, we have an estimate for µ from the m th candidate model, ˆµ m = X m ˆβm,OLS = X m (X T mx m ) 1 X T my = P m y, where P m is the usual hat matrix. The Mallows model averaged estimate of µ is defined as M ˆµ(w) = w m P m y P (w)y. (2.2) m=1 19

35 The choice of weight vector is then chosen adaptively using a Mallows type criterion given by C n (w) = (y ˆµ(w)) T (y ˆµ(w)) + 2σ 2 tr(p (w)). The estimated weight vector is given by ŵ = argmin C n (w), w the estimate of µ is ˆµ(ŵ), and the estimate for β is given by M ˆβ Mallows = ŵ m ˆβm,OLS. m=1 Li (1987) demonstrated the asymptotic optimality of the minimum Mallows C P chosen model in homoskedastic regression. Part of Hansen s motivation for using Mallows Criterion as a basis for defining model averaging weights is that it is asymptotically equivalent to the squared error and so it must also minimize the squared error in large samples (Hansen, 2007). Hansen gave theory that was extended by Wan et al. (2010) that showed the method was asymptotically optimal in the sense that the chosen weight vector asymptotically achieved the best possible squared loss. 20

36 Chapter 3 Model Averaged Ridge Regression 3.1 Introduction and Motivation Consider the regression setup with n independent observations, each consisting of a response variable y i and a countably infinite number of predictors X i = (x i1, x i2,...) T, i = 1,..., n. The true model for the data can be written as y i = µ i + ɛ i = β j x ij + ɛ i j=1 and in vector form as y = µ + ɛ, where y = (y 1,..., y n ) T, µ = (µ 1,..., µ n ) T, ɛ = (ɛ 1,..., ɛ n ) T, β j are the regression parameters, E(ɛ i X i ) = 0, and V ar(ɛ i X i ) = σ 2. In practice, we approximate µ using a finite number of predictors p n, which we can write in matrix form as E[y] = µ X β, 21

37 where µ i is approximated by p n j=1 β jx ij, X is a n p n design matrix with (i, j) element x ij, and β = (β 1,..., β pn ) T. The standard way to fit the linear model above is using ordinary least squares (OLS). The OLS model fitting criterion is given by min β y X β 2 and, in the full rank case, the OLS solution is given by β OLS = (X T X ) 1 X T y. The OLS solution can often be improved upon in terms of prediction accuracy and interpretation (Burnham and Anderson, 2002). Interpretation of the estimates can be aided by eliminating some of the predictors from the model. Prediction accuracy can be improved by trading increased bias for lower variance. The OLS estimates are unbiased, but when many correlated predictors are included in the model the estimates may have high variance. There are many methods for selecting the structure and fit of the model such as best subset selection, forward/backward stepwise selection, ridge regression (RR), the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996), the elastic net (EN) (Zou and Hastie, 2005), and least angle regression (LARS) (Efron et al., 2002) to name a few. In this paper, our goal will be that of improving the prediction accuracy of our estimates. Ridge regression is a commonly used method when the objective of analysis is to increase prediction accuracy in the face of correlated predictors. When there exists high correlation among the predictors the X T X matrix may become close to singular and the OLS solution will exhibit high sampling variance. Ridge regression is a penalized regression technique that places a constraint on the L 2 norm of the regression parameter 22

38 vector. The ridge regression solution in Lagrangian form is given by β RR = min β y X β 2 + λβ T β, where λ > 0 is a tuning parameter. The closed form solution is given by β RR = (X T X + λi pn ) 1 X T y. A nice property of ridge regression is that β RR exists even when p n > n. If the true mean is in the linear space spanned by the observed predictors, i.e. µ = X β, ridge regression has been shown to have favorable properties. The mean square error (MSE) of the ridge estimator is MSE( β RR ) = (Bias of β ] RR ) 2 + V ar [ βrr = λ 2 β T (X T X + λi pn ) 1 (X T X + λi pn ) 1 β + σ 2 (X T X + λi pn ) 1 X T X (X T X + λi pn ) 1. It is well known that the OLS solution is unbiased and has variance σ ( 2 X T X ) 1. We can see that compared to the OLS estimate, the ridge regression estimate increases the bias but decreases the variance by adding a term to the diagonal of the X T X matrix. This is commonly referred to as a bias-variance trade-off. In the seminal ridge regression paper by Hoerl and Kennard (1970), it was shown that there exists λ values such that β RR obtains a lower total MSE than β OLS. This result was extended to a more general MSE type criterion by Theobald (1974). Lee and Triveld (1982) also showed a similar result when the errors are correlated. When the true mean is not in the linear space spanned by the predictors, ridge regression may still give good results. Uemukai (2011) gives conditions on the coefficients and predictors under which a single parameter ridge regression estimate from a model with omitted variables will achieve better MSE than the OLS estimate. Through numer- 23

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Least Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006

Least Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006 Least Squares Model Averaging Bruce E. Hansen University of Wisconsin January 2006 Revised: August 2006 Introduction This paper developes a model averaging estimator for linear regression. Model averaging

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1 Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR DEPARTMENT OF STATISTICS North Carolina State University 2501 Founders Drive, Campus Box 8203 Raleigh, NC 27695-8203 Institute of Statistics Mimeo Series No. 2583 Simultaneous regression shrinkage, variable

More information

Akaike Information Criterion

Akaike Information Criterion Akaike Information Criterion Shuhua Hu Center for Research in Scientific Computation North Carolina State University Raleigh, NC February 7, 2012-1- background Background Model statistical model: Y j =

More information

Tuning Parameter Selection in L1 Regularized Logistic Regression

Tuning Parameter Selection in L1 Regularized Logistic Regression Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2012 Tuning Parameter Selection in L1 Regularized Logistic Regression Shujing Shi Virginia Commonwealth University

More information

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

More information

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR Howard D. Bondell and Brian J. Reich Department of Statistics, North Carolina State University,

More information

Iterative Selection Using Orthogonal Regression Techniques

Iterative Selection Using Orthogonal Regression Techniques Iterative Selection Using Orthogonal Regression Techniques Bradley Turnbull 1, Subhashis Ghosal 1 and Hao Helen Zhang 2 1 Department of Statistics, North Carolina State University, Raleigh, NC, USA 2 Department

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

Moment and IV Selection Approaches: A Comparative Simulation Study

Moment and IV Selection Approaches: A Comparative Simulation Study Moment and IV Selection Approaches: A Comparative Simulation Study Mehmet Caner Esfandiar Maasoumi Juan Andrés Riquelme August 7, 2014 Abstract We compare three moment selection approaches, followed by

More information

Penalized Loss functions for Bayesian Model Choice

Penalized Loss functions for Bayesian Model Choice Penalized Loss functions for Bayesian Model Choice Martyn International Agency for Research on Cancer Lyon, France 13 November 2009 The pure approach For a Bayesian purist, all uncertainty is represented

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation

More information

Model Selection, Estimation, and Bootstrap Smoothing. Bradley Efron Stanford University

Model Selection, Estimation, and Bootstrap Smoothing. Bradley Efron Stanford University Model Selection, Estimation, and Bootstrap Smoothing Bradley Efron Stanford University Estimation After Model Selection Usually: (a) look at data (b) choose model (linear, quad, cubic...?) (c) fit estimates

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n

More information

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman Linear Regression Models Based on Chapter 3 of Hastie, ibshirani and Friedman Linear Regression Models Here the X s might be: p f ( X = " + " 0 j= 1 X j Raw predictor variables (continuous or coded-categorical

More information

Or How to select variables Using Bayesian LASSO

Or How to select variables Using Bayesian LASSO Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

All models are wrong but some are useful. George Box (1979)

All models are wrong but some are useful. George Box (1979) All models are wrong but some are useful. George Box (1979) The problem of model selection is overrun by a serious difficulty: even if a criterion could be settled on to determine optimality, it is hard

More information

A Confidence Region Approach to Tuning for Variable Selection

A Confidence Region Approach to Tuning for Variable Selection A Confidence Region Approach to Tuning for Variable Selection Funda Gunes and Howard D. Bondell Department of Statistics North Carolina State University Abstract We develop an approach to tuning of penalized

More information

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Jonathan Taylor - p. 1/15 Today s class Bias-Variance tradeoff. Penalized regression. Cross-validation. - p. 2/15 Bias-variance

More information

Essays on Least Squares Model Averaging

Essays on Least Squares Model Averaging Essays on Least Squares Model Averaging by Tian Xie A thesis submitted to the Department of Economics in conformity with the requirements for the degree of Doctor of Philosophy Queen s University Kingston,

More information

PENALIZED METHOD BASED ON REPRESENTATIVES & NONPARAMETRIC ANALYSIS OF GAP DATA

PENALIZED METHOD BASED ON REPRESENTATIVES & NONPARAMETRIC ANALYSIS OF GAP DATA PENALIZED METHOD BASED ON REPRESENTATIVES & NONPARAMETRIC ANALYSIS OF GAP DATA A Thesis Presented to The Academic Faculty by Soyoun Park In Partial Fulfillment of the Requirements for the Degree Doctor

More information

Regularized Regression A Bayesian point of view

Regularized Regression A Bayesian point of view Regularized Regression A Bayesian point of view Vincent MICHEL Director : Gilles Celeux Supervisor : Bertrand Thirion Parietal Team, INRIA Saclay Ile-de-France LRI, Université Paris Sud CEA, DSV, I2BM,

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

Model selection for good estimation or prediction over a user-specified covariate distribution

Model selection for good estimation or prediction over a user-specified covariate distribution Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2010 Model selection for good estimation or prediction over a user-specified covariate distribution Adam Lee

More information

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures FE661 - Statistical Methods for Financial Engineering 9. Model Selection Jitkomut Songsiri statistical models overview of model selection information criteria goodness-of-fit measures 9-1 Statistical models

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Model Combining in Factorial Data Analysis

Model Combining in Factorial Data Analysis Model Combining in Factorial Data Analysis Lihua Chen Department of Mathematics, The University of Toledo Panayotis Giannakouros Department of Economics, University of Missouri Kansas City Yuhong Yang

More information

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly Robust Variable Selection Methods for Grouped Data by Kristin Lee Seamon Lilly A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree

More information

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)

More information

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 16 Model Selection Ruibin Xi Peking University School of Mathematical Sciences Motivating example1 Interested in factors related to the life expectancy (50 US states,1969-71 ) Per

More information

Consistent Group Identification and Variable Selection in Regression with Correlated Predictors

Consistent Group Identification and Variable Selection in Regression with Correlated Predictors Consistent Group Identification and Variable Selection in Regression with Correlated Predictors Dhruv B. Sharma, Howard D. Bondell and Hao Helen Zhang Abstract Statistical procedures for variable selection

More information

PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /

More information

Lecture 14: Shrinkage

Lecture 14: Shrinkage Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Day 4: Shrinkage Estimators

Day 4: Shrinkage Estimators Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Regression I: Mean Squared Error and Measuring Quality of Fit

Regression I: Mean Squared Error and Measuring Quality of Fit Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving

More information

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks (9) Model selection and goodness-of-fit checks Objectives In this module we will study methods for model comparisons and checking for model adequacy For model comparisons there are a finite number of candidate

More information

Introduction to Machine Learning and Cross-Validation

Introduction to Machine Learning and Cross-Validation Introduction to Machine Learning and Cross-Validation Jonathan Hersh 1 February 27, 2019 J.Hersh (Chapman ) Intro & CV February 27, 2019 1 / 29 Plan 1 Introduction 2 Preliminary Terminology 3 Bias-Variance

More information

Jackknife Model Averaging for Quantile Regressions

Jackknife Model Averaging for Quantile Regressions Singapore Management University Institutional Knowledge at Singapore Management University Research Collection School Of Economics School of Economics -3 Jackknife Model Averaging for Quantile Regressions

More information

Why Do Statisticians Treat Predictors as Fixed? A Conspiracy Theory

Why Do Statisticians Treat Predictors as Fixed? A Conspiracy Theory Why Do Statisticians Treat Predictors as Fixed? A Conspiracy Theory Andreas Buja joint with the PoSI Group: Richard Berk, Lawrence Brown, Linda Zhao, Kai Zhang Ed George, Mikhail Traskin, Emil Pitkin,

More information

PENALIZING YOUR MODELS

PENALIZING YOUR MODELS PENALIZING YOUR MODELS AN OVERVIEW OF THE GENERALIZED REGRESSION PLATFORM Michael Crotty & Clay Barker Research Statisticians JMP Division, SAS Institute Copyr i g ht 2012, SAS Ins titut e Inc. All rights

More information

Seminar über Statistik FS2008: Model Selection

Seminar über Statistik FS2008: Model Selection Seminar über Statistik FS2008: Model Selection Alessia Fenaroli, Ghazale Jazayeri Monday, April 2, 2008 Introduction Model Choice deals with the comparison of models and the selection of a model. It can

More information

Regression Shrinkage and Selection via the Lasso

Regression Shrinkage and Selection via the Lasso Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,

More information

Dimension Reduction Methods

Dimension Reduction Methods Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Statistics 203: Introduction to Regression and Analysis of Variance Course review Statistics 203: Introduction to Regression and Analysis of Variance Course review Jonathan Taylor - p. 1/?? Today Review / overview of what we learned. - p. 2/?? General themes in regression models Specifying

More information

Inference Conditional on Model Selection with a Focus on Procedures Characterized by Quadratic Inequalities

Inference Conditional on Model Selection with a Focus on Procedures Characterized by Quadratic Inequalities Inference Conditional on Model Selection with a Focus on Procedures Characterized by Quadratic Inequalities Joshua R. Loftus Outline 1 Intro and background 2 Framework: quadratic model selection events

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Chapter 7: Model Assessment and Selection

Chapter 7: Model Assessment and Selection Chapter 7: Model Assessment and Selection DD3364 April 20, 2012 Introduction Regression: Review of our problem Have target variable Y to estimate from a vector of inputs X. A prediction model ˆf(X) has

More information

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods Chapter 4 Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods 4.1 Introduction It is now explicable that ridge regression estimator (here we take ordinary ridge estimator (ORE)

More information

Bootstrap, Jackknife and other resampling methods

Bootstrap, Jackknife and other resampling methods Bootstrap, Jackknife and other resampling methods Part VI: Cross-validation Rozenn Dahyot Room 128, Department of Statistics Trinity College Dublin, Ireland dahyot@mee.tcd.ie 2005 R. Dahyot (TCD) 453 Modern

More information

Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference

Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference Andreas Buja joint with: Richard Berk, Lawrence Brown, Linda Zhao, Arun Kuchibhotla, Kai Zhang Werner Stützle, Ed George, Mikhail

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

Robust Variable Selection Through MAVE

Robust Variable Selection Through MAVE Robust Variable Selection Through MAVE Weixin Yao and Qin Wang Abstract Dimension reduction and variable selection play important roles in high dimensional data analysis. Wang and Yin (2008) proposed sparse

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Linear Models 1. Isfahan University of Technology Fall Semester, 2014 Linear Models 1 Isfahan University of Technology Fall Semester, 2014 References: [1] G. A. F., Seber and A. J. Lee (2003). Linear Regression Analysis (2nd ed.). Hoboken, NJ: Wiley. [2] A. C. Rencher and

More information

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010 1 Linear models Y = Xβ + ɛ with ɛ N (0, σ 2 e) or Y N (Xβ, σ 2 e) where the model matrix X contains the information on predictors and β includes all coefficients (intercept, slope(s) etc.). 1. Number of

More information

How the mean changes depends on the other variable. Plots can show what s happening...

How the mean changes depends on the other variable. Plots can show what s happening... Chapter 8 (continued) Section 8.2: Interaction models An interaction model includes one or several cross-product terms. Example: two predictors Y i = β 0 + β 1 x i1 + β 2 x i2 + β 12 x i1 x i2 + ɛ i. How

More information

Shrinkage Methods: Ridge and Lasso

Shrinkage Methods: Ridge and Lasso Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Final Review. Yang Feng.   Yang Feng (Columbia University) Final Review 1 / 58 Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple

More information

Outline lecture 2 2(30)

Outline lecture 2 2(30) Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77 Linear Regression Chapter 3 September 27, 2016 Chapter 3 September 27, 2016 1 / 77 1 3.1. Simple linear regression 2 3.2 Multiple linear regression 3 3.3. The least squares estimation 4 3.4. The statistical

More information

STK-IN4300 Statistical Learning Methods in Data Science

STK-IN4300 Statistical Learning Methods in Data Science Outline of the lecture Linear Methods for Regression Linear Regression Models and Least Squares Subset selection STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30 MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD Copyright c 2012 (Iowa State University) Statistics 511 1 / 30 INFORMATION CRITERIA Akaike s Information criterion is given by AIC = 2l(ˆθ) + 2k, where l(ˆθ)

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2 1 Bootstrapped Bias and CIs Given a multiple regression model with mean and

More information

Variable Selection in Predictive Regressions

Variable Selection in Predictive Regressions Variable Selection in Predictive Regressions Alessandro Stringhi Advanced Financial Econometrics III Winter/Spring 2018 Overview This chapter considers linear models for explaining a scalar variable when

More information