ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

General terms Supervised Learning: Will have input X and the corresponding output Y The task of regression and classification is a typical supervised learning. Unsupervised Learning Only input X is given, and there is no notion of output, during learning; The task of clustering is a typical unsupervised learning. Linear Regression: Method of (ordinary) Least Squares Penalized Least Squares 2

Simple Linear Regression We observe the data (y i, x i ) for i = 1,, n, and model the linear relationship assuming ε i are independent and identically distributed (i.i.d.) normal with mean 0 and variance 2. How to estimate β 0, β 1 and 2? 3

Method of Least Squares The method of Least Squares: find b 0 and b 1 so as to minimize the residual sum of squares The solution is 4

Fitted Values & Residuals The regression line is and the values predicted values. are called fitted or The residuals are the deviations from the estimated line: for i=1,,n. 5

Properties of Regression Estimators Since we have Thus, Similarly, and thus 6

How to estimate 2? Note that ε i are iid with mean 0 and variance 2, thus it is reasonable to estimate 2 by Since β 0, β 1 are unknown, we can estimate 2 by 7

Summary Let If ε i are iid N 0, σ 2 then 8

100 1 α % C.I. for β 1 is: Statistical inference A 100 1 α % confidence interval for the estimated regression at a new point x = x is 9

Multiple Linear Regression Observe data (Y i, X i1,, X ip ) for i = 1,, n, and we assume the linear model Y i = f X i + ε i = β 0 + β 1 X i1 + + β p X ip + ε i with ε i ~N(0, σ 2 ) How to estimate β j? The (ordinary) least squares estimator β ols = (X X) 1 X Y which has good properties (Gauss-Markov, Maximum Likelihood). 10

Derivation: Multiple Linear Regression Unbiased: Variance: 11

Estimation of σ 2 : Statistical inference T-test of β j : F-test: 12

Insight of two models Consider and Y restricted = X 1 β 1,restricted Y full = X 1 β 1,full + X 2 β 2,full 14

What does better mean? Two questions to see whether an estimator β is good: Model Identification: is β close to the true β, e.g., is Mean Squared Error (MSE) small? MSE β = E[(β β) 2 ] Prediction: will the model predict future observations well? 15

Multiple Linear Regression Gauss-Markov Theorem: If E Y = Xβ and var Y = σ 2 I, then the Best Linear Unbiased Estimator (BLUE) of a β is a β ols. 16

BLUE 17

Prediction: Overfitting v.s. Overfitting: Underfitting When the statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. Underfitting Underfitting: when model is too simple, both training and test errors are large 18

Variable Selection Subset Selection: Suppose that we have p predictors with coefficients β 0, β 1,, β p 1. The subset selection problem is to find for each k {0,1, 2,..., p}, the subset of k out of p predictors that minimize the RSS. Can we search all 2 p candidate models? (e.g. p = 20, 2 p = 1048576) 19

Greedy procedure & sequential search methods The previous figure shows that using additional predictor variables more than two does not gain much. So why don t we select the smallest subset of variables such that RSS is smaller than a given threshold? Forward selection: starts with β 0, and subsequently add into the model the predictor that most improves the model Question: where does the F-test come from? 20

Greedy procedure & sequential search methods Backward selection: starts with the full model (with all p predictors), and sequentially drop a predictor from the model at a time so that its corresponding F-ratio is the smallest. Stepwise selection: a combination of the above, testing at each step for variables to be included or excluded 21

Classical Subset Selection Classical subset selection seeks a subset of variables that minimizes some criterion: Mallow s C p Cross-Validation (Ch 7.10), and GCV (eq 7.46 on p 239) Akaike s Information Criterion(AIC, Akaike 1973), Bayes information Criterion(BIC, Schwarz 1978) 22

Information Criterion Typically, an information criterion is of the form AIC: (better for prediction) BIC: (better for model identification) Others ( (n) = o(n)): Hurvich & Tsai, 1989: HQ (Hannan & Quinn, 1979): 23

Lasso regression Because the 0-norm problem is combinatorial and hard to solve, people tried to approximate the 0-norm problem with other convex objective functions that are easier to solve. The tactic is called "convex relaxation." For example, It is natural to convert the l 1 -Error formulation into: Lasso regression: is to use the l 1 -norm convex relaxation to approximate the original subset selection problem Usually center the data before we do a regression 24

How well do greedy algorithms or convex approximations work? For greedy algorithm: Tropp, 2004, "Greedy is good: algorithmic results for sparse approximation," IEEE Transactions on Information Theory, 50(10): 2231-2242. Basically, when the size of the optimal subset is small enough, then the greedy algorithm can lead to the true optimum. Unfortunately, the bound for smallness is typically too small. For convex approximation: Tropp, 2006, "Just relax: convex programming methods for subset selection and sparse approximation", IEEE Transactions on Information Theory, 52(3): 1030-1051. Conclusion similar to that in the greedy algorithms but the bound is little relaxed. 25

Lasso regression Define a standardized quantity: s = t t 0 s = 1, all the coefficients from Lasso are the same as those from LSE; s = 0, all the coefficients from Lasso are zero; shrinkage factor For s in between, it indicates how much the coefficients are shrunk on average. 26

Subset selection v.s. coefficient shrinkage As expected, in the Lasso-result figure, with a change in the value of s, the values of coefficients are shrunk if go from s = 1 to s = 0. For this reason, people also use coefficient shrinkage to refer to the methods for subset selection. Oftentimes, coefficient shrinkage is indeed used interchangeably with subset selection. However, fundamentally, subset selection is a combinatorial problem, while coefficient shrinkage does not have to be. Actually, coefficient shrinkage, as it is usually formulated, is a continuous optimization problem. So we can understand that coefficient shrinkage is the approximation to the subset selection problem. 27

Shrinkage Method Generalization Assume we observe (Yi, X i1,, X ip ), and all variables are standardized: A shrinkage method solves the optimization problem: min β Y Xβ Y Xβ + λ J( β j ) j=1 where J( β j ) is the penalty function and λ 0 is the decay or tuning parameter. p 28

Shrinkage Methods Shrinkage Methods (penalized, regularization) Are based on subtracting a penalty from the risk (rather than log-likelihood), and the penalty is a function of a decay parameter Can filter out unimportant variables from the candidate variables; and estimate the important ones consistently with high efficiency. Can make an ill-posed problem well-posed (e.g., the matrix X is not full rank) Once the decay parameter is estimated, the variable (or model) selection is done! 29

Shrinkage Method Generalization In general, a shrinkage method solves the optimization problem min β Y Xβ p Y Xβ + λ J( β j ) j=1 An alternative formulation is to solve min β Y Xβ Y Xβ, subject to J β j t p j=1 How to choose the penalty function? 30

Ridge regression The ridge regression estimator is l 2 -Penalty formulation under Langrangian relaxation: The explicit expression is It is most useful when X is nonsingular but has high collinearity (i.e., close to singular) 31

Remarks on Ridge regression β ridge is still a linear function of y, same as the LSE. β ridge is biased, with the bias decreasing to 0 as goes to 0. As increases, the β ridge get closer to 0 (though rarely=0). One may force shrinkage the coefficients to zero by choosing a cutoff below which coefficients are set to 0. This may help smooth out noisy data and produce robust solutions. Overall, despite the bias, the variance will usually be smaller than that of OLS, and thus smaller MSE and better prediction. (introduced in statistics by Hoerl & Kennard in 1970, who tried to address a singularity problem.) 32

Another way to see regularization If we use the singular value decomposition (SVD): where U and V are orthogonal matrix with the column of U spanning the column space of X, and the columns of V spanning the raw space. D= diag(d 1,., d p ) and d 1,, d p called the singular values of X. Using the singular values of X, people define the effective degree of freedom: 33

Ridge regression example Effective degree of freedom indicates how many effective coefficients are resulted from a ridge regression. 34

Comparisons Performance-wise, in terms of selecting the best linear predictive model, ridge and lasso perform quite similarly. Lasso outperforms ridge when there are a moderate number of sizable effects, rather than many small effects. It also produces more interpretable models. LASSO does poorly: (i) when the true mode is not spare; (ii) when some variables are highly correlated (LASSO picks one randomly and Ridge regression tends to perform better); (ii) LASSO may not be robust to outliers in the responses. Ridge regression used to be more popular because it has a closedform analytical solution. But with today s computer power, lasso is also easy to use For each fixed, the computation of LASSO solution can be formulated as a quadratic programming (Tibshirani, 1996). Least Angel Regression (LARS) algorithm by Efron et al. (2004): The number of linear pieces in LASSO path is approximately p, and the complexity of getting the whole LASSO path is O(np 2 ), the same as the cost of computing a single OLS fit. 35