SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Size: px

Start display at page:

Download "SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University."

Ruth Green
5 years ago
Views:

1 SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016

2 Outline Quick Recap of Least Square Ridge Regression and Regularization Lasso : Least Absolute Shrinkage and Selection Operator Model Selection

3 Outline Quick Recap of Least Square Ridge Regression and Regularization Lasso : Least Absolute Shrinkage and Selection Operator Model Selection

4 Quick Recap of Least Square The least square objective is L(w) = 1 2 (y Xw) (y Xw). The solution ŵ = arg min w L(w) is given by (X X)ŵ = X y ŵ = (X X) 1 X y. The solution ŵ is analytic and unique since L(w) is convex. It corresponds to the maximum likelihood estimate of the model Y = w X + ϵ with ϵ N (0, σ 2 ). One need to calculate the inverse of X X.

5 Least Square with Cholesky Decomposition Using Cholesky decomposition of X X: X X = R R where R is an upper triangular matrix, we can find ŵ using forward-backward substitution: (X X)w = (R R)w = X y 1. Forward substitution : find a vector a such that R a = X y. 2. Backward substitution : find a vector ŵ such that Rŵ = a. Since R is upper triangular, both steps can be solved efficiently.

6 Least Square with Gradient Descent Method Problem: when X is huge, it is expensive to compute (X X) 1. Solution: treat least square as an optimization problem. ŵ = arg min w 1 2 (y Xw) (y Xw) = arg min w L(w) Gradient Descent 1. Initialize w 0 2. Update w t+1 = w t α 2 L(w) w = w t αx (Xw t y) 3. Repeat step 2 until convergence The algorithm depends on the learning rate α. L(w) w

7 Least Square is Ill-posed Problem When X X is not invertible, solving a least square can be an ill-posed problem. When the inverse of X X may not exist (rank X X < d) small d, large n : if n > d, rank X X = d (invertible). large d, small n : if n < d, rank X X < d (not invertible!). colinearity : the features are not linearly independent. Problem: lead to large value of w if the magnitude of data is large (overfitting) Solution: penalize large value of w (complexity control better generalization)

8 Outline Quick Recap of Least Square Ridge Regression and Regularization Lasso : Least Absolute Shrinkage and Selection Operator Model Selection

9 Ridge Regression (Hoerl and Kennard, 1970) The ridge regression objective is L(w) = 1 2 (y Xw) (y Xw) + λ 2 w w where λ > 0 is a regularization parameter. w w = w 2 2 = d i=1 w 2 i : penalize large value of w i. The regularization parameter λ controls model complexity. λ 0: we obtain the least square solution λ : the vector w approaches zero vector L(w) is a convex function (there exists a unique solution). Exercise : find the solution ŵ of ridge regression.

10 Ridge Regression (Hoerl and Kennard, 1970) Finding a derivative w L(w) = X y + X Xw + λ w = X y + (X X + λi)w Setting wl(w) to zero yields normal equation Exercise: (X X + λi)ŵ = X y ŵ = (X X + λi) 1 X y show that, for any λ > 0, (X X + λi) is always invertible (hint: full-rank square matrix is invertible.)

11 Regularization L(w) = 1 2 n i=1 (y i w x i ) } {{ } empirical risk It is known as Tikhonov regularization. The role of regularization: control the complexity of the solution ŵ. + λ 2 w 2 2 }{{} regularization allow for incorporation of prior knowledge about ŵ. The regularization parameter λ controls the importance of the regularization term.

12 Bias-Variance Tradeoff

13 Occam s Razor : William of Ockham (c ) Occam: entities should not be multiplied unnecessarily (keep it simple) Aristotle: Nature operates in the shortest way possible Einstein: Everything should be made as simple as possible, but not simpler Newton s law of motion vs. Kepler s laws of planetary motion Reading :

14 Occam s Razor

15 Maximum a Posteriori (MAP) Learning Model: Y = w X + ϵ, ϵ N (0, σ 2 ) Prior: w N (0, σ 2 wi) We know that Y N (w X, σ 2 )

16 Maximum a Posteriori (MAP) Learning Model: Y = w X + ϵ, ϵ N (0, σ 2 ) Prior: w N (0, σ 2 wi) p(w) We know that Y N (w X, σ 2 ) log P(w D) = log P(D w)p(w) = = n i=1 n i=1 = log P((x i, y i ) w) + log P(w) log n i=1 ( ) ( ) e (y i w x i ) 2 /2σ 2 e + w 2 /2σw 2 log 2πσ 2πσw (y i w x i ) 2 2σ 2 w 2 2 2σw 2 + C

17 Maximum a Posteriori (MAP) Learning Taking a derivative and setting it to zero: w log P(w D) = 0 1 σ 2 (X y X Xŵ) 1 σw 2 w = 0 ) (X X + σ2 σw 2 I ŵ = X y 1 ŵ = (X X + I) σ2 σw 2 X y Setting λ = σ2 σ 2 w yields the same solution as ridge regression ŵ = ( X X + λ(σ, σ w )I) 1 X y

18 Least Square vs. Ridge Regression y = w x + b + ε, ε N (0, 0.64), w = 1.2, b = Y X

19 Least Square vs. Ridge Regression y = w x + b + ε, ε N (0, 0.64), w = 1.2, b = Y least square ridge regression X

20 Least Square vs. Ridge Regression y = w x + b + ε, ε N (0, 0.64), w = 1.2, b = Y least square ridge regression X E ls = , E ridge =

21 Least Square vs. Ridge Regression y = w x + b + ε, ε N (0, 0.64), w = 1.2, b = Y true model least square ridge regression X

22 Least Square vs. Ridge Regression y = w x + b + ε, ε N (0, 0.64), w = 1.2, b = Y true model least square ridge regression X

23 Least Square vs. Ridge Regression y = w x + b + ε, ε N (0, 0.64), w = 1.2, b = Y true model least square ridge regression X E ls = , E ridge =

24 Outline Quick Recap of Least Square Ridge Regression and Regularization Lasso : Least Absolute Shrinkage and Selection Operator Model Selection

25 Lasso : Least Absolute Shrinkage and Selection Operator L(w) = 1 2 (y Xw) (y Xw) + λ w 1 w 1 = d i=1 w i : prefer sparse vector ŵ Lasso is suitable for high-dimensional problem (d n) l 1 l 2

26 Least Angle Regression (Efron et al. 2004) Lasso has no closed-form solution. lars package in R implements the LASSO LARS Algorithm 1. Initialize all w 1, w 2,..., w d to zero. 2. Find the predictor x j most correlated with y. 3. Increase w j in the direction of the sign of its correlation with y. Take residuals r = y ŷ along the way. Stop when some other predictor x k has as much correlation with r as x j has. 4. Increase (w j, w k ) in their joint least squares direction, until some other predictor x m has as much correlation with the residual r. 5. Continue until all predictors are in the model. Similar algorithm : Forward Stagewise Algorithm

27 Bayesian Interpretation of Lasso Model: Y = w X + ϵ, ϵ N (0, σ 2 ) Prior: w Laplace(0, t) p(w) exp( w 1 /t) p(w) Consider the MAP solution of Lasso: ŵ MAP = arg max w = arg min w = arg min w log P(w D) = arg max log P(D w)p(w) w { n (y i w x i ) 2 i=1 2σ 2 + w } 1 t { } n 1 (y Xw) (y Xw) + λ(σ 2, t) w 1 2 i=1

28 Outline Quick Recap of Least Square Ridge Regression and Regularization Lasso : Least Absolute Shrinkage and Selection Operator Model Selection

29 Model Selection Plot the coefficients w for different values of λ Coefficients Wine Quality Dataset fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density ph sulphates alcohol λ λ 0 : the solution ŵ converges to the least square solution. λ : the solution ŵ converges to zero.

30 Model Selection Most ML models depends on some unknown parameters, e.g., λ. How to choose the best parameter values?

31 Model Selection

32 K-Fold Cross Validation

33 K-Fold Cross Validation 1. Partition D = {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} into K separate sets of equal size. D = {D1, D 2,..., D K } with D k n/k 2. For each k = 1, 2,..., K fit the model f ( k) λ on D ( k) = {D 1,..., D k 1, D k+1,..., D K } Compute the cross-validation error CV λ (k) = 1 D k (y f ( k) λ (x)) 2 (x,y) D k 3. Compute overall cross-validation error : CV λ = 1 K K i=1 CV λ (k) 4. Pick λ with the smallest cross-validation error.

34 Other Variants Least square: Weighted least square (generalized least square) Iteratively reweighted least square Robust regression model Recursive least square Lasso: Group lasso Elastic net Fused lasso

35 Exercise 1. Find the bias and variance terms of the ridge regression ŵ = (X X + λi) 1 X y. 2. In model selection, we need to evaluate ŵ = (X X + λi) 1 X y for different values of λ. How to do it efficiently?

Introduction to Machine Learning

Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1