CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24

Outline Review of last lecture Review of last lecture Linear regression 2 Nonlinear basis functions 3 Basic ideas of overcome overfitting Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 2 / 24

Review of last lecture Estimating model parameters Linear regression Design matrix and target vector x T x T 2 X =. RN D, X = ( X) R N (D+), y = x T N Residual sum squares in matrix form { ( ) } T RSS( w) = w T X T X w 2 X T y w + const y y 2. y N Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 3 / 24

Optimal solution Review of last lecture Linear regression Normal equation Take derivative with respect to w RSS( w) w X T Xw X T y = This leads to the least-mean-square (LMS) solution w LMS = ( X T X) X T y Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 4 / 24

Review of last lecture Linear regression Practical concerns Bottleneck of computing the LMS solution w = ( XT X) Xy is to invert the matrix X T X R (D+) (D+) Scalable methods Batch gradient descent Stochastic gradient descent Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 5 / 24

Review of last lecture Linear regression Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to w () (anything reasonable is fine); set t = ; choose η > Loop until convergence random choose a training a sample x t 2 Compute its contribution to the gradient (ignoring the constant factor) g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+) = w (t) ηg t 4 t t + Scalable to large problems and often very effective Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 6 / 24

Review of last lecture Linear regression Regularized least square/ridge regression For X T X that is not invertible w = ( X T X + λi) XT y This is equivalent to adding an extra term to RSS( w) RSS( w) { { }}{ ( ) } T w T X T X w 2 X T y w + 2 2 λ w 2 2 }{{} regularization Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 7 / 24

Review of last lecture Linear regression Things you should know about linear regression Linear regression is the linear combination of features. f : x y, with f(x) = w + d w dx d = w + w T x If we minimize residual sum squares as our learning objective, we get a closed-form solution of parameters. You can either invert a matrix to obtain the parameters, or use gradient-based methods (batch or stochastic gradient descent) Probabilistic interpretation: maximum likelihood if assuming residual is Gaussian distributed Ridge regression: regularizing residual sum squares Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 8 / 24

Outline Nonlinear basis functions Review of last lecture 2 Nonlinear basis functions 3 Basic ideas of overcome overfitting Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 9 / 24

Nonlinear basis functions What if data is not linearly separable or fits to a line Example of nonlinear classification.5.5.5.5.5.5.5.5 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24

Nonlinear basis functions What if data is not linearly separable or fits to a line Example of nonlinear classification.5.5.5.5.5.5.5.5 Example of nonlinear regression t x Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24

Nonlinear basis functions Nonlinear basis for classification Transform the input/feature φ(x) : x R 2 z = x x 2 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24

Nonlinear basis functions Nonlinear basis for classification Transform the input/feature φ(x) : x R 2 z = x x 2 Transformed training data: linearly separable!.5.5.5.5.5.5.5.5.8.6.4.2.2.4.6.8.5.5.5.5 2 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24

Another example Nonlinear basis functions How to transform the input/feature? 4 3 2 2 3 4 4 3 2 2 3 4 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 2 / 24

Nonlinear basis functions Another example 4 3 2 How to transform the input/feature? φ(x) : x R 2 z = x 2 x x 2 x 2 2 R 3 2 3 4 4 3 2 2 3 4 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 2 / 24

Another example Nonlinear basis functions 4 3 2 How to transform the input/feature? φ(x) : x R 2 z = x 2 x x 2 x 2 2 R 3 2 3 4 4 3 2 2 3 4 Transformed training data: linearly separable Intuition: 2 w = [ ] T then, w T z = x 2 2, i.e., the distance to the origin! 8 6 4 2 5 5 5 5 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 2 / 24

Nonlinear basis functions General nonlinear basis functions We can use a nonlinear mapping φ(x) : x R D z R M where M is the dimensionality of the new feature/input z (or φ(x)). Note that M could be either greater than D or less than or the same. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 3 / 24

Nonlinear basis functions General nonlinear basis functions We can use a nonlinear mapping φ(x) : x R D z R M where M is the dimensionality of the new feature/input z (or φ(x)). Note that M could be either greater than D or less than or the same. With the new features, we can apply our learning techniques linear methods: prediction is based on w T φ(x) other methods: nearest neighbors, decision trees, etc to minimize our errors on the transformed training data Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 3 / 24

Nonlinear basis functions Regression with nonlinear basis Residual sum squares [w T φ(x n ) y n ] 2 n where w R M, the same dimensionality as the transformed features φ(x). Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 4 / 24

Nonlinear basis functions Regression with nonlinear basis Residual sum squares [w T φ(x n ) y n ] 2 n where w R M, the same dimensionality as the transformed features φ(x). The LMS solution can be formulated with the new design matrix φ(x ) T φ(x 2 ) T Φ =. RN M, w lms = ( Φ T Φ ) Φ T y φ(x N ) T Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 4 / 24

Nonlinear basis functions Example with regression Polynomial basis functions x φ(x) = x 2. x M f(x) = w + M w m x m m= Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 5 / 24

Nonlinear basis functions Example with regression Polynomial basis functions x φ(x) = x 2 f(x) = w +. x M M w m x m m= Fitting samples from a sine function: underrfitting as f(x) is too simple M = M = t t x x Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 5 / 24

Nonlinear basis functions Adding more high-order basis M=3 t M = 3 x Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 6 / 24

Nonlinear basis functions Adding more high-order basis M=3 M=9: overfitting t M = 3 t M = 9 x x Being too adaptive leads better results on the training data, but not so great on data that has not been seen! Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 6 / 24

Overfiting Nonlinear basis functions Parameters for higher-order polynomials are very large M = M = M = 3 M = 9 w.9.82.3.35 w -.27 7.99 232.37 w 2-25.43-532.83 w 3 7.37 48568.3 w 4-23639.3 w 5 6442.26 w 6-68.52 w 7 424.8 w 8-557682.99 w 9 252.43 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 7 / 24

Nonlinear basis functions Overfitting can be quite disastrous Fitting the housing price data with M = 3 Note that the price would goes to zero (or negative) if you buy bigger ones! This is called poor generalization/overfitting. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 8 / 24

Detecting overfitting Nonlinear basis functions Plot model complexity versus objective function As model becomes more complex, performance on training keeps improving while on test data improve first and deteriorate later. ERMS.5 Training Test 3 M 6 9 Horizontal axis: measure of model complexity In this example, we use the maximum order of the polynomial basis functions. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 9 / 24

Detecting overfitting Nonlinear basis functions Plot model complexity versus objective function As model becomes more complex, performance on training keeps improving while on test data improve first and deteriorate later. ERMS.5 Training Test 3 M 6 9 Horizontal axis: measure of model complexity In this example, we use the maximum order of the polynomial basis functions. Vertical axis: For regression, the vertical axis would be RSS or RMS (squared root of RSS) 2 For classification, the vertical axis would be classification error rate or cross-entropy error function Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 9 / 24

Outline Basic ideas of overcome overfitting Review of last lecture 2 Nonlinear basis functions 3 Basic ideas of overcome overfitting Use more training data Regularization methods Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 2 / 24

Basic ideas of overcome overfitting Use more training data Use more training data to prevent over fitting The more, the merrier t M = 9 x Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 2 / 24

Basic ideas of overcome overfitting Use more training data Use more training data to prevent over fitting The more, the merrier M = 9 N = 5 t t x x Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 2 / 24

Basic ideas of overcome overfitting Use more training data Use more training data to prevent over fitting The more, the merrier M = 9 N = 5 N = t t t x x What if we do not have a lot of data? x Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 2 / 24

Basic ideas of overcome overfitting Regularization methods Regularization methods Intuition: for a linear model for regression what do we mean by being simple? w T x Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 22 / 24

Basic ideas of overcome overfitting Regularization methods Regularization methods Intuition: for a linear model for regression what do we mean by being simple? w T x Assumptions 2 d 2σ 2 p(w d ) = e w 2πσ Namely, a prior, we believe w d is around zero, i.e., resulting in a simple model for prediction. Note that this line of thinking is to regard w as a random variable and we will use the observed data D to update our prior belief on w Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 22 / 24

Basic ideas of overcome overfitting Regularization methods Example: fitting data with polynomials Our regression model y = M w m x m m= Thus, smaller w m will likely lead to a smaller oder of polynomial, thus potentially preventing overfitting. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 23 / 24

Basic ideas of overcome overfitting Regularization methods Setup for regularized linear regression Linear regression y = w T x + η where η is a Gaussian noise, distributed according to N(, σ 2 ). Prior distribution on the parameter p(w) = M e w 2πσ d= 2 d 2σ 2 Note that all the dimensions share the same variance σ 2. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 24 / 24