Machine learning - HT Basis Expansion, Regularization, Validation

Machine learning - HT 016 4. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford Feburary 03, 016

Outline Introduce basis function to go beyond linear regression Understanding the tradeoff between bias and variance Overfitting: What happens when we make models too complex Regularization as a means to control model complexity Cross-validation to perform model selection 1

Basis Expansion 8 6 4 0 0 4 6 8 10

Basis Expansion φ(x) = [1, x, x, x 3, x 4 ] w 1 + w x + w 3x + w 4x 3 + w 4x 5 = φ(x) [w 1, w, w 3, w 4, w 5] 8 6 4 0 0 4 6 8 10

Basis Expansion φ(x) = [1, x, x ] 8 6 4 0 0 4 6 8 10

Basis Expansion φ(x) = [1, x, x, x 3,..., x 9 ] 8 6 4 0 0 4 6 8 10

Basis Expansion φ(x) = [1, x, x, x 3,..., x d ] How can we avoid overfitting? 8 6 4 0 0 4 6 8 10 3

Basis Expansion φ(x) = [1, x, x, x 3,..., x d ] How can we avoid overfitting? Does more data help? 8 6 4 0 0 4 6 8 10 3

Bias Variance Tradeoff For linear model, more data would make little difference 8 8 6 6 4 4 0 0 0 4 6 8 10 0 4 6 8 10 Bias results from model being simpler than the truth High bias results in underfitting 4

Bias Variance Tradeoff What happens when we fit model on different (randomly drawn) training datasets? 8 8 6 6 4 4 0 0 0 4 6 8 10 0 4 6 8 10 Variance arises when the (complex) model is sensitive to fluctuations in training dataset Variance results in overfitting 5

Bias Variance Tradeoff When does more data help? Error = Bias + Variance + Noise (Exercise for linear regression) 8 8 6 6 4 4 0 0 0 4 6 8 10 0 4 6 8 10 For more complex models, difficult to visually overfitting and underfitting Keep aside some points as test set 6

Learning Curves Suppose we have a training set and test set Train on increasing sizes of the training set, and plot the errors 3.5 0.50 3.0 0.45.5.0 0.40 0.35 0.30 1.5 1.0 0.5 0.0 0.5 0.15 0.0 0 5 10 15 0 5 30 35 40 0.10 10 15 0 5 30 35 40 Once training and test error approach each other, then more data won t help! 7

Basis Expansion Using Kernels We can use kernels as features, e.g., radial basis functions (RBFs) Feature expansion: φ(x) = [1, κ(x, µ 1, σ),..., κ(x, µ d, σ)], e.g., κ(x, µ i, σ) = e x µ i σ Model: y = φ(x) T w + noise 8

Basis Expansion Using Kernels As in the case of polynomials, the width σ can cause overfitting or underfitting Image Source: K. Murphy (01) 9

Polynomial Basis Expansion in Higher Dimensions We are basically fitting linear models (at the cost of increasing dimensions) y = φ(x) + noise Linear Model: φ(x) = [1, x 1, x ] Quadratic Model: φ(x) = [1, x 1, x, x 1, x, x 1x ] Image Source: K. Murphy (01) How many dimensions do you get for degree d polynomials over n variables? 10

Overfitting! In high dimensions, we can have many many parameters! With 100 variables and degree 10 we have 10 0 parameters! Enrico Fermi to Freeman Dyson I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk. [video] How do we prevent overfitting? 11

How does overfitting occur? Suppose X 100 100 with every entry N (0, 1) And let y i = x i,1 + N (0, σ ), for σ = 0. 5 1. 1.0 4 3 0.8 0.6 0.4 0. 0.0 1 0 0 0 40 60 80 100 0. 0.4 0 0 40 60 80 100 1

Ridge Regression Say our data is (x i, y i) m i=1, where x R N where N is really really large! We used the squared loss L(w) = (Xw y) T (Xw y) and obtained the estimate w = (X T X) 1 X T y Suppose we want w to be small ( weight decay) L(w) = (Xw y) T (Xw y) + λw T w λw1 We will not regularize the bias (or constant) term Exercise: If all x i (except x 1) have mean 0 and y has mean 0, w 1 = 0 13

Deriving Estimate for Ridge Regression L(w) = (Xw y) T (Xw y) + λw T w 14

Ridge Regression Objective/Loss: L(w) = (Xw y) T (Xw y) + λw T w Estimate: ŵ = (X T X + λi) 1 X T y Estimate depends on the scaling of the inputs Common practice: Normalise all input dimensions 15

Ridge Regression and MAP Estimation Objective/Loss: L(w) = (Xw y) T (Xw y) + λw T w 16

Ridge Regression and MAP Estimation Objective/Loss: L(w) = (Xw y) T (Xw y) + λw T w exp ( 1 L(w) ) ( ) ( ) = exp 1 (y σ Xw)T Σ 1 (y Xw) exp 1 wt Λ 1 w 16

Bayesian View of Machine Learning General Formulation Prior p(w) on w Model p(y x, w) Linear Regression p(w) = N (w 0, τ I n) p(y x, w) = N (y x T w, σ ) Compute posterior given data D = (x i, y i) m i=1 p(w D) 17

Bayesian View of Regression Making a prediction on a new point x new p(y D, x new) = p(y w, x new)p(w D)dw w 18

Prediction MAP vs Fully Bayesian MAP Approach Bayesian Approach y new N (x T neww map, σ ) y new N (x T neww map, σ X) σ X = σ (1+x T new(x T X+ σ τ I) 1 x new) 19

Ridge Regression Minimize: (Xw y) T (Xw y) T + λw T w Minimize (Xw y) T (Xw y) such that w T w R 0

Ridge Regression Image Source: Hastie, Tibshirani, Friedman (013) 1

Lasso Regression Minimize: (Xw y) T (Xw y) T + λ N i=1 wi Minimize (Xw y) T (Xw y) such that N i=1 wi R

Lasso Regression Image Source: Hastie, Tibshirani, Friedman (013) 3

Regularization: Ridge or Lasso No regularization Ridge Lasso 5 5 5 4 4 4 3 3 3 1 1 1 0 0 0 40 60 80 100 0 0 0 40 60 80 100 0 0 0 40 60 80 100 1. 1.0 0.8 0.6 0.4 0. 0.0 0. 0.4 0 0 40 60 80 100 1. 1.0 0.8 0.6 0.4 0. 0.0 0. 0.4 0 0 40 60 80 100 1. 1.0 0.8 0.6 0.4 0. 0.0 0. 0.4 0 0 40 60 80 100 4

Choosing Parameters Before we were just trying to find w Now, we have to worry about how to choose λ for ridge or Lasso If we use kernels, we also have to pick the width σ If we use higher degree polynomials, we have to pick d 5

Cross Validation Keep a part of data as validation or test set Look for error on training and validation sets λ Train Error(%) Test Error(%) 0.01 5 7 0.1 10 8 1 1 10 0 43 100 0 89 6

k-fold Cross Validation What do we do when data is scarce? Divide data into k parts Use k 1 parts for training and 1 part as validation When k = m (the number of datapoints), we get LOOCV (Leave one out cross validation) 7

Overfitting on the Validation Set Suppose you do all the right things Train on the training set Choose hyperparameters using proper validation Test on the test set, and your error is high! What would you do? 8

Winning Kaggle without reading the data! Suppose the task is to predict m binary labels 9

Winning Kaggle without reading the data! Suppose the task is to predict m binary labels Algorithm (Wacky Boosting): 1. Choose y 1,..., y k {0, 1} m uniformly. Set I = {i accuracy(y i ) > 51%} 3. Output ŷ j = majority{y i j i I} Source: blog.mrtz.org 9

Next Time Optimization Methods Read up on gradients, multivariate calculus 30