Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Similar documents
Machine learning - HT Basis Expansion, Regularization, Validation

Introduction to Machine Learning

Least Squares Regression

Least Squares Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

ECE521 week 3: 23/26 January 2017

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Week 3: Linear Regression

Machine Learning - MT Linear Regression

Introduction to Machine Learning

Machine Learning - MT Classification: Generative Models

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

GWAS IV: Bayesian linear (variance component) models

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Bayesian Linear Regression [DRAFT - In Progress]

COMP90051 Statistical Machine Learning

Overfitting, Bias / Variance Analysis

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Linear Models for Regression

STA414/2104 Statistical Methods for Machine Learning II

Cheng Soon Ong & Christian Walder. Canberra February June 2018

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine learning - HT Maximum Likelihood

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Lecture 2 Machine Learning Review

Bayesian Regression Linear and Logistic Regression

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Probabilistic & Unsupervised Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Relevance Vector Machines

Linear Models for Regression. Sargur Srihari

Bayesian Machine Learning

CS-E3210 Machine Learning: Basic Principles

Lecture 5: GPs and Streaming regression

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Introduction to Gaussian Processes

y(x) = x w + ε(x), (1)

y Xw 2 2 y Xw λ w 2 2

CSCI567 Machine Learning (Fall 2014)

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Linear Models for Regression CS534

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Multivariate Bayesian Linear Regression MLAI Lecture 11

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Linear Regression (9/11/13)

Bayesian Linear Regression. Sargur Srihari

Today. Calculus. Linear Regression. Lagrange Multipliers

Machine Learning Linear Regression. Prof. Matteo Matteucci

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Midterm exam CS 189/289, Fall 2015

CPSC 540: Machine Learning

Linear Regression (continued)

Introduction to Gaussian Process

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Gaussian Process Regression

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

PMR Learning as Inference

Bayesian methods in economics and finance

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine Learning (CSE 446): Probabilistic Machine Learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Regularized Regression A Bayesian point of view

COMP90051 Statistical Machine Learning

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Statistical Data Mining and Machine Learning Hilary Term 2016

An Introduction to Statistical and Probabilistic Linear Models

Linear Models for Regression

Linear Models for Regression CS534

Understanding Generalization Error: Bounds and Decompositions

Introduction to Machine Learning

Machine Learning

DD Advanced Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

The evolution from MLE to MAP to Bayesian Learning

Introduction to Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Outline lecture 2 2(30)

Machine Learning

Linear Models for Regression CS534

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Supervised Learning Coursework

Introduction to Gaussian Processes

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Lecture : Probabilistic Machine Learning

Probabilistic Reasoning in Deep Learning

Transcription:

Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016

Outline Basis function expansion to capture non-linear relationships Understanding the bias-variance tradeoff Overfitting and Regularization Bayesian View of Machine Learning Cross-validation to perform model selection 1

Outline Basis Function Expansion Overfitting and the Bias-Variance Tradeoff Ridge Regression and Lasso Bayesian Approach to Machine Learning Model Selection

Linear Regression : Polynomial Basis Expansion 2

Linear Regression : Polynomial Basis Expansion 2

Linear Regression : Polynomial Basis Expansion φ(x) = [1, x, x 2 ] w 0 + w 1x + w 2x 2 = φ(x) [w 0, w 1, w 2] 2

Linear Regression : Polynomial Basis Expansion φ(x) = [1, x, x 2 ] w 0 + w 1x + w 2x 2 = φ(x) [w 0, w 1, w 2] 2

Linear Regression : Polynomial Basis Expansion φ(x) = [1, x, x 2,, x d ] Model y = w T φ(x) + ɛ Here w R M, where M is the number for expanded features 2

Linear Regression : Polynomial Basis Expansion Getting more data can avoid overfitting! 2

Polynomial Basis Expansion in Higher Dimensions Basis expansion can be performed in higher dimensions We re still fitting linear models, but using more features Linear Model φ(x) = [1, x 1, x 2] y = w φ(x) + ɛ Quadratic Model φ(x) = [1, x 1, x 2, x 2 1, x 2 2, x 1x 2] Using degree d polynomials in D dimensions results in D d features! 3

Basis Expansion Using Kernels We can use kernels as features A Radial Basis Function (RBF) kernel with width parameter γ is defined as κ(x, x) = exp( γ x x 2 ) Choose centres µ 1, µ 2,..., µ M Feature map: φ(x) = [1, κ(µ 1, x),..., κ(µ M, x)] y = w 0 + w 1κ(µ 1, x) + + w M κ(µ M, x) + ɛ = w φ(x) + ɛ How do we choose the centres? 4

Basis Expansion Using Kernels One reasonable choice is to choose data points themselves as centres for kernels Need to choose width parameter γ for the RBF kernel κ(x, x ) = exp( γ x x 2 ) As with the choice of degree in polynomial basis expansion depending on the width of the kernel overfitting or underfitting may occur Overfitting occurs if the width is too small, i.e., γ very large Underfitting occurs if the width is too large, i.e., γ very small 5

When the kernel width is too large 6

When the kernel width is too small 6

When the kernel width is chosen suitably 6

Big Data: When the kernel width is too large 7

Big Data: When the kernel width is too small 7

Big Data: When the kernel width is chosen suitably 7

Basis Expansion using Kernels Overfitting occurs if the kernel width is too small, i.e., γ very large Having more data can help reduce overfitting! Underfitting occurs if the width is too large, i.e., γ very small Extra data does not help at all in this case! When the data lies in a high-dimensional space we may encounter the curse of dimensionality If the width is too large then we may underfit Might need exponentially large (in the dimension) sample for using modest width kernels Connection to Problem 1 on Sheet 1 8

Outline Basis Function Expansion Overfitting and the Bias-Variance Tradeoff Ridge Regression and Lasso Bayesian Approach to Machine Learning Model Selection

The Bias Variance Tradeoff High Bias High Variance 9

The Bias Variance Tradeoff High Bias High Variance 9

The Bias Variance Tradeoff High Bias High Variance 9

The Bias Variance Tradeoff High Bias High Variance 9

The Bias Variance Tradeoff High Bias High Variance 9

The Bias Variance Tradeoff High Bias High Variance 9

The Bias Variance Tradeoff Having high bias means that we are underfitting Having high variance means that we are overfitting The terms bias and variance in this context are precisely defined statistical notions See Problem Sheet 2, Q3 for precise calculations in one particular context See Secs. 7.1-3 in HTF book for a much more detailed description 10

Learning Curves Suppose we ve trained a model and used it to make predictions But in reality, the predictions are often poor How can we know whether we have high bias (underfitting) or high variance (overfitting) or neither? Should we add more features (higher degree polynomials, lower width kernels, etc.) to make the model more expressive? Should we simplify the model (lower degree polynomials, larger width kernels, etc.) to reduce the number of parameters? Should we try and obtain more data? Often there is a computational and monetary cost to using more data 11

Learning Curves Split the data into a training set and testing set Train on increasing sizes of data Plot the training error and test error as a function of training data size More data is not useful More data would be useful 12

Overfitting: How does it occur? When dealing with high-dimensional data (which may be caused by basis expansion) even for a linear model we have many parameters With D = 100 input variables and using degree 10 polynomial basis expansion we have 10 20 parameters! Enrico Fermi to Freeman Dyson I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk. [video] How can we prevent overfitting? 13

Overfitting: How does it occur? Suppose we have D = 100 and N = 100 so that X is 100 100 Suppose every entry of X is drawn from N (0, 1) And let y i = x i,1 + N (0, σ 2 ), for σ = 0.2 14

Outline Basis Function Expansion Overfitting and the Bias-Variance Tradeoff Ridge Regression and Lasso Bayesian Approach to Machine Learning Model Selection

Ridge Regression Suppose we have data (x i, y i) N i=1, where x R D with D N One idea to avoid overfitting is to add a penalty term for weights Least Squares Estimate Objective L(w) = (Xw y) T (Xw y) Ridge Regression Objective D L ridge(w) = (Xw y) T (Xw y) + λ i=1 w 2 i 15

Ridge Regression We add a penalty term for weights to control model complexity Should not penalise the constant term w 0 for being large 16

Ridge Regression Should translating and scaling inputs contribute to model complexity? Suppose ŷ = w 0 + w 1x Supose x is temperature in C and x in F So ŷ = ( w 0 160 9 w1 ) + 5 9 w1x In one case model complexity is w1, 2 in the other it is 25 81 w2 1 < w2 1 3 Should try and avoid dependence on scaling and translation of variables 17

Ridge Regression Before optimising the ridge objective, it s a good idea to standardise all inputs (mean 0 and variance 1) If in addition, we center the outputs, i.e., the outputs have mean 0, then the constant term is unnecessary (Exercise on Sheet 2) Then find w that minimises the objective function L ridge(w) = (Xw y) T (Xw y) + λw T w 18

Deriving Estimate for Ridge Regression Suppose the data (x i, y i) N i=1 with inputs standardised and output centered We want to derive expression for w that minimises L ridge(w) = (Xw y) T (Xw y) + λw T w = w T X T Xw 2y T Xw + y T y + λw T w Let s take the gradient of the objective with respect to w wl ridge = 2(X T X)w 2X T y + 2λw ( ( ) ) = 2 X T X + λi D w X T y Set the gradient to 0 and solve for w (X T X + λi D ) w = X T y w ridge = ( X T X + λi D ) 1 X T y 19

Ridge Regression Minimise (Xw y) T (Xw y) + λw T w Minimise (Xw y) T (Xw y) subject to w T w R 20

Ridge Regression Minimise (Xw y) T (Xw y) + λw T w Minimise (Xw y) T (Xw y) subject to w T w R 20

Ridge Regression Minimise (Xw y) T (Xw y) + λw T w Minimise (Xw y) T (Xw y) subject to w T w R 20

Ridge Regression Minimise (Xw y) T (Xw y) + λw T w Minimise (Xw y) T (Xw y) subject to w T w R 20

Ridge Regression Minimise (Xw y) T (Xw y) + λw T w Minimise (Xw y) T (Xw y) subject to w T w R 20

Ridge Regression Minimise (Xw y) T (Xw y) + λw T w Minimise (Xw y) T (Xw y) subject to w T w R 20

Ridge Regression Minimise (Xw y) T (Xw y) + λw T w Minimise (Xw y) T (Xw y) subject to w T w R 20

Ridge Regression Minimise (Xw y) T (Xw y) + λw T w Minimise (Xw y) T (Xw y) subject to w T w R 20

Ridge Regression As we decrease λ the magnitudes of weights start increasing 21

Summary : Ridge Regression In ridge regression, in addition to the residual sum of squares we penalise the sum of squares of weights Ridge Regression Objective L ridge(w) = (Xw y) T (Xw y) + λw T w This is also called l 2-regularization or weight-decay Penalising weights encourages fitting signal rather than just noise 22

The Lasso Lasso (least absolute shrinkage and selection operator) minimises the following objective function Lasso Objective L lasso(w) = (Xw y) T (Xw y) + λ D w i i=1 As with ridge regression, there is a penalty on the weights The absolute value function does not allow for a simple close-form expression (l 1-regularization) However, there are advantages to using the lasso as we shall see next 23

The Lasso : Optimization Minimise D (Xw y) T (Xw y) + λ w i i=1 Minimise (Xw y) T (Xw y) D subject to w i R i=1 24

The Lasso : Optimization Minimise D (Xw y) T (Xw y) + λ w i i=1 Minimise (Xw y) T (Xw y) D subject to w i R i=1 24

The Lasso : Optimization Minimise D (Xw y) T (Xw y) + λ w i i=1 Minimise (Xw y) T (Xw y) D subject to w i R i=1 24

The Lasso : Optimization Minimise D (Xw y) T (Xw y) + λ w i i=1 Minimise (Xw y) T (Xw y) D subject to w i R i=1 24

The Lasso : Optimization Minimise D (Xw y) T (Xw y) + λ w i i=1 Minimise (Xw y) T (Xw y) D subject to w i R i=1 24

The Lasso : Optimization Minimise D (Xw y) T (Xw y) + λ w i i=1 Minimise (Xw y) T (Xw y) D subject to w i R i=1 24

The Lasso : Optimization Minimise D (Xw y) T (Xw y) + λ w i i=1 Minimise (Xw y) T (Xw y) D subject to w i R i=1 24

The Lasso : Optimization Minimise D (Xw y) T (Xw y) + λ w i i=1 Minimise (Xw y) T (Xw y) D subject to w i R i=1 24

The Lasso Paths As we decrease λ the magnitudes of weights start increasing 25

Comparing Ridge Regression and the Lasso When using the Lasso, weights are often exactly 0. Thus, Lasso gives sparse models. 26

Overfitting: How does it occur? We have D = 100 and N = 100 so that X is 100 100 Every entry of X is drawn from N (0, 1) y i = x i,1 + N (0, σ 2 ), for σ = 0.2 No regularization 27

Overfitting: How does it occur? We have D = 100 and N = 100 so that X is 100 100 Every entry of X is drawn from N (0, 1) y i = x i,1 + N (0, σ 2 ), for σ = 0.2 No regularization 27

Overfitting: How does it occur? We have D = 100 and N = 100 so that X is 100 100 Every entry of X is drawn from N (0, 1) y i = x i,1 + N (0, σ 2 ), for σ = 0.2 Ridge 27

Overfitting: How does it occur? We have D = 100 and N = 100 so that X is 100 100 Every entry of X is drawn from N (0, 1) y i = x i,1 + N (0, σ 2 ), for σ = 0.2 Lasso 27

Outline Basis Function Expansion Overfitting and the Bias-Variance Tradeoff Ridge Regression and Lasso Bayesian Approach to Machine Learning Model Selection

Least Squares and MLE (Gaussian Noise) Least Squares Objective Function N L(w) = (y i w x i) 2 i=1 MLE (Gaussian Noise) Likelihood p(y X, w) = 1 (2πσ 2 ) N/2 ( N exp i=1 (yi w xi)2 2σ 2 ) For estimating w, the negative log-likelihood under Gaussian noise has the same form as the least squares objective Alternatively, we can model the data (only y i-s) as being generated from a distribution defined by exponentiating the negative of the objective function 28

What Data Model Produces the Ridge Objective? We have the Ridge Regression Objective, let D = (x i, y i) N i=1 denote the data L ridge(w; D) = (y Xw) T (y Xw) + λw T w Let s rewrite this objective slightly, scaling by 1 and setting λ = σ2. To 2σ 2 τ 2 avoid ambiguity, we ll denote this by L L ridge(w; D) = 1 2σ 2 (y Xw)T (y Xw) + 1 2τ 2 wt w Let Σ = σ 2 I N and Λ = τ 2 I D, where I m denotes the m m identity matrix L ridge(w) = 1 2 (y Xw)T Σ 1 (y Xw) + 1 2 wt Λ 1 w Taking the negation of L ridge(w; D) and exponentiating gives us a non-negative function of w and D which after normalisation gives a density function ( f(w; D) = exp 1 ) ( 2 (y Xw)T Σ 1 (y Xw) exp 1 ) 2 wt Λ 1 w 29

Bayesian Linear Regression (and connections to Ridge) Let s start with the form of the density function we had on the previous slide and factor it. ( f(w; D) = exp 1 ) ( 2 (y Xw)T Σ 1 (y Xw) exp 1 ) 2 wt Λ 1 w We ll treat σ as fixed and not treat is as a parameter. Up to a constant factor (which does t matter when optimising w.r.t. w), we can rewrite this as p(w X, y) }{{} posterior N (y Xw, Σ) }{{} Likelihood N (w 0, Λ) }{{} prior where N ( µ, Σ) denotes the density of the multivariate normal distribution with mean µ and covariance matrix Σ What the ridge objective is actually finding is the maximum a posteriori or (MAP) estimate which is a mode of the posterior distribution The linear model is as described before with Gaussian noise The prior distribution on w is assumed to be a spherical Gaussian 30

Bayesian Machine Learning In the discriminative framework, we model the output y as a probability distribution given the input x and the parameters w, say p(y w, x) In the Bayesian view, we assume a prior on the parameters w, say p(w) This prior represents a belief about the model; the uncertainty in our knowledge is expressed mathematically as a probability distribution When observations, D = (x i, y i) N i=1 are made the belief about the parameters w is updated using Bayes rule Bayes Rule For events A, B, Pr[A B] = Pr[B A] Pr[A] Pr[B] The posterior distribution on w given the data D becomes: p(w D) p(y w, X) p(w) 31

Coin Toss Example Let us consider the Bernoulli model for a coin toss, for θ [0, 1] p(h θ) = θ Suppose after three independent coin tosses, you get T, T, T. What is the maximum likelihood estimate for θ? What is the posterior distribution over θ, assuming a uniform prior on θ? 32

Coin Toss Example Let us consider the Bernoulli model for a coin toss, for θ [0, 1] p(h θ) = θ Suppose after three independent coin tosses, you get T, T, T. What is the maximum likelihood estimate for θ? What is the posterior distribution over θ, assuming a Beta(2, 2) prior on θ? 32

Full Bayesian Prediction Let us recall the posterior distribution over parameters w in the Bayesian approach p(w X, y) }{{} posterior p(y X, w) }{{} likelihood p(w) }{{} prior If we use the MAP estimate, as we get more samples the posterior peaks at the MLE When, data is scarce rather than picking a single estimator (like MAP) we can sample from the full posterior For x new, we can output the entire distribution over our prediction ŷ as p(y D) = p(y w, x new) p(w D) dw w }{{}}{{} model posterior This integration is often computationally very hard! 33

See Murphy Sec 7.6 for calculations 34 Full Bayesian Approach for Linear Regression For the linear model with Guassian noise and a Gaussian prior on w, the full Bayesian prediction distribution for a new point x new can be expressed in closed form. p(y D, x new, σ 2 ) = N (w T mapx new, (σ(x new)) 2 )

Summary : Bayesian Machine Learning In the Bayesian view, in addition to modelling the output y as a random variable given the parameters w and input x, we also encode prior belief about the parameters w as a probability distribution p(w). If the prior has a parametric form, they are called hyperparameters The posterior over the parameters w is updated given data Either pick point (plugin) estimates, e.g., maximum a posteriori Or as in the full Bayesian approach use the entire posterior to make prediction (this is often computationally intractable) How to choose the prior? 35

Outline Basis Function Expansion Overfitting and the Bias-Variance Tradeoff Ridge Regression and Lasso Bayesian Approach to Machine Learning Model Selection

How to Choose Hyper-parameters? So far, we were just trying to estimate the parameters w For Ridge Regression or Lasso, we need to choose λ If we perform basis expansion For kernels, we need to pick the width parameter γ For polynomials, we need to pick degree d For more complex models there may be more hyperparameters 36

Using a Validation Set Divide the data into parts: training, validation (and testing) Grid Search: Choose values for the hyperparameters from a finite set Train model using training set and evaluate on validation set λ training error(%) validation error(%) 0.01 0 89 0.1 0 43 1 2 12 10 10 8 100 25 27 Pick the value of λ that minimises validation error Typically, split the data as 80% for training, 20% for validation 37

Training and Validation Curves Plot of training and validation error vs λ for Lasso Validation error curve is U-shaped 38

K-Fold Cross Validation When data is scarce, instead of splitting as training and validation: Divide data into K parts Use K 1 parts for training and 1 part as validation Commonly set K = 5 or K = 10 When K = N (the number of datapoints), it is called LOOCV (Leave one out cross validation) Run 1 train train train train valid Run 2 train train train valid train Run 3 train train valid train train Run 4 train valid train train train Run 5 valid train train train train 39

Overfitting on the Validation Set Suppose you do all the right things Train on the training set Choose hyperparameters using proper validation Test on the test set (real world), and your error is unacceptably high! What would you do? 40

Winning Kaggle without reading the data! Suppose the task is to predict N binary labels Algorithm (Wacky Boosting): 1. Choose y 1,..., y k {0, 1} N randomly 2. Set I = {i accuracy(y i ) > 51%} 3. Output ŷ j = majority{yj i i I} Source blog.mrtz.org 41

Next Time Optimization Algorithms Read up on gradients, multivariate calculus 42