Machine learning - HT Basis Expansion, Regularization, Validation

Similar documents
Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Introduction to Machine Learning

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

ECE521 week 3: 23/26 January 2017

GWAS IV: Bayesian linear (variance component) models

4 Bias-Variance for Ridge Regression (24 points)

Least Squares Regression

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Overfitting, Bias / Variance Analysis

Linear Models for Regression

Week 3: Linear Regression

Machine learning - HT Maximum Likelihood

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Least Squares Regression

CSC321 Lecture 2: Linear Regression

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

An Introduction to Statistical and Probabilistic Linear Models

Linear Models for Regression. Sargur Srihari

The evolution from MLE to MAP to Bayesian Learning

Introduction to Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Linear Regression (continued)

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Lecture 2 Machine Learning Review

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Fundamentals of Machine Learning

Machine Learning. Ensemble Methods. Manfred Huber

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Bayesian Linear Regression. Sargur Srihari

Bayesian Linear Regression [DRAFT - In Progress]

Machine Learning - MT Classification: Generative Models

Linear Models for Regression

Support Vector Machines

CSC 411 Lecture 6: Linear Regression

CSC2515 Assignment #2

Machine Learning and Pattern Recognition, Tutorial Sheet Number 3 Answers

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Kernelized Perceptron Support Vector Machines

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Introduction to Machine Learning

Fundamentals of Machine Learning (Part I)

COMP90051 Statistical Machine Learning

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Supervised Learning Coursework

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning for Biomedical Engineering. Enrico Grisan

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CS-E3210 Machine Learning: Basic Principles

Machine Learning & SVM

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Machine Learning - MT Linear Regression

Linear Regression and Discrimination

STA414/2104 Statistical Methods for Machine Learning II

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Understanding Generalization Error: Bounds and Decompositions

4 Bias-Variance for Ridge Regression (24 points)

PDEEC Machine Learning 2016/17

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Bayesian Inference and an Intro to Monte Carlo Methods

Lecture 2: Linear regression

y(x) = x w + ε(x), (1)

Linear Regression (9/11/13)

Lecture 5 Multivariate Linear Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning Linear Regression. Prof. Matteo Matteucci

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Discriminative Learning and Big Data

CSC321 Lecture 9: Generalization

CSCI567 Machine Learning (Fall 2014)

Linear Regression. Volker Tresp 2014

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Machine Learning Basics: Maximum Likelihood Estimation

CSE446: non-parametric methods Spring 2017

Review: Support vector machines. Machine learning techniques and image analysis

Recap from previous lecture

Introduction to Machine Learning HW6

Machine Learning 4771

Week 5: Logistic Regression & Neural Networks

PATTERN RECOGNITION AND MACHINE LEARNING

CSC321 Lecture 9: Generalization

Transcription:

Machine learning - HT 016 4. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford Feburary 03, 016

Outline Introduce basis function to go beyond linear regression Understanding the tradeoff between bias and variance Overfitting: What happens when we make models too complex Regularization as a means to control model complexity Cross-validation to perform model selection 1

Basis Expansion 8 6 4 0 0 4 6 8 10

Basis Expansion 8 6 4 0 0 4 6 8 10

Basis Expansion φ(x) = [1, x, x, x 3, x 4 ] w 1 + w x + w 3x + w 4x 3 + w 4x 5 = φ(x) [w 1, w, w 3, w 4, w 5] 8 6 4 0 0 4 6 8 10

Basis Expansion φ(x) = [1, x, x, x 3, x 4 ] w 1 + w x + w 3x + w 4x 3 + w 4x 5 = φ(x) [w 1, w, w 3, w 4, w 5] 8 6 4 0 0 4 6 8 10

Basis Expansion φ(x) = [1, x, x ] 8 6 4 0 0 4 6 8 10

Basis Expansion φ(x) = [1, x, x, x 3,..., x 9 ] 8 6 4 0 0 4 6 8 10

Basis Expansion φ(x) = [1, x, x, x 3,..., x d ] How can we avoid overfitting? 8 6 4 0 0 4 6 8 10 3

Basis Expansion φ(x) = [1, x, x, x 3,..., x d ] How can we avoid overfitting? Does more data help? 8 6 4 0 0 4 6 8 10 3

Basis Expansion φ(x) = [1, x, x, x 3,..., x d ] How can we avoid overfitting? Does more data help? 8 6 4 0 0 4 6 8 10 3

Basis Expansion φ(x) = [1, x, x, x 3,..., x d ] How can we avoid overfitting? Does more data help? 8 6 4 0 0 4 6 8 10 3

Bias Variance Tradeoff For linear model, more data would make little difference 8 8 6 6 4 4 0 0 0 4 6 8 10 0 4 6 8 10 Bias results from model being simpler than the truth High bias results in underfitting 4

Bias Variance Tradeoff What happens when we fit model on different (randomly drawn) training datasets? 8 8 6 6 4 4 0 0 0 4 6 8 10 0 4 6 8 10 Variance arises when the (complex) model is sensitive to fluctuations in training dataset Variance results in overfitting 5

Bias Variance Tradeoff When does more data help? Error = Bias + Variance + Noise (Exercise for linear regression) 8 8 6 6 4 4 0 0 0 4 6 8 10 0 4 6 8 10 For more complex models, difficult to visually overfitting and underfitting Keep aside some points as test set 6

Learning Curves Suppose we have a training set and test set Train on increasing sizes of the training set, and plot the errors 3.5 0.50 3.0 0.45.5.0 0.40 0.35 0.30 1.5 1.0 0.5 0.0 0.5 0.15 0.0 0 5 10 15 0 5 30 35 40 0.10 10 15 0 5 30 35 40 Once training and test error approach each other, then more data won t help! 7

Basis Expansion Using Kernels We can use kernels as features, e.g., radial basis functions (RBFs) Feature expansion: φ(x) = [1, κ(x, µ 1, σ),..., κ(x, µ d, σ)], e.g., κ(x, µ i, σ) = e x µ i σ Model: y = φ(x) T w + noise 8

Basis Expansion Using Kernels As in the case of polynomials, the width σ can cause overfitting or underfitting Image Source: K. Murphy (01) 9

Polynomial Basis Expansion in Higher Dimensions We are basically fitting linear models (at the cost of increasing dimensions) y = φ(x) + noise Linear Model: φ(x) = [1, x 1, x ] Quadratic Model: φ(x) = [1, x 1, x, x 1, x, x 1x ] Image Source: K. Murphy (01) How many dimensions do you get for degree d polynomials over n variables? 10

Overfitting! In high dimensions, we can have many many parameters! With 100 variables and degree 10 we have 10 0 parameters! Enrico Fermi to Freeman Dyson I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk. [video] How do we prevent overfitting? 11

How does overfitting occur? Suppose X 100 100 with every entry N (0, 1) And let y i = x i,1 + N (0, σ ), for σ = 0. 5 1. 1.0 4 3 0.8 0.6 0.4 0. 0.0 1 0 0 0 40 60 80 100 0. 0.4 0 0 40 60 80 100 1

Ridge Regression Say our data is (x i, y i) m i=1, where x R N where N is really really large! We used the squared loss L(w) = (Xw y) T (Xw y) and obtained the estimate w = (X T X) 1 X T y Suppose we want w to be small ( weight decay) L(w) = (Xw y) T (Xw y) + λw T w λw1 We will not regularize the bias (or constant) term Exercise: If all x i (except x 1) have mean 0 and y has mean 0, w 1 = 0 13

Deriving Estimate for Ridge Regression L(w) = (Xw y) T (Xw y) + λw T w 14

Ridge Regression Objective/Loss: L(w) = (Xw y) T (Xw y) + λw T w Estimate: ŵ = (X T X + λi) 1 X T y Estimate depends on the scaling of the inputs Common practice: Normalise all input dimensions 15

Ridge Regression and MAP Estimation Objective/Loss: L(w) = (Xw y) T (Xw y) + λw T w 16

Ridge Regression and MAP Estimation Objective/Loss: L(w) = (Xw y) T (Xw y) + λw T w exp ( 1 L(w) ) ( ) ( ) = exp 1 (y σ Xw)T Σ 1 (y Xw) exp 1 wt Λ 1 w 16

Bayesian View of Machine Learning General Formulation Prior p(w) on w Model p(y x, w) Linear Regression p(w) = N (w 0, τ I n) p(y x, w) = N (y x T w, σ ) Compute posterior given data D = (x i, y i) m i=1 p(w D) 17

Bayesian View of Regression Making a prediction on a new point x new p(y D, x new) = p(y w, x new)p(w D)dw w 18

Prediction MAP vs Fully Bayesian MAP Approach Bayesian Approach y new N (x T neww map, σ ) y new N (x T neww map, σ X) σ X = σ (1+x T new(x T X+ σ τ I) 1 x new) 19

Ridge Regression Minimize: (Xw y) T (Xw y) T + λw T w Minimize (Xw y) T (Xw y) such that w T w R 0

Ridge Regression Image Source: Hastie, Tibshirani, Friedman (013) 1

Lasso Regression Minimize: (Xw y) T (Xw y) T + λ N i=1 wi Minimize (Xw y) T (Xw y) such that N i=1 wi R

Lasso Regression Image Source: Hastie, Tibshirani, Friedman (013) 3

Regularization: Ridge or Lasso No regularization Ridge Lasso 5 5 5 4 4 4 3 3 3 1 1 1 0 0 0 40 60 80 100 0 0 0 40 60 80 100 0 0 0 40 60 80 100 1. 1.0 0.8 0.6 0.4 0. 0.0 0. 0.4 0 0 40 60 80 100 1. 1.0 0.8 0.6 0.4 0. 0.0 0. 0.4 0 0 40 60 80 100 1. 1.0 0.8 0.6 0.4 0. 0.0 0. 0.4 0 0 40 60 80 100 4

Choosing Parameters Before we were just trying to find w Now, we have to worry about how to choose λ for ridge or Lasso If we use kernels, we also have to pick the width σ If we use higher degree polynomials, we have to pick d 5

Cross Validation Keep a part of data as validation or test set Look for error on training and validation sets λ Train Error(%) Test Error(%) 0.01 5 7 0.1 10 8 1 1 10 0 43 100 0 89 6

k-fold Cross Validation What do we do when data is scarce? Divide data into k parts Use k 1 parts for training and 1 part as validation When k = m (the number of datapoints), we get LOOCV (Leave one out cross validation) 7

Overfitting on the Validation Set Suppose you do all the right things Train on the training set Choose hyperparameters using proper validation Test on the test set, and your error is high! What would you do? 8

Winning Kaggle without reading the data! Suppose the task is to predict m binary labels 9

Winning Kaggle without reading the data! Suppose the task is to predict m binary labels Algorithm (Wacky Boosting): 1. Choose y 1,..., y k {0, 1} m uniformly. Set I = {i accuracy(y i ) > 51%} 3. Output ŷ j = majority{y i j i I} Source: blog.mrtz.org 9

Winning Kaggle without reading the data! Suppose the task is to predict m binary labels Algorithm (Wacky Boosting): 1. Choose y 1,..., y k {0, 1} m uniformly. Set I = {i accuracy(y i ) > 51%} 3. Output ŷ j = majority{y i j i I} Source: blog.mrtz.org 9

Next Time Optimization Methods Read up on gradients, multivariate calculus 30