Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 1, 2015

Similar documents
Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Fundamentals of Machine Learning (Part I)

Kernel Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 27, 2015

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Linear Models for Regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Fundamentals of Machine Learning

Introduction into Bayesian statistics

Linear Regression Linear Regression with Shrinkage

Bayesian Deep Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Linear Models for Regression

Bias-Variance Decomposition. Mohammad Emtiyaz Khan EPFL Oct 6, 2015

An Introduction to Statistical and Probabilistic Linear Models

CMU-Q Lecture 24:

Linear Regression Linear Regression with Shrinkage

Linear Models for Regression

Bayesian Machine Learning

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Chris Bishop s PRML Ch. 8: Graphical Models

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear Models in Machine Learning

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Linear Models for Regression. Sargur Srihari

Lecture : Probabilistic Machine Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Feature Engineering, Model Evaluations

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Sample questions for Fundamentals of Machine Learning 2018

Bayesian Regression Linear and Logistic Regression

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reading Group on Deep Learning Session 1

Variational Bayes and Variational Message Passing

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Logistic Regression. COMP 527 Danushka Bollegala

Classification Logistic Regression

Discriminative Models

Bayesian RL Seminar. Chris Mansley September 9, 2008

Sparsity Regularization

Machine Learning Basics III

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

CS-E3210 Machine Learning: Basic Principles

Sparse Linear Models (10/7/13)

Week 5: Logistic Regression & Neural Networks

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Models for Regression

y Xw 2 2 y Xw λ w 2 2

Probabilistic Reasoning in Deep Learning

Bayesian Linear Regression [DRAFT - In Progress]

GWAS IV: Bayesian linear (variance component) models

Midterm exam CS 189/289, Fall 2015

Machine Learning: Assignment 1

y(x) = x w + ε(x), (1)

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

ISyE 691 Data mining and analytics

Lecture 2 Machine Learning Review

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

g-priors for Linear Regression

Logistic Regression & Neural Networks

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

CSC321 Lecture 9: Generalization

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Linear Models for Regression CS534

Discriminative Models

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian Analysis (Optional)

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

CSC321 Lecture 9: Generalization

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Bayesian Machine Learning

Regression, Ridge Regression, Lasso

Is the test error unbiased for these programs?

Introduction to Gaussian Process

Statistics: Learning models from data

CSC321 Lecture 18: Learning Probabilistic Models

Outline lecture 2 2(30)

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

A Modern Look at Classical Multivariate Techniques

STA414/2104 Statistical Methods for Machine Learning II

DATA MINING AND MACHINE LEARNING

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Transcription:

Ridge Regression Mohammad Emtiyaz Khan EPFL Oct, 205 Mohammad Emtiyaz Khan 205

Motivation Linear models can be too limited and usually underfit. One way is to use nonlinear basis functions instead. Nonlinear basis functions In general, we can use basis functions that are (known and fixed) nonlinear transformations applied to input variables. Given a D dimensional input vector x, let us say that we have M basis functions φ j (x), indexed by j. Example of basis functions: Polynomial, Splines (see Wikipedia), Gaussian, sigmoid, Fourier, Wavelets. 0.75 0.5 0.25 0 0 0.75 0.5 0.25 0 0 Taken from Bishop, Chapter 3

Linear basis function model The model is given as follows: y n = 0 + M j φ j (x n ) = φ(x n ) T j= where φ(x n ) T = [, φ (x n ), φ 2 (x n ),..., φ M (x n )] This model is linear in but nonlinear in x. Note that the dimensionality is now M, not D. Defining matrix Φ with φ(x n ) T as rows, φ (x ) φ 2 (x )... φ M (x ) φ (x 2 ) φ 2 (x 2 )... φ M (x 2 ) Φ := φ (x 3 ) φ 2 (x 3 )... φ M (x 3 )....... φ (x N ) φ 2 (x N )... φ M (x N ) The least square solution is lse = ( Φ T Φ) ΦT y Two Issues The model can potentially overfit. The Gram matrix can be illconditioned. 2

Regularization and ridge regression Through regularization, we can penalize complex models and favor simpler ones: min L() + λ 2N M j= 2 j The second term is a regularizer. The main point is that an input variable weighted by a small j will have less influence on the output. Note that 0 is not penalized. Setting λ = 0, we get back to leastsquares when L is MSE. Ridge regression When L is MSE, this is called the ridge regression: min 2N N [y n φ(x n ) T ] 2 + n= λ 2N M j= 2 j Differentiating and setting to zero: ridge = ( Φ T Φ + λim ) ΦT y 3

Ridge regression to fight ill-conditioning The eigenvalues of ( Φ T Φ + λim ) is at least λ. This is also referred to as lifting the eigenvalues. Proof: Write the eigenvalue decomposition of Φ T Φ as QSQ T and use the fact that QQ T = I M. Therefore ridge regression improves the condition number of the Gram matrix. Ridge regression as MAP estimator Assume 0 = 0 for this discussion. Recall that least-squares can be interpreted as the maximum likelihood estimator. [ N ] lse = arg max log N (y n x T n, σ 2 ) n= Ridge regression has a very similar interpretation: [ N ] ridge = arg max log N (y n x T n, λ) N ( 0, I) n= This is called a Maximum-aposteriori (MAP) estimate. 4

Additional Notes Other type of regularization Popular methods such as shrinkage, weight decay (in the context of neural networks), early stopping are all different forms of regularization. Another view of regularization: the regularized optimization problem is similar to the following constrained optimization (for some τ > 0). min 2N N (y n φ(x n ) T ) 2, such that T τ () n= The following picture illustrates this. w 2 w w Figure : Geometric interpretation of Ridge Regression. 5

Another popular regularization is the L regularizer (also known as lasso) N min (y n 2N φ(x M n ) T ) 2, such that i τ (2) n= This forces some of the elements of to be strictly 0 and therefore forces sparsity in the model (some features are not used since their coefficients are zero). Why does L regularizer forces sparsity? Hint: Draw the picture similar to above for lasso and look at the optimal solution. Why is it good to have sparsity in the model? Is it going to be better than least-squares? When and why? i= MAP estimate and the Bayes rule In general, a MAP estimate maximizes the product of the likelihood and the prior, as opposed to a ML estimate that maximizes only the likelihood. lik = arg max map = arg max p(y, X, λ) (3) p(y, X, λ)p( θ) (4) Here p(y, X, λ) defines the likelihood of observing the data y given X, and some likelihood parameters λ. Similarly, p( λ) is the prior distribution. This incorporates our prior knowledge about. Using the Bayes rule, p(a, B) = p(a B)p(B) = p(b A)p(A) (5) we can rewrite the MAP estimate, as follows: map = arg max p( y, X, λ, θ) (6) Therefore, the MAP estimate is the maximum of the posterior distribution, which explains the name. 6

To do. Read Section 3. and 3.3 of Bishop (3.. might be confusing and can be skipped at the first reading). 2. Implement ridge regression and understand how it solves the two problems. 3. (Difficult) Derive the update rule for Ridge regression using the Gaussian formula (Eq. 2.3-2.7 in Bishop s book). This is detailed in Section 3.3. of Bishop s book. 7