Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Similar documents
SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Machine Learning Lecture 5

Machine Learning Lecture 7

STA414/2104 Statistical Methods for Machine Learning II

Linear Models for Regression

Machine Learning Lecture 2

Machine Learning Lecture 2

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

ECE521 week 3: 23/26 January 2017

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Cheng Soon Ong & Christian Walder. Canberra February June 2018

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Models for Regression CS534

Linear Models for Regression

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present.

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Pattern Recognition and Machine Learning

Machine Learning Lecture 2

STA 4273H: Sta-s-cal Machine Learning

Bayesian Machine Learning

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Linear Models for Regression. Sargur Srihari

Overfitting, Bias / Variance Analysis

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Outline lecture 2 2(30)

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine Learning Lecture 3

Reading Group on Deep Learning Session 1

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Nonparametric Bayesian Methods (Gaussian Processes)

Recap from previous lecture

Machine Learning Basics: Maximum Likelihood Estimation

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Machine Learning Linear Classification. Prof. Matteo Matteucci

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

COMS 4771 Regression. Nakul Verma

Machine Learning Lecture 10

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Kernelized Perceptron Support Vector Machines

Machine Learning Lecture 14

Linear Models for Regression CS534

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

CMU-Q Lecture 24:

Logistic Regression. COMP 527 Danushka Bollegala

Sparse Linear Models (10/7/13)

Gaussian with mean ( µ ) and standard deviation ( σ)

Least Squares Regression

Linear Models for Regression CS534

Machine Learning Lecture 12

GWAS IV: Bayesian linear (variance component) models

Relevance Vector Machines

Least Squares Regression

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

An Introduction to Statistical and Probabilistic Linear Models

Outline Lecture 2 2(32)

Today. Calculus. Linear Regression. Lagrange Multipliers

Bayesian methods in economics and finance

Bayesian Linear Regression. Sargur Srihari

Introduction to Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Gaussian Processes in Machine Learning

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Neural Network Training

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Linear & nonlinear classifiers

Machine Learning for Signal Processing Bayes Classification and Regression

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Multivariate statistical methods and data mining in particle physics

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

STA 4273H: Statistical Machine Learning

Linear Regression (continued)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Machine Learning for Signal Processing Bayes Classification

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Ch 4. Linear Models for Classification

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Classification Logistic Regression

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Logistic Regression. Machine Learning Fall 2018

Lecture 1b: Linear Models for Regression

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

The Bayes classifier

y(x) = x w + ε(x), (1)

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Lecture 4: Probabilistic Learning

Bias-Variance Trade-off in ML. Sargur Srihari

Transcription:

Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de

This Lecture: Advanced Machine Learning Regression Approaches Linear Regression Regularization (Ridge, Lasso) Support Vector Regression Gaussian Processes Learning with Latent Variables EM and Generalizations Dirichlet Processes Structured Output Learning Large-margin Learning

Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression Loss functions for regression Basis functions Multiple Outputs Sequential Estimation Regularization revisited Regularized Least-squares The Lasso Discussion Bias-Variance Decomposition 3

Recap: Probabilistic Regression First assumption: Our target function values t are generated by adding noise to the ideal function estimate: Target function value Noise Regression function Input value Weights or parameters Second assumption: The noise is Gaussian distributed. Slide adapted from Bernt Schiele Mean Variance ( precision) 4

Recap: Probabilistic Regression Given Training data points: Associated function values: X = [x 1 ; : : : ; x n ] 2 R d n t = [t 1 ; : : : ; t n ] T Conditional likelihood (assuming i.i.d. data) Maximize w.r.t. w, Generalized linear regression function Slide adapted from Bernt Schiele 5

Recap: Maximum Likelihood Regression Setting the gradient to zero: = [Á(x 1 ); : : : ; Á(x n )] Least-squares regression is equivalent to Maximum Likelihood under the assumption of Gaussian noise. Slide adapted from Bernt Schiele Same as in least-squares regression! 6

Recap: Role of the Precision Parameter Also use ML to determine the precision parameter : Gradient w.r.t. : The inverse of the noise precision is given by the residual variance of the target values around the regression function. 7

Recap: Predictive Distribution Having determined the parameters w and, we can now make predictions for new values of x. This means Rather than giving a point estimate, we can now also give an estimate of the estimation uncertainty. Image source: C.M. Bishop, 2006 8

Recap: Maximum-A-Posteriori Estimation Introduce a prior distribution over the coefficients w. For simplicity, assume a zero-mean Gaussian distribution New hyperparameter controls the distribution of model parameters. Express the posterior distribution over w. Using Bayes theorem: We can now determine w by maximizing the posterior. This technique is called maximum-a-posteriori (MAP). 9

Recap: MAP Solution Minimize the negative logarithm The MAP solution is therefore Maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error (with ). 10

MAP Solution (2) Setting the gradient to zero: = [Á(x 1 ); : : : ; Á(x n )] Effect of regularization: Keeps the inverse well-conditioned 11

Recap: Bayesian Curve Fitting Given Training data points: Associated function values: X = [x 1 ; : : : ; x n ] 2 R d n t = [t 1 ; : : : ; t n ] T Our goal is to predict the value of t for a new point x. Evaluate the predictive distribution What we just computed for MAP Noise distribition again assume a Gaussian here Assume that parameters and are fixed and known for now. 12

Bayesian Curve Fitting Under those assumptions, the posterior distribution is a Gaussian and can be evaluated analytically: where the mean and variance are given by and S is the regularized covariance matrix 13 Image source: C.M. Bishop, 2006

Analyzing the result Analyzing the variance of the predictive distribution Uncertainty in the predicted value due to noise on the target variables (expressed already in ML) Uncertainty in the parameters w (consequence of Bayesian treatment) 14

Bayesian Predictive Distribution Important difference to previous example Uncertainty may vary with test point x! 15 Image source: C.M. Bishop, 2006

Discussion We now have a better understanding of regression Least-squares regression: Assumption of Gaussian noise We can now also plug in different noise models and explore how they affect the error function. L2 regularization as a Gaussian prior on parameters w. We can now also use different regularizers and explore what they mean. This lecture General formulation with basis functions Á(x). We can now also use different basis functions. 16

Discussion General regression formulation In principle, we can perform regression in arbitrary spaces and with many different types of basis functions However, there is a caveat Can you see what it is? Example: Polynomial curve fitting, M = 3 Number of coefficients grows with D M! The approach becomes quickly unpractical for high dimensions. This is known as the curse of dimensionality. We will encounter some ways to deal with this later... 17

Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression Loss functions for regression Basis functions Multiple Outputs Sequential Estimation Regularization revisited Regularized Least-squares The Lasso Discussion Bias-Variance Decomposition 18

Loss Functions for Regression Given p(y, x, w, ), how do we actually estimate a function value y t for a new point x t? We need a loss function, just as in the classification case Optimal prediction: Minimize the expected loss Slide adapted from Stefan Roth 19

Loss Functions for Regression Simplest case Squared loss: Expected loss, Z Z @E[L] @y(x) = 2 fy(x) tg p(x; t)dt =! 0 Z tp(x; t)dt = y(x) p(x; t)dt Slide adapted from Stefan Roth 20

Loss Functions for Regression Z Z tp(x; t)dt = y(x) p(x; t)dt, y(x) = Z t p(x; t) p(x) dt = Z tp(tjx)dt, y(x) = E[tjx] Important result Under Squared loss, the optimal regression function is the mean E [t x] of the posterior p(t x). Also called mean prediction. For our generalized linear regression function and square loss, we obtain as result Slide adapted from Stefan Roth 21

Visualization of Mean Prediction mean prediction 22 Slide adapted from Stefan Roth Image source: C.M. Bishop, 2006

Loss Functions for Regression Different derivation: Expand the square term as follows fy(x) tg 2 = fy(x) E[tjx] + E[tjx] tg 2 = fy(x) E[tjx]g 2 + fe[tjx] tg 2 +2fy(x) E[tjx]gfE[tjx] tg Substituting into the loss function The cross-term vanishes, and we end up with Optimal least-squares predictor given by the conditional mean Intrinsic variability of target data Irreducible minimum value of the loss function 23

Other Loss Functions The squared loss is not the only possible choice Poor choice when conditional distribution p(t x) is multimodal. Simple generalization: Minkowski loss Expectation E[L q ] = Z Z jy(x) tj q p(x; t)dxdt Minimum of E[L q ] is given by Conditional mean for q = 2, Conditional median for q = 1, Conditional mode for q = 0. 24

Minkowski Loss Functions 25 Image source: C.M. Bishop, 2006

Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression Loss functions for regression Basis functions Multiple Outputs Sequential Estimation Regularization revisited Regularized Least-squares The Lasso Discussion Bias-Variance Decomposition 26

Linear Basis Function Models Generally, we consider models of the following form where Á j (x) are known as basis functions. Typically, Á 0 (x) = 1, so that w 0 acts as a bias. In the simplest case, we use linear basis functions: Á d (x) = x d. Let s take a look at some other possible basis functions... Slide adapted from C.M. Bishop, 2006 27

Linear Basis Function Models (2) Polynomial basis functions Properties Global A small change in x affects all basis functions. 28 Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Linear Basis Function Models (3) Gaussian basis functions Properties Local A small change in x affects only nearby basis functions. ¹ j and s control location and scale (width). 29 Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Linear Basis Function Models (4) Sigmoid basis functions where Properties Local A small change in x affects only nearby basis functions. ¹ j and s control location and scale (slope). 30 Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression Loss functions for regression Basis functions Multiple Outputs Sequential Estimation Regularization revisited Regularized Least-squares The Lasso Discussion Bias-Variance Decomposition 31

Multiple Outputs Multiple Output Formulation So far only considered the case of a single target variable t. We may wish to predict K > 1 target variables in a vector t. We can write this in matrix form where 32

Multiple Outputs (2) Analogously to the single output case we have: Given observed inputs,, and targets,, we obtain the log likelihood function Slide adapted from C.M. Bishop, 2006 33

Multiple Outputs (3) Maximizing with respect to W, we obtain If we consider a single target variable, t k, we see that where single output case., which is identical with the Slide adapted from C.M. Bishop, 2006 34

Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression Loss functions for regression Basis functions Multiple Outputs Sequential Estimation Regularization revisited Regularized Least-squares The Lasso Discussion Bias-Variance Decomposition 35

Sequential Learning Up to now, we have mainly considered batch methods All data was used at the same time Instead, we can also consider data items one at a time (a.k.a. online learning) Stochastic (sequential) gradient descent: This is known as the least-mean-squares (LMS) algorithm. Issue: how to choose the learning rate? Slide adapted from C.M. Bishop, 2006 36

Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression Loss functions for regression Basis functions Multiple Outputs Sequential Estimation Regularization revisited Regularized Least-squares The Lasso Discussion Bias-Variance Decomposition 37

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Regularization Revisited Consider the error function Data term + Regularization term With the sum-of-squares error function and a quadratic regularizer, we get which is minimized by is called the regularization coefficient. Slide adapted from C.M. Bishop, 2006 38

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Regularized Least-Squares Let s look at more general regularizers L q norms Lasso Ridge Regression 39 Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Recall: Lagrange Multipliers 40

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Regularized Least-Squares We want to minimize This is equivalent to minimizing subject to the constraint (for some suitably chosen ) 41

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Regularized Least-Squares Effect: Sparsity for q 1. Minimization tends to set many coefficients to zero Why is this good? Why don t we always do it, then? Any problems? 42 Image source: C.M. Bishop, 2006

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing The Lasso Consider the following regressor This formulation is known as the Lasso. Properties L 1 regularization The solution will be sparse (only few coefficients will be non-zero) The L 1 penalty makes the problem non-linear. There is no closed-form solution. Need to solve a quadratic programming problem. However, efficient algorithms are available with the same computational cost as for ridge regression. 43 Image source: C.M. Bishop, 2006

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Lasso as Bayes Estimation Interpretation as Bayes Estimation We can think of w j q as the log-prior density for w j. Prior for Lasso (q = 1): Laplacian distribution with 44 Image source: Friedman, Hastie, Tibshirani, 2009

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Analysis Equicontours of the prior distribution Analysis For q 1, the prior is not uniform in direction, but concentrates more mass on the coordinate directions. The case q = 1 (lasso) is the smallest q such that the constraint region is convex. Non-convexity makes the optimization problem more difficult. Limit for q = 0: regularization term becomes j=1..m 1 = M. This is known as Best Subset Selection. 45 Image source: Friedman, Hastie, Tibshirani, 2009

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Discussion Bayesian analysis Lasso, Ridge regression and Best Subset Selection are Bayes estimates with different priors. However, derived as maximizers of the posterior. Should ideally use the posterior mean as the Bayes estimate! Ridge regression solution is also the posterior mean, but Lasso and Best Subset Selection are not. We might also try using other values of q besides 0,1,2 However, experience shows that this is not worth the effort. Values of q 2 (1,2) are a compromise between lasso and ridge However, w j q with q > 1 is differentiable at 0. Loses the ability of lasso for setting coefficients exactly to zero. 46

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression Loss functions for regression Basis functions Multiple Outputs Sequential Estimation Regularization revisited Regularized Least-squares The Lasso Discussion Bias-Variance Decomposition 47

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Bias-Variance Decomposition Recall the expected squared loss, where The second term of E[L] corresponds to the noise inherent in the random variable t. What about the first term? Slide adapted from C.M. Bishop, 2006 48

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Bias-Variance Decomposition Suppose we were given multiple data sets, each of size N. Any particular data set D will give a particular function y(x;d). We then have Taking the expectation over D yields Slide adapted from C.M. Bishop, 2006 49

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Bias-Variance Decomposition Thus we can write where Slide adapted from C.M. Bishop, 2006 50

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Bias-Variance Decomposition Example 25 data sets from the sinusoidal, varying the degree of regularization,. 51 Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Bias-Variance Decomposition Example 25 data sets from the sinusoidal, varying the degree of regularization,. 52 Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Bias-Variance Decomposition Example 25 data sets from the sinusoidal, varying the degree of regularization,. 53 Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing The Bias-Variance Trade-Off Result from these plots An over-regularized model (large ) will have a high bias. An under-regularized model (small ) will have a high variance. We can compute an estimate for the generalization capability this way (magenta curve)! Can you see where the problem is with this? Computation is based on average w.r.t. ensembles of data sets. Unfortunately of little practical value 54 Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

References and Further Reading More information on linear regression, including a discussion on regularization can be found in Chapters 1.5.5 and 3.1-3.2 of the Bishop book. Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 T. Hastie, R. Tibshirani, J. Friedman Elements of Statistical Learning 2 nd edition, Springer, 2009 Additional information on the Lasso, including efficient algorithms to solve it, can be found in Chapter 3.4 of the Hastie book. 55