Linear Models for Regression

Similar documents
Linear Models for Regression

Nonparameteric Regression:

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Logistic Regression. Seungjin Choi

Density Estimation. Seungjin Choi

Kernel Principal Component Analysis

CMU-Q Lecture 24:

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Linear Regression (continued)

Linear Models for Regression. Sargur Srihari

CSCI567 Machine Learning (Fall 2014)

Bayes Decision Theory

Overfitting, Bias / Variance Analysis

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Independent Component Analysis (ICA)

Machine Learning. 7. Logistic and Linear Regression

Expectation Maximization

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Regression (9/11/13)

Machine Learning. Regression. Manfred Huber

Linear Models in Machine Learning

Nonnegative Matrix Factorization

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Probabilistic Machine Learning. Industrial AI Lab.

Introduction to Machine Learning

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 1, 2015

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Ch 4. Linear Models for Classification

Linear Models for Regression

Fisher s Linear Discriminant Analysis

Learning from Data: Regression

Information Theory Primer:

Iterative Reweighted Least Squares

Outline Lecture 2 2(32)

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Finite Automata. Seungjin Choi

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Bayesian Machine Learning

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

ECE521 week 3: 23/26 January 2017

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Vorlesung Neuronale Netze - Radiale-Basisfunktionen (RBF)-Netze -

Linear Models for Regression CS534

Linear Models for Classification

Probabilistic Reasoning in Deep Learning

Statistical Machine Learning Hilary Term 2018

Machine Learning Lecture 7

Neural Networks and Deep Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture : Probabilistic Machine Learning

Lecture 6. Regression

Multiclass Logistic Regression

Bayesian Linear Regression. Sargur Srihari

10-701/ Machine Learning, Fall

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Machine Learning Lecture 5

Linear Models for Regression CS534

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Linear Models for Regression CS534

Reading Group on Deep Learning Session 1

Probabilistic Latent Semantic Analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Regression. S. Sumitra

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Gaussian with mean ( µ ) and standard deviation ( σ)

Regression with Numerical Optimization. Logistic

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Association studies and regression

Introduction to Machine Learning

Talk on Bayesian Optimization

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Adaptive Filters. un [ ] yn [ ] w. yn n wun k. - Adaptive filter (FIR): yn n n w nun k. (1) Identification. Unknown System + (2) Inverse modeling

Linear Regression and Discrimination

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

COMS 4771 Regression. Nakul Verma

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning Basics III

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

MTTTS16 Learning from Multiple Sources

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Probabilistic Machine Learning

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Lecture 5: GPs and Streaming regression

Linear Models for Regression

1 Machine Learning Concepts (16 points)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Manifold Learning: Theory and Applications to HRI

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

CS-E3210 Machine Learning: Basic Principles

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Transcription:

Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/ seungjin 1 / 27

Outline Regression and linear models Batch methods Ordinary least squares (OLS) Maximum likelihood estimates Sequential methods Least mean squares (LMS) Recursive (sequential) least squares (RLS) 2 / 27

What is regression analysis? 3 / 27

Problem Setup Given a set of N labeled examples, D = {(x n, y n )} N n=1 (x n X R D and y n Y R), the goal is to learn a mapping f (x) : X Y, which associates x with y, such that we can make prediction about y when a new input x / D is provided. Parametric regression: Assume a functional form for f (x) (e.g. linear models). Nonparametric regression: Do not assume functional form for f (x). In this lecture we focus on parametric regression. 4 / 27

Regression Regression aims at modeling the dependence of a response Y on a covariate X. In other words, the goal of regression is to predict the value of one or more continuous target variables y given the value of input vector x. The regression model is described by y = f (x) + ɛ. Terminology: x: input, independent variable, predictor, regressor, covariate y: output, dependent variable, response The dependence of a response on a covariate is captured via a conditional probability distribution, p(y x). Depending on f (x), Linear regression with basis functions: f (x) = M j=1 w jφ j (x) + w 0. Linear regression with kernels: f (x) = N i=1 w ik(x, x i ) + w 0. 5 / 27

Regression Function: Conditional Mean We consider the mean squared error and find the MMSE estimate: E(f ) = E (y f (x)) 2 = (y f (x)) 2 p(x, y) dxdy = (y f (x)) 2 p(x)p(y x) dxdy = [ f (x) (y f (x)) 2 p(y x) dy ] = 0 leads to p(x) (y f (x)) 2 p(y x) dy dx }{{} to be minimized f (x) = yp(y x) dy = E [y x]. 6 / 27

Linear Regression 7 / 27

Why Linear Models? y n = w 1x n,1 + w 2x n,2 + + w M x n,m + w 0 + ɛ n, n = 1,..., N. Built on well-developed linear transformation. Can be solved analytically. Yield some interpretability (in contrast to deep learning). 8 / 27

Linear Regression Linear regression refers to a model in which the conditional mean of y given the value of x is an affine function of φ(x) f (x) = M w j φ j (x) + w 0 φ 0 (x) = w φ(x), j=1 where φ j (x) are known as basis functions and w = [w 0, w 1,..., w M ], φ = [φ 0, φ 1,..., φ M ]. By using nonlinear basis functions, we allow the function f (x) to be a nonlinear function of the input vector x (but a linear function of φ(x)). 9 / 27

Polynomial Regression: y n = M j=0 w jφ j (x n) = M j=0 w jx j n 1 M = 0 1 M = 1 t t 0 0 1 1 0 x 1 0 x 1 1 M = 3 1 M = 9 t t 0 0 1 1 0 x 1 0 x 1 [Figure source: Bishop s PRML] 10 / 27

Basis Functions Polynomial regression: φ j (x) = x j. Gaussian basis functions: φ j (x) = exp { x µ j 2 2σ 2 }. Spline basis functions: Piecewise polynomials (divide the input space up into regions and fit a different polynomial in each region). Many other possible basis functions: sigmoidal basis functions, hyperbolic tangent basis functions, Fourier basis, wavelet basis, and so on. 11 / 27

Ordinary Least Squares Loss function view 12 / 27

Least Squares Method Given a set of training data {(x n, y n )} N n=1, we determine the weight vector w R M+1 which minimizes J LS (w) = 1 2 N n=1 ( yn w φ(x n ) ) 2 = 1 2 y Φ w 2 2, where y = [y 1,..., y N ] R N and Φ R (M+1) N is known as the design matrix with Φ j,n = φ j 1 (x n ) for j = 1,..., M + 1 and n = 1,..., N, i.e., φ 0 (x 1 ) φ 1 (x 1 ) φ M (x 1 ) Φ φ 0 (x 2 ) φ 1 (x 2 ) φ M (x 2 ) =....... φ 0 (x N ) φ 1 (x N ) φ M (x N ) Note that y Φ w 2 2 = ( ) ( ) y Φ w y Φ w. 13 / 27

Find the estimate ŵ LS such that w LS = arg min w where both y and Φ are given. 1 2 y Φ w 2 2, How do you find the minimizer w LS? Solve w ( ) 1 2 y Φ w 2 2 = 0 for w. 14 / 27

Note that 1 2 y Φ w 2 2 = 1 ( ) ( ) y Φ w y Φ w 2 = 1 ( ) y y w Φy y Φ w + w ΦΦ w. 2 Then, we have w ( ) 1 2 y Φ w 2 2 = ΦΦ w Φy. 15 / 27

Therefore, w of the form ( 1 2 y ) Φw 2 2 = 0 leads to the normal equation that is ΦΦ w = Φy. Thus, LS estimate of w is given by w LS = ( ΦΦ ) 1 Φy = Φ y, where Φ is known as the Moore-Penrose pseudo-inverse. 16 / 27

Least Squares Probabilistic model view with MLE 17 / 27

Maximum Likelihood We consider a linear model where the target variable y n is assumed to be generated by a deterministic function f (x n ; w) = w φ(x n ) with additive Gaussian noise: y n = w φ(x n ) + ɛ n, for n = 1,..., N and ɛ n N (0, σ 2 I). In a compact form, we have y = Φ w + ɛ. In other words, we model p(y Φ, w) as p(y Φ, w) = N (Φ w, σ 2 I). 18 / 27

The log-likelihood is given by N L = log p(y Φ, w) = log p(y n φ(x n ), w) n=1 = N 2 log σ2 N 2 log 2π σ 2 J LS. MLE is given by leading to w ML = arg max w w ML = w LS, log p(y Φ, w), which we arrived at under Gaussian noise assumption. 19 / 27

Sequential Methods LMS and RLS 20 / 27

Online Learning A method of machine learning in which data becomes available in a sequential order and is used to update our best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. [Source: Wikipedia] 21 / 27

Mean Squared Error (MSE) Interested in MMSE estimate: arg min MSE = E [ y w φ(x) 2 2]. w Sample average: MSE 1 N N n=1 ( yn w φ(x n ) ) 2. Instantaneous squared error: ( y n w φ(x n ) ) 2. 22 / 27

Least Mean Squares (LMS) Approximate E [ (y w φ(x) ) 2 ] ( y n w φ(x n ) ) 2. LMS is a gradient-descent method which minimizes the instantaneous squared error J n = ( y n w φ(x n ) ) 2. The gradient descent method leads to the updating rule for w that is of the form where η > 0 is learning rate. w w η J n w + η ( y n w φ(x n ) ) φ(x n ), [Widrow and Hoff, 1960] 23 / 27

Recursive (Sequential) LS We introduce the forgetting factor λ to de-emphasize old samples, leading to the following error function J RLS = 1 2 n λ n i (y i φ i wn ) 2, i=1 where φ i = φ(x i ). Solving J RLS w n = 0 for w n leads to [ n ] [ n ] λ n i φ i φ i w n = λ n i y i φ i. i=1 i=1 24 / 27

We define P n = r n = [ n ] 1 λ n i φ i φ i, i=1 [ n ] λ n i y i φ i. i=1 With these definitions, we have w n = P n r n. The core idea of RLS is to apply the matrix inversion lemma to develop the sequential algorithm without matrix inversion. (A + BCD) 1 = A 1 A 1 B(C 1 + DA 1 B) 1 DA 1. 25 / 27

The recursion for P n is given by P n = = = 1 λ [ n λ n i φ i φ i i=1 ] 1 [ n 1 λ λ n 1 i φ i φ i [ i=1 + φ n φ n P n 1 P n 1φ n φ n P n 1 λ + φ n P n 1 φ n ] 1 ]. (matrix inversion lemma) 26 / 27

Thus, the updating rule for w is given by w n = P n r [ n = 1 P n 1 P n 1φ n φ n P n 1 λ λ + φ [λr n 1 + y n φ n ] n P n 1 φ n P n 1 φ ] = w n 1 + n [y λ + φ n φ n w n 1. n P n 1 φ }{{ n }}{{} error gain ] 27 / 27