Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/ seungjin 1 / 27
Outline Regression and linear models Batch methods Ordinary least squares (OLS) Maximum likelihood estimates Sequential methods Least mean squares (LMS) Recursive (sequential) least squares (RLS) 2 / 27
What is regression analysis? 3 / 27
Problem Setup Given a set of N labeled examples, D = {(x n, y n )} N n=1 (x n X R D and y n Y R), the goal is to learn a mapping f (x) : X Y, which associates x with y, such that we can make prediction about y when a new input x / D is provided. Parametric regression: Assume a functional form for f (x) (e.g. linear models). Nonparametric regression: Do not assume functional form for f (x). In this lecture we focus on parametric regression. 4 / 27
Regression Regression aims at modeling the dependence of a response Y on a covariate X. In other words, the goal of regression is to predict the value of one or more continuous target variables y given the value of input vector x. The regression model is described by y = f (x) + ɛ. Terminology: x: input, independent variable, predictor, regressor, covariate y: output, dependent variable, response The dependence of a response on a covariate is captured via a conditional probability distribution, p(y x). Depending on f (x), Linear regression with basis functions: f (x) = M j=1 w jφ j (x) + w 0. Linear regression with kernels: f (x) = N i=1 w ik(x, x i ) + w 0. 5 / 27
Regression Function: Conditional Mean We consider the mean squared error and find the MMSE estimate: E(f ) = E (y f (x)) 2 = (y f (x)) 2 p(x, y) dxdy = (y f (x)) 2 p(x)p(y x) dxdy = [ f (x) (y f (x)) 2 p(y x) dy ] = 0 leads to p(x) (y f (x)) 2 p(y x) dy dx }{{} to be minimized f (x) = yp(y x) dy = E [y x]. 6 / 27
Linear Regression 7 / 27
Why Linear Models? y n = w 1x n,1 + w 2x n,2 + + w M x n,m + w 0 + ɛ n, n = 1,..., N. Built on well-developed linear transformation. Can be solved analytically. Yield some interpretability (in contrast to deep learning). 8 / 27
Linear Regression Linear regression refers to a model in which the conditional mean of y given the value of x is an affine function of φ(x) f (x) = M w j φ j (x) + w 0 φ 0 (x) = w φ(x), j=1 where φ j (x) are known as basis functions and w = [w 0, w 1,..., w M ], φ = [φ 0, φ 1,..., φ M ]. By using nonlinear basis functions, we allow the function f (x) to be a nonlinear function of the input vector x (but a linear function of φ(x)). 9 / 27
Polynomial Regression: y n = M j=0 w jφ j (x n) = M j=0 w jx j n 1 M = 0 1 M = 1 t t 0 0 1 1 0 x 1 0 x 1 1 M = 3 1 M = 9 t t 0 0 1 1 0 x 1 0 x 1 [Figure source: Bishop s PRML] 10 / 27
Basis Functions Polynomial regression: φ j (x) = x j. Gaussian basis functions: φ j (x) = exp { x µ j 2 2σ 2 }. Spline basis functions: Piecewise polynomials (divide the input space up into regions and fit a different polynomial in each region). Many other possible basis functions: sigmoidal basis functions, hyperbolic tangent basis functions, Fourier basis, wavelet basis, and so on. 11 / 27
Ordinary Least Squares Loss function view 12 / 27
Least Squares Method Given a set of training data {(x n, y n )} N n=1, we determine the weight vector w R M+1 which minimizes J LS (w) = 1 2 N n=1 ( yn w φ(x n ) ) 2 = 1 2 y Φ w 2 2, where y = [y 1,..., y N ] R N and Φ R (M+1) N is known as the design matrix with Φ j,n = φ j 1 (x n ) for j = 1,..., M + 1 and n = 1,..., N, i.e., φ 0 (x 1 ) φ 1 (x 1 ) φ M (x 1 ) Φ φ 0 (x 2 ) φ 1 (x 2 ) φ M (x 2 ) =....... φ 0 (x N ) φ 1 (x N ) φ M (x N ) Note that y Φ w 2 2 = ( ) ( ) y Φ w y Φ w. 13 / 27
Find the estimate ŵ LS such that w LS = arg min w where both y and Φ are given. 1 2 y Φ w 2 2, How do you find the minimizer w LS? Solve w ( ) 1 2 y Φ w 2 2 = 0 for w. 14 / 27
Note that 1 2 y Φ w 2 2 = 1 ( ) ( ) y Φ w y Φ w 2 = 1 ( ) y y w Φy y Φ w + w ΦΦ w. 2 Then, we have w ( ) 1 2 y Φ w 2 2 = ΦΦ w Φy. 15 / 27
Therefore, w of the form ( 1 2 y ) Φw 2 2 = 0 leads to the normal equation that is ΦΦ w = Φy. Thus, LS estimate of w is given by w LS = ( ΦΦ ) 1 Φy = Φ y, where Φ is known as the Moore-Penrose pseudo-inverse. 16 / 27
Least Squares Probabilistic model view with MLE 17 / 27
Maximum Likelihood We consider a linear model where the target variable y n is assumed to be generated by a deterministic function f (x n ; w) = w φ(x n ) with additive Gaussian noise: y n = w φ(x n ) + ɛ n, for n = 1,..., N and ɛ n N (0, σ 2 I). In a compact form, we have y = Φ w + ɛ. In other words, we model p(y Φ, w) as p(y Φ, w) = N (Φ w, σ 2 I). 18 / 27
The log-likelihood is given by N L = log p(y Φ, w) = log p(y n φ(x n ), w) n=1 = N 2 log σ2 N 2 log 2π σ 2 J LS. MLE is given by leading to w ML = arg max w w ML = w LS, log p(y Φ, w), which we arrived at under Gaussian noise assumption. 19 / 27
Sequential Methods LMS and RLS 20 / 27
Online Learning A method of machine learning in which data becomes available in a sequential order and is used to update our best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. [Source: Wikipedia] 21 / 27
Mean Squared Error (MSE) Interested in MMSE estimate: arg min MSE = E [ y w φ(x) 2 2]. w Sample average: MSE 1 N N n=1 ( yn w φ(x n ) ) 2. Instantaneous squared error: ( y n w φ(x n ) ) 2. 22 / 27
Least Mean Squares (LMS) Approximate E [ (y w φ(x) ) 2 ] ( y n w φ(x n ) ) 2. LMS is a gradient-descent method which minimizes the instantaneous squared error J n = ( y n w φ(x n ) ) 2. The gradient descent method leads to the updating rule for w that is of the form where η > 0 is learning rate. w w η J n w + η ( y n w φ(x n ) ) φ(x n ), [Widrow and Hoff, 1960] 23 / 27
Recursive (Sequential) LS We introduce the forgetting factor λ to de-emphasize old samples, leading to the following error function J RLS = 1 2 n λ n i (y i φ i wn ) 2, i=1 where φ i = φ(x i ). Solving J RLS w n = 0 for w n leads to [ n ] [ n ] λ n i φ i φ i w n = λ n i y i φ i. i=1 i=1 24 / 27
We define P n = r n = [ n ] 1 λ n i φ i φ i, i=1 [ n ] λ n i y i φ i. i=1 With these definitions, we have w n = P n r n. The core idea of RLS is to apply the matrix inversion lemma to develop the sequential algorithm without matrix inversion. (A + BCD) 1 = A 1 A 1 B(C 1 + DA 1 B) 1 DA 1. 25 / 27
The recursion for P n is given by P n = = = 1 λ [ n λ n i φ i φ i i=1 ] 1 [ n 1 λ λ n 1 i φ i φ i [ i=1 + φ n φ n P n 1 P n 1φ n φ n P n 1 λ + φ n P n 1 φ n ] 1 ]. (matrix inversion lemma) 26 / 27
Thus, the updating rule for w is given by w n = P n r [ n = 1 P n 1 P n 1φ n φ n P n 1 λ λ + φ [λr n 1 + y n φ n ] n P n 1 φ n P n 1 φ ] = w n 1 + n [y λ + φ n φ n w n 1. n P n 1 φ }{{ n }}{{} error gain ] 27 / 27