Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control Linköping University Linköping, Sweden. Email: schon@isy.liu.se, Phone: 3-28373, Office: House B, Entrance 27.. Summary of lecture 2. Linear basis function models 3. Maximum likelihood and least squares 4. Bias variance trade-off 5. Shrinkage methods Ridge regression LASSO 6. Bayesian linear regression 7. Motivation of kernel methods Summary of lecture (I/III) 3(3) Summary of lecture (II/III) 4(3) The exponential family of distributions over x, parameterized by η, ( ) p(x η) = h(x)g(η) exp η T u(x) One important member is the Gaussian density, which is commonly used as a building block in more sophisticated models. Important basic properties were provided. The idea underlying maximum likelihood is that the parameters θ should be chosen in such a way that the measurements {x i } i= are as likely as possible, i.e., θ = arg max θ p(x,, x θ). The three basic steps of Bayesian modeling ( all variables are modeled as stochastic). Assign prior distributions p(θ) to all unknown parameters θ. 2. Write down the likelihood p(x,..., x θ) of the data x,..., x given the parameters θ. 3. Determine the posterior distribution of the parameters given the data p(θ x,..., x ) = p(x,..., x θ)p(θ) p(x,..., x ) p(x,..., x θ)p(θ) If the posterior p(θ x,..., x ) and the prior p(θ) distributions are of the same functional form they are conjugate distributions and the prior is said to be a conjugate prior for the likelihood.
Summary of lecture (III/III) 5(3) Modeling heavy tails using the Student s t-distribution ( St(x µ, λ, ν) = x µ, (ηλ) ) Gam (η ν/2, ν/2) dη = Γ(ν/2 + /2) Γ(ν/2) ( ) λ 2 ( + πν ) ν λ(x µ)2 2 2 ν which according to the first expressions can be interpreted as an infinite mix of Gaussians with the same mean, but different variance. 9 8 7 6 5 4 3 2 log Student log Gaussian 5 5 5 5 Poor robustness is due to an unrealistic model, the ML estimator is inherently robust, provided we have the correct model. Commonly used basis functions 6(3) In using nonlinear basis functions, y(x, w) can be a nonlinear function in the input variable x (still linear in w). Global (in the sense that a small change in x affects all basis functions) basis function. Polynomial (see illustrative example in Section.) (ex. identity φ(x) = x) Local (in the sense that a small change in x only affects the nearby basis functions) basis function. Gaussian 2. Sigmoidal Linear regression model on matrix form 7(3) Maximum likelihood and least squares (I/IV) 8(3) It is commonly convenient to write the linear regression model t n = w T φ(x n ) + ɛ n, n =,...,, w = ( ) T w w... w M and φ = ( φ (x n )... φ M (x n ) ) T on matrix form T = Φw + E, t φ (x ) φ (x )... φ M (x ) ɛ t 2 T =. Φ = φ (x 2 ) φ (x 2 )... φ M (x 2 )...... E = ɛ 2. φ (x ) φ (x )... φ M (x ) t ɛ In our linear regression model, t n = w T φ(x n ) + ɛ n, assume that ɛ n (, β ) (i.i.d.). This results in the following likelihood function p(t n w, β) = (w T φ(x n ), β ) ote that this is a slight abuse of notation, p w,β (t n ) or p(t n ; w, β) would have been better, since w and β are both considered deterministic parameters in ML.
Maximum likelihood and least squares (II/IV) 9(3) The available training data consisting of input variables X = {x i } i= and the corresponding target variables T = {t i} i=. According to our assumption on the noise, the likelihood function is given by p(t w, β) = p(t n w, β) = (t n w T φ(x n ), β ) n= n= which results in the following log-likelihood function L(w, β) ln p(t,..., t n w, β) = ln (t n w T φ(x n ), β ) n= = 2 ln β 2 ln(2π) β n=(t n w T φ(x n )) 2 Maximum likelihood and least squares (III/IV) (3) The maximum likelihood problem now amounts to solving arg max w,β L(w, β) Setting the derivative L w = 2β n=(t n w T φ(x n ))φ(x n ) T equal to gives the following ML estimate for w ŵ ML = (Φ T Φ) Φ T T, }{{} Φ φ (x ) φ (x )... φ M (x ) φ (x 2 ) φ (x 2 )... φ M (x 2 ) Φ =...... φ (x ) φ (x )... φ M (x ) ote that if Φ T Φ is singular (or close to) we can fix this by adding λi, i.e., ŵ RR = (Φ T Φ + λi) Φ T T, Maximum likelihood and least squares (IV/IV) (3) Maximizing the log-likelihood function L(w, β) w.r.t. β results in the following estimate for β β = ( ML tn ŵ ML φ(x n ) ) 2 n= Finally, note that if we are only interested in w, the log-likelihood function is proportional to n= (t n w T φ(x n )) 2, which clearly shows that assuming a Gaussian noise model and making use of Maximum Likelihood (ML) corresponds to a Least Squares (LS) problem. Interpretation of the Gauss-Markov theorem 2(3) The least squares estimator has the smallest mean square error (MSE) of all linear estimators with no bias, BUT there may exist a biased estimator with lower MSE. the restriction to unbiased estimates is not necessarily a wise one. [HTF, page 5] Two classes of potentially biased estimators,. Subset selection methods and 2. Shrinkage methods. This is intimately connected to the bias-variance trade-off We will give a system identification example related to ridge regression to illustrate the bias-variance trade-off. See Section 3.2 for a slightly more abstract (but very informative) account of the bias-variance trade-off. (this is a perfect topic for discussions during the exercise sessions!)
Interpretation of RR using the SVD of Φ 3(3) Bias-variance tradeoff example (I/IV) 4(3) (Ex. 2.3 in Henrik Ohlsson s PhD thesis) Consider a SISO system By studying the SVD of Φ it can be shown that ridge regression projects the measurements onto the principal components of Φ and then shrinks the coefficients of low-variance components more than the coefficients of high-variance components. (See Section 3.4.. in HTF for details.) y t = n g k u t k + e t, () k= u t denotes the input, y t denotes the output, e t denotes white noise (E (e) = and E (e t e s ) = σ 2 δ(t s)) and {g k }n k= denote the impulse response of the system. Recall that the impulse response is the output y t when u t = δ(t) is used in (), which results in { g t y t = + e t t =,..., n, e t t > n. Bias-variance tradeoff example (II/IV) 5(3) Bias-variance tradeoff example (III/IV) 6(3) The task is now to estimate the impulse response using an n th order FIR model, y t = w T φ t + e t, φ t = ( u t... u t n ) T, w R n Let us use Ridge Regression (RR), ŵ RR = arg min w Y Φw 2 2 + λw T w..9.8.7.6.5.4.3.2. 2 3 4 5 6 7 8 9 λ Squared bias (gray line) ( ) ) 2 Eŵ (ŵ T φ w T φ Variance (dashed line) ( ( ) ) ) 2 Eŵ Eŵ (ŵ T φ ŵ T φ MSE (black line) to find the parameters w. MSE = (bias) 2 + variance
Bias-variance tradeoff example (IV/IV) 7(3) Flexible models will have a low bias and high variance and more restricted models will have high bias and low variance. The model with the best predictive capabilities is the one which strikes the best tradeoff between bias and variance. Recent contributions on impulse response identification using regularization, see Gianluigi Pillonetto and Giuseppe De icolao. A new kernel-based approach for linear system identification. Automatica, 46():8 93, January 2. Tianshi Chen, Henrik Ohlsson and Lennart Ljung. On the estimation of transfer functions, regularizations and Gaussian processes Revisited. Automatica, 48(8): 525 535, August 22. Lasso 8(3) The Lasso was introduced during lecture as the MAP estimate when a Laplacian prior is assigned to the parameters. Alternatively we can motivate the Lasso as the solution to ( tn w T φ(x n ) ) 2 min w n= s.t. M j= w j η which using a Lagrange multiplier λ can be stated min w n= ( t n w T φ(x n ) ) 2 M + λ w j j= The difference to ridge regression is simply that Lasso make use of the l -norm M j= w j, rather than the l 2 -norm M j= w2 j used in ridge regression in shrinking the parameters. Graphical illustration of Lasso and RR 9(3) Implementing Lasso 2(3) Lasso Ridge Regression (RR) The l -regularized least squares problem (lasso) min w T Φw 2 2 + λ w (2) YALMIP code solving (2). Download: http://users.isy.liu.se/johanl/yalmip/ w=sdpvar (M, ) ; ops= s d p s e t t i n g s ( verbose, ) ; solvesdp ( [ ], ( T Phi w) ( T Phi w) + lambda norm (w, ), ops ) CVX code solving (2). Download: http://cvxr.com/cvx/ The circles are contours of the least squares cost function (LS estimate in the middle). The constraint regions are shown in gray w + w η (Lasso) and w 2 + w2 η (RR). The shape of the constraints motivates why Lasso often leads to sparseness. cvx_begin v a r i a b l e w(m) minimize ( ( T Phi w) ( y Phi w) + lambda norm (w, ) ) cvx_end A MATLAB package dedicated to l -regularized least squares problems is l_ls. Download: http://www.stanford.edu/~boyd/l_ls/
y y Bayesian linear regression example (I/VI) 2(3) Consider the problem of fitting a straight line to noisy measurements. Let the model be (t R, x n R) t n = w + w x }{{} n +ɛ n, n =,...,. (3) y(x,w) ɛ n (,.2 2 ), β =.2 2 = 25. According to (3), the following identity basis function is used φ (x n ) =, φ (x n ) = x n. The example lives in two dimensions, allowing us to plot the distributions in illustrating the inference. Bayesian linear regression example (II/VI) 22(3) Let the true values for w be w = (.3 white circle below). Generate synthetic measurements by x n U(, )..5 ) T (plotted using a t n = w + w x n + ɛ n, ɛ n (,.2 2 ), Furthermore, let the prior be p(w) = ( w ( α = 2. ) ) T, α I, Bayesian linear regression example (III/VI) 23(3) Plot of the situation before any data arrives. Bayesian linear regression example (IV/VI) 24(3) Plot of the situation after one measurement has arrived..8.6.4.2.2.4.8.6.4.2.2.4.6.8.5.5 x.6 Prior, p(w) = ( w ( ) ) T, 2 I.8.5.5 x Example of a few realizations from the posterior. Likelihood (plotted as a function of w) p(t w) = (t w + w x, β ) Posterior/prior, p(w t ) = (w m, S ), m = βs Φ T t, S = (αi + βφ T Φ). Example of a few realizations from the posterior and the first measurement (black circle).
y y Bayesian Linear Regression - Example (V/VI) 25(3) Bayesian linear regression example (VI/VI) 26(3) Plot of the situation after two measurements have arrived. Plot of the situation after 3 measurements have arrived..8.6.4.2.8.6.4.2.2.2.4.4.6.6.8.8.5.5 x.5.5 x Likelihood (plotted as a function of w) p(t 2 w) = (t 2 w + w x 2, β ) Posterior/prior, p(w T) = (w m 2, S 2 ), m 2 = βs 2 Φ T T, Example of a few realizations from the posterior and the measurements (black circles). Likelihood (plotted as a function of w) p(t 3 w) = (t 3 w + w x 3, β ) Posterior/prior, p(w T) = (w m 3, S 3 ), m 3 = βs 3 Φ T T, Example of a few realizations from the posterior and the measurements (black circles). S 2 = (αi + βφ T Φ). S 3 = (αi + βφ T Φ). Empirical Bayes 27(3) Important question: How do we decide on the suitable values for hyperparameters η? Idea: Estimate the hyperparameters from the data by selecting them such that they maximize the marginal likelihood function, p(t η) = p(t w, η)p(w η)dw, η denotes the hyperparameters to be estimated. Travels under many names, besides empirical Bayes, this is also referred to as type 2 maximum likelihood, generalized maximum likelihood, and evidence approximation. Empirical Bayes combines the two statistical philosophies; frequentistic ideas are used to estimate the hyperparameters that are then used within the Bayesian inference. Predictive distribution example 28(3).6.4.2.2.4.6.8 Investigating the predictive distribution for the example above.2.5.5.5.5.5.5.5.5.5.5 = 2 observations = 5 observations = 2 observations True system (y(x) =.3 +.5x) generating the data (red line) Mean of the predictive distribution (blue line) One standard deviation of the predictive distribution (gray shaded area) ote that this is the point-wise predictive standard deviation as a function of x. Observations (black circles).5.5
Posterior distribution 29(3) A few concepts to summarize lecture 2 3(3) Recall that the posterior distribution is given by p(w T) = (w m, S ), m = βs Φ T T, S = (αi + βφ T Φ). Let us now investigate the posterior mean solution m, which has an interpretation that directly leads to the kernel methods (lecture 5), including Gaussian processes. Linear regression: Models the relationship between a continuous target variable t and a possibly nonlinear function φ(x) of the input variables. Hyperparameter: A parameter of the prior distribution that controls the distribution of the parameters of the model. Maximum a Posteriori (MAP): A point estimate obtained by maximizing the posterior distribution. Corresponds to a mode of the posterior distribution. Gauss Markov theorem: States that in a linear regression model, the best (in the sense of minimum MSE) linear unbiased estimate (BLUE) is given by the least squares estimate. Ridge regression: An l 2 -regularized least squares problem used to solve the linear regression problem resulting in potentially biased estimates. A.k.a. Tikhonov regularization. Lasso: An l -regularized least squares problem used to solve the linear regression problem resulting in potentially biased estimates. The Lasso typically produce sparse estimates.