Outline lecture 2 2(30)

Similar documents
Outline Lecture 2 2(32)

Bayesian Linear Regression. Sargur Srihari

Outline lecture 4 2(26)

Linear Models for Regression

Outline Lecture 3 2(40)

Lecture : Probabilistic Machine Learning

Outline lecture 6 2(35)

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Linear Models for Regression

STA414/2104 Statistical Methods for Machine Learning II

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Bayesian Machine Learning

Today. Calculus. Linear Regression. Lagrange Multipliers

ECE521 week 3: 23/26 January 2017

Relevance Vector Machines

Density Estimation. Seungjin Choi

Least Squares Regression

Least Squares Regression

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

GAUSSIAN PROCESS REGRESSION

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

COS513 LECTURE 8 STATISTICAL CONCEPTS

Introduction to Machine Learning

Bayesian methods in economics and finance

Bayesian Regression Linear and Logistic Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Overfitting, Bias / Variance Analysis

Regularized Regression A Bayesian point of view

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

The Laplace Approximation

Using Multiple Kernel-based Regularization for Linear System Identification

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Linear Methods for Regression. Lijun Zhang

Bayesian Machine Learning

6.867 Machine Learning

6.867 Machine Learning

CS-E3210 Machine Learning: Basic Principles

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Linear Models for Regression CS534

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Linear Models for Regression. Sargur Srihari

GWAS IV: Bayesian linear (variance component) models

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Lecture 2 Machine Learning Review

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction into Bayesian statistics

Parametric Techniques Lecture 3

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Statistics: Learning models from data

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Lecture 14: Shrinkage

Machine Learning 4771

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Regression, Ridge Regression, Lasso

Part 1: Expectation Propagation

Neural Network Training

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Machine Learning Basics: Maximum Likelihood Estimation

Probability Theory for Machine Learning. Chris Cremer September 2015

Lecture 5: GPs and Streaming regression

4 Bias-Variance for Ridge Regression (24 points)

Lecture 1b: Linear Models for Regression

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Introduction to Probabilistic Machine Learning

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Bayesian Machine Learning

Parameter Estimation in a Moving Horizon Perspective

Introduction to Gaussian Process

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

The Bayesian approach to inverse problems

Linear & nonlinear classifiers

Parametric Techniques

STA 4273H: Sta-s-cal Machine Learning

PMR Learning as Inference

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Lecture 7 Introduction to Statistical Decision Theory

CMU-Q Lecture 24:

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Sparse Linear Models (10/7/13)

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

PATTERN RECOGNITION AND MACHINE LEARNING

CPSC 540: Machine Learning

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Transcription:

Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control Linköping University Linköping, Sweden. Email: schon@isy.liu.se, Phone: 3-28373, Office: House B, Entrance 27.. Summary of lecture 2. Linear basis function models 3. Maximum likelihood and least squares 4. Bias variance trade-off 5. Shrinkage methods Ridge regression LASSO 6. Bayesian linear regression 7. Motivation of kernel methods Summary of lecture (I/III) 3(3) Summary of lecture (II/III) 4(3) The exponential family of distributions over x, parameterized by η, ( ) p(x η) = h(x)g(η) exp η T u(x) One important member is the Gaussian density, which is commonly used as a building block in more sophisticated models. Important basic properties were provided. The idea underlying maximum likelihood is that the parameters θ should be chosen in such a way that the measurements {x i } i= are as likely as possible, i.e., θ = arg max θ p(x,, x θ). The three basic steps of Bayesian modeling ( all variables are modeled as stochastic). Assign prior distributions p(θ) to all unknown parameters θ. 2. Write down the likelihood p(x,..., x θ) of the data x,..., x given the parameters θ. 3. Determine the posterior distribution of the parameters given the data p(θ x,..., x ) = p(x,..., x θ)p(θ) p(x,..., x ) p(x,..., x θ)p(θ) If the posterior p(θ x,..., x ) and the prior p(θ) distributions are of the same functional form they are conjugate distributions and the prior is said to be a conjugate prior for the likelihood.

Summary of lecture (III/III) 5(3) Modeling heavy tails using the Student s t-distribution ( St(x µ, λ, ν) = x µ, (ηλ) ) Gam (η ν/2, ν/2) dη = Γ(ν/2 + /2) Γ(ν/2) ( ) λ 2 ( + πν ) ν λ(x µ)2 2 2 ν which according to the first expressions can be interpreted as an infinite mix of Gaussians with the same mean, but different variance. 9 8 7 6 5 4 3 2 log Student log Gaussian 5 5 5 5 Poor robustness is due to an unrealistic model, the ML estimator is inherently robust, provided we have the correct model. Commonly used basis functions 6(3) In using nonlinear basis functions, y(x, w) can be a nonlinear function in the input variable x (still linear in w). Global (in the sense that a small change in x affects all basis functions) basis function. Polynomial (see illustrative example in Section.) (ex. identity φ(x) = x) Local (in the sense that a small change in x only affects the nearby basis functions) basis function. Gaussian 2. Sigmoidal Linear regression model on matrix form 7(3) Maximum likelihood and least squares (I/IV) 8(3) It is commonly convenient to write the linear regression model t n = w T φ(x n ) + ɛ n, n =,...,, w = ( ) T w w... w M and φ = ( φ (x n )... φ M (x n ) ) T on matrix form T = Φw + E, t φ (x ) φ (x )... φ M (x ) ɛ t 2 T =. Φ = φ (x 2 ) φ (x 2 )... φ M (x 2 )...... E = ɛ 2. φ (x ) φ (x )... φ M (x ) t ɛ In our linear regression model, t n = w T φ(x n ) + ɛ n, assume that ɛ n (, β ) (i.i.d.). This results in the following likelihood function p(t n w, β) = (w T φ(x n ), β ) ote that this is a slight abuse of notation, p w,β (t n ) or p(t n ; w, β) would have been better, since w and β are both considered deterministic parameters in ML.

Maximum likelihood and least squares (II/IV) 9(3) The available training data consisting of input variables X = {x i } i= and the corresponding target variables T = {t i} i=. According to our assumption on the noise, the likelihood function is given by p(t w, β) = p(t n w, β) = (t n w T φ(x n ), β ) n= n= which results in the following log-likelihood function L(w, β) ln p(t,..., t n w, β) = ln (t n w T φ(x n ), β ) n= = 2 ln β 2 ln(2π) β n=(t n w T φ(x n )) 2 Maximum likelihood and least squares (III/IV) (3) The maximum likelihood problem now amounts to solving arg max w,β L(w, β) Setting the derivative L w = 2β n=(t n w T φ(x n ))φ(x n ) T equal to gives the following ML estimate for w ŵ ML = (Φ T Φ) Φ T T, }{{} Φ φ (x ) φ (x )... φ M (x ) φ (x 2 ) φ (x 2 )... φ M (x 2 ) Φ =...... φ (x ) φ (x )... φ M (x ) ote that if Φ T Φ is singular (or close to) we can fix this by adding λi, i.e., ŵ RR = (Φ T Φ + λi) Φ T T, Maximum likelihood and least squares (IV/IV) (3) Maximizing the log-likelihood function L(w, β) w.r.t. β results in the following estimate for β β = ( ML tn ŵ ML φ(x n ) ) 2 n= Finally, note that if we are only interested in w, the log-likelihood function is proportional to n= (t n w T φ(x n )) 2, which clearly shows that assuming a Gaussian noise model and making use of Maximum Likelihood (ML) corresponds to a Least Squares (LS) problem. Interpretation of the Gauss-Markov theorem 2(3) The least squares estimator has the smallest mean square error (MSE) of all linear estimators with no bias, BUT there may exist a biased estimator with lower MSE. the restriction to unbiased estimates is not necessarily a wise one. [HTF, page 5] Two classes of potentially biased estimators,. Subset selection methods and 2. Shrinkage methods. This is intimately connected to the bias-variance trade-off We will give a system identification example related to ridge regression to illustrate the bias-variance trade-off. See Section 3.2 for a slightly more abstract (but very informative) account of the bias-variance trade-off. (this is a perfect topic for discussions during the exercise sessions!)

Interpretation of RR using the SVD of Φ 3(3) Bias-variance tradeoff example (I/IV) 4(3) (Ex. 2.3 in Henrik Ohlsson s PhD thesis) Consider a SISO system By studying the SVD of Φ it can be shown that ridge regression projects the measurements onto the principal components of Φ and then shrinks the coefficients of low-variance components more than the coefficients of high-variance components. (See Section 3.4.. in HTF for details.) y t = n g k u t k + e t, () k= u t denotes the input, y t denotes the output, e t denotes white noise (E (e) = and E (e t e s ) = σ 2 δ(t s)) and {g k }n k= denote the impulse response of the system. Recall that the impulse response is the output y t when u t = δ(t) is used in (), which results in { g t y t = + e t t =,..., n, e t t > n. Bias-variance tradeoff example (II/IV) 5(3) Bias-variance tradeoff example (III/IV) 6(3) The task is now to estimate the impulse response using an n th order FIR model, y t = w T φ t + e t, φ t = ( u t... u t n ) T, w R n Let us use Ridge Regression (RR), ŵ RR = arg min w Y Φw 2 2 + λw T w..9.8.7.6.5.4.3.2. 2 3 4 5 6 7 8 9 λ Squared bias (gray line) ( ) ) 2 Eŵ (ŵ T φ w T φ Variance (dashed line) ( ( ) ) ) 2 Eŵ Eŵ (ŵ T φ ŵ T φ MSE (black line) to find the parameters w. MSE = (bias) 2 + variance

Bias-variance tradeoff example (IV/IV) 7(3) Flexible models will have a low bias and high variance and more restricted models will have high bias and low variance. The model with the best predictive capabilities is the one which strikes the best tradeoff between bias and variance. Recent contributions on impulse response identification using regularization, see Gianluigi Pillonetto and Giuseppe De icolao. A new kernel-based approach for linear system identification. Automatica, 46():8 93, January 2. Tianshi Chen, Henrik Ohlsson and Lennart Ljung. On the estimation of transfer functions, regularizations and Gaussian processes Revisited. Automatica, 48(8): 525 535, August 22. Lasso 8(3) The Lasso was introduced during lecture as the MAP estimate when a Laplacian prior is assigned to the parameters. Alternatively we can motivate the Lasso as the solution to ( tn w T φ(x n ) ) 2 min w n= s.t. M j= w j η which using a Lagrange multiplier λ can be stated min w n= ( t n w T φ(x n ) ) 2 M + λ w j j= The difference to ridge regression is simply that Lasso make use of the l -norm M j= w j, rather than the l 2 -norm M j= w2 j used in ridge regression in shrinking the parameters. Graphical illustration of Lasso and RR 9(3) Implementing Lasso 2(3) Lasso Ridge Regression (RR) The l -regularized least squares problem (lasso) min w T Φw 2 2 + λ w (2) YALMIP code solving (2). Download: http://users.isy.liu.se/johanl/yalmip/ w=sdpvar (M, ) ; ops= s d p s e t t i n g s ( verbose, ) ; solvesdp ( [ ], ( T Phi w) ( T Phi w) + lambda norm (w, ), ops ) CVX code solving (2). Download: http://cvxr.com/cvx/ The circles are contours of the least squares cost function (LS estimate in the middle). The constraint regions are shown in gray w + w η (Lasso) and w 2 + w2 η (RR). The shape of the constraints motivates why Lasso often leads to sparseness. cvx_begin v a r i a b l e w(m) minimize ( ( T Phi w) ( y Phi w) + lambda norm (w, ) ) cvx_end A MATLAB package dedicated to l -regularized least squares problems is l_ls. Download: http://www.stanford.edu/~boyd/l_ls/

y y Bayesian linear regression example (I/VI) 2(3) Consider the problem of fitting a straight line to noisy measurements. Let the model be (t R, x n R) t n = w + w x }{{} n +ɛ n, n =,...,. (3) y(x,w) ɛ n (,.2 2 ), β =.2 2 = 25. According to (3), the following identity basis function is used φ (x n ) =, φ (x n ) = x n. The example lives in two dimensions, allowing us to plot the distributions in illustrating the inference. Bayesian linear regression example (II/VI) 22(3) Let the true values for w be w = (.3 white circle below). Generate synthetic measurements by x n U(, )..5 ) T (plotted using a t n = w + w x n + ɛ n, ɛ n (,.2 2 ), Furthermore, let the prior be p(w) = ( w ( α = 2. ) ) T, α I, Bayesian linear regression example (III/VI) 23(3) Plot of the situation before any data arrives. Bayesian linear regression example (IV/VI) 24(3) Plot of the situation after one measurement has arrived..8.6.4.2.2.4.8.6.4.2.2.4.6.8.5.5 x.6 Prior, p(w) = ( w ( ) ) T, 2 I.8.5.5 x Example of a few realizations from the posterior. Likelihood (plotted as a function of w) p(t w) = (t w + w x, β ) Posterior/prior, p(w t ) = (w m, S ), m = βs Φ T t, S = (αi + βφ T Φ). Example of a few realizations from the posterior and the first measurement (black circle).

y y Bayesian Linear Regression - Example (V/VI) 25(3) Bayesian linear regression example (VI/VI) 26(3) Plot of the situation after two measurements have arrived. Plot of the situation after 3 measurements have arrived..8.6.4.2.8.6.4.2.2.2.4.4.6.6.8.8.5.5 x.5.5 x Likelihood (plotted as a function of w) p(t 2 w) = (t 2 w + w x 2, β ) Posterior/prior, p(w T) = (w m 2, S 2 ), m 2 = βs 2 Φ T T, Example of a few realizations from the posterior and the measurements (black circles). Likelihood (plotted as a function of w) p(t 3 w) = (t 3 w + w x 3, β ) Posterior/prior, p(w T) = (w m 3, S 3 ), m 3 = βs 3 Φ T T, Example of a few realizations from the posterior and the measurements (black circles). S 2 = (αi + βφ T Φ). S 3 = (αi + βφ T Φ). Empirical Bayes 27(3) Important question: How do we decide on the suitable values for hyperparameters η? Idea: Estimate the hyperparameters from the data by selecting them such that they maximize the marginal likelihood function, p(t η) = p(t w, η)p(w η)dw, η denotes the hyperparameters to be estimated. Travels under many names, besides empirical Bayes, this is also referred to as type 2 maximum likelihood, generalized maximum likelihood, and evidence approximation. Empirical Bayes combines the two statistical philosophies; frequentistic ideas are used to estimate the hyperparameters that are then used within the Bayesian inference. Predictive distribution example 28(3).6.4.2.2.4.6.8 Investigating the predictive distribution for the example above.2.5.5.5.5.5.5.5.5.5.5 = 2 observations = 5 observations = 2 observations True system (y(x) =.3 +.5x) generating the data (red line) Mean of the predictive distribution (blue line) One standard deviation of the predictive distribution (gray shaded area) ote that this is the point-wise predictive standard deviation as a function of x. Observations (black circles).5.5

Posterior distribution 29(3) A few concepts to summarize lecture 2 3(3) Recall that the posterior distribution is given by p(w T) = (w m, S ), m = βs Φ T T, S = (αi + βφ T Φ). Let us now investigate the posterior mean solution m, which has an interpretation that directly leads to the kernel methods (lecture 5), including Gaussian processes. Linear regression: Models the relationship between a continuous target variable t and a possibly nonlinear function φ(x) of the input variables. Hyperparameter: A parameter of the prior distribution that controls the distribution of the parameters of the model. Maximum a Posteriori (MAP): A point estimate obtained by maximizing the posterior distribution. Corresponds to a mode of the posterior distribution. Gauss Markov theorem: States that in a linear regression model, the best (in the sense of minimum MSE) linear unbiased estimate (BLUE) is given by the least squares estimate. Ridge regression: An l 2 -regularized least squares problem used to solve the linear regression problem resulting in potentially biased estimates. A.k.a. Tikhonov regularization. Lasso: An l -regularized least squares problem used to solve the linear regression problem resulting in potentially biased estimates. The Lasso typically produce sparse estimates.