Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control Linköping University Linköping, Sweden. Email: schon@isy.liu.se, Phone: 3-8373, Office: House B, Entrance 7.. Summary of Lecture. Bayesian linear regression 3. Motivation of Kernel methods 4. Linear Classification. Problem setup. Discriminant functions (mainly least squares) 3. Probabilistic generative models 4. Logistic regression (discriminative model) 5. Bayesian logistic regression (Chapter 3.3-4) Summary of Lecture (I/VI) 3(3) Summary of Lecture (II/VI) 4(3) The exponential family of distributions over x, parameterized by η, ( ) p(x η) = h(x)g(η) exp η T u(x) One important member is the Gaussian density, which is commonly used as a building block in more sophisticated models. Important basic properties were provided. The idea underlying maximum likelihood is that the parameters θ should be chosen in such a way that the measurements {x i } i= are as likely as possible, i.e., θ = arg max θ p(x,, x θ). The three basic steps of Bayesian modeling ( all variables are modeled as stochastic). Assign prior distributions p(θ) to all unknown parameters θ.. Write down the likelihood p(x,..., x θ) of the data x,..., x given the parameters θ. 3. Determine the posterior distribution of the parameters given the data p(θ x,..., x ) = p(x,..., x θ)p(θ) p(x,..., x ) p(x,..., x θ)p(θ) If the posterior p(θ x,..., x ) and the prior p(θ) distributions are of the same functional form they are conjugate distributions and the prior is said to be a conjugate prior for the likelihood.
Summary of Lecture (III/VI) 5(3) Modeling heavy tails using the Student s t-distribution ( St(x µ, λ, ν) = x µ, (ηλ) ) Gam (η ν/, ν/) dη = Γ(ν/ + /) Γ(ν/) ( ) λ ( + πν ) ν λ(x µ) ν which according to the first expressions can be interpreted as an infinite mix of Gaussians with the same mean, but different variance. 9 8 7 6 5 4 3 log Student log Gaussian 5 5 5 5 Poor robustness is due to an unrealistic model, the ML estimator is inherently robust, provided we have the correct model. Summary of Lecture (IV/VI) 6(3) Linear regression models the relationship between a continuous target variable t and a possibly nonlinear function φ(x) of the input variable x, Solved this problem using. Maximum Likelihood (ML). Bayesian approach t n = w T φ(x n ) +ɛ n. }{{} y(x n,w) ML with a Gaussian noise model is equivalent to least squares (LS). Summary of Lecture (V/VI) 7(3) Theorem (Gauss-Markov) In a linear regression model T = Φw + E, E is white noise with zero mean and covariance R, the best linear unbiased estimate (BLUE) of w is ŵ = (Φ T R Φ) Φ T R T, Cov(ŵ) = (Φ T R Φ). Summary of Lecture (VI/VI) 8(3) Two potentially biased estimators are ridge regression (p = ) and the Lasso (p = ) min ( w tn w T φ(x n ) ) s.t. M j= w j p η which using a Lagrange multiplier λ can be stated ( ) min w t n w T M φ(x n ) + λ w j p j= Interpretation: The least squares estimator has the smallest mean square error (MSE) of all linear estimators with no bias, BUT there may exist a biased estimator with lower MSE. Alternative interpretation: The MAP estimate with the likelihood (t n w T φ(x n )) together with a Gaussian prior leads to ridge regression and together with a Laplacian prior it leads to the LASSO.
y y Bayesian Linear Regression - Example (I/VI) 9(3) Consider the problem of fitting a straight line to noisy measurements. Let the model be (t R, x n R) t n = w + w x }{{} n +ɛ n, n =,...,. () y(x,w) ɛ n (,. ), β =. = 5. According to (), the following identity basis function is used φ (x n ) =, φ (x n ) = x n. Since the example lives in two dimensions, we can plot distributions to illustrate the inference. Bayesian Linear Regression - Example (II/VI) (3) Let the true values for w be w = (.3 white circle below). Generate synthetic measurements by x n U(, )..5 ) T (plotted using a t n = w + w x n + ɛ n, ɛ n (,. ), Furthermore, let the prior be p(w) = ( w ( α =. ) ) T, α I Bayesian Linear Regression - Example (III/VI) (3) Plot of the situation before any data arrives. Bayesian Linear Regression - Example (IV/VI) (3) Plot of the situation after one measurement has arrived..8.6.4...4.8.6.4...4.6.8.5.5 x.6 Prior, p(w) = ( w ( ) ) T, I.8.5.5 x Example of a few realizations from the posterior. Likelihood (plotted as a function of w) p(t w) = (t w + w x, β ) Posterior/prior, p(w t ) = (w m, S ), m = βs Φ T t, S = (αi + βφ T Φ). Example of a few realizations from the posterior and the first measurement (black circle).
y y Bayesian Linear Regression - Example (V/VI) 3(3) Bayesian Linear Regression - Example (VI/VI) 4(3) Plot of the situation after two measurements have arrived. Plot of the situation after 3 measurements have arrived..8.6.4..8.6.4....4.4.6.6.8.8.5.5 x.5.5 x Likelihood (plotted as a function of w) p(t w) = (t w + w x, β ) Posterior/prior, p(w T) = (w m, S ), m = βs Φ T T, Example of a few realizations from the posterior and the measurements (black circles). Likelihood (plotted as a function of w) p(t 3 w) = (t 3 w + w x 3, β ) Posterior/prior, p(w T) = (w m 3, S 3 ), m 3 = βs 3 Φ T T, Example of a few realizations from the posterior and the measurements (black circles). S = (αi + βφ T Φ). S 3 = (αi + βφ T Φ). Predictive Distribution - Example 5(3) Posterior Distribution 6(3) Investigating the predictive distribution for the example above Recall that the posterior distribution is given by.6.4..5.5 p(w T) = (w m, S ),..4.6.5.5.8..5.5.5.5.5.5.5.5 = observations = 5 observations = observations m = βs Φ T T, S = (αi + βφ T Φ). True system (y(x) =.3 +.5x) generating the data (red line) Mean of the predictive distribution (blue line) One standard deviation of the predictive distribution (gray shaded area) ote that this is the point-wise predictive standard deviation as a function of x. Observations (black circles) Let us now investigate the posterior mean solution m above, which has an interpretation that directly leads to the kernel methods (lecture 4), including Gaussian processes.
Generative and Discriminative Models 7(3) ML for Probabilistic Generative Models (I/IV) 8(3) Approaches that model the distributions of both the inputs and the outputs are known as generative models. The reason for the name is the fact that using these models we can generate new samples in the input space. Approaches that models the posterior probability directly are referred to as discriminative models. Consider the two class case, the class-conditional densities p(x C k ) are Gaussian and the training data is given by {x n, t n }. Furthermore, assume that p(c ) = α. The task is now to find the parameters α, µ, µ, Σ by maximizing the likelihood function, p(t, X α, µ, µ, Σ) = (p(x n, C )) t n (p(x n, C ) t n, p(x n, C ) = p(c )p(x n C ) = α (x n µ, Σ), p(x n, C ) = p(c )p(x n C ) = ( α) (x n µ, Σ). ML for Probabilistic Generative Models (II/IV) 9(3) Let us now maximize the logarithm of the likelihood function, ( ) L(α, µ, µ, Σ) = ln (α (x n µ, Σ)) t n (( α) (x n µ, Σ)) t n The terms that depends on α are (t n ln α + ( t n ) ln( α)) which is maximized by α = t n = + (as expected). k denotes the number of data in class C k. Straightforwardly we get µ = t n x n, µ = t n )x n. ( ML for Probabilistic Generative Models (III/IV) (3) L(Σ) = t n ln det Σ t n (x n µ ) T Σ (x n µ ) ( t n ) ln det Σ Using the fact that x T Ax = Tr ( Axx T) we have L(Σ) = ln det Σ ) (Σ Tr S, S = ( t n )(x n µ ) T Σ (x n µ ) (t n (x n µ )(x n µ ) T + ( t n )(x n µ )(x n µ ) T)
ML for Probabilistic Generative Models (IV/IV) (3) Lemma (Useful Matrix derivatives) M ln det M = M T, ( ) M Tr M = M T T M T. Differentiating L(Σ) = ln det Σ Tr ( Σ S ) results in Σ = Σ T + Σ T SΣ T Σ = Hence, Σ = S More results on matrix derivatives are available in Magnus, J. R., & eudecker, H. (999). Matrix differential calculus with applications in statistics and econometrics. nd Edition, Chichester, UK: Wiley. Generalized Linear Models for Classification (3) In linear regression we made use of a linear model t n = y(x, w) = w T φ(x n ) + ɛ n. For classification problems the target variables are discrete, or slightly more general, posterior probabilities in the range (, ). This is achieved using a so called activation function f (f must exist), y(x) = f (w T x + w ). () ote that the decision surface corresponds to y(x) = constant, implying that w T x + w = constant. This means that the decision surface is a linear function of x, even if f is nonlinear. Hence, the name generalized linear model for (). Gradient of L(w) for Logistic Regression (I/II) 3(3) The negative log-likelihood is L(w) = y n = σ(a n ) = Using the chain rule we have, (t n ln y n + ( t n ) ln( y n )), + exp( a n ), and a n = w T φ n. w = y n a n y n a n w y n = t n y n t n y n = y n t n y n ( y n ) Gradient of L(w) for Logistic Regression (II/II) 4(3) Furthermore, y n = σ(a n) = = σ(a n )( σ(a n )) = y n ( y n ), a n a n a n w = φ n. which results in the following expression for the gradient w = (y n t n )φ n = Φ T (Y T), φ (x ) φ (x )... φ M (x ) y t φ (x ) φ (x )... φ M (x ) Φ =...... Y = y. T = t. φ (x ) φ (x )... φ M (x ) y t
Hessian of L(w) for Logistic Regression 5(3) Bayesian Logistic Regression 6(3) L H = w w T = = (y n t n )φ n φn T = Φ T RΦ y ( y )... y ( y )... R =......... y ( y ) Recall that p(t w) = Hence, computing the posterior density ( ) σ(w T φ n ) t n σ(w T tn φ n ) p(w T) = p(t w)p(w) p(t) is intractable. We are forced to an approximation. Three alternatives. Laplace approximation (this lecture). VB & EP (lecture 5) 3. Sampling methods, e.g., MCMC (lecture 6) Laplace Approximation (I/III) 7(3) The Laplace approximation is a simple approximation that is obtained by fitting a Gaussian centered around the (MAP) mode of the distribution. Consider the density function p(z) of a scalar stochastic variable z, given by p(z) = Z f (z), Z = f (z)dz is the normalization coefficient. We start by finding a mode z of the density function, df (z) dz =. z=z Laplace Approximation (II/III) 8(3) Consider a Taylor expansion of ln f (z) around the mode z, ln f (z) ln f (z ) + d dz ln f (z) (z z ) + }{{ z=z d ln f (z) dz (z z ) z=z } = = ln f (z ) A (z z ), (3) A = d ln f (z) dz z=z Taking the exponential of both sides in the approx. (3) results in f (z) f (z ) exp ( A ) (z z )
Laplace Approximation (III/III) 9(3) Bayesian Logistic Regression (I/II) 3(3) By normalizing this expression we have now obtained a Gaussian approximation q(z) = ( ) A / exp ( A ) π (z z ) A = d ln f (z) dz z=z The main limitation of the Laplace approximation is that it is a local method that only captures aspects of the true density around a specific value z. The posterior is p(w T) p(t w)p(w), (4) we assume a Gaussian prior p(w) = (w m, S ) and make use of the Laplace approximation. Taking logarithm on both sides of (4) gives ln p(w t) = (w m ) T S (w m ) + y n = σ(w T φ n ). (t n ln y n + ( t n ) ln( y n )) + const. Bayesian Logistic Regression (II/II) 3(3) A Few Concepts to Summarize Lecture 3(3) Using the Laplace approximation we can now obtain a Gaussian approximation q(w) = (w w MAP, S ) w MAP is the MAP estimate of p(w T) and the covariance S is the Hessian of ln p(w T), S = ln p(w T) = S w wt + y n ( y n )φ n φ T n Based on this distribution we can now start making predictions for new input data φ(x), which is typically what we are interested in. Recall that prediction corresponds to marginalization w.r.t. w. Hyperparameter: A parameter of the prior distribution that controls the distribution of the parameters of the model. Classification: The goal of classification is to assign an input vector x to one of K classes, C k, k =,..., K. Discriminant: A discriminant is a function that takes an input x and assigns it to one of K classes. Generative models: Approaches that model the distributions of both the inputs and the outputs are known as generative models. In classification this amounts to modelling the class-conditional densities p(x C k ), as well as the prior densities p(c k ). The reason for the name is the fact that using these models we can generate new samples in the input space. Discriminative models: Approaches that models the posterior probability directly are referred to as discriminative models. Logistic Regression: Discriminative model that makes direct use of a generalized linear model in the form of a logistic sigmoid to solve the classification problem. Laplace approximation: A local approximation method that finds the mode of the posterior distribution and then fits a Gaussian centered at that mode.