Outline Lecture 2 2(32)

Similar documents
Outline lecture 2 2(30)

Outline Lecture 3 2(40)

The Laplace Approximation

Bayesian Linear Regression. Sargur Srihari

Outline lecture 4 2(26)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Outline lecture 6 2(35)

Bayesian Regression Linear and Logistic Regression

Neural Network Training

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Linear Classification

Density Estimation. Seungjin Choi

STA 4273H: Statistical Machine Learning

Machine Learning. 7. Logistic and Linear Regression

Ch 4. Linear Models for Classification

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Overfitting, Bias / Variance Analysis

Bayesian Logistic Regression

Linear Models for Classification

Neural Networks - II

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Linear Models for Regression

Introduction to Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture : Probabilistic Machine Learning

Variational Bayesian Logistic Regression

GAUSSIAN PROCESS REGRESSION

Relevance Vector Machines

Linear Models for Regression

Bayesian Machine Learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

STA414/2104 Statistical Methods for Machine Learning II

Lecture 1b: Linear Models for Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Iterative Reweighted Least Squares

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Introduction to Machine Learning

STA 4273H: Sta-s-cal Machine Learning

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

An Introduction to Statistical and Probabilistic Linear Models

CSC 411: Lecture 04: Logistic Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

ECE521 week 3: 23/26 January 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Introduction to Machine Learning

Part 1: Expectation Propagation

Reading Group on Deep Learning Session 1

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Parametric Techniques

Pattern Recognition and Machine Learning

Linear Models for Regression. Sargur Srihari

Parametric Techniques Lecture 3

PATTERN RECOGNITION AND MACHINE LEARNING

Nonparameteric Regression:

COS513 LECTURE 8 STATISTICAL CONCEPTS

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Linear Models for Regression

CS-E3210 Machine Learning: Basic Principles

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Multivariate Bayesian Linear Regression MLAI Lecture 11

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Bayesian Inference and MCMC

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

CMU-Q Lecture 24:

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Variational Inference via Stochastic Backpropagation

Least Squares Regression

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Linear & nonlinear classifiers

Lecture 13 : Variational Inference: Mean Field Approximation

Today. Calculus. Linear Regression. Lagrange Multipliers

Introduction to Probabilistic Machine Learning

Linear Classification: Probabilistic Generative Models

Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Max Margin-Classifier

The evolution from MLE to MAP to Bayesian Learning

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

GWAS IV: Bayesian linear (variance component) models

Least Squares Regression

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Lecture 5: GPs and Streaming regression

Transcription:

Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control Linköping University Linköping, Sweden. Email: schon@isy.liu.se, Phone: 3-8373, Office: House B, Entrance 7.. Summary of Lecture. Bayesian linear regression 3. Motivation of Kernel methods 4. Linear Classification. Problem setup. Discriminant functions (mainly least squares) 3. Probabilistic generative models 4. Logistic regression (discriminative model) 5. Bayesian logistic regression (Chapter 3.3-4) Summary of Lecture (I/VI) 3(3) Summary of Lecture (II/VI) 4(3) The exponential family of distributions over x, parameterized by η, ( ) p(x η) = h(x)g(η) exp η T u(x) One important member is the Gaussian density, which is commonly used as a building block in more sophisticated models. Important basic properties were provided. The idea underlying maximum likelihood is that the parameters θ should be chosen in such a way that the measurements {x i } i= are as likely as possible, i.e., θ = arg max θ p(x,, x θ). The three basic steps of Bayesian modeling ( all variables are modeled as stochastic). Assign prior distributions p(θ) to all unknown parameters θ.. Write down the likelihood p(x,..., x θ) of the data x,..., x given the parameters θ. 3. Determine the posterior distribution of the parameters given the data p(θ x,..., x ) = p(x,..., x θ)p(θ) p(x,..., x ) p(x,..., x θ)p(θ) If the posterior p(θ x,..., x ) and the prior p(θ) distributions are of the same functional form they are conjugate distributions and the prior is said to be a conjugate prior for the likelihood.

Summary of Lecture (III/VI) 5(3) Modeling heavy tails using the Student s t-distribution ( St(x µ, λ, ν) = x µ, (ηλ) ) Gam (η ν/, ν/) dη = Γ(ν/ + /) Γ(ν/) ( ) λ ( + πν ) ν λ(x µ) ν which according to the first expressions can be interpreted as an infinite mix of Gaussians with the same mean, but different variance. 9 8 7 6 5 4 3 log Student log Gaussian 5 5 5 5 Poor robustness is due to an unrealistic model, the ML estimator is inherently robust, provided we have the correct model. Summary of Lecture (IV/VI) 6(3) Linear regression models the relationship between a continuous target variable t and a possibly nonlinear function φ(x) of the input variable x, Solved this problem using. Maximum Likelihood (ML). Bayesian approach t n = w T φ(x n ) +ɛ n. }{{} y(x n,w) ML with a Gaussian noise model is equivalent to least squares (LS). Summary of Lecture (V/VI) 7(3) Theorem (Gauss-Markov) In a linear regression model T = Φw + E, E is white noise with zero mean and covariance R, the best linear unbiased estimate (BLUE) of w is ŵ = (Φ T R Φ) Φ T R T, Cov(ŵ) = (Φ T R Φ). Summary of Lecture (VI/VI) 8(3) Two potentially biased estimators are ridge regression (p = ) and the Lasso (p = ) min ( w tn w T φ(x n ) ) s.t. M j= w j p η which using a Lagrange multiplier λ can be stated ( ) min w t n w T M φ(x n ) + λ w j p j= Interpretation: The least squares estimator has the smallest mean square error (MSE) of all linear estimators with no bias, BUT there may exist a biased estimator with lower MSE. Alternative interpretation: The MAP estimate with the likelihood (t n w T φ(x n )) together with a Gaussian prior leads to ridge regression and together with a Laplacian prior it leads to the LASSO.

y y Bayesian Linear Regression - Example (I/VI) 9(3) Consider the problem of fitting a straight line to noisy measurements. Let the model be (t R, x n R) t n = w + w x }{{} n +ɛ n, n =,...,. () y(x,w) ɛ n (,. ), β =. = 5. According to (), the following identity basis function is used φ (x n ) =, φ (x n ) = x n. Since the example lives in two dimensions, we can plot distributions to illustrate the inference. Bayesian Linear Regression - Example (II/VI) (3) Let the true values for w be w = (.3 white circle below). Generate synthetic measurements by x n U(, )..5 ) T (plotted using a t n = w + w x n + ɛ n, ɛ n (,. ), Furthermore, let the prior be p(w) = ( w ( α =. ) ) T, α I Bayesian Linear Regression - Example (III/VI) (3) Plot of the situation before any data arrives. Bayesian Linear Regression - Example (IV/VI) (3) Plot of the situation after one measurement has arrived..8.6.4...4.8.6.4...4.6.8.5.5 x.6 Prior, p(w) = ( w ( ) ) T, I.8.5.5 x Example of a few realizations from the posterior. Likelihood (plotted as a function of w) p(t w) = (t w + w x, β ) Posterior/prior, p(w t ) = (w m, S ), m = βs Φ T t, S = (αi + βφ T Φ). Example of a few realizations from the posterior and the first measurement (black circle).

y y Bayesian Linear Regression - Example (V/VI) 3(3) Bayesian Linear Regression - Example (VI/VI) 4(3) Plot of the situation after two measurements have arrived. Plot of the situation after 3 measurements have arrived..8.6.4..8.6.4....4.4.6.6.8.8.5.5 x.5.5 x Likelihood (plotted as a function of w) p(t w) = (t w + w x, β ) Posterior/prior, p(w T) = (w m, S ), m = βs Φ T T, Example of a few realizations from the posterior and the measurements (black circles). Likelihood (plotted as a function of w) p(t 3 w) = (t 3 w + w x 3, β ) Posterior/prior, p(w T) = (w m 3, S 3 ), m 3 = βs 3 Φ T T, Example of a few realizations from the posterior and the measurements (black circles). S = (αi + βφ T Φ). S 3 = (αi + βφ T Φ). Predictive Distribution - Example 5(3) Posterior Distribution 6(3) Investigating the predictive distribution for the example above Recall that the posterior distribution is given by.6.4..5.5 p(w T) = (w m, S ),..4.6.5.5.8..5.5.5.5.5.5.5.5 = observations = 5 observations = observations m = βs Φ T T, S = (αi + βφ T Φ). True system (y(x) =.3 +.5x) generating the data (red line) Mean of the predictive distribution (blue line) One standard deviation of the predictive distribution (gray shaded area) ote that this is the point-wise predictive standard deviation as a function of x. Observations (black circles) Let us now investigate the posterior mean solution m above, which has an interpretation that directly leads to the kernel methods (lecture 4), including Gaussian processes.

Generative and Discriminative Models 7(3) ML for Probabilistic Generative Models (I/IV) 8(3) Approaches that model the distributions of both the inputs and the outputs are known as generative models. The reason for the name is the fact that using these models we can generate new samples in the input space. Approaches that models the posterior probability directly are referred to as discriminative models. Consider the two class case, the class-conditional densities p(x C k ) are Gaussian and the training data is given by {x n, t n }. Furthermore, assume that p(c ) = α. The task is now to find the parameters α, µ, µ, Σ by maximizing the likelihood function, p(t, X α, µ, µ, Σ) = (p(x n, C )) t n (p(x n, C ) t n, p(x n, C ) = p(c )p(x n C ) = α (x n µ, Σ), p(x n, C ) = p(c )p(x n C ) = ( α) (x n µ, Σ). ML for Probabilistic Generative Models (II/IV) 9(3) Let us now maximize the logarithm of the likelihood function, ( ) L(α, µ, µ, Σ) = ln (α (x n µ, Σ)) t n (( α) (x n µ, Σ)) t n The terms that depends on α are (t n ln α + ( t n ) ln( α)) which is maximized by α = t n = + (as expected). k denotes the number of data in class C k. Straightforwardly we get µ = t n x n, µ = t n )x n. ( ML for Probabilistic Generative Models (III/IV) (3) L(Σ) = t n ln det Σ t n (x n µ ) T Σ (x n µ ) ( t n ) ln det Σ Using the fact that x T Ax = Tr ( Axx T) we have L(Σ) = ln det Σ ) (Σ Tr S, S = ( t n )(x n µ ) T Σ (x n µ ) (t n (x n µ )(x n µ ) T + ( t n )(x n µ )(x n µ ) T)

ML for Probabilistic Generative Models (IV/IV) (3) Lemma (Useful Matrix derivatives) M ln det M = M T, ( ) M Tr M = M T T M T. Differentiating L(Σ) = ln det Σ Tr ( Σ S ) results in Σ = Σ T + Σ T SΣ T Σ = Hence, Σ = S More results on matrix derivatives are available in Magnus, J. R., & eudecker, H. (999). Matrix differential calculus with applications in statistics and econometrics. nd Edition, Chichester, UK: Wiley. Generalized Linear Models for Classification (3) In linear regression we made use of a linear model t n = y(x, w) = w T φ(x n ) + ɛ n. For classification problems the target variables are discrete, or slightly more general, posterior probabilities in the range (, ). This is achieved using a so called activation function f (f must exist), y(x) = f (w T x + w ). () ote that the decision surface corresponds to y(x) = constant, implying that w T x + w = constant. This means that the decision surface is a linear function of x, even if f is nonlinear. Hence, the name generalized linear model for (). Gradient of L(w) for Logistic Regression (I/II) 3(3) The negative log-likelihood is L(w) = y n = σ(a n ) = Using the chain rule we have, (t n ln y n + ( t n ) ln( y n )), + exp( a n ), and a n = w T φ n. w = y n a n y n a n w y n = t n y n t n y n = y n t n y n ( y n ) Gradient of L(w) for Logistic Regression (II/II) 4(3) Furthermore, y n = σ(a n) = = σ(a n )( σ(a n )) = y n ( y n ), a n a n a n w = φ n. which results in the following expression for the gradient w = (y n t n )φ n = Φ T (Y T), φ (x ) φ (x )... φ M (x ) y t φ (x ) φ (x )... φ M (x ) Φ =...... Y = y. T = t. φ (x ) φ (x )... φ M (x ) y t

Hessian of L(w) for Logistic Regression 5(3) Bayesian Logistic Regression 6(3) L H = w w T = = (y n t n )φ n φn T = Φ T RΦ y ( y )... y ( y )... R =......... y ( y ) Recall that p(t w) = Hence, computing the posterior density ( ) σ(w T φ n ) t n σ(w T tn φ n ) p(w T) = p(t w)p(w) p(t) is intractable. We are forced to an approximation. Three alternatives. Laplace approximation (this lecture). VB & EP (lecture 5) 3. Sampling methods, e.g., MCMC (lecture 6) Laplace Approximation (I/III) 7(3) The Laplace approximation is a simple approximation that is obtained by fitting a Gaussian centered around the (MAP) mode of the distribution. Consider the density function p(z) of a scalar stochastic variable z, given by p(z) = Z f (z), Z = f (z)dz is the normalization coefficient. We start by finding a mode z of the density function, df (z) dz =. z=z Laplace Approximation (II/III) 8(3) Consider a Taylor expansion of ln f (z) around the mode z, ln f (z) ln f (z ) + d dz ln f (z) (z z ) + }{{ z=z d ln f (z) dz (z z ) z=z } = = ln f (z ) A (z z ), (3) A = d ln f (z) dz z=z Taking the exponential of both sides in the approx. (3) results in f (z) f (z ) exp ( A ) (z z )

Laplace Approximation (III/III) 9(3) Bayesian Logistic Regression (I/II) 3(3) By normalizing this expression we have now obtained a Gaussian approximation q(z) = ( ) A / exp ( A ) π (z z ) A = d ln f (z) dz z=z The main limitation of the Laplace approximation is that it is a local method that only captures aspects of the true density around a specific value z. The posterior is p(w T) p(t w)p(w), (4) we assume a Gaussian prior p(w) = (w m, S ) and make use of the Laplace approximation. Taking logarithm on both sides of (4) gives ln p(w t) = (w m ) T S (w m ) + y n = σ(w T φ n ). (t n ln y n + ( t n ) ln( y n )) + const. Bayesian Logistic Regression (II/II) 3(3) A Few Concepts to Summarize Lecture 3(3) Using the Laplace approximation we can now obtain a Gaussian approximation q(w) = (w w MAP, S ) w MAP is the MAP estimate of p(w T) and the covariance S is the Hessian of ln p(w T), S = ln p(w T) = S w wt + y n ( y n )φ n φ T n Based on this distribution we can now start making predictions for new input data φ(x), which is typically what we are interested in. Recall that prediction corresponds to marginalization w.r.t. w. Hyperparameter: A parameter of the prior distribution that controls the distribution of the parameters of the model. Classification: The goal of classification is to assign an input vector x to one of K classes, C k, k =,..., K. Discriminant: A discriminant is a function that takes an input x and assigns it to one of K classes. Generative models: Approaches that model the distributions of both the inputs and the outputs are known as generative models. In classification this amounts to modelling the class-conditional densities p(x C k ), as well as the prior densities p(c k ). The reason for the name is the fact that using these models we can generate new samples in the input space. Discriminative models: Approaches that models the posterior probability directly are referred to as discriminative models. Logistic Regression: Discriminative model that makes direct use of a generalized linear model in the form of a logistic sigmoid to solve the classification problem. Laplace approximation: A local approximation method that finds the mode of the posterior distribution and then fits a Gaussian centered at that mode.