Bayesian Logistic Regression

Size: px

Start display at page:

Download "Bayesian Logistic Regression"

Christiana Dorsey
6 years ago
Views:

1 Bayesian Logistic Regression Sargur N. University at Buffalo, State University of New York USA

2 Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic Generative Models 3. Probabilistic Discriminative Models 4. The Laplace Approximation 5. Bayesian Logistic Regression 2

3 Topics in Bayesian Logistic Regression Recap of Logistic Regression Roadmap of Bayesian Logistic Regression Laplace Approximation Evaluation of posterior distribution Gaussian approximation Predictive Distribution Convolution of Sigmoid and Gaussian Approximate sigmoid with probit Variational Bayesian Logistic Regression 3

4 Recap of Logistic Regression Feature vector φ, two-classes C 1 and C 2 A posteriori probability φ) can be written as φ) =y(φ) = σ (w T φ) where φ is a M-dimensional feature vector σ (.) is the logistic sigmoid function Goal is to determine the M parameters Known as logistic regression in statistics Although a model for classification rather than for regression

5 Determining Logistic Regression parameters Maximum Likelihood Approach to determine w Data set {φ n,t n } where t n ε {0,1} and φ n =φ(x n ), n=1,..,n Since t n is binary we can use Bernoulli Let y n be the probability that t n =1, i.e., y n = φ n ) Denote t =(t 1,..,t N ) T Likelihood function associated with N observations p(t w) = N t y n n n=1 { 1 y } 1 t n n 5

6 Simple sequential solution Error function is the negative of the log-likelihood N { } E(w) = ln p(t w) = t n lny n + (1 t n )ln(1 y n ) n=1 No closed-form maximum likelihood solution for determining w Given Gradient of error function Solve using an iterative approach where E(w) = E n = (y n t n )φ n ( y n t n ) N φ n Solution has severe n=1 Error x Feature Vector w τ+1 = w τ η E n Cross-entropy error function over-fitting problems for linearly separable data So use IRLS algorithm 6

7 IRLS for Logistic Regression Posterior probability of class C 1 is φ) =y(φ) = σ (w T ϕ) Likelihood Function for data set {φ n,t n }, t n ε {0,1}, φ n =φ(x n ) p(t w) = 1. Error Function Log-likelihood yields Cross-entropy N { } E(w) = t n lny n + (1 t n )ln(1 y n ) n=1 N n=1 y n t n { 1 y } 1 t n n 7

8 IRLS for Logistic Regression 2. Gradient of Error Function: E(w) = (y n t n )φ n = Φ T (y t) n=1 3. Hessian: N N H = E(w) = y n (1 y n )φ n φ n T = Φ T RΦ n=1 Φ is N M design matrix whose n th row is φ n T R is N N diagonal matrix with elements R nn =y n (1-y n ) Hessian is not constant and depends on w through R Since H is positive-definite (i.e., for arbitrary u, u T Hu>0) error function is a concave function of w and so has a unique minimum 8

9 IRLS for Logistic Regression 4. Newton-Raphson update: w (new) = w (old) H 1 E(w) Substituting H = Φ T RΦ and E(w) = Φ T (y - t) w (new) = w (old) - (Φ T RΦ) -1 Φ T (y-t) = (Φ T RΦ) -1 {ΦΦw (old) -Φ T (y-t)} = (Φ T RΦ) -1 Φ T Rz where z is a N-dimensional vector with elements z =Φw (old) -R -1 (y-t) Update formula is a set of normal equations Since Hessian depends on w Apply them iteratively each time using the new weight vector 9

10 Roadmap of Bayesian Logistic Regression Logistic regression is a discriminative probabilistic linear classifier: φ)= σ (w T φ) Exact Bayesian inference for Logistic Regression φ) = σ(w is intractable, T φ)p(w)dw because: 1. Evaluation of posterior distribution p(w t) Needs normalization of prior p(w)=n(w m 0,S 0 ) times N t likelihood (a product of sigmoids) p(t w) = y n n { 1 y n } 1 t n Solution: use Laplace approximation to get Gaussian q(w) 2. Evaluation of predictive distribution Convolution of sigmoid and Gaussian Solution: Approximate Sigmoid by Probit n=1 φ)! σ(w T φ)q(w)dw 10

11 Laplace Approximation (summary) Need mode w 0 of posterior distribution p(w t) Done by a numerical optimization algorithm Fit a Gaussian centered at the mode q(w)= 1 1/2 A f (w)= W (2π) exp -1 M/2 2 (w -w 0) T A(w -w 0 ) = N(w w 0,A -1 ) Needs second derivatives of log posterior Equivalent to finding Hessian matrix A = ln f (w) w=w0 S N = lnp(w t) = S n i=1 y n (1-y n )φ n φ n T 11

12 Evaluation of Posterior Distribution Gaussian prior p(w)=n(w m 0,S 0 ) Where m 0 and S 0 are hyper-parameters Posterior distribution p(w t) α p(w)p(t w) where t =(t 1,..,t N ) T Substituting p(t w) = N n=1 ln p(w t) = 1 2 (w m 0) T S 0 1 (w m 0 ) + y n t n { 1 y n } 1 t n n i=1 (t n ln y n + (1 t n )ln(1 y n ) + const where y n = σ (w T φ n ) 12

13 Gaussian Approximation of Posterior Maximize posterior p(w t) to give MAP solution w map Done by numerical optimization Defines mean of the Gaussian Covariance given by Inverse of matrix of 2 nd derivatives of negative n log-likelihood S N = ln p(w t) = S 1 T 0 + y n (1 y n )φ n φ n Gaussian approximation to posterior q(w) = N(w w map,s N ) Need to marginalize wrt this distribution to 13 make predictions i=1

14 Predictive Distribution Predictive distribution for class C 1, given new feature vector φ(x) Obtained by marginalizing wrt posterior p(w t) φ,t) =,w φ,t)dw Sum rule = φ,t,w)p(w t)dw Product rule = φ,w)p(w t)dw Given φ and w, C 1 is indep of t! σ (w T φ)q(w)dw Approximate p(w t) by Gaussian q(w) corresponding probability for class C 2 p(c 2 φ,t) = 1 φ,t) 14

15 Predictive distrib. is a Convolution Function σ(w T ϕ) depends on w only through its projection onto ϕ Denoting a = w T ϕ we have where δ is the Dirac delta function Thus φ,t)! σ (w T φ)q(w)dw σ (w T φ)! Can evaluate p(a) because the delta function imposes a linear constraint on w Since q(w) is Gaussian, its marginal is also Gaussian Evaluate its mean and covariance µ a = Ε[a] = p(a)da = q(w)w T T φ dw = w map φ σ 2 a = var[a] = p(a) { a 2 Ε[a] 2 } da = q(w) {(w T φ) 2 (m T N φ) 2 } dw = φ T S N φ δ (a w T φ)σ (a)da σ (w T φ)q(w)dw = σ (a)p(a)da where p(a) = δ (a w T φ)q(w)dw 15

16 Variational Approximation to Predictive Distribution Predictive distribution is t) = σ(a)p(a)da ( ) = σ(a)n a µ a,σ a 2 Convolution of Sigmoid-Gaussian is intractable Use probit instead of logistic sigmoid da !

17 Approximation using Probit 2 t)= σ(a)n ( a µ a,σ ) a da Use probit which is similar to Logistic sigmoid Defined as Approximate σ(a) by Φ(λa)!5 0 5 Find λ such that two functions have same slope at origin Convolution of probit with Gaussian is a probit Thus Φ(a) = a N(θ 0,1)dθ Approximate σ(a) by Φ(λa) Find suitable value of λ by requiring that two have same slope at origin, which yields λ 2 =π /8 Φ(λa)N(a µ,σ 2 )da = Φ φ,t)= σ (a)n a µ a,σ a 2! σ (κ (σ 2 a )µ a ) where κ (σ 2 ) = (1+ πσ 2 / 8) 1/2 ( ) da µ ( λ 2 + σ 2 ) 1/

18 Probit Classification Applying it to We have φ,t) = σ κ (σ a 2 )µ a where T µ a = w map φ ( ) Decision boundary corresponding to ϕ,t) =0.5 is given by µ a = 0 ( ) t)= σ(a)n a µ a,σ a 2 σ a 2 = φ T S N φ This is the same solution as T w map φ = 0 da Thus marginalization has no effect! When minimizing misclassification rate with equal prior probabilities For more complex decision criteria it plays important role 18

19 Summary Logistic regression is a linear probabilistic discriminative model x) = σ (w T φ) Bayesian Logistic Regression is intractable Using Laplacian the posterior parameter distribution p(w t) can be approximated as a Gaussian Predictive distribution is convolution of sigmoids and Gaussian φ)! σ (w T φ)q(w)dw Probit yields convolution as probit 19

Iterative Reweighted Least Squares

Iterative Reweighted Least Squares Sargur. University at Buffalo, State University of ew York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative