Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 / 20
Outline Generative vs. Discriminative Classifiers Logistic Regression Logistic Regression - Training Using Gradient Descent for Learning Weights Using Newton s Method Regularization with Logistic Regression Handling Multiple Classes Bayesian Logistic Regression Laplace Approximation Posterior of w for Logistic Regression Approximating the Posterior Getting Prediction on Unseen Examples Chandola@UB CSE 474/574 2 / 20
Generative vs. Discriminative Classifiers Probabilistic classification task: p(y = benign X = x), p(y = malicious X = x) How do you estimate p(y x)? p(y x) = p(y, x) p(x) = p(x y)p(y) p(x) Two step approach - Estimate generative model and then posterior for y (Naïve Bayes) Solving a more general problem [2, 1] Why not directly model p(y x)? - Discriminative approach Chandola@UB CSE 474/574 3 / 20
Which is Better? Number of training examples needed to learn a PAC-learnable classifier VC-dimension of the hypothesis space VC-dimension of a probabilistic classifier Number of parameters [2] (or a small polynomial in the number of parameters) Number of parameters for p(y, x) > Number of parameters for p(y x) Discriminative classifiers need lesser training examples to for PAC learning than generative classifiers Chandola@UB CSE 474/574 4 / 20
Logistic Regression y x is a Bernoulli distribution with parameter θ = sigmoid(w x) When a new input x arrives, we toss a coin which has sigmoid(w x ) as the probability of heads If outcome is heads, the predicted class is 1 else 0 Learns a linear boundary Learning Task for Logistic Regression Given training examples x i, y i D i=1, learn w Chandola@UB CSE 474/574 5 / 20
Logistic Regression - Recap Bayesian Interpretation Directly model p(y x) (y {0, 1}) p(y x) Bernoulli(θ = sigmoid(w x)) Geometric Interpretation Use regression to predict discrete values Squash output to [0, 1] using sigmoid function Output less than 0.5 is one class and greater than 0.5 is the other 1 0.5 1 1+e ax x Chandola@UB CSE 474/574 6 / 20
Learning Parameters MLE Approach Assume that y {0, 1} What is the likelihood for a bernoulli sample? If yi = 1, p(y i ) = θ i = If yi = 0, p(y i ) = 1 θ i = Log-likelihood 1 1+exp( w x i ) 1 1+exp(w x i ) In general, p(yi ) = θ y i i (1 θ i ) 1 y i LL(w) = N y i log θ i + (1 y i ) log (1 θ i ) i=1 No closed form solution for maximizing log-likelihood Chandola@UB CSE 474/574 7 / 20
Using Gradient Descent for Learning Weights Compute gradient of LL with respect to w A convex function of w with a unique global maximum d N dw LL(w) = (y i θ i )x i i=1 0 Update rule: 0.5 w k+1 = w k + η d LL(w k ) dw k 10 5 0 2 4 6 8 10 0 Chandola@UB CSE 474/574 8 / 20
Using Newton s Method Setting η is sometimes tricky Too large incorrect results Too small slow convergence Another way to speed up convergence: Newton s Method w k+1 = w k + ηh 1 k d LL(w k ) dw k Chandola@UB CSE 474/574 9 / 20
What is the Hessian? Hessian or H is the second order derivative of the objective function Newton s method belong to the family of second order optimization algorithms For logistic regression, the Hessian is: H = i θ i (1 θ i )x i x i Chandola@UB CSE 474/574 10 / 20
Regularization with Logistic Regression Overfitting is an issue, especially with large number of features Add a Gaussian prior N (0, τ 2 ) (Or a regularization penalty) Easy to incorporate in the gradient descent based approach where I is the identity matrix. LL (w) = LL(w) 1 2 λw w d dw LL (w) = d LL(w) λw dw H = H λi Chandola@UB CSE 474/574 11 / 20
Handling Multiple Classes One vs. Rest and One vs. Other p(y x) Multinoulli(θ) Multinoulli parameter vector θ is defined as: θ j = exp(w j x) C k=1 exp(w k x) Multiclass logistic regression has C weight vectors to learn Chandola@UB CSE 474/574 12 / 20
Bayesian Logistic Regression How to get the posterior for w? Not easy - Why? Laplace Approximation We do not know what the true posterior distribution for w is. Is there a close-enough (approximate) Gaussian distribution? Chandola@UB CSE 474/574 13 / 20
Laplace Approximation Problem Statement How to approximate a posterior with a Gaussian distribution? When is this needed? When direct computation of posterior is not possible. No conjugate prior Chandola@UB CSE 474/574 14 / 20
Laplace Approximation using Taylor Series Expansion Chandola@UB CSE 474/574 15 / 20
Laplace Approximation using Taylor Series Expansion Assume that posterior is: p(w D) = 1 Z e E(w) E(w) is energy function negative log of unnormalized log posterior Chandola@UB CSE 474/574 15 / 20
Laplace Approximation using Taylor Series Expansion Assume that posterior is: p(w D) = 1 Z e E(w) E(w) is energy function negative log of unnormalized log posterior Let w MAP be the mode or expected value of the posterior distribution of w Chandola@UB CSE 474/574 15 / 20
Laplace Approximation using Taylor Series Expansion Assume that posterior is: p(w D) = 1 Z e E(w) E(w) is energy function negative log of unnormalized log posterior Let w MAP be the mode or expected value of the posterior distribution of w Taylor series expansion of E(w) around the mode E(w) = E(w MAP ) + (w w ) E (w) + (w w ) E (w)(w w ) +. E(w MAP ) + (w w ) E (w) + (w w MAP ) E (w)(w w ) E (w) = - first derivative of E(w) (gradient) and E (w) = H is the second derivative (Hessian) Chandola@UB CSE 474/574 15 / 20
Taylor Series Expansion Continued Since w MAP is the mode, the first derivative or gradient is zero E(w) E(w MAP ) + (w w ) H(w w ) Chandola@UB CSE 474/574 16 / 20
Taylor Series Expansion Continued Since w MAP is the mode, the first derivative or gradient is zero E(w) E(w MAP ) + (w w ) H(w w ) Posterior p(w D) may be written as: p(w D) 1 [ Z e E(w MAP ) exp 1 ] 2 (w w ) H(w w ) = N (w MAP, H 1 ) w MAP is the mode obtained by maximizing the posterior using gradient ascent Chandola@UB CSE 474/574 16 / 20
Posterior of w for Logistic Regression Prior: Likelihood of data p(w) = N (0, τ 2 I) where θ i = 1 1+e w x i Posterior: p(d w) = N i=1 θ y i i (1 θ i ) 1 y i p(w D) = N (0, τ 2 I) N i=1 θy i i (1 θ i ) 1 y i p(d w)dw Chandola@UB CSE 474/574 17 / 20
Approximating the Posterior - Laplace Approximation Approximate posterior distribution p(w D) = N (w MAP, H 1 ) H is the Hessian of the negative log-posterior w.r.t. w Chandola@UB CSE 474/574 18 / 20
Getting Prediction p(y x) = p(y x, w)p(w D)dw 1. Use a point estimate of w (MLE or MAP) 2. Analytical Result 3. Monte Carlo Approximation Numerical integration Sample finite versions of w using p(w D) p(w D) = N (w MAP, H 1 ) Compute p(y x) using the samples and add Chandola@UB CSE 474/574 19 / 20
References A. Y. Ng and M. I. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, NIPS, pages 841 848. MIT Press, 2001. V. Vapnik. Statistical learning theory. Wiley, 1998. Chandola@UB CSE 474/574 20 / 20