Introduction to Machine Learning

Size: px

Start display at page:

Download "Introduction to Machine Learning"

Leonard Sharp
6 years ago
Views:

1 Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA CSE 474/574 1 / 20

2 Outline Generative vs. Discriminative Classifiers Logistic Regression Logistic Regression - Training Using Gradient Descent for Learning Weights Using Newton s Method Regularization with Logistic Regression Handling Multiple Classes Bayesian Logistic Regression Laplace Approximation Posterior of w for Logistic Regression Approximating the Posterior Getting Prediction on Unseen Examples Chandola@UB CSE 474/574 2 / 20

3 Generative vs. Discriminative Classifiers Probabilistic classification task: p(y = benign X = x), p(y = malicious X = x) How do you estimate p(y x)? p(y x) = p(y, x) p(x) = p(x y)p(y) p(x) Two step approach - Estimate generative model and then posterior for y (Naïve Bayes) Solving a more general problem [2, 1] Why not directly model p(y x)? - Discriminative approach Chandola@UB CSE 474/574 3 / 20

4 Which is Better? Number of training examples needed to learn a PAC-learnable classifier VC-dimension of the hypothesis space VC-dimension of a probabilistic classifier Number of parameters [2] (or a small polynomial in the number of parameters) Number of parameters for p(y, x) > Number of parameters for p(y x) Discriminative classifiers need lesser training examples to for PAC learning than generative classifiers Chandola@UB CSE 474/574 4 / 20

5 Logistic Regression y x is a Bernoulli distribution with parameter θ = sigmoid(w x) When a new input x arrives, we toss a coin which has sigmoid(w x ) as the probability of heads If outcome is heads, the predicted class is 1 else 0 Learns a linear boundary Learning Task for Logistic Regression Given training examples x i, y i D i=1, learn w Chandola@UB CSE 474/574 5 / 20

6 Logistic Regression - Recap Bayesian Interpretation Directly model p(y x) (y {0, 1}) p(y x) Bernoulli(θ = sigmoid(w x)) Geometric Interpretation Use regression to predict discrete values Squash output to [0, 1] using sigmoid function Output less than 0.5 is one class and greater than 0.5 is the other e ax x Chandola@UB CSE 474/574 6 / 20

7 Learning Parameters MLE Approach Assume that y {0, 1} What is the likelihood for a bernoulli sample? If yi = 1, p(y i ) = θ i = If yi = 0, p(y i ) = 1 θ i = Log-likelihood 1 1+exp( w x i ) 1 1+exp(w x i ) In general, p(yi ) = θ y i i (1 θ i ) 1 y i LL(w) = N y i log θ i + (1 y i ) log (1 θ i ) i=1 No closed form solution for maximizing log-likelihood Chandola@UB CSE 474/574 7 / 20

8 Using Gradient Descent for Learning Weights Compute gradient of LL with respect to w A convex function of w with a unique global maximum d N dw LL(w) = (y i θ i )x i i=1 0 Update rule: 0.5 w k+1 = w k + η d LL(w k ) dw k Chandola@UB CSE 474/574 8 / 20

9 Using Newton s Method Setting η is sometimes tricky Too large incorrect results Too small slow convergence Another way to speed up convergence: Newton s Method w k+1 = w k + ηh 1 k d LL(w k ) dw k Chandola@UB CSE 474/574 9 / 20

10 What is the Hessian? Hessian or H is the second order derivative of the objective function Newton s method belong to the family of second order optimization algorithms For logistic regression, the Hessian is: H = i θ i (1 θ i )x i x i Chandola@UB CSE 474/ / 20

11 Regularization with Logistic Regression Overfitting is an issue, especially with large number of features Add a Gaussian prior N (0, τ 2 ) (Or a regularization penalty) Easy to incorporate in the gradient descent based approach where I is the identity matrix. LL (w) = LL(w) 1 2 λw w d dw LL (w) = d LL(w) λw dw H = H λi Chandola@UB CSE 474/ / 20

12 Handling Multiple Classes One vs. Rest and One vs. Other p(y x) Multinoulli(θ) Multinoulli parameter vector θ is defined as: θ j = exp(w j x) C k=1 exp(w k x) Multiclass logistic regression has C weight vectors to learn Chandola@UB CSE 474/ / 20

13 Bayesian Logistic Regression How to get the posterior for w? Not easy - Why? Laplace Approximation We do not know what the true posterior distribution for w is. Is there a close-enough (approximate) Gaussian distribution? Chandola@UB CSE 474/ / 20

14 Laplace Approximation Problem Statement How to approximate a posterior with a Gaussian distribution? When is this needed? When direct computation of posterior is not possible. No conjugate prior Chandola@UB CSE 474/ / 20

15 Laplace Approximation using Taylor Series Expansion CSE 474/ / 20

16 Laplace Approximation using Taylor Series Expansion Assume that posterior is: p(w D) = 1 Z e E(w) E(w) is energy function negative log of unnormalized log posterior Chandola@UB CSE 474/ / 20

17 Laplace Approximation using Taylor Series Expansion Assume that posterior is: p(w D) = 1 Z e E(w) E(w) is energy function negative log of unnormalized log posterior Let w MAP be the mode or expected value of the posterior distribution of w Chandola@UB CSE 474/ / 20

18 Laplace Approximation using Taylor Series Expansion Assume that posterior is: p(w D) = 1 Z e E(w) E(w) is energy function negative log of unnormalized log posterior Let w MAP be the mode or expected value of the posterior distribution of w Taylor series expansion of E(w) around the mode E(w) = E(w MAP ) + (w w ) E (w) + (w w ) E (w)(w w ) +. E(w MAP ) + (w w ) E (w) + (w w MAP ) E (w)(w w ) E (w) = - first derivative of E(w) (gradient) and E (w) = H is the second derivative (Hessian) Chandola@UB CSE 474/ / 20

19 Taylor Series Expansion Continued Since w MAP is the mode, the first derivative or gradient is zero E(w) E(w MAP ) + (w w ) H(w w ) Chandola@UB CSE 474/ / 20

20 Taylor Series Expansion Continued Since w MAP is the mode, the first derivative or gradient is zero E(w) E(w MAP ) + (w w ) H(w w ) Posterior p(w D) may be written as: p(w D) 1 [ Z e E(w MAP ) exp 1 ] 2 (w w ) H(w w ) = N (w MAP, H 1 ) w MAP is the mode obtained by maximizing the posterior using gradient ascent Chandola@UB CSE 474/ / 20

21 Posterior of w for Logistic Regression Prior: Likelihood of data p(w) = N (0, τ 2 I) where θ i = 1 1+e w x i Posterior: p(d w) = N i=1 θ y i i (1 θ i ) 1 y i p(w D) = N (0, τ 2 I) N i=1 θy i i (1 θ i ) 1 y i p(d w)dw Chandola@UB CSE 474/ / 20

22 Approximating the Posterior - Laplace Approximation Approximate posterior distribution p(w D) = N (w MAP, H 1 ) H is the Hessian of the negative log-posterior w.r.t. w Chandola@UB CSE 474/ / 20

23 Getting Prediction p(y x) = p(y x, w)p(w D)dw 1. Use a point estimate of w (MLE or MAP) 2. Analytical Result 3. Monte Carlo Approximation Numerical integration Sample finite versions of w using p(w D) p(w D) = N (w MAP, H 1 ) Compute p(y x) using the samples and add Chandola@UB CSE 474/ / 20

24 References A. Y. Ng and M. I. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, NIPS, pages MIT Press, V. Vapnik. Statistical learning theory. Wiley, Chandola@UB CSE 474/ / 20

Introduction to Machine Learning

Introduction to Machine Learning How o you estimate p(y x)? Outline Contents Introuction to Machine Learning Logistic Regression Varun Chanola April 9, 207 Generative vs. Discriminative Classifiers 2 Logistic Regression 2 3 Logistic Regression