Introduction to Machine Learning

How o you estimate p(y x)? Outline Contents Introuction to Machine Learning Logistic Regression Varun Chanola April 9, 207 Generative vs. Discriminative Classifiers 2 Logistic Regression 2 3 Logistic Regression - Training 3 3. Using Graient Descent for Learning Weights......... 4 3.2 Using Newton s Metho..................... 4 3.3 Regularization with Logistic Regression............. 5 3.4 Hanling Multiple Classes.................... 6 3.5 Bayesian Logistic Regression................... 6 3.6 Laplace Approximation...................... 6 3.7 Posterior of w for Logistic Regression.............. 7 3.8 Approximating the Posterior................... 8 3.9 Getting Preiction on Unseen Examples............ 8 Generative vs. Discriminative Classifiers Probabilistic classification task: p(y = benign X = x), p(y = malicious X = x) p(y x) = p(y, x) p(x) = p(x y)p(y) p(x) Two step approach - Estimate generative moel an then posterior for y (Naïve Bayes) Solving a more general problem [2, ] Why not irectly moel p(y x)? - Discriminative approach Number of training examples neee to learn a PAC-learnable classifier VC-imension of the hypothesis space VC-imension of a probabilistic classifier Number of parameters [2] (or a small polynomial in the number of parameters) Number of parameters for p(y, x) > Number of parameters for p(y x) Discriminative classifiers nee lesser training examples to for PAC learning than generative classifiers 2 Logistic Regression y x is a Bernoulli istribution with parameter θ = sigmoi(w x) When a new input x arrives, we toss a coin which has sigmoi(w x ) as the probability of heas If outcome is heas, the preicte class is else 0 Learns a linear bounary Learning Task for Logistic Regression Given training examples x i, y i D, learn w Bayesian Interpretation Directly moel p(y x) (y {0, }) 2

p(y x) Bernoulli(θ = sigmoi(w x)) Geometric Interpretation Use regression to preict iscrete values Squash output to [0, ] using sigmoi function Output less than 0.5 is one class an greater than 0.5 is the other 3 Logistic Regression - Training MLE Approach Assume that y {0, } What is the likelihoo for a bernoulli sample? Log-likelihoo If y i =, p(y i ) = θ i = +exp( w x i) If y i = 0, p(y i ) = θ i = +exp(w x i) In general, p(y i ) = θ yi i ( θ i) yi LL(w) = y i log θ i + ( y i ) log ( θ i ) No close form solution for maximizing log-likelihoo Using this result we obtain: N w LL(w) = y i θ i ( θ i )x i ( y i) θ i ( θ i )x i θ i θ i = = (y i ( θ i ) ( y i )θ i )x i (y i θ i )x i Obviously, given that θ i is a non-linear function of w, a close form solution is not possible. 3. Using Graient Descent for Learning Weights 0 0.5 Compute graient of LL with respect to w A convex function of w with a unique global maximum Upate rule: N w LL(w) = (y i θ i )x i w k+ = w k + η LL(w k ) w k To unerstan why there is no close form solution for maximizing the loglikelihoo, we first ifferentiate LL(w) with respect to w. We make use of the useful result for sigmoi: 0 5 0 2 4 6 8 0 0 θ i w = θ i( θ i )x i 3.2 Using Newton s Metho Setting η is sometimes tricky 3 4

Too large incorrect results Too small slow convergence Another way to spee up convergence: Newton s Metho w k+ = w k + ηh k LL(w k ) w k Hessian or H is the secon orer erivative of the objective function Newton s metho belong to the family of secon orer optimization algorithms For logistic regression, the Hessian is: 3.4 Hanling Multiple Classes One vs. Rest an One vs. Other p(y x) Multinoulli(θ) Multinoulli parameter vector θ is efine as: θ j = exp(w j x) C k= exp(w k x) Multiclass logistic regression has C weight vectors to learn 3.5 Bayesian Logistic Regression How to get the posterior for w? H = i θ i ( θ i )x i x i Not easy - Why? 3.3 Regularization with Logistic Regression Overfitting is an issue, especially with large number of features A a Gaussian prior N (0, τ 2 ) (Or a regularization penalty) Easy to incorporate in the graient escent base approach where I is the ientity matrix. LL (w) = LL(w) 2 λw w w LL (w) = LL(w) λw w H = H λi Laplace Approximation We o not know what the true posterior istribution for w is. Is there a close-enough (approximate) Gaussian istribution? One shoul note that we use a Gaussian prior for w which is not a conjugate prior for the Bernoulli istribution use in the logistic regression. In fact there is no convenient prior that may be use for logistic regression. 3.6 Laplace Approximation Problem Statement How to approximate a posterior with a Gaussian istribution? When is this neee? When irect computation of posterior is not possible. No conjugate prior 5 6

Assume that posterior is: p(w D) = Z e E(w) E(w) is energy function negative log of unnormalize log posterior Let w MAP be the moe or expecte value of the posterior istribution of w Taylor series expansion of E(w) aroun the moe E(w) = E(w MAP ) + (w w ) E (w) + (w w ) E (w)(w w ) +... E(w MAP ) + (w w ) E (w) + (w w MAP ) E (w)(w w ) E (w) = - first erivative of E(w) (graient) an E (w) = H is the secon erivative (Hessian) Since w MAP is the moe, the first erivative or graient is zero E(w) E(w MAP ) + (w w ) H(w w ) Posterior p(w D) may be written as: p(w D) [ ) exp ] Z e E(wMAP 2 (w w ) H(w w ) = N (w MAP, H ) w MAP is the moe obtaine by maximizing the posterior using graient ascent 3.7 Posterior of w for Logistic Regression Prior: Likelihoo of ata where θ i = +e w x i Posterior: p(d w) = p(w) = N (0, τ 2 I) N θ yi i ( θ i) yi p(w D) = N (0, τ 2 I) N θyi i ( θ i) yi p(d w)w 3.8 Approximating the Posterior Approximate posterior istribution p(w D) = N (w MAP, H ) H is the Hessian of the negative log-posterior w.r.t. w Hessian (or H) is a matrix that is obtaine by ouble ifferentiation of a scalar function of a vector. In this case the scalar function is the negative log-posterior of w: H = 2 ( log p(d w) log p(w)) 3.9 Getting Preiction on Unseen Examples p(y x) = p(y x, w)p(w D)w. Use a point estimate of w (MLE or MAP) 2. Analytical Result 3. Monte Carlo Approximation Numerical integration Sample finite versions of w using p(w D) p(w D) = N (w MAP, H ) Compute p(y x) using the samples an a Using point estimate of w means that the integral isappears from the Bayesian averaging equation. 7 8

References References [] A. Y. Ng an M. I. Joran. On iscriminative vs. generative classifiers: A comparison of logistic regression an naive bayes. In T. G. Dietterich, S. Becker, an Z. Ghahramani, eitors, NIPS, pages 84 848. MIT Press, 200. [2] V. Vapnik. Statistical learning theory. Wiley, 998. 9