Introduction to Machine Learning

Similar documents
Introduction to Machine Learning

Introduction to Machine Learning

Introduction to Machine Learning

Logistic Regression. Seungjin Choi

Neural Network Training

Generative v. Discriminative classifiers Intuition

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bias-Variance Tradeoff

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Generative v. Discriminative classifiers Intuition

Machine Learning

An Introduction to Statistical and Probabilistic Linear Models

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Bayesian Machine Learning

ECE 5984: Introduction to Machine Learning

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Linear and logistic regression

Bayesian Regression Linear and Logistic Regression

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Logistic Regression. William Cohen

Machine Learning Gaussian Naïve Bayes Big Picture

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Introduction to Machine Learning

Multiclass Logistic Regression

Introduction to Machine Learning

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines

Logistic Regression. Machine Learning Fall 2018

Overfitting, Bias / Variance Analysis

Logistic Regression. COMP 527 Danushka Bollegala

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Machine Learning Lecture 7

Stochastic Gradient Descent

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Machine Learning

Logistic Regression & Neural Networks

Machine Learning, Fall 2012 Homework 2

Nonparametric Bayesian Methods (Gaussian Processes)

Naïve Bayes classification

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Machine Learning 2017

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Ch 4. Linear Models for Classification

Bayesian Logistic Regression

Machine Learning - Waseda University Logistic Regression

Linear Classification

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Iterative Reweighted Least Squares

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Classification Logistic Regression

Linear Models for Classification

Week 5: Logistic Regression & Neural Networks

ECE521 week 3: 23/26 January 2017

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Introduction to Probabilistic Machine Learning

Reading Group on Deep Learning Session 1

CPSC 340: Machine Learning and Data Mining

Introduction to Machine Learning

Machine Learning - MT Classification: Generative Models

Regression with Numerical Optimization. Logistic

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CMU-Q Lecture 24:

PATTERN RECOGNITION AND MACHINE LEARNING

Classification Based on Probability

Bayesian Machine Learning

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Classification Logistic Regression

Introduction to Bayesian Learning. Machine Learning Fall 2018

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Bayesian Learning (II)

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Linear classifiers: Overfitting and regularization

Variational Bayesian Logistic Regression

Logistic Regression. Stochastic Gradient Descent

Gaussian discriminant analysis Naive Bayes

Machine Learning Lecture 5

CS 340 Lec. 16: Logistic Regression

Expectation Propagation for Approximate Bayesian Inference

Bayesian Linear Regression. Sargur Srihari

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

STA 4273H: Statistical Machine Learning

Linear Models for Regression CS534

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Transcription:

Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 / 20

Outline Generative vs. Discriminative Classifiers Logistic Regression Logistic Regression - Training Using Gradient Descent for Learning Weights Using Newton s Method Regularization with Logistic Regression Handling Multiple Classes Bayesian Logistic Regression Laplace Approximation Posterior of w for Logistic Regression Approximating the Posterior Getting Prediction on Unseen Examples Chandola@UB CSE 474/574 2 / 20

Generative vs. Discriminative Classifiers Probabilistic classification task: p(y = benign X = x), p(y = malicious X = x) How do you estimate p(y x)? p(y x) = p(y, x) p(x) = p(x y)p(y) p(x) Two step approach - Estimate generative model and then posterior for y (Naïve Bayes) Solving a more general problem [2, 1] Why not directly model p(y x)? - Discriminative approach Chandola@UB CSE 474/574 3 / 20

Which is Better? Number of training examples needed to learn a PAC-learnable classifier VC-dimension of the hypothesis space VC-dimension of a probabilistic classifier Number of parameters [2] (or a small polynomial in the number of parameters) Number of parameters for p(y, x) > Number of parameters for p(y x) Discriminative classifiers need lesser training examples to for PAC learning than generative classifiers Chandola@UB CSE 474/574 4 / 20

Logistic Regression y x is a Bernoulli distribution with parameter θ = sigmoid(w x) When a new input x arrives, we toss a coin which has sigmoid(w x ) as the probability of heads If outcome is heads, the predicted class is 1 else 0 Learns a linear boundary Learning Task for Logistic Regression Given training examples x i, y i D i=1, learn w Chandola@UB CSE 474/574 5 / 20

Logistic Regression - Recap Bayesian Interpretation Directly model p(y x) (y {0, 1}) p(y x) Bernoulli(θ = sigmoid(w x)) Geometric Interpretation Use regression to predict discrete values Squash output to [0, 1] using sigmoid function Output less than 0.5 is one class and greater than 0.5 is the other 1 0.5 1 1+e ax x Chandola@UB CSE 474/574 6 / 20

Learning Parameters MLE Approach Assume that y {0, 1} What is the likelihood for a bernoulli sample? If yi = 1, p(y i ) = θ i = If yi = 0, p(y i ) = 1 θ i = Log-likelihood 1 1+exp( w x i ) 1 1+exp(w x i ) In general, p(yi ) = θ y i i (1 θ i ) 1 y i LL(w) = N y i log θ i + (1 y i ) log (1 θ i ) i=1 No closed form solution for maximizing log-likelihood Chandola@UB CSE 474/574 7 / 20

Using Gradient Descent for Learning Weights Compute gradient of LL with respect to w A convex function of w with a unique global maximum d N dw LL(w) = (y i θ i )x i i=1 0 Update rule: 0.5 w k+1 = w k + η d LL(w k ) dw k 10 5 0 2 4 6 8 10 0 Chandola@UB CSE 474/574 8 / 20

Using Newton s Method Setting η is sometimes tricky Too large incorrect results Too small slow convergence Another way to speed up convergence: Newton s Method w k+1 = w k + ηh 1 k d LL(w k ) dw k Chandola@UB CSE 474/574 9 / 20

What is the Hessian? Hessian or H is the second order derivative of the objective function Newton s method belong to the family of second order optimization algorithms For logistic regression, the Hessian is: H = i θ i (1 θ i )x i x i Chandola@UB CSE 474/574 10 / 20

Regularization with Logistic Regression Overfitting is an issue, especially with large number of features Add a Gaussian prior N (0, τ 2 ) (Or a regularization penalty) Easy to incorporate in the gradient descent based approach where I is the identity matrix. LL (w) = LL(w) 1 2 λw w d dw LL (w) = d LL(w) λw dw H = H λi Chandola@UB CSE 474/574 11 / 20

Handling Multiple Classes One vs. Rest and One vs. Other p(y x) Multinoulli(θ) Multinoulli parameter vector θ is defined as: θ j = exp(w j x) C k=1 exp(w k x) Multiclass logistic regression has C weight vectors to learn Chandola@UB CSE 474/574 12 / 20

Bayesian Logistic Regression How to get the posterior for w? Not easy - Why? Laplace Approximation We do not know what the true posterior distribution for w is. Is there a close-enough (approximate) Gaussian distribution? Chandola@UB CSE 474/574 13 / 20

Laplace Approximation Problem Statement How to approximate a posterior with a Gaussian distribution? When is this needed? When direct computation of posterior is not possible. No conjugate prior Chandola@UB CSE 474/574 14 / 20

Laplace Approximation using Taylor Series Expansion Chandola@UB CSE 474/574 15 / 20

Laplace Approximation using Taylor Series Expansion Assume that posterior is: p(w D) = 1 Z e E(w) E(w) is energy function negative log of unnormalized log posterior Chandola@UB CSE 474/574 15 / 20

Laplace Approximation using Taylor Series Expansion Assume that posterior is: p(w D) = 1 Z e E(w) E(w) is energy function negative log of unnormalized log posterior Let w MAP be the mode or expected value of the posterior distribution of w Chandola@UB CSE 474/574 15 / 20

Laplace Approximation using Taylor Series Expansion Assume that posterior is: p(w D) = 1 Z e E(w) E(w) is energy function negative log of unnormalized log posterior Let w MAP be the mode or expected value of the posterior distribution of w Taylor series expansion of E(w) around the mode E(w) = E(w MAP ) + (w w ) E (w) + (w w ) E (w)(w w ) +. E(w MAP ) + (w w ) E (w) + (w w MAP ) E (w)(w w ) E (w) = - first derivative of E(w) (gradient) and E (w) = H is the second derivative (Hessian) Chandola@UB CSE 474/574 15 / 20

Taylor Series Expansion Continued Since w MAP is the mode, the first derivative or gradient is zero E(w) E(w MAP ) + (w w ) H(w w ) Chandola@UB CSE 474/574 16 / 20

Taylor Series Expansion Continued Since w MAP is the mode, the first derivative or gradient is zero E(w) E(w MAP ) + (w w ) H(w w ) Posterior p(w D) may be written as: p(w D) 1 [ Z e E(w MAP ) exp 1 ] 2 (w w ) H(w w ) = N (w MAP, H 1 ) w MAP is the mode obtained by maximizing the posterior using gradient ascent Chandola@UB CSE 474/574 16 / 20

Posterior of w for Logistic Regression Prior: Likelihood of data p(w) = N (0, τ 2 I) where θ i = 1 1+e w x i Posterior: p(d w) = N i=1 θ y i i (1 θ i ) 1 y i p(w D) = N (0, τ 2 I) N i=1 θy i i (1 θ i ) 1 y i p(d w)dw Chandola@UB CSE 474/574 17 / 20

Approximating the Posterior - Laplace Approximation Approximate posterior distribution p(w D) = N (w MAP, H 1 ) H is the Hessian of the negative log-posterior w.r.t. w Chandola@UB CSE 474/574 18 / 20

Getting Prediction p(y x) = p(y x, w)p(w D)dw 1. Use a point estimate of w (MLE or MAP) 2. Analytical Result 3. Monte Carlo Approximation Numerical integration Sample finite versions of w using p(w D) p(w D) = N (w MAP, H 1 ) Compute p(y x) using the samples and add Chandola@UB CSE 474/574 19 / 20

References A. Y. Ng and M. I. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, NIPS, pages 841 848. MIT Press, 2001. V. Vapnik. Statistical learning theory. Wiley, 1998. Chandola@UB CSE 474/574 20 / 20