Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Similar documents
Multiclass Logistic Regression

Iterative Reweighted Least Squares

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Bayesian Logistic Regression

Linear Classification: Probabilistic Generative Models

Linear Models for Classification

Machine Learning. 7. Logistic and Linear Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Network Training

Machine Learning Lecture 7

Ch 4. Linear Models for Classification

The Laplace Approximation

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Introduction to Machine Learning

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Logistic Regression. Seungjin Choi

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Linear Classification

Probabilistic generative models

Support Vector Machines

Linear and logistic regression

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Linear Models for Classification

Logistic Regression. COMP 527 Danushka Bollegala

Reading Group on Deep Learning Session 1

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Variational Bayesian Logistic Regression

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

STA 4273H: Statistical Machine Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Machine Learning Lecture 5

Generative v. Discriminative classifiers Intuition

Gradient-Based Learning. Sargur N. Srihari

Generative v. Discriminative classifiers Intuition

CSC 411: Lecture 04: Logistic Regression

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Bayesian Linear Regression. Sargur Srihari

PATTERN RECOGNITION AND MACHINE LEARNING

ECE 5984: Introduction to Machine Learning

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Logistic Regression & Neural Networks

Alternative Parameterizations of Markov Networks. Sargur Srihari

Pattern Recognition and Machine Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Overfitting, Bias / Variance Analysis

Stochastic Gradient Descent

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Nonparametric Bayesian Methods (Gaussian Processes)

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Machine Learning Linear Classification. Prof. Matteo Matteucci

Outline Lecture 2 2(32)

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Linear classifiers: Overfitting and regularization

Slides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D.

Advanced statistical methods for data analysis Lecture 2

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Linear Models for Regression. Sargur Srihari

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Stochastic gradient descent; Classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic Graphical Models

Modern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K

CSCI-567: Machine Learning (Spring 2019)

CMU-Q Lecture 24:

Logistic Regression. Machine Learning Fall 2018

LDA, QDA, Naive Bayes

Multi-layer Neural Networks

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Naïve Bayes classification

Bayesian Machine Learning

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Machine Learning for Signal Processing Bayes Classification and Regression

Bias-Variance Tradeoff

Introduction to Neural Networks

An Introduction to Statistical and Probabilistic Linear Models

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Generative v. Discriminative classifiers Intuition

Lecture 5: Logistic Regression. Neural Networks

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Statistical Data Mining and Machine Learning Hilary Term 2016

CSC 411: Lecture 09: Naive Bayes

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Linear classifiers: Logistic regression

Classification Logistic Regression

Linear Models for Regression

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Transcription:

Logistic Regression Sargur N. University at Buffalo, State University of New York USA

Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative 1. Fixed basis functions 2. Logistic Regression (two-class) 3. Iterative Reweighted Least Squares (IRLS) 4. Multiclass Logistic Regression 5. Probit Regression 6. Canonical Link Functions 2

Topics in Logistic Regression Logistic Sigmoid and Logit Functions Parameters in discriminative approach Determining logistic regression parameters Error function Gradient of error function Simple sequential algorithm An example Generative vs Discriminative Training Naiive Bayes vs Logistic Regression 3

Logistic Sigmoid and Logit Functions In two-class case, posterior of class C 1 can be written as as a logistic sigmoid of feature vector ϕ=[ϕ 1,..ϕ M ] T p(c 1 ϕ) = y(ϕ) = σ (w T ϕ) with p(c 2 ϕ) = 1- p(c 1 ϕ) Here σ (.) is the logistic sigmoid function w Known as logistic regression in statistics Although a model for classification rather than for regression Logit function: It is the log of the odds ratio It links the probability to the predictor variables Logistic Sigmoid σ(a) a Properties: A. Symmetry σ (-a)=1-σ (a) B. Inverse a=ln(σ /1-σ) known as logit. Also known as log odds since it is the ratio ln[p(c 1 ϕ)/p(c 2 ϕ)] C. Derivative dσ/da = σ (1-σ)

Fewer Parameters in Linear Discriminative Model Discriminative approach (Logistic Regression) For M -dim feature space ϕ: M adjustable parameters Generative based on Gaussians (Bayes/NB) 2M parameters for mean M(M+1)/2 parameters for shared covariance matrix Two class priors Total of M(M+5)/2 + 1 parameters Grows quadratically with M If features assumed independent (naïve Bayes) still needs M+3 parameters 5

Determining Logistic Regression parameters Maximum Likelihood Approach for Two classes For a data set (ϕ n,t n ) where t n ε {0,1} and ϕ n =ϕ (x n ), n =1,..,N Likelihood function can be written as p(t w) = N t y n n n=1 { 1 y } 1 t n n where t =(t 1,..,t N ) T and y n = p(c 1 ϕ n ) y n is the probability that t n =1 6

Error Fn for Logistic Regression Likelihood function is p(t w) = N n=1 y n t n { 1 y } 1 t n n By taking negative logarithm we get the Cross-entropy Error Function N { } E(w) = ln p(t w) = t n lny n + (1 t n )ln(1 y n ) n=1 where y n = σ (a n ) and a n =w T ϕ n We need to minimize E(w) At its minimum, derivative of E(w) is zero So we need to solve for w in the equation E(w) = 0 7

Gradient of Error Function Error function E(w) = ln p(t w) = t n lny n + (1 t n )ln(1 y n ) n=1 where y n = σ(w T ϕ n ) Using Derivative of logistic sigmoid dσ/da=σ(1-σ) Gradient of the error function E(w) = N n=1 ( y n t n ) φ n N Error x Feature Vector Contribution to gradient by data point n is error between target t n and prediction y n = σ (w T φ n ) times basis φ n { } Proof of gradient expression Let z = z 1 + z 2 where z 1 = t lnσ(wφ) and z 2 = (1 t)ln[1 σ(wφ)] dz 1 dw and dz2 dw = tσ(wφ)[1 σ(wφ)]φ σ(wφ) = (1 t)σ(wφ)[1 σ(wφ)]( φ) [1 σ(wφ)] Therefore dz dw = (σ(wφ) t)φ Using dσ da = σ(1 σ) d dx (lnax) = a x 8

Simple Sequential Algorithm Given Gradient of error function N E(w) = ( y n t n ) φ where y n n = σ(w T ϕ n ) n=1 Solve using an iterative approach w τ+1 = w τ η E n where E n = (y n t n )φ n Takes precisely same form as Gradient of Sum-of-squares error for linear regression Error x Feature Vector Samples are presented one at a time in which each each of the weight vectors is updated 9

ML solution can over-fit Severe over-fitting for linearly separable data Because ML solution occurs at!4!2 0 2 4 6 8 σ = 0.5 With σ>0.5 and σ< 0.5 for the two classes Solution equivalent to a=w T ϕ = 0 Logistic sigmoid becomes infinitely steep A Heavyside step function w goes to infinity Penalizing wts can avoid this 4 2 0!2!4!6!8 σ(a) a 10

11 An Example of 2-class Logistic Regression Input Data ϕ 0 (x)=1, dummy feature

12 Initial Weight Vector, Gradient and Hessian (2-class) Weight vector Gradient Hessian

13 Final Weight Vector, Gradient and Hessian (2-class) Weight Vector Gradient Hessian Number of iterations : 10 Error (Initial and Final): 15.0642, 1.0000e-009

Generative vs Discriminative Training Variables x ={x 1,..x M } and classifier target y 1. Generative: estimate parameters of variables independently Naïve Bayes y x 1 x 2 x M For classification: Determine joint: M p(y,x) = p(y) p(x i y) i=1 From joint get required conditional p(y x) Simple estimation independently estimate M sets of parameters But independence is usually false We can estimate M(M+1)/2 covariance matrix 2. Discriminative: estimate joint parameters w i Potential Functions (log-linear) ϕ i (x i,y)=exp{w i x i I{y=1}}, ϕ 0 (y)=exp{w 0 I{y=1}} Naïve Markov y I has value 1 when y=1, else 0 x 1 x 2 x M For classification: Unnormalized Normalized!P(y = 1 x) = exp w 0 + w i x i Jointly optimize M parameters More complex estimation but correlations accounted for Can use much richer features: Edges, image patches sharing same pixels M i=1 P(y = 1 x) = sigmoid w 0 + Logistic Regression M i=1 w i x i!p(y = 0 x) = exp{ 0} = 1 where sigmoid(z) = e p(y i φ) = y i (φ) = z 1+ e z multiclass exp(a i ) exp(a j ) j where a j =w jt ϕ

Logistic Regression is a special architecture of a neural network 15