Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Similar documents
Generative v. Discriminative classifiers Intuition

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Bias-Variance Tradeoff

Machine Learning Gaussian Naïve Bayes Big Picture

ECE 5984: Introduction to Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Generative v. Discriminative classifiers Intuition

Machine Learning

Generative v. Discriminative classifiers Intuition

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning

Stochastic Gradient Descent

Logistic Regression Logistic

Classification Logistic Regression

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Expectation Maximization Algorithm

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Support Vector Machines

Machine Learning, Fall 2012 Homework 2

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Introduction to Machine Learning

CSE446: Clustering and EM Spring 2017

Logistic Regression. Machine Learning Fall 2018

CSC 411: Lecture 09: Naive Bayes

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Linear & nonlinear classifiers

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Week 5: Logistic Regression & Neural Networks

Machine Learning

CSC 411: Lecture 04: Logistic Regression

Ad Placement Strategies

10-701/ Machine Learning - Midterm Exam, Fall 2010

Day 5: Generative models, structured classification

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Introduction to Machine Learning

Lecture 2 Machine Learning Review

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear classifiers: Overfitting and regularization

Linear Classifiers 1

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Support vector machines Lecture 4

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

ECE521 week 3: 23/26 January 2017

Lecture 5: Logistic Regression. Neural Networks

Linear Models for Classification

Linear discriminant functions

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Methods: Naïve Bayes

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear classifiers: Logistic regression

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Support Vector Machines

Bayesian Learning (II)

Linear Models for Regression CS534

Logistic Regression. William Cohen

Regression with Numerical Optimization. Logistic

Naïve Bayes classification

CS 340 Lec. 16: Logistic Regression

Logistic Regression & Neural Networks

ECE 5424: Introduction to Machine Learning

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Gaussian Mixture Models

Machine Learning Linear Classification. Prof. Matteo Matteucci

Generative Clustering, Topic Modeling, & Bayesian Inference

Linear classifiers Lecture 3

Introduction to Machine Learning

Is the test error unbiased for these programs?

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

STA414/2104 Statistical Methods for Machine Learning II

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Voting (Ensemble Methods)

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

CSE446: non-parametric methods Spring 2017

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Transcription:

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features Y target classes Generative classifier, e.g., Naïve Bayes: Assume some functional form for P(X Y), P(Y) Estimate parameters of P(X Y), P(Y) directly from training data Use Bayes rule to calculate P(Y X= x) This is a generative model Indirect computation of P(Y X) through Bayes rule As a result, can also generate a sample of the data, P(X) = y P(y) P(X y) Discriminative classifiers, e.g., Logistic Regression: Assume some functional form for P(Y X) Estimate parameters of P(Y X) directly from training data This is the discriminative model Directly learn P(Y X) But cannot obtain a sample of the data, because P(X) is not available P(Y X) P(X Y) P(Y) 2

Univariate Linear Regression h w (x) = w 1 x + w 0 n Loss(h w ) = L 2(y j, h w (x j )) = (y j - h w (x j )) 2 = j=1 n j=1 n (y j (w 1 x j +w 0 )) 2 j=1 3

Understanding Weight Space h w (x) = w 1 x + w 0 Loss(h w ) = n (y j (w 1 x j +w 0 )) 2 j=1 4

Understanding Weight Space h w (x) = w 1 x + w 0 Loss(h w ) = n (y j (w 1 x j +w 0 )) 2 j=1 5

Finding Minimum Loss Argmin w Loss(h w ) h w (x) = w 1 x + w 0 w 0 Loss(h w ) = Loss(h w ) = 0 n (y j (w 1 x j +w 0 )) 2 j=1 w 1 Loss(h w ) = 0 6

Unique Solution! Argmin w Loss(h w ) h w (x) = w 1 x + w 0 w 1 = N (x j y j ) ( x j )( y j ) N (x j2 ) ( x j ) 2 w 0 = ( (y j ) w 1 ( x j )/N

Could also Solve Iteratively Argmin w Loss(h w ) w = any point in weight space Loop until convergence For each w i in w do w i := w i - Loss(w) w i

Multivariate Linear Regression h w (x j ) = w 0 + w i x j,i = w i x j,i =w T x j Argmin w Loss(h w ) Unique Solution = (x T x) -1 x T y Problem. Overfitting: Possible that some dimension that is actually irrelevant appears by chance to be useful. 9

Overfitting Regularize!! Penalize high weights (complex hypothesis) Minimize cost: Loss + Complexity n Cost h w = y j w i x j 2 + λ w i p j=1 i i p=1: L1 regularization (Lasso) p=2: L2 regularization 10

Regularization L1 L2 11

Back to Classification P(edible X)=1 Decision Boundary P(edible X)=0 12

Logistic Regression Learn P(Y X) directly! Assume a particular functional form Not differentiable P(Y)=0 P(Y)=1 13

Logistic Regression Learn P(Y X) directly! Assume a particular functional form Logistic Function Aka Sigmoid 14

Logistic Function in n Dimensions Sigmoid applied to a linear function of the data: Features can be discrete or continuous! 15

Understanding Sigmoids 1 0.9 0.8 w 0 =-2, w 1 =-1 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 w 0 =0, w 1 =-1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 w 0 =0, w 1 = -0.5

Very convenient! implies implies implies linear classification rule! Y=0 if the RHS>0 17 Carlos Guestrin 2005-2009

Likelihood vs. Conditional Likelihood Generative (Naïve Bayes) maximizes Data likelihood Discriminative (Logistic Regr.) maximizes Conditional Data Likelihood Discriminative models can t compute P(x j w)! Or, They don t waste effort learning P(X) Focus only on P(Y X) - all that matters for classification 18 Carlos Guestrin 2005-2009

Maximizing Conditional Log Likelihood Bad news: no closed-form solution to maximize l(w) Good news: l(w) is concave function of w! No local minima Concave functions easy to optimize Carlos Guestrin 2005-2009 19

Optimizing concave function Gradient ascent Conditional likelihood for Logistic Regression is concave! Gradient: Update rule: Learning rate, >0 Gradient ascent is simplest of optimization approaches

Maximize Conditional Log Likelihood: Gradient ascent

Gradient Ascent for LR Gradient ascent algorithm: (learning rate η > 0) do: For i=1 to n: (iterate over weights) until change < Loop over training examples!

Large parameters a=1 a=5 a=10 Maximum likelihood solution: prefers higher weights higher likelihood of (properly classified) examples close to decision boundary larger influence of corresponding features on decision can cause overfitting!!! Regularization: penalize high weights

That s all MCLE. How about MCAP? One common approach is to define priors on w Normal distribution, zero mean, identity covariance Pushes parameters towards zero Regularization Helps avoid very large weights and overfitting MAP estimate:

M C AP as Regularization Add log p(w) to objective: Quadratic penalty: drives weights towards zero Adds a negative linear term to the gradients Penalizes high weights, like we did in linear regression

MCLE vs. MCAP Maximum conditional likelihood estimate Maximum conditional a posteriori estimate

Naïve Bayes Learning: h:x Y vs. Logistic Regression X features Y target classes Generative Assume functional form for P(X Y) assume cond indep P(Y) Est params from train data Gaussian NB for cont features Bayes rule to calc. P(Y X= x) P(Y X) P(X Y) P(Y) Indirect computation Can also generate a sample of the data 27 Discriminative Assume functional form for P(Y X) no assumptions Est params from training data Handles discrete & cont features Directly calculate P(Y X=x) Can t generate data sample

Remember Gaussian Naïve Bayes? Sometimes Assume Variance is independent of Y (i.e., i ), or independent of X i (i.e., k ) or both (i.e., ) P(X i =x Y=y k ) = N( ik, ik ) P(Y X) P(X Y) P(Y) N( ik, ik ) =

Gaussian Naïve Bayes vs. Logistic Regression Learning: h:x Y X Real-valued features Y target classes Generative What can we say about the Assume functional form for form of P(Y=1 X i )? P(X Y) assume X i cond indep given Y P(Y) Est params from train data Gaussian NB for continuous features model P(X i Y = y k ) as Gaussian N( ik, i ) model P(Y) as Bernoulli (q,1-q) Bayes rule to calc. P(Y X= x) P(Y X) P(X Y) P(Y) 29 Cool!!!!

Derive form for P(Y X) for continuous X i up to now, all arithmetic only for Naïve Bayes models Looks like a setting for w 0? Can we solve for w i? Yes, but only in Gaussian case

Ratio of class-conditional probabilities Linear function! Coefficients expressed with original Gaussian parameters!

Derive form for P(Y X) for continuous Xi

Gaussian Naïve Bayes vs. Logistic Regression Set of Gaussian Naïve Bayes parameters (feature variance independent of class label) Can go both ways, we only did one way Set of Logistic Regression parameters Representation equivalence But only in a special case!!! (GNB with class-independent variances) But what s the difference??? LR makes no assumptions about P(X Y) in learning!!! Loss function!!! Optimize different functions! Obtain different solutions

Naïve Bayes vs. Logistic Regression Consider Y boolean, X=<X 1... X n > continuous Number of parameters: Naïve Bayes: 4n +1 Logistic Regression: n+1

Naïve Bayes vs. Logistic Regression Generative vs. Discriminative classifiers Asymptotic comparison (# training examples infinity) when model correct GNB (with class independent variances) and LR produce identical classifiers when model incorrect LR is less biased does not assume conditional independence therefore LR expected to outperform GNB [Ng & Jordan, 2002]

Naïve Bayes vs. Logistic Regression [Ng & Jordan, 2002] Generative vs. Discriminative classifiers Non-asymptotic analysis convergence rate of parameter estimates, (n = # of attributes in X) Size of training data to get close to infinite data solution Naïve Bayes needs O(log n) samples Logistic Regression needs O(n) samples GNB converges more quickly to its (perhaps less helpful) asymptotic estimates

Naïve bayes Logistic Regression Some experiments from UCI data sets Carlos Guestrin 2005-2009 37

What you should know about Logistic Regression (LR) Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR Solution differs because of objective (loss) function In general, NB and LR make different assumptions NB: Features independent given class! assumption on P(X Y) LR: Functional form of P(Y X), no assumption on P(X Y) LR is a linear classifier decision rule is a hyperplane LR optimized by conditional likelihood no closed-form solution concave! global optimum with gradient ascent Maximum conditional a posteriori corresponds to regularization Convergence rates GNB (usually) needs less data LR (usually) gets to better solutions in the limit

LR: MCAP Algorithm Given Data matrix of size m x (n+2) (n attributes and m examples) Data[i][n+1] gives the class for example i Data[i][0] is the dummy threshold attribute always set to 1. Arrays Pr[0..m-1] and w[0..n] initialized to random values Until convergence do For each example i Compute Pr[i]=Pr(class=1 Data[i],w) Array dw[0..n] initialized to zero For i=0 to n // Go over all the weights For j=0 to m-1 // Go over all the training examples dw[i]=dw[i]+data[j][i]*(data[j][n+1]-pr[j]) For i=0 to n w[i]=w[i]+ (dw[i]- w[i]) 39