Logistic Regression. Machine Learning Fall PDF Free Download

Logistic Regression Machine Learning Fall 2018 1

Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes classifier 2

This lecture Logistic regression Connection to Naïve Bayes Training a logistic regression classifier Back to loss minimization 3

This lecture Logistic regression Connection to Naïve Bayes Training a logistic regression classifier Back to loss minimization 4

Logistic Regression: Setup The setting Binary classification Inputs: Feature vectors x 2 < d Labels: y 2 {-1, +1} Training data S = {(x i, y i )}, m examples 5

Classification, but The output y is discrete valued (-1 or 1) Instead of predicting the output, let us try to predict P(y = 1 x) Expand hypothesis space to functions hose output is [0-1] Original problem: < d! {-1, 1} Modified problem: < d! [0-1] Effectively make the problem a regression problem Many hypothesis spaces possible 6

The Sigmoid function The hypothesis space for logistic regression: All functions of the form That is, a linear function, composed ith a sigmoid function (the logistic function) ¾ What is the domain and the range of the sigmoid function? This is a reasonable choice. We ill see hy later 8

The Sigmoid function ¾(z) z 11

The Sigmoid function What is its derivative ith respect to z? 12

The Sigmoid function What is its derivative ith respect to z? 13

Predicting probabilities According to the logistic regression model, e have 14

Predicting probabilities According to the logistic regression model, e have 15

Predicting probabilities According to the logistic regression model, e have 16

Predicting probabilities According to the logistic regression model, e have Or equivalently 17

Predicting probabilities According to the logistic regression model, e have Or equivalently Note that e are directly modeling P(y x) rather than P(x y) and P(y) 18

Predicting a label ith logistic regression Compute P(y =1 x; ) If this is greater than half, predict 1 else predict -1 What does this correspond to in terms of T x? 19

Predicting a label ith logistic regression Compute P(y =1 x; ) If this is greater than half, predict 1 else predict -1 What does this correspond to in terms of T x? Prediction = sgn( T x) 20

This lecture Logistic regression Connection to Naïve Bayes Training a logistic regression classifier Back to loss minimization 21

Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function log P(y = 1 x, ) P(y = +1 x, ) = 2 x Here, the P s represent the Naïve Bayes posterior distribution, and can be used to calculate the priors and the likelihoods. That is, P(y = 1, x) is computed using P(x y = 1, ) and P(y = 1 ) 22

Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function log P(y = 1 x, ) P(y = +1 x, ) = 2 x But e also kno that P y = +1 x, = 1 P(y = 1 x, ) 23

Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function log P(y = 1 x, ) P(y = +1 x, ) = 2 x But e also kno that P y = +1 x, = 1 P(y = 1 x, ) Substituting in the above expression, e get P y = +1, x = σ 2 x = 1 1 + exp ( 2 x) 24

Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function P(y = 1 x, ) log P(y = +1 x, ) = 2 x That is, both naïve Bayes and logistic regression try to compute the same posterior distribution over the outputs But e also kno that P y = +1 x, = 1 P(y = 1 x, ) Naïve Bayes is a generative model. Substituting in the above expression, e get Logistic Regression is the discriminative version. P y = +1, x = σ 2 x = 1 1 + exp ( 2 x) 25

This lecture Logistic regression Connection to Naïve Bayes Training a logistic regression classifier First: Maximum likelihood estimation Then: Adding priors à Maximum a Posteriori estimation Back to loss minimization 26

Maximum likelihood estimation Let s get back to the problem of learning Training data S = {(x i, y i )}, m examples What e ant Find a such that P(S ) is maximized We kno that our examples are dran independently and are identically distributed (i.i.d) Ho do e proceed? 27

Maximum likelihood estimation argmax P S = argmax = ; P y < x <, ) <>? The usual trick: Convert products to sums by taking log Recall that this orks only because log is an increasing function and the maximizer ill not change 28

Maximum likelihood estimation argmax P S = argmax Equivalent to solving = ; P y < x <, ) <>? max = @ log P y < x <, ) < 29

Maximum likelihood estimation = argmax P S = argmax = ; P y < x <, ) <>? max @ log P y < x <, ) < But (by definition) e kno that P y, x = σ y < 2 x < = 1 1 + exp ( y < 2 x < ) 30

P y, x = Maximum likelihood estimation 1 1 + exp ( y B 2 x < ) argmax P S = argmax max = = ; P y < x <, ) <>? @ log P y < x <, ) < Equivalent to solving = max @ log(1 + exp ( y < 2 x < ) < 31

Maximum likelihood estimation argmax The goal: Maximum likelihood training of a discriminative probabilistic classifier under the logistic model for the posterior distribution. P S = argmax max = = ; P y < x <, ) <>? @ log P y < x <, ) < Equivalent to solving = P y, x = max @ log(1 + exp ( y < 2 x < ) < 1 1 + exp ( y B 2 x < ) 32

Maximum a posteriori estimation We could also add a prior on the eights Suppose each eight in the eight vector is dran independently from the normal distribution ith zero mean and standard deviation σ E p = ; p( < ) F>? E = ; 1 σ 2π exp J < σ J F>? 34

MAP estimation for logistic regression E p = ; p( < ) F>? E = ; 1 σ 2π exp J < σ J F>? Let us ork through this procedure again to see hat changes 35

MAP estimation for logistic regression E p = ; p( < ) F>? E = ; 1 σ 2π exp J < σ J F>? Let us ork through this procedure again to see hat changes What is the goal of MAP estimation? (In maximum likelihood, e maximized the likelihood of the data) 36

MAP estimation for logistic regression E p = ; p( < ) F>? E = ; 1 σ 2π exp J < σ J F>? What is the goal of MAP estimation? (In maximum likelihood, e maximized the likelihood of the data) To maximize the posterior probability of the model given the data (i.e. to find the most probable model, given the data) P S P S P() 37

MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax E = ; 1 σ 2π exp J < σ J F>? P( S) = argmax P S P() 38

MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() 39

MAP estimation for logistic regression We have already expanded out the first term. = @ log(1 + exp ( y < 2 x < ) < E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() 40

MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() = @ log(1 + exp ( y < 2 x < ) < E + @ < J F>? σ J + constants Expand the log prior 41

MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() = max @ log(1 + exp ( y < 2 x < ) < E + @ < J F>? σ J + constants 42

MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() = max @ log(1 + exp ( y < 2 x < ) 1 σ J 2 < 43

MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() = max @ log(1 + exp ( y < 2 x < ) < 1 σ J 2 Maximizing a negative function is the same as minimizing the function 44

Learning a logistic regression classifier Learning a logistic regression classifier is equivalent to solving = min @ log(1 + exp ( y < 2 x < ) + 1 σ J 2 < 45

Learning a logistic regression classifier Learning a logistic regression classifier is equivalent to solving = min @ log(1 + exp ( y < 2 x < ) + 1 σ J 2 < Where have e seen this before? 46

Learning a logistic regression classifier Learning a logistic regression classifier is equivalent to solving = min @ log(1 + exp ( y < 2 x < ) + 1 σ J 2 < Where have e seen this before? The first question in the homeork: Write don the stochastic gradient descent algorithm for this? Historically, other training algorithms exist. In particular, you might run into LBFGS 47

Logistic regression is A classifier that predicts the probability that the label is +1 for a particular input The discriminative counter-part of the naïve Bayes classifier A discriminative classifier that can be trained via MAP or MLE estimation A discriminative classifier that minimizes the logistic loss over the training set 48

This lecture Logistic regression Connection to Naïve Bayes Training a logistic regression classifier Back to loss minimization 49

Learning as loss minimization The setup Examples x dran from a fixed, unknon distribution D Hidden oracle classifier f labels examples We ish to find a hypothesis h that mimics f The ideal situation Define a function L that penalizes bad hypotheses Learning: Pick a function h 2 H to minimize expected loss But distribution D is unknon Instead, minimize empirical loss on the training set 50

Empirical loss minimization Learning = minimize empirical loss on the training set Is there a problem here? 51

Empirical loss minimization Learning = minimize empirical loss on the training set Is there a problem here? Overfitting! We need something that biases the learner toards simpler hypotheses Achieved using a regularizer, hich penalizes complex hypotheses 52

Regularized loss minimization Learning: With linear classifiers: (using l2 regularization) What is a loss function? Loss functions should penalize mistakes We are minimizing average loss over the training data What is the ideal loss function for classification? 53

The 0-1 loss Penalize classification mistakes beteen true label y and prediction y For linear classifiers, the prediction y = sgn( T x) Mistake if y T x 0 Minimizing 0-1 loss is intractable. Need surrogates 54

The loss function zoo Many loss functions exist Perceptron loss Hinge loss (SVM) Exponential loss (AdaBoost) Logistic loss (logistic regression) 55

The loss function zoo 56

The loss function zoo Zero-one 57

The loss function zoo Hinge: SVM Zero-one 58

The loss function zoo Perceptron Hinge: SVM Zero-one 59

The loss function zoo Hinge: SVM Perceptron Exponential: AdaBoost Zero-one 60

The loss function zoo Hinge: SVM Perceptron Exponential: AdaBoost Zero-one Logistic regression 61

The loss function zoo Zoomed out 62

The loss function zoo Zoomed out even more 63

Logistic Regression. Machine Learning Fall 2018