Logistic Regression Machine Learning Fall 2018 1
Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes classifier 2
This lecture Logistic regression Connection to Naïve Bayes Training a logistic regression classifier Back to loss minimization 3
This lecture Logistic regression Connection to Naïve Bayes Training a logistic regression classifier Back to loss minimization 4
Logistic Regression: Setup The setting Binary classification Inputs: Feature vectors x 2 < d Labels: y 2 {-1, +1} Training data S = {(x i, y i )}, m examples 5
Classification, but The output y is discrete valued (-1 or 1) Instead of predicting the output, let us try to predict P(y = 1 x) Expand hypothesis space to functions hose output is [0-1] Original problem: < d! {-1, 1} Modified problem: < d! [0-1] Effectively make the problem a regression problem Many hypothesis spaces possible 6
Classification, but The output y is discrete valued (-1 or 1) Instead of predicting the output, let us try to predict P(y = 1 x) Expand hypothesis space to functions hose output is [0-1] Original problem: < d! {-1, 1} Modified problem: < d! [0-1] Effectively make the problem a regression problem Many hypothesis spaces possible 7
The Sigmoid function The hypothesis space for logistic regression: All functions of the form That is, a linear function, composed ith a sigmoid function (the logistic function) ¾ What is the domain and the range of the sigmoid function? This is a reasonable choice. We ill see hy later 8
The Sigmoid function The hypothesis space for logistic regression: All functions of the form That is, a linear function, composed ith a sigmoid function (the logistic function) ¾ This is a reasonable choice. We ill see hy later 9
The Sigmoid function The hypothesis space for logistic regression: All functions of the form That is, a linear function, composed ith a sigmoid function (the logistic function) ¾ What is the domain and the range of the sigmoid function? This is a reasonable choice. We ill see hy later 10
The Sigmoid function ¾(z) z 11
The Sigmoid function What is its derivative ith respect to z? 12
The Sigmoid function What is its derivative ith respect to z? 13
Predicting probabilities According to the logistic regression model, e have 14
Predicting probabilities According to the logistic regression model, e have 15
Predicting probabilities According to the logistic regression model, e have 16
Predicting probabilities According to the logistic regression model, e have Or equivalently 17
Predicting probabilities According to the logistic regression model, e have Or equivalently Note that e are directly modeling P(y x) rather than P(x y) and P(y) 18
Predicting a label ith logistic regression Compute P(y =1 x; ) If this is greater than half, predict 1 else predict -1 What does this correspond to in terms of T x? 19
Predicting a label ith logistic regression Compute P(y =1 x; ) If this is greater than half, predict 1 else predict -1 What does this correspond to in terms of T x? Prediction = sgn( T x) 20
This lecture Logistic regression Connection to Naïve Bayes Training a logistic regression classifier Back to loss minimization 21
Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function log P(y = 1 x, ) P(y = +1 x, ) = 2 x Here, the P s represent the Naïve Bayes posterior distribution, and can be used to calculate the priors and the likelihoods. That is, P(y = 1, x) is computed using P(x y = 1, ) and P(y = 1 ) 22
Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function log P(y = 1 x, ) P(y = +1 x, ) = 2 x But e also kno that P y = +1 x, = 1 P(y = 1 x, ) 23
Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function log P(y = 1 x, ) P(y = +1 x, ) = 2 x But e also kno that P y = +1 x, = 1 P(y = 1 x, ) Substituting in the above expression, e get P y = +1, x = σ 2 x = 1 1 + exp ( 2 x) 24
Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function P(y = 1 x, ) log P(y = +1 x, ) = 2 x That is, both naïve Bayes and logistic regression try to compute the same posterior distribution over the outputs But e also kno that P y = +1 x, = 1 P(y = 1 x, ) Naïve Bayes is a generative model. Substituting in the above expression, e get Logistic Regression is the discriminative version. P y = +1, x = σ 2 x = 1 1 + exp ( 2 x) 25
This lecture Logistic regression Connection to Naïve Bayes Training a logistic regression classifier First: Maximum likelihood estimation Then: Adding priors à Maximum a Posteriori estimation Back to loss minimization 26
Maximum likelihood estimation Let s get back to the problem of learning Training data S = {(x i, y i )}, m examples What e ant Find a such that P(S ) is maximized We kno that our examples are dran independently and are identically distributed (i.i.d) Ho do e proceed? 27
Maximum likelihood estimation argmax P S = argmax = ; P y < x <, ) <>? The usual trick: Convert products to sums by taking log Recall that this orks only because log is an increasing function and the maximizer ill not change 28
Maximum likelihood estimation argmax P S = argmax Equivalent to solving = ; P y < x <, ) <>? max = @ log P y < x <, ) < 29
Maximum likelihood estimation = argmax P S = argmax = ; P y < x <, ) <>? max @ log P y < x <, ) < But (by definition) e kno that P y, x = σ y < 2 x < = 1 1 + exp ( y < 2 x < ) 30
P y, x = Maximum likelihood estimation 1 1 + exp ( y B 2 x < ) argmax P S = argmax max = = ; P y < x <, ) <>? @ log P y < x <, ) < Equivalent to solving = max @ log(1 + exp ( y < 2 x < ) < 31
Maximum likelihood estimation argmax The goal: Maximum likelihood training of a discriminative probabilistic classifier under the logistic model for the posterior distribution. P S = argmax max = = ; P y < x <, ) <>? @ log P y < x <, ) < Equivalent to solving = P y, x = max @ log(1 + exp ( y < 2 x < ) < 1 1 + exp ( y B 2 x < ) 32
Maximum likelihood estimation argmax The goal: Maximum likelihood training of a discriminative probabilistic classifier under the logistic model for the posterior distribution. P S = argmax max = = ; P y < x <, ) <>? @ log P y < x <, ) < Equivalent to solving = P y, x = max @ log(1 + exp ( y < 2 x < ) < Equivalent to: Training a linear classifier by minimizing the logistic loss. 1 1 + exp ( y B 2 x < ) 33
Maximum a posteriori estimation We could also add a prior on the eights Suppose each eight in the eight vector is dran independently from the normal distribution ith zero mean and standard deviation σ E p = ; p( < ) F>? E = ; 1 σ 2π exp J < σ J F>? 34
MAP estimation for logistic regression E p = ; p( < ) F>? E = ; 1 σ 2π exp J < σ J F>? Let us ork through this procedure again to see hat changes 35
MAP estimation for logistic regression E p = ; p( < ) F>? E = ; 1 σ 2π exp J < σ J F>? Let us ork through this procedure again to see hat changes What is the goal of MAP estimation? (In maximum likelihood, e maximized the likelihood of the data) 36
MAP estimation for logistic regression E p = ; p( < ) F>? E = ; 1 σ 2π exp J < σ J F>? What is the goal of MAP estimation? (In maximum likelihood, e maximized the likelihood of the data) To maximize the posterior probability of the model given the data (i.e. to find the most probable model, given the data) P S P S P() 37
MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax E = ; 1 σ 2π exp J < σ J F>? P( S) = argmax P S P() 38
MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() 39
MAP estimation for logistic regression We have already expanded out the first term. = @ log(1 + exp ( y < 2 x < ) < E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() 40
MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() = @ log(1 + exp ( y < 2 x < ) < E + @ < J F>? σ J + constants Expand the log prior 41
MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() = max @ log(1 + exp ( y < 2 x < ) < E + @ < J F>? σ J + constants 42
MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() = max @ log(1 + exp ( y < 2 x < ) 1 σ J 2 < 43
MAP estimation for logistic regression E p = ; p( < ) F>? Learning by solving argmax Take log to simplify max E = ; 1 σ 2π exp J < σ J F>? P S P() log P S + log P() = max @ log(1 + exp ( y < 2 x < ) < 1 σ J 2 Maximizing a negative function is the same as minimizing the function 44
Learning a logistic regression classifier Learning a logistic regression classifier is equivalent to solving = min @ log(1 + exp ( y < 2 x < ) + 1 σ J 2 < 45
Learning a logistic regression classifier Learning a logistic regression classifier is equivalent to solving = min @ log(1 + exp ( y < 2 x < ) + 1 σ J 2 < Where have e seen this before? 46
Learning a logistic regression classifier Learning a logistic regression classifier is equivalent to solving = min @ log(1 + exp ( y < 2 x < ) + 1 σ J 2 < Where have e seen this before? The first question in the homeork: Write don the stochastic gradient descent algorithm for this? Historically, other training algorithms exist. In particular, you might run into LBFGS 47
Logistic regression is A classifier that predicts the probability that the label is +1 for a particular input The discriminative counter-part of the naïve Bayes classifier A discriminative classifier that can be trained via MAP or MLE estimation A discriminative classifier that minimizes the logistic loss over the training set 48
This lecture Logistic regression Connection to Naïve Bayes Training a logistic regression classifier Back to loss minimization 49
Learning as loss minimization The setup Examples x dran from a fixed, unknon distribution D Hidden oracle classifier f labels examples We ish to find a hypothesis h that mimics f The ideal situation Define a function L that penalizes bad hypotheses Learning: Pick a function h 2 H to minimize expected loss But distribution D is unknon Instead, minimize empirical loss on the training set 50
Empirical loss minimization Learning = minimize empirical loss on the training set Is there a problem here? 51
Empirical loss minimization Learning = minimize empirical loss on the training set Is there a problem here? Overfitting! We need something that biases the learner toards simpler hypotheses Achieved using a regularizer, hich penalizes complex hypotheses 52
Regularized loss minimization Learning: With linear classifiers: (using l2 regularization) What is a loss function? Loss functions should penalize mistakes We are minimizing average loss over the training data What is the ideal loss function for classification? 53
The 0-1 loss Penalize classification mistakes beteen true label y and prediction y For linear classifiers, the prediction y = sgn( T x) Mistake if y T x 0 Minimizing 0-1 loss is intractable. Need surrogates 54
The loss function zoo Many loss functions exist Perceptron loss Hinge loss (SVM) Exponential loss (AdaBoost) Logistic loss (logistic regression) 55
The loss function zoo 56
The loss function zoo Zero-one 57
The loss function zoo Hinge: SVM Zero-one 58
The loss function zoo Perceptron Hinge: SVM Zero-one 59
The loss function zoo Hinge: SVM Perceptron Exponential: AdaBoost Zero-one 60
The loss function zoo Hinge: SVM Perceptron Exponential: AdaBoost Zero-one Logistic regression 61
The loss function zoo Zoomed out 62
The loss function zoo Zoomed out even more 63