Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Size: px

Start display at page:

Download "Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824"

Angel Glenn
5 years ago
Views:

1 Logistic Regression Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019

2 Administrative Please start HW 1 early! Questions are welcome!

3 Two principles for estimating parameters Maximum Likelihood Estimate (MLE) Choose θ that maximizes probability of observed data θ MLE = argmax θ P(Data θ) Maximum a posteriori estimation (MAP) Choose θ that is most probable given prior probability and data θ MAP = argmax θ P θ D = argmax θ P Data θ P θ P(Data) Slide credit: Tom Mitchell

4 Naïve Bayes classifier Want to learn P Y X 1,, X n ) But require 2 n parameters... How about applying Bayes rule? P Y X 1,, X n ) = P(X 1,,X n Y P Y P(X 1,,X n ) P(X 1,, X n Y P Y P(X 1,, X n Y : Need (2 n 1) 2 parameters P(Y): Need 1 parameter Apply conditional independence assumption n P X 1,, X n Y = ς j=1 P(X j Y): Need n 2 parameters

5 Naïve Bayes classifier Bayes rule: P Y = y k X 1,, X n ) = P(Y = y k)p(x 1,, X n Y = y k σ j P Y = y j P X 1,, X n Y = y j Assume conditional independence among X i s: P Y = y k X 1,, X n ) = P Y = y k Π i P X i Y = y k ) σ j P Y = y j Π i P X i Y = y j ) Pick the most probable Y Y argmax y k P Y = y k Π i P X i Y = y k ) Slide credit: Tom Mitchell

6 Example P Y X 1, X 2 Estimating parameters P Y P X 1, X 2 Y = P Y P X 1 Y P(X 2 Y Bayes rule P Y = 1 = 0.4 P X 1 = 1 Y = 1 = 0.2 P X 1 = 1 Y = 0 = 0.7 P X 2 = 1 Y = 1 = 0.3 P X 2 = 1 Y = 0 = 0.9 Conditional indep. P Y = 0 = 0.6 P X 1 = 0 Y = 1 = 0.8 P X 1 = 0 Y = 0 = 0.3 P X 2 = 0 Y = 1 = 0.7 P X 2 = 0 Y = 0 = 0.1 Test example: X 1 = 1, X 2 = 0 Y = 1: P Y = 1 P X 1 = 1 Y = 1 P X 2 = 0 Y = 1 = = Y = 0: P Y = 0 P X 1 = 1 Y = 0 P X 2 = 0 Y = 0 = = 0.042

7 Naïve Bayes algorithm discrete X i For each value y k Estimate π k = P(Y = y k ) For each value x ij of each attribute X i Estimate θ ijk = P(X i = x ijk Y = y k ) Classify X test Y argmax y k P Y = y k Π i P X i test Y = y k ) Y argmax y k π k Π i θ ijk Slide credit: Tom Mitchell

8 Estimating parameters: discrete Y, X i Maximum likelihood estimates (MLE) π k = P Y = y k = #D Y = y k D መθ ijk = P X i = x ij Y = y k = #D X i = x ij ^ Y = y k #D{Y = y k } Slide credit: Tom Mitchell

9 F = 1 iff you live in Fox Ridge S = 1 iff you watched the superbowl last night D = 1 iff you drive to VT G = 1 iff you went to gym in the last month P F = 1 = P S = 1 F = 1 = P S = 1 F = 0 = P D = 1 F = 1 = P D = 1 F = 0 = P G = 1 F = 1 = P G = 1 F = 0 = P F = 0 = P S = 0 F = 1 = P S = 0 F = 0 = P D = 0 F = 1 = P D = 0 F = 0 = P G = 0 F = 1 = P G = 0 F = 0 = P F S, D, G = P F P S F P D F P(G F)

10 Naïve Bayes: Subtlety #1 Often the X i are not really conditionally independent Naïve Bayes often works pretty well anyway Often the right classification, even when not the right probability [Domingos & Pazzani, 1996]) What is the effect on estimated P(Y X)? What if we have two copies: X i = X k P Y = y k X 1,, X n ) P Y = y k Π i P X i Y = y k ) Slide credit: Tom Mitchell

11 Naïve Bayes: Subtlety #2 MLE estimate for P X i Y = y k ) might be zero. (for example, X i = birthdate. X i = Feb_4_1995) Why worry about just one parameter out of many? P Y = y k X 1,, X n ) P Y = y k Π i P X i Y = y k ) What can we do to address this? MAP estimates (adding imaginary examples) Slide credit: Tom Mitchell

12 Estimating parameters: discrete Y, X i Maximum likelihood estimates (MLE) π k = P Y = y k = #D Y = y k D መθ ijk = P X i = x ij Y = y k = #D X i = x ij, Y = y k #D{Y = y k } MAP estimates (Dirichlet priors): π k = P Y = y k = #D Y = y k + (β k 1) D + σ m (β m 1) መθ ijk = P X i = x ij Y = y k = #D X i = x ij, Y = y k + (β k 1) #D{Y = y k } + σ m (β m 1) Slide credit: Tom Mitchell

13 What if we have continuous X i Gaussian Naïve Bayes (GNB): assume P X i = x Y = y k = 1 exp( x μ ik 2 2πσ ik 2σ ik 2 ) Additional assumption on σ ik : Is independent of Y (σ i ) Is independent of X i (σ k ) Is independent of X i and Y (σ) Slide credit: Tom Mitchell

14 Naïve Bayes algorithm continuous X i For each value y k Estimate π k = P(Y = y k ) For each attribute X i estimate Class conditional mean μ ik, variance σ ik Classify X test Y argmax P Y = y k Π i P X test i Y = y k ) y k Y argmax π k Π i Normal(X test i, μ ik, σ ik ) y k Slide credit: Tom Mitchell

15 Things to remember Probability basics Conditional probability, joint probability, Bayes rule Estimating parameters from data Maximum likelihood (ML) Maximum a posteriori estimation (MAP) maximize P(Data θ) maximize P(θ Data) Naive Bayes P Y = y k X 1,, X n ) P Y = y k Π i P X i Y = y k )

16 Logistic Regression Hypothesis representation Cost function Logistic regression with gradient descent Regularization Multi-class classification

17 Logistic Regression Hypothesis representation Cost function Logistic regression with gradient descent Regularization Multi-class classification

18 1 (Yes) Malignant? 0 (No) h θ x = θ x Tumor Size Threshold classifier output h θ x at 0.5 If h θ x 0.5, predict y = 1 If h θ x < 0.5, predict y = 0 Slide credit: Andrew Ng

19 Classification: y = 1 or y = 0 h θ x = θ x (from linear regression) can be > 1 or < 0 Logistic regression: 0 h θ x 1 Logistic regression is actually for classification Slide credit: Andrew Ng

20 Hypothesis representation Want 0 h θ x 1 h θ x = g θ x, where g z = 1 1+e z h θ x = g(z) e θ x Sigmoid function Logistic function z Slide credit: Andrew Ng

21 Interpretation of hypothesis output h θ x = estimated probability that y = 1 on input x Example: If x = x 0 x = 1 h θ x = tumorsize Tell patient that 70% chance of tumor being malignant Slide credit: Andrew Ng

22 Logistic regression h θ x = g θ x 1 g z = 1 + e z g(z) z = θ x Suppose predict y = 1 if h θ x 0.5 predict y = 0 if h θ x < 0.5 z = θ x 0 z = θ x < 0 Slide credit: Andrew Ng

23 Decision boundary h θ x = g(θ 0 + θ 1 x 1 + θ 2 x 2 ) Age E.g., θ 0 = 3, θ 1 = 1, θ 2 = 1 Tumor Size Predict y = 1 if 3 + x 1 + x 2 0 Slide credit: Andrew Ng

24 h θ x = g(θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x θ 4 x 2 2 ) E.g., θ 0 = 1, θ 1 = 0, θ 2 = 0, θ 3 = 1, θ 4 = 1 Predict y = 1 if 1 + x x h θ x = g(θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x θ 4 x 1 2 x 2 + θ 5 x 1 2 x θ 6 x 1 3 x 2 + ) Slide credit: Andrew Ng

25 Where does the form come from? Logistic regression hypothesis representation 1 h θ x = 1 + e = 1 θ x 1 + e (θ 0+θ 1 x 1 +θ 2 x 2 + +θ n x n ) Consider learning f: X Y, where X is a vector of real-valued features X 1,, X n Y is Boolean Assume all X i are conditionally independent given Y Model P X i Y = y k as Gaussian N μ ik, σ i Model P Y as Bernoulli π What is P Y X 1, X 2,, X n? Slide credit: Tom Mitchell

26 P Y = 1 X = = = = P x y k = 1 2πσ i e P Y=1 P(X Y=1) P Y=1 P(X Y=1) +P(Y=0)P(X Y=0) 1 1+ P Y=0 P(X Y=0) P Y=1 P(X Y=1) 1+exp(ln( 1+exp(ln 1 π π x μ ik 2 2σ i 2 1 P Y=0 P(X Y=0) P Y=1 P(X Y=1) )) P Y = 1 X 1, X 2,, X n = 1 +σ i ln P(X i Y=0) P(X i Y=1) ) ( μ i0 μ i1 i σ i 2 X i + μ i1 2 2 μ i0 2 ) 2σ i exp(θ 0 + σ i θ i X i ) Applying Bayes rule Divide by P Y = 1 P(X Y = 1) Apply exp(ln( )) Plug in P(X i Y) Slide credit: Tom Mitchell

28 Logistic Regression Hypothesis representation Cost function Logistic regression with gradient descent Regularization Multi-class classification

29 Training set with m examples { x 1, y 1, x 2, y 2,, x m, y m x 0 x x 1 x n x 0 = 1, y {0, 1} h θ x = e θ x How to choose parameters θ? Slide credit: Andrew Ng

30 Cost function for Linear Regression J θ = 1 m 2m i=1 m h θ x i y i 2 1 = m i=1 Cost(h θ (x i ), y)) Cost(h θ x, y) = 1 2 h θ x y 2 Slide credit: Andrew Ng

31 Cost function for Logistic Regression Cost(h θ x, y) = log h θ x if y = 1 log 1 h θ x if y = 0 if y = 1 if y = 0 0 h θ x 1 0 h θ x 1 Slide credit: Andrew Ng

32 Logistic regression cost function Cost(h θ x, y) = log h θ x if y = 1 log 1 h θ x if y = 0 Cost h θ x, y = y log h θ x (1 y) log 1 h θ x If y = 1: Cost h θ x, y = log h θ x If y = 0: Cost h θ x, y = log 1 h θ x Slide credit: Andrew Ng

33 Logistic regression m J θ = 1 m i=1 Cost(h θ (x i ), y (i) )) = 1 σ m i=1 m y (i) log h θ x (i) + (1 y (i) ) log 1 h θ x (i) Learning: fit parameter θ min θ J(θ) Prediction: given new x Output h θ x = 1 1+e θ x Slide credit: Andrew Ng

34 Where does the cost come from? Training set with m examples x 1, y 1, x 2, y 2,, x m, y m Maximum likelihood estimate for parameter θ θ MLE = argmax P θ x 1, y 1, x 2, y 2,, x m, y m θ = argmax θ m i=1 P θ x i, y i Maximum conditional likelihood estimate for parameter θ Slide credit: Tom Mitchell

35 Goal: choose θ to maximize conditional likelihood of training data P θ Y = 1 X = x = h θ x = P θ Y = 0 X = x = 1 h θ x = 1 1+e θ x e θ x 1+e θ x Training data D = x 1, y 1, x 2, y 2,, x m, y m Data likelihood = ς m i=1 P θ x i, y i Data conditional likelihood = ς m i=1 P θ y (i) x i θ MCLE = argmax θ ς m i=1 P θ y (i) x i Slide credit: Tom Mitchell

36 Expressing conditional log-likelihood m m L θ = log P θ y (i) x i = log P θ y (i) x i m i=1 i=1 = y (i) log P θ y (i) = 1 x i + 1 y i log P θ y (i) = 0 x i i=1 = σ m i=1 y (i) log (h θ (x (i) )) + 1 y i log(1 h θ (x (i) )) Cost(h θ x, y) = log h θ x if y = 1 log 1 h θ x if y = 0

37 Logistic Regression Hypothesis representation Cost function Logistic regression with gradient descent Regularization Multi-class classification

38 Gradient descent m J θ = 1 m i=1 y (i) log h θ x (i) + (1 y (i) ) log 1 h θ x (i) Goal: min θ J(θ) Good news: Convex function! Bad news: No analytical solution Repeat { θ j θ j α θ j J(θ) } (Simultaneously update all θ j ) m θ j J θ = 1 m i=1 (h θ x i y (i) ) x j (i) Slide credit: Andrew Ng

39 Gradient descent m J θ = 1 m i=1 y (i) log h θ x (i) + (1 y (i) ) log 1 h θ x (i) Goal: min θ J(θ) Repeat { m θ j θ j α 1 m i=1 } (Simultaneously update all θ j ) h θ x i y (i) x j (i) Slide credit: Andrew Ng

40 Gradient descent for Linear Regression Repeat { m θ j θ j α 1 m i=1 } h θ x i y (i) x j (i) h θ x = θ x Gradient descent for Logistic Regression Repeat { m θ j θ j α 1 m i=1 h θ x i y (i) x j (i) h θ x = e θ x } Slide credit: Andrew Ng

41 Logistic Regression Hypothesis representation Cost function Logistic regression with gradient descent Regularization Multi-class classification

42 How about MAP? Maximum conditional likelihood estimate (MCLE) θ MCLE = argmax θ ς m i=1 P θ y (i) x i Maximum conditional a posterior estimate (MCAP) θ MCAP = argmax θ ς m i=1 P θ y (i) x i P(θ)

43 Prior P(θ) Common choice of P(θ): Normal distribution, zero mean, identity covariance Pushes parameters towards zeros Corresponds to Regularization Helps avoid very large weights and overfitting Slide credit: Tom Mitchell

44 MLE vs. MAP Maximum conditional likelihood estimate (MCLE) m θ j θ j α 1 m i=1 h θ x i y (i) x j (i) Maximum conditional a posterior estimate (MCAP) m θ j θ j αλθ j α 1 m i=1 h θ x i y (i) x j (i)

45 Logistic Regression Hypothesis representation Cost function Logistic regression with gradient descent Regularization Multi-class classification

46 Multi-class classification foldering/taggning: Work, Friends, Family, Hobby Medical diagrams: Not ill, Cold, Flu Weather: Sunny, Cloudy, Rain, Snow Slide credit: Andrew Ng

47 Binary classification Multiclass classification x 2 x 2 x 1 x 1

48 One-vs-all (one-vs-rest) x 2 h θ 1 x x 2 x 1 h θ 2 x x 2 Class 1: Class 2: Class 3: h θ i x 1 h θ 3 x = P y = i x; θ (i = 1, 2, 3) x x 2 x 1 x 1 Slide credit: Andrew Ng

49 One-vs-all Train a logistic regression classifier h θ i x for each class i to predict the probability that y = i Given a new input x, pick the class i that maximizes max i h θ i x Slide credit: Andrew Ng

50 Generative Approach Ex: Naïve Bayes Discriminative Approach Ex: Logistic regression Estimate P(Y) and P(X Y) Estimate P(Y X) directly (Or a discriminant function: e.g., SVM) Prediction y = argmax y P Y = y P(X = x Y = y) Prediction y = P(Y = y X = x)

51 Further readings Tom M. Mitchell Generative and discriminative classifiers: Naïve Bayes and Logistic Regression Andrew Ng, Michael Jordan On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes

52 Things to remember Hypothesis representation h θ x = e θ x Cost function Cost(h θ x, y) = log h θ x if y = 1 log 1 h θ x if y = 0 Logistic regression with gradient descent Regularization Multi-class classification m θ j θ j α 1 m i=1 θ j θ j αλθ j α 1 m i=1 max i h θ i x h θ x i y (i) x j (i) m h θ x i y (i) x j (i)

53 Coming up Regularization Support Vector Machine

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824 Naïve Bayes Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266