Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Size: px

Start display at page:

Download "Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor"

Shawn Dixon
6 years ago
Views:

1 Logistic Regression Some slides adapted from Dan Jurfasky and Brendan O Connor

2 Naïve Bayes Recap Bag of words (order independent) Features are assumed independent given class P (x 1,...,x n c) =P (x 1 c)...p(x n c) Q: Is this really true?

3 The problem with assuming conditional independence Correlated features -> double counting evidence Parameters are estimated independently This can hurt classifier accuracy and calibration

4 Logistic Regression (Log) Linear Model similar to Naïve Bayes Doesn t assume features are independent Correlated features don t double count

5 What are Features? A feature function, f Input: Document, D (a string) Output: Feature Vector, X

6 What are Features? f(d) = 0 count( boring ) count( not boring ) length of document author of document... 1 C A Doesn t have to be just bag of words

7 Feature Templates Typically feature templates are used to generate many features at once For each word: ${w}_count ${w}_lowercase ${w}_with_not_before_count

8 Logistic Regression: Example Compute Features: 0 f(d i )=x i 1 count( nigerian ) count( prince ) A count( nigerian prince ) Assume we are given some weights: 0 w A 4.0

9 Logistic Regression: Example Compute Features We are given some weights Compute the dot product: z = X X w i x i i=0

10 Logistic Regression: Example Compute the dot product: z = X X i=0 w i x i Compute the logistic function: P (spam x) = ez e z +1 = 1 1+e z

11 The Logistic function P (spam x) = ez e z +1 = 1 1+e z

12 The Dot Product z = X X w i x i i=0 Intuition: weighted sum of features All Linear models have this form

14 Naïve Bayes as a log-linear model Q: what are the features? Q: what are the weights?

15 Naïve Bayes as a Log-Linear Model P (spam D) / P (spam) Y w2d P (w spam) P (spam D) / P (spam) Y w2vocab P (w spam) x i log P (spam D) / log P (spam) + X w2vocab x i log P (w spam)

16 Naïve Bayes as a Log-Linear Model log P (spam D) / log P (spam) + X w2vocab x i log P (w spam) In both Naïve Bayes and Logistic Regression we Compute The Dot Product!

17 NB vs. LR Both compute the dot product NB: sum of log probabilities LR: logistic function

18 Naïve Bayes: NB vs. LR: Parameter Learning Learn conditional probabilities independently by counting Logistic Regression: Learn weights jointly

19 LR: Learning Weights Given: a set of feature vectors and labels Goal: learn the weights

20 Learning Weights Document Feature 2 3 x 11 x 12 x x 1n x 21 x 22 x x 2n x d1 x d2 x d3... x dn 0 Labels y 1 y 2.. y n 1 C A

21 Q: what parameters should we choose? What is the right value for the weights? Maximum Likelihood Principle: Pick the parameters that maximize the probability of the data

22 Maximum Likelihood Estimation MLE = argmax w log P (y 1,...,y d x 1,...,x d ; w) X = argmax w log P (y i x i ; w) i ( X p i, if y i =1 = argmax w log 1 p i, if y i =0 = argmax w X i i log p I(y i=1) i (1 p i ) I(y i=0)

23 Maximum Likelihood Estimation = argmax w X = argmax w X i i log p I(y i=1) i (1 p i ) I(y i=0) y i log p i +(1 y i ) log(1 p i )

24 Maximum Likelihood Estimation Unfortunately there is no closed form solution (like there was with naïve bayes) Solution: Iteratively climb the log-likelihood surface through the derivatives for each weight Luckily, the derivatives turn out to be nice

25 Gradient ascent Loop While not converged: For all features j, compute and add derivatives w new j = w old j L(w): Training j L(w) : Gradient vector

26 Gradient ascent w 2 w 1

27 Gradient ascent w 2 w 1

28 j = X i (y i p i )x j

30 Logistic Regression: Pros and Cons Doesn t assume conditional independence of features Better calibrated probabilities Can handle highly correlated overlapping features NB is faster to train, less likely to overfit

31 NB & LR Both are linear models z = X X i=0 w i x i Training is different: NB: weights are trained independently LR: weights trained jointly

32 Perceptron Algorithm Very similar to logistic regression Not exactly computing gradients

33 Perceptron Algorithm Algorithm is Very similar to logistic regression Not exactly computing gradients Initalize weight vector w = 0 Loop for K iterations Loop For all training examples x_i if sign(w * x_i)!= y_i w += (y_i - sign(w * x_i)) * x_i

34 Differences between LR and Perceptron Online learning vs. Batch Perceptron doesn t always make updates

35 MultiClass Classification Q: what if we have more than 2 categories? Sentiment: Positive, Negative, Neutral Document topics: Sports, Politics, Business, Entertainment, Could train a seperate logistic regression model for each category... Pretty clear what to do with Naive Bayes.

36 Log-Linear Models P (y x) / e w f(d,y) P (y x) = 1 Z(w) ew f(d,y)

37 MultiClass Logistic Regression P (y x) / e w f(d,y) P (y x) = 1 Z(w) ew f(d,y) P (y x) = P e w f(d,y) y 0 2Y ew f(d,y0 )

practice with Multiclass in practice: one giant weight vector and one weight

38 MultiClass Logistic Regression Binary logistic regression: We have one feature vector that matches the size of the vocabulary Can represent this in practice with Multiclass in practice: one giant weight vector and one weight vector for each category repeated features for each category. w pos w neg w neut

39 Q: How to compute posterior class probabilities for multiclass? P (y = j x i )= ew j x i P k ew k x i

40 Maximum Likelihood Estimation MLE = argmax w log P (y 1,...,y n x 1,...,x n ; w) = argmax w X i log P (y i x i ; w) = argmax w X i log P e w f(x i,y i ) y 0 2Y ew f(x i,y i )

41 Multiclass j = DX i=1 f j (y i,d i ) DX i=1 X y2y f j (y, d i )P (y d i )

42 MAP-based j = DX i=1 f j (y i,d i ) DX i=1 f j (arg max y2y P (y d i),d i )

43 Online Learning (perceptron) Rather than making a full pass through the data, compute gradient and update parameters after each training example.

44 MultiClass Perceptron Algorithm Initalize weight vector w = 0 Loop for K iterations Loop For all training examples x_i y_pred = argmax_y w_y * x_i if y_pred!= y_i w_y_gold += x_i w_y_pred -= x_i

45 Q: what if there are only 2 categories? P (y = j x i )= ew j x i P k ew k x i

46 Q: what if there are only 2 categories? P (y =1 x) = e w 1 x e w 0 x+w 1 x w 1 x + e w 1 x

47 Q: what if there are only 2 categories? P (y =1 x) = e w 1 x e w 0 x w 1 x e w 1 x + e w 1 x

48 Q: what if there are only 2 categories? P (y =1 x) = e w 1 x e w 1 x (e w 0 x w 1 x + 1)

49 Q: what if there are only 2 categories? P (y =1 x) = 1 e w 0 x w 1 x +1

50 Q: what if there are only 2 categories? P (y =1 x) = 1 e w0 x +1

51 Regularization Combating over fitting Intuition: don t let the weights get very large w MLE = argmax w log P (y 1,...,y d x 1,...,x d ; w) argmax w log P (y 1,...,y d x 1,...,x d ; w) VX i=1 w 2 i

52 Regularization in the Perceptron Algorithm Can t directly include regularization in gradient # of iterations Parameter averaging

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative