Data Mining 2018 Logistic Regression Text Classification

Size: px

Start display at page:

Download "Data Mining 2018 Logistic Regression Text Classification"

Osborn Hunt
5 years ago
Views:

1 Data Mining 2018 Logistic Regression Text Classification Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 50

2 Two types of approaches to classification In (probabilistic) classification we are interested in the conditional distribution P(Y x), so that, for example, when we observe X = x we can predict the class with the highest probability for that value of X. There are two basic approaches to modeling P(Y x): Generative Models (use Bayes rule): P(Y = j x) = P(x Y = j)p(y = j) P(x) Discriminative Models: model P(Y x) directly. P(x Y = j)p(y = j) = k i=1 P(x Y = i)p(y = i) Ad Feelders ( Universiteit Utrecht ) Data Mining 2 / 50

3 Generative Models Examples of generative classification methods: Naive Bayes classifier (discussed in the previous lecture) Linear/Quadratic Discriminant Analysis (not discussed)... Ad Feelders ( Universiteit Utrecht ) Data Mining 3 / 50

4 Discriminative Models Discriminative methods only model the conditional distribution of Y given X. The probability distribution of X itself is not modeled. For the binary classification problem: P(Y = 1 X ) = f (X, β) where f (X, β) is some deterministic function of X. Ad Feelders ( Universiteit Utrecht ) Data Mining 4 / 50

5 Discriminative Models Examples of discriminative classification methods: Linear probability model Logistic regression Feed-forward neural networks... Ad Feelders ( Universiteit Utrecht ) Data Mining 5 / 50

6 Discriminative Models: linear probability model Consider the linear regression model E[Y x] = β x Y {0, 1}, where But β x = m β j x j, with x 0 1. j=0 E[Y x] = 1 P(Y = 1 x) + 0 P(Y = 0 x) = P(Y = 1 x) So the model assumes that P(Y = 1 x) = β x Ad Feelders ( Universiteit Utrecht ) Data Mining 6 / 50

7 Linear response function P (Y = 1 x) 1 1 β x Ad Feelders ( Universiteit Utrecht ) Data Mining 7 / 50

8 Logistic regression Logistic response function E[Y x] = P(Y = 1 x) = eβ x 1 + e β x or (divide numerator and denominator by e β x ) P(Y = 1 x) = e β x = (1 + e β x ) 1 Ad Feelders ( Universiteit Utrecht ) Data Mining 8 / 50

9 Logistic Response Function P (Y = 1 x) β x Ad Feelders ( Universiteit Utrecht ) Data Mining 9 / 50

10 Linearization: the logit transformation { } P(Y = 1 x) ln 1 P(Y = 1 x) { } (1 + e β x ) 1 = ln 1 (1 + e β x ) 1 { } { } 1 1 = ln = ln (1 + e β x ) 1 e β x = ln e β x = β x In the second step, we divided the numerator and the denominator by (1 + e β x ) 1. The ratio is called the odds. P(Y = 1 x) P(Y = 1 x) = 1 P(Y = 1 x) P(Y = 0 x) Ad Feelders ( Universiteit Utrecht ) Data Mining 10 / 50

11 Linear Separation Assign to class 1 if P(Y = 1 x) > P(Y = 0 x), i.e. if P(Y = 1 x) P(Y = 0 x) > 1 This is true if ln { } P(Y = 1 x) > 0 P(Y = 0 x) So assign to class 1 if β x > 0, and to class 0 otherwise. Ad Feelders ( Universiteit Utrecht ) Data Mining 11 / 50

12 Decision Boundary x 2 Class 1 β 0 + β 1 x 1 + β 2 x 2 = 0 Class 0 x 1 Ad Feelders ( Universiteit Utrecht ) Data Mining 12 / 50

13 Maximum Likelihood Estimation Y = 1 if heads, Y = 0 if tails. p = P(Y = 1). One coin flip P(y) = p y (1 p) 1 y Note that P(1) = p, P(0) = 1 p as required. Sequence of n independent coin flips P(y 1, y 2,..., y n ) = n p y i (1 p) 1 y i which defines the likelihood function when viewed as a function of p. i=1 Ad Feelders ( Universiteit Utrecht ) Data Mining 13 / 50

14 Maximum Likelihood Estimation In a sequence of 10 coin flips we observe y = (1, 0, 1, 1, 0, 1, 1, 1, 1, 0). The corresponding likelihood function is P(y p) = p (1 p) p p (1 p) p p p p (1 p) = p 7 (1 p) 3 The corresponding log-likelihood function is ln P(y p) = ln(p 7 (1 p) 3 ) = 7 ln p + 3 ln(1 p) Ad Feelders ( Universiteit Utrecht ) Data Mining 14 / 50

15 Computing the maximum To determine the maximum we take the derivative and equate it to zero d ln P(y p) dp = 7 p 3 1 p = 0 which yields maximum likelihood estimate ˆp = 0.7. This is just the relative frequency of heads in the sample. Ad Feelders ( Universiteit Utrecht ) Data Mining 15 / 50

16 Log-likelihood function for y = (1, 0, 1, 1, 0, 1, 1, 1, 1, 0) loglikelihood Ad Feelders ( Universiteit Utrecht ) Data Mining mu 16 / 50

17 ML estimation for logistic regression Logistic regression is a bit like the coin tossing example, except that now the probability of success p i depends on x i : p i = P(Y = 1 x i ) = (1 + e β x i ) 1 1 p i = P(Y = 0 x i ) = (1 + e β x i ) 1 we can represent its probability distribution as follows P(y i ) = p y i i (1 p i ) 1 y i y i {0, 1}; i = 1,..., n Ad Feelders ( Universiteit Utrecht ) Data Mining 17 / 50

18 ML estimation for logistic regression Example i x i y i P(y i ) (1 + e β 0+8β 1 ) (1 + e β 0+12β 1 ) (1 + e β 0 15β 1 ) (1 + e β 0 10β 1 ) 1 Ad Feelders ( Universiteit Utrecht ) Data Mining 18 / 50

19 LR: likelihood function Since the y i observations are independent: P(y β) = Or, taking the natural log: ln P(y β) = ln = n P(y i ) = i=1 n i=1 n i=1 p y i i (1 p i ) 1 y i p y i i (1 p i ) 1 y i (1) n {y i ln p i + (1 y i ) ln(1 p i )} i=1 Ad Feelders ( Universiteit Utrecht ) Data Mining 19 / 50

20 LR: log-likelihood function For the logistic regression model we have so filling in gives l(β) = n i=1 p i = (1 + e β x i ) 1 1 p i = (1 + e β x i ) 1, { ( 1 y i ln 1 + e β x i Non-linear function of the parameters. ) ( + (1 y i ) ln e β x i No closed form solution (no nice formulas for the parameter estimates). )} Likelihood function globally concave so relatively easy optimization problem. Ad Feelders ( Universiteit Utrecht ) Data Mining 20 / 50

21 First-order conditions l(β) = n {y i ln p i + (1 y i ) ln(1 p i )} i=1 g(β j ) = l(β) β j = n i=1 Filling in (2) in equation (1) gives: y i p i p i β j + 1 y i 1 p i (1 p i) β j (1) p i β j = p i (1 p i )x ij (2) g(β j ) = n (y i p i )x ij i=1 Ad Feelders ( Universiteit Utrecht ) Data Mining 21 / 50

22 First-order conditions First order condition for maximum: g(β j ) = For the intercept β 0 we get n (y i p i )x ij = 0, j = 0,..., m. i=1 l(β) β 0 = n (y i p i ) = 0, i=1 since x i0 1. So in the optimal solution we have n y i = i=1 The expected number of cases with Y = 1 is equal to the observed number of cases with Y = 1. n i=1 p i Ad Feelders ( Universiteit Utrecht ) Data Mining 22 / 50

23 First-order conditions First order condition for maximum: Likewise, for β j we get g(β j ) = l(β) β j = So in the optimal solution we have Interpretation? n (y i p i )x ij = 0 i=1 n (y i p i )x ij = 0. i=1 n y i x ij = i=1 n p i x ij i=1 Ad Feelders ( Universiteit Utrecht ) Data Mining 23 / 50

24 Optimization by Gradient Ascent The gradient points in the direction of the steepest ascent of the function. We can perform optimization by the method of gradient ascent as follows. The new estimate of β j based on processing the i-th observation is: where β (t+1) j = β (t) j + η g(β j ) β=β (t)= β (t) j p (t) i = (1 + e β(t) x i ) 1 + η (y i p (t) i )x ij, is the current estimate of P(Y = 1 x i ), that is, using β (t). Ad Feelders ( Universiteit Utrecht ) Data Mining 24 / 50

25 Optimization by Gradient Ascent The basic gradient-ascent algorithm applied to logistic regression: 1 choose an initial value β (0) (e.g. at random); t 0 2 determine the gradient l(β) β j β=β (t) of l(β) at β (t) and update β (t+1) j = β (t) j + η n i=1 3 Repeat the previous step until (y i p (t) i )x ij, j = 0,..., m g(β j ) β=β (t)= 0 for all j = 0,..., m and check if a (local) maximum has been reached. η > 0 is the step size. Ad Feelders ( Universiteit Utrecht ) Data Mining 25 / 50

26 Fitted Response Function Substitute maximum likelihood estimates into the response function to obtain the fitted response function ˆP(Y = 1 x) = e ˆβ x 1 + e ˆβ x Ad Feelders ( Universiteit Utrecht ) Data Mining 26 / 50

27 Example: Programming Assignment Model the probability of successfully completing a programming assignment. Explanatory variable: programming experience. We find ˆβ 0 = and ˆβ 1 = , so ˆP(Y = 1 x) = 14 months of programming experience: e x 1 + e x ˆP(Y = 1 x = 14) = e (14) e (14) Ad Feelders ( Universiteit Utrecht ) Data Mining 27 / 50

28 Interpretation We have { } ˆP(Y = 1 x) ln = x, ˆP(Y = 0 x) so with every additional month of programming experience, the log odds increase with The odds are multiplied by e so with every additional month of programming experience, the odds increase with 17.5%. Note that the effect of an increase in x on the probability of success depends on the value of x: An increase from 14 to 24 months of programming experience leads to an increase of the probability of success from 0.31 to An increase from 34 to 44 months of programming experience leads to an increase of the probability of success from 0.92 to Ad Feelders ( Universiteit Utrecht ) Data Mining 28 / 50

29 Example: Programming Assignment month.exp success fitted month.exp success fitted Ad Feelders ( Universiteit Utrecht ) Data Mining 29 / 50

30 Allocation Rule Probability of the classes is equal when Solving for x we get x Allocation Rule: x 19: predict y = 1 x < 19: predict y = x = 0 If a person has 19 months or more programming experience, predict success, otherwise predict failure. Ad Feelders ( Universiteit Utrecht ) Data Mining 30 / 50

31 Programming Assignment: Confusion Matrix Cross table of observed and predicted class label: Row: observed, Column: predicted Error rate: 6/25= Default (predict majority class): 11/25=0.44 Ad Feelders ( Universiteit Utrecht ) Data Mining 31 / 50

32 How to in R > prog.logreg <- glm(success month.exp, data=prog.dat, family=binomial) > summary(prog.logreg) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) * month.exp * --- Number of Fisher Scoring iterations: 4 > table(prog.dat$success, as.numeric(prog.logreg$fitted > 0.5)) Ad Feelders ( Universiteit Utrecht ) Data Mining 32 / 50

33 Non-binary classes in logistic regression Recall the logistic regression model assumption for binary class variable Y {0, 1}: P(Y = 1 x) = exp(β x) 1 + exp(β x) from which it follows that P(Y = 0 x) = exp(β x) since P(Y = 1 x) + P(Y = 0 x) = 1 Ad Feelders ( Universiteit Utrecht ) Data Mining 33 / 50

34 Non-binary classes in logistic regression We can generalize this model to non-binary class variable Y {0, 1,..., K 1} (where K is the number of classes) as follows P(Y = k x) = exp(β k x) K 1 j=0 exp(β j x) where we now have a weight vector β k for each class. This is called the multinomial logit model or multi-class logistic regression. Ad Feelders ( Universiteit Utrecht ) Data Mining 34 / 50

35 Multi-class logistic regression We can arrive at this model in the following steps: 1 Assume that P(Y = k x) is a function of the linear combination β k x. 2 To ensure that the probabilities are non-negative, take the exponential exp(β k x). 3 To make sure the probabilities sum to 1, we divide exp(β k x) by K 1 j=0 exp(β j x): P(Y = k x) = exp(β k x) K 1 j=0 exp(β j x) Ad Feelders ( Universiteit Utrecht ) Data Mining 35 / 50

36 Identification Restriction Note that exp((β k + d) x) K 1 j=0 exp((β j + d) x) = exp(d x) exp(βk exp(d x) x) K 1 j=0 exp(β j x) so adding a vector d to each of the vectors β j, j = 0,..., K 1 would yield the same fitted probabilities. To identify the model, we put β 0 = 0. Verify that binary logistic regression is a special case of the multinomial logit model, with K = 2, since exp(β0 x) = exp(0) = 1. Ad Feelders ( Universiteit Utrecht ) Data Mining 36 / 50

37 Interpretation We have ln { } P(Y = k x) = (β k β l ) x, P(Y = l x) which allows us to interpret β kj β lj as follows: for a unit increase in x j, the log-odds of class k versus class l is expected to change by β kj β lj units, holding all the other variables constant. Since β 0 = 0, β kj is the effect of x j on the log-odds of class k relative to class 0: for a unit increase in x j, the log-odds of class k versus class 0 is expected to change by β kj units, holding all the other variables constant. Ad Feelders ( Universiteit Utrecht ) Data Mining 37 / 50

38 Regularization If we have a large number of predictors, even a linear model estimated with maximum likelihood can be prone to overfitting. This can be controlled by punishing large (positive or negative) weights. The coefficient estimates are shrunken towards zero. Add a penalty term for the size of the coefficients to the objective function. With LASSO penalty: E(β) = l(β) + λ m β j, where E(β) is the new error function that we want to minimize, and l(β) is the negative log-likelihood function. The value of λ is usually selected by cross-validation. j=1 Ad Feelders ( Universiteit Utrecht ) Data Mining 38 / 50

39 Application of Logistic Regression to Movie Reviews # logistic regression with lasso penalty > reviews.glmnet <- cv.glmnet(as.matrix(train.dtm),labels[index.train], family="binomial",type.measure="class") > plot(reviews.glmnet) > coef(reviews.glmnet,s="lambda.1se") 309 x 1 sparse Matrix of class "dgcmatrix" 1 bad beautiful best better boring excellent fun funny. minutes perfect poor script stupid supposed terrible wonderful worst Ad Feelders ( Universiteit Utrecht ) Data Mining 39 / 50

40 Cross-Validation on lambda Misclassification Error log(lambda) Ad Feelders ( Universiteit Utrecht ) Data Mining 40 / 50

41 Application of Logistic Regression to Movie Reviews # make predictions on the test set > reviews.logreg.pred <- predict(reviews.glmnet, newx=as.matrix(test.dtm),s="lambda.1se",type="class") # show confusion matrix > table(reviews.logreg.pred,labels[-index.train]) reviews.logreg.pred # compute accuracy: about 81% correct > ( )/9000 [1] Ad Feelders ( Universiteit Utrecht ) Data Mining 41 / 50

42 Using tf-idf weights Currently, we use term frequency as feature value. In information retrieval other term weighting schemes are popular. For example: tf-idf(i, j) = tf(i, j) idf(i), where tf-idf(i, j) is the number of times term i occurs in document j, and N doc idf(i) = log {d : w i d} There are many variations on this theme. Should work well in finding relevant documents to queries, but not obvious why it should work in document classification. Ad Feelders ( Universiteit Utrecht ) Data Mining 42 / 50

43 Logistic Regression: using tf-idf weights # construct training document term matrix with tf-idf weights > train2.dtm <- DocumentTermMatrix(reviews.all[index.train], control=list(weighting=weighttfidf)) # remove sparse terms > train2.dtm <- removesparseterms(train2.dtm,0.95) # perform logistic regression with lasso penalty > reviews2.glmnet <- cv.glmnet(as.matrix(train2.dtm),labels[index.train], family="binomial",type.measure="class") # compute tf-idf scores on test set: we use the document frequency # from the training set! > train3.dtm <- as.matrix(train.dtm) # convert term frequency counts to binary indicator > train3.dtm <- matrix(as.numeric(train3.dtm > 0),nrow=16000,ncol=308) Ad Feelders ( Universiteit Utrecht ) Data Mining 43 / 50

44 Logistic Regression: using tf-idf weights # sum the columns of the training set > train3.idf <- apply(train3.dtm,2,sum) # compute idf for each term (column) > train3.idf <- log2(16000/train3.idf) # term frequencies on the test set > test3.dtm <- as.matrix(test.dtm) # compute tf-idf weights on the test set > for(i in 1:308){test3.dtm[,i] <- test3.dtm[,i]*train3.idf[i]} Ad Feelders ( Universiteit Utrecht ) Data Mining 44 / 50

45 Logistic Regression: using tf-idf weights # make predictions on the test set using lambda=lambda.1se > reviews.logreg2.pred <- predict(reviews2.glmnet, newx=test3.dtm,s="lambda.1se",type="class") # show confusion matrix > table(reviews.logreg2.pred,labels[-index.train]) reviews.logreg2.pred # compute accuracy: still about 81% correct > ( )/9000 [1] Ad Feelders ( Universiteit Utrecht ) Data Mining 45 / 50

46 Including Bigrams The bigrams in are: the spy who loved me the spy spy who who loved loved me but not for example spy loved The two words need to be next to each other. Ad Feelders ( Universiteit Utrecht ) Data Mining 46 / 50

47 Including Bigrams # extract bigrams > train.dtm2 <- DocumentTermMatrix(reviews.all[index.train], control = list(tokenize = BigramTokenizer)) # more than one million bigrams! > dim(train.dtm2) [1] > train.dtm2 <- removesparseterms(train.dtm2,0.99) # after removing sparse bigrams only 152 left > dim(train.dtm2) [1] > train.dat1 <- as.matrix(train.dtm) > train.dat2 <- as.matrix(train.dtm2) # combine unigrams and bigrams > train.dat <- cbind(train.dat1,train.dat2) > dim(train.dat) [1] Ad Feelders ( Universiteit Utrecht ) Data Mining 47 / 50

48 Including Bigrams # fit regularized logistic regression model # use cross-validation to evaluate different lambda values > reviews3.glmnet <- cv.glmnet(train.dat,labels[index.train], family="binomial",type.measure="class") # show coefficient estimates for lambda-1se # (only a selection of the bigram coefficients is shown here) > coef(reviews3.glmnet,s="lambda.1se") bad movie black white cant believe character development dont want end movie even though felt like first time good film great movie highly recommend just plain make sense much better must see one best one worst read book really good worst movie Ad Feelders ( Universiteit Utrecht ) Data Mining 48 / 50

49 Cross-Validation on lambda Misclassification Error log(lambda) Ad Feelders ( Universiteit Utrecht ) Data Mining 49 / 50

50 Including Bigrams # create document term matrix for the test data, # using the training dictionary > test3.dtm <- DocumentTermMatrix(reviews.all[-index.train], list(dictionary=dimnames(train.dat)[[2]])) # convert to ordinary matrix > test3.dat <- as.matrix(test3.dtm) # get columns in the same order as on the training set > test3.dat <- test3.dat[,dimnames(train.dat)[[2]]] # make predictions using lambda.1se > reviews.logreg3.pred <- predict(reviews3.glmnet,newx=test3.dat, s="lambda.1se",type="class") > table(reviews.logreg3.pred,labels[-index.train]) reviews.logreg3.pred # accuracy does not improve > ( )/9000 [1] Ad Feelders ( Universiteit Utrecht ) Data Mining 50 / 50

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

ECLT 5810 Linear Regression and Logistic Regression for Classification Prof. Wai Lam Linear Regression Models Least Squares Input vectors is an attribute / feature / predictor (independent variable) The