Natural Language Processing (CSEP 517): Text Classification

Size: px

Start display at page:

Download "Natural Language Processing (CSEP 517): Text Classification"

Maude Sparks
5 years ago
Views:

1 Natural Language Processing (CSEP 517): Text Classification Noah Smith c 2017 University of Washington nasmith@cs.washington.edu April 10, / 71

2 To-Do List Online quiz: due Sunday Read: Jurafsky and Martin (2016b); Collins (2011); Jurafsky and Martin (2016a) A2 due April 23 (Sunday) Final exam will be on Canvas (posted 5/29, due 6/4) 2 / 71

3 Text Classification Input: a piece of text x V, usually a document (r.v. X) Output: a label from a finite set L (r.v. L) Standard line of attack: 1. Human experts label some data. 2. Feed the data to a supervised machine learning algorithm that constructs an automatic classifier classify : V L 3. Apply classify to as much data as you want! Note: we assume the texts are segmented already, even the new ones. 3 / 71

4 Text Classification: Examples Library-like subjects (e.g., the Dewey decimal system) News stories: politics vs. sports vs. business vs. technology... Reviews of films, restaurants, products: postive vs. negative Author attributes: identity, political stance, gender, age,... , arxiv submissions, etc.: spam vs. not What is the reading level of a piece of text? How influential will a scientific paper be? Will a piece of proposed legislation pass? Closely related: relevance to a query. 4 / 71

5 Evaluation Accuracy: A(classify) = p(classify(x) = L) = { 1 if classify(x) = l p(x = x, L = l) 0 otherwise x V,l L = p(x = x, L = l) 1 {classify(x) = l} x V,l L where p is the true distribution over data. Error is 1 A. This is estimated using a test dataset x 1, l 1,... x m, l m : Â(classify) = 1 m m 1 { classify( x i ) = l } i i=1 5 / 71

6 Issues with Test-Set Accuracy 6 / 71

7 Issues with Test-Set Accuracy Class imbalance: if p(l = not spam) = 0.99, then you can get Â 0.99 by always guessing not spam. 7 / 71

8 Issues with Test-Set Accuracy Class imbalance: if p(l = not spam) = 0.99, then you can get Â 0.99 by always guessing not spam. Relative importance of classes or cost of error types. 8 / 71

9 Issues with Test-Set Accuracy Class imbalance: if p(l = not spam) = 0.99, then you can get Â 0.99 by always guessing not spam. Relative importance of classes or cost of error types. Variance due to the test data. 9 / 71

10 Evaluation in the Two-Class Case Suppose we have two classes, and one of them, t L is a target. E.g., given a query, find relevant documents. Precision and recall encode the goals of returning a pure set of targeted instances and capturing all of them. A C B actually in the target class; L = t correctly labeled as t believed to be in the target class; classify(x) = t ˆP(classify) = C A B = B B ˆR(classify) = C A B = A A ˆP ˆF 1 (classify) = 2 ˆR ˆP + ˆR 10 / 71

11 Another View: Contingency Table L = t L t classify(x) = t C (true positives) B \ C (false positives) B classify(x) t A \ C (false negatives) (true negatives) A 11 / 71

12 Evaluation with > 2 Classes Macroaveraged precision and recall: let each class be the target and report the average ˆP and ˆR across all classes. Microaveraged precision and recall: pool all one-vs.-rest decisions into a single contingency table, calculate ˆP and ˆR from that. 12 / 71

13 Cross-Validation Remember that Â, ˆP, ˆR, and ˆF 1 are all estimates of the classifier s quality under the true data distribution. Estimates are noisy! K-fold cross-validation: Partition the training set into K non-overlapping folds x 1,..., x K. For i {1,..., K}: Train on x 1:n \ x i, using x i as development data. Estimate quality on the ith development set: Â i Report the average: Â = 1 K K i=1 Â i and perhaps also the standard error. 13 / 71

14 Statistical Significance Suppose we have two classifiers, classify 1 and classify / 71

15 Statistical Significance Suppose we have two classifiers, classify 1 and classify 2. Is classify 1 better? The null hypothesis, denoted H 0, is that it isn t. But if Â 1 Â2, we are tempted to believe otherwise. 15 / 71

16 Statistical Significance Suppose we have two classifiers, classify 1 and classify 2. Is classify 1 better? The null hypothesis, denoted H 0, is that it isn t. But if Â 1 Â2, we are tempted to believe otherwise. How much larger must Â1 be than Â2 to reject H 0? 16 / 71

17 Statistical Significance Suppose we have two classifiers, classify 1 and classify 2. Is classify 1 better? The null hypothesis, denoted H 0, is that it isn t. But if Â 1 Â2, we are tempted to believe otherwise. How much larger must Â1 be than Â2 to reject H 0? Frequentist view: how (im)probable is the observed difference, given H 0 = true? 17 / 71

18 Statistical Significance Suppose we have two classifiers, classify 1 and classify 2. Is classify 1 better? The null hypothesis, denoted H 0, is that it isn t. But if Â 1 Â2, we are tempted to believe otherwise. How much larger must Â1 be than Â2 to reject H 0? Frequentist view: how (im)probable is the observed difference, given H 0 = true? Caution: statistical significance is neither necessary nor sufficient for research significance or practical usefulness! 18 / 71

19 A Hypothesis Test for Text Classifiers McNemar (1947) 1. The null hypothesis: A 1 = A 2 2. Pick significance level α, an acceptably high probability of incorrectly rejecting H Calculate the test statistic, k (explained in the next slide). 4. Calculate the probability of a more extreme value of k, assuming H 0 is true; this is the p-value. 5. Reject the null hypothesis if the p-value is less than α. The p-value is p(this observation H 0 is true), not the other way around! 19 / 71

20 McNemar s Test: Details Assumptions: independent (test) samples and binary measurements. Count test set error patterns: classify 1 is incorrect classify 1 is correct classify 2 is incorrect c 00 c 10 classify 2 is correct c 01 c 11 m Â2 m Â1 If A 1 = A 2, then c 01 and c 10 are each distributed according to Binomial(c 01 + c 10, 1 2 ). test statistic k = min{c 01, c 10 } 1 k ( ) c01 + c 10 p-value = 2 c 01+c 10 1 j j=0 20 / 71

21 Other Tests Different tests make different assumptions. Sometimes we calculate an interval that would be unsurprising under H 0 and test whether a test statistic falls in that interval (e.g., t-test and Wald test). In many cases, there is no closed form for estimating p-values, so we use random approximations (e.g., permutation test and paired bootstrap test). If you do lots of tests, you need to correct for that! Read lots more in Smith (2011), appendix B. 21 / 71

22 Features in Text Classification Running example: x = The vodka was great, but don t touch the hamburgers. A different representation of the text sequence r.v. X: feature r.v.s. For j {1,..., d}, let F j be a discrete random variable taking a value in F j. 22 / 71

23 Features in Text Classification Running example: x = The vodka was great, but don t touch the hamburgers. A different representation of the text sequence r.v. X: feature r.v.s. For j {1,..., d}, let F j be a discrete random variable taking a value in F j. Often, these are term (word and perhaps n-gram) frequencies. E.g., f hamburgers (x) = 1, f the (x) = 2, f delicious (x) = 0, f don t touch (x) = / 71

24 Features in Text Classification Running example: x = The vodka was great, but don t touch the hamburgers. A different representation of the text sequence r.v. X: feature r.v.s. For j {1,..., d}, let F j be a discrete random variable taking a value in F j. Often, these are term (word and perhaps n-gram) frequencies. E.g., f hamburgers (x) = 1, f the (x) = 2, f delicious (x) = 0, f don t touch (x) = 1. Can also be word presence features. E.g., f hamburgers (x) = 1, f the (x) = 1, f delicious (x) = 0, f don t touch (x) = / 71

25 Features in Text Classification Running example: x = The vodka was great, but don t touch the hamburgers. A different representation of the text sequence r.v. X: feature r.v.s. For j {1,..., d}, let F j be a discrete random variable taking a value in F j. Often, these are term (word and perhaps n-gram) frequencies. E.g., f hamburgers (x) = 1, f the (x) = 2, f delicious (x) = 0, f don t touch (x) = 1. Can also be word presence features. E.g., f hamburgers (x) = 1, f the (x) = 1, f delicious (x) = 0, f don t touch (x) = 1. Transformations on word frequencies: logarithm, idf weighting v V, idf(v) = log n i : c xi (v) > 0 25 / 71

26 Features in Text Classification Running example: x = The vodka was great, but don t touch the hamburgers. A different representation of the text sequence r.v. X: feature r.v.s. For j {1,..., d}, let F j be a discrete random variable taking a value in F j. Often, these are term (word and perhaps n-gram) frequencies. E.g., f hamburgers (x) = 1, f the (x) = 2, f delicious (x) = 0, f don t touch (x) = 1. Can also be word presence features. E.g., f hamburgers (x) = 1, f the (x) = 1, f delicious (x) = 0, f don t touch (x) = 1. Transformations on word frequencies: logarithm, idf weighting Disjunctions of terms Clusters Task-specific lexicons v V, idf(v) = log n i : c xi (v) > 0 26 / 71

27 Probabilistic Classification Classification rule: classify(f) = argmax p(l f) l L = argmax l L p(l, f) p(f) = argmax p(l, f) l L 27 / 71

28 Naïve Bayes Classifier Parameters: p(l = l, F 1 = f 1,..., F d = f d ) = p(l) = π l d j=1 d p(f j = f j l) j=1 θ fj j,l π is the class prior (it sums to one) For each feature function j and label l, a distribution over values θ j,l (sums to one for every j, l pair) The bag of words version of naïve Bayes: F j = X j x p(l, x) = π l j=1 θ xj l 28 / 71

29 Naïve Bayes: Remarks Estimation by (smoothed) relative frequency estimation: easy! 29 / 71

30 Naïve Bayes: Remarks Estimation by (smoothed) relative frequency estimation: easy! For continuous or integer-valued features, use different distributions. 30 / 71

31 Naïve Bayes: Remarks Estimation by (smoothed) relative frequency estimation: easy! For continuous or integer-valued features, use different distributions. The bag of words version equates to building a conditional language model for each label. 31 / 71

32 Naïve Bayes: Remarks Estimation by (smoothed) relative frequency estimation: easy! For continuous or integer-valued features, use different distributions. The bag of words version equates to building a conditional language model for each label. The Collins reading assumes a binary version, with F v indicating whether v V occurs in x. 32 / 71

33 Generative vs. Discriminative Classification Naïve Bayes is the prototypical generative classifier. It describes a probabilistic process generative story for X and L. But why model X? It s always observed? Discriminative models instead: seek to optimize a performance measure, like accuracy, or a computationally convenient surrogate; do not worry about p(x); tend to perform better when you have reasonable amounts of data. 33 / 71

34 Discriminative Text Classifiers Multinomial logistic regression (also known as max ent and log-linear ) Support vector machines Neural networks Decision trees I ll briefly touch on three ways to train a classifier with a linear decision rule. 34 / 71

35 Linear Models for Classification Linear decision rule: where φ : V L R d. ˆl = argmax w φ(x, l) l L Parameters: w R d What does this remind you of? 35 / 71

36 Linear Models for Classification Linear decision rule: where φ : V L R d. Parameters: w R d What does this remind you of? Some notational variants define: w l for each l L ˆl = argmax w φ(x, l) l L φ : V R d (similar to what we had for naïve Bayes) 36 / 71

37 The Geometric View of Linear Classifiers Suppose we have instance x, L = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 As a simple example, let the two features be binary functions. 37 / 71

38 The Geometric View of Linear Classifiers Suppose we have instance x, L = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 w φ = w 1 φ 1 + w 2 φ 2 = 0 38 / 71

39 The Geometric View of Linear Classifiers Suppose we have instance x, L = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 distance(w φ = 0, φ 0 ) = w φ 0 w 2 w φ 0 39 / 71

40 The Geometric View of Linear Classifiers Suppose we have instance x, L = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 w φ(x, y 1 ) > w φ(x, y 3 ) > w φ(x, y 4 ) > 0 w φ(x, y 2 ) 40 / 71

41 The Geometric View of Linear Classifiers Suppose we have instance x, L = {y 1, y 2, y 3, y 4 }, and there are only two features, φ 1 and φ 2. ϕ 2 (x, y 3 ) (x, y 1 ) (x, y 2 ) (x, y 4 ) ϕ 1 41 / 71

42 MLE for Multinomial Logistic Regression When we discussed log-linear language models, we transformed the score into a probability distribution. Here, that would be: p(l = l x) = exp w φ(x, l) l L exp w φ(x, l ) 42 / 71

43 MLE for Multinomial Logistic Regression When we discussed log-linear language models, we transformed the score into a probability distribution. Here, that would be: p(l = l x) = exp w φ(x, l) l L exp w φ(x, l ) MLE can be rewritten as a maximization problem: ŵ = argmax w n w φ(x i, l i ) log exp w φ(x }{{} i, l ) l hope L }{{} fear i=1 43 / 71

44 MLE for Multinomial Logistic Regression When we discussed log-linear language models, we transformed the score into a probability distribution. Here, that would be: exp w φ(x, l) p(l = l x) = l L exp w φ(x, l ) MLE can be rewritten as a maximization problem: ŵ = argmax w Recall from language models: Be wise and regularize! n w φ(x i, l i ) log exp w φ(x }{{} i, l ) l hope L }{{} fear i=1 Solve with batch or stochastic gradient methods. w j has an interpretation. 44 / 71

45 Log Loss for (x, l) Another view is to minimize the negated log-likelihood, which is known as log loss : fear {( }}{ log l L exp w φ(x, l ) ) hope {}}{ w φ(x, l) In the binary case, where the x-axis is the difference in scores between correct and incorrect labels: loss score difference In blue is the log loss; in red is the zero-one loss (error). 45 / 71

46 Log Sum Exp Below, y-axis plots the log exp part of the objective function (with two labels), against x, assuming the other score is one of {8, 0, 8} log(e x + e 8 ), log(e x + e 0 ), log(e x + e 8 ) 46 / 71

47 Hard Maximum Why not use a hard max instead? max(x, 8), max(x, 0), max(x, 8) 47 / 71

48 Hinge Loss for (x, l) ( ) max w φ(x, l L l ) w φ(x, l) In the binary case: loss score difference In purple is the hinge loss, in blue is the log loss; in red is the zero-one loss (error). 48 / 71

49 Minimizing Hinge Loss: Perceptron fear {( }} ){ hope {}}{ max w φ(x, l L l ) w φ(x, l) 49 / 71

50 Minimizing Hinge Loss: Perceptron fear {( }} ){ hope {}}{ max w φ(x, l L l ) w φ(x, l) When two labels are tied, the function is not differentiable. 50 / 71

51 Minimizing Hinge Loss: Perceptron fear {( }} ){ hope {}}{ max w φ(x, l L l ) w φ(x, l) When two labels are tied, the function is not differentiable. But it s still sub-differentiable. Solution: (stochastic) subgradient descent! 51 / 71

52 Log Loss and Hinge Loss for (x, l) In the binary case: log loss: hinge loss: ( log l L exp w φ(x, l ) ) w φ(x, l) ( ) max w φ(x, l L l ) w φ(x, l) loss score difference In purple is the hinge loss, in blue is the log loss; in red is the zero-one loss (error). 52 / 71

53 Minimizing Hinge Loss: Perceptron min w n i=1 ( ) max w φ(x i, l ) w φ(x i, l i ) l L Stochastic subgradient descent on the above is called the perceptron algorithm. For t {1,..., T }: Pick i t uniformly at random from {1,..., n}. ˆlit argmax ( l L w φ(x it, l) w w α φ(x it, ˆl ) it ) φ(x it, l it ) 53 / 71

54 Error Costs Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. 54 / 71

55 Error Costs Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. Let cost(l, l ) quantify the badness of substituting l for correct label l. 55 / 71

56 Error Costs Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. Let cost(l, l ) quantify the badness of substituting l for correct label l. Intuition: estimate the scoring function so that score(l i ) score(ˆl) cost(l i, ˆl) 56 / 71

57 General Hinge Loss for (x, l) ( ) max w φ(x, l L l ) + cost(l, l ) w φ(x, l) In the binary case, with cost( 1, 1) = 1: In blue is the general hinge loss; in red is the zero-one loss (error). 57 / 71

58 General Remarks Text classification: many problems, all solved with supervised learners. Lexicon features can provide problem-specific guidance. 58 / 71

59 General Remarks Text classification: many problems, all solved with supervised learners. Lexicon features can provide problem-specific guidance. Naïve Bayes, log-linear, and linear SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. You should have a basic understanding of the tradeoffs in choosing among them. 59 / 71

60 General Remarks Text classification: many problems, all solved with supervised learners. Lexicon features can provide problem-specific guidance. Naïve Bayes, log-linear, and linear SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. You should have a basic understanding of the tradeoffs in choosing among them. Rumor: random forests are widely used in industry when performance matters more than interpretability. 60 / 71

61 General Remarks Text classification: many problems, all solved with supervised learners. Lexicon features can provide problem-specific guidance. Naïve Bayes, log-linear, and linear SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. You should have a basic understanding of the tradeoffs in choosing among them. Rumor: random forests are widely used in industry when performance matters more than interpretability. Lots of papers about neural networks, though with hyperparameter tuning applied fairly to linear models, the advantage is not clear (Yogatama et al., 2015). 61 / 71

62 General Remarks Text classification: many problems, all solved with supervised learners. Lexicon features can provide problem-specific guidance. Naïve Bayes, log-linear, and linear SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization. You should have a basic understanding of the tradeoffs in choosing among them. Rumor: random forests are widely used in industry when performance matters more than interpretability. Lots of papers about neural networks, though with hyperparameter tuning applied fairly to linear models, the advantage is not clear (Yogatama et al., 2015). Lots of work on feature design; most of the more sophisticated analysis methods we cover starting next week can provide features for text categorization! 62 / 71

63 References I Michael Collins. The naive Bayes model, maximum-likelihood estimation, and the EM algorithm, URL Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2(5): , Daniel Jurafsky and James H. Martin. Logistic regression (draft chapter), 2016a. URL Daniel Jurafsky and James H. Martin. Naive Bayes and sentiment classification (draft chapter), 2016b. URL Quinn McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2): , Noah A. Smith. Linguistic Structure Prediction. Synthesis Lectures on Human Language Technologies. Morgan and Claypool, URL Dani Yogatama, Lingpeng Kong, and Noah A. Smith. Bayesian optimization of text representations. In Proc. of EMNLP, URL 63 / 71

64 Extras 64 / 71

65 (Linear) Support Vector Machines A different motivation for the generalized hinge: ŵ = n α i,l φ(x i, l) i=1 l L where most only a small number of α i,l are nonzero. 65 / 71

66 (Linear) Support Vector Machines A different motivation for the generalized hinge: ŵ = n α i,l φ(x i, l) i=1 l L where most only a small number of α i,l are nonzero. Those φ(x i, l) are called support vectors because they support the decision boundary. ŵ φ(x, l ) = α i,l φ(x i, l) φ(x, l ) (i,l) S See Crammer and Singer (2001) for the multiclass version. 66 / 71

67 (Linear) Support Vector Machines A different motivation for the generalized hinge: ŵ = n α i,l φ(x i, l) i=1 l L where most only a small number of α i,l are nonzero. Those φ(x i, l) are called support vectors because they support the decision boundary. ŵ φ(x, l ) = α i,l φ(x i, l) φ(x, l ) (i,l) S See Crammer and Singer (2001) for the multiclass version. Really good tool: SVM light, 67 / 71

68 Support Vector Machines: Remarks Regularization is critical; squared l 2 is most common, and often used in (yet another) motivation around the idea of maximizing margin around the hyperplane separator. 68 / 71

69 Support Vector Machines: Remarks Regularization is critical; squared l 2 is most common, and often used in (yet another) motivation around the idea of maximizing margin around the hyperplane separator. Often, instead of linear models that explicitly calculate w φ, these methods are kernelized and rearrange all calculations to involve inner-products between φ vectors. Example: K linear (v, w) = v w K polynomial (v, w) = (v w + 1) p Linear kernels are most common in NLP. K Gaussian (v, w) = exp v w 2 2 2σ 2 69 / 71

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes