Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

Outline Vector space classification Rocchio Linear classifiers knn 2

Ch. 13 Standing queries The path from IR to text classification: You have an information need to monitor, say: Unrest in the Niger delta region You want to rerun an appropriate query periodically to find new news items on this topic You will be sent new documents that are found I.e., it s not ranking but classification (relevant vs. not relevant) Such queries are called standing queries Long used by information professionals A modern mass instantiation is Google Alerts Standing queries are (hand-written) text classifiers

Sec.14.1 Recall: vector space representation Each doc is a vector One component for each term (= word). Terms are axes Usually normalize vectors to unit length. High-dimensional vector space: 10,000+ dimensions, or even 100,000+ Docs are vectors in this space How can we do classification in this space? 4

Sec.14.1 Classification using vector spaces Training set: a set of docs, each labeled with its class (e.g., topic) This set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space Premise 1: Docs in the same class form a contiguous regions of space Premise 2: Docs from different classes don t overlap (much) We define surfaces to delineate classes in the space 5

Sec.14.1 Documents in a vector space Government Science Arts 6

Sec.14.1 Test document of what class? Government Science Arts 7

Sec.14.1 Test document of what class? Government Is this similarity hypothesis true in general? Government Science Arts Our main topic today is how to find good separators 8

Relevance feedback relation to classification In relevance feedback, the user marks docs as relevant/non-relevant. Relevant/non-relevant can be viewed as classes or categories. For each doc, the user decides which of these two classes is correct. Relevance feedback is a form of text classification. 9

Sec.14.2 Rocchino for text classification Relevance feedback methods can be adapted for text categorization Relevance feedback can be viewed as 2-class classification Use standard tf-idf weighted vectors to represent text docs For training docs in each category, compute a prototype as centroid of the vectors of the training docs in the category. Prototype = centroid of members of class Assign test docs to the category with the closest prototype vector based on cosine similarity. 10

Sec.14.2 Definition of centroid μ c = 1 D c d Dc d D c : docs that belong to class c d : vector space representation of d. Centroid will in general not be a unit vector even when the inputs are unit vectors. 11

Rocchino algorithm 12

Rocchio: example We will see that Rocchino finds linear boundaries between classes Government Science Arts 13

Sec.14.2 Illustration of Rocchio: text classification 14

Sec.14.2 Rocchio properties Forms a simple generalization of the examples in each class (a prototype). Prototype vector does not need to be normalized. Classification is based on similarity to class prototypes. Does not guarantee classifications are consistent with the given training data. 15

Sec.14.2 Rocchio anomaly Prototype models have problems with polymorphic (disjunctive) categories. 16

Sec.14.2 Rocchio classification: summary Rocchio forms a simple representation for each class: Centroid/prototype Classification is based on similarity to the prototype It does not guarantee that classifications are consistent with the given training data It is little used outside text classification It has been used quite effectively for text classification But in general worse than many other classifiers Rocchio does not handle nonconvex, multimodal classes correctly. 17

Linear classifiers Assumption:The classes are linearly separable. Classification decision: m i=1 w i x i + w 0 > 0? First, we only consider binary classifiers. Geometrically, this corresponds to a line (2D), a plane (3D) or a hyperplane (higher dimensionalities) decision boundary. Find the parameters w 0, w 1,, w m based on training set. Methods for finding these parameters: Perceptron, Rocchio, 18

Sec.14.4 Separation by hyperplanes A simplifying assumption is linear separability: in 2 dimensions, can separate classes by a line in higher dimensions, need hyperplanes 19

Sec.14.2 Two-class Rocchio as a linear classifier Line or hyperplane defined by: M w 0 + w i d i = w 0 + w T d 0 i=1 For Rocchio, set: w = μ c 1 μ c 2 w 0 = 1 2 μ c 1 2 μ c 2 2 20

Sec.14.4 Linear classifier: example Class: interest (as in interest rate) Example features of a linear classifier w i t i w i t i 0.70 prime 0.67 rate 0.63 interest 0.60 rates 0.46 discount 0.43 bundesbank 0.71 dlrs 0.35 world 0.33 sees 0.25 year 0.24 group 0.24 dlr To classify, find dot product of feature vector and weights 21

Linear classifier: example w 0 = 0 Class interest in Reuters-21578 d 1 : rate discount dlrs world d 2 : prime dlrs w T d 1 = 0.07 d 1 is assigned to the interest class w T d 2 = 0.01 d 2 is not assigned to this class 22

Naïve Bayes as a linear classifier P C 1 M i=1 P t i C 1 tf i,d > P(C 2 ) M M i=1 log P(C 1 ) + tf i,d log P t i C 1 i=1 M > log P(C 2 ) + tf i,d log P t i C 2 i=1 P t i C 2 tf i,d w i = log P t i C 1 P t i C 2 x i = tf i,d w 0 = log P C 1 P C 1 23

Sec.14.4 Linear programming / Perceptron Find a,b,c, such that ax + by > c for red points ax + by < c for blue points 24

Sec.14.4 Which hyperplane? 25 In general, lots of possible solutions for a,b,c.

Sec.14.4 Which hyperplane? Lots of possible solutions for a,b,c. Some methods find a separating hyperplane, but not the optimal one [according to some criterion of expected goodness] Which points should influence optimality? All points E.g., Rocchino Only difficult points close to decision boundary E.g., SupportVector Machine (SVM) 26

Sec. 15.1 Support Vector Machine (SVM) SVMs maximize the margin around the separating hyperplane. A.k.a. large margin classifiers Support vectors Solving SVMs is a quadratic programming problem Seen by many as the most successful current text classification method* Maximizes Narrower margin margin *but other discriminative methods often perform very similarly 27

Sec.14.4 Linear classifiers Many common text classifiers are linear classifiers Classifiers more powerful than linear often don t perform better on text problems.why? Despite the similarity of linear classifiers, noticeable performance differences between them For separable problems, there is an infinite number of separating hyperplanes. Different training methods pick different hyperplanes. Also different strategies for non-separable problems 28

Linear classifiers: binary and multiclass classification Sec.14.4 Consider 2 class problems Deciding between two classes, perhaps, government and nongovernment Multi-class How do we define (and find) the separating surface? How do we decide which region a test doc is in? 29

Sec.14.5 More than two classes One-of classification (multi-class classification) Classes are mutually exclusive. Each doc belongs to exactly one class Any-of classification Classes are not mutually exclusive. A doc can belong to 0, 1, or >1 classes. For simplicity, decompose into K binary problems Quite common for docs 30

Sec.14.5 Set of binary classifiers: any of Build a separator between each class and its complementary set (docs from all other classes). Given test doc, evaluate it for membership in each class. Apply decision criterion of classifiers independently It works although considering dependencies between categories may be more accurate 31

Sec.14.5 Multi-class: set of binary classifiers Build a separator between each class and its complementary set (docs from all other classes). Given test doc, evaluate it for membership in each class. Assign doc to class with: maximum score maximum confidence maximum probability??? 32

Sec.14.3 k Nearest Neighbor Classification knn = k Nearest Neighbor To classify a document d: Define k-neighborhood as the k nearest neighbors of d Pick the majority class label in the k- neighborhood 33

Sec.14.3 Nearest-Neighbor (1NN) classifier Learning phase: Just storing the representations of the training examples in D. Does not explicitly compute category prototypes. Testing instance x (under 1NN): Compute similarity between x and all examples in D. Assign x the category of the most similar example in D. Rationale of knn: contiguity hypothesis We expect a test doc d to have the same label as the training docs located in the local region surrounding d. 34

Sec.14.1 Test Document = Science Government Science Arts 35

Sec.14.3 k Nearest Neighbor (knn) classifier 1NN: subject to errors due to A single atypical example. Noise (i.e., an error) in the category label of a single training example. More robust alternative: find the k most-similar examples return the majority category of these k examples. 36

Sec.14.3 knn example: k=6 P(science )? Government Science Arts 38

Sec.14.3 knn decision boundaries Boundaries are in principle arbitrary surfaces (polyhedral) 39 Government Science Arts knn gives locally defined decision boundaries between classes far away points do not influence each classification decision (unlike Rocchio, etc.)

1NN: Voronoi tessellation The decision boundaries between classes are piecewise linear. 40

knn algorithm 41

Time complexity of knn knn test time proportional to the size of the training set! knn is inefficient for very large training sets. 42

Sec.14.3 Similarity metrics Nearest neighbor method depends on a similarity (or distance) metric. Euclidean distance: Simplest for continuous vector space. Hamming distance: Simplest for binary instance space. number of feature values that differ For text, cosine similarity of tf.idf weighted vectors is typically most effective. 43

Sec.14.3 Illustration of knn (k=3) for text vector space 44

3-NN vs. Rocchio Nearest Neighbor tends to handle polymorphic categories better than Rocchio/NB. 45

Sec.14.3 Nearest neighbor with inverted index Naively, finding nearest neighbors requires a linear search through D docs in collection Similar to determining the k best retrievals using the test doc as a query to a database of training docs. Use standard vector space inverted index methods to find the k nearest neighbors. Testing Time: O(B V t ) B is the average number of training docs in which at least one word of test-document appears Typically B << D if a large list of stopwords is used. 46

Sec.14.4 A nonlinear problem Linear classifiers do badly on this task knn will do very well (assuming enough training data) 47

Overfitting example 48

Sec.14.3 knn: summary No training phase necessary Actually: We always preprocess the training set, so in reality training time of knn is linear. May be expensive at test time knn is very accurate if training set is large. In most cases it s more accurate than linear classifiers Optimality result: asymptotically zero error if Bayes rate is zero. But knn can be very inaccurate if training set is small. Scales well with large number of classes Don t need to train C classifiers for C classes Classes can influence each other Small changes to one class can have ripple effect 49

Sec.14.6 Choosing the correct model capacity 50

Linear classifiers for doc classification We typically encounter high-dimensional spaces in text applications. With increased dimensionality, the likelihood of linear separability increases rapidly Many of the best-known text classification algorithms are linear. More powerful nonlinear learning methods are more sensitive to noise in the training data. Nonlinear learning methods sometimes perform better if the training set is large, but by no means in all cases. 51

Which classifier do I use for a given text classification problem? Is there a learning method that is optimal for all text classification problems? No, because there is a tradeoff between complexity of the classifier and its performance on new data points. Factors to take into account: How much training data is available? How simple/complex is the problem? How noisy is the data? How stable is the problem over time? For an unstable problem, it s better to use a simple and robust classifier. 52

Reuters collection Only about 10 out of 118 categories are large Common categories (#train, #test) Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189) Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) 53

Evaluating classification Evaluation must be done on test data that are independent of the training data training and test sets are disjoint. Measures: Precision, recall, F1, accuracy F1 allows us to trade off precision against recall (harmonic mean of P and R). 54

Precision P and recall R predicted to be in the class Predicted not to be in the class actually in the class tp fn actually in the class fp tn Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) 55

Good practice department: Make a confusion matrix Actual Class Sec. 15.2.4 This (i, j) entry means 53 of the docs actually in class i were put in class j by the classifier. Class assigned by classifier c ij 53 In a perfect classification, only the diagonal has non-zero entries Look at common confusions and how they might be addressed 56

Sec. 15.2.4 Per class evaluation measures Recall: Fraction of docs in class i classified correctly: Precision: Fraction of docs assigned class i that are actually about class i: Accuracy: (1 - error rate) Fraction of docs classified correctly: 57 j j i cii c j cii c i c ij ji ii c ij

Averaging: macro vs. micro We now have an evaluation measure (F1) for one class. But we also want a single number that shows aggregate performance over all classes 58

Sec. 15.2.4 Micro- vs. Macro-Averaging If we have more than one class, how do we combine multiple performance measures into one quantity? Macroaveraging: Compute performance for each class, then average. Compute F1 for each of the C classes Average these C numbers Microaveraging: Collect decisions for all classes, aggregate them and then compute measure. Compute TP, FP, FN for each of the C classes Sum these C numbers (e.g., all TP to get aggregate TP) Compute F1 for aggregate TP, FP, FN 59

Sec. 15.2.4 Micro- vs. Macro-Averaging: Example Class 1 Class 2 Micro Ave. Table Truth: yes Truth: no Truth: yes Truth: no Truth: yes Truth: no Classifier: yes 10 10 Classifier: yes 90 10 Classifier: yes 100 20 Classifier: no 10 970 Classifier: no 10 890 Classifier: no 20 1860 Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 =.83 Microaveraged score is dominated by score on common classes 60

61 Evaluation measure: F1

Sec. 15.3.1 Amount of data? Little amount of data stick to less powerful classifiers (i.e. linear ones) Naïve Bayes should do well in such circumstances (Ng and Jordan 2002 NIPS) The practical answer is to get more labeled data as soon as you can Reasonable amount of data We can use all our clever classifiers Huge amount of data Expensive methods like SVMs (train time) or knn (test time) are quite impractical Naïve Bayes can come back into its own again! Or other advanced methods with linear training/test complexity With enough data the choice of classifier may not matter much, and the best choice may be unclear 62

Sec. 15.3.2 How many categories? A few (well separated ones)? Easy! A zillion closely related ones? Think:Yahoo! Directory Quickly gets difficult! May need a hybrid automatic/manual solution 63

Yahoo! Hierarchy www.yahoo.com/science (30) agriculture biology physics CS space...... dairy crops botany cell forestry agronomy evolution magnetism relativity......... AI HCI courses craft missions 64

Sec. 15.3.2 How can one tweak performance? Aim to exploit any domain-specific useful features that give special meanings or that zone the data Aim to collapse things that would be treated as different but shouldn t be. E.g., ISBNs, part numbers, chemical formulas Does putting in hacks help? You bet! Feature design and non-linear weighting is very important in the performance of real-world systems 65

Sec. 15.3.2 Upweighting You can get a lot of value by differentially weighting contributions from different document zones. That is, you count as two instances of a word when you see it in,say, the abstract Upweighting title words helps (Cohen & Singer 1996) Doubling the weighting on the title words is a good rule of thumb Upweighting the first sentence of each paragraph helps (Murata, 1999) Upweighting sentences that contain title words helps (Ko et al, 2002) 66

Sec. 15.3.2 Two techniques for zones 1. Have a completely separate set of features/parameters for different zones like the title 2. Use the same features (pooling/tying their parameters) across zones, but upweight the contribution of different zones Commonly the second method is more successful: it costs you nothing in terms of sparsifying the data, but can give a very useful performance boost Which is best is a contingent fact about the data 67

Sec. 15.3.2 Does stemming/lowercasing/ help? As always, it s hard to tell, and empirical evaluation is normally the gold standard. But note that the role of tools like stemming is rather different for TextCat vs. IR: For IR, you want to improve recall For TextCat, with sufficient training data, stemming does no good. It only helps in compensating for data sparseness 68

Ch. 14 Resources IIR, Chapter 14 69