CS 175, Project in Artificial Intelligence Lecture 3: Document Classification

Size: px

Start display at page:

Download "CS 175, Project in Artificial Intelligence Lecture 3: Document Classification"

Corey King
6 years ago
Views:

1 CS 175, Project in Artificial Intelligence Lecture 3: Document Classification Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California, Irvine

2 2 Announcements Assignment 1: completed Assignment 2: Text Classification Due by 5pm Wednesday next week Project Proposals Due 2 weeks from Friday Will discuss project ideas and proposals over the next 2 weeks Today s lecture Possible project topics Document classification

3 3 Assignment 2 Use scikit-learn (Python library) to investigate document classification create bag-of-words representations generate training and test data sets experiment with different classifiers (naïve Bayes, logistic regression, knn) Submit modified assignment2.py file. (Wed next week, 5pm) Please read the assignment and submit any questions via Piazza Office hours over the next week Eric: Thursday 1 to 3, and next Tuesday 1 to 3 (new) Instructor: Friday 9:30 to 11:30

4 4

5 5 Background Reading: useful for Project Ideas Links on Class Web site: Tutorial Articles Software and Demos for Text Analysis Data Sets Reference books on text/natural language Introduction to Information Retrieval, Speech and Language Processing, Mining the Social Web (2nd Edition), by Matthew Russell, O'Reilly Media, (O Reilly books are available for free online via the UCI Library's subscription to Safari Books Online ( Reference books on machine learning Hands-On Machine Learning with Scikit-Learn and TensorFlow A Course in Machine Learning, Deep Learning,

6 6

7 7 Text Analysis Techniques Classification: automatically assign a document to 1 or more categories e.g., is an spam or non-spam? is a review positive or negative? Clustering/Topic Discovery: Group a set of documents into clusters, discover themes/topics in documents e.g., automatically group documents in search results Word prediction e.g., predict the next word for typing on a mobile device Chatbots and Text Synthesis e.g., automatically generate new text, e.g., in response to human dialog Information Extraction: Extract mentions of entities from documents e.g., tag news articles with mentions of companies and products

8 8 Figure 5.1 from the NLTK book showing the results of matching strings to a geographic dictionary. Illustrates clearly why dictionary look-up is not sufficient for entity recognition!

9 9 Project Topic: Sentiment or Emotion Prediction from Text Problem: Given text from a short document (e.g., Tweets) investigate classification methods to predict sentiment (e.g., positive/negative) or emotions (e.g., anger) from short documents. (e.g., Tweets) Possible Data Sources Labeled Tweet data sets (positive/negative or other emotion) Evaluation Accuracy, precision/recall, etc, on test data set or using cross-validation Comments Add additional aspects so that project is not too simple, needs to be interesting Additional Reading See links to Tutorial Articles, Data Sets, Software Demos on class Web page

10 10 Project Topic: Predicting Review Scores from Text Problem: Given text from a review (e.g., product, movie, restaurant) investigate machine learning methods to predict the numerical score of a review (e.g., 1, 2, 3, 4, 5) given the text of the review Possible Data Sources Yelp Challenge data sets, Amazon product review data sets, etc Evaluation Accuracy, precision/recall, etc, on test data set or using cross-validation Comments Could start with binary classification: {1 or 2} versus {4 or 5} Additional Reading See links to Tutorial Articles, Data Sets, Software Demos on class Web page

11 11

12 12 Histogram of Review Lengths

13 13 Project Topic: Summarizing Aspects of Review Text Problem: Given text from a set of reviews (e.g., product, movie, restaurant) investigate information extraction methods to automatically extract and summarize sentiment for different aspects of the reviews, e.g., price, service, quality of food, etc Possible Data Sources Yelp Challenge data sets, Amazon product review data sets, etc Evaluation User studies: is output of method A better than method B Evaluation techniques for text summarization such as BLEU scores Comments Evaluation is difficult for a problem like this Additional Reading See links to Tutorial Articles, Data Sets, Software Demos on class Web page

14 14

15 15 Project Topic: Chatbot Problem: Given a sentence generate an appropriate response sentence Possible Data Sources Transcripts of spoken or written dialog (e.g., Switchboard or Ubuntu corpus) Evaluation User studies, e.g., human judges responses of algorithm A v algorithm B Comments This is a difficult problem to do well on partial success would be fine J Additional Reading Chapter on Chatbots in Jurafsky text book (class Web site) See also Proceedings of the Amazon Alexa Prize competition in 2017 (online)

16 16

17 17 Project Topic: Text Generation/Simulation Problem: Given text from a particular author or source, generate new simulated text with the same style as the author or source Possible Data Sources Fiction from different authors, speeches, songs, poetry, movie scripts Evaluation User studies, e.g., human judges quality of output of algorithm A v algorithm B Comments This is easier than a Chatbot (more open loop ) Evaluation is tricky: how do you avoid generating text very similar to original? Additional Reading See links to Tutorial Articles, Software Demos on class Web page

18 18 Output from a Model Learned on Source Code Examples from The Unreasonable Effectiveness of Recurrent Neural Networks, Andrej Kaparthy, blog,

19 19 Output from a Model Learned on Mathematics Papers Examples from The Unreasonable Effectiveness of Recurrent Neural Networks, Andrej Kaparthy, blog,

20 20 Important Components of Projects Clear definition of the problem: think of inputs, outputs, pipelines Data Make sure you will be able data you need, e.g., labeled data for classifications Self-written components Which parts of the code will you write and what will be existing code? Evaluation How will you evaluate the quality of your system? Think of ways to compare version A versus B Run-time Do you want a system/demo that can run in real-time, or one that operates offline? Different design decisions for each.

21 21 Project Tips: Plan in Stages Plan your project in stages so that the overall project is not dependent on the riskier elements working Example: PHASE 1 Original Documents Standard Bag of Words Standard Logistic Regression Cross- Validation Experiments

22 22 Project Tips: Plan in Stages Plan your project in stages so that the overall project is not dependent on the riskier elements working Example: PHASE 1 Original Documents Standard Bag of Words Standard Logistic Regression Cross- Validation Experiments PHASE 2 Bag of Phrases (ngrams)

23 23 Project Tips: Plan in Stages Plan your project in stages so that the overall project is not dependent on the riskier elements working Example: PHASE 1 Original Documents Standard Bag of Words Standard Logistic Regression Cross- Validation Experiments PHASE 2 Bag of Phrases (ngrams) PHASE 3 Deep Neural Network

24 24 Overview of Text Classification

25 25 Text Classification Text classification has many applications Spam detection Classifying news articles, e.g., Google News Classifying Web pages into categories

26 26 Text Classification Text classification has many applications Spam detection Classifying news articles, e.g., Google News Classifying Web pages into categories Data Representation Bag of words/terms most commonly used: either counts or binary Can also use other weighting and additional features (e.g., metadata)

27 27 Text Classification Text classification has many applications Spam detection Classifying news articles, e.g., Google News Classifying Web pages into categories Data Representation Bag of words/terms most commonly used: either counts or binary Can also use other weighting and additional features (e.g., metadata) Classification Methods Naïve Bayes widely used baseline Fast and reasonably accurate Logistic Regression Widely used in industry, accurate, excellent baseline Neural networks and deep learning Can be very accurate Can require very large amounts of labeled training data More complex than other methods

28 28 Example of Document by Term Matrix predict finance stocks goal score team Class Label d d d d d d d d d d

29 29 Example of Document by Term Matrix predict finance Note: we use term to allow for multi-word terms, stocks i.e., n-grams, goal e.g., Santa score Barbara team and New York Class Label d d d d d d d d d d

30 30 Real Example from Yelp Data Yelp Dataset Number of Reviews 706,693 Number of Reviews w/o Neutral Rating 595,468 Number of Tokens 85,392,376 Vocabulary Size w/o Stopwords 176,114 Array Dimensions (595468, ) Number of cells in the Array 104,870,251,352 Non-zero entries 28,357,001 Density

Labels are Known Unlabeled Documents Future Data

31 31 Training and Prediction Terms Labeled Documents Training Data (used to learn the model) Class Labels are Known Unlabeled Documents Future Data (using the model to make predictions) Class Labels are Unknown

32 32 Example of a Pipeline for Document Classification Training Documents (corpus) Tokenization Lists of Tokens Stopword and rare word removal Vocabulary Bag of Words Frequency Counts Machine Learning Algorithm Document Classifier

33 33 Example of a Pipeline for Document Classification Training Documents (corpus) Tokenization Lists of Tokens Stopword and rare word removal Vocabulary Bag of Words Frequency Counts Machine Learning Algorithm Tokenization Lists of Tokens Bag of Words Document Classifier Label Prediction New Document

34 34 Key Steps in Document Analysis Pipelines (for Bag of Words) Tokenization Various options (e.g., with punctuation, non alphanumeric symbols, etc) Vocabulary definition N-grams, stopword removal, rare word removal, stemming Feature definition Binary (term present or not?) Counts Weighted counts, e.g., TD-IDF (see later in the slides) Classifier selection Naïve Bayes, logistic, SVMs, neural networks, etc

35 35 Example of Document by Term Matrix (count version) predict finance stocks goal score team Class Label d d d d d d d d d d

36 36 Example of Document by Term Matrix (binary version) predict finance stocks goal score team Class Label d d d d d d d d d d

37 37 TF-IDF Weighting of Features In practice the inputs can be weighted It can be helpful to use TF-IDF weights instead of counts TF(t,d) = term frequency = count = number of times term t occurs in doc d IDF(t,d) = inverse document frequency = log ( N / number of docs with term t) (where N = total number of docs in the corpus) TF-IDF(t,d) = TF(t,d) * IDF(t,d) The IDF term has the effect of upweightingterms that occur in few docs

38 38 TF-IDF Example N = 1000 in a corpus of news articles Term 1: t = city, appears in 500 documents IDF(t) = log(1000/500) = log(2) = 1 (log is base 2, not important) Term 2: t = freeway, appears in 10 documents IDF(t) = log(1000/10) = log(100) = 6.64 So occurrences of freeway will get upweighted by a factor of 6.64 compared to occurrences of city

39 39 Comparing True Labels and Predictions Classification accuracy = percentage of correct predictions = 70% below predict finance stocks goal score team Class Label Algorithm s Predictions d d d d d d d d d d

40 40 Confusion Matrix and Accuracy Predicted Class 1 Predicted Class 2 True Class 1 True Class 2 True Class Predicted Class Accuracy = fraction of examples classified correctly = 280/400 = 70%

41 41 Training and Test Data Classification accuracy on our training data will tend to be optimistic Classifier can memorize the training data Test set performance A more accurate estimate of accuracy can be gotten by reserving some of our data as an independent/unseen/holdout test data set Train the model on the training data Evaluate the model s true accuracy on the data it did not see (the test set) Cross-validation We can repeat the process of splitting our data into train-test sets multiple times to get an even more robust estimate of accuracy V-fold cross-validation V train-test splits of the data, train V models and evaluate on V test sets Our final accuracy estimate is the average over the V test folds

42 42 Linear Classifiers

43 43 Notation Data: N documents, T features (e.g., term counts) N x T array of features Variables: c is a class label, taking one of M possible values (M > 1) x is a T-dimensional vector of features for a document (e.g., x could be a binary vector indicating which terms are in the document or not) P(c x) is the probability of class label c given x

44 44 Linear Classifiers for 2-Class Problems A linear classifier computes a linear weighted sum of the inputs e.g., in 2 dimensions, with 2 classes f = classifier output = w 0 + w 1 x 1 + w 2 x 2 Why do we need this extra constant weight?

45 45 Geometric Interpretation of a Linear Classifier 8 TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE 6 Decision Region 1 Decision Region 2 Note that the decision boundary corresponds to the points where 4 f = 0, i.e., w 0 + w 1 x 1 + w 2 x 2 = 0 Feature Decision Boundary Feature 1 which is the equation of a line in 2 dimensions The w 0 weight allows us to have lines that have non-zero intercept, i.e., that don t need to go through the origin

46 46 Linear Classifier with Overlapping Class Distributions 6 5 TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE Decision Region 1 Decision Region 2 4 Feature Decision Boundary Feature 1

47 47 A Linear Classifier (with 2 Features) Inputs x 1 Weights w 1 Weighted Sum of the inputs Threshold Function Output = class prediction x 2 w 2 f ( x 1, x 2 ) = w 0 + w 1 x 1 + w 2 x 2 f >0? 1 or 2 w 3 1

48 48 Linear Classifiers for 2-Class Problems A linear classifier computes a linear weighted sum of the inputs e.g., in 2 dimensions f ( x 1, x 2 ) = w 0 + w 1 x 1 + w 2 x 2 and more generally in T dimensions f(x) = f(x 1,. x T ) = Σ j w j x j = w 0 + w 1 x w 2 x w T x T

49 49 Linear Classifiers for 2-Class Problems A linear classifier computes a linear weighted sum of the inputs e.g., in 2 dimensions f ( x 1, x 2 ) = w 0 + w 1 x 1 + w 2 x 2 and more generally in T dimensions f(x) = f(x 1,. x T ) = Σ j w j x j = w 0 + w 1 x w 2 x w T x T Sidenote: this can also be written as the inner product of a weight-vector and the feature vector, i.e., f(x) = Σ j w j x j = w t x, where w = (w 0, w 1,. w T ) and w t is the transpose of w

50 50 Linear Classifiers for Text Documents Linear classifiers use a weighted sum of the inputs With T features we have T + 1 weights (one per feature plus one intercept ) Examples of Linear Classifiers Linear Classifier (Perceptron) Logistic Regression <- widely used in practice, is what we will focus on Naïve Bayes (it is effectively linear) Support Vector Machines Example of a Non-Linear Classifier Neural Networks <- will be reviewed in future lectures For further discussion of linear and non-linear classifiers see:

51 51 Possible Weights for a Linear Classifier with Documents predict finance stocks goal score team Class Label Weight d d d d d d d d d d

52 52 Logistic Regression Classification

53 53 Notation Data: N documents, T features (e.g., term counts) N x T array of features Variables: c is a class label, taking one of M possible values (M > 1) x is a T-dimensional vector of features for a document (e.g., x could be a binary vector indicating which terms are in the document or not) P(c x) is the probability of class label c given x

54 54 Getting Class Probabilities. Estimates of class probabilities P(c x ) are very useful in practice e.g., for ranking documents to show to a human user

55 55 Getting Class Probabilities. Estimates of class probabilities P(c x ) are very useful in practice e.g., for ranking documents to show to a human user Assume for simplicity we have a 2-class binary classification problem Say we tried to get a probability of a class with a linear model: P(c x ) = f( x ) = w 0 + w 1 x 1 + w 2 x w T x T There is a problem: f( x ) could be negative, could be > 1, etc.

56 56 A Better Approach P(c x ) = f( x ) = g (w 0 + w 1 x 1 + w 2 x w T x T ) where g(z) = 1 / [ 1 + e -z ] As z -> positive infinity, g(z) -> 1, P(class) -> 1 As z -> negative infinity, g(z) -> 0, P(class) -> 0 This is the logistic regression model In effect: a linear (weighted sum) model where the sum is transformed to lie between 0 and 1 and we can interpret f( x ) directly as a probability between 0 and 1

57 57 What does the Logistic Function look like? Shape of the Logistic Function 1 g(z) = g(z) = z = weighted sum = Σ j w j x j As z -> positive infinity, g(z) -> 1, P(class) -> 1 As z -> negative infinity, g(z) -> 0, P(class) -> 0

58 58 Logistic Regression as a Neural Network Logistic regression can be viewed as a simple artificial neuron x 1 Each edge in the network has an associated weight or parameter, w j x 2 f(x) x 3 +1

59 59 A Neural Network with 1 Hidden Layer Here the model learns 3 different logistic functions, each one a hidden unit and then combines the outputs of the 3 to make a prediction x 1 x 2 f(x) x 3 +1 This model is representationally more powerful than a single logistic function, but has many more parameters (can overfit unless we are careful) The model can be trained using gradient methods but local minima are a problem

60 60 Deep Learning: Models with 2 or More Hidden Layers We can build on this idea to create deep models with many hidden layers x 1 x 2 f(x) x 3 +1 The model f(x) is now a very flexible highly non-linear function Significant current interest in deep learning (e.g., 5, 10, 20 layers)

61 61 Explaining Decisions by an AI Algorithm

62 62 Explaining an Algorithm s Decisions Generating human-interpretable explanations of decisions made by AI systems is very important to human users of these systems, e.g., in Autonomous driving Medical diagnosis Product recommendations And so on.. For linear classifiers, where we have 1 weight per input, this is straightforward For each class, look at most positive weights and most negative weights This tells us which features/terms (if present) have the most impact (Assignment 2) For documents note that some terms might be rare: so we could measure how much impact they have on average, rather than when they are present Can also tell the user which terms in a particular document contributed most to a decision For non-linear classifiers (such as neural networks), explaining decisions is much more complicated to do

63 63 From:

64 64 From:

65 65 From:

66 66 From:

67 67 Assignment 2 Use scikit-learn (Python library) to investigate document classification create bag-of-words representations generate training and test data sets experiment with different classifiers (naïve Bayes, logistic regression, knn) Submit modified assignment2.py file. (Wed next week, 5pm) Please read the assignment and submit any questions via Piazza Office hours over the next week Eric: Thursday 1 to 3, and next Tuesday 1 to 3 (new) Instructor: Friday 9:30 to 11:30

68 68 Week Monday Wednesday Jan 8 Lecture: Introduction and course outline Lecture: Basic concepts in text analysis Jan 15 No class (university holiday) Lecture: Text classification, part 1 Assignment 1 due, 5pm Jan 22 Text classification, part 2 Jan 29 Lecture: Neural networks for text, part 2 Lecture: Neural networks for text, part 1 Assignment 2 due, 5pm Lecture: Neural networks for text, part 2 Project proposal due, Friday 6pm Feb 5 Office hours (no lecture) Lecture: Algorithm evaluation methods Feb 12 Office hours (no lecture) Lecture: Unsupervised learning algorithms Feb 19 No class (university holiday) Lecture: Discussion of progress reports Progress report due, Friday 6pm Feb 26 Office hours (no lecture) Office hours (no lecture) Mar 5 Office hours (no lecture) Lecture: Discussion of final reports Mar 12 Mar 19 Project Presentations (in class) Upload slides by 4pm Final project reports due (day/time TBD) Project Presentations (in class) Upload slides by 4pm

69 69 Example (in Python) of Classifying Yelp Reviews (code from Dimitris Kotzias, PhD student, Computer Science Department, UCI)

70 70

71 71 Real Example from Yelp Data Simple pipeline for classification of Yelp Reviews Extract the restaurant reviews Convert them to a tf*idfarray Split data into training and testing Train on training data, and Test

72 72 Real Example from Yelp Data Yelp Dataset Number of Reviews 706,693 Number of Reviews w/o Neutral Rating 595,468 Number of Tokens 85,392,376 Vocabulary Size w/o Stopwords 176,114 Array Dimensions (595468, ) Number of cells in the Array 104,870,251,352 Non-zero entries 28,357,001 Density

73 73 Histogram of Review Lengths

74 74 Real Example from Yelp Data Number of restaurants: 14,308 A total of 706,693 reviews

75 75 Real Example from Yelp Data data shape: (595468, )

76 76 Real Example from Yelp Data training size: testing size:

77 77 Real Example from Yelp Data Training: acc: Testing: acc: auc: Overall takes about mins to run (may produce some warnings)

78 78 Other Aspects of Document Classification

79 Examples of Labels/Categories/Classes Labels for documents or web-pages Labels are often general categories e.g., for news articles "finance," "sports," "news>world>asia>business e.g., for biomedical articles gene expression, microarray, lung cancer Ch Labels may be genres "editorials" "movie-reviews" "news Labels may be opinion on a person/product like, hate, neutral Labels may be domain-specific "interesting-to-me" : "not-interesting-to-me contains adult language : doesn t language identification: English, French, Chinese, link spam : not link spam

80 80 Where do Document Labels come from? Manually assigned (expensive) Predefined dictionary of labels Human labelers read all or part of the article and assigning the most likely label Who are the labelers? Domain experts Librarians/editors (e.g., for the New York Times) Low-paid labelers, e.g., via Amazon Turk This is a subjective process Even domain experts will disagree on some labels In many cases there is no absolute right or wrong labeling Semi-automated process e.g., domain experts define selected keywords for each label Keyword matching used to return documents with most keyword matches for each label Experts then label these returned documents Classifier trained on these labeled documents

81 81 Other Aspects of Document Labels Large numbers of label values Many applications have a very large number of possible class labels (thousands) Distribution of labels is often highly skewed Some labels very common, other labels very rare Multi-Label versus Single-Label documents Multi-Label: each document can have multiple labels Single-Label: each document is assigned a single label The multi-label problem is more complex to handle E.g., the model needs to decide how many labels to assign to each document (we will assume single-label for now, return to multi-label later) Hierarchical labels Common in real-world applications that labels are related hierarchically in a tree e.g., "news>world>asia>business Classifiers that use this hierarchy will generally perform better than classifiers that ignore it

82 82 Feature Selection Performance of text classification algorithms can often be improved by selecting only a subset of the terms Greedy search Start from empty set or full set and add/delete one at a time Heuristics for adding/deleting Information gain (mutual information of term with class) Chi-square Other ideas Methods tend not to be particularly sensitive to the specific heuristic used for feature selection, but some form of feature selection often improves performance

83 83 Feature Selection using Mutual Information Average mutual information between (a) C, the class label and (b) f t, the presence or absence of a term in a document, defined as From McCallum and Nigam, 1998 Where here c is the class and f t indicates the presence or absence of term t Typical approach: compute for all terms, include the top K terms in the classifier, and optimize the value of K via cross-validation (next lecture)

84 84 Generating Multi-Word Terms Consider multi-word terms like New York Would rather treat this as one word New York rather than New and York We can extend our vocabulary to include multi-word terms (or ngrams) Ngrams with n=1,2,3,4. e.g., University of California Irvine (n=4) Finding candidate n-grams Space of possible multi-word combinations is huge W word tokens: W 2 bigrams, W 3 trigrams, etc. (W order of 10 5 ) General approach: select ngrams that occur frequently Keep track of all k-frequent ngrams in the corpus (e.g., k=10) Use feature selection (e.g., mutual information) to select best Can also use other filters to find good terms, e.g., use a parser to automatically extract noun-phrases The big dog jumped over the lazy brown cat

85 85 Sentiment Lexicons Basic analysis of text: Overall count and percentage of words in various categories Example lexicons General inquirer ( Words categorized according to Positive / Negative, Strong vs Weak, Active vs Passive, etc Sentiwordnet ( Synsets in WordNet3.0 annotated for degrees of positivity, negativity, and neutrality/objectiveness Linguistic Inquiry and Word Count ( Next slide

86 86 LIWC (Linguistic Inquiry and Word Count) Pennebaker, J.W., Booth, R.J., & Francis, M.E. (2007). Linguistic Inquiry and Word Count: LIWC Austin, TX words, >70 classes Affective Processes negative emotion (bad, weird, hate, problem, tough) positive emotion (love, nice, sweet) Cognitive Processes Tentative (maybe, perhaps, guess), Inhibition (block, constraint) Pronouns, Negation (no, never), Quantifiers (few, many)

87 87 LIWC Word Categories Pronouns Affect Hearing 1st person singular Positive emotions Feeling 1st person plural Negative emotions Body 2nd person Anxiety Sexual Articles Anger Motion Past tense verbs Sadness Space Present tense verbs Cognitive mechanisms Time Future tense verbs Insight Occupation Prepositions Causal Achievement Negations Discrepancy Leisure Numbers Tentative Home Swear words Certainty Money Social words Inhibition Religion Family Inclusive Death Friends Exclusive Assent Humans Seeing Nonfluencies

88 88 Pros and Cons of Dictionary Approaches such as LIWC Pros Effective method for studying the various emotional, cognitive, structural and process components present in individual s verbal and written speech. Easy to use Cons Sentiment lexicons are fixed in number of categories and words in categories Word context is often ignored Not domain specific

Information Retrieval and Organisation

Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents