Machine Learning for natural language processing

Similar documents
Machine Learning for natural language processing

Machine Learning for natural language processing

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Machine Learning for natural language processing

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Machine Learning for natural language processing

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Dealing with Text Databases

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Probabilistic Information Retrieval

Social Data Mining Trainer: Enrico De Santis, PhD

How Latent Semantic Indexing Solves the Pachyderm Problem

Information Retrieval

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Information Retrieval

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Classification, Linear Models, Naïve Bayes

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

MIRA, SVM, k-nn. Lirong Xia

Scoring, Term Weighting and the Vector Space

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

8: Hidden Markov Models

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

8: Hidden Markov Models

PV211: Introduction to Information Retrieval

Text mining and natural language analysis. Jefrey Lijffijt

Classification. Team Ravana. Team Members Cliffton Fernandes Nikhil Keswaney

Metric-based classifiers. Nuno Vasconcelos UCSD

Boolean and Vector Space Retrieval Models

Day 6: Classification and Machine Learning

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

PV211: Introduction to Information Retrieval

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

CS 188: Artificial Intelligence Spring Announcements

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Latent Semantic Analysis. Hongning Wang

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Gaussian Models

Information Retrieval Basic IR models. Luca Bondi

Geoffrey Zweig May 7, 2009

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

Natural Language Processing (CSEP 517): Text Classification

Machine Learning for Signal Processing Bayes Classification

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Modern Information Retrieval

CS 188: Artificial Intelligence Spring Announcements

More Smoothing, Tuning, and Evaluation

1 Handling of Continuous Attributes in C4.5. Algorithm

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

INFO 630 / CS 674 Lecture Notes

Modern Information Retrieval

Learning Features from Co-occurrences: A Theoretical Analysis

1 Handling of Continuous Attributes in C4.5. Algorithm

Probability Review and Naïve Bayes

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Dynamic Programming: Hidden Markov Models

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Applied Natural Language Processing

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Generative Models for Classification

Iterative Laplacian Score for Feature Selection

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Machine Learning for Signal Processing Bayes Classification and Regression

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Lecture 2: Probability, Naive Bayes

Language as a Stochastic Process

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Unsupervised Learning with Permuted Data

Day 3: Classification, logistic regression

Chap 2: Classical models for information retrieval

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Algorithm-Independent Learning Issues

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Document Similarity in Information Retrieval

Naïve Bayes, Maxent and Neural Models

18.9 SUPPORT VECTOR MACHINES

Conditional Random Field

2017 Predictive Analytics Symposium

Ranked Retrieval (2)

Transcription:

Machine Learning for natural language processing Classification: k nearest neighbors Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 28

Introduction Classification = supervised method for classifying an input, given a finite set of possible classes. Today: Vector-based document classification. Jurafsky & Martin (2015), chapter 15, Jurafsky & Martin (2009) chapter 20 and Manning et al. (2008), chapter 14 2 / 28

Table of contents 1 Motivation 2 Vectors and documents 3 Measuring vector distance/similarity 4 k nearest neighbors 3 / 28

Motivation We can characterize documents by large vectors of real-valued features. Features are for instance words of a given vocabulary and their values reflect their occurrences in the document. This gives us a vector-space model of document classes. 4 / 28

Vectors and documents Term-document matrix: Let V be our vocabulary, D our set of documents. A term-document matrix characterizing D with respect to V is a D V where the cell for v i and d j contains the number of occurrences of v i in d i. Each column in this matrix gives a vector of dimension V that represents a document. This is again the bag-of-words representation of documents, except that V does not contain all words from the documents in D. 5 / 28

Vectors and documents Example from Jurafsky & Martin (2015), chapter 19 Term-document matrix for four words in four Shakespeare plays: As You Like It Twelth Night Julius Caesar Henry V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 5 117 0 0 In real applications, we usually have 10.000 V 50.000. 6 / 28

Vectors and documents General idea: Two documents are similar if they have similar vectors. Example from Jurafsky & Martin (2015), chapter 19 continued Take only dimensions fool and battle from the previous matrix: battle Henry V (5,15) Julius Caesar (1,8) As You Like It (37,1) Twelfth Night (58,1) fool 7 / 28

Vectors and documents The above frequency counts are not the best measures for associations between terms and documents: A highly frequent word like the will not tell us anything about the class of a document since it occurs (more or less) equally frequent in all documents. Rare words also might be equally rare in all types of documents. The frequencies of such words, which are independent from the document class, should weigh less than the frequencies of words that are very typical for certain classes. We need a weighting scheme that takes this into account. 8 / 28

Vectors and documents A standard weighting scheme for term-document matrices in information retrieval is tf-idf: tf-idf term frequency: tf td = frequency of term t in document d. This can be simply the count of t in d. document frequency: df t = number of documents in which term t occurs inverse document frequency: idf t = log ( D df t ) tf-idf: w td = tf td idf t Terms occurring in lesser documents are more discriminative. Therefore their counts are weighted with a higher idf factor. Since D is usually large, we use the log of the inverse document frequency. Note, however, that terms occurring in all documents get a weight 0. 9 / 28

Vectors and documents Reminder: 1.5 log 10 x 1 0.5 0 0 5 10 15 20 25 30 10 / 28

Measuring vector distance/similarity Each document d is characterized with respect to a vocabulary V = {t 1,..., t V } by a vector, for instance if we use tf-idf. tf t1,didf t1,..., tf t V,didf t V In order to compare documents, we need some metric for the distance of two vectors. One possibility is the Euclidian distance. The length of a vector is defined as v = n i=1 We define the Euclidian distance of two vectors v, w as v w = n (v i w i ) 2 i=1 v 2 i 11 / 28

Measuring vector distance/similarity In order to use the Euclidian distance, we first have to normalize the vectors, i.e., we use v v w w = n ( v i i=1 v w i w )2 Alternatively, we can also use a metric for the similarity of vectors. The most common metric used in NLP for vector similarity is the cosine of the angle between the vectors. 12 / 28

Measuring vector distance/similarity Underlying idea: We use the dot product from linear algebra as a similarity metric: For vectors v, w of dimension n, v w = n i=1 v i w i Example: dot product Consider the following term-document matrix: d 1 d 2 d 3 d 4 t 1 1 2 8 50 t 2 2 8 3 20 t 3 30 120 0 2 d 1, d 2 : v d1 v d2 = 1, 2, 30 2, 8, 120 = 3618 d 1, d 3 : v d1 v d4 = 200 But: v d2 v d4 = 500 and v d3 v d4 = 460 13 / 28

Measuring vector distance/similarity The dot product favors long vectors. Therefore, we use the normalized dot product, i.e., we divide by the lengths of the two vectors. Cosine similarity metric CosSim( v, w) = v w v w = n i=1 where φ is the angle between v and w. n i=1 v 2 i v i w i n i=1 w 2 i = cos φ 14 / 28

Measuring vector distance/similarity Reminder: This is how the cosine looks like for angles between 0 and 90 (with all vector components 0, other angles cannot occur): 1 cos(x) 0.8 0.6 0.4 0.2 0 0 20 40 60 80 15 / 28

Measuring vector distance/similarity Example: cosine similarity Consider again the following term-document matrix: d 1 d 2 d 3 d 4 t 1 1 2 8 50 t 2 2 8 3 20 t 3 30 120 0 2 v d 30,08 120,28 8,54 53,89 Cosine values: d 1 d 2 d 3 d 2 1 d 3 0.05 0.04 d 4 0.09 0.08 1 (The cosine values for d 1, d 2 and for d 3, d 4 are rounded.) 16 / 28

Measuring vector distance/similarity Cosine similarity is not invariant to shifts. Cosine 3 Adding 1 to all components increases the cosine similarity. 2 1 1 2 3 An alternative is the Pearson correlation: Pearson correlation Let v = n i=1 vi n The Euclidian distance, however, remains the same (if used without normalization). be the means of the vector components. Corr( v, w) = n i=1(v i v)(w i w) v v w w where v 1,..., v n c = v 1 c,..., v n c. = CosSim( v v, w w) 17 / 28

Measuring vector distance/similarity Two other widely used similarity measures are Jaccard and Dice. The underlying idea is that we measure the overlap in the features. Therefore in both metrics, we include Jaccard similarity measure n i=1 min(v i, w i ) simjaccard( v, w) = n i=1 min(v i, w i ) n i=1 max(v i, w i ) Dice is very similar except that it does not take the max but the average in the denominator: Dice similarity measure simdice( v, w) = 2 n i=1 min(v i, w i ) n i=1 v i + w i 18 / 28

k nearest neighbors In the following, we are concerned with the task of classifying a document by assigning it a label drawn from some finite set of labels. Our training data consists of a set D of documents, each of which is in a unique class c C. Documents are represented as vectors, i.e., our task amounts to classifying a new vector, given the classes of the training vectors. Idea of k nearest neighbors (knn): in order to classify a new document d, take the majority class of the k nearest neighbors of d in the training document vector space. 19 / 28

k nearest neighbors Example: k nearest neighbors Task: classify fictional texts in topic classes l (=love) and c (=crime). Training: Class l Class c new docs: terms d 1 d 2 d 3 d 4 d 5 d 6 d 7 love 10 8 7 0 1 5 1 kiss 5 6 4 1 0 6 0 inspector 2 0 0 12 8 2 12 murderer 0 1 0 20 56 0 4 v d 11.36 10.05 8.06 23.35 56.58 8.06 12.69 Before comparing vectors, we should normalize them. (The definition of the cosine similarity above includes normalization.) I.e., every v = v 1,... v n gets replaced with v v = v 1 v,..., v n v where v = n i=1 v 2 i 20 / 28

k nearest neighbors Example continued Normalized data: Training: Class l Class c new docs: terms d 1 d 2 d 3 d 4 d 5 d 6 d 7 love 0.88 0.8 0.87 0 0.02 0.62 0.08 kiss 0.44 0.6 0.5 0.04 0 0.74 0 inspector 0.18 0 0 0.51 0.14 0.25 0.95 murderer 0 0.1 0 0.86 0.99 0 0.32 Euclidian distances: d 1 d 2 d 3 d 4 d 5 d 6 0.4 0.35 0.43 1.3 1.38 d 7 1.24 1.35 1.37 0.7 1.05 21 / 28

k nearest neighbors Example continued Visualization of the dimensions love and murderer including the k nearest neighbors of each of the new documents d 6, d 7 for k = 1 and k = 3: c c d 7 d 6 l l For both ks, the majority class of the k nearest neighbors of d 6 is l while the one of d 7 is c. 22 / 28

k nearest neighbors Some notation: S k (d) is the set of the k nearest neighbor documents of a new document d. v(d) is the vector of a document d. We define I c S k (d) {0, 1} for a class c and a document d as I c (d t ) = 1 if the class of d t is c, otherwise I c (d t ) = 0. Then knn assigns the following score to each class c, given a document d that has to be classified: score(c, d) = I c (d t ) d t S k (d) The classifier then assigns to d the class c from the set of classes C with the highest score: ĉ = arg max c C score(c, d) 23 / 28

k nearest neighbors Choice of the k: usually an odd number; k = 3 or k = 5 are frequent choices but sometimes much larger values are also used; k can be choosen according to experiments during training on held-out data. 24 / 28

k nearest neighbors knn can be used as a probabilistic classifier: We can define Example continued Euclidian distances: P(c d) = d t S k (d) I c (d t ) k d 1 d 2 d 3 d 4 d 5 d 6 0.4 0.35 0.43 1.3 1.38 d 7 1.24 1.35 1.37 0.7 1.05 Probabilities: k = 1 S 1 (d 6 ) = {d 2 }, P(l d 6 ) = 1, P(c d 6 ) = 0 S 1 (d 7 ) = {d 4 }, P(l d 7 ) = 0, P(c d 7 ) = 1 k = 3 S 3 (d 6 ) = {d 1, d 2, d 3 }, P(l d 6 ) = 1, P(c d 6 ) = 0 S 3 (d 7 ) = {d 1, d 4, d 5 }, P(l d 7 ) = 1 3, P(c d 7) = 2 3 25 / 28

k nearest neighbors Problem: For k > 1, the majority vote does not take the individual distances of the k nearest neighbors to v(d) into consideration. Example k = 5, classes A and B, document d has to be classified. Assume we have the following situation: A A d A B B B The classifier would assign B. But the two A documents in S 5 (d) are much nearer to d than the three B documents. 26 / 28

k nearest neighbors Possible solution: weight the elements in S k (d) be their similarity to d. This gives the following revised definition of the score, including the cos similarity measure: score(c, d) = I c (d t ) cos( v(d t ), v(d)) d t S k (d) where v(d) is the vector of some document d. 27 / 28

References Jurafsky, Daniel & James H. Martin. 2009. Speech and language processing. an introduction to natural language processing, computational linguistics, and speech recognition Prentice Hall Series in Articial Intelligence. Pearson Education International second edition edn. Jurafsky, Daniel & James H. Martin. 2015. Speech and language processing. an introduction to natural language processing, computational linguistics, and speech recognition. Draft of the 3rd edition. Manning, Christopher D., Prabhakar Raghavan & Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge University Press. 28 / 28