Dimensionality reduction

Similar documents
Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Supervised Term Weighting for Automated Text Categorization

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Information Retrieval and Web Search

Information Retrieval

Information Retrieval

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Scoring, Term Weighting and the Vector Space

Dealing with Text Databases

Informa(on Retrieval

Informa(on Retrieval

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

Symbolic methods in TC: Decision Trees

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval. Lecture 6

Entropy based feature selection for text categorization

A Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR)

Modern Information Retrieval

PV211: Introduction to Information Retrieval

Symbolic methods in TC: Decision Trees

CS 572: Information Retrieval

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions

SUBMITTED TO IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1. Toward Optimal Feature Selection in Naive Bayes for Text Categorization

CLASSIFYING STRUCTURED WEB SOURCES USING AGGRESSIVE FEATURE SELECTION

Language Models, Smoothing, and IDF Weighting

Supervised Term Weighting for Automated Text Categorization

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Machine Learning for NLP

Scalable Term Selection for Text Categorization

Lecture 4 Ranking Search Results. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Department of Computer Science and Engineering CSE 151 University of California, San Diego Fall Midterm Examination

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Classification. Team Ravana. Team Members Cliffton Fernandes Nikhil Keswaney

DISTRIBUTIONAL SEMANTICS

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning

Information Retrieval

Naïve Bayes, Maxent and Neural Models

Variable Latent Semantic Indexing

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Learning Features from Co-occurrences: A Theoretical Analysis

Machine Learning for NLP

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Discrete Multivariate Statistics

1 Handling of Continuous Attributes in C4.5. Algorithm

Benchmarking Non-Parametric Statistical Tests

Machine Learning for natural language processing

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

1 Handling of Continuous Attributes in C4.5. Algorithm

Modern Information Retrieval

Classification and Regression Trees

Machine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Model-based estimation of word saliency in text

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

Language as a Stochastic Process

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Learning theory. Ensemble methods. Boosting. Boosting: history

6.036 midterm review. Wednesday, March 18, 15

ECE521 Lecture7. Logistic Regression

Classification Algorithms

An Efficient Algorithm for Large-Scale Text Categorization

CS 188: Artificial Intelligence. Outline

CS249: ADVANCED DATA MINING

Online Passive-Aggressive Algorithms. Tirgul 11

Learning Methods for Linear Detectors

Prediction of Citations for Academic Papers

Significance Tests for Bizarre Measures in 2-Class Classification Tasks

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

A Neural Passage Model for Ad-hoc Document Retrieval

An Analysis on Frequency of Terms for Text Categorization

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

18.6 Regression and Classification with Linear Models

Generalization bounds

On a New Model for Automatic Text Categorization Based on Vector Space Model

Classifying Chinese Texts in Two Steps

Knowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Feature Engineering, Model Evaluations

Comparing Relevance Feedback Techniques on German News Articles

Performance evaluation of binary classifiers

Basic Probability and Information Theory: quick revision

Stochastic gradient descent; Classification

CSC 411: Lecture 03: Linear Classification

Ranked Retrieval (2)

Transcription:

Dimensionality reduction ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs4062/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 2017 Recapitulating: Evaluating TC systems Evaluation compares actual performance ( ˆf) to ideal performance (f) The most commonly used metrics: Recall: how good the system is at finding relevant documents for a given category (ρ): ρ = true positives true positives + false negatives (1) Precision: the quality of the classified data (π): π = true positives true positives + false positives (2) Recapitulating: Machine Learning and Text Categorisation Foundations Term Clustering Life Cycle: The overall approach Preliminaries: Linguistic Jargon Development Approaches: Train and Test, K-Fold Validation, Category Generality Text Representation: Feature Vectors, Implementations, Indexing Defining Features: Words vs. Phrases, N-Grams Computing Weights: TF-IDF, Normalisation, DIA, AIR/X 1

Dimensionality reduction in IR nd TC Basics of Dimensionality Reduction: Vector Representation, V. Length, Euclidean Distance, Cosine Similarity, Cosine Matching Types of DR: Local vs. Global, Feature Selection vs. Feature Extraction Term Space Reduction - TSR: Aggressiveness, Filtering by Document Frequency TSR functions selection: DIA, Info. Gain, Multual Info., Chi-Square, NLG, Relevance Score, Odds Ratio, GSS coef. Refresher: Information Theory - Entropy Function Examples: Multual Info (PMI and Normalised) and Information Gain TF-IDF Advanced: From Local to Global TSR and Comparing TSR techniques Dimensionality reduction in IR nd TC A 3d term set: T = {football, politics, economy} IR: calculate distances between vectors (e.g. via cosine matching) TC: High dimensionality may be problematic Economy d1 = <0.5,0.5,0.3> d2 = <0.5,0.3,0.3> football politics Cosine similarity between documents d and e is given by: cos(d, e) = d e d e Where d is the Euclidean norm of d. In the case of the example above (normalised vectors) the Euclidean distance can be used instead, as it gives the same rank order as cosine similarity: T dist(d, e) = (d i e i ) 2 (3) Basics of Dimensionality Reduction It starts with a n-dimensional classification (vector) space. 2

In text classification the axis represents features - the classifcation labels or values (label, binary etc.) A vector in the n-dimensional classification space represent a document. Q: Why a Vector Space? Are the properties of the Vector important? Is it important how the vectors stand to each other? Basics of Dimensionality Reduction: An example Document 1: Economy and Politics is important for Football. Document 2: In Economy the Politics of Football counts. Classification towards three labels Football, Economy, Politics Typicall approach: Bag of Words and Frequency. Resulting Vectors for D1 and d2: (1,1,1) and (1,1,1) Trivial Example Economy d1 = <0.5,0.5,0.3> d2 = <0.5,0.3,0.3> football politics Basics of Dimensionality Reduction: An example Document 1 and 2 are longer documents with (5,5,3) and (5,3,3) in keyword frequency How do we compare both Vectors? Normalisation: Unit Vector (on the surface of the unit hypersphere) u = v v v = 5 2 + 5 2 + 3 2, ( 5 7.68, 5 7.68, 3 7.68 ) 3

Dimensionality reduction in IR nd TC A 3d term set: T = {football, politics, economy} IR: calculate distances between vectors (e.g. via cosine matching) TC: High dimensionality may be problematic Economy d1 = <0.5,0.5,0.3> d2 = <0.5,0.3,0.3> football politics Cosine similarity between documents d and e is given by: cos(d, e) = d e d e Where d is the Euclidean norm of d. In the case of the example above (normalised vectors) the Euclidean distance can be used instead, as it gives the same rank order as cosine similarity: T dist(d, e) = (d i e i ) 2 (4) Cosine similarity between documents d and e cos(d, e) = d e d e Where d is the Euclidean norm of d. In the case of the example above (normalised vectors) the Euclidean distance can be used instead, as it gives the same rank order as cosine similarity: T dist(d, e) = (d i e i ) 2 (5) Cosine similarity example with three documents and four labels Three novels. SaS: Sense and Sensibility, PaP: Pride and Prejudice, WH: Wuthering Heights? Term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38 4

Normalization using Log Frequency Weighting Log Frequency = 1 + log (term frequency) Log Frequency Weighting Values Term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.00 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58 Length Normalized Values Term SaS PaP WH affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0 0.405 wuthering 0 0 0.588 Cosine similarity between documents d e cos(d, e) = d e T dist(d, e) = (d i e i ) 2 (6) cos(sas,pap) 0.789 x 0.832 + 0.515 x 0.555 + 0.335 x 0.0 + 0.0 x 0.0 0.94 cos(sas, WH) 0.79 cos(pap, WH) 0.69 What is Dimensionality Reduction? DR: a processing step whose goal is to reduce the size of the vector space from T to T T. T is called Reduced Term Set Benefits of DR: Lower computational cost for ML Help avoid overfitting (training on constitutive features, rather than contingent ones) A rule-of-thumb: overfitting is avoided if the number of training examples is proportional to the size of T (For TC, experiments have suggested a ratio of 50-100 texts per feature). Local vs. Global DR DR can be done for each category or for the whole set of categories: Local DR: for for each category c i,a set T i of terms ( T i T, typically 10 T i 50) is chosen for classication under c i. 5

Different term sets are used for different categories. global DR:,a set T of terms ( T T ) is chosen for classication under all categories C = {c 1,..., c C } N.B.: Most feature selection techniques can be used for local and global DR alike. DR by feature selection vs. DR by feature extraction DR by term selection or Term Space Reduction (TSR): T is a subset of T. Select a T from the original feature set which yields the highest effectiveness w.r.t document indexing DR by term extraction: the terms in T are not of the same type as the terms in T, but are obtained by combinations or transformations of the original ones. E.g: in DR by term extraction, if the terms in T are words, the terms in T may not be words at all. Term Space Reduction There are two ways to reduce the term space: TSR by term wrapping: the ML algorithm itself is used to reduce term space dimensionality. TSR by term filtering: terms are ranked according to their importance for the TC task and the highest-scoring ones are chosen. Performance is measured in terms of aggressiveness: the ratio between the sizes of original and reduced feature set: T T Empirical comparisons of TSR techniques can be found in (Yang and Pedersen, 1997) and (Forman, 2003). Filtering by document frequency The simplest TSR technique: 1. Remove stop-words, etc, (see pre-processing steps) 2. Order all features t k in T according to the number of documents in which they occur. Call this metric # T r (t k ) 3. Choose T = {t 1,..., t n } s.t. it contains the n highest scoring t k Advantages: Low computational cost DR up to a factor of 10 with small reduction in effectiveness (Yang and Pedersen, 1997) 6

Information theoretic TSR: preliminaries Probability distributions: probabilities on an event space of documents: P ( t k, c i ): for a random document x, term t k does not occur in x and x is classified under category c i. Similarly, we represent the probability that t k does occur in x and x is filed under c i by P (t k, c i ). N.B.: This notation will be used as shorthand for instantiations of the appropriate random variables. That is, for multivariate Bernoulli models: P (T k = 0, C i = 1) and P (T k = 1, C i = 1), respectively. Commonly used TSR functions Function Notation Mathematical definition DIA factor z(t k, C i) P (c i t k ) P (t, c) Information Gain, AKA IG(T k, C i) or P (t, c) log P (t)p (c) Expected Mutual Information I(T k ; C i) c {c i, c i } t {t k, t k } Mutual information MI(T k, C i) P (t k, c i) log P (t k,c i ) P (t k )P (c i ) Chi-square χ 2 (T k, C i) T r [P (t k, c i)p ( t k, c i) P (t k, c i)p ( t k, c i)] 2 P (t k )P ( t k )P (c i)p ( c i) NLG coefficient NGL(T k, C i) T r [P (tk,c i )P ( t k, c i ) P (t k, c i )P ( t k,c i )] P (tk )P ( t k )P (c i )P ( c i ) Relevancy score RS(T k, C i) log P (t k c i) + d P ( t k c i) + d Odds ratio OR(t k, c i) P (t k c i )[1 P (t k c i )] [1 P (t k c i )]P (t k c i ) GSS coefficient GSS(T k, C i) P (t k, c i)p ( t k, c i) P (t k, c i)p ( t k, c i) from (Sebastiani, 2002) The two more exotic acronyms, GSS and NGL are for the initials of the researchers who first proposed those metrics, namely Galavotti-Sebastiani-Simi coefficient (GSS), proposed by (Galavotti et al., 2000) and Ng-Goh-Low-Leong coefficient (NGL) proposed by (Ng et al., 1997). Some functions in detail Basic intuition: the best features for a category are those distributed most differently on sets of positive and negative instances of documents filed under that category Pointwise mutual information: P (ti, cj) P MI(T i, C j) = log P (t i)p (c j) (7) Calculations to be performed: co-occurrence of terms and categories in the training corpus (T r), and frequency of occurrence of words and categories in T r. Implementing I(.,.) Example: extracting keywords from paragraphs. 7

1. pointwisemi(d): wordpartable 2. pmitable = () 3. parlist = split_as_paragraphs(d) 4. typelist = gettypelist(d) 5. foreach (par in parlist) do 6 ptypelist = gettypelist(par) 7. pindex = indexof(par, parlist) 8. foreach (word in ptypelist) do 9. i_w_p = log ( getwordprobability(word, par) / getwordprobability(word, d) ) 10. addtotable(<word,pindex>, i_w_p, pmitable) 11. done 12. done 13 return pmitable The keyword spotting examples in this chapter use a slightly different sample space model than the one we will be using in the TC application. The intention is to illustrate alternative ways of modelling linguistic data in a probabilistic framework, and the fact that the TSR metrics can be used in different contexts. For the algorithm above, term occurrences are taken to be the elementary events. Term occurences in the whole text generate the prior probabilities for terms P (t). Term occurrences in certain paragraphs give conditional probabilities P (t c) (i.e. occurences of terms conditioned on the paragraph, taken in this case to be the category ). Paragraphs are assumed to have a uniform prior P (c) (i.e. they are all equally likely to occur). In the case of P MI(T, C) (mutual information of a term and a paragraph), we can simply work with priors and conditionals for words: P MI(T, C) = P (t, c) P (t)p (c) = P (t c)p (c) P (t)p (c) = P (t c) P (t) (8) The conditional P (t c) can be calculated by dividing the number of times t occurs in documents of category c by the total number of tokens in those documents (the probability space for documents of category c). P (t) can be calculated by dividing the frequency of t in the training corpus by the total number of tokens in that corpus. Normalised mutual information P (ti, cj) MI(T i, C j) = P (t i, c j)log P (t i)p (c i) (9) 1. mi(d): wordpartable /* rank each word in d */ 2. mitable = () 3. parlist = split_as_paragraphs(d) 8

4. p_par = 1/sizeof(parlist) 5. typelist = gettypelist(d) 6. foreach (par in parlist) do 7 ptypelist = gettypelist(par) 8. pindex = indexof(par, parlist) 9. foreach (word in ptypelist) do 10. mi_w_p = getwordprobability(word, par) * p_par * log ( getwordprobability(word, par) / getwordprobability(word, d) ) 11. addtotable(<word,pindex>, mi_w_p, mitable) 12. done 13. done 14 return mitable Similarly to (8) we can simplify the computation of MI(T, C) as follows: P (t, c) MI(T, C) = P (t, c)log P (t)p (c) = P (t c)p (c)log P (t c) P (t) (10) Expected Mutual Information (Information gain) A formalisation of how much information about category c j does one gain by knowing term t i (and vice-versa). IG(T i, C j ) = t {t i, t i} c {c j, c j} P (t, c) P (t, c)log P (t)p (c) (11) Computational cost of calculating IG(.,.) is higher than that of estimating MI(.,.) IG: a simplified example 1. ig(d): wordpartable /* features = words; */ 2. igtable = () /* categories = paragraphs */ 3. parlist = split_as_paragraphs(d) 4. p_par = 1/sizeof(parlist) 5. typelist = gettypelist(d) 6. foreach (par in parlist) do 7. ptypelist = gettypelist(par) 8. pindex = indexof(par, parlist) 9. foreach (word in ptypelist) do /* oversimplification: assuming T = {word} */ 10. foreach (par in parlist) do 11. ig_w_p += getwordprobability(word, par) * p_par * log ( getwordprobability(word, par) / getwordprobability(word, d) ) 12. done 13. addtotable(<word,pindex>, ig_w_p, igtable) 14. done 15. done 16. return igtable 9

From local to global TSR A locally specified TSR function tsr(t k, C i), i.e. ranging over terms t k with repect to a specific to category c i, can be made global by: Summing over the set of all categories: Taking a weighted average Picking the maximum Comparing TSR techniques C tsr sum(t k ) = tsr(t k, C i) (12) C tsr wavg(t k ) = P (C i)tsr(t k, C i) (13) tsr max(t k ) = max C tsr(t k, C i) (14) Effectiveness depends on the chosen task, domain etc Reduction factor of up to 100 with IG sum and χ 2 max Summary of empirical studies on the performance of different information theoretic measures (Sebastiani, 2002): {OR sum, NGL sum, GSS max } {IG sum, χ 2 max} {# wavg, χ 2 wavg} References Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3:1289 1305. Galavotti, L., Sebastiani, F., and Simi, M. (2000). Experiments on the use of feature selection and negative evidence in automated text categorization. Technical report, Paris, France, France. Ng, H. T., Goh, W. B., and Low, K. L. (1997). Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 97, pages 67 73, New York, NY, USA. ACM. 10

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1 47. Yang, Y. and Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Fisher, D. H., editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412 420, Nashville. Morgan Kaufmann Publishers. 11