Dimensionality reduction ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs4062/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 2017 Recapitulating: Evaluating TC systems Evaluation compares actual performance ( ˆf) to ideal performance (f) The most commonly used metrics: Recall: how good the system is at finding relevant documents for a given category (ρ): ρ = true positives true positives + false negatives (1) Precision: the quality of the classified data (π): π = true positives true positives + false positives (2) Recapitulating: Machine Learning and Text Categorisation Foundations Term Clustering Life Cycle: The overall approach Preliminaries: Linguistic Jargon Development Approaches: Train and Test, K-Fold Validation, Category Generality Text Representation: Feature Vectors, Implementations, Indexing Defining Features: Words vs. Phrases, N-Grams Computing Weights: TF-IDF, Normalisation, DIA, AIR/X 1
Dimensionality reduction in IR nd TC Basics of Dimensionality Reduction: Vector Representation, V. Length, Euclidean Distance, Cosine Similarity, Cosine Matching Types of DR: Local vs. Global, Feature Selection vs. Feature Extraction Term Space Reduction - TSR: Aggressiveness, Filtering by Document Frequency TSR functions selection: DIA, Info. Gain, Multual Info., Chi-Square, NLG, Relevance Score, Odds Ratio, GSS coef. Refresher: Information Theory - Entropy Function Examples: Multual Info (PMI and Normalised) and Information Gain TF-IDF Advanced: From Local to Global TSR and Comparing TSR techniques Dimensionality reduction in IR nd TC A 3d term set: T = {football, politics, economy} IR: calculate distances between vectors (e.g. via cosine matching) TC: High dimensionality may be problematic Economy d1 = <0.5,0.5,0.3> d2 = <0.5,0.3,0.3> football politics Cosine similarity between documents d and e is given by: cos(d, e) = d e d e Where d is the Euclidean norm of d. In the case of the example above (normalised vectors) the Euclidean distance can be used instead, as it gives the same rank order as cosine similarity: T dist(d, e) = (d i e i ) 2 (3) Basics of Dimensionality Reduction It starts with a n-dimensional classification (vector) space. 2
In text classification the axis represents features - the classifcation labels or values (label, binary etc.) A vector in the n-dimensional classification space represent a document. Q: Why a Vector Space? Are the properties of the Vector important? Is it important how the vectors stand to each other? Basics of Dimensionality Reduction: An example Document 1: Economy and Politics is important for Football. Document 2: In Economy the Politics of Football counts. Classification towards three labels Football, Economy, Politics Typicall approach: Bag of Words and Frequency. Resulting Vectors for D1 and d2: (1,1,1) and (1,1,1) Trivial Example Economy d1 = <0.5,0.5,0.3> d2 = <0.5,0.3,0.3> football politics Basics of Dimensionality Reduction: An example Document 1 and 2 are longer documents with (5,5,3) and (5,3,3) in keyword frequency How do we compare both Vectors? Normalisation: Unit Vector (on the surface of the unit hypersphere) u = v v v = 5 2 + 5 2 + 3 2, ( 5 7.68, 5 7.68, 3 7.68 ) 3
Dimensionality reduction in IR nd TC A 3d term set: T = {football, politics, economy} IR: calculate distances between vectors (e.g. via cosine matching) TC: High dimensionality may be problematic Economy d1 = <0.5,0.5,0.3> d2 = <0.5,0.3,0.3> football politics Cosine similarity between documents d and e is given by: cos(d, e) = d e d e Where d is the Euclidean norm of d. In the case of the example above (normalised vectors) the Euclidean distance can be used instead, as it gives the same rank order as cosine similarity: T dist(d, e) = (d i e i ) 2 (4) Cosine similarity between documents d and e cos(d, e) = d e d e Where d is the Euclidean norm of d. In the case of the example above (normalised vectors) the Euclidean distance can be used instead, as it gives the same rank order as cosine similarity: T dist(d, e) = (d i e i ) 2 (5) Cosine similarity example with three documents and four labels Three novels. SaS: Sense and Sensibility, PaP: Pride and Prejudice, WH: Wuthering Heights? Term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38 4
Normalization using Log Frequency Weighting Log Frequency = 1 + log (term frequency) Log Frequency Weighting Values Term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.00 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58 Length Normalized Values Term SaS PaP WH affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0 0.405 wuthering 0 0 0.588 Cosine similarity between documents d e cos(d, e) = d e T dist(d, e) = (d i e i ) 2 (6) cos(sas,pap) 0.789 x 0.832 + 0.515 x 0.555 + 0.335 x 0.0 + 0.0 x 0.0 0.94 cos(sas, WH) 0.79 cos(pap, WH) 0.69 What is Dimensionality Reduction? DR: a processing step whose goal is to reduce the size of the vector space from T to T T. T is called Reduced Term Set Benefits of DR: Lower computational cost for ML Help avoid overfitting (training on constitutive features, rather than contingent ones) A rule-of-thumb: overfitting is avoided if the number of training examples is proportional to the size of T (For TC, experiments have suggested a ratio of 50-100 texts per feature). Local vs. Global DR DR can be done for each category or for the whole set of categories: Local DR: for for each category c i,a set T i of terms ( T i T, typically 10 T i 50) is chosen for classication under c i. 5
Different term sets are used for different categories. global DR:,a set T of terms ( T T ) is chosen for classication under all categories C = {c 1,..., c C } N.B.: Most feature selection techniques can be used for local and global DR alike. DR by feature selection vs. DR by feature extraction DR by term selection or Term Space Reduction (TSR): T is a subset of T. Select a T from the original feature set which yields the highest effectiveness w.r.t document indexing DR by term extraction: the terms in T are not of the same type as the terms in T, but are obtained by combinations or transformations of the original ones. E.g: in DR by term extraction, if the terms in T are words, the terms in T may not be words at all. Term Space Reduction There are two ways to reduce the term space: TSR by term wrapping: the ML algorithm itself is used to reduce term space dimensionality. TSR by term filtering: terms are ranked according to their importance for the TC task and the highest-scoring ones are chosen. Performance is measured in terms of aggressiveness: the ratio between the sizes of original and reduced feature set: T T Empirical comparisons of TSR techniques can be found in (Yang and Pedersen, 1997) and (Forman, 2003). Filtering by document frequency The simplest TSR technique: 1. Remove stop-words, etc, (see pre-processing steps) 2. Order all features t k in T according to the number of documents in which they occur. Call this metric # T r (t k ) 3. Choose T = {t 1,..., t n } s.t. it contains the n highest scoring t k Advantages: Low computational cost DR up to a factor of 10 with small reduction in effectiveness (Yang and Pedersen, 1997) 6
Information theoretic TSR: preliminaries Probability distributions: probabilities on an event space of documents: P ( t k, c i ): for a random document x, term t k does not occur in x and x is classified under category c i. Similarly, we represent the probability that t k does occur in x and x is filed under c i by P (t k, c i ). N.B.: This notation will be used as shorthand for instantiations of the appropriate random variables. That is, for multivariate Bernoulli models: P (T k = 0, C i = 1) and P (T k = 1, C i = 1), respectively. Commonly used TSR functions Function Notation Mathematical definition DIA factor z(t k, C i) P (c i t k ) P (t, c) Information Gain, AKA IG(T k, C i) or P (t, c) log P (t)p (c) Expected Mutual Information I(T k ; C i) c {c i, c i } t {t k, t k } Mutual information MI(T k, C i) P (t k, c i) log P (t k,c i ) P (t k )P (c i ) Chi-square χ 2 (T k, C i) T r [P (t k, c i)p ( t k, c i) P (t k, c i)p ( t k, c i)] 2 P (t k )P ( t k )P (c i)p ( c i) NLG coefficient NGL(T k, C i) T r [P (tk,c i )P ( t k, c i ) P (t k, c i )P ( t k,c i )] P (tk )P ( t k )P (c i )P ( c i ) Relevancy score RS(T k, C i) log P (t k c i) + d P ( t k c i) + d Odds ratio OR(t k, c i) P (t k c i )[1 P (t k c i )] [1 P (t k c i )]P (t k c i ) GSS coefficient GSS(T k, C i) P (t k, c i)p ( t k, c i) P (t k, c i)p ( t k, c i) from (Sebastiani, 2002) The two more exotic acronyms, GSS and NGL are for the initials of the researchers who first proposed those metrics, namely Galavotti-Sebastiani-Simi coefficient (GSS), proposed by (Galavotti et al., 2000) and Ng-Goh-Low-Leong coefficient (NGL) proposed by (Ng et al., 1997). Some functions in detail Basic intuition: the best features for a category are those distributed most differently on sets of positive and negative instances of documents filed under that category Pointwise mutual information: P (ti, cj) P MI(T i, C j) = log P (t i)p (c j) (7) Calculations to be performed: co-occurrence of terms and categories in the training corpus (T r), and frequency of occurrence of words and categories in T r. Implementing I(.,.) Example: extracting keywords from paragraphs. 7
1. pointwisemi(d): wordpartable 2. pmitable = () 3. parlist = split_as_paragraphs(d) 4. typelist = gettypelist(d) 5. foreach (par in parlist) do 6 ptypelist = gettypelist(par) 7. pindex = indexof(par, parlist) 8. foreach (word in ptypelist) do 9. i_w_p = log ( getwordprobability(word, par) / getwordprobability(word, d) ) 10. addtotable(<word,pindex>, i_w_p, pmitable) 11. done 12. done 13 return pmitable The keyword spotting examples in this chapter use a slightly different sample space model than the one we will be using in the TC application. The intention is to illustrate alternative ways of modelling linguistic data in a probabilistic framework, and the fact that the TSR metrics can be used in different contexts. For the algorithm above, term occurrences are taken to be the elementary events. Term occurences in the whole text generate the prior probabilities for terms P (t). Term occurrences in certain paragraphs give conditional probabilities P (t c) (i.e. occurences of terms conditioned on the paragraph, taken in this case to be the category ). Paragraphs are assumed to have a uniform prior P (c) (i.e. they are all equally likely to occur). In the case of P MI(T, C) (mutual information of a term and a paragraph), we can simply work with priors and conditionals for words: P MI(T, C) = P (t, c) P (t)p (c) = P (t c)p (c) P (t)p (c) = P (t c) P (t) (8) The conditional P (t c) can be calculated by dividing the number of times t occurs in documents of category c by the total number of tokens in those documents (the probability space for documents of category c). P (t) can be calculated by dividing the frequency of t in the training corpus by the total number of tokens in that corpus. Normalised mutual information P (ti, cj) MI(T i, C j) = P (t i, c j)log P (t i)p (c i) (9) 1. mi(d): wordpartable /* rank each word in d */ 2. mitable = () 3. parlist = split_as_paragraphs(d) 8
4. p_par = 1/sizeof(parlist) 5. typelist = gettypelist(d) 6. foreach (par in parlist) do 7 ptypelist = gettypelist(par) 8. pindex = indexof(par, parlist) 9. foreach (word in ptypelist) do 10. mi_w_p = getwordprobability(word, par) * p_par * log ( getwordprobability(word, par) / getwordprobability(word, d) ) 11. addtotable(<word,pindex>, mi_w_p, mitable) 12. done 13. done 14 return mitable Similarly to (8) we can simplify the computation of MI(T, C) as follows: P (t, c) MI(T, C) = P (t, c)log P (t)p (c) = P (t c)p (c)log P (t c) P (t) (10) Expected Mutual Information (Information gain) A formalisation of how much information about category c j does one gain by knowing term t i (and vice-versa). IG(T i, C j ) = t {t i, t i} c {c j, c j} P (t, c) P (t, c)log P (t)p (c) (11) Computational cost of calculating IG(.,.) is higher than that of estimating MI(.,.) IG: a simplified example 1. ig(d): wordpartable /* features = words; */ 2. igtable = () /* categories = paragraphs */ 3. parlist = split_as_paragraphs(d) 4. p_par = 1/sizeof(parlist) 5. typelist = gettypelist(d) 6. foreach (par in parlist) do 7. ptypelist = gettypelist(par) 8. pindex = indexof(par, parlist) 9. foreach (word in ptypelist) do /* oversimplification: assuming T = {word} */ 10. foreach (par in parlist) do 11. ig_w_p += getwordprobability(word, par) * p_par * log ( getwordprobability(word, par) / getwordprobability(word, d) ) 12. done 13. addtotable(<word,pindex>, ig_w_p, igtable) 14. done 15. done 16. return igtable 9
From local to global TSR A locally specified TSR function tsr(t k, C i), i.e. ranging over terms t k with repect to a specific to category c i, can be made global by: Summing over the set of all categories: Taking a weighted average Picking the maximum Comparing TSR techniques C tsr sum(t k ) = tsr(t k, C i) (12) C tsr wavg(t k ) = P (C i)tsr(t k, C i) (13) tsr max(t k ) = max C tsr(t k, C i) (14) Effectiveness depends on the chosen task, domain etc Reduction factor of up to 100 with IG sum and χ 2 max Summary of empirical studies on the performance of different information theoretic measures (Sebastiani, 2002): {OR sum, NGL sum, GSS max } {IG sum, χ 2 max} {# wavg, χ 2 wavg} References Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3:1289 1305. Galavotti, L., Sebastiani, F., and Simi, M. (2000). Experiments on the use of feature selection and negative evidence in automated text categorization. Technical report, Paris, France, France. Ng, H. T., Goh, W. B., and Low, K. L. (1997). Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 97, pages 67 73, New York, NY, USA. ACM. 10
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1 47. Yang, Y. and Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Fisher, D. H., editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412 420, Nashville. Morgan Kaufmann Publishers. 11