Dimensionality reduction

Size: px
Start display at page:

Download "Dimensionality reduction"

Transcription

1 Dimensionality reduction ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado Recapitulating: Evaluating TC systems Evaluation compares actual performance ( ˆf) to ideal performance (f) The most commonly used metrics: Recall: how good the system is at finding relevant documents for a given category (ρ): ρ = true positives true positives + false negatives (1) Precision: the quality of the classified data (π): π = true positives true positives + false positives (2) Recapitulating: Machine Learning and Text Categorisation Foundations Term Clustering Life Cycle: The overall approach Preliminaries: Linguistic Jargon Development Approaches: Train and Test, K-Fold Validation, Category Generality Text Representation: Feature Vectors, Implementations, Indexing Defining Features: Words vs. Phrases, N-Grams Computing Weights: TF-IDF, Normalisation, DIA, AIR/X 1

2 Dimensionality reduction in IR nd TC Basics of Dimensionality Reduction: Vector Representation, V. Length, Euclidean Distance, Cosine Similarity, Cosine Matching Types of DR: Local vs. Global, Feature Selection vs. Feature Extraction Term Space Reduction - TSR: Aggressiveness, Filtering by Document Frequency TSR functions selection: DIA, Info. Gain, Multual Info., Chi-Square, NLG, Relevance Score, Odds Ratio, GSS coef. Refresher: Information Theory - Entropy Function Examples: Multual Info (PMI and Normalised) and Information Gain TF-IDF Advanced: From Local to Global TSR and Comparing TSR techniques Dimensionality reduction in IR nd TC A 3d term set: T = {football, politics, economy} IR: calculate distances between vectors (e.g. via cosine matching) TC: High dimensionality may be problematic Economy d1 = <0.5,0.5,0.3> d2 = <0.5,0.3,0.3> football politics Cosine similarity between documents d and e is given by: cos(d, e) = d e d e Where d is the Euclidean norm of d. In the case of the example above (normalised vectors) the Euclidean distance can be used instead, as it gives the same rank order as cosine similarity: T dist(d, e) = (d i e i ) 2 (3) Basics of Dimensionality Reduction It starts with a n-dimensional classification (vector) space. 2

3 In text classification the axis represents features - the classifcation labels or values (label, binary etc.) A vector in the n-dimensional classification space represent a document. Q: Why a Vector Space? Are the properties of the Vector important? Is it important how the vectors stand to each other? Basics of Dimensionality Reduction: An example Document 1: Economy and Politics is important for Football. Document 2: In Economy the Politics of Football counts. Classification towards three labels Football, Economy, Politics Typicall approach: Bag of Words and Frequency. Resulting Vectors for D1 and d2: (1,1,1) and (1,1,1) Trivial Example Economy d1 = <0.5,0.5,0.3> d2 = <0.5,0.3,0.3> football politics Basics of Dimensionality Reduction: An example Document 1 and 2 are longer documents with (5,5,3) and (5,3,3) in keyword frequency How do we compare both Vectors? Normalisation: Unit Vector (on the surface of the unit hypersphere) u = v v v = , ( , , ) 3

4 Dimensionality reduction in IR nd TC A 3d term set: T = {football, politics, economy} IR: calculate distances between vectors (e.g. via cosine matching) TC: High dimensionality may be problematic Economy d1 = <0.5,0.5,0.3> d2 = <0.5,0.3,0.3> football politics Cosine similarity between documents d and e is given by: cos(d, e) = d e d e Where d is the Euclidean norm of d. In the case of the example above (normalised vectors) the Euclidean distance can be used instead, as it gives the same rank order as cosine similarity: T dist(d, e) = (d i e i ) 2 (4) Cosine similarity between documents d and e cos(d, e) = d e d e Where d is the Euclidean norm of d. In the case of the example above (normalised vectors) the Euclidean distance can be used instead, as it gives the same rank order as cosine similarity: T dist(d, e) = (d i e i ) 2 (5) Cosine similarity example with three documents and four labels Three novels. SaS: Sense and Sensibility, PaP: Pride and Prejudice, WH: Wuthering Heights? Term SaS PaP WH affection jealous gossip wuthering

5 Normalization using Log Frequency Weighting Log Frequency = 1 + log (term frequency) Log Frequency Weighting Values Term SaS PaP WH affection jealous gossip wuthering Length Normalized Values Term SaS PaP WH affection jealous gossip wuthering Cosine similarity between documents d e cos(d, e) = d e T dist(d, e) = (d i e i ) 2 (6) cos(sas,pap) x x x x cos(sas, WH) 0.79 cos(pap, WH) 0.69 What is Dimensionality Reduction? DR: a processing step whose goal is to reduce the size of the vector space from T to T T. T is called Reduced Term Set Benefits of DR: Lower computational cost for ML Help avoid overfitting (training on constitutive features, rather than contingent ones) A rule-of-thumb: overfitting is avoided if the number of training examples is proportional to the size of T (For TC, experiments have suggested a ratio of texts per feature). Local vs. Global DR DR can be done for each category or for the whole set of categories: Local DR: for for each category c i,a set T i of terms ( T i T, typically 10 T i 50) is chosen for classication under c i. 5

6 Different term sets are used for different categories. global DR:,a set T of terms ( T T ) is chosen for classication under all categories C = {c 1,..., c C } N.B.: Most feature selection techniques can be used for local and global DR alike. DR by feature selection vs. DR by feature extraction DR by term selection or Term Space Reduction (TSR): T is a subset of T. Select a T from the original feature set which yields the highest effectiveness w.r.t document indexing DR by term extraction: the terms in T are not of the same type as the terms in T, but are obtained by combinations or transformations of the original ones. E.g: in DR by term extraction, if the terms in T are words, the terms in T may not be words at all. Term Space Reduction There are two ways to reduce the term space: TSR by term wrapping: the ML algorithm itself is used to reduce term space dimensionality. TSR by term filtering: terms are ranked according to their importance for the TC task and the highest-scoring ones are chosen. Performance is measured in terms of aggressiveness: the ratio between the sizes of original and reduced feature set: T T Empirical comparisons of TSR techniques can be found in (Yang and Pedersen, 1997) and (Forman, 2003). Filtering by document frequency The simplest TSR technique: 1. Remove stop-words, etc, (see pre-processing steps) 2. Order all features t k in T according to the number of documents in which they occur. Call this metric # T r (t k ) 3. Choose T = {t 1,..., t n } s.t. it contains the n highest scoring t k Advantages: Low computational cost DR up to a factor of 10 with small reduction in effectiveness (Yang and Pedersen, 1997) 6

7 Information theoretic TSR: preliminaries Probability distributions: probabilities on an event space of documents: P ( t k, c i ): for a random document x, term t k does not occur in x and x is classified under category c i. Similarly, we represent the probability that t k does occur in x and x is filed under c i by P (t k, c i ). N.B.: This notation will be used as shorthand for instantiations of the appropriate random variables. That is, for multivariate Bernoulli models: P (T k = 0, C i = 1) and P (T k = 1, C i = 1), respectively. Commonly used TSR functions Function Notation Mathematical definition DIA factor z(t k, C i) P (c i t k ) P (t, c) Information Gain, AKA IG(T k, C i) or P (t, c) log P (t)p (c) Expected Mutual Information I(T k ; C i) c {c i, c i } t {t k, t k } Mutual information MI(T k, C i) P (t k, c i) log P (t k,c i ) P (t k )P (c i ) Chi-square χ 2 (T k, C i) T r [P (t k, c i)p ( t k, c i) P (t k, c i)p ( t k, c i)] 2 P (t k )P ( t k )P (c i)p ( c i) NLG coefficient NGL(T k, C i) T r [P (tk,c i )P ( t k, c i ) P (t k, c i )P ( t k,c i )] P (tk )P ( t k )P (c i )P ( c i ) Relevancy score RS(T k, C i) log P (t k c i) + d P ( t k c i) + d Odds ratio OR(t k, c i) P (t k c i )[1 P (t k c i )] [1 P (t k c i )]P (t k c i ) GSS coefficient GSS(T k, C i) P (t k, c i)p ( t k, c i) P (t k, c i)p ( t k, c i) from (Sebastiani, 2002) The two more exotic acronyms, GSS and NGL are for the initials of the researchers who first proposed those metrics, namely Galavotti-Sebastiani-Simi coefficient (GSS), proposed by (Galavotti et al., 2000) and Ng-Goh-Low-Leong coefficient (NGL) proposed by (Ng et al., 1997). Some functions in detail Basic intuition: the best features for a category are those distributed most differently on sets of positive and negative instances of documents filed under that category Pointwise mutual information: P (ti, cj) P MI(T i, C j) = log P (t i)p (c j) (7) Calculations to be performed: co-occurrence of terms and categories in the training corpus (T r), and frequency of occurrence of words and categories in T r. Implementing I(.,.) Example: extracting keywords from paragraphs. 7

8 1. pointwisemi(d): wordpartable 2. pmitable = () 3. parlist = split_as_paragraphs(d) 4. typelist = gettypelist(d) 5. foreach (par in parlist) do 6 ptypelist = gettypelist(par) 7. pindex = indexof(par, parlist) 8. foreach (word in ptypelist) do 9. i_w_p = log ( getwordprobability(word, par) / getwordprobability(word, d) ) 10. addtotable(<word,pindex>, i_w_p, pmitable) 11. done 12. done 13 return pmitable The keyword spotting examples in this chapter use a slightly different sample space model than the one we will be using in the TC application. The intention is to illustrate alternative ways of modelling linguistic data in a probabilistic framework, and the fact that the TSR metrics can be used in different contexts. For the algorithm above, term occurrences are taken to be the elementary events. Term occurences in the whole text generate the prior probabilities for terms P (t). Term occurrences in certain paragraphs give conditional probabilities P (t c) (i.e. occurences of terms conditioned on the paragraph, taken in this case to be the category ). Paragraphs are assumed to have a uniform prior P (c) (i.e. they are all equally likely to occur). In the case of P MI(T, C) (mutual information of a term and a paragraph), we can simply work with priors and conditionals for words: P MI(T, C) = P (t, c) P (t)p (c) = P (t c)p (c) P (t)p (c) = P (t c) P (t) (8) The conditional P (t c) can be calculated by dividing the number of times t occurs in documents of category c by the total number of tokens in those documents (the probability space for documents of category c). P (t) can be calculated by dividing the frequency of t in the training corpus by the total number of tokens in that corpus. Normalised mutual information P (ti, cj) MI(T i, C j) = P (t i, c j)log P (t i)p (c i) (9) 1. mi(d): wordpartable /* rank each word in d */ 2. mitable = () 3. parlist = split_as_paragraphs(d) 8

9 4. p_par = 1/sizeof(parlist) 5. typelist = gettypelist(d) 6. foreach (par in parlist) do 7 ptypelist = gettypelist(par) 8. pindex = indexof(par, parlist) 9. foreach (word in ptypelist) do 10. mi_w_p = getwordprobability(word, par) * p_par * log ( getwordprobability(word, par) / getwordprobability(word, d) ) 11. addtotable(<word,pindex>, mi_w_p, mitable) 12. done 13. done 14 return mitable Similarly to (8) we can simplify the computation of MI(T, C) as follows: P (t, c) MI(T, C) = P (t, c)log P (t)p (c) = P (t c)p (c)log P (t c) P (t) (10) Expected Mutual Information (Information gain) A formalisation of how much information about category c j does one gain by knowing term t i (and vice-versa). IG(T i, C j ) = t {t i, t i} c {c j, c j} P (t, c) P (t, c)log P (t)p (c) (11) Computational cost of calculating IG(.,.) is higher than that of estimating MI(.,.) IG: a simplified example 1. ig(d): wordpartable /* features = words; */ 2. igtable = () /* categories = paragraphs */ 3. parlist = split_as_paragraphs(d) 4. p_par = 1/sizeof(parlist) 5. typelist = gettypelist(d) 6. foreach (par in parlist) do 7. ptypelist = gettypelist(par) 8. pindex = indexof(par, parlist) 9. foreach (word in ptypelist) do /* oversimplification: assuming T = {word} */ 10. foreach (par in parlist) do 11. ig_w_p += getwordprobability(word, par) * p_par * log ( getwordprobability(word, par) / getwordprobability(word, d) ) 12. done 13. addtotable(<word,pindex>, ig_w_p, igtable) 14. done 15. done 16. return igtable 9

10 From local to global TSR A locally specified TSR function tsr(t k, C i), i.e. ranging over terms t k with repect to a specific to category c i, can be made global by: Summing over the set of all categories: Taking a weighted average Picking the maximum Comparing TSR techniques C tsr sum(t k ) = tsr(t k, C i) (12) C tsr wavg(t k ) = P (C i)tsr(t k, C i) (13) tsr max(t k ) = max C tsr(t k, C i) (14) Effectiveness depends on the chosen task, domain etc Reduction factor of up to 100 with IG sum and χ 2 max Summary of empirical studies on the performance of different information theoretic measures (Sebastiani, 2002): {OR sum, NGL sum, GSS max } {IG sum, χ 2 max} {# wavg, χ 2 wavg} References Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3: Galavotti, L., Sebastiani, F., and Simi, M. (2000). Experiments on the use of feature selection and negative evidence in automated text categorization. Technical report, Paris, France, France. Ng, H. T., Goh, W. B., and Low, K. L. (1997). Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 97, pages 67 73, New York, NY, USA. ACM. 10

11 Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1 47. Yang, Y. and Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Fisher, D. H., editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages , Nashville. Morgan Kaufmann Publishers. 11

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Supervised Term Weighting for Automated Text Categorization

Supervised Term Weighting for Automated Text Categorization Supervised Term Weighting for Automated Text Categorization Franca Debole Istituto di Scienza e Tecnologie dell Informazione Consiglio Nazionale delle Ricerche Via G. Moruzzi 1-56124 Pisa (Italy) debole@iei.pi.cnr.it

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002 CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;

More information

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze Term Weighting and Vector Space Model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Ranked retrieval Thus far, our queries have all been Boolean. Documents either

More information

Scoring, Term Weighting and the Vector Space

Scoring, Term Weighting and the Vector Space Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J

More information

Dealing with Text Databases

Dealing with Text Databases Dealing with Text Databases Unstructured data Boolean queries Sparse matrix representation Inverted index Counts vs. frequencies Term frequency tf x idf term weights Documents as vectors Cosine similarity

More information

Informa(on Retrieval

Informa(on Retrieval Introduc*on to Informa(on Retrieval Lecture 6-2: The Vector Space Model Binary incidence matrix Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth... ANTHONY BRUTUS CAESAR CALPURNIA

More information

Informa(on Retrieval

Informa(on Retrieval Introduc*on to Informa(on Retrieval Lecture 6-2: The Vector Space Model Outline The vector space model 2 Binary incidence matrix Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth...

More information

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1 TDDD43 Information Retrieval Fang Wei-Kleiner ADIT/IDA Linköping University Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1 Outline 1. Introduction 2. Inverted index 3. Ranked Retrieval tf-idf

More information

Symbolic methods in TC: Decision Trees

Symbolic methods in TC: Decision Trees Symbolic methods in TC: Decision Trees ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs0/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 01-017 A symbolic

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 5: Scoring, Term Weighting, The Vector Space Model II Paul Ginsparg Cornell

More information

Information Retrieval. Lecture 6

Information Retrieval. Lecture 6 Information Retrieval Lecture 6 Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces This lecture

More information

Entropy based feature selection for text categorization

Entropy based feature selection for text categorization Entropy based feature selection for text categorization Christine Largeron, Christophe Moulin, Mathias Géry To cite this version: Christine Largeron, Christophe Moulin, Mathias Géry. Entropy based feature

More information

A Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR)

A Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR) A Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR) Chirag Shah Dept. of CSE IIT Bombay, Powai Mumbai - 400 076, Maharashtra, India. Email: chirag@cse.iitb.ac.in

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 6: Scoring, term weighting, the vector space model Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics,

More information

Symbolic methods in TC: Decision Trees

Symbolic methods in TC: Decision Trees Symbolic methods in TC: Decision Trees ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs4062/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 2016-2017 2

More information

CS 572: Information Retrieval

CS 572: Information Retrieval CS 572: Information Retrieval Lecture 5: Term Weighting and Ranking Acknowledgment: Some slides in this lecture are adapted from Chris Manning (Stanford) and Doug Oard (Maryland) Lecture Plan Skip for

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions

Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions AUSTRIAN JOURNAL OF STATISTICS Volume 37 (2008), Number 1, 109 118 Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions Vaidas Balys and Rimantas Rudzkis Institute

More information

SUBMITTED TO IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1. Toward Optimal Feature Selection in Naive Bayes for Text Categorization

SUBMITTED TO IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1. Toward Optimal Feature Selection in Naive Bayes for Text Categorization SUBMITTED TO IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1 Toward Optimal Feature Selection in Naive Bayes for Text Categorization Bo Tang, Student Member, IEEE, Steven Kay, Fellow, IEEE, and Haibo

More information

CLASSIFYING STRUCTURED WEB SOURCES USING AGGRESSIVE FEATURE SELECTION

CLASSIFYING STRUCTURED WEB SOURCES USING AGGRESSIVE FEATURE SELECTION CLASSIFYING STRUCTURED WEB SOURCES USING AGGRESSIVE FEATURE SELECTION Hieu Quang Le, Stefan Conrad Institute of Computer Science, Heinrich-Heine-Universität Düsseldorf, D-40225 Düsseldorf, Germany lqhieu@cs.uni-duesseldorf.de,

More information

Language Models, Smoothing, and IDF Weighting

Language Models, Smoothing, and IDF Weighting Language Models, Smoothing, and IDF Weighting Najeeb Abdulmutalib, Norbert Fuhr University of Duisburg-Essen, Germany {najeeb fuhr}@is.inf.uni-due.de Abstract In this paper, we investigate the relationship

More information

Supervised Term Weighting for Automated Text Categorization

Supervised Term Weighting for Automated Text Categorization Supervised Term Weighting for Automated Text Categorization Franca Debole and Fabrizio Sebastiani Istituto di Scienza e Tecnologie dell Informazione Consiglio Nazionale delle Ricerche Via Giuseppe Moruzzi,

More information

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Scalable Term Selection for Text Categorization

Scalable Term Selection for Text Categorization Scalable Term Selection for Text Categorization Jingyang Li National Lab of Intelligent Tech. & Sys. Department of Computer Sci. & Tech. Tsinghua University, Beijing, China lijingyang@gmail.com Maosong

More information

Lecture 4 Ranking Search Results. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Lecture 4 Ranking Search Results. Many thanks to Prabhakar Raghavan for sharing most content from the following slides Lecture 4 Ranking Search Results Many thanks to Prabhakar Raghavan for sharing most content from the following slides Recap of the previous lecture Index construction Doing sorting with limited main memory

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,

More information

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1 Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)

More information

Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers

Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers J Internet Serv Appl (2011) 1: 183 200 DOI 10.1007/s13174-010-0014-7 ORIGINAL PAPER Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers Tiago A. Almeida Jurandy

More information

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Department of Computer Science and Engineering CSE 151 University of California, San Diego Fall Midterm Examination

Department of Computer Science and Engineering CSE 151 University of California, San Diego Fall Midterm Examination Department of Computer Science and Engineering CSE 151 University of California, San Diego Fall 2008 Your name: Midterm Examination Tuesday October 28, 9:30am to 10:50am Instructions: Look through the

More information

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes 1 Categorization ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014 Important task for both humans and machines object identification face recognition spoken word recognition

More information

ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014 Categorization Important task for both humans and machines 1 object identification face recognition spoken word recognition

More information

Classification. Team Ravana. Team Members Cliffton Fernandes Nikhil Keswaney

Classification. Team Ravana. Team Members Cliffton Fernandes Nikhil Keswaney Email Classification Team Ravana Team Members Cliffton Fernandes Nikhil Keswaney Hello! Cliffton Fernandes MS-CS Nikhil Keswaney MS-CS 2 Topic Area Email Classification Spam: Unsolicited Junk Mail Ham

More information

DISTRIBUTIONAL SEMANTICS

DISTRIBUTIONAL SEMANTICS COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.

More information

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning Supplementary Material Zheyun Feng Rong Jin Anil Jain Department of Computer Science and Engineering, Michigan State University,

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 11: Probabilistic Information Retrieval 1 Outline Basic Probability Theory Probability Ranking Principle Extensions 2 Basic Probability Theory For events A

More information

Naïve Bayes, Maxent and Neural Models

Naïve Bayes, Maxent and Neural Models Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words

More information

Variable Latent Semantic Indexing

Variable Latent Semantic Indexing Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background

More information

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Natural Language Processing. Topics in Information Retrieval. Updated 5/10 Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background

More information

Learning Features from Co-occurrences: A Theoretical Analysis

Learning Features from Co-occurrences: A Theoretical Analysis Learning Features from Co-occurrences: A Theoretical Analysis Yanpeng Li IBM T. J. Watson Research Center Yorktown Heights, New York 10598 liyanpeng.lyp@gmail.com Abstract Representing a word by its co-occurrences

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Discrete Multivariate Statistics

Discrete Multivariate Statistics Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are

More information

1 Handling of Continuous Attributes in C4.5. Algorithm

1 Handling of Continuous Attributes in C4.5. Algorithm .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Classification/Supervised Learning Potpourri Contents 1. C4.5. and continuous attributes: incorporating continuous

More information

Benchmarking Non-Parametric Statistical Tests

Benchmarking Non-Parametric Statistical Tests R E S E A R C H R E P O R T I D I A P Benchmarking Non-Parametric Statistical Tests Mikaela Keller a Samy Bengio a Siew Yeung Wong a IDIAP RR 05-38 January 5, 2006 to appear in Advances in Neural Information

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: k nearest neighbors Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 28 Introduction Classification = supervised method

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding

More information

1 Handling of Continuous Attributes in C4.5. Algorithm

1 Handling of Continuous Attributes in C4.5. Algorithm .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Classification/Supervised Learning Potpourri Contents 1. C4.5. and continuous attributes: incorporating continuous

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

Machine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77

Machine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77 Machine Learning Ludovic Samper Antidot September 1st, 2015 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 1 / 77 Antidot Software vendor since 1999 Paris, Lyon, Aix-en-Provence 45 employees

More information

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient

More information

Model-based estimation of word saliency in text

Model-based estimation of word saliency in text Model-based estimation of word saliency in text Xin Wang and Ata Kabán School of Computer Science, The University of Birmingham, Birmingham, B15 2TT, UK {X.C.Wang,A.Kaban}@cs.bham.ac.uk Abstract. We investigate

More information

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured. What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured. Text mining What can be used for text mining?? Classification/categorization

More information

Language as a Stochastic Process

Language as a Stochastic Process CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any

More information

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012) Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

More information

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Classification Algorithms

Classification Algorithms Classification Algorithms UCSB 290N, 2015. T. Yang Slides based on R. Mooney UT Austin 1 Table of Content roblem Definition Rocchio K-nearest neighbor case based Bayesian algorithm Decision trees 2 Given:

More information

An Efficient Algorithm for Large-Scale Text Categorization

An Efficient Algorithm for Large-Scale Text Categorization An Efficient Algorithm for Large-Scale Text Categorization CHANG-RUI YU 1, YAN LUO 2 1 School of Information Management and Engineering Shanghai University of Finance and Economics No.777, Guoding Rd.

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline CS 188: Artificial Intelligence Lecture 21: Perceptrons Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. Outline Generative vs. Discriminative Binary Linear Classifiers Perceptron Multi-class

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu May 2, 2017 Announcements Homework 2 due later today Due May 3 rd (11:59pm) Course project

More information

Online Passive-Aggressive Algorithms. Tirgul 11

Online Passive-Aggressive Algorithms. Tirgul 11 Online Passive-Aggressive Algorithms Tirgul 11 Multi-Label Classification 2 Multilabel Problem: Example Mapping Apps to smart folders: Assign an installed app to one or more folders Candy Crush Saga 3

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

Prediction of Citations for Academic Papers

Prediction of Citations for Academic Papers 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Significance Tests for Bizarre Measures in 2-Class Classification Tasks

Significance Tests for Bizarre Measures in 2-Class Classification Tasks R E S E A R C H R E P O R T I D I A P Significance Tests for Bizarre Measures in 2-Class Classification Tasks Mikaela Keller 1 Johnny Mariéthoz 2 Samy Bengio 3 IDIAP RR 04-34 October 4, 2004 D a l l e

More information

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they

More information

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning) Bayes Theorem & Naïve Bayes (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning) Review: Bayes Theorem & Diagnosis P( a b) Posterior Likelihood Prior P( b a) P( a)

More information

A Neural Passage Model for Ad-hoc Document Retrieval

A Neural Passage Model for Ad-hoc Document Retrieval A Neural Passage Model for Ad-hoc Document Retrieval Qingyao Ai, Brendan O Connor, and W. Bruce Croft College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, MA, USA,

More information

An Analysis on Frequency of Terms for Text Categorization

An Analysis on Frequency of Terms for Text Categorization An Analysis on Frequency of Terms for Text Categorization Edgar Moyotl-Hernández Fac. de Ciencias de la Computación B. Universidad Autónoma de Puebla C.U. 72570, Puebla, México emoyotl@mail.cs.buap.mx,

More information

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors? CS276A Information Retrieval Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces Lecture 7 This

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

On a New Model for Automatic Text Categorization Based on Vector Space Model

On a New Model for Automatic Text Categorization Based on Vector Space Model On a New Model for Automatic Text Categorization Based on Vector Space Model Makoto Suzuki, Naohide Yamagishi, Takashi Ishida, Masayuki Goto and Shigeichi Hirasawa Faculty of Information Science, Shonan

More information

Classifying Chinese Texts in Two Steps

Classifying Chinese Texts in Two Steps Classifying Chinese Texts in Two Steps Xinghua Fan,, 3, Maosong Sun, Key-sun Choi 3, and Qin Zhang State Key Laboratory of Intelligent Technoy and Systems, Tsinghua University, Beijing 00084, China fanxh@tsinghua.org.cn,

More information

Knowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Knowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar Knowledge Discovery in Data: Naïve Bayes Overview Naïve Bayes methodology refers to a probabilistic approach to information discovery

More information

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics Querying

More information

Feature Engineering, Model Evaluations

Feature Engineering, Model Evaluations Feature Engineering, Model Evaluations Giri Iyengar Cornell University gi43@cornell.edu Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35 Overview 1 ETL 2 Feature Engineering

More information

Comparing Relevance Feedback Techniques on German News Articles

Comparing Relevance Feedback Techniques on German News Articles B. Mitschang et al. (Hrsg.): BTW 2017 Workshopband, Lecture Notes in Informatics (LNI), Gesellschaft für Informatik, Bonn 2017 301 Comparing Relevance Feedback Techniques on German News Articles Julia

More information

Performance evaluation of binary classifiers

Performance evaluation of binary classifiers Performance evaluation of binary classifiers Kevin P. Murphy Last updated October 10, 2007 1 ROC curves We frequently design systems to detect events of interest, such as diseases in patients, faces in

More information

Basic Probability and Information Theory: quick revision

Basic Probability and Information Theory: quick revision Basic Probability and Information Theory: quick revision ML for NLP Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs4ll4/ February 17, 2015 In these notes we review the basics of probability theory and

More information

Stochastic gradient descent; Classification

Stochastic gradient descent; Classification Stochastic gradient descent; Classification Steve Renals Machine Learning Practical MLP Lecture 2 28 September 2016 MLP Lecture 2 Stochastic gradient descent; Classification 1 Single Layer Networks MLP

More information

CSC 411: Lecture 03: Linear Classification

CSC 411: Lecture 03: Linear Classification CSC 411: Lecture 03: Linear Classification Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classification 1 / 24 Examples of Problems What

More information

Ranked Retrieval (2)

Ranked Retrieval (2) Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF

More information