INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Similar documents
Introduction to Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

PV211: Introduction to Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

CS145: INTRODUCTION TO DATA MINING

Information Retrieval and Organisation

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Introduction to Information Retrieval

Chap 2: Classical models for information retrieval

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval CS Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Lecture 13: More uses of Language Models

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Probabilistic Information Retrieval

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

IR Models: The Probabilistic Model. Lecture 8

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Boolean and Vector Space Retrieval Models

Ranked Retrieval (2)

Generative Models for Classification

Information Retrieval. Lecture 6

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Machine Learning for natural language processing

16 The Information Retrieval "Data Model"

PV211: Introduction to Information Retrieval

Department of Computer and Information Science and Engineering. CAP4770/CAP5771 Fall Midterm Exam. Instructor: Prof.

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Naïve Bayes, Maxent and Neural Models

Classification & Information Theory Lecture #8

Words vs. Terms. Words vs. Terms. Words vs. Terms. Information Retrieval cares about terms You search for em, Google indexes em Query:

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

1 Information retrieval fundamentals

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Latent Semantic Analysis. Hongning Wang

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

Machine Learning for natural language processing

CSCE 561 Information Retrieval System Models

Maschinelle Sprachverarbeitung

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

boolean queries Inverted index query processing Query optimization boolean model January 15, / 35

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval Basic IR models. Luca Bondi

Support Vector Machines

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

naive bayes document classification

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Topic Models and Applications to Short Documents

Information Retrieval and Web Search Engines

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Variable Latent Semantic Indexing

Modern Information Retrieval

Embeddings Learned By Matrix Factorization

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Lecture 12: Link Analysis for Web Retrieval

Part 9: Text Classification; The Naïve Bayes algorithm Francesco Ricci

Text mining and natural language analysis. Jefrey Lijffijt

text statistics October 24, 2018 text statistics 1 / 20

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

CS 572: Information Retrieval

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

CS 6375 Machine Learning

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Intelligent Systems (AI-2)

Intelligent Systems (AI-2)

INFO 2950 Intro to Data Science. Lecture 18: Power Laws and Big Data

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Lecture 5: Web Searching using the SVD

CS6220: DATA MINING TECHNIQUES

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Information Retrieval

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Applied Natural Language Processing

Dealing with Text Databases

CS-E3210 Machine Learning: Basic Principles

Behavioral Data Mining. Lecture 2

University of Illinois at Urbana-Champaign. Midterm Examination

13 Searching the Web with the SVD

9 Searching the Internet with the SVD

Inf2b Learning and Data

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

PROBABILISTIC LATENT SEMANTIC ANALYSIS

The Naïve Bayes Classifier. Machine Learning Fall 2017

Introduction to Machine Learning

Transcription:

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University, Ithaca, NY 3 Dec 2009 1/ 32

Administrativa Assignment 4 due Fri 4 Dec (extended to Sun 6 Dec). 2/ 32

Combiner in Simulator Can be added, but makes less sense to have a combiner in a simulator. Combiners help to speed things by providing local (in-memory) partial reduces. In a simulator we are not really concerned about efficiency. Hadoop Wiki: When the map operation outputs its pairs they are already available in memory. For efficiency reasons, sometimes it makes sense to take advantage of this fact by supplying a combiner class to perform a reduce-type function. If a combiner is used then the map key-value pairs are not immediately written to the output. Instead they will be collected in lists, one list per each key value. When a certain number of key-value pairs have been written, this buffer is flushed by passing all the values of each key to the combiner s reduce method and outputting the key-value pairs of the combine operation as if they were created by the original map operation. 3/ 32

Assignment 3 The page rank r j of page j is determined self-consistently by the equation r j = α r i + (1 α), n d i i i j α is a number between 0 and 1 (originally taken to be.15) the sum on i is over pages i pointing to j d i is the outgoing degree of page i. Incidence matrix A ij = 1 if i points to j, otherwise A ij = 0. Transition probability from page i to page j P ij = α n O ij + (1 α) 1 d i A ij where n = total # of pages, d i is the outdegree of node i, and O ij = 1( i,j). The matrix eigenvector relation rp = r or r = P T r is equivalent to the equation above (with r is normalized as a probability, so that i r i O ij = i r i = 1). 4/ 32

Overview 1 Recap 2 Feature selection 3 Structured Retrieval 4 Exam Overview 5/ 32

Outline 1 Recap 2 Feature selection 3 Structured Retrieval 4 Exam Overview 6/ 32

More Data Figure 1. Learning Curves for Confusion Set Disambiguation http://acl.ldc.upenn.edu/p/p01/p01-1005.pdf Scaling to Very Very Large Corpora for Natural Language Disambiguation M. Banko and E. Brill (2001) 7/ 32

Statistical Learning Spelling with Statistical Learning Google Sets Statistical Machine Translation Canonical image selection from the web Learning people annotation from the web via consistency learning and others... 8/ 32

Outline 1 Recap 2 Feature selection 3 Structured Retrieval 4 Exam Overview 9/ 32

Feature selection In text classification, we usually represent documents in a high-dimensional space, with each dimension corresponding to a term. In this lecture: axis = dimension = word = term = feature Many dimensions correspond to rare words. Rare words can mislead the classifier. Rare misleading features are called noise features. Eliminating noise features from the representation increases efficiency and effectiveness of text classification. Eliminating features is called feature selection. 10/ 32

Different feature selection methods A feature selection method is mainly defined by the feature utility measures it employs Feature utility measures Frequency select the most frequent terms Mutual information select the terms with the highest mutual information Mutual information is also called information gain in this context. Chi-square 11/ 32

Information H[p] = i=1,n p i log 2 p i measures information uncertainty (p.91 in book) has maximum H = log 2 n for all p i = 1/n Consider two probability distributions: p(x) for x X and p(y) for y Y MI: I[X;Y ] = H[p(x)] + H[p(y)] H[p(x,y)] measures how much information p(x) gives about p(y) (and vice versa) MI is zero iff p(x,y) = p(x)p(y), i.e., x and y are independent for all x X and y Y can be as large as H[p(x)] or H[p(y)] I[X;Y ] = x X,y Y p(x,y)log 2 p(x,y) p(x)p(y) 12/ 32

Mutual information Compute the feature utility A(t,c) as the expected mutual information (MI) of term t and class c. MI tells us how much information the term contains about the class and vice versa. For example, if a term s occurrence is independent of the class (same proportion of docs within/without class contain the term), then MI is 0. Definition: I(U; C)= e t {1,0} e c {1,0} P(U =e t, C =e c )log 2 P(U =e t, C =e c ) P(U =e t )P(C =e c ) 13/ 32

How to compute MI values Based on maximum likelihood estimates, the formula we actually use is: I(U;C) = N 11 N log NN 11 2 + N 01 N 1. N.1 N log 2 + N 10 N log 2 NN 10 N 1. N.0 + N 00 N log 2 NN 01 N 0. N.1 NN 00 N 0. N.0 N 10 : number of documents that contain t (e t = 1) and are not in c (e c = 0); N 11 : number of documents that contain t (e t = 1) and are in c (e c = 1); N 01 : number of documents that do not contain t (e t = 1) and are in c (e c = 1); N 00 : number of documents that do not contain t (e t = 1) and are not in c (e c = 1); N = N 00 + N 01 + N 10 + N 11. 14/ 32

MI example for poultry/export in Reuters e c = e poultry = 1 e c = e poultry = 0 e t = e export = 1 N 11 = 49 N 10 = 27,652 e t = e export = 0 N 01 = 141 N 00 = 774,106 Plug these values into formula: 49 I(U;C) = 801,948 log 801,948 49 2 (49+27,652)(49+141) + 141 801,948 log 801,948 141 2 (141+774,106)(49+141) + 27,652 801,948 log 801,948 27,652 2 (49+27,652)(27,652+774,106) + 774,106 801,948 log 801,948 774,106 2 (141+774,106)(27,652+774,106) 0.000105 15/ 32

MI feature selection on Reuters coffee coffee 0.0111 bags 0.0042 growers 0.0025 kg 0.0019 colombia 0.0018 brazil 0.0016 export 0.0014 exporters 0.0013 exports 0.0013 crop 0.0012 sports soccer 0.0681 cup 0.0515 match 0.0441 matches 0.0408 played 0.0388 league 0.0386 beat 0.0301 game 0.0299 games 0.0284 team 0.0264 16/ 32

χ 2 Feature selection χ 2 tests independence of two events, p(a,b) = p(a)p(b) (or p(a B) = p(a), p(b A) = p(b)) test occurrence of the term, occurrence of the class, rank w.r.t.: X 2 (D,t,c) = e t {0,1} e c {0,1} (N ete c E ete c ) 2 E ete c where N = observed frequency in D, E = expected frequency (e.g., E 11 is the expected frequency of t and c occurring together in a document, assuming term and class are independent) High value of X 2 indicates independence hypothesis is incorrect, i.e., observed and expected are not similar. Occurrence of term and class dependent events occurrence of term makes class more (or less) likely, hence helpful as feature. 17/ 32

χ 2 Feature selection, example e c = e poultry = 1 e c = e poultry = 0 e t = e export = 1 N 11 = 49 N 10 = 27,652 e t = e export = 0 N 01 = 141 N 00 = 774,106 E 11 = N P(t) P(c) = N N11 + N 10 N 49 + 27652 49 + 141 = N N N N11 + N 01 N 6.6 e c = e poultry = 1 e c = e poultry = 0 e t = e export = 1 E 11 6.6 E 10 183.4 e t = e export = 0 E 01 27694.4 E 00 774063.6 X 2 (D,t,c) = e t {0,1} e c {0,1} (N ete c E ete c ) 2 E ete c 284 18/ 32

Naive Bayes: Effect of feature selection F1 measure 0.0 0.2 0.4 0.6 0.8 bb x b x x x b # o o oo # b# ox # # b x o# # b o x # o b x o # b x o b x x b o # # # o x b x o# x o# x o # b b b multinomial, MI multinomial, chisquare multinomial, frequency binomial, MI 1 10 100 1000 10000 number of features selected (multinomial = multinomial Naive Bayes) 19/ 32

Feature selection for Naive Bayes In general, feature selection is necessary for Naive Bayes to get decent performance. Also true for most other learning methods in text classification: you need feature selection for optimal performance. 20/ 32

Outline 1 Recap 2 Feature selection 3 Structured Retrieval 4 Exam Overview 21/ 32

XML markup play author Shakespeare /author title Macbeth /title act number= I scene number= vii title Macbeths castle /title verse Will I with wine and wassail... /verse /scene /act /play 22/ 32

XML Doc as DOM object 23/ 32

Outline 1 Recap 2 Feature selection 3 Structured Retrieval 4 Exam Overview 24/ 32

Definition of information retrieval (from Lecture 1) Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Three scales (web, enterprise/inst/domain, personal) 25/ 32

Plan (from Lecture 1) Search full text: basic concepts Web search Probabalistic Retrieval Interfaces Metadata / Semantics IR NLP ML Prereqs: Introductory courses in data structures and algorithms, in linear algebra and in probability theory 26/ 32

1st Half Searching full text: dictionaries, inverted files, postings, implementation and algorithms, term weighting, Vector Space Model, similarity, ranking Word Statistics MRS: 1 Boolean retrieval MRS: 2 The term vocabulary and postings lists MRS: 3 Dictionaries and tolerant retrieval MRS: 5 Index compression MRS: 6 Scoring, term weighting, and the vector space model MRS: 7 Computing scores in a complete search system 27/ 32

1st Half, cont d Evaluation of retrieval effectiveness MRS: 8. Evaluation in information retrieval Latent semantic indexing MRS: 18. Matrix decompositions and latent semantic indexing Discussion 2 SMART Discussion 3 IDF Discussion 4 Latent semantic indexing 28/ 32

2nd Half MRS: 3. Tolerant retrieval MRS: 9 Relevance feedback and query expansion MRS: 11 Probabilistic information retrieval Web Search: anchor text and links, Citation and Link Analysis, Web crawling MRS: 19 Web search basics MRS: 21 Link analysis 29/ 32

2nd Half, cont d Classification, categorization, clustering MRS: 13 Text classification and Naive Bayes MRS: 14 Vector space classification MRS: 16 Flat clustering MRS: 17 Hierarchical clustering (Structured Retrieval MRS: 10 XML Retrieval) Discussion 5 Google Discussion 6 MapReduce Discussion 7 Statistical Spell Correction 30/ 32

Midterm 1) term-document matrix, VSM, tf.idf 2) Recall/Precision 3) LSI 4) Word statistics (Heap, Zipf) 31/ 32

Final Exam, 3 or 4 questions from these topics CS4300/INFO4300 Tue 15 Dec 7:00-9:30 PM Olin Hall 255 issues in personal/enterprise/webscale searching, recall/precision, and how related to info/nav/trans needs issues for modern search engines... (e.g., w.r.t. web scale, tf.idf? recall/precision?) MapReduce probabilistic reasoning: naive bayes web indexing and retrieval: link analysis, adversarial IR Vector space classification (rocchio, knn) types of text classification (curated, rule-based, statistical) clustering: flat, hierarchical (k-means, agglomerative): evaluation of clustering, measures of cluster similarity (single link, complete link, average, group average) classification, clustering (make a dendrogram based on similarity) cluster labeling, feature selection 32/ 32