INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University, Ithaca, NY 3 Dec 2009 1/ 32

Administrativa Assignment 4 due Fri 4 Dec (extended to Sun 6 Dec). 2/ 32

Combiner in Simulator Can be added, but makes less sense to have a combiner in a simulator. Combiners help to speed things by providing local (in-memory) partial reduces. In a simulator we are not really concerned about efficiency. Hadoop Wiki: When the map operation outputs its pairs they are already available in memory. For efficiency reasons, sometimes it makes sense to take advantage of this fact by supplying a combiner class to perform a reduce-type function. If a combiner is used then the map key-value pairs are not immediately written to the output. Instead they will be collected in lists, one list per each key value. When a certain number of key-value pairs have been written, this buffer is flushed by passing all the values of each key to the combiner s reduce method and outputting the key-value pairs of the combine operation as if they were created by the original map operation. 3/ 32

Assignment 3 The page rank r j of page j is determined self-consistently by the equation r j = α r i + (1 α), n d i i i j α is a number between 0 and 1 (originally taken to be.15) the sum on i is over pages i pointing to j d i is the outgoing degree of page i. Incidence matrix A ij = 1 if i points to j, otherwise A ij = 0. Transition probability from page i to page j P ij = α n O ij + (1 α) 1 d i A ij where n = total # of pages, d i is the outdegree of node i, and O ij = 1( i,j). The matrix eigenvector relation rp = r or r = P T r is equivalent to the equation above (with r is normalized as a probability, so that i r i O ij = i r i = 1). 4/ 32

Overview 1 Recap 2 Feature selection 3 Structured Retrieval 4 Exam Overview 5/ 32

Outline 1 Recap 2 Feature selection 3 Structured Retrieval 4 Exam Overview 6/ 32

More Data Figure 1. Learning Curves for Confusion Set Disambiguation http://acl.ldc.upenn.edu/p/p01/p01-1005.pdf Scaling to Very Very Large Corpora for Natural Language Disambiguation M. Banko and E. Brill (2001) 7/ 32

Statistical Learning Spelling with Statistical Learning Google Sets Statistical Machine Translation Canonical image selection from the web Learning people annotation from the web via consistency learning and others... 8/ 32

Outline 1 Recap 2 Feature selection 3 Structured Retrieval 4 Exam Overview 9/ 32

Feature selection In text classification, we usually represent documents in a high-dimensional space, with each dimension corresponding to a term. In this lecture: axis = dimension = word = term = feature Many dimensions correspond to rare words. Rare words can mislead the classifier. Rare misleading features are called noise features. Eliminating noise features from the representation increases efficiency and effectiveness of text classification. Eliminating features is called feature selection. 10/ 32

Different feature selection methods A feature selection method is mainly defined by the feature utility measures it employs Feature utility measures Frequency select the most frequent terms Mutual information select the terms with the highest mutual information Mutual information is also called information gain in this context. Chi-square 11/ 32

Information H[p] = i=1,n p i log 2 p i measures information uncertainty (p.91 in book) has maximum H = log 2 n for all p i = 1/n Consider two probability distributions: p(x) for x X and p(y) for y Y MI: I[X;Y ] = H[p(x)] + H[p(y)] H[p(x,y)] measures how much information p(x) gives about p(y) (and vice versa) MI is zero iff p(x,y) = p(x)p(y), i.e., x and y are independent for all x X and y Y can be as large as H[p(x)] or H[p(y)] I[X;Y ] = x X,y Y p(x,y)log 2 p(x,y) p(x)p(y) 12/ 32

Mutual information Compute the feature utility A(t,c) as the expected mutual information (MI) of term t and class c. MI tells us how much information the term contains about the class and vice versa. For example, if a term s occurrence is independent of the class (same proportion of docs within/without class contain the term), then MI is 0. Definition: I(U; C)= e t {1,0} e c {1,0} P(U =e t, C =e c )log 2 P(U =e t, C =e c ) P(U =e t )P(C =e c ) 13/ 32

How to compute MI values Based on maximum likelihood estimates, the formula we actually use is: I(U;C) = N 11 N log NN 11 2 + N 01 N 1. N.1 N log 2 + N 10 N log 2 NN 10 N 1. N.0 + N 00 N log 2 NN 01 N 0. N.1 NN 00 N 0. N.0 N 10 : number of documents that contain t (e t = 1) and are not in c (e c = 0); N 11 : number of documents that contain t (e t = 1) and are in c (e c = 1); N 01 : number of documents that do not contain t (e t = 1) and are in c (e c = 1); N 00 : number of documents that do not contain t (e t = 1) and are not in c (e c = 1); N = N 00 + N 01 + N 10 + N 11. 14/ 32

MI example for poultry/export in Reuters e c = e poultry = 1 e c = e poultry = 0 e t = e export = 1 N 11 = 49 N 10 = 27,652 e t = e export = 0 N 01 = 141 N 00 = 774,106 Plug these values into formula: 49 I(U;C) = 801,948 log 801,948 49 2 (49+27,652)(49+141) + 141 801,948 log 801,948 141 2 (141+774,106)(49+141) + 27,652 801,948 log 801,948 27,652 2 (49+27,652)(27,652+774,106) + 774,106 801,948 log 801,948 774,106 2 (141+774,106)(27,652+774,106) 0.000105 15/ 32

MI feature selection on Reuters coffee coffee 0.0111 bags 0.0042 growers 0.0025 kg 0.0019 colombia 0.0018 brazil 0.0016 export 0.0014 exporters 0.0013 exports 0.0013 crop 0.0012 sports soccer 0.0681 cup 0.0515 match 0.0441 matches 0.0408 played 0.0388 league 0.0386 beat 0.0301 game 0.0299 games 0.0284 team 0.0264 16/ 32

χ 2 Feature selection χ 2 tests independence of two events, p(a,b) = p(a)p(b) (or p(a B) = p(a), p(b A) = p(b)) test occurrence of the term, occurrence of the class, rank w.r.t.: X 2 (D,t,c) = e t {0,1} e c {0,1} (N ete c E ete c ) 2 E ete c where N = observed frequency in D, E = expected frequency (e.g., E 11 is the expected frequency of t and c occurring together in a document, assuming term and class are independent) High value of X 2 indicates independence hypothesis is incorrect, i.e., observed and expected are not similar. Occurrence of term and class dependent events occurrence of term makes class more (or less) likely, hence helpful as feature. 17/ 32

χ 2 Feature selection, example e c = e poultry = 1 e c = e poultry = 0 e t = e export = 1 N 11 = 49 N 10 = 27,652 e t = e export = 0 N 01 = 141 N 00 = 774,106 E 11 = N P(t) P(c) = N N11 + N 10 N 49 + 27652 49 + 141 = N N N N11 + N 01 N 6.6 e c = e poultry = 1 e c = e poultry = 0 e t = e export = 1 E 11 6.6 E 10 183.4 e t = e export = 0 E 01 27694.4 E 00 774063.6 X 2 (D,t,c) = e t {0,1} e c {0,1} (N ete c E ete c ) 2 E ete c 284 18/ 32

Naive Bayes: Effect of feature selection F1 measure 0.0 0.2 0.4 0.6 0.8 bb x b x x x b # o o oo # b# ox # # b x o# # b o x # o b x o # b x o b x x b o # # # o x b x o# x o# x o # b b b multinomial, MI multinomial, chisquare multinomial, frequency binomial, MI 1 10 100 1000 10000 number of features selected (multinomial = multinomial Naive Bayes) 19/ 32

Feature selection for Naive Bayes In general, feature selection is necessary for Naive Bayes to get decent performance. Also true for most other learning methods in text classification: you need feature selection for optimal performance. 20/ 32

Outline 1 Recap 2 Feature selection 3 Structured Retrieval 4 Exam Overview 21/ 32

XML markup play author Shakespeare /author title Macbeth /title act number= I scene number= vii title Macbeths castle /title verse Will I with wine and wassail... /verse /scene /act /play 22/ 32

XML Doc as DOM object 23/ 32

Outline 1 Recap 2 Feature selection 3 Structured Retrieval 4 Exam Overview 24/ 32

Definition of information retrieval (from Lecture 1) Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Three scales (web, enterprise/inst/domain, personal) 25/ 32

Plan (from Lecture 1) Search full text: basic concepts Web search Probabalistic Retrieval Interfaces Metadata / Semantics IR NLP ML Prereqs: Introductory courses in data structures and algorithms, in linear algebra and in probability theory 26/ 32

1st Half Searching full text: dictionaries, inverted files, postings, implementation and algorithms, term weighting, Vector Space Model, similarity, ranking Word Statistics MRS: 1 Boolean retrieval MRS: 2 The term vocabulary and postings lists MRS: 3 Dictionaries and tolerant retrieval MRS: 5 Index compression MRS: 6 Scoring, term weighting, and the vector space model MRS: 7 Computing scores in a complete search system 27/ 32

1st Half, cont d Evaluation of retrieval effectiveness MRS: 8. Evaluation in information retrieval Latent semantic indexing MRS: 18. Matrix decompositions and latent semantic indexing Discussion 2 SMART Discussion 3 IDF Discussion 4 Latent semantic indexing 28/ 32

2nd Half MRS: 3. Tolerant retrieval MRS: 9 Relevance feedback and query expansion MRS: 11 Probabilistic information retrieval Web Search: anchor text and links, Citation and Link Analysis, Web crawling MRS: 19 Web search basics MRS: 21 Link analysis 29/ 32

2nd Half, cont d Classification, categorization, clustering MRS: 13 Text classification and Naive Bayes MRS: 14 Vector space classification MRS: 16 Flat clustering MRS: 17 Hierarchical clustering (Structured Retrieval MRS: 10 XML Retrieval) Discussion 5 Google Discussion 6 MapReduce Discussion 7 Statistical Spell Correction 30/ 32

Midterm 1) term-document matrix, VSM, tf.idf 2) Recall/Precision 3) LSI 4) Word statistics (Heap, Zipf) 31/ 32

Final Exam, 3 or 4 questions from these topics CS4300/INFO4300 Tue 15 Dec 7:00-9:30 PM Olin Hall 255 issues in personal/enterprise/webscale searching, recall/precision, and how related to info/nav/trans needs issues for modern search engines... (e.g., w.r.t. web scale, tf.idf? recall/precision?) MapReduce probabilistic reasoning: naive bayes web indexing and retrieval: link analysis, adversarial IR Vector space classification (rocchio, knn) types of text classification (curated, rule-based, statistical) clustering: flat, hierarchical (k-means, agglomerative): evaluation of clustering, measures of cluster similarity (single link, complete link, average, group average) classification, clustering (make a dendrogram based on similarity) cluster labeling, feature selection 32/ 32