Term Filtering with Bounded Error

Term Filtering with Bounded Error Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi, tangjie, ljz}@keg.cs.tsinghua.edu.cn lwthucs@gmail.com Presented in ICDM 2010, Sydney, Australia 1

Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion 2

Introduction Term-document correlation matrix M: a V D co-occurrence matrix, the number of times that term w i occurs in document d j Applied in various text mining tasks text classification [Dasgupta et al., 2007] text clustering [Blei et al., 2003] information retrieval [Wei and Croft, 2006]. Disadvantage high computation-cost and intensive memory access. 3

Introduction Term-document correlation matrix M: a V D co-occurrence matrix, 4 Applied in various text mining tasks text classification [Dasgupta et al., 2007] text clustering [Blei et al., 2003] information retrieval [Wei and Croft, 2006]. Disadvantage high computation-cost and intensive memory access. the number of times that term w i occurs in document d j How can we overcome the disadvantage?

Introduction (cont'd) Three general methods: To improve the algorithm itself High performance platforms multicore processors and distributed machines [Chu et al., 2006] Feature selection approaches [Yi, 2003] We follow this line 5

Introduction (cont'd) Basic assumption Each term captures more or less information. Our goal A subspace of features (terms) Minimal information loss. Conventional solution: Step 1: Measure each individual term or the dependency to a group of terms. Information gain or mutual information. Step 2: Remove features with low scores in importance or high scores in redundancy. 6

Basic assumption Introduction (cont'd) Each term captures more or less information. Our goal - A generic definition of information A subspace loss for of a features single (terms) and a set of Minimal terms information terms loss. of any metric? - The information loss of each Step 1: individual Measure document, each individual and term a set or of the dependency documents? to a group of terms. Conventional solution: Limitations Information gain or mutual information. Step 2: Remove Why should features we with consider low scores the in importance or high scores in redundancy. information loss on documents? 7

8 Introduction (cont'd) Consider a simple example: Doc 1 Doc 2 A A B A A B w w ( 2, 2 ) (1, 1) Question: any loss of information if A and B are substituted by a single term S? Consider only the information loss for terms YES! Consider that ε V is defined by some state-ofthe-art pseudometrics, cosine similarity. Consider both NO! Because both documents emphasize A more than B. It s information loss of documents! A B

9 Introduction (cont'd) Consider a simple example: w A A A B What we have done? w B - Information loss for both terms and Doc 1 Doc 2 A A B ( 2, 2 ) (1, 1) Question: documents any loss of information if A and B are substituted by a single term S? - Term Filtering with Bounded Error problem Consider - Develop only efficient the information algorithms loss for terms YES! Consider that ε V is defined by some state-ofthe-art pseudometrics, cosine similarity. Consider both NO! Because both documents emphasize A more than B. It s information loss of documents!

Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion 10

Want Problem Definition - Superterm less information loss smaller term space Group terms into superterms! Superterm: s = *n(s, d j )+ j. the number of occurrences of superterm s in document d j 11

Problem Definition Information loss Information loss during term merging? user-specified distance measures d(, ) Euclidean metric and cosine similarity. superterm s substitutes a set of terms w i information loss of terms error V (w) = d(w, s). information loss of documents error D (d) = d(d, d ). 12 Doc 1 A A B Doc 2 Doc 1 Doc 2 A B 2 2 1 1 A A B S S A transformed representation for document d after merging Doc 1 Doc 2 1.5 1.5 1.5 1.5 e r r o r 0 W e r r o r 0 D

Problem Definition Information loss Information loss during term merging? user-specified distance measures d(, ) Can be chosen with different Euclidean methods: winner-take-all, metric and cosine similarity. average-occurrence, etc. superterm s substitutes a set of terms w i information loss of terms error V (w) = d(w, s). information loss of documents error D (d) = d(d, d ). 13 Doc 1 A A B Doc 2 Doc 1 Doc 2 A B 2 2 1 1 A A B S S A transformed representation for document d after merging Doc 1 Doc 2 1.5 1.5 1.5 1.5 e r r o r 0 W e r r o r 0 D

Problem Definition TFBE Term Filtering with Bounded Error (TFBE) To minimize the size of superterms S subject to the following constraints: (1) S = f(v) = s S s, (2) for all w V, error V (w) ε V, (3) for all d D, error D (d) ε D. Mapping function from terms to superterms Bounded by userspecified errors 14

Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion 15

Lossless Term Filtering Special case NO information loss of any term and document ε V = 0 and ε D = 0 Theorem: The exact optimal solution of the problem is yielded by grouping terms of the same vector representation. Algorithm Step 1: find local superterms for each document Same occurrences within the document Step 2: add, split, or remove superterms from global superterm set 16

Lossless Term Filtering Special case NO information loss of any term and document ε V = 0 and ε D = 0 Theorem: The exact optimal solution of the problem is yielded by grouping terms of the same vector This representation. case is applicable? Algorithm YES! On Baidu Baike dataset (containing Step 1: find local superterms for each document 1,531,215 documents) Same occurrences within the document The Step vocabulary 2: add, split, size or 1,522,576 remove superterms -> 714,392 from global superterm set > 50% 17

Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion 18

19 Lossy Term Filtering General case ε V > 0 and ε D > 0 NP-hard! A greedy algorithm Step 1: Consider the constraint given by ε V. Similar to Greedy Cover Algorithm Step 2: Verify the validity of the candidates with ε D. Computational Complexity #iterations << V O( V 2 + V T D + T ( V 2 + T D )) Parallelize the computation for distances

20 Lossy Term Filtering General case ε V > 0 and ε D > 0 NP-hard! A greedy algorithm Step 1: Consider the constraint given by ε V. Further reduce the size of Similar to Greedy Cover Algorithm candidate superterms? Step 2: Verify the validity of the candidates with ε D Locality-sensitive. hashing (LSH) Computational Complexity #iterations << V O( V 2 + V T D + T ( V 2 + T D )) Parallelize the computation for distances

Efficient Candidate Superterm Generation LSH (Figure modified based on http://cybertron.cg.tuberlin.de/pdci08/imageflight/nn_search.html) Basic idea: Same hash value (with several randomized hash projections) Close to each other (in the original space) Hash function for the Euclidean distance r w + b h r,b w = w(ε V ) [Datar et al., 2004]. 21

22 LSH Efficient Candidate Superterm Generation a projection vector (drawn from the Gaussian distribution) (Figure modified based on http://cybertron.cg.tuberlin.de/pdci08/imageflight/nn_search.html) Basic idea: Same hash value (with several randomized hash projections) Close to each other (in the original space) Hash function for the Euclidean distance r w + b h r,b w = the width of the w(ε V ) buckets [Datar et al., 2004]. (assigned empirically)

23 LSH Efficient Candidate Superterm Generation Basic idea: Same hash value a projection vector (with several randomized (drawn from the hash projections) Now Gaussian search distribution) w(εclose V )-near to each other (in the original space) neighbors in all the buckets! (Figure modified based on http://cybertron.cg.tuberlin.de/pdci08/imageflight/nn_search.html) Hash function for the Euclidean distance r w + b h r,b w = the width of the w(ε V ) buckets [Datar et al., 2004]. (assigned empirically)

Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion 24

Settings Datasets Experiments Academic: ArnetMiner (10,768 papers and 8,212 terms) 20-Newsgroups (18,774 postings and 61,188 terms) Baselines Task-irrelevant feature selection: document frequency criterion (DF), term strength (TS), sparse principal component analysis (SPCA) Supervised method (only on classification): Chistatistic (CHIMAX) 25

Experiments (cont'd) Filtering results on Euclidean metric Findings Errors, term radio Same bound for terms, bound for documents, term ratio Same bound for documents, bound for terms, term ratio 26

Experiments (cont'd) The information loss in terms of error Avg. errors Max errors 27

Experiments (cont'd) Efficiency Performance of Parallel Algorithm and LSH* * Optimal resulting from different choice of w ε V tuned 28

Experiments (cont'd) Applications Clustering (Arnet) Classification (20NG) Document retrieval (Arnet) 29

Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion 30

Conclusion Formally define the problem and perform a theoretical investigation Develop efficient algorithms for the problems Validate the approach through an extensive set of experiments with multiple real-world text mining applications 31

32 Thank you!

References Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993-1022. Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y. and Olukotun, K. (2006). Map-Reduce for Machine Learning on Multicore. In NIPS '06 pp. 281-288,. Dasgupta, A., Drineas, P., Harb, B., Josifovski, V. and Mahoney, M. W. (2007). Feature selection methods for text classification. In KDD'07 pp. 230-239,. Datar, M., Immorlica, N., Indyk, P. and Mirrokni, V. S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In SCG'04 pp. 253-262,. Wei, X. and Croft, W. B. (2006). LDA-Based Document Models for Ad-hoc Retrieval. In SIGIR '06 pp. 178-185,. Yi, L. (2003). Web page cleaning for web mining through feature weighting. In IJCAI '03 pp. 43-50,. 33