Term Filtering with Bounded Error

Similar documents
Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Benchmarking and Improving Recovery of Number of Topics in Latent Dirichlet Allocation Models

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

Dimension Reduction (PCA, ICA, CCA, FLD,

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Iterative Laplacian Score for Feature Selection

A Framework of Detecting Burst Events from Micro-blogging Streams

Improving Topic Models with Latent Feature Word Representations

Randomized Algorithms

Latent Dirichlet Allocation Based Multi-Document Summarization

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Topic Models and Applications to Short Documents

Hash-based Indexing: Application, Impact, and Realization Alternatives

Algorithms for Nearest Neighbors

Lecture 14 October 16, 2014

IPSJ SIG Technical Report Vol.2014-MPS-100 No /9/25 1,a) 1 1 SNS / / / / / / Time Series Topic Model Considering Dependence to Multiple Topics S

Batch Mode Sparse Active Learning. Lixin Shi, Yuhang Zhao Tsinghua University

Lecture 8 January 30, 2014

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Latent Dirichlet Allocation (LDA)

Prediction of Citations for Academic Papers

Spectral Hashing: Learning to Leverage 80 Million Images

Incorporating Social Context and Domain Knowledge for Entity Recognition

Point-of-Interest Recommendations: Learning Potential Check-ins from Friends

Active Learning for Sparse Bayesian Multilabel Classification

Manifold Coarse Graining for Online Semi-supervised Learning

Online Bayesian Passive-Aggressive Learning

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

Collaborative Topic Modeling for Recommending Scientific Articles

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Modeling User Rating Profiles For Collaborative Filtering

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Large-Scale Behavioral Targeting


Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Algorithms for Data Science: Lecture on Finding Similar Items

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

Comparative Summarization via Latent Dirichlet Allocation

Compressed Fisher vectors for LSVR

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Comparing Relevance Feedback Techniques on German News Articles

Kernel Density Topic Models: Visual Topics Without Visual Words

Evaluation Methods for Topic Models

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Modern Information Retrieval

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Data Mining Techniques

Scaling Neighbourhood Methods

Collaborative topic models: motivations cont

Comparative Document Analysis for Large Text Corpora

Learning Task Grouping and Overlap in Multi-Task Learning

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Collaborative User Clustering for Short Text Streams

A Probabilistic Model for Canonicalizing Named Entity Mentions. Dani Yogatama Yanchuan Sim Noah A. Smith

Cholesky Decomposition Rectification for Non-negative Matrix Factorization

arxiv: v1 [stat.ml] 5 Dec 2016

Document and Topic Models: plsa and LDA

Unified Modeling of User Activities on Social Networking Sites

Content-based Recommendation

Composite Quantization for Approximate Nearest Neighbor Search

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Online Bayesian Passive-Agressive Learning

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Relational Stacked Denoising Autoencoder for Tag Recommendation. Hao Wang

arxiv: v1 [cs.ir] 25 Oct 2015

Introduction of Recruit

Modern Information Retrieval

Locality Sensitive Hashing

Topic Modeling: Beyond Bag-of-Words

Classical Predictive Models

PCA, Kernel PCA, ICA

Linear Spectral Hashing

Examples are not Enough, Learn to Criticize! Criticism for Interpretability

MULTIPLEKERNELLEARNING CSE902

Variable Latent Semantic Indexing

Ad Placement Strategies

ECS289: Scalable Machine Learning

Recent Advances in Bayesian Inference Techniques

ECS289: Scalable Machine Learning

Tailored Bregman Ball Trees for Effective Nearest Neighbors

Gaussian Mixture Model

Chap 2: Classical models for information retrieval

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Manning & Schuetze, FSNLP, (c)

arxiv: v1 [cs.db] 2 Sep 2014

Approximate Spectral Clustering via Randomized Sketching

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign

Learning Query and Document Similarities from Click-through Bipartite Graph with Metadata

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Transcription:

Term Filtering with Bounded Error Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi, tangjie, ljz}@keg.cs.tsinghua.edu.cn lwthucs@gmail.com Presented in ICDM 2010, Sydney, Australia 1

Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion 2

Introduction Term-document correlation matrix M: a V D co-occurrence matrix, the number of times that term w i occurs in document d j Applied in various text mining tasks text classification [Dasgupta et al., 2007] text clustering [Blei et al., 2003] information retrieval [Wei and Croft, 2006]. Disadvantage high computation-cost and intensive memory access. 3

Introduction Term-document correlation matrix M: a V D co-occurrence matrix, 4 Applied in various text mining tasks text classification [Dasgupta et al., 2007] text clustering [Blei et al., 2003] information retrieval [Wei and Croft, 2006]. Disadvantage high computation-cost and intensive memory access. the number of times that term w i occurs in document d j How can we overcome the disadvantage?

Introduction (cont'd) Three general methods: To improve the algorithm itself High performance platforms multicore processors and distributed machines [Chu et al., 2006] Feature selection approaches [Yi, 2003] We follow this line 5

Introduction (cont'd) Basic assumption Each term captures more or less information. Our goal A subspace of features (terms) Minimal information loss. Conventional solution: Step 1: Measure each individual term or the dependency to a group of terms. Information gain or mutual information. Step 2: Remove features with low scores in importance or high scores in redundancy. 6

Basic assumption Introduction (cont'd) Each term captures more or less information. Our goal - A generic definition of information A subspace loss for of a features single (terms) and a set of Minimal terms information terms loss. of any metric? - The information loss of each Step 1: individual Measure document, each individual and term a set or of the dependency documents? to a group of terms. Conventional solution: Limitations Information gain or mutual information. Step 2: Remove Why should features we with consider low scores the in importance or high scores in redundancy. information loss on documents? 7

8 Introduction (cont'd) Consider a simple example: Doc 1 Doc 2 A A B A A B w w ( 2, 2 ) (1, 1) Question: any loss of information if A and B are substituted by a single term S? Consider only the information loss for terms YES! Consider that ε V is defined by some state-ofthe-art pseudometrics, cosine similarity. Consider both NO! Because both documents emphasize A more than B. It s information loss of documents! A B

9 Introduction (cont'd) Consider a simple example: w A A A B What we have done? w B - Information loss for both terms and Doc 1 Doc 2 A A B ( 2, 2 ) (1, 1) Question: documents any loss of information if A and B are substituted by a single term S? - Term Filtering with Bounded Error problem Consider - Develop only efficient the information algorithms loss for terms YES! Consider that ε V is defined by some state-ofthe-art pseudometrics, cosine similarity. Consider both NO! Because both documents emphasize A more than B. It s information loss of documents!

Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion 10

Want Problem Definition - Superterm less information loss smaller term space Group terms into superterms! Superterm: s = *n(s, d j )+ j. the number of occurrences of superterm s in document d j 11

Problem Definition Information loss Information loss during term merging? user-specified distance measures d(, ) Euclidean metric and cosine similarity. superterm s substitutes a set of terms w i information loss of terms error V (w) = d(w, s). information loss of documents error D (d) = d(d, d ). 12 Doc 1 A A B Doc 2 Doc 1 Doc 2 A B 2 2 1 1 A A B S S A transformed representation for document d after merging Doc 1 Doc 2 1.5 1.5 1.5 1.5 e r r o r 0 W e r r o r 0 D

Problem Definition Information loss Information loss during term merging? user-specified distance measures d(, ) Can be chosen with different Euclidean methods: winner-take-all, metric and cosine similarity. average-occurrence, etc. superterm s substitutes a set of terms w i information loss of terms error V (w) = d(w, s). information loss of documents error D (d) = d(d, d ). 13 Doc 1 A A B Doc 2 Doc 1 Doc 2 A B 2 2 1 1 A A B S S A transformed representation for document d after merging Doc 1 Doc 2 1.5 1.5 1.5 1.5 e r r o r 0 W e r r o r 0 D

Problem Definition TFBE Term Filtering with Bounded Error (TFBE) To minimize the size of superterms S subject to the following constraints: (1) S = f(v) = s S s, (2) for all w V, error V (w) ε V, (3) for all d D, error D (d) ε D. Mapping function from terms to superterms Bounded by userspecified errors 14

Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion 15

Lossless Term Filtering Special case NO information loss of any term and document ε V = 0 and ε D = 0 Theorem: The exact optimal solution of the problem is yielded by grouping terms of the same vector representation. Algorithm Step 1: find local superterms for each document Same occurrences within the document Step 2: add, split, or remove superterms from global superterm set 16

Lossless Term Filtering Special case NO information loss of any term and document ε V = 0 and ε D = 0 Theorem: The exact optimal solution of the problem is yielded by grouping terms of the same vector This representation. case is applicable? Algorithm YES! On Baidu Baike dataset (containing Step 1: find local superterms for each document 1,531,215 documents) Same occurrences within the document The Step vocabulary 2: add, split, size or 1,522,576 remove superterms -> 714,392 from global superterm set > 50% 17

Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion 18

19 Lossy Term Filtering General case ε V > 0 and ε D > 0 NP-hard! A greedy algorithm Step 1: Consider the constraint given by ε V. Similar to Greedy Cover Algorithm Step 2: Verify the validity of the candidates with ε D. Computational Complexity #iterations << V O( V 2 + V T D + T ( V 2 + T D )) Parallelize the computation for distances

20 Lossy Term Filtering General case ε V > 0 and ε D > 0 NP-hard! A greedy algorithm Step 1: Consider the constraint given by ε V. Further reduce the size of Similar to Greedy Cover Algorithm candidate superterms? Step 2: Verify the validity of the candidates with ε D Locality-sensitive. hashing (LSH) Computational Complexity #iterations << V O( V 2 + V T D + T ( V 2 + T D )) Parallelize the computation for distances

Efficient Candidate Superterm Generation LSH (Figure modified based on http://cybertron.cg.tuberlin.de/pdci08/imageflight/nn_search.html) Basic idea: Same hash value (with several randomized hash projections) Close to each other (in the original space) Hash function for the Euclidean distance r w + b h r,b w = w(ε V ) [Datar et al., 2004]. 21

22 LSH Efficient Candidate Superterm Generation a projection vector (drawn from the Gaussian distribution) (Figure modified based on http://cybertron.cg.tuberlin.de/pdci08/imageflight/nn_search.html) Basic idea: Same hash value (with several randomized hash projections) Close to each other (in the original space) Hash function for the Euclidean distance r w + b h r,b w = the width of the w(ε V ) buckets [Datar et al., 2004]. (assigned empirically)

23 LSH Efficient Candidate Superterm Generation Basic idea: Same hash value a projection vector (with several randomized (drawn from the hash projections) Now Gaussian search distribution) w(εclose V )-near to each other (in the original space) neighbors in all the buckets! (Figure modified based on http://cybertron.cg.tuberlin.de/pdci08/imageflight/nn_search.html) Hash function for the Euclidean distance r w + b h r,b w = the width of the w(ε V ) buckets [Datar et al., 2004]. (assigned empirically)

Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion 24

Settings Datasets Experiments Academic: ArnetMiner (10,768 papers and 8,212 terms) 20-Newsgroups (18,774 postings and 61,188 terms) Baselines Task-irrelevant feature selection: document frequency criterion (DF), term strength (TS), sparse principal component analysis (SPCA) Supervised method (only on classification): Chistatistic (CHIMAX) 25

Experiments (cont'd) Filtering results on Euclidean metric Findings Errors, term radio Same bound for terms, bound for documents, term ratio Same bound for documents, bound for terms, term ratio 26

Experiments (cont'd) The information loss in terms of error Avg. errors Max errors 27

Experiments (cont'd) Efficiency Performance of Parallel Algorithm and LSH* * Optimal resulting from different choice of w ε V tuned 28

Experiments (cont'd) Applications Clustering (Arnet) Classification (20NG) Document retrieval (Arnet) 29

Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion 30

Conclusion Formally define the problem and perform a theoretical investigation Develop efficient algorithms for the problems Validate the approach through an extensive set of experiments with multiple real-world text mining applications 31

32 Thank you!

References Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993-1022. Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y. and Olukotun, K. (2006). Map-Reduce for Machine Learning on Multicore. In NIPS '06 pp. 281-288,. Dasgupta, A., Drineas, P., Harb, B., Josifovski, V. and Mahoney, M. W. (2007). Feature selection methods for text classification. In KDD'07 pp. 230-239,. Datar, M., Immorlica, N., Indyk, P. and Mirrokni, V. S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In SCG'04 pp. 253-262,. Wei, X. and Croft, W. B. (2006). LDA-Based Document Models for Ad-hoc Retrieval. In SIGIR '06 pp. 178-185,. Yi, L. (2003). Web page cleaning for web mining through feature weighting. In IJCAI '03 pp. 43-50,. 33