人工知能学会インタラクティブ情報アクセスと可視化マイニング研究会 ( 第 3 回 ) SIG-AM Pseudo Labled Latent Dirichlet Allocation 1 2 Satoko Suzuki 1 Ichiro Kobayashi Departmen

Similar documents
COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

Co-training and Learning with Noise

LEARNING AND REASONING ON BACKGROUND NETS FOR TEXT CATEGORIZATION WITH CHANGING DOMAIN AND PERSONALIZED CRITERIA

Discriminative Topic Modeling Based on Manifold Learning

arxiv: v1 [cs.lg] 5 Jul 2010

Two Roles for Bayesian Methods

Topic Modelling and Latent Dirichlet Allocation

Using Both Latent and Supervised Shared Topics for Multitask Learning

Machine Learning. Bayesian Learning.

Dirichlet Process Based Evolutionary Clustering

Model-based estimation of word saliency in text

Machine Learning. Bayesian Learning. Acknowledgement Slides courtesy of Martin Riedmiller

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.

arxiv: v1 [cs.cl] 1 Apr 2016

Using Part-of-Speech Information for Transfer in Text Classification

CS Lecture 18. Topic Models and LDA

On the Helmholtz Principle for Data Mining

Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization

Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

BAYESIAN LEARNING. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6]

Doubly Aggressive Selective Sampling Algorithms for Classification

Bayesian Learning. Bayes Theorem. MAP, MLhypotheses. MAP learners. Minimum description length principle. Bayes optimal classier. Naive Bayes learner

Topic Models and Applications to Short Documents

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Evaluation Methods for Topic Models

Evolutionary Clustering by Hierarchical Dirichlet Process with Hidden Markov State

Convex Multiple-Instance Learning by Estimating Likelihood Ratio

Latent variable models for discrete data

Nonparametric Spherical Topic Modeling with Word Embeddings

Classical Predictive Models

Discriminative Transfer Learning on Manifold

Multi-Task Semi-Supervised Semantic Feature Learning for Classification

Generative Clustering, Topic Modeling, & Bayesian Inference

Topic Modeling: Beyond Bag-of-Words

Knowledge Discovery with Iterative Denoising

Content-based Recommendation

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Active Learning for Logistic Regression: An Evaluation

Latent Dirichlet Allocation (LDA)

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

AN INTRODUCTION TO TOPIC MODELS

Russell Hanson DFCI April 24, 2009

Improving Topic Models with Latent Feature Word Representations

Transfer Learning From Multiple Source Domains via Consensus Regularization

Applying hlda to Practical Topic Modeling

Collaborative Topic Modeling for Recommending Scientific Articles

Convex Multiple-Instance Learning by Estimating Likelihood Ratio

39 Generative Models for Evolutionary Clustering

Text Mining: Basic Models and Applications

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Dimension Reduction (PCA, ICA, CCA, FLD,

Study Notes on the Latent Dirichlet Allocation

CLustering has been a fundamental and efficient tool for. A Fuzzy Approach for Multi-Type Relational Data Clustering

Mixtures of Multinomials

Latent Dirichlet Allocation Introduction/Overview

Analyzing Burst of Topics in News Stream

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Non-parametric Clustering with Dirichlet Processes

Pattern Change Discovery between High Dimensional Data Sets

Scalable Bayesian Matrix and Tensor Factorization for Discrete Data

Information Bottleneck Co-clustering

Online Bayesian Passive-Agressive Learning

Language Information Processing, Advanced. Topic Models

Sparse Stochastic Inference for Latent Dirichlet Allocation

Text mining and natural language analysis. Jefrey Lijffijt

Efficient Tree-Based Topic Modeling

Measuring Topic Quality in Latent Dirichlet Allocation

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Latent Dirichlet Allocation (LDA)

Collapsed Variational Inference for HDP

Distinguish between different types of scenes. Matching human perception Understanding the environment

arxiv: v1 [stat.ml] 30 Dec 2009

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Nonnegative Matrix Factorization

Introduction To Machine Learning

ECE 5984: Introduction to Machine Learning

Multilayer Neural Networks

Non-Parametric Bayes

Unified Modeling of User Activities on Social Networking Sites

Colored Maximum Variance Unfolding

Comparative Summarization via Latent Dirichlet Allocation

Topic Significance Ranking of LDA Generative Models

Statistical Debugging with Latent Topic Models

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Document and Topic Models: plsa and LDA

Lecture 22 Exploratory Text Analysis & Topic Models

Term Filtering with Bounded Error

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Collaborative topic models: motivations cont

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

IPSJ SIG Technical Report Vol.2014-MPS-100 No /9/25 1,a) 1 1 SNS / / / / / / Time Series Topic Model Considering Dependence to Multiple Topics S

Discriminative Topic Modeling based on Manifold Learning

Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

Distributed ML for DOSNs: giving power back to users

Automated word puzzle generation using topic models and semantic relatedness measures

Evaluation of Topographic Clustering and its Kernelization

Latent Dirichlet Alloca/on

Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation

Transcription:

Pseudo Labled Latent Dirichlet Allocation 1 2 Satoko Suzuki 1 Ichiro Kobayashi 2 1 1 Department of Information Science, Faculty of Science, Ochanomizu University 2 2 Advanced Science, Graduate School of Humanities and Science, Ochanomizu University Abstract: In recent years, topic models have been widely used for many applications such as document summarization, document clustering etc. Labeled latent Dirichlet allocation (LLDA) was proposed based on latent Dirichlet allocation (LDA), and it regards the tags, i.e., labels, put on documents by humans as the ones expressing the contents of the documents, and uses them as supervised information to estimate latent topics of the documents. Moreover, it is reported that LLDA exceeds the ability of LDA in terms of topic estimation. However, normal documents usually do not have such tags with them, so, the use of LLDA is considerably limited.in this study, therefore, we make pseudo labels from the documents to be estimated their latent topics instead of tags put on documents by humans, and aim to make LLDA available for all documents. 1 (LDA)[1] Labeled LDA(L-LDA)[2] LDA L-LDA L-LDA 2 Labeled LDA L-LDA LDA 1 L-LDA L-LDA LDA 112-0012 2-1-1 E-mail: g0920519@is.ocha.ac.jp 1: L-LDA θ Λ (d) Λ (d) = (l 1,, l K ) l k {0, 1} (1) K 0 2 λ (d) = {k Λ (d) k = 1} (2) λ (d) d - 20 -

{ L (d) 1 if λ (d) ij = i = j 0 otherwise. (3) Leader-Follower Crouch 2 [5][6] 2 1 α α (d) α (d) θ LDA 3 2 Leader-Follower Leader-Follower 1. 2. 1 3. 1 3.1 Newman [3] TF-IDF 1 PMI 2 PMI PMI 2 3.2 d i 4 4. 1 5. 6. 3 C h t j 5 wˆ hj = log x ij + 1.0 (5) d i C h 1 Crouch Leader-Follower 2 Crouch Leader-Follower 6 w ij = (log x ij + 1.0) log(n/n j ) (4) x ij d i t j N n j t j s(d i, C k ) = f M j=1 min(w ij, ŵ kj ) min( M j=1 w ij, M j=1 ŵkj) (6) d i C k w ij ŵ kj M - 21 -

4 LDA 4.1 20 Newsgroups 1 20 10 100 1000 1000 2 2 seta setb 1 1: seta setb alt.atheism comp.graphics com.sys.mac.hardware rec.sport.baseball sci.med sci.crypt sci.electronics sci.space talk.politics.guns soc.religion.christian alt.atheism comp.graphics comp.ibm.pc.hardware misc.forsale rec.autos rec.motorcycles sci.electronics sci.space talk.politics.guns talk.politics.misc 10 seta setb 2 1 TF-IDF 30 1 5 PMI seta [4.5,6.2] setb [4.8,6.2] 2 Leader-Follower Crouch 2 [0.1,0.9] 2a 0.1 2b 1 http://qwone.com/ jason/20newsgroups/ L-LDA α=0.1 η=0.1 LDA seta 16 setb 28 LDA α=0.1 η=0.1 θ k-means 20Newsgroups 10 4.2 [4] 7 MI(L, A) = l i L,α j A P (l i, α j ) log 2 P (l i, α j ) P (l i )P (α j ) (7) L = {l 1, l 2,, l k } k-means A = {α 1, α 2,, α k } P (l i ) l j P (α j ) α j P (l i, α j ) 2 [0,1] 8 4.3 MI = MI(L, A) MI(A, A) (8) k-means 10 MI seta 2 4 setb 5 7 MI LDA MI LDA seta setb MI 2 5 LDA MI 2 LDA 3 6 [0.1,0.9] MI - 22 -

2: MI (seta) 3: MI 2a(setA) 4: MI 2b(setA) 5: MI (setb) 6: MI 2a(setB) 7: MI 2b(setB) LDA 4 7 [0.03,0.1] Leader-Follower Crouch LDA MI 2 4 2 2: 1 Threshold Number of labels seta setb 4.5 2(9) - 4.6 2(9) - 4.7 3(10) - 4.8 3(10) 2(8) 4.9 5(12) 2(8) 5.0 6(13) 1(7) 5.1 5(12) 1(7) 5.2 22(29) 22(28) 5.3 20(27) 27(33) 5.4 20(27) 27(33) 5.5 16(23) 24(30) 5.6 204(211) 193(199) 5.7 204(211) 193(199) 5.8 204(211) 193(199) 5.9 125(132) 124(130) 6.0 125(132) 124(130) 6.1 125(132) 124(130) 6.2 125(132) 124(130) 1 5.2 5.6 5.9 6.3 2 Leader-Follower Crouch 0.2 MI 8 9 MI LDA Leader-Follower Crouch 0 50 100 150 200 LDA - 23 -

3: 2a Threshold Number of labels seta setb 0.1 185 212 0.2 228 220 0.3 202 179 0.4 155 140 0.5 102 93 0.6 59 56 0.7 28 25 0.8 11 9 0.9 2 6 4: 2b Threshold Number of labels seta setb 0.02-6 0.03 19 21 0.04 42 42 0.05 69 80 0.06 101 119 0.07 118 144 0.08 151 181 0.09 169 194 0.1 185 212 8: MI seta 9: MI setb 5 2 LDA Crouch Leader-Follower 2 0.03 Leader-Follower LDA 0.03 setb 0.03 0.7 2 0.03 Leader-Follower 6 2 LDA LDA LDA 2 2 Leader-Follower - 24 -

LDA TF-IDF [1] D. M. Blei, A. Y. Ng, M. I. Jordan:Latent Dirichlet Allocation. Journal of Machine Learning Research, Vol. 3, pp. 993-1022, 2003. [2] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning:Labeled LDA: A supervised topic model for credit attribution in multi-label corpora. EMNLP2009, pp. 248-256, 2009. [3] Newman, David and Lau, Jey Han and Grieser, Karl and Baldwin, Timothy: Human Language Technologies, NAACL2010,pp. 100 108,Los Angeles, California, 2010. [4] Gunes Erkan:Language Model-Based Document Clustering Using Random Walks, Association for Computational Linguistics,pp.479-456,2006. [5] :, Library and Information Science,No.47,2002. [6] :, Library and Information Science,No.49, 2003. - 25 -