人工知能学会インタラクティブ情報アクセスと可視化マイニング研究会 ( 第 3 回 ) SIG-AM Pseudo Labled Latent Dirichlet Allocation 1 2 Satoko Suzuki 1 Ichiro Kobayashi Departmen

Size: px

Start display at page:

Download "人工知能学会インタラクティブ情報アクセスと可視化マイニング研究会 ( 第 3 回 ) SIG-AM Pseudo Labled Latent Dirichlet Allocation 1 2 Satoko Suzuki 1 Ichiro Kobayashi Departmen"

Lawrence Hall
5 years ago
Views:

1 Pseudo Labled Latent Dirichlet Allocation 1 2 Satoko Suzuki 1 Ichiro Kobayashi Department of Information Science, Faculty of Science, Ochanomizu University 2 2 Advanced Science, Graduate School of Humanities and Science, Ochanomizu University Abstract: In recent years, topic models have been widely used for many applications such as document summarization, document clustering etc. Labeled latent Dirichlet allocation (LLDA) was proposed based on latent Dirichlet allocation (LDA), and it regards the tags, i.e., labels, put on documents by humans as the ones expressing the contents of the documents, and uses them as supervised information to estimate latent topics of the documents. Moreover, it is reported that LLDA exceeds the ability of LDA in terms of topic estimation. However, normal documents usually do not have such tags with them, so, the use of LLDA is considerably limited.in this study, therefore, we make pseudo labels from the documents to be estimated their latent topics instead of tags put on documents by humans, and aim to make LLDA available for all documents. 1 (LDA)[1] Labeled LDA(L-LDA)[2] LDA L-LDA L-LDA 2 Labeled LDA L-LDA LDA 1 L-LDA L-LDA LDA g @is.ocha.ac.jp 1: L-LDA θ Λ (d) Λ (d) = (l 1,, l K ) l k {0, 1} (1) K 0 2 λ (d) = {k Λ (d) k = 1} (2) λ (d) d

2 { L (d) 1 if λ (d) ij = i = j 0 otherwise. (3) Leader-Follower Crouch 2 [5][6] 2 1 α α (d) α (d) θ LDA 3 2 Leader-Follower Leader-Follower Newman [3] TF-IDF 1 PMI 2 PMI PMI d i C h t j 5 wˆ hj = log x ij (5) d i C h 1 Crouch Leader-Follower 2 Crouch Leader-Follower 6 w ij = (log x ij + 1.0) log(n/n j ) (4) x ij d i t j N n j t j s(d i, C k ) = f M j=1 min(w ij, ŵ kj ) min( M j=1 w ij, M j=1 ŵkj) (6) d i C k w ij ŵ kj M

3 4 LDA Newsgroups seta setb 1 1: seta setb alt.atheism comp.graphics com.sys.mac.hardware rec.sport.baseball sci.med sci.crypt sci.electronics sci.space talk.politics.guns soc.religion.christian alt.atheism comp.graphics comp.ibm.pc.hardware misc.forsale rec.autos rec.motorcycles sci.electronics sci.space talk.politics.guns talk.politics.misc 10 seta setb 2 1 TF-IDF PMI seta [4.5,6.2] setb [4.8,6.2] 2 Leader-Follower Crouch 2 [0.1,0.9] 2a 0.1 2b 1 jason/20newsgroups/ L-LDA α=0.1 η=0.1 LDA seta 16 setb 28 LDA α=0.1 η=0.1 θ k-means 20Newsgroups [4] 7 MI(L, A) = l i L,α j A P (l i, α j ) log 2 P (l i, α j ) P (l i )P (α j ) (7) L = {l 1, l 2,, l k } k-means A = {α 1, α 2,, α k } P (l i ) l j P (α j ) α j P (l i, α j ) 2 [0,1] MI = MI(L, A) MI(A, A) (8) k-means 10 MI seta 2 4 setb 5 7 MI LDA MI LDA seta setb MI 2 5 LDA MI 2 LDA 3 6 [0.1,0.9] MI

4 2: MI (seta) 3: MI 2a(setA) 4: MI 2b(setA) 5: MI (setb) 6: MI 2a(setB) 7: MI 2b(setB) LDA 4 7 [0.03,0.1] Leader-Follower Crouch LDA MI : 1 Threshold Number of labels seta setb 4.5 2(9) (9) (10) (10) 2(8) 4.9 5(12) 2(8) 5.0 6(13) 1(7) 5.1 5(12) 1(7) (29) 22(28) (27) 27(33) (27) 27(33) (23) 24(30) (211) 193(199) (211) 193(199) (211) 193(199) (132) 124(130) (132) 124(130) (132) 124(130) (132) 124(130) Leader-Follower Crouch 0.2 MI 8 9 MI LDA Leader-Follower Crouch LDA

5 3: 2a Threshold Number of labels seta setb : 2b Threshold Number of labels seta setb : MI seta 9: MI setb 5 2 LDA Crouch Leader-Follower Leader-Follower LDA 0.03 setb Leader-Follower 6 2 LDA LDA LDA 2 2 Leader-Follower

6 LDA TF-IDF [1] D. M. Blei, A. Y. Ng, M. I. Jordan:Latent Dirichlet Allocation. Journal of Machine Learning Research, Vol. 3, pp , [2] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning:Labeled LDA: A supervised topic model for credit attribution in multi-label corpora. EMNLP2009, pp , [3] Newman, David and Lau, Jey Han and Grieser, Karl and Baldwin, Timothy: Human Language Technologies, NAACL2010,pp ,Los Angeles, California, [4] Gunes Erkan:Language Model-Based Document Clustering Using Random Walks, Association for Computational Linguistics,pp ,2006. [5] :, Library and Information Science,No.47,2002. [6] :, Library and Information Science,No.49,

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification COMP 55 Applied Machine Learning Lecture 5: Generative models for linear classification Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material