Pseudo Labled Latent Dirichlet Allocation 1 2 Satoko Suzuki 1 Ichiro Kobayashi 2 1 1 Department of Information Science, Faculty of Science, Ochanomizu University 2 2 Advanced Science, Graduate School of Humanities and Science, Ochanomizu University Abstract: In recent years, topic models have been widely used for many applications such as document summarization, document clustering etc. Labeled latent Dirichlet allocation (LLDA) was proposed based on latent Dirichlet allocation (LDA), and it regards the tags, i.e., labels, put on documents by humans as the ones expressing the contents of the documents, and uses them as supervised information to estimate latent topics of the documents. Moreover, it is reported that LLDA exceeds the ability of LDA in terms of topic estimation. However, normal documents usually do not have such tags with them, so, the use of LLDA is considerably limited.in this study, therefore, we make pseudo labels from the documents to be estimated their latent topics instead of tags put on documents by humans, and aim to make LLDA available for all documents. 1 (LDA)[1] Labeled LDA(L-LDA)[2] LDA L-LDA L-LDA 2 Labeled LDA L-LDA LDA 1 L-LDA L-LDA LDA 112-0012 2-1-1 E-mail: g0920519@is.ocha.ac.jp 1: L-LDA θ Λ (d) Λ (d) = (l 1,, l K ) l k {0, 1} (1) K 0 2 λ (d) = {k Λ (d) k = 1} (2) λ (d) d - 20 -
{ L (d) 1 if λ (d) ij = i = j 0 otherwise. (3) Leader-Follower Crouch 2 [5][6] 2 1 α α (d) α (d) θ LDA 3 2 Leader-Follower Leader-Follower 1. 2. 1 3. 1 3.1 Newman [3] TF-IDF 1 PMI 2 PMI PMI 2 3.2 d i 4 4. 1 5. 6. 3 C h t j 5 wˆ hj = log x ij + 1.0 (5) d i C h 1 Crouch Leader-Follower 2 Crouch Leader-Follower 6 w ij = (log x ij + 1.0) log(n/n j ) (4) x ij d i t j N n j t j s(d i, C k ) = f M j=1 min(w ij, ŵ kj ) min( M j=1 w ij, M j=1 ŵkj) (6) d i C k w ij ŵ kj M - 21 -
4 LDA 4.1 20 Newsgroups 1 20 10 100 1000 1000 2 2 seta setb 1 1: seta setb alt.atheism comp.graphics com.sys.mac.hardware rec.sport.baseball sci.med sci.crypt sci.electronics sci.space talk.politics.guns soc.religion.christian alt.atheism comp.graphics comp.ibm.pc.hardware misc.forsale rec.autos rec.motorcycles sci.electronics sci.space talk.politics.guns talk.politics.misc 10 seta setb 2 1 TF-IDF 30 1 5 PMI seta [4.5,6.2] setb [4.8,6.2] 2 Leader-Follower Crouch 2 [0.1,0.9] 2a 0.1 2b 1 http://qwone.com/ jason/20newsgroups/ L-LDA α=0.1 η=0.1 LDA seta 16 setb 28 LDA α=0.1 η=0.1 θ k-means 20Newsgroups 10 4.2 [4] 7 MI(L, A) = l i L,α j A P (l i, α j ) log 2 P (l i, α j ) P (l i )P (α j ) (7) L = {l 1, l 2,, l k } k-means A = {α 1, α 2,, α k } P (l i ) l j P (α j ) α j P (l i, α j ) 2 [0,1] 8 4.3 MI = MI(L, A) MI(A, A) (8) k-means 10 MI seta 2 4 setb 5 7 MI LDA MI LDA seta setb MI 2 5 LDA MI 2 LDA 3 6 [0.1,0.9] MI - 22 -
2: MI (seta) 3: MI 2a(setA) 4: MI 2b(setA) 5: MI (setb) 6: MI 2a(setB) 7: MI 2b(setB) LDA 4 7 [0.03,0.1] Leader-Follower Crouch LDA MI 2 4 2 2: 1 Threshold Number of labels seta setb 4.5 2(9) - 4.6 2(9) - 4.7 3(10) - 4.8 3(10) 2(8) 4.9 5(12) 2(8) 5.0 6(13) 1(7) 5.1 5(12) 1(7) 5.2 22(29) 22(28) 5.3 20(27) 27(33) 5.4 20(27) 27(33) 5.5 16(23) 24(30) 5.6 204(211) 193(199) 5.7 204(211) 193(199) 5.8 204(211) 193(199) 5.9 125(132) 124(130) 6.0 125(132) 124(130) 6.1 125(132) 124(130) 6.2 125(132) 124(130) 1 5.2 5.6 5.9 6.3 2 Leader-Follower Crouch 0.2 MI 8 9 MI LDA Leader-Follower Crouch 0 50 100 150 200 LDA - 23 -
3: 2a Threshold Number of labels seta setb 0.1 185 212 0.2 228 220 0.3 202 179 0.4 155 140 0.5 102 93 0.6 59 56 0.7 28 25 0.8 11 9 0.9 2 6 4: 2b Threshold Number of labels seta setb 0.02-6 0.03 19 21 0.04 42 42 0.05 69 80 0.06 101 119 0.07 118 144 0.08 151 181 0.09 169 194 0.1 185 212 8: MI seta 9: MI setb 5 2 LDA Crouch Leader-Follower 2 0.03 Leader-Follower LDA 0.03 setb 0.03 0.7 2 0.03 Leader-Follower 6 2 LDA LDA LDA 2 2 Leader-Follower - 24 -
LDA TF-IDF [1] D. M. Blei, A. Y. Ng, M. I. Jordan:Latent Dirichlet Allocation. Journal of Machine Learning Research, Vol. 3, pp. 993-1022, 2003. [2] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning:Labeled LDA: A supervised topic model for credit attribution in multi-label corpora. EMNLP2009, pp. 248-256, 2009. [3] Newman, David and Lau, Jey Han and Grieser, Karl and Baldwin, Timothy: Human Language Technologies, NAACL2010,pp. 100 108,Los Angeles, California, 2010. [4] Gunes Erkan:Language Model-Based Document Clustering Using Random Walks, Association for Computational Linguistics,pp.479-456,2006. [5] :, Library and Information Science,No.47,2002. [6] :, Library and Information Science,No.49, 2003. - 25 -