Analyzing Burst of Topics in News Stream

1 1 1 2 2 Kleinberg LDA (latent Dirichlet allocation) DTM (dynamic topic model) DTM Analyzing Burst of Topics in News Stream Yusuke Takahashi, 1 Daisuke Yokomoto, 1 Takehito Utsuro 1 and Masaharu Yoshioka 2 Among various types of recent information explosion, that in news stream is also a kind of serious problems. This paper studies issues regarding two types of modeling of information flow in news stream, namely, burst analysis and topic modeling. First, when one wants to detect a kind of topics that are paid much more attention than usual, it is usually necessary for him/her to carefully watch every article in news stream at every moment. In such a situation, it is well known in the field of time series analysis that Kleinberg s modeling of bursts is quite effective in detecting burst of keywords. Second, topic models such as LDA (latent Dirichlet allocation) and DTM (dynamic topic model) are also quite effective in estimating distribution of topics over a document collection such as articles in news stream. This paper focuses on the fact that Kleinberg s modeling of bursts is usually applied only to bursts of keywords but not to those of topics. Then, based on Kleinberg s modeling of bursts of keywords, we propose how to measure bursts of topics estimated by a topic model such as LDA and DTM. 1. Kleinberg 5) DTM (dynamic topic model) 3) DTM 1 1 Graduate School of Systems and Information Engineering, University of Tsukuba 2 Graduate School of Information Science and Technology, Hokkaido University 1 c 2011 Information Processing Society of Japan

1 2. Kleinberg 5) 2.1 enumerating enumerating 2 A 2 2 q 0 q 1 2 1 m B 1,...,B m t B t d t B t r t m D D = d t R R = m r t t=1 t=1 2 q 0 p 0 = R/D q 1 p 0 s p 1 = p 0s s >1 p 1 1 s s m d t r t q =(q i1,...,q im ) q im m q i (i =0, 1) B(d t,p i) q i σ(i, r i,d t) [ ( ) ] d t σ(i, r t,d t)= ln p rt i r (1 pi)dt rt t q i q j τ(i, j) { (j i)γ (j >i) τ(i, j) = 0 (j i) τ γ γ =1 1 2 c 2011 Information Processing Society of Japan

(a) (b) 2 q σ τ q c( q r t,d t )= ( m 1 τ(i t,i t+1) t=0 ) + ( m ) σ(i t,r t,d t) t=1 A 2 s γ A 2 s,γ s =2 γ =1 A 2 2,1 2.2 Kleinberg t k,...,t l w bw(t k,t l,w) bw(t k,t l,w)= t l t=t k ( σ(0,rt,d t) σ(1,r t,d t) ) 1 t k = t l (=t) bw(t, w) =bw(t, t, w) 3. 3.1 DTM (dynamic topic model) 3) DTM w K z n (n =1,...,K) w p(w z n)(w V ) b z n p(z n b) (n =1,...,K) V DTM (LDA, Latent Dirichlet Allocation) 4) 3 c 2011 Information Processing Society of Japan

p(w z n)(w V ) p(z n b) (n =1,...,K) Blei 1 α K α =0.01 K =20 3.2 B K 1 b (b B) z n (n =1,...,K) D(z n) { } D(z n)= b B t z n = argmax z u (u=1,...,k) p(z u b) b b 4. t z n bz(t, z n) bz(t, z n)= bw(t, w) p(w z n) w 0 0.015 5. 2009 12 7 2010 4 2 p 0 1 http://www.cs.princeton.edu/~blei/topicmodeling.html 3 2009 6 1 2010 5 31 1 2 2 (a) (b) 12 8 2 25 3 3 4 1 3.2 z n D(z n) D(z n) 2(a) 2(b) 2010 2 12 28 2010 2 27 2 2 (http://www.nikkei.com/) (http://www.asahi.com/) (http://www. yomiuri.co.jp/) 56,503 38,758 62,684 157,945 4 c 2011 Information Processing Society of Japan

1 2010 2 25 2010 2 26 2010 2 27 2010 2 28 2010 3 1 2010 3 2 2010 3 3 5 (3.71), (2.34), (0.33), (0.33), (0.21) (4.12), (1.08), (0.36), (0.33), (0.27) (3.80), (1.49), (1.42), (0.90), (0.25) (8.07), (4.06), (2.06), (0.25), (0.20) (4.67), (3.00), (2.96), (0.52), (0.46) (1.20), (1.08), (0.97), (0.78), (0.43) (1.91), (0.77), (0.52), (0.52), (0.27) 5 (63), (54), (51), (49), (45) (77), (58), (50), (43), (33) (48), (30), (23), (23), (22) (114), (36), (34), (16), (15) (64), (55), (22), (50), (46) (58), (50), (45), (37), (32) (49), (46), (45), (41), (38) 3 3 2(a) 2(b)2 25 3 3 3 DTM 2 27 2 3 2009 2 25 3 3 5 5 1 0.5 5 2 4 0.5 0.6 6. 6) 7) Kleinberg DTM LDA 7) 5 c 2011 Information Processing Society of Japan

2 ( =0.6) [%] 2009 7 5 11 11 11 100 2009 9 20 9 26 8 8 100 2009 10 5 10 11 6 6 100 2010 1 9 1 15 18 17 94.4 2010 2 25 3 3 21 21 100 3 ( =0.5) [%] 2009 7 5 11 15 12 80.0 2009 9 20 9 26 9 9 100 2009 10 5 10 11 9 8 88.9 2010 1 9 1 15 19 18 94.7 2010 2 25 3 3 24 24 100 4 ( =0.4) [%] 2009 7 5 11 17 12 70.6 2009 9 20 9 26 15 14 93.3 2009 10 5 10 11 10 8 80.0 2010 1 9 1 15 24 18 75.0 2010 2 25 3 3 26 26 100 20 6) 2) (J/I; Junk/Insignificance Topic) LDA J/I 7... DTM On-line LDA 1) 1) ALSumait, L., Bardara, D. and Domeniconi, C.: On-Line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking, Proc. 8th ICDM, pp.3 12 (2008). 2) ALSumait, L., Bardara, D., Gentle, J. and Domeniconi, C.: Topic Significance Ranking of LDA Generative Models, Proc. ECML/PKDD, pp.67 82 (2009). 3) Blei, D.M. andlafferty, J.D.: DynamicTopicModels, Proc. 23rd ICML, pp.113 120 (2006). 4) Blei, D.M., Ng, A.Y. and Jordan, M.I.: Latent Dirichlet Allocation, Journal of Machine Learning Research, Vol.3, pp.993 1022 (2003). 5) Kleinberg, J.: Bursty and Hierarchical Structure in Streams, Proc. 8th SIGKDD, pp.91 101 (2002). 6) Mane, K. and Borner, K.: Mapping topics and topic bursts in PNAS, Proc. PNAS, Vol.101, Suppl 1, pp.5287 5290 (2004). 7) 3 DEIM (2011). DTM 6 c 2011 Information Processing Society of Japan