Analyzing Burst of Topics in News Stream

Similar documents
Time Series Topic Modeling and Bursty Topic Detection of Correlated News and Twitter

Topic Models and Applications to Short Documents

Identification of Bursts in a Document Stream

A Continuous-Time Model of Topic Co-occurrence Trends

IPSJ SIG Technical Report Vol.2014-MPS-100 No /9/25 1,a) 1 1 SNS / / / / / / Time Series Topic Model Considering Dependence to Multiple Topics S

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Topic Modelling and Latent Dirichlet Allocation

Incorporating Social Context and Domain Knowledge for Entity Recognition

Comparative Summarization via Latent Dirichlet Allocation

CS Lecture 18. Topic Models and LDA

arxiv: v1 [stat.ml] 5 Dec 2016

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Mixtures of Multinomials

Applying hlda to Practical Topic Modeling

Latent Dirichlet Allocation (LDA)

The Role of Semantic History on Online Generative Topic Modeling

Statistical Debugging with Latent Topic Models

Introduction To Machine Learning

Evaluation Methods for Topic Models

Lecture 8: Graphical models for Text

Recent Advances in Bayesian Inference Techniques

Topic Learning and Inference Using Dirichlet Allocation Product Partition Models and Hybrid Metropolis Search

Distributed ML for DOSNs: giving power back to users

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Dimension Reduction (PCA, ICA, CCA, FLD,

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Topic Modeling: Beyond Bag-of-Words

Using Both Latent and Supervised Shared Topics for Multitask Learning

Additive Regularization of Topic Models for Topic Selection and Sparse Factorization

Term Filtering with Bounded Error

Russell Hanson DFCI April 24, 2009

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Machine Learning

Utilizing Portion of Patent Families with No Parallel Sentences Extracted in Estimating Translation of Technical Terms

人工知能学会インタラクティブ情報アクセスと可視化マイニング研究会 ( 第 3 回 ) SIG-AM Pseudo Labled Latent Dirichlet Allocation 1 2 Satoko Suzuki 1 Ichiro Kobayashi Departmen

Comparing Relevance Feedback Techniques on German News Articles

arxiv: v1 [stat.ml] 25 Dec 2015

Asynchronous Distributed Learning of Topic Models

Asynchronous Distributed Learning of Topic Models

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Dirichlet Process Based Evolutionary Clustering

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

Content-based Recommendation

Research topics and trends over the past decade ( ) of Baltic Coleopterology using text mining methods

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Probabilistic Dyadic Data Analysis with Local and Global Consistency

Latent variable models for discrete data

Latent Dirichlet Allocation Based Multi-Document Summarization

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Improving Topic Models with Latent Feature Word Representations

Language Information Processing, Advanced. Topic Models

Supplement: On Model Parallelization and Scheduling Strategies for Distributed Machine Learning

Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details THIS IS AN EARLY DRAFT. YOUR FEEDBACKS ARE HIGHLY APPRECIATED.

Gaussian Mixture Model

Appendix A. Proof to Theorem 1

Latent Dirichlet Allocation (LDA)

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Lecture 13 : Variational Inference: Mean Field Approximation

Two Useful Bounds for Variational Inference

Dynamic Mixture Models for Multiple Time Series

Generative Clustering, Topic Modeling, & Bayesian Inference

Collaborative Topic Modeling for Recommending Scientific Articles

Welcome to CAMCOS Reports Day Fall 2011

/ / MET Day 000 NC1^ INRTL MNVR I E E PRE SLEEP K PRE SLEEP R E

LDA with Amortized Inference

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

CS145: INTRODUCTION TO DATA MINING

Latent Dirichlet Alloca/on

Latent Dirichlet Allocation Introduction/Overview

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Bayesian Hidden Markov Models and Extensions

Click Prediction and Preference Ranking of RSS Feeds

16 : Approximate Inference: Markov Chain Monte Carlo

Measuring Topic Quality in Latent Dirichlet Allocation

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Probabilistic Graphical Models

Modeling of Growing Networks with Directional Attachment and Communities

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

Latent Dirichlet Conditional Naive-Bayes Models

Online Bayesian Passive-Aggressive Learning

TopicMF: Simultaneously Exploiting Ratings and Reviews for Recommendation

Non-Parametric Bayes

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Topic modeling with more confidence: a theory and some algorithms

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

arxiv: v2 [stat.ap] 3 Aug 2016

A Probabilistic Clustering-Projection Model for Discrete Data

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Unified Modeling of User Activities on Social Networking Sites

arxiv: v1 [cs.si] 7 Dec 2013

Title Document Clustering. Issue Date Right.

Dirichlet Enhanced Latent Semantic Analysis

Mixture Models in Text Mining Tools in R

How to generate large-scale data from small-scale realworld

Transcription:

1 1 1 2 2 Kleinberg LDA (latent Dirichlet allocation) DTM (dynamic topic model) DTM Analyzing Burst of Topics in News Stream Yusuke Takahashi, 1 Daisuke Yokomoto, 1 Takehito Utsuro 1 and Masaharu Yoshioka 2 Among various types of recent information explosion, that in news stream is also a kind of serious problems. This paper studies issues regarding two types of modeling of information flow in news stream, namely, burst analysis and topic modeling. First, when one wants to detect a kind of topics that are paid much more attention than usual, it is usually necessary for him/her to carefully watch every article in news stream at every moment. In such a situation, it is well known in the field of time series analysis that Kleinberg s modeling of bursts is quite effective in detecting burst of keywords. Second, topic models such as LDA (latent Dirichlet allocation) and DTM (dynamic topic model) are also quite effective in estimating distribution of topics over a document collection such as articles in news stream. This paper focuses on the fact that Kleinberg s modeling of bursts is usually applied only to bursts of keywords but not to those of topics. Then, based on Kleinberg s modeling of bursts of keywords, we propose how to measure bursts of topics estimated by a topic model such as LDA and DTM. 1. Kleinberg 5) DTM (dynamic topic model) 3) DTM 1 1 Graduate School of Systems and Information Engineering, University of Tsukuba 2 Graduate School of Information Science and Technology, Hokkaido University 1 c 2011 Information Processing Society of Japan

1 2. Kleinberg 5) 2.1 enumerating enumerating 2 A 2 2 q 0 q 1 2 1 m B 1,...,B m t B t d t B t r t m D D = d t R R = m r t t=1 t=1 2 q 0 p 0 = R/D q 1 p 0 s p 1 = p 0s s >1 p 1 1 s s m d t r t q =(q i1,...,q im ) q im m q i (i =0, 1) B(d t,p i) q i σ(i, r i,d t) [ ( ) ] d t σ(i, r t,d t)= ln p rt i r (1 pi)dt rt t q i q j τ(i, j) { (j i)γ (j >i) τ(i, j) = 0 (j i) τ γ γ =1 1 2 c 2011 Information Processing Society of Japan

(a) (b) 2 q σ τ q c( q r t,d t )= ( m 1 τ(i t,i t+1) t=0 ) + ( m ) σ(i t,r t,d t) t=1 A 2 s γ A 2 s,γ s =2 γ =1 A 2 2,1 2.2 Kleinberg t k,...,t l w bw(t k,t l,w) bw(t k,t l,w)= t l t=t k ( σ(0,rt,d t) σ(1,r t,d t) ) 1 t k = t l (=t) bw(t, w) =bw(t, t, w) 3. 3.1 DTM (dynamic topic model) 3) DTM w K z n (n =1,...,K) w p(w z n)(w V ) b z n p(z n b) (n =1,...,K) V DTM (LDA, Latent Dirichlet Allocation) 4) 3 c 2011 Information Processing Society of Japan

p(w z n)(w V ) p(z n b) (n =1,...,K) Blei 1 α K α =0.01 K =20 3.2 B K 1 b (b B) z n (n =1,...,K) D(z n) { } D(z n)= b B t z n = argmax z u (u=1,...,k) p(z u b) b b 4. t z n bz(t, z n) bz(t, z n)= bw(t, w) p(w z n) w 0 0.015 5. 2009 12 7 2010 4 2 p 0 1 http://www.cs.princeton.edu/~blei/topicmodeling.html 3 2009 6 1 2010 5 31 1 2 2 (a) (b) 12 8 2 25 3 3 4 1 3.2 z n D(z n) D(z n) 2(a) 2(b) 2010 2 12 28 2010 2 27 2 2 (http://www.nikkei.com/) (http://www.asahi.com/) (http://www. yomiuri.co.jp/) 56,503 38,758 62,684 157,945 4 c 2011 Information Processing Society of Japan

1 2010 2 25 2010 2 26 2010 2 27 2010 2 28 2010 3 1 2010 3 2 2010 3 3 5 (3.71), (2.34), (0.33), (0.33), (0.21) (4.12), (1.08), (0.36), (0.33), (0.27) (3.80), (1.49), (1.42), (0.90), (0.25) (8.07), (4.06), (2.06), (0.25), (0.20) (4.67), (3.00), (2.96), (0.52), (0.46) (1.20), (1.08), (0.97), (0.78), (0.43) (1.91), (0.77), (0.52), (0.52), (0.27) 5 (63), (54), (51), (49), (45) (77), (58), (50), (43), (33) (48), (30), (23), (23), (22) (114), (36), (34), (16), (15) (64), (55), (22), (50), (46) (58), (50), (45), (37), (32) (49), (46), (45), (41), (38) 3 3 2(a) 2(b)2 25 3 3 3 DTM 2 27 2 3 2009 2 25 3 3 5 5 1 0.5 5 2 4 0.5 0.6 6. 6) 7) Kleinberg DTM LDA 7) 5 c 2011 Information Processing Society of Japan

2 ( =0.6) [%] 2009 7 5 11 11 11 100 2009 9 20 9 26 8 8 100 2009 10 5 10 11 6 6 100 2010 1 9 1 15 18 17 94.4 2010 2 25 3 3 21 21 100 3 ( =0.5) [%] 2009 7 5 11 15 12 80.0 2009 9 20 9 26 9 9 100 2009 10 5 10 11 9 8 88.9 2010 1 9 1 15 19 18 94.7 2010 2 25 3 3 24 24 100 4 ( =0.4) [%] 2009 7 5 11 17 12 70.6 2009 9 20 9 26 15 14 93.3 2009 10 5 10 11 10 8 80.0 2010 1 9 1 15 24 18 75.0 2010 2 25 3 3 26 26 100 20 6) 2) (J/I; Junk/Insignificance Topic) LDA J/I 7... DTM On-line LDA 1) 1) ALSumait, L., Bardara, D. and Domeniconi, C.: On-Line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking, Proc. 8th ICDM, pp.3 12 (2008). 2) ALSumait, L., Bardara, D., Gentle, J. and Domeniconi, C.: Topic Significance Ranking of LDA Generative Models, Proc. ECML/PKDD, pp.67 82 (2009). 3) Blei, D.M. andlafferty, J.D.: DynamicTopicModels, Proc. 23rd ICML, pp.113 120 (2006). 4) Blei, D.M., Ng, A.Y. and Jordan, M.I.: Latent Dirichlet Allocation, Journal of Machine Learning Research, Vol.3, pp.993 1022 (2003). 5) Kleinberg, J.: Bursty and Hierarchical Structure in Streams, Proc. 8th SIGKDD, pp.91 101 (2002). 6) Mane, K. and Borner, K.: Mapping topics and topic bursts in PNAS, Proc. PNAS, Vol.101, Suppl 1, pp.5287 5290 (2004). 7) 3 DEIM (2011). DTM 6 c 2011 Information Processing Society of Japan