Topic Discovery Project Report

Topic Discovery Project Report Shunyu Yao and Xingjiang Yu IIIS, Tsinghua University {yao-sy15, yu-xj15}@mails.tsinghua.edu.cn Abstract In this report we present our implementations of topic discovery model on different datasets. Two classical algorithms, Nonnegative matrix factorization (NMF) and Latent Dirichlet allocation (LDA), are utilized to extract latent semantic structures (topics) of text bodies. They are representative of two significant classes of techniques, namely matrix techniques and probability techniques. Text corpus from the datasets varies from daily tweets to neural network papers, based on which we provide qualitative and quantitative comparisons and analysis on the advantages and limits of different methods. Index Terms: topic discovery, text mining, NLP, NMF, LDA 1. Introduction In the age of information, the ever-increasing amount of text corpus calls for efficient methods to document classification and clustering. To do so, it is essential to discover the latent semantic structures from documents in a fast, stable and robust manner. Such a topic model can organize and lead us to understand massive unstructured text bodies encountered every day. In this project, we implement several representative topic discovery algorithms and evaluate them on several different datasets. The work we have done in this project can be divided into the following parts. 1. Study the topic discovery problem by reading various papers concerning different approaches and contexts. The basic approach to topic discovery is to model a document as a mixture of topics. In terms of how to model a topic, different methods employ different perspectives. This is illustrated in Section 2. 2. Obtain text corpus dataset for our use. 4 datasets are selected and processed from the Internet, covering everyday tweets, news in politics, economics, sports and other sections, and conference papers over decades. This multiplicity allows insight into the pros and cons of different methods. 3. Conduct experiments on multiple datasets using different method and analyze accordingly. This highlighted part consists of our observations and discoveries during the experiments. We also provide reasonable thoughts to explain the results and possible ways to improve. Experimental details can be found in Section 3 and discussions in Section 4. 2. Methodology 2.1. Notations and terminology We define the following terms formally: A word is a basic unit of discrete data, indexed by {1,, V }. We can represent the w th word as a unit vector w R V with the wth component being 1 and other components being 0, for convenience. A document of N words is a sequence of words d = (w 1,, w N ). Its vector representation d is the sum of its words vector representation w 1 + w N. A corpus is a collection of M documents C = {d 1,, d M }. The corresponding document-term matrix X = [d 1 d M ] R V M describes each document as a column. For a matrix A, let A(i, :) denote its ith row and A(:, j) denote its jth column. 2.2. Related Works A great diversity of topic models have been proposed to reveal the hidden semantic structure from text corpus. Latent Semantic Analysis (LSA) [1] is one of the first topic discovery attempts by applying singular value decomposition (SVD) to the documentterm matrix. After LSA, Probabilistic Latent Semantic Analysis (PLSA) [2] and Latent Dirichlet Allocation (LDA) [3] further use a hidden topic variable to model topics as distributions over words. PLSA, LDA, and their other variants, like author-topic models [4], have succeeded in processing normal texts and served as standard tools for text mining. However, for short texts like tweets or news titles, these conventional topic models have shown poor performance due to the lack of context. To deal with short texts, different solution are proposed, such as aggregating several short texts based on auxiliary information [5], assuming one document [6] or one sentence [7] contains only one topic. Equipped with word embeddings, topic models can exploit external context from short texts. [8, 9] On the other hand, the matrix-decomposition approach of LSA is inherited by Nonnegative Matrix factorization (NMF) [10] and further tensor decomposition techniques [5]. To our major surprise, though completely different algorithms and perspectives, NMF and PLSI are shown to be equivalent in that they optimize the same objective. [11] 2.3. Latent Dirichlet allocation Latent Dirichlet allocation (LDA) is perhaps the most common topic model currently in use. It is a generative statistical model that posits each document as a mixture of topics, and each word is created by attributing to one of the document s topics.

2.4. Non-negative matrix factorization The NMF approach is basically a factorization of a documentterm matrix X R V M with objective min X W H 2 2 s. t. W, H 0 (6) W R V r,h R r M Figure 1: Graphical model of LDA. Boxes are levels of replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words. [3] To state it formally, each document d in C is generated as follows with parameters α, β. 1. Choose N P oisson(ξ) and θ Dir(α). 2. For each of the N words w n: Choose a topic z n Multimodal(θ). Choose a word w n with respect to p(w n z n, β), a multimodal probability conditioned on topic z n. For explanation, k is the dimensionality of topics and β R k V denotes word probabilities such that β(i, j) = p(w j z i ). A k dim Dirichlet random variable θ has distribution p(θ α) = Γ( k αi) k Γ(αi) θα 1 1 1 θ α k 1 k (1) where parameter α R k has components > 0. Given parameters we can deduce that [3] the marginal distribution of a document is ( N ) p(d α, β) = p(θ α) p(z n θ)p(w n z n, β) dθ n=1 z n (2) and thus the probability of a corpus is p(c α, β) = M p(d i α, β). (3) Inference and parameter estimation are the next steps for LDA. The key inferential problem is to compute the posterior distribution of hidden variable given a document: p(θ, z 1 N α, β, w 1 N ) = p(θ, z1 N, w1 N α, β). (4) p(w 1 N α, β) And we wish to find α, β that maximizes the (marginal) likelihood of the data observed: max α,β M log p(d i α, β). (5) The whole process of LDA, a generative probabilistic model, involves complicated work of probabilistic calculations. Technical details can be seen in [3]. In our project, we gain help by accessing related Python libraries. where W, H are nonnegative matrices and r is the number of topics. NMF identifies topics and documents as a linear sum of topics, which can be interpreted as X(: j) W (: i)h(i, j) (7) 1 i r where X(: j) represents the jth document, W (: i) represents the ith topic and H(i, j) 0 is the weight (importance) of ith topic in jth document. The idea of NMF is straightforward and can be generalized to other (linear) feature extraction and classification tasks. Practical algorithms for NMF include alternating least squares method (remove the nonnegative constrain and project back), alternating nonnegative least squares (solve for W and then H iteratively) and hierarchical alternating least squares (locally optimize one column at a time) [12]. 2.5. Topic naming scheme Besides topic identification and document classification, we need to name the topics generated. In LDA a topic is a distribution over words while in NMF a topic is a linear sum of word vectors. Hence a natural scheme arises as naming the topic as the most frequent (important) words of it. In the project, we set the 10 most frequent words of a topic as a (graphical) representation. 3.1. Datasets 3. Experiments For the variety of text corpus, we attain 4 datasets with different types, text lengths, time ranges and sources. Sentiment140 contains 160000 randomly selected tweets, mostly related to daily life (so vague topics). REUTERS-21578 contains news posted by Reuters in 1987, mostly about economics. ABCNews consists of the titles and abstracts of the most recent 1200 documents (till 2017.6) of each of the 9 sections: travel, politics, world, US, tech, entertainment, sports, health, money. It is crawled by us from the ABC news website. Neural Information Processing Systems (NIPS) is a multi-track machine learning and computational neuroscience conference and our dataset include the papers of the conference over two decades. Name Type Time range N M Sentiment140 tweets 2010-2012 short 160000 REUTERS news 1987 normal 21578 ABCNews news 2017.3-6 short 10800 NIPS papers 1983-2016 long 5811 3.2. Setup and preprocessing Table 1: Diverse datasets. In the following experiments, LDA parameters are set α = (0.1,, 0.1), β = 200/#words.

Figure 2: Topics of Sentiment140 without preprocessing. Figure 3: Topics of Sentiment140 via LDA. Text preprocessing is essential before applying topic models to corpus. If we fit the raw tweet corpus into the NMF model, as shown in Figure 2, most of the topic keywords are commonly used while meaningless for our purpose. These so-called stopwords must be dropped in the preprocessing phase. We also drop the word with length smaller than 3. Moreover, to reduce the size of dictionary, we lemmatize infected or variant words, e. g., gets get, apples apple. This is also helpful for topic naming, as a top word may have several variants appearing as keywords.in our experiments, there are always more than 5 topics, so we discard those words with document frequency df > 0.3 > 1/5 (too common) and those with df < 0.001 (too rare). 3.3. Experiment results: Sentiment140 & REUTERS We run NMF and LDA over the dataset Sentiment140 and REUTERS in order to compare the two algorithms. We visualize the results over Sentiment140 as Figure 3 and 4, and use Table 2 and 3 to show extracted topics from REUTERS. We observe that over REUTERS dataset, LDA produces more reasonable results than NMF. For example, topic 1 of NMF fails, while topic 1 and 4 of LDA relate to raw materials and exporting products respectively. However, LDA fails to find acceptable topics out of tweets in Sentiment140 while NMF stays productive. We obverse that LDA is good at dealing with coherent topics, whereas NMF is good at incoherent topics. By coherent we mean that topics are clearly different from each other and one document usually possesses only one topic. The reason is intuitive. By their definitions, NMF treats a document as a linear superposition of topics, while LDA tends to find a certain topic maximizing the likelihood.most of tweets in Sentiment140 are incoherent with vague topics and most common words, making it hard for LDA to discover welldefined topics. Situation is just reversed on REUTERS dataset, where the max-likelihood method of LDA surpasses NMF. 3.4. Experiment results: ABCNews To get quantified evaluation of some kind, we use the accuracy (AC) to measure the performance of the two algorithms on ABCNews dataset. In ABCNews, each document is naturally labeled as its section, and we can calculate accuracy if the obtained topics correspond to these sections. More precisely, given corpus C = {d 1,, d M }, let y i and a i be the (ob- Figure 4: Topics of Sentiment140 via NMF. More reasonable. Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 blah rev div pct avg zurich shr qtly year shrs exchanged mths record company diluted executives note pay stock primary executive qtr prior share shr execution year march year tax executed ended payable corp periods exclusively revs qtr common note exclusive months dividend bank 1st excluding full qtrly group split Table 2: NMF on REUTERS: top 10 words from 5 topics. Topic 1 fails. tained) label and (natural) section of d i respectively. Then AC = 1 M M δ(map(a i), y i) (8) where δ(a, b) = 1 if a = b and otherwise 0, and map(a i) maps the obtained topic a i to the corresponding section of the topic (done manually). We measure ACs for both NMA and LDA on ABCNews, conditioned on using only titles or using titles and abstracts. We drop the categories health, tech and travel because the abstracts of these news consist of only simple meaningless sub-tags. As shown in Table 4, LDA is better than NMF when measured by AC, which validates our arguments in the last subsection. Neither of the two models reaches an average accuracy of 65%, perhaps due to the complexity of the dataset (real and raw news) implies space for improvements.

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 oil company pct tonnes shares prices year billion sugar company prod n quarter year coffee corp opec earnings february prices pct gas sales stg export stock gold share rise price share price reported fell market group crude expects december producers offer bpd corp compared int l common energy operations march year sh ders Table 3: LDA on REUTERS: top 10 words from 5 topics. prod n short for production, int l short for international, sh ders short for shareholders. e m p s u w AVG LDAt 0.25 0.39 0.60 0.36 0.44 0.19 0.37 NMFt 0.29 0.31 0.46 0.25 0.25 0.09 0.27 LDA 0.63 0.56 0.83 0.64 0.58 0.37 0.60 NMF 0.46 0.67 0.53 0.24 0.51 0.42 0.47 Table 4: Comparison of AC. Sections are entertainment, money, politics, sports, US and world respectively. (t) means only using titles but not abstracts. 3.5. Experiment results: NIPS Up to now, we have provided quantified results and comparative analysis of the two algorithms. NIPS dataset serves as extra fun for us, as we set out to find the keywords of the conference during periods 1987-1992, 1993-1998, 1999-2004, 2005-2010 and 2011-2015 using our topic model. The result is illustrated in Figure 5. It is quite clear that bayesian methods, posterior probability and classifiers were popular from 1993 to 1998, and then SVM and boosting went hot since 2000, and neural network arising gradually since 2005. Currently (last 5 years) the three hotspot topics are respectively about game theory, images and deep neural networks. It seems that our topic model works quite well. 4. Discussion In this project, we investigate two representative topic models, namely NMF and LDA. Compared with NMF, LDA is a more general model, changing greatly over parameters α and β. Our experiments, especially these on tweets dataset Sentiment140 and news dataset REUTERS, show the pros and cons of the two topic discovery model. NMF is good at incoherent or vague topic setting while LDA is good at coherent or specific topic setting. This results from the different designs of the models. Matrix factorization tends to mix topics together while max-likelihood method is hard to detect parallel topics in a document. Though efficient and achieving somewhat acceptable results, it is worthwhile to point out that these two methods does not have accuracy more that 0.6, as shown in Table 4. These two (mostly) linear methods certainly fails to captures the rich non-linear context within the documents. To fix this, we propose independently that word embedding can be equipped into topic models, which is already discussed in [8, 9]. Yet we believe more can be done to tackle the context problem and polish the topic models. Besides document clustering, there are more things that can Figure 5: Keywords of NIPS through timeline. be done by topic discovery. An interesting instance is the NIPS timeline discovery. In fact, topic models are a basic tool for natural language processing (NLP), text mining, machine learning and other areas. In return, the development of these subjects also boosts the development of topic discovery technique. We are open to learning more about topic discovery in the future. In the project, we learn not only by reading papers and deducing math formulas, but also by failed attempts, failed schedules, wrong choices and debugging codes. For instance, we will never forget the importance of data preprocessing. We want to thank all the devoted teachers of this course for such an interesting yet unique experience.

5. References [1] S. Deerwester, S. T. Dumais, G. W. Furnas, and T. K. Landauer, Richard harshman indexing by latent semantic analysis, Journal of the American Society for Information Science, 1990. [2] T. Hofmann, Probabilistic latent semantic indexing, in International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 50 57. [3] D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent dirichlet allocation. JMLR.org, 2003. [4] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth, The author-topic model for authors and documents, pp. 487 494, 2012. [5] F. Li, Q. Zhu, and X. Lin, Topic discovery in research literature based on non-negative matrix factorization and testor theory, in Asia-Pacific Conference on Information Processing, 2009, pp. 266 269. [6] W. X. Zhao, J. Jiang, J. Weng, J. He, E. P. Lim, H. Yan, and X. Li, Comparing twitter and traditional media using topic models, in European Conference on Advances in Information Retrieval, 2011, pp. 338 349. [7] A. Gruber, Y. Weiss, and M. Rosen-Zvi, Hidden topic markov models, Proceedings of Artificial Intelligence & Statistics, pp. 163 170, 2007. [8] R. Das, M. Zaheer, and C. Dyer, Gaussian lda for topic models with word embeddings, in Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, 2015, pp. 795 804. [9] J. Qiang, P. Chen, T. Wang, and X. Wu, Topic modeling over short texts by incorporating word embeddings, in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2017, pp. 363 374. [10] D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, in NIPS, 2000, pp. 556 562. [11] C. Ding, T. Li, and W. Peng, On the equivalence between nonnegative matrix factorization and probabilistic latent semantic indexing, Computational Statistics & Data Analysis, vol. 52, no. 8, pp. 3913 3927, 2008. [12] N. Gillis, The why and how of nonnegative matrix factorization, Mathematics, 2014.