Topic Discovery Project Report

Size: px
Start display at page:

Download "Topic Discovery Project Report"


1 Topic Discovery Project Report Shunyu Yao and Xingjiang Yu IIIS, Tsinghua University {yao-sy15, Abstract In this report we present our implementations of topic discovery model on different datasets. Two classical algorithms, Nonnegative matrix factorization (NMF) and Latent Dirichlet allocation (LDA), are utilized to extract latent semantic structures (topics) of text bodies. They are representative of two significant classes of techniques, namely matrix techniques and probability techniques. Text corpus from the datasets varies from daily tweets to neural network papers, based on which we provide qualitative and quantitative comparisons and analysis on the advantages and limits of different methods. Index Terms: topic discovery, text mining, NLP, NMF, LDA 1. Introduction In the age of information, the ever-increasing amount of text corpus calls for efficient methods to document classification and clustering. To do so, it is essential to discover the latent semantic structures from documents in a fast, stable and robust manner. Such a topic model can organize and lead us to understand massive unstructured text bodies encountered every day. In this project, we implement several representative topic discovery algorithms and evaluate them on several different datasets. The work we have done in this project can be divided into the following parts. 1. Study the topic discovery problem by reading various papers concerning different approaches and contexts. The basic approach to topic discovery is to model a document as a mixture of topics. In terms of how to model a topic, different methods employ different perspectives. This is illustrated in Section Obtain text corpus dataset for our use. 4 datasets are selected and processed from the Internet, covering everyday tweets, news in politics, economics, sports and other sections, and conference papers over decades. This multiplicity allows insight into the pros and cons of different methods. 3. Conduct experiments on multiple datasets using different method and analyze accordingly. This highlighted part consists of our observations and discoveries during the experiments. We also provide reasonable thoughts to explain the results and possible ways to improve. Experimental details can be found in Section 3 and discussions in Section Methodology 2.1. Notations and terminology We define the following terms formally: A word is a basic unit of discrete data, indexed by {1,, V }. We can represent the w th word as a unit vector w R V with the wth component being 1 and other components being 0, for convenience. A document of N words is a sequence of words d = (w 1,, w N ). Its vector representation d is the sum of its words vector representation w 1 + w N. A corpus is a collection of M documents C = {d 1,, d M }. The corresponding document-term matrix X = [d 1 d M ] R V M describes each document as a column. For a matrix A, let A(i, :) denote its ith row and A(:, j) denote its jth column Related Works A great diversity of topic models have been proposed to reveal the hidden semantic structure from text corpus. Latent Semantic Analysis (LSA) [1] is one of the first topic discovery attempts by applying singular value decomposition (SVD) to the documentterm matrix. After LSA, Probabilistic Latent Semantic Analysis (PLSA) [2] and Latent Dirichlet Allocation (LDA) [3] further use a hidden topic variable to model topics as distributions over words. PLSA, LDA, and their other variants, like author-topic models [4], have succeeded in processing normal texts and served as standard tools for text mining. However, for short texts like tweets or news titles, these conventional topic models have shown poor performance due to the lack of context. To deal with short texts, different solution are proposed, such as aggregating several short texts based on auxiliary information [5], assuming one document [6] or one sentence [7] contains only one topic. Equipped with word embeddings, topic models can exploit external context from short texts. [8, 9] On the other hand, the matrix-decomposition approach of LSA is inherited by Nonnegative Matrix factorization (NMF) [10] and further tensor decomposition techniques [5]. To our major surprise, though completely different algorithms and perspectives, NMF and PLSI are shown to be equivalent in that they optimize the same objective. [11] 2.3. Latent Dirichlet allocation Latent Dirichlet allocation (LDA) is perhaps the most common topic model currently in use. It is a generative statistical model that posits each document as a mixture of topics, and each word is created by attributing to one of the document s topics.

2 2.4. Non-negative matrix factorization The NMF approach is basically a factorization of a documentterm matrix X R V M with objective min X W H 2 2 s. t. W, H 0 (6) W R V r,h R r M Figure 1: Graphical model of LDA. Boxes are levels of replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words. [3] To state it formally, each document d in C is generated as follows with parameters α, β. 1. Choose N P oisson(ξ) and θ Dir(α). 2. For each of the N words w n: Choose a topic z n Multimodal(θ). Choose a word w n with respect to p(w n z n, β), a multimodal probability conditioned on topic z n. For explanation, k is the dimensionality of topics and β R k V denotes word probabilities such that β(i, j) = p(w j z i ). A k dim Dirichlet random variable θ has distribution p(θ α) = Γ( k αi) k Γ(αi) θα θ α k 1 k (1) where parameter α R k has components > 0. Given parameters we can deduce that [3] the marginal distribution of a document is ( N ) p(d α, β) = p(θ α) p(z n θ)p(w n z n, β) dθ n=1 z n (2) and thus the probability of a corpus is p(c α, β) = M p(d i α, β). (3) Inference and parameter estimation are the next steps for LDA. The key inferential problem is to compute the posterior distribution of hidden variable given a document: p(θ, z 1 N α, β, w 1 N ) = p(θ, z1 N, w1 N α, β). (4) p(w 1 N α, β) And we wish to find α, β that maximizes the (marginal) likelihood of the data observed: max α,β M log p(d i α, β). (5) The whole process of LDA, a generative probabilistic model, involves complicated work of probabilistic calculations. Technical details can be seen in [3]. In our project, we gain help by accessing related Python libraries. where W, H are nonnegative matrices and r is the number of topics. NMF identifies topics and documents as a linear sum of topics, which can be interpreted as X(: j) W (: i)h(i, j) (7) 1 i r where X(: j) represents the jth document, W (: i) represents the ith topic and H(i, j) 0 is the weight (importance) of ith topic in jth document. The idea of NMF is straightforward and can be generalized to other (linear) feature extraction and classification tasks. Practical algorithms for NMF include alternating least squares method (remove the nonnegative constrain and project back), alternating nonnegative least squares (solve for W and then H iteratively) and hierarchical alternating least squares (locally optimize one column at a time) [12] Topic naming scheme Besides topic identification and document classification, we need to name the topics generated. In LDA a topic is a distribution over words while in NMF a topic is a linear sum of word vectors. Hence a natural scheme arises as naming the topic as the most frequent (important) words of it. In the project, we set the 10 most frequent words of a topic as a (graphical) representation Datasets 3. Experiments For the variety of text corpus, we attain 4 datasets with different types, text lengths, time ranges and sources. Sentiment140 contains randomly selected tweets, mostly related to daily life (so vague topics). REUTERS contains news posted by Reuters in 1987, mostly about economics. ABCNews consists of the titles and abstracts of the most recent 1200 documents (till ) of each of the 9 sections: travel, politics, world, US, tech, entertainment, sports, health, money. It is crawled by us from the ABC news website. Neural Information Processing Systems (NIPS) is a multi-track machine learning and computational neuroscience conference and our dataset include the papers of the conference over two decades. Name Type Time range N M Sentiment140 tweets short REUTERS news 1987 normal ABCNews news short NIPS papers long Setup and preprocessing Table 1: Diverse datasets. In the following experiments, LDA parameters are set α = (0.1,, 0.1), β = 200/#words.

3 Figure 2: Topics of Sentiment140 without preprocessing. Figure 3: Topics of Sentiment140 via LDA. Text preprocessing is essential before applying topic models to corpus. If we fit the raw tweet corpus into the NMF model, as shown in Figure 2, most of the topic keywords are commonly used while meaningless for our purpose. These so-called stopwords must be dropped in the preprocessing phase. We also drop the word with length smaller than 3. Moreover, to reduce the size of dictionary, we lemmatize infected or variant words, e. g., gets get, apples apple. This is also helpful for topic naming, as a top word may have several variants appearing as our experiments, there are always more than 5 topics, so we discard those words with document frequency df > 0.3 > 1/5 (too common) and those with df < (too rare) Experiment results: Sentiment140 & REUTERS We run NMF and LDA over the dataset Sentiment140 and REUTERS in order to compare the two algorithms. We visualize the results over Sentiment140 as Figure 3 and 4, and use Table 2 and 3 to show extracted topics from REUTERS. We observe that over REUTERS dataset, LDA produces more reasonable results than NMF. For example, topic 1 of NMF fails, while topic 1 and 4 of LDA relate to raw materials and exporting products respectively. However, LDA fails to find acceptable topics out of tweets in Sentiment140 while NMF stays productive. We obverse that LDA is good at dealing with coherent topics, whereas NMF is good at incoherent topics. By coherent we mean that topics are clearly different from each other and one document usually possesses only one topic. The reason is intuitive. By their definitions, NMF treats a document as a linear superposition of topics, while LDA tends to find a certain topic maximizing the likelihood.most of tweets in Sentiment140 are incoherent with vague topics and most common words, making it hard for LDA to discover welldefined topics. Situation is just reversed on REUTERS dataset, where the max-likelihood method of LDA surpasses NMF Experiment results: ABCNews To get quantified evaluation of some kind, we use the accuracy (AC) to measure the performance of the two algorithms on ABCNews dataset. In ABCNews, each document is naturally labeled as its section, and we can calculate accuracy if the obtained topics correspond to these sections. More precisely, given corpus C = {d 1,, d M }, let y i and a i be the (ob- Figure 4: Topics of Sentiment140 via NMF. More reasonable. Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 blah rev div pct avg zurich shr qtly year shrs exchanged mths record company diluted executives note pay stock primary executive qtr prior share shr execution year march year tax executed ended payable corp periods exclusively revs qtr common note exclusive months dividend bank 1st excluding full qtrly group split Table 2: NMF on REUTERS: top 10 words from 5 topics. Topic 1 fails. tained) label and (natural) section of d i respectively. Then AC = 1 M M δ(map(a i), y i) (8) where δ(a, b) = 1 if a = b and otherwise 0, and map(a i) maps the obtained topic a i to the corresponding section of the topic (done manually). We measure ACs for both NMA and LDA on ABCNews, conditioned on using only titles or using titles and abstracts. We drop the categories health, tech and travel because the abstracts of these news consist of only simple meaningless sub-tags. As shown in Table 4, LDA is better than NMF when measured by AC, which validates our arguments in the last subsection. Neither of the two models reaches an average accuracy of 65%, perhaps due to the complexity of the dataset (real and raw news) implies space for improvements.

4 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 oil company pct tonnes shares prices year billion sugar company prod n quarter year coffee corp opec earnings february prices pct gas sales stg export stock gold share rise price share price reported fell market group crude expects december producers offer bpd corp compared int l common energy operations march year sh ders Table 3: LDA on REUTERS: top 10 words from 5 topics. prod n short for production, int l short for international, sh ders short for shareholders. e m p s u w AVG LDAt NMFt LDA NMF Table 4: Comparison of AC. Sections are entertainment, money, politics, sports, US and world respectively. (t) means only using titles but not abstracts Experiment results: NIPS Up to now, we have provided quantified results and comparative analysis of the two algorithms. NIPS dataset serves as extra fun for us, as we set out to find the keywords of the conference during periods , , , and using our topic model. The result is illustrated in Figure 5. It is quite clear that bayesian methods, posterior probability and classifiers were popular from 1993 to 1998, and then SVM and boosting went hot since 2000, and neural network arising gradually since Currently (last 5 years) the three hotspot topics are respectively about game theory, images and deep neural networks. It seems that our topic model works quite well. 4. Discussion In this project, we investigate two representative topic models, namely NMF and LDA. Compared with NMF, LDA is a more general model, changing greatly over parameters α and β. Our experiments, especially these on tweets dataset Sentiment140 and news dataset REUTERS, show the pros and cons of the two topic discovery model. NMF is good at incoherent or vague topic setting while LDA is good at coherent or specific topic setting. This results from the different designs of the models. Matrix factorization tends to mix topics together while max-likelihood method is hard to detect parallel topics in a document. Though efficient and achieving somewhat acceptable results, it is worthwhile to point out that these two methods does not have accuracy more that 0.6, as shown in Table 4. These two (mostly) linear methods certainly fails to captures the rich non-linear context within the documents. To fix this, we propose independently that word embedding can be equipped into topic models, which is already discussed in [8, 9]. Yet we believe more can be done to tackle the context problem and polish the topic models. Besides document clustering, there are more things that can Figure 5: Keywords of NIPS through timeline. be done by topic discovery. An interesting instance is the NIPS timeline discovery. In fact, topic models are a basic tool for natural language processing (NLP), text mining, machine learning and other areas. In return, the development of these subjects also boosts the development of topic discovery technique. We are open to learning more about topic discovery in the future. In the project, we learn not only by reading papers and deducing math formulas, but also by failed attempts, failed schedules, wrong choices and debugging codes. For instance, we will never forget the importance of data preprocessing. We want to thank all the devoted teachers of this course for such an interesting yet unique experience.

5 5. References [1] S. Deerwester, S. T. Dumais, G. W. Furnas, and T. K. Landauer, Richard harshman indexing by latent semantic analysis, Journal of the American Society for Information Science, [2] T. Hofmann, Probabilistic latent semantic indexing, in International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp [3] D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent dirichlet allocation., [4] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth, The author-topic model for authors and documents, pp , [5] F. Li, Q. Zhu, and X. Lin, Topic discovery in research literature based on non-negative matrix factorization and testor theory, in Asia-Pacific Conference on Information Processing, 2009, pp [6] W. X. Zhao, J. Jiang, J. Weng, J. He, E. P. Lim, H. Yan, and X. Li, Comparing twitter and traditional media using topic models, in European Conference on Advances in Information Retrieval, 2011, pp [7] A. Gruber, Y. Weiss, and M. Rosen-Zvi, Hidden topic markov models, Proceedings of Artificial Intelligence & Statistics, pp , [8] R. Das, M. Zaheer, and C. Dyer, Gaussian lda for topic models with word embeddings, in Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, 2015, pp [9] J. Qiang, P. Chen, T. Wang, and X. Wu, Topic modeling over short texts by incorporating word embeddings, in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2017, pp [10] D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, in NIPS, 2000, pp [11] C. Ding, T. Li, and W. Peng, On the equivalence between nonnegative matrix factorization and probabilistic latent semantic indexing, Computational Statistics & Data Analysis, vol. 52, no. 8, pp , [12] N. Gillis, The why and how of nonnegative matrix factorization, Mathematics, 2014.

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing 1 Zhong-Yuan Zhang, 2 Chris Ding, 3 Jie Tang *1, Corresponding Author School of Statistics,

More information

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Information retrieval LSI, plsi and LDA. Jian-Yun Nie Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: For a square matrix A: Ax = λx where x is a vector (eigenvector), and

More information

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1 Topic Models Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1 Low-Dimensional Space for Documents Last time: embedding space

More information

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

More information

Document and Topic Models: plsa and LDA

Document and Topic Models: plsa and LDA Document and Topic Models: plsa and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline Topic Models plsa LSA Model Fitting via EM phits: link analysis

More information

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations : DAG-Structured Mixture Models of Topic Correlations Wei Li and Andrew McCallum University of Massachusetts, Dept. of Computer Science {weili,mccallum} Abstract Latent Dirichlet allocation

More information

Notes on Latent Semantic Analysis

Notes on Latent Semantic Analysis Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically

More information

Latent Dirichlet Allocation

Latent Dirichlet Allocation Outlines Advanced Artificial Intelligence October 1, 2009 Outlines Part I: Theoretical Background Part II: Application and Results 1 Motive Previous Research Exchangeability 2 Notation and Terminology

More information

Probabilistic Dyadic Data Analysis with Local and Global Consistency

Probabilistic Dyadic Data Analysis with Local and Global Consistency Deng Cai DENGCAI@CAD.ZJU.EDU.CN Xuanhui Wang XWANG20@CS.UIUC.EDU Xiaofei He XIAOFEIHE@CAD.ZJU.EDU.CN State Key Lab of CAD&CG, College of Computer Science, Zhejiang University, 100 Zijinggang Road, 310058,

More information

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang Matrix Factorization & Latent Semantic Analysis Review Yize Li, Lanbo Zhang Overview SVD in Latent Semantic Indexing Non-negative Matrix Factorization Probabilistic Latent Semantic Indexing Vector Space

More information

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University TOPIC MODELING MODELS FOR TEXT DATA

More information

Kernel Density Topic Models: Visual Topics Without Visual Words

Kernel Density Topic Models: Visual Topics Without Visual Words Kernel Density Topic Models: Visual Topics Without Visual Words Konstantinos Rematas K.U. Leuven ESAT-iMinds Mario Fritz Max Planck Institute for Informatics

More information

Topic Modelling and Latent Dirichlet Allocation

Topic Modelling and Latent Dirichlet Allocation Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Yuriy Sverchkov Intelligent Systems Program University of Pittsburgh October 6, 2011 Outline Latent Semantic Analysis (LSA) A quick review Probabilistic LSA (plsa)

More information

CS Lecture 18. Topic Models and LDA

CS Lecture 18. Topic Models and LDA CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Dan Oneaţă 1 Introduction Probabilistic Latent Semantic Analysis (plsa) is a technique from the category of topic models. Its main goal is to model cooccurrence information

More information


CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun December 4, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Dirichlet Enhanced Latent Semantic Analysis

Dirichlet Enhanced Latent Semantic Analysis Dirichlet Enhanced Latent Semantic Analysis Kai Yu Siemens Corporate Technology D-81730 Munich, Germany Shipeng Yu Institute for Computer Science University of Munich D-80538 Munich,

More information

Topic Models and Applications to Short Documents

Topic Models and Applications to Short Documents Topic Models and Applications to Short Documents Dieu-Thu Le Email: Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text

More information



More information

Comparative Summarization via Latent Dirichlet Allocation

Comparative Summarization via Latent Dirichlet Allocation Comparative Summarization via Latent Dirichlet Allocation Michal Campr and Karel Jezek Department of Computer Science and Engineering, FAV, University of West Bohemia, 11 February 2013, 301 00, Plzen,

More information

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014 Understanding Comments Submitted to FCC on Net Neutrality Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014 Abstract We aim to understand and summarize themes in the 1.65 million

More information

Latent Dirichlet Allocation Based Multi-Document Summarization

Latent Dirichlet Allocation Based Multi-Document Summarization Latent Dirichlet Allocation Based Multi-Document Summarization Rachit Arora Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai - 600 036, India.

More information

topic modeling hanna m. wallach

topic modeling hanna m. wallach university of massachusetts amherst Ramona Blei-Gantz Helen Moss (Dave's Grandma) The Next 30 Minutes Motivations and a brief history: Latent semantic analysis Probabilistic latent

More information


AN INTRODUCTION TO TOPIC MODELS AN INTRODUCTION TO TOPIC MODELS Michael Paul December 4, 2013 600.465 Natural Language Processing Johns Hopkins University Prof. Jason Eisner Making sense of text Suppose you want to learn something about

More information

Text mining and natural language analysis. Jefrey Lijffijt

Text mining and natural language analysis. Jefrey Lijffijt Text mining and natural language analysis Jefrey Lijffijt PART I: Introduction to Text Mining Why text mining The amount of text published on paper, on the web, and even within companies is inconceivably

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Text Mining for Economics and Finance Latent Dirichlet Allocation

Text Mining for Economics and Finance Latent Dirichlet Allocation Text Mining for Economics and Finance Latent Dirichlet Allocation Stephen Hansen Text Mining Lecture 5 1 / 45 Introduction Recall we are interested in mixed-membership modeling, but that the plsi model

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University Jeff Schneider The Robotics Institute

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. Following slides borrowed ant then heavily modified from: Jonathan Huang

More information

Web Search and Text Mining. Lecture 16: Topics and Communities

Web Search and Text Mining. Lecture 16: Topics and Communities Web Search and Tet Mining Lecture 16: Topics and Communities Outline Latent Dirichlet Allocation (LDA) Graphical models for social netorks Eploration, discovery, and query-ansering in the contet of the

More information

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Replicated Softmax: an Undirected Topic Model. Stephen Turner Replicated Softmax: an Undirected Topic Model Stephen Turner 1. Introduction 2. Replicated Softmax: A Generative Model of Word Counts 3. Evaluating Replicated Softmax as a Generative Model 4. Experimental

More information

Modeling Environment

Modeling Environment Topic Model Modeling Environment What does it mean to understand/ your environment? Ability to predict Two approaches to ing environment of words and text Latent Semantic Analysis (LSA) Topic Model LSA

More information

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Applying LDA topic model to a corpus of Italian Supreme Court decisions Applying LDA topic model to a corpus of Italian Supreme Court decisions Paolo Fantini Statistical Service of the Ministry of Justice - Italy CESS Conference - Rome - November 25, 2014 Our goal finding

More information

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition Yu-Seop Kim 1, Jeong-Ho Chang 2, and Byoung-Tak Zhang 2 1 Division of Information and Telecommunication

More information


PROBABILISTIC LATENT SEMANTIC ANALYSIS PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

A Continuous-Time Model of Topic Co-occurrence Trends

A Continuous-Time Model of Topic Co-occurrence Trends A Continuous-Time Model of Topic Co-occurrence Trends Wei Li, Xuerui Wang and Andrew McCallum Department of Computer Science University of Massachusetts 140 Governors Drive Amherst, MA 01003-9264 Abstract

More information

Sound Recognition in Mixtures

Sound Recognition in Mixtures Sound Recognition in Mixtures Juhan Nam, Gautham J. Mysore 2, and Paris Smaragdis 2,3 Center for Computer Research in Music and Acoustics, Stanford University, 2 Advanced Technology Labs, Adobe Systems

More information

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization Tim Roughgarden February 28, 2017 1 Preamble This lecture fulfills a promise made back in Lecture #1,

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Collapsed Variational Bayesian Inference for Hidden Markov Models

Collapsed Variational Bayesian Inference for Hidden Markov Models Collapsed Variational Bayesian Inference for Hidden Markov Models Pengyu Wang, Phil Blunsom Department of Computer Science, University of Oxford International Conference on Artificial Intelligence and

More information

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media Newswire Text Formal Primary purpose: Inform typical reader about recent events Broad audience: Explicitly establish shared context

More information

Language Information Processing, Advanced. Topic Models

Language Information Processing, Advanced. Topic Models Language Information Processing, Advanced Topic Models Kyoto University - LIP, Adv. - 2011 1 Today s talk Continue exploring the representation of text as histogram of words. Objective:

More information

Latent Dirichlet Alloca/on

Latent Dirichlet Alloca/on Latent Dirichlet Alloca/on Blei, Ng and Jordan ( 2002 ) Presented by Deepak Santhanam What is Latent Dirichlet Alloca/on? Genera/ve Model for collec/ons of discrete data Data generated by parameters which

More information

Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence, Chi-square Statistic, and a Hybrid Method

Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence, Chi-square Statistic, and a Hybrid Method Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence, hi-square Statistic, and a Hybrid Method hris Ding a, ao Li b and Wei Peng b a Lawrence Berkeley National Laboratory,

More information

On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing

On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing Computational Statistics and Data Analysis 52 (2008) 3913 3927 On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing Chris

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner Sridevi Parise Padhraic Smyth School of Information and Computer Science, University

More information

Time Series Topic Modeling and Bursty Topic Detection of Correlated News and Twitter

Time Series Topic Modeling and Bursty Topic Detection of Correlated News and Twitter Time Series Topic Modeling and Bursty Topic Detection of Correlated News and Twitter Daichi Koike Yusuke Takahashi Takehito Utsuro Grad. Sch. Sys. & Inf. Eng., University of Tsukuba, Tsukuba, 305-8573,

More information

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine Nitish Srivastava Ruslan Salahutdinov Geoffrey Hinton

More information

Discovering Geographical Topics in Twitter

Discovering Geographical Topics in Twitter Discovering Geographical Topics in Twitter Liangjie Hong, Lehigh University Amr Ahmed, Yahoo! Research Alexander J. Smola, Yahoo! Research Siva Gurumurthy, Twitter Kostas Tsioutsiouliklis, Twitter Overview

More information

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank Shoaib Jameel Shoaib Jameel 1, Wai Lam 2, Steven Schockaert 1, and Lidong Bing 3 1 School of Computer Science and Informatics,

More information

Latent variable models for discrete data

Latent variable models for discrete data Latent variable models for discrete data Jianfei Chen Department of Computer Science and Technology Tsinghua University, Beijing 100084 Janurary 13, 2014 Murphy, Kevin P. Machine

More information

Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds

Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds Jiho Yoo and Seungjin Choi Department of Computer Science Pohang University of Science and Technology San 31 Hyoja-dong,

More information

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization Rachit Arora Computer Science and Engineering Indian Institute of Technology Madras Chennai - 600 036, India.

More information

Dimension Reduction Using Nonnegative Matrix Tri-Factorization in Multi-label Classification

Dimension Reduction Using Nonnegative Matrix Tri-Factorization in Multi-label Classification 250 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 Dimension Reduction Using Nonnegative Matrix Tri-Factorization in Multi-label Classification Keigo Kimura, Mineichi Kudo and Lu Sun Graduate

More information

Measuring Topic Quality in Latent Dirichlet Allocation

Measuring Topic Quality in Latent Dirichlet Allocation Measuring Topic Quality in Sergei Koltsov Olessia Koltsova Steklov Institute of Mathematics at St. Petersburg Laboratory for Internet Studies, National Research University Higher School of Economics, St.

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation

he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation Amnon Shashua School of Computer Science & Eng. The Hebrew University Matrix Factorization

More information

16 : Approximate Inference: Markov Chain Monte Carlo

16 : Approximate Inference: Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

A Probabilistic Clustering-Projection Model for Discrete Data

A Probabilistic Clustering-Projection Model for Discrete Data Proc. PKDD, Porto, Portugal, 2005 A Probabilistic Clustering-Projection Model for Discrete Data Shipeng Yu 12, Kai Yu 2, Volker Tresp 2, and Hans-Peter Kriegel 1 1 Institute for Computer Science, University

More information

Non-Negative Matrix Factorization

Non-Negative Matrix Factorization Chapter 3 Non-Negative Matrix Factorization Part 2: Variations & applications Geometry of NMF 2 Geometry of NMF 1 NMF factors.9.8 Data points.7.6 Convex cone.5.4 Projections.3.2.1 1.5 1.5 1.5.5 1 3 Sparsity

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Pengtao Xie, Machine Learning Department, Carnegie Mellon University 1. Background Latent Variable Models (LVMs) are

More information

Insights from Viewing Ranked Retrieval as Rank Aggregation

Insights from Viewing Ranked Retrieval as Rank Aggregation Insights from Viewing Ranked Retrieval as Rank Aggregation Holger Bast Ingmar Weber Max-Planck-Institut für Informatik Stuhlsatzenhausweg 85 66123 Saarbrücken, Germany {bast, iweber} Abstract

More information

A Framework of Detecting Burst Events from Micro-blogging Streams

A Framework of Detecting Burst Events from Micro-blogging Streams , pp.379-384 A Framework of Detecting Burst Events from Micro-blogging Streams Kaifang Yang, Yongbo Yu, Lizhou Zheng, Peiquan Jin School of Computer Science and

More information

Multi-theme Sentiment Analysis using Quantified Contextual

Multi-theme Sentiment Analysis using Quantified Contextual Multi-theme Sentiment Analysis using Quantified Contextual Valence Shifters Hongkun Yu, Jingbo Shang, MeichunHsu, Malú Castellanos, Jiawei Han Presented by Jingbo Shang University of Illinois at Urbana-Champaign

More information

Improving Topic Models with Latent Feature Word Representations

Improving Topic Models with Latent Feature Word Representations Improving Topic Models with Latent Feature Word Representations Dat Quoc Nguyen Joint work with Richard Billingsley, Lan Du and Mark Johnson Department of Computing Macquarie University Sydney, Australia

More information

Diversity-Promoting Bayesian Learning of Latent Variable Models

Diversity-Promoting Bayesian Learning of Latent Variable Models Diversity-Promoting Bayesian Learning of Latent Variable Models Pengtao Xie 1, Jun Zhu 1,2 and Eric Xing 1 1 Machine Learning Department, Carnegie Mellon University 2 Department of Computer Science and

More information

Note for plsa and LDA-Version 1.1

Note for plsa and LDA-Version 1.1 Note for plsa and LDA-Version 1.1 Wayne Xin Zhao March 2, 2011 1 Disclaimer In this part of PLSA, I refer to [4, 5, 1]. In LDA part, I refer to [3, 2]. Due to the limit of my English ability, in some place,

More information

Latent Variable Models Probabilistic Models in the Study of Language Day 4

Latent Variable Models Probabilistic Models in the Study of Language Day 4 Latent Variable Models Probabilistic Models in the Study of Language Day 4 Roger Levy UC San Diego Department of Linguistics Preamble: plate notation for graphical models Here is the kind of hierarchical

More information


DISTRIBUTIONAL SEMANTICS COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

arxiv: v3 [cs.lg] 18 Mar 2013

arxiv: v3 [cs.lg] 18 Mar 2013 Hierarchical Data Representation Model - Multi-layer NMF arxiv:1301.6316v3 [cs.lg] 18 Mar 2013 Hyun Ah Song Department of Electrical Engineering KAIST Daejeon, 305-701 Abstract Soo-Young

More information

Lecture 19, November 19, 2012

Lecture 19, November 19, 2012 Machine Learning 0-70/5-78, Fall 0 Latent Space Analysis SVD and Topic Models Eric Xing Lecture 9, November 9, 0 Reading: Tutorial on Topic Model @ ACL Eric Xing @ CMU, 006-0 We are inundated with data

More information

Statistical Debugging with Latent Topic Models

Statistical Debugging with Latent Topic Models Statistical Debugging with Latent Topic Models David Andrzejewski, Anne Mulhern, Ben Liblit, Xiaojin Zhu Department of Computer Sciences University of Wisconsin Madison European Conference on Machine Learning,

More information

Incorporating Social Context and Domain Knowledge for Entity Recognition

Incorporating Social Context and Domain Knowledge for Entity Recognition Incorporating Social Context and Domain Knowledge for Entity Recognition Jie Tang, Zhanpeng Fang Department of Computer Science, Tsinghua University Jimeng Sun College of Computing, Georgia Institute of

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization

More information

Scaling Neighbourhood Methods

Scaling Neighbourhood Methods Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)

More information

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK

More information

Behavioral Data Mining. Lecture 2

Behavioral Data Mining. Lecture 2 Behavioral Data Mining Lecture 2 Autonomy Corp Bayes Theorem Bayes Theorem P(A B) = probability of A given that B is true. P(A B) = P(B A)P(A) P(B) In practice we are most interested in dealing with events

More information

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Non-negative Matrix Factorization: Algorithms, Extensions and Applications Non-negative Matrix Factorization: Algorithms, Extensions and Applications Emmanouil Benetos sbbj660/ March 2013 Emmanouil Benetos Non-negative Matrix Factorization March 2013 1 / 25

More information

Learning Features from Co-occurrences: A Theoretical Analysis

Learning Features from Co-occurrences: A Theoretical Analysis Learning Features from Co-occurrences: A Theoretical Analysis Yanpeng Li IBM T. J. Watson Research Center Yorktown Heights, New York 10598 Abstract Representing a word by its co-occurrences

More information


CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun October 19, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Embeddings Learned By Matrix Factorization

Embeddings Learned By Matrix Factorization Embeddings Learned By Matrix Factorization Benjamin Roth; Folien von Hinrich Schütze Center for Information and Language Processing, LMU Munich Overview WordSpace limitations LinAlgebra review Input matrix

More information

Dimensionality Reduction and Principle Components Analysis

Dimensionality Reduction and Principle Components Analysis Dimensionality Reduction and Principle Components Analysis 1 Outline What is dimensionality reduction? Principle Components Analysis (PCA) Example (Bishop, ch 12) PCA vs linear regression PCA as a mixture

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

HTM: A Topic Model for Hypertexts

HTM: A Topic Model for Hypertexts HTM: A Topic Model for Hypertexts Congkai Sun Department of Computer Science Shanghai Jiaotong University Shanghai, P. R. China Bin Gao Microsoft Research Asia No.49 Zhichun Road

More information

Lecture 8: Graphical models for Text

Lecture 8: Graphical models for Text Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Probabilistic Latent Semantic Analysis Denis Helic KTI, TU Graz Jan 16, 2014 Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 1 / 47 Big picture: KDDM

More information

A Two-Level Learning Hierarchy of Nonnegative Matrix Factorization Based Topic Modeling for Main Topic Extraction

A Two-Level Learning Hierarchy of Nonnegative Matrix Factorization Based Topic Modeling for Main Topic Extraction A Two-Level Learning Hierarchy of Nonnegative Matrix Factorization Based Topic Modeling for Main Topic Extraction Hendri Murfi Department of Mathematics, Universitas Indonesia Depok 16424, Indonesia

More information

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way. 10-708: Probabilistic Graphical Models, Spring 2015 22 : Optimization and GMs (aka. LDA, Sparse Coding, Matrix Factorization, and All That ) Lecturer: Yaoliang Yu Scribes: Yu-Xiang Wang, Su Zhou 1 Introduction

More information

CS 572: Information Retrieval

CS 572: Information Retrieval CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments: Some slides were adapted from Chris Manning, and from Thomas Hoffman 1 Plan for next few weeks Project 1: done (submit by Friday).

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Probabilistic Topic Models Tutorial: COMAD 2011

Probabilistic Topic Models Tutorial: COMAD 2011 Probabilistic Topic Models Tutorial: COMAD 2011 Indrajit Bhattacharya Assistant Professor Dept of Computer Sc. & Automation Indian Institute Of Science, Bangalore My Background Interests Topic Models Probabilistic

More information

Deep Learning for NLP

Deep Learning for NLP Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning

More information