Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Similar documents
Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Document and Topic Models: plsa and LDA

Unsupervised Prediction of Citation Influences

HTM: A Topic Model for Hypertexts

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Latent Topic Models for Hypertext

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

Latent Dirichlet Allocation (LDA)

CS145: INTRODUCTION TO DATA MINING

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Web Search and Text Mining. Lecture 16: Topics and Communities

Latent Dirichlet Conditional Naive-Bayes Models

Topic-Link LDA: Joint Models of Topic and Author Community

Study Notes on the Latent Dirichlet Allocation

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Probabilistic Dyadic Data Analysis with Local and Global Consistency

topic modeling hanna m. wallach

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

CS Lecture 18. Topic Models and LDA

Unified Modeling of User Activities on Social Networking Sites

Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation

A Continuous-Time Model of Topic Co-occurrence Trends

Topic Models and Applications to Short Documents

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Generative Clustering, Topic Modeling, & Bayesian Inference

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Probabilistic Topic Models Tutorial: COMAD 2011

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Using Both Latent and Supervised Shared Topics for Multitask Learning

Collaborative User Clustering for Short Text Streams

Kernel Density Topic Models: Visual Topics Without Visual Words

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Classical Predictive Models

HIERARCHICAL RELATIONAL MODELS FOR DOCUMENT NETWORKS. BY JONATHAN CHANG 1 AND DAVID M. BLEI 2 Facebook and Princeton University

Distributed ML for DOSNs: giving power back to users

CS6220: DATA MINING TECHNIQUES

Latent variable models for discrete data

Active and Semi-supervised Kernel Classification

Supporting Statistical Hypothesis Testing Over Graphs

Generative MaxEnt Learning for Multiclass Classification

Gaussian Mixture Model

Liangjie Hong, Ph.D. Candidate Dept. of Computer Science and Engineering Lehigh University Bethlehem, PA

Discovering Geographical Topics in Twitter

Latent Dirichlet Allocation

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Modeling Environment

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Latent Dirichlet Allocation Based Multi-Document Summarization

Markov Topic Models. Bo Thiesson, Christopher Meek Microsoft Research One Microsoft Way Redmond, WA 98052

Text Mining for Economics and Finance Latent Dirichlet Allocation

Naïve Bayes classification

Information Retrieval and Organisation

MEI: Mutual Enhanced Infinite Community-Topic Model for Analyzing Text-augmented Social Networks

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details THIS IS AN EARLY DRAFT. YOUR FEEDBACKS ARE HIGHLY APPRECIATED.

Evaluation Methods for Topic Models

Mixed-membership Models (and an introduction to variational inference)

arxiv: v1 [cs.si] 7 Dec 2013

Efficient Methods for Topic Model Inference on Streaming Document Collections

Topic Modelling and Latent Dirichlet Allocation

Probabilistic Latent Semantic Analysis

Diversity-Promoting Bayesian Learning of Latent Variable Models

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

A Hybrid Generative/Discriminative Approach to Semi-supervised Classifier Design

Introduction to Machine Learning Midterm Exam Solutions

How to exploit network properties to improve learning in relational domains

Gaussian Models

Latent Dirichlet Allocation Introduction/Overview

Introduction To Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Mining Topic-level Opinion Influence in Microblog

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Introduction to Machine Learning Midterm Exam

Yahoo! Labs Nov. 1 st, Liangjie Hong, Ph.D. Candidate Dept. of Computer Science and Engineering Lehigh University

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Lecture 13 : Variational Inference: Mean Field Approximation

Present and Future of Text Modeling

Machine Learning

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

The Doubly Correlated Nonparametric Topic Model

Lecture 22 Exploratory Text Analysis & Topic Models

CSCI-567: Machine Learning (Spring 2019)

Probabilistic Time Series Classification

Sparse Stochastic Inference for Latent Dirichlet Allocation

Statistical Debugging with Latent Topic Models

Machine Learning 3. week

Topic modeling with more confidence: a theory and some algorithms

The Bayes classifier

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Topic Learning and Inference Using Dirichlet Allocation Product Partition Models and Hybrid Metropolis Search

Title Document Clustering. Issue Date Right.

Transcription:

Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media

Newswire Text Formal Primary purpose: Inform typical reader about recent events Broad audience: Explicitly establish shared context with reader Ambiguity often avoided Social Media Text Informal Many purposes: Entertain, connect, persuade Narrow audience: Friends and colleagues Shared context already established Many statements are ambiguous out of social context

Newswire Text Goals of analysis: Extract information about events from text Understanding text requires understanding typical reader conventions for communicating with him/ her Prior knowledge, background, Social Media Text Goals of analysis: Very diverse Evaluation is difficult And requires revisiting often as goals evolve Often understanding social text requires understanding a community

Outline Tools for analysis of text Probabilistic models for text, communities, and time Mixture models and LDA models for text LDA extensions to model hyperlink structure LDA extensions to model time

Introduction to Topic Models Mixture model: unsupervised naïve Bayes model π Joint probability of words and classes: C Z But classes are not visible: W N M β

Introduction to Topic Models JMLR, 2003

Introduction to Topic Models Latent Dirichlet Allocation θ z α For each document d = 1,,M Generate θ d ~ Dir(. α) For each position n = 1,, N d generate z n ~ Mult(. θ d ) generate w n ~ Mult(. β zn ) w N M β

Introduction to Topic Models Latent Dirichlet Allocation Overcomes some technical issues with PLSA PLSA only estimates mixing parameters for training docs Parameter learning is more complicated: Gibbs Sampling: easy to program, often slow Variational EM

Introduction to Topic Models Perplexity comparison of various models Unigram Mixture model PLSA Lower is better LDA

Introduction to Topic Models Prediction accuracy for classification using learning with topic-models as features Higher is better

Before LDA.LSA and plsa

Introduction to Topic Models Probabilistic Latent Semantic Analysis Model θ d d π Select document d ~ Mult(π) For each position n = 1,, N d z generate z n ~ Mult( _ θ d ) generate w n ~ Mult( _ β zn ) Topic distribution PLSA model: w N M each word is generated by a single unknown multinomial distribution of words, each document is mixed by θ d need to estimate θ d for each d è overfitting is easy LDA: β integrate out θ d and only estimate β

Introduction to Topic Models PLSA topics (TDT-1 corpus)

Outline Tools for analysis of text Probabilistic models for text, communities, and time Mixture models and LDA models for text LDA extensions to model hyperlink structure LDA extensions to model time Alternative framework based on graph analysis to model time & community Preliminary results & tradeoffs Discussion of results & challenges

Hyperlink modeling using PLSA

Hyperlink modeling using PLSA π [Cohn and Hoffman, NIPS, 2001] Select document d ~ Mult(π) d θ d For each position n = 1,, N d generate z n ~ Mult(. θ d ) z z generate w n ~ Mult(. β zn ) For each citation j = 1,, L d w N c L M generate z j ~ Mult(. θ d ) generate c j ~ Mult(. γ zj ) β γ

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] π PLSA likelihood: d θ d z z New likelihood: w N c L M β γ Learning using EM

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] Heuristic: α (1-α) 0 α 1 determines the relative importance of content and hyperlinks

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] Experiments: Text Classification Datasets: Web KB 6000 CS dept web pages with hyperlinks 6 Classes: faculty, course, student, staff, etc. Cora 2000 Machine learning abstracts with citations 7 classes: sub-areas of machine learning Methodology: Learn the model on complete data and obtain θ d for each document Test documents classified into the label of the nearest neighbor in training set Distance measured as cosine similarity in the θ space Measure the performance as a function of α

Hyperlink modeling using PLSA [Cohn and Hoffman, NIPS, 2001] Classification performance Hyperlink content link content

Hyperlink modeling using LDA

α Hyperlink modeling using LinkLDA [Erosheva, Fienberg, Lafferty, PNAS, 2004] θ For each document d = 1,,M Generate θ d ~ Dir( α) z z For each position n = 1,, N d generate z n ~ Mult(. θ d ) generate w n ~ Mult(. β zn ) w c For each citation j = 1,, L d N M L generate z j ~ Mult(. θ d ) generate c j ~ Mult(. γ zj ) β γ Learning using variational EM

Hyperlink modeling using LDA [Erosheva, Fienberg, Lafferty, PNAS, 2004]

Newswire Text Social Media Text Goals of analysis: Extract information about events from text Understanding text requires understanding typical reader conventions for communicating with him/ her Prior knowledge, background, Goals of analysis: Very diverse Evaluation is difficult And requires revisiting often as goals evolve Often understanding social text requires understanding a community Science as a testbed for social text: an open community which we understand

Author-Topic Model for Scientific Literature

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] a P For each author a = 1,,A x α Generate θ a ~ Dir(. γ) For each topic k = 1,,K z θ A Generate φ k ~ Dir(. α) For each document d = 1,,M For each position n = 1,, N d w N M Generate author x ~ Unif(. a d ) generate z n ~ Mult(. θ a ) generate w n ~ Mult(. φ zn ) φ K β

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] Perplexity results

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] Topic-Author visualization

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] Application 1: Author similarity

Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004] Application 2: Author entropy

Labeled LDA: [Ramage, Hall, Nallapati, Manning, EMNLP 2009]

Labeled LDA Del.icio.us tags as labels for documents

Labeled LDA

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI 05]

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI 05] Gibbs sampling

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI 05] Datasets Enron email data 23,488 messages between 147 users McCallum s personal email 23,488(?) messages with 128 authors

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI 05] Topic Visualization: Enron set

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI 05] Topic Visualization: McCallum s data

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI 05]

Models of hypertext for blogs [ICWSM 2008] Ramesh Nallapati Amr Ahmed Eric Xing me

LinkLDA model for citing documents Variant of PLSA model for cited documents Topics are shared between citing, cited Links depend on topics in two documents Link-PLSA-LDA

Experiments 8.4M blog postings in Nielsen/Buzzmetrics corpus Collected over three weeks summer 2005 Selected all postings with >=2 inlinks or >=2 outlinks 2248 citing (2+ outlinks), 1777 cited documents (2+ inlinks) Only 68 in both sets, which are duplicated Fit model using variational EM

Topics in blogs Model can answer questions like: which blogs are most likely to be cited when discussing topic z?

Topics in blogs Model can be evaluated by predicting which links an author will include in a an article Link-LDA Link-PLSA-LDA Lower is better

Another model: Pairwise Link-LDA LDA for both cited and citing documents Generate an indicator for every pair of docs Vs. generating pairs of docs Link depends on the mixing components (θ s) stochastic block model θ z w N z α γ c β z θ z w N

Pairwise Link-LDA supports new inferences but doesn t perform better on link prediction

Outline Tools for analysis of text Probabilistic models for text, communities, and time Mixture models and LDA models for text LDA extensions to model hyperlink structure Observation: these models can be used for many purposes LDA extensions to model time Alternative framework based on graph analysis to model time & community Discussion of results & challenges

Authors are using a number of clever tricks for inference.

Predicting Response to Political Blog Posts with Topic Models [NAACL 09] Tae Yano Noah Smith

Political blogs and and comments Posts are often coupled " with comment sections" Comment style is casual, creative," less carefully edited" 54

Political blogs and comments Most of the text associated with large Alist community blogs is comments 5-20x as many words in comments as in text for the 5 sites considered in Yano et al. A large part of socially-created commentary in the blogosphere is comments. Not blog à blog hyperlinks Comments do not just echo the post

Modeling political blogs Our political blog model: CommentLDA z, z` = topic w = word (in post) w`= word (in comments) u = user D = # of documents; N = # of words in post; M = # of words in comments

Modeling political blogs Our proposed political blog model: LHS is vanilla LDA CommentLDA D = # of documents; N = # of words in post; M = # of words in comments

Modeling political blogs Our proposed political blog model: RHS to capture the generation of reaction separately from the post body Two chambers share the same topic-mixture CommentLDA Two separate sets of word distributions D = # of documents; N = # of words in post; M = # of words in comments

Modeling political blogs Our proposed political blog model: User IDs of the commenters as a part of comment text CommentLDA generate the words in the comment section D = # of documents; N = # of words in post; M = # of words in comments

Modeling political blogs Another model we tried: Took out the words from the comment section! CommentLDA This is a model agnostic to the words in the comment section! The model is structurally equivalent to the LinkLDA from (Erosheva et al., 2004) D = # of documents; N = # of words in post; M = # of words in comments

Topic discovery - Matthew Yglesias (MY) site 61

Topic discovery - Matthew Yglesias (MY) site 62

Topic discovery - Matthew Yglesias (MY) site 63

Comment prediction (MY) (RS) 20.54 % Comment LDA (R) LinkLDA and CommentLDA consistently outperform baseline models" Neither consistently outperforms the other." (CB) 16.92 % 32.06 % Link LDA (R) Link LDA (C) user prediction: Precision at top 10! From left to right: Link LDA(-v, -r,-c) Cmnt LDA (-v, -r, -c), Baseline (Freq, NB)" 64

Document modeling with Latent Dirichlet Allocation (LDA) θ z α For each document d = 1,,M Generate θ d ~ Dir(. α) For each position n = 1,, N d generate z n ~ Mult(. θ d ) generate w n ~ Mult(. β zn ) w N M β

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI 05]

Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI 05] SNA = Jensen-Shannon divergence for recipients of messages

Modeling Citation Influences

Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] Copycat model of citation influence innovation plaigarism c is a cited document s is a coin toss to mix γ and ψ

s is a coin toss to mix γ and ψ

Modeling Citation Influences [Dietz, Bickel, Scheffer, ICML 2007] Citation influence graph for LDA paper

Modeling Citation Influences

Modeling Citation Influences User study: selfreported citation influence on Likert scale LDA-post is Prob(cited doc paper) LDA-js is Jensen-Shannon dist in topic space