CS145: INTRODUCTION TO DATA MINING

Similar documents
CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

CS145: INTRODUCTION TO DATA MINING

CS249: ADVANCED DATA MINING

CS145: INTRODUCTION TO DATA MINING

CS Lecture 18. Topic Models and LDA

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

Topic Modelling and Latent Dirichlet Allocation

Lecture 13 : Variational Inference: Mean Field Approximation

Generative Clustering, Topic Modeling, & Bayesian Inference

Latent Dirichlet Allocation (LDA)

CS6220: DATA MINING TECHNIQUES

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Click Prediction and Preference Ranking of RSS Feeds

Document and Topic Models: plsa and LDA

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Probabilistic Latent Semantic Analysis

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

CS249: ADVANCED DATA MINING

Probablistic Graphical Models, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07

Latent Dirichlet Allocation

Probabilistic Latent Semantic Analysis

Data Mining Techniques

Expectation maximization

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

CS249: ADVANCED DATA MINING

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Latent Dirichlet Allocation Introduction/Overview

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Introduction to Machine Learning Midterm Exam Solutions

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Introduction to Graphical Models

Introduction To Machine Learning

Modeling Environment

Machine Learning for natural language processing

Non-Parametric Bayes

Introduction to Machine Learning Midterm Exam

Study Notes on the Latent Dirichlet Allocation

CS145: INTRODUCTION TO DATA MINING

CS 6375 Machine Learning

Lecture 8: Clustering & Mixture Models

Machine Learning for Signal Processing Bayes Classification and Regression

Midterm sample questions

CS6220: DATA MINING TECHNIQUES

CSCI-567: Machine Learning (Spring 2019)

Knowledge Discovery and Data Mining 1 (VO) ( )

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Modern Information Retrieval

Brief Introduction of Machine Learning Techniques for Content Analysis

Note for plsa and LDA-Version 1.1

Introduction to Machine Learning

Data Mining Techniques

Naïve Bayes, Maxent and Neural Models

Machine Learning for Signal Processing Bayes Classification

Pattern Recognition and Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

FINAL: CS 6375 (Machine Learning) Fall 2014

CPSC 540: Machine Learning

Topic Models. Charles Elkan November 20, 2008

Measuring Topic Quality in Latent Dirichlet Allocation

Machine Learning, Fall 2012 Homework 2

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Text Mining for Economics and Finance Unsupervised Learning

Latent Dirichlet Allocation (LDA)

A Note on the Expectation-Maximization (EM) Algorithm

Parametric Mixture Models for Multi-Labeled Text

Naïve Bayes classification

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

LDA with Amortized Inference

CS249: ADVANCED DATA MINING

Bayes Decision Rule and Naïve Bayes Classifier

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION

Expectation Maximization

Final Exam. December 11 th, This exam booklet contains five problems, out of which you are expected to answer four problems of your choice.

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

From perceptrons to word embeddings. Simon Šuster University of Groningen

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Estimating Parameters

CS534 Machine Learning - Spring Final Exam

Introduction to Machine Learning Midterm, Tues April 8

Mixtures of Gaussians. Sargur Srihari

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Generative Model (Naïve Bayes, LDA)

Statistical Data Mining and Machine Learning Hilary Term 2016

Lecture 2: Simple Classifiers

Transcription:

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017

Methods to be Learnt Vector Data Set Data Sequence Data Text Data Classification Clustering Logistic Regression; Decision Tree; KNN; SVM; NN K-means; hierarchical clustering; DBSCAN; Mixture Models Naïve Bayes for Text PLSA Prediction Frequent Pattern Mining Similarity Search Linear Regression GLM* Apriori; FP growth GSP; PrefixSpan DTW 2

Text Data: Topic Models Text Data and Topic Models Revisit of Mixture Model Probabilistic Latent Semantic Analysis (plsa) Summary 3

Text Data Word/term Document A sequence of words Corpus A collection of documents 4

Represent a Document Most common way: Bag-of-Words Ignore the order of words keep the count c1 c2 c3 c4 c5 m1 m2 m3 m4 Vector space model 5

Topic Topics A topic is represented by a word distribution Relate to an issue 6

Topic modeling Get topics automatically from a corpus Assign documents to topics automatically Most frequently used topic models plsa LDA Topic Models 7

Text Data: Topic Models Text Data and Topic Models Revisit of Mixture Model Probabilistic Latent Semantic Analysis (plsa) Summary 8

Mixture Model-Based Clustering A set C of k probabilistic clusters C 1,,C k probability density/mass functions: f 1,, f k, Cluster prior probabilities: w 1,, w k, σ j w j = 1 Joint Probability of an object i and its cluster C j is: P(x i, z i = C j ) = w j f j x i z i : hidden random variable Probability of i is: P x i = σ j w j f j (x i ) f 1 (x) f 2 (x) 9

Maximum Likelihood Estimation Since objects are assumed to be generated independently, for a data set D = {x 1,, x n }, we have, P D = P x i = w j f j (x i ) i i j logp D = logp x i = log w j f j (x i ) i Task: Find a set C of k probabilistic clusters s.t. P(D) is maximized i j 10

Gaussian Mixture Model Generative model For each object: Pick its cluster, i.e., a distribution component: Z~Multinoulli w 1,, w k Sample a value from the selected distribution: X Z~N μ Z, σ Z 2 Overall likelihood function L D θ = ς i σ j w j p(x i μ j, σ j 2 ) s.t. σ j w j = 1 and w j 0 11

Multinomial Mixture Model For documents with bag-of-words representation x d = (x d1, x d2,, x dn ), x dn is the number of words for nth word in the vocabulary Generative model For each document Sample its cluster label z~multinoulli(π) π = (π 1, π 2,, π K ), π k is the proportion of kth cluster p z = k = π k Sample its word vector x d ~multinomial(β z ) β z = β z1, β z2,, β zn, β zn is the parameter associate with nth word in the vocabulary p x d z = k = σ n x dn! ς n x dn! x ς n β dn x kn ς n β dn kn 12

Likelihood Function For a set of M documents L = p(x d ) = p(x d, z = k) d d k = d k p x d z = k p(z = k) p(z = k) x β dn kn d k n 13

Mixture of Unigrams For documents represented by a sequence of words w d = (w d1, w d2,, w dnd ), N d is the length of document d, w dn is the word at the nth position of the document Generative model For each document Sample its cluster label z~multinoulli(π) π = (π 1, π 2,, π K ), π k is the proportion of kth cluster p z = k = π k For each word in the sequence Sample the word w dn ~Multinoulli(β z ) p w dn z = k = β kwdn 14

Likelihood Function For a set of M documents L = p(w d ) = p(w d, z = k) d d k = d k p w d z = k p(z = k) = p(z = k) β kwdn d k n 15

Question Are multinomial mixture model and mixture of unigrams model equivalent? Why? 16

Text Data: Topic Models Text Data and Topic Models Revisit of Mixture Model Probabilistic Latent Semantic Analysis (plsa) Summary 17

Notations Word, document, topic w, d, z Word count in document c(w, d) Word distribution for each topic (β z ) β zw : p(w z) Topic distribution for each document (θ d ) θ dz : p(z d) (Yes, soft clustering) 18

Issues of Mixture of Unigrams All the words in the same documents are sampled from the same topic In practice, people switch topics during their writing 19

Illustration of plsa 20

Generative Model for plsa Describe how a document is generated probabilistically For each position in d, n = 1,, N d Generate the topic for the position as z n ~Multinoulli(θ d ), i. e., p z n = k = θ dk (Note, 1 trial multinomial, i.e., categorical distribution) Generate the word for the position as w n ~Multinoulli(β zn ), i. e., p w n = w = β zn w 21

Graphical Model Note: Sometimes, people add parameters such as θ and β into the graphical model 22

The Likelihood Function for a Corpus Probability of a word p w d = p(w, z = k d) = p w z = k p z = k d = β kw θ dk k k k Likelihood of a corpus π d is usually considered as uniform, i.e., 1/M 23

Re-arrange the Likelihood Function Group the same word from different positions together max logl = dw c w, d log z θ dz β zw s. t. z θ dz = 1 and w β zw = 1 24

Optimization: EM Algorithm Repeat until converge E-step: for each word in each document, calculate its conditional probability belonging to each topic p z w, d p w z, d p z d = β zw θ dz (i. e., p z w, d = β zwθ dz σ z β z w θ dz ) M-step: given the conditional distribution, find the parameters that can maximize the expected likelihood β zw σ d p z w, d c w, d (i. e., β zw = θ dz p z w, d c w, d w σ d p z w, d c w,d σ w,d p z w, d ) c w,d (i. e., θ dz = σ w p z w, d c w, d N d ) 25

Example Two documents, two topics Vocabulary: {data, mining, frequent, pattern, web, information, retrieval} At some iteration of EM algorithm, E-step 26

Example (Continued) M-step 0.8 5 + 0.5 2 β 11 = = 5/17.6 11.8 + 5.8 0.8 4 + 0.5 3 β 12 = = 4.7/17.6 11.8 + 5.8 β 13 = 3/17.6 β 14 = 1.6/17.6 β 15 = 1.3/17.6 β 16 = 1.2/17.6 β 17 = 0.8/17.6 θ 11 = 11.8 17 θ 12 = 5.2 17 27

Text Data: Topic Models Text Data and Topic Models Revisit of Mixture Model Probabilistic Latent Semantic Analysis (plsa) Summary 28

Basic Concepts Summary Word/term, document, corpus, topic Mixture of unigrams plsa Generative model Likelihood function EM algorithm 29

Quiz Q1: Is Multinomial Naïve Bayes a linear classifier? Q2: In plsa, For the same word in different positions in a document, do they have the same conditional probability p z w, d? 30