CS6220: DATA MINING TECHNIQUES

Similar documents
CS6220: DATA MINING TECHNIQUES

CS249: ADVANCED DATA MINING

CS145: INTRODUCTION TO DATA MINING

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING

CS Lecture 18. Topic Models and LDA

Machine Learning for Data Science (CS4786) Lecture 12

Generative Clustering, Topic Modeling, & Bayesian Inference

CS249: ADVANCED DATA MINING

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Non-Parametric Bayes

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

CSCI-567: Machine Learning (Spring 2019)

CS249: ADVANCED DATA MINING

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Gaussian Mixture Models

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Lecture 13 : Variational Inference: Mean Field Approximation

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Click Prediction and Preference Ranking of RSS Feeds

Gaussian Mixture Models, Expectation Maximization

Data Mining Techniques

Probabilistic Time Series Classification

ECE 5984: Introduction to Machine Learning

Pattern Recognition and Machine Learning

Study Notes on the Latent Dirichlet Allocation

An Introduction to Statistical and Probabilistic Linear Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Data Preprocessing. Cluster Similarity

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning

Latent Variable View of EM. Sargur Srihari

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

The Bayes classifier

Introduction to Probabilistic Machine Learning

Probabilistic clustering

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Kernel methods, kernel SVM and ridge regression

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

13: Variational inference II

Introduction to Graphical Models

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Machine Learning for Signal Processing Bayes Classification and Regression

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Expectation maximization

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Machine Learning for Signal Processing Bayes Classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Data Mining Techniques

Linear Classification: Probabilistic Generative Models

Bayesian Decision Theory

Algorithmisches Lernen/Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Probabilistic Latent Semantic Analysis

CPSC 540: Machine Learning

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Hidden Markov Models Part 1: Introduction

Brief Introduction of Machine Learning Techniques for Content Analysis

Recent Advances in Bayesian Inference Techniques

CS249: ADVANCED DATA MINING

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Active and Semi-supervised Kernel Classification

Probabilistic Graphical Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Naïve Bayes classification

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Latent Dirichlet Allocation Introduction/Overview

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Statistical Pattern Recognition

Introduction to Machine Learning Midterm Exam

CS6220: DATA MINING TECHNIQUES

K-Means and Gaussian Mixture Models

Lecture 11: Unsupervised Machine Learning

CS534 Machine Learning - Spring Final Exam

Linear Dynamical Systems

Advanced Machine Learning & Perception

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Document and Topic Models: plsa and LDA

Lecture 14. Clustering, K-means, and EM

Randomized Algorithms

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Support Vector Machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Expectation Maximization

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Lecture 8: Graphical models for Text

Transcription:

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014

Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network Classification Decision Tree; Naïve Bayes; Logistic Regression SVM; knn HMM Label Propagation Clustering Frequent Pattern Mining K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means Apriori; FP-growth GSP; PrefixSpan Prediction Linear Regression Autoregression SCAN; Spectral Clustering Similarity Search Ranking DTW P-PageRank PageRank 2

Matrix Data: Clustering: Part 2 Revisit K-means Mixture Model and EM algorithm Kernel K-means Summary 3

Objective function Recall K-Means k J = j=1 C i =j x i c j 2 Total within-cluster variance Re-arrange the objective function k J = j=1 i w ij x i c j 2 w ij {0,1} w ij = 1, if x i belongs to cluster j; w ij = 0, otherwise Looking for: The best assignment w ij The best center c j 4

Iterations Solution of K-Means Step 1: Fix centers c j, find assignment w ij that minimizes J => w ij = 1, if x i c j 2 is the smallest Step 2: Fix assignment w ij, find centers that minimize J => first derivative of J = 0 => J c j = 2 i w ij (x i c j ) = 0 =>c j = Note i w ijx i i w ij J = i w ij is the total number of objects in cluster j k j=1 i w ij x i c j 2 5

Converges! Why?

Limitations of K-Means K-means has problems when clusters are of differing Sizes Densities Non-Spherical Shapes 12

Limitations of K-Means: Different Density and Size 13

Limitations of K-Means: Non-Spherical Shapes 14

Demo http://webdocs.cs.ualberta.ca/~yaling/clu ster/applet/code/cluster.html 15

Connections of K-means to Other Methods K-means Gaussian Mixture Model Kernel K- means 16

Matrix Data: Clustering: Part 2 Revisit K-means Mixture Model and EM algorithm Kernel K-means Summary 17

Fuzzy Set and Fuzzy Cluster Clustering methods discussed so far Every data object is assigned to exactly one cluster Some applications may need for fuzzy or soft cluster assignment Ex. An e-game could belong to both entertainment and software Methods: fuzzy clusters and probabilistic model-based clusters Fuzzy cluster: A fuzzy set S: F S : X [0, 1] (value between 0 and 1) 18

Probabilistic Model-Based Clustering Cluster analysis is to find hidden categories. A hidden category (i.e., probabilistic cluster) is a distribution over the data space, which can be mathematically represented using a probability density function (or distribution function). Ex. categories for digital cameras sold consumer line vs. professional line density functions f 1, f 2 for C 1, C 2 obtained by probabilistic clustering A mixture model assumes that a set of observed objects is a mixture of instances from multiple probabilistic clusters, and conceptually each observed object is generated independently Our task: infer a set of k probabilistic clusters that is mostly likely to generate D using the above data generation process 19

Mixture Model-Based Clustering A set C of k probabilistic clusters C 1,,C k with probability density functions f 1,, f k, respectively, and their probabilities w 1,, w k, j w j = 1 Probability of an object i generated by cluster C j is: P(x i, z i = C j ) = w j f j (x i ) Probability of i generated by the set of cluster C is: P x i = j w j f j (x i ) 20

Maximum Likelihood Estimation Since objects are assumed to be generated independently, for a data set D = {x 1,, x n }, we have, P D = P x i = w j f j (x i ) i i j Task: Find a set C of k probabilistic clusters s.t. P(D) is maximized 21

The EM (Expectation Maximization) Algorithm The (EM) algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models. E-step assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters w t ij = p z i = j θ t j, x i p x i C t j, θ t j p(c t j ) M-step finds the new clustering or parameters that maximize the expected likelihood 22

Case 1: Gaussian Mixture Model Generative model For each object: Pick its distribution component: Z~Multi w 1,, w k Sample a value from the selected distribution: X~N μ Z, σ Z 2 Overall likelihood function L D θ = i j w j p(x i μ j, σ j 2 ) Q: What is θ here? 23

Estimating Parameters L D; θ = i log j w j p(x i μ j, σ j 2 ) Considering the first derivative of μ j : Intractable! L u j = i w j j w jp(x i μ j,σ j 2 ) p(x i μ j,σ j 2 ) μ j = i w j p(x i μ j,σ j 2 ) j w jp(x i μ j,σ j 2 ) 1 p(x i μ j,σ j 2 ) p(x i μ j,σ j 2 ) μ j = i w j p(x i μ j,σ j 2 ) j w jp(x i μ j,σ j 2 ) w ij = P(Z = j X = x i, θ) logp(x i μ j,σ j 2 ) u j l(x i )/ μ j Like weighted likelihood estimation; But the weight is determined by the parameters! 24

Apply EM algorithm An iterative algorithm (at iteration t+1) E(expectation)-step Evaluate the weight w ij when μ j, σ j, w j are given w ij t = w j t p(x i μ j t,(σj 2 ) t ) j w j t p(x i μ j t,(σ j 2 ) t ) M(maximization)-step Evaluate μ j, σ j, ω j when w ij s are given that maximize the weighted likelihood It is equivalent to Gaussian distribution parameter estimation when each point has a weight belonging to each distribution μ j t+1 = i w ij t x i i w ij t ; (σ j 2 ) t+1 = i w ij t x i μ j t 2 i w ij t ; w t+1 t j i w ij 25

K-Means: A Special Case of Gaussian Mixture Model When each Gaussian component with covariance matrix σ 2 I Soft K-means p x i μ j, σ 2 exp{ x i μ j 2 /σ 2 } When σ 2 0 Distance! Soft assignment becomes hard assignment w ij 1, if x i is closest to μ j (why?) 26

Case 2: Multinomial Mixture Model Generative model For each object: Pick its distribution component: Z~Multi w 1,, w k Sample a value from the selected distribution: X~Multi β Z1, β Z2,, β Zm Overall likelihood function L D θ = i j w j p(x i β j ) j w j = 1; l β jl = 1 Q: What is θ here? 27

Application: Document Clustering A vocabulary containing m words Each document i: A m-dimensional vector: c i1, c i2,, c im c il is the number of occurrence of word l appearing in document i Under unigram assumption p x i β j = ( m c il )! β c i1 c c i1! c im! j1 β im jm Length of document Constant to all parameters 28

Example 29

Estimating Parameters l D; θ = i log j ω j l c il logβ jl Apply EM algorithm E-step: w ij = w jp(x i β j ) j w jp(x i β j ) M-step: maximize weighted likelihood i w ij l c il logβ jl β jl = i w ijc il l i w ijc il ; ω j i w ij Weighted percentage of word l in cluster j 30

Better Way for Topic Modeling Topic: a word distribution Unigram multinomial mixture model Once the topic of a document is decided, all its words are generated from that topic PLSA (probabilistic latent semantic analysis) Every word of a document can be sampled from different topics LDA (Latent Dirichlet Allocation) Assume priors on word distribution and/or document cluster distribution 31

Why EM Works? E-Step: computing a tight lower bound f of the original objective function at θ old M-Step: find θ new to maximize the lower bound l θ new f θ new f(θ old ) = l(θ old ) 32

*How to Find Tight Lower Bound? q h : the tight lower bound we want to get Jensen s inequality When = holds to get a tight lower bound? q h = p(h d, θ) (why?) 33

Advantages and Disadvantages of Strength Mixture Models Mixture models are more general than partitioning Clusters can be characterized by a small number of parameters The results may satisfy the statistical assumptions of the generative models Weakness Converge to local optimal (overcome: run multi-times w. random initialization) Computationally expensive if the number of distributions is large, or the data set contains very few observed data points Need large data sets Hard to estimate the number of clusters 34

Matrix Data: Clustering: Part 2 Revisit K-means Mixture Model and EM algorithm Kernel K-means Summary 35

Kernel K-Means How to cluster the following data? A non-linear map: φ: R n F Map a data point into a higher/infinite dimensional space x φ x Dot product matrix K ij K ij =< φ x i, φ(x j ) > 36

Typical Kernel Functions Recall kernel SVM: 37

Solution of Kernel K-Means Objective function under new feature space: k J = j=1 i w ij φ(x i ) c j 2 Algorithm By fixing assignment w ij c j = i w ij φ(x i )/ i w ij In the assignment step, assign the data points to the closest center d x i, c j = φ x i i w i jφ x i 2 i w i j φ x i φ x i i w i j i w i j + i l w i j w lj φ x i φ x l ( i w i j )^2 2 = φ x i φ x i Do not really need to know φ x, but only K ij 38

Advantages and Disadvantages of Kernel K-Means Advantages Algorithm is able to identify the non-linear structures. Disadvantages Number of cluster centers need to be predefined. Algorithm is complex in nature and time complexity is large. References Kernel k-means and Spectral Clustering by Max Welling. Kernel k-means, Spectral Clustering and Normalized Cut by Inderjit S. Dhillon, Yuqiang Guan and Brian Kulis. An Introduction to kernel methods by Colin Campbell. 39

Matrix Data: Clustering: Part 2 Revisit K-means Mixture Model and EM algorithm Kernel K-means Summary 40

Revisit k-means Derivative Mixture models Summary Gaussian mixture model; multinomial mixture model; EM algorithm; Connection to k-means Kernel k-means Objective function; solution; connection to k- means 41