CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014
Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network Classification Decision Tree; Naïve Bayes; Logistic Regression SVM; knn HMM Label Propagation Clustering Frequent Pattern Mining K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means Apriori; FP-growth GSP; PrefixSpan Prediction Linear Regression Autoregression SCAN; Spectral Clustering Similarity Search Ranking DTW P-PageRank PageRank 2
Matrix Data: Clustering: Part 2 Revisit K-means Mixture Model and EM algorithm Kernel K-means Summary 3
Objective function Recall K-Means k J = j=1 C i =j x i c j 2 Total within-cluster variance Re-arrange the objective function k J = j=1 i w ij x i c j 2 w ij {0,1} w ij = 1, if x i belongs to cluster j; w ij = 0, otherwise Looking for: The best assignment w ij The best center c j 4
Iterations Solution of K-Means Step 1: Fix centers c j, find assignment w ij that minimizes J => w ij = 1, if x i c j 2 is the smallest Step 2: Fix assignment w ij, find centers that minimize J => first derivative of J = 0 => J c j = 2 i w ij (x i c j ) = 0 =>c j = Note i w ijx i i w ij J = i w ij is the total number of objects in cluster j k j=1 i w ij x i c j 2 5
Converges! Why?
Limitations of K-Means K-means has problems when clusters are of differing Sizes Densities Non-Spherical Shapes 12
Limitations of K-Means: Different Density and Size 13
Limitations of K-Means: Non-Spherical Shapes 14
Demo http://webdocs.cs.ualberta.ca/~yaling/clu ster/applet/code/cluster.html 15
Connections of K-means to Other Methods K-means Gaussian Mixture Model Kernel K- means 16
Matrix Data: Clustering: Part 2 Revisit K-means Mixture Model and EM algorithm Kernel K-means Summary 17
Fuzzy Set and Fuzzy Cluster Clustering methods discussed so far Every data object is assigned to exactly one cluster Some applications may need for fuzzy or soft cluster assignment Ex. An e-game could belong to both entertainment and software Methods: fuzzy clusters and probabilistic model-based clusters Fuzzy cluster: A fuzzy set S: F S : X [0, 1] (value between 0 and 1) 18
Probabilistic Model-Based Clustering Cluster analysis is to find hidden categories. A hidden category (i.e., probabilistic cluster) is a distribution over the data space, which can be mathematically represented using a probability density function (or distribution function). Ex. categories for digital cameras sold consumer line vs. professional line density functions f 1, f 2 for C 1, C 2 obtained by probabilistic clustering A mixture model assumes that a set of observed objects is a mixture of instances from multiple probabilistic clusters, and conceptually each observed object is generated independently Our task: infer a set of k probabilistic clusters that is mostly likely to generate D using the above data generation process 19
Mixture Model-Based Clustering A set C of k probabilistic clusters C 1,,C k with probability density functions f 1,, f k, respectively, and their probabilities w 1,, w k, j w j = 1 Probability of an object i generated by cluster C j is: P(x i, z i = C j ) = w j f j (x i ) Probability of i generated by the set of cluster C is: P x i = j w j f j (x i ) 20
Maximum Likelihood Estimation Since objects are assumed to be generated independently, for a data set D = {x 1,, x n }, we have, P D = P x i = w j f j (x i ) i i j Task: Find a set C of k probabilistic clusters s.t. P(D) is maximized 21
The EM (Expectation Maximization) Algorithm The (EM) algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models. E-step assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters w t ij = p z i = j θ t j, x i p x i C t j, θ t j p(c t j ) M-step finds the new clustering or parameters that maximize the expected likelihood 22
Case 1: Gaussian Mixture Model Generative model For each object: Pick its distribution component: Z~Multi w 1,, w k Sample a value from the selected distribution: X~N μ Z, σ Z 2 Overall likelihood function L D θ = i j w j p(x i μ j, σ j 2 ) Q: What is θ here? 23
Estimating Parameters L D; θ = i log j w j p(x i μ j, σ j 2 ) Considering the first derivative of μ j : Intractable! L u j = i w j j w jp(x i μ j,σ j 2 ) p(x i μ j,σ j 2 ) μ j = i w j p(x i μ j,σ j 2 ) j w jp(x i μ j,σ j 2 ) 1 p(x i μ j,σ j 2 ) p(x i μ j,σ j 2 ) μ j = i w j p(x i μ j,σ j 2 ) j w jp(x i μ j,σ j 2 ) w ij = P(Z = j X = x i, θ) logp(x i μ j,σ j 2 ) u j l(x i )/ μ j Like weighted likelihood estimation; But the weight is determined by the parameters! 24
Apply EM algorithm An iterative algorithm (at iteration t+1) E(expectation)-step Evaluate the weight w ij when μ j, σ j, w j are given w ij t = w j t p(x i μ j t,(σj 2 ) t ) j w j t p(x i μ j t,(σ j 2 ) t ) M(maximization)-step Evaluate μ j, σ j, ω j when w ij s are given that maximize the weighted likelihood It is equivalent to Gaussian distribution parameter estimation when each point has a weight belonging to each distribution μ j t+1 = i w ij t x i i w ij t ; (σ j 2 ) t+1 = i w ij t x i μ j t 2 i w ij t ; w t+1 t j i w ij 25
K-Means: A Special Case of Gaussian Mixture Model When each Gaussian component with covariance matrix σ 2 I Soft K-means p x i μ j, σ 2 exp{ x i μ j 2 /σ 2 } When σ 2 0 Distance! Soft assignment becomes hard assignment w ij 1, if x i is closest to μ j (why?) 26
Case 2: Multinomial Mixture Model Generative model For each object: Pick its distribution component: Z~Multi w 1,, w k Sample a value from the selected distribution: X~Multi β Z1, β Z2,, β Zm Overall likelihood function L D θ = i j w j p(x i β j ) j w j = 1; l β jl = 1 Q: What is θ here? 27
Application: Document Clustering A vocabulary containing m words Each document i: A m-dimensional vector: c i1, c i2,, c im c il is the number of occurrence of word l appearing in document i Under unigram assumption p x i β j = ( m c il )! β c i1 c c i1! c im! j1 β im jm Length of document Constant to all parameters 28
Example 29
Estimating Parameters l D; θ = i log j ω j l c il logβ jl Apply EM algorithm E-step: w ij = w jp(x i β j ) j w jp(x i β j ) M-step: maximize weighted likelihood i w ij l c il logβ jl β jl = i w ijc il l i w ijc il ; ω j i w ij Weighted percentage of word l in cluster j 30
Better Way for Topic Modeling Topic: a word distribution Unigram multinomial mixture model Once the topic of a document is decided, all its words are generated from that topic PLSA (probabilistic latent semantic analysis) Every word of a document can be sampled from different topics LDA (Latent Dirichlet Allocation) Assume priors on word distribution and/or document cluster distribution 31
Why EM Works? E-Step: computing a tight lower bound f of the original objective function at θ old M-Step: find θ new to maximize the lower bound l θ new f θ new f(θ old ) = l(θ old ) 32
*How to Find Tight Lower Bound? q h : the tight lower bound we want to get Jensen s inequality When = holds to get a tight lower bound? q h = p(h d, θ) (why?) 33
Advantages and Disadvantages of Strength Mixture Models Mixture models are more general than partitioning Clusters can be characterized by a small number of parameters The results may satisfy the statistical assumptions of the generative models Weakness Converge to local optimal (overcome: run multi-times w. random initialization) Computationally expensive if the number of distributions is large, or the data set contains very few observed data points Need large data sets Hard to estimate the number of clusters 34
Matrix Data: Clustering: Part 2 Revisit K-means Mixture Model and EM algorithm Kernel K-means Summary 35
Kernel K-Means How to cluster the following data? A non-linear map: φ: R n F Map a data point into a higher/infinite dimensional space x φ x Dot product matrix K ij K ij =< φ x i, φ(x j ) > 36
Typical Kernel Functions Recall kernel SVM: 37
Solution of Kernel K-Means Objective function under new feature space: k J = j=1 i w ij φ(x i ) c j 2 Algorithm By fixing assignment w ij c j = i w ij φ(x i )/ i w ij In the assignment step, assign the data points to the closest center d x i, c j = φ x i i w i jφ x i 2 i w i j φ x i φ x i i w i j i w i j + i l w i j w lj φ x i φ x l ( i w i j )^2 2 = φ x i φ x i Do not really need to know φ x, but only K ij 38
Advantages and Disadvantages of Kernel K-Means Advantages Algorithm is able to identify the non-linear structures. Disadvantages Number of cluster centers need to be predefined. Algorithm is complex in nature and time complexity is large. References Kernel k-means and Spectral Clustering by Max Welling. Kernel k-means, Spectral Clustering and Normalized Cut by Inderjit S. Dhillon, Yuqiang Guan and Brian Kulis. An Introduction to kernel methods by Colin Campbell. 39
Matrix Data: Clustering: Part 2 Revisit K-means Mixture Model and EM algorithm Kernel K-means Summary 40
Revisit k-means Derivative Mixture models Summary Gaussian mixture model; multinomial mixture model; EM algorithm; Connection to k-means Kernel k-means Objective function; solution; connection to k- means 41