CS249: ADVANCED DATA MINING Vector Data: Clustering: Part II Instructor: Yizhou Sun yzsun@cs.ucla.edu May 3, 2017
Methods to Learn: Last Lecture Classification Clustering Vector Data Text Data Recommender System Decision Tree; Naïve Bayes; Logistic Regression SVM; NN K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means PLSA; LDA Matrix Factorization Graph & Network Label Propagation SCAN; Spectral Clustering Prediction Ranking Linear Regression GLM Collaborative Filtering PageRank Feature Representation Word embedding Network embedding 2
Methods to Learn Classification Clustering Vector Data Text Data Recommender System Decision Tree; Naïve Bayes; Logistic Regression SVM; NN K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means PLSA; LDA Matrix Factorization Graph & Network Label Propagation SCAN; Spectral Clustering Prediction Ranking Linear Regression GLM Collaborative Filtering PageRank Feature Representation Word embedding Network embedding 3
Vector Data: Clustering: Part 2 Revisit K-means Kernel K-means Mixture Model and EM algorithm Summary 4
Objective function Recall K-Means kk JJ = jj=1 CC ii =jj xx ii cc jj 2 Total within-cluster variance Re-arrange the objective function kk JJ = jj=1 ii ww iiii xx ii cc jj 2 ww iiii {0,1} ww iiii = 1, iiii xx ii bbbbbbbbbbbbbb tttt cccccccccccccc jj; ww iiii = 0, ooooooooooooooooo Looking for: The best assignment ww iiii The best center cc jj 5
Iterations Solution of K-Means JJ = Step 1: Fix centers cc jj, find assignment ww iiii that minimizes JJ => ww iiii = 1, iiii xx ii cc jj 2 is the smallest Step 2: Fix assignment ww iiii, find centers that minimize JJ => first derivative of JJ = 0 => cc jj = 2 ii ww iiii (xx ii cc jj ) = 0 =>cc jj = ii ww iiiixx ii ii ww iiii Note ii ww iiii is the total number of objects in cluster j kk jj=1 ww iiii xx ii cc jj 2 ii 6
Converges! Why?
Limitations of K-Means K-means has problems when clusters are of different Sizes and density Non-Spherical Shapes 13
Limitations of K-Means: Different Sizes and Density 14
Limitations of K-Means: Non-Spherical Shapes 15
Demo http://webdocs.cs.ualberta.ca/~yaling/clu ster/applet/code/cluster.html 16
Connections of K-means to Other Methods K-means Kernel K- means Gaussian Mixture Model 17
Vector Data: Clustering: Part 2 Revisit K-means Kernel K-means Mixture Model and EM algorithm Summary 18
Kernel K-Means How to cluster the following data? A non-linear map: φφ: RR nn FF Map a data point into a higher/infinite dimensional space xx φφ xx Dot product matrix KK iiii KK iiii =< φφ xx ii, φφ(xx jj ) > 19
Typical Kernel Functions Recall kernel SVM: 20
Solution of Kernel K-Means Objective function under new feature space: JJ = kk jj=1 ii ww iiii φφ(xx ii ) cc jj 2 Algorithm By fixing assignment ww iiii cc jj = ii ww iiii φφ(xx ii )/ ii ww iiii In the assignment step, assign the data points to the closest center dd xx ii, cc jj = φφ xx ii ii ww ii jjφφ xx ii ii ww ii jj 2 = φφ xx ii φφ xx ii 2 ii ww ii jj φφ xx ii φφ xx ii ii ww ii jj + ii ll ww ii jj ww llll φφ xx ii φφ xx ll ( ii ww ii jj )^2 Do not really need to know φφ xx, bbbbbb oooooooo KK iiii 21
Advantages and Disadvantages of Kernel K-Means Advantages Algorithm is able to identify the non-linear structures. Disadvantages Number of cluster centers need to be predefined. Algorithm is complex in nature and time complexity is large. References Kernel k-means and Spectral Clustering by Max Welling. Kernel k-means, Spectral Clustering and Normalized Cut by Inderjit S. Dhillon, Yuqiang Guan and Brian Kulis. An Introduction to kernel methods by Colin Campbell. 22
Vector Data: Clustering: Part 2 Revisit K-means Kernel K-means Mixture Model and EM algorithm Summary 23
Fuzzy Set and Fuzzy Cluster Clustering methods discussed so far Every data object is assigned to exactly one cluster Some applications may need for fuzzy or soft cluster assignment Ex. An e-game could belong to both entertainment and software Methods: fuzzy clusters and probabilistic model-based clusters Fuzzy cluster: A fuzzy set S: F S : X [0, 1] (value between 0 and 1) 24
Mixture Model-Based Clustering A set C of k probabilistic clusters C 1,,C k probability density functions: f 1,, f k, Cluster prior probabilities: w 1,, w k, jj ww jj = 1 Joint Probability of an object i and its cluster C j is: PP(xx ii, zz ii = CC jj ) = ww jj ff jj (xx ii ) Probability of i is: PP xx ii = jj ww jj ff jj (xx ii ) 25
Maximum Likelihood Estimation Since objects are assumed to be generated independently, for a data set D = {x 1,, x n }, we have, PP DD = PP xx ii = ww jj ff jj (xx ii ) ii ii jj llllllll DD = llllllll xx ii = llllll ww jj ff jj (xx ii ) ii Task: Find a set C of k probabilistic clusters s.t. P(D) is maximized ii jj 26
The EM (Expectation Maximization) Algorithm The (EM) algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models. E-step assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters ww tt+1 iiii = pp zz ii = jj θθ tt jj, xx ii pp xx ii zz ii = jj, θθ jj tt pp(zz ii = jj) M-step finds the new clustering or parameters that maximize the expected likelihood, with respect to conditional distribution pp zz ii = jj θθ jj tt, xx ii θθ tt+1 = aaaaaaaaaaxx θθ ii jj ww tt+1 iiii log LL(xx ii, zz ii = jj θθ) 27
Gaussian Mixture Model Generative model For each object: Pick its distribution component: ZZ~MMMMMMMMMM ww 1,, ww kk Sample a value from the selected distribution: XX~NN μμ ZZ, σσ ZZ 2 Overall likelihood function LL DD θθ = ii jj ww jj pp(xx ii μμ jj, σσ jj 2 ) s.t. jj ww jj = 1 aaaaaa ww jj 0 Q: What is θθ here? 28
Estimating Parameters LL DD; θθ = ii log jj ww jj pp(xx ii μμ jj, σσ jj 2 ) Considering the first derivative of μμ jj : LL uu jj = ii ww jj jj ww jj pp(xx ii μμ jj,σσ jj 2 ) pp(xx ii μμ jj,σσ jj 2 ) μμ jj = ii ww jj pp(xx ii μμ jj,σσ jj 2 ) jj ww jj pp(xx ii μμ jj,σσ jj 2 ) 1 pp(xx ii μμ jj,σσ jj 2 ) (xx ii μμ jj,σσ jj 2 ) μμ jj = ii ww jj pp(xx ii μμ jj,σσ jj 2 ) jj ww jj pp(xx ii μμ jj,σσ jj 2 ) ww iiii = PP(ZZ = jj XX = xx ii, θθ) (xx ii μμ jj,σσ jj 2 ) uu jj (xx ii )/ μμ jj Like weighted likelihood estimation; But the weight is determined by the parameters! 29
Apply EM algorithm: 1-d An iterative algorithm (at iteration t+1) E(expectation)-step Evaluate the weight ww iiii when μμ jj, σσ jj, ww jj are given ww iiii tt+1 = ww jj tt pp(xx ii μμ jj tt,(σσjj 2 ) tt ) jj ww jj tt pp(xx ii μμ jj tt,(σσ jj 2 ) tt ) M(maximization)-step Evaluate μμ jj, σσ jj, ww jj when ww iiii s are given that maximize the weighted likelihood It is equivalent to Gaussian distribution parameter estimation when each point has a weight belonging to each distribution μμ tt+1 jj = ii ww iiii tt+1 xx ii tt+1 ii ww iiii ; (σσ 2 jj ) tt+1 = ii ww tt+1 tt 2 iiii xxii μμ jj tt+1 ; ww tt+1 tt+1 ii ww jj ii ww iiii iiii 30
Example: 1-D GMM 31
2-d Gaussian Bivariate Gaussian distribution Two dimensional random variable: X = XX 1 XX 1 XX 2 NN(μμ = μμ 1 μμ 2, Σ = μμ 1 aaaaaa μμ 2 are means of XX 1 aaaaaa XX 2 XX 2 σσ 1 2 σσ XX 1, XX 2 σσ XX 1, XX 2 σσ 2 2 ) σσ 1 aaaaaa σσ 2 aaaaaa standard deviations of XX 1 aaaaaa XX 2 σσ XX 1, XX 2 iiii ttttt cccccccccccccccccccc bbbbbbbbbbbbbb XX 1 aaaaaa XX 2, ii. ee., σσ XX 1, XX 2 = EE XX 1 μμ 1 XX 2 μμ 2 32
Apply EM algorithm: 2-d An iterative algorithm (at iteration t+1) E(expectation)-step Evaluate the weight ww iiii when μμ jj, Σ jj, ww jj are given ww tt+1 iiii = ww jj tt tt tt pp(xx ii μμ jj,σjj) jj ww tt jj pp(xx ii μμ tt jj,σ tt jj ) M(maximization)-step Evaluate μμ jj, Σ jj, ww jj when ww iiii s are given that maximize the weighted likelihood It is equivalent to Gaussian distribution parameter estimation when each point has a weight belonging to each distribution μμ tt+1 jj = ii ww iiii tt+1 xx ii ii ww tt+1 ; (σσ jj,1 iiii (σσ XX 1, XX 2 jj ) tt+1 = ii ww iiii 2 ) tt+1 = ii ww iiii tt+1 tt xx ii,1 μμ jj,1 ii ww tt+1 iiii tt+1 tt tt (xx ii,1 μμ jj,1)(xxii,2 μμ jj,2) ii ww tt+1 iiii 2 ; (σσ 2 jj,2 ) tt+1 = ii ww iiii tt+1 ; ww jj tt+1 ii ww iiii tt+1 2 tt xx ii,2 μμ jj,2 tt+1 ; ii ww iiii 33
K-Means: A Special Case of Gaussian Mixture Model When each Gaussian component with covariance matrix σσ 2 II Soft K-means ww iiii pp xx ii μμ jj, σσ 2 ww jj = exp xx ii μμ jj 2 When σσ 2 0 2σσ 2 Distance! Soft assignment becomes hard assignment ww iiii 1, iiii xx ii is closest to μμ jj (why?) ww jj 34
Why EM Works? E-Step: computing a tight lower bound LL of the original objective function l at θθ oooooo M-Step: find θθ nnnnnn to maximize the lower bound ll θθ nnnnnn LL θθ nnnnnn LL(θθ oooooo ) = ll(θθ oooooo ) 35
How to Find Tight Lower Bound? qq h : ttttt kkkkkk tttt ttttttttt llllllllll bbbbbbbbbb wwww wwwwwwww tttt gggggg Jensen s inequality When = holds to get a tight lower bound? qq h = pp(h dd, θθ) (why?) tthee ttttttttt llllllllll bbbbbbbbbb 36
In GMM Case LL DD; θθ = ii log jj ww jj pp(xx ii μμ jj, σσ jj 2 ) ii jj ww iiii (log ww jj pp xx ii μμ jj, σσ jj 2 llllllww iiii ) log LL(xx ii, zz ii = jj θθ) Does not involve θθ, can be dropped 37
Advantages and Disadvantages of GMM Strength Mixture models are more general than partitioning: different densities and sizes of clusters Clusters can be characterized by a small number of parameters The results may satisfy the statistical assumptions of the generative models Weakness Converge to local optimal (overcome: run multi-times w. random initialization) Computationally expensive if the number of distributions is large, or the data set contains very few observed data points Hard to estimate the number of clusters Can only deal with spherical clusters 38
Vector Data: Clustering: Part 2 Revisit K-means Kernel K-means Mixture Model and EM algorithm Summary 39
Revisit k-means Derivative Kernel k-means Summary Objective function; solution; connection to k- means Mixture models Gaussian mixture model; multinomial mixture model; EM algorithm; Connection to k-means 40