CS6220: DATA MINING TECHNIQUES

Size: px

Start display at page:

Download "CS6220: DATA MINING TECHNIQUES"

Jessica Higgins
5 years ago
Views:

1 CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu November 3, 2015

2 Methods to Learn Matrix Data Text Data Set Data Sequence Data Time Series Graph & Network Images Classification Decision Tree; Naïve Bayes; Logistic Regression SVM; knn HMM Label Propagation* Neural Network Clustering K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means* PLSA SCAN*; Spectral Clustering* Frequent Pattern Mining Apriori; FPgrowth GSP; PrefixSpan Prediction Linear Regression Autoregression Similarity Search Ranking DTW P-PageRank PageRank 2

3 Revisit K-means Matrix Data: Clustering: Part 2 Mixture Model and EM algorithm Kernel K-means Summary 3

4 Objective function Recall K-Means k J = j=1 C i =j x i c j 2 Total within-cluster variance Re-arrange the objective function k J = j=1 i w ij x i c j 2 w ij {0,1} w ij = 1, if x i belongs to cluster j; w ij = 0, otherwise Looking for: The best assignment w ij The best center c j 4

5 Iterations Solution of K-Means Step 1: Fix centers c j, find assignment w ij that minimizes J => w ij = 1, if x i c j 2 is the smallest Step 2: Fix assignment w ij, find centers that minimize J => first derivative of J = 0 => J c j = 2 i w ij (x i c j ) = 0 =>c j = Note i w ijx i i w ij i w ij is the total number of objects in cluster j J = k j=1 i w ij x i c j 2 5

11 Converges! Why?

12 Limitations of K-Means K-means has problems when clusters are of different Sizes Densities Non-Spherical Shapes 12

13 Limitations of K-Means: Different Density and Size 13

14 Limitations of K-Means: Non-Spherical Shapes 14

15 Demo de/cluster.html 15

16 Connections of K-means to Other Methods K-means Gaussian Mixture Model Kernel K- means 16

17 Revisit K-means Matrix Data: Clustering: Part 2 Mixture Model and EM algorithm Kernel K-means Summary 17

18 Fuzzy Set and Fuzzy Cluster Clustering methods discussed so far Every data object is assigned to exactly one cluster Some applications may need for fuzzy or soft cluster assignment Ex. An e-game could belong to both entertainment and software Methods: fuzzy clusters and probabilistic model-based clusters Fuzzy cluster: A fuzzy set S: F S : X [0, 1] (value between 0 and 1) 18

19 Mixture Model-Based Clustering A set C of k probabilistic clusters C 1,,C k probability density functions: 1 f,, f k, Cluster prior probabilities: w 1,, w k, j w j = 1 Probability of an object i generated by cluster C j is: P(x i, z i = C j ) = w j f j (x i ) Probability of i generated by the set of cluster C is: P x i = j w j f j (x i ) 19

20 Maximum Likelihood Estimation Since objects are assumed to be generated independently, for a data set D = {x 1,, x n }, we have, P D = P x i = w j f j (x i ) i i j Task: Find a set C of k probabilistic clusters s.t. P(D) is maximized 20

21 The EM (Expectation Maximization) Algorithm The (EM) algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models. E-step assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters w t ij = p z i = j θ t j, x i p x i C t j, θ t j p(c t j ) M-step finds the new clustering or parameters that maximize the expected likelihood 21

22 Generative model For each object: Gaussian Mixture Model Pick its distribution component: Z~Multi w 1,, w k Sample a value from the selected distribution: X~N μ Z, σ Z 2 Overall likelihood function L D θ = i j w j p(x i μ j, σ j 2 ) Q: What is θ here? 22

23 Estimating Parameters L D; θ = i log j w j p(x i μ j, σ j 2 ) Considering the first derivative of μ j : Intractable! L u j = i w j j w jp(x i μ j,σ j 2 ) p(x i μ j,σ j 2 ) μ j = i w j p(x i μ j,σ j 2 ) j w jp(x i μ j,σ j 2 ) 1 p(x i μ j,σ j 2 ) p(x i μ j,σ j 2 ) μ j = i w j p(x i μ j,σ j 2 ) j w jp(x i μ j,σ j 2 ) 2 logp(x i μ j,σ j ) u j w ij = P(Z = j X = x i, θ) l(x i )/ μ j Like weighted likelihood estimation; But the weight is determined by the parameters! 23

24 Apply EM algorithm: 1-d An iterative algorithm (at iteration t+1) E(expectation)-step Evaluate the weight w ij when μ j, σ j, w j are given w ij t = w j t p(x i μ j t,(σj 2 ) t ) j w j t p(x i μ j t,(σ j 2 ) t ) M(maximization)-step Evaluate μ j, σ j, w j when w ij s are given that maximize the weighted likelihood It is equivalent to Gaussian distribution parameter estimation when each point has a weight belonging to each distribution μ j t+1 = i w ij t x i i w ij t ; (σ j 2 ) t+1 = i w ij t x i μ j t 2 i w ij t ; w t+1 t j i w ij 24

25 Example: 1-D GMM 25

1, X 2 σ X 1, X 2 σ 2 2 ) σ 1 and σ 2 are standard deviations of X 1 and X 2 σ X

26 2-d Gaussian Bivariate Gaussian distribution Two dimensional random variable: X = X 1 X 1 X 2 N(μ = μ 1 μ 2, Σ = μ 1 and μ 2 are means of X 1 and X 2 X 2 σ 1 2 σ X 1, X 2 σ X 1, X 2 σ 2 2 ) σ 1 and σ 2 are standard deviations of X 1 and X 2 σ X 1, X 2 is the covariance between X 1 and X 2, i. e., σ X 1, X 2 = E X 1 μ 1 X 2 μ 2 26

27 Apply EM algorithm: 2-d An iterative algorithm (at iteration t+1) E(expectation)-step Evaluate the weight w ij when μ j, Σ j, w j are given w ij t = w j t p(x i μ j t,σj t ) j w j t p(x i μ j t,σ j t ) M(maximization)-step Evaluate μ j, Σ j, w j when w ij s are given that maximize the weighted likelihood It is equivalent to Gaussian distribution parameter estimation when each point has a weight belonging to each distribution μ j t+1 = i w ij t x i i w ij t (σ X 1, X 2 j ) t+1 = ; (σ 2 j,1 ) t+1 = i w ij t i w ij t t (x i,1 μ j,1 i w ij t t x i,1 μ j,1 i w ij t t )(xi,2 μ j,2) 2 ; (σ 2 j,2 ) t+1 = i w ij t ; w t+1 t j i w ij t x i,2 μ j,2 2 ; i w ij t 27

28 K-Means: A Special Case of Gaussian Mixture Model When each Gaussian component with covariance matrix σ 2 I Soft K-means p x i μ j, σ 2 2 exp{ x i μ j /σ 2 } When σ 2 0 Soft assignment becomes hard assignment w ij 1, if x i is closest to μ j (why?) Distance! 28

29 *Why EM Works? E-Step: computing a tight lower bound f of the original objective function at θ old M-Step: find θ new to maximize the lower bound l θ new f θ new f(θ old ) = l(θ old ) 29

30 *How to Find Tight Lower Bound? q h : the tight lower bound we want to get Jensen s inequality When = holds to get a tight lower bound? q h = p(h d, θ) (why?) 30

31 Advantages and Disadvantages of GMM Strength Mixture models are more general than partitioning: different densities and sizes of clusters Clusters can be characterized by a small number of parameters The results may satisfy the statistical assumptions of the generative models Weakness Converge to local optimal (overcome: run multi-times w. random initialization) Computationally expensive if the number of distributions is large, or the data set contains very few observed data points Hard to estimate the number of clusters Can only deal with spherical clusters 31

32 Revisit K-means Matrix Data: Clustering: Part 2 Mixture Model and EM algorithm Kernel K-means Summary 32

33 *Kernel K-Means How to cluster the following data? A non-linear map: φ: R n F Map a data point into a higher/infinite dimensional space x φ x Dot product matrix K ij K ij =< φ x i, φ(x j ) > 33

34 Recall kernel SVM: Typical Kernel Functions 34

35 Solution of Kernel K-Means Objective function under new feature space: k J = j=1 i w ij φ(x i ) c j 2 Algorithm By fixing assignment w ij c j = i w ij φ(x i )/ i w ij In the assignment step, assign the data points to the closest center d x i, c j = φ x i i w i jφ x i i w i j 2 = φ x i φ x i 2 i w i j φ x i φ x i i w i j + i l w i j w lj φ x i φ x l ( i w i j )^2 Do not really need to know φ x, but only K ij 35

36 Advantages and Disadvantages of Kernel K-Means Advantages Algorithm is able to identify the non-linear structures. Disadvantages Number of cluster centers need to be predefined. Algorithm is complex in nature and time complexity is large. References Kernel k-means and Spectral Clustering by Max Welling. Kernel k-means, Spectral Clustering and Normalized Cut by Inderjit S. Dhillon, Yuqiang Guan and Brian Kulis. An Introduction to kernel methods by Colin Campbell. 36

37 Revisit K-means Matrix Data: Clustering: Part 2 Mixture Model and EM algorithm Kernel K-means Summary 37

38 Summary Revisit k-means Derivative Mixture models Gaussian mixture model; multinomial mixture model; EM algorithm; Connection to k-means Kernel k-means* Objective function; solution; connection to k-means 38

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network