Probabilistic clustering

Size: px

Start display at page:

Download "Probabilistic clustering"

Curtis Silvester Powell
6 years ago
Views:

1 Aprendizagem Automática Probabilistic clustering Ludwig Krippahl

2 Probabilistic clustering Summary Fuzzy sets and clustering Fuzzy c-means Probabilistic Clustering: mixture models Expectation-Maximization, more formal External indexes: Rand index Assignment 2 1

3 Aprendizagem Automática Fuzzy sets 2

4 Fuzzy sets In conventional set theory, elements either belong or don't belong to a set In fuzzy set theory, each element has a u S (x) [0, 1] the degree of membership to set S More formally, a fuzzy set S is a set of ordered pairs S = {(x, u S (x)) x X} indicating 3

5 Fuzzy sets Fuzzy sets allow modelling uncertainty Linguistic and conceptual uncertainty (old, large, important,...) cold warm hot 1 0 temperature Wikimedia, CC BY-SA 3.0 fullofstars 4

6 Fuzzy sets Fuzzy sets allow modelling uncertainty Linguistic and conceptual uncertainty (old, large, important,...) Informational uncertainty (credit, security risks,...) Note: fuzzy membership is not probability. It's a measure of similarity to some imprecise properties 5

7 Fuzzy c-partition U(X) is a fuzzy c-partition of X if: 0 u k ( x n ) 1 k, n c k=1 u k ( x n ) = 1 n N 0 u k ( x n ) N k n=1 6

8 Aprendizagem Automática Fuzzy c-means 7

9 Fuzzy c-means From set X of N unlabelled data points Return the c N membership matrix with u k ( x n ), defining a fuzzy c-partition of X Return the { C 1,..., C c } centroids Minimizing the squared error function: J m c N (X, C) = u k ( x n ) m x n c k 2 m 1 k=1 n=1 8

10 Fuzzy c-means Minimizing the squared error function: J m Subject to: c (X, C) = u k ( x n ) m x n c k 2 m 1 k=1 c k=1 N n=1 u k ( x n ) = 1 n and where m is the degree of fuzzification; tipically, m = 2 9

11 Fuzzy c-means ( ) u k x n The derivatives w.r.t. are zero at: ( ) 1 x n c k 2 u k ( x n ) = j=1 ( ) c 1 x n c j 2 2 m 1 2 m 1 10

12 Fuzzy c-means The derivatives w.r.t. c k are zero at: N u k ( x n ) m x n c k = n=1 c k N n=1 u k ( x n ) m That is, each centroid is the weighted mean of the example vectors using the membership values. 11

13 Fuzzy c-means algorithm Random initial centroids Compute u: c k Recompute : { C 1,..., C c } ( ) 1 x n c k 2 u k ( x n ) = j=1 ( ) c 1 x n c j 2 N u k ( x n ) m x n c k = n=1 N n=1 u k ( x n ) m Repeat until t iterations or change < ϵ 2 m 1 2 m 1 12

14 Fuzzy c-means algorithm The result is a clustering similar to k-means, but with membership values: Source: "Simulated Annealing - Advances, Applications and Hybridizations" Ed. Marcos de Sales Guerra Tsuzuki, CC BY

15 Fuzzy c-means algorithm Defuzzification If we want a crisp clustering, we'll need to convert the fuzzy membership function into {0, 1} Maximum membership Nearest centroid 14

16 Aprendizagem Automática Probabilistic Clustering 15

17 Probabilistic Clustering Suppose we have a probabilistic model for the distribution of examples X: P (X, Y ) = P (X Y )P (Y ) This would not only provide a way of clustering X over Y but also to generate examples from that distribution. The problem is that we do not know Y 16

18 Probabilistic Clustering Recall, from EM: Set Set X of observed data Z of missing data or latent variables Unknown parameters θ Likelihood function L(θ; X, Z) = p(x, Z θ) 17

19 Probabilistic Clustering Let us assume Gaussian distributions. In 1D: N (x μ, σ) = 1 σ 2π e (x μ) 2 2σ μ= 0, μ= 0, μ= 0, μ= 2, 2 σ = 0.2, 2 σ = 1.0, 2 σ = 5.0, 2 σ = 0.5, φ μ,σ 2( x) Source: Wikimedia, public domain x 18

20 Probabilistic Clustering We can create a mixture of gaussians by adding the different distributions with weights μ= 0, μ= 0, μ= 0, μ= 2, 2 σ = 0.2, 2 σ = 1.0, 2 σ = 5.0, 2 σ = 0.5, φ μ,σ 2( x) x Source: Wikimedia, public domain 19

21 Probabilistic Clustering For multivariate Gaussian distributions: Where μ is the vector of means and Σ the covariance matrix A a mixture model of where π k N (x μ, Σ) = k gaussians is: is the mixing coefficient, such that 1 (2π Σ ) 1 (x μ) T Σ 1 (x μ) e 2 1/2 K p(x) = π k N (x μ k, Σ k ) K k=1 π k k=1 π k = k 20

22 Probabilistic Clustering If we create a variable z so for each x that: And we define the distribution: The marginal probability p( = 1) = z k ( are the latent variables assigning each example to each cluster) And z k Which brings us back to: k {0, 1} = 1 p(x, z) = p(z)p(x z) z k p(x z k = 1) = N (x μ k, Σ k ) but now explicitly with the latent variables π k z k p(x) = p(z)p(x z) = π k N (x μ k, Σ k ) z K k=1 21

23 Probabilistic Clustering Using Bayes rule and the previous equations, we get the posterior prob. that k responsible for x n (we'll call it γ( )): Since we want to maximize the log-likelihood of our parameters is: z nk p( z k = 1)p(x z k = 1) π k N (x μ k, Σ k ) p( z k = 1 x) = = = γ( z ) nk K K p( = 1)p(x = 1) π j N (x μ j, Σ j ) j=1 z j We want to find the maximum, where the derivative is 0 z j j=1 N ln p(x π, μ, Σ) = ln( π k N ( x n μ k, Σ k )) n=1 K k=1 22

24 Probabilistic Clustering μ k N π k N (x μ k, Σ k ) 0 = ( x n μ ) K k n=1 k π j N (x μ j, Σ j ) Setting to zero the derivative w.r.t. : j=1 N 1 μ k = γ( z nk ) x n N k n=1 Where is the effective number of points in component k: N k N k N n=1 z nk = γ( ) 23

25 Probabilistic Clustering Setting to zero the derivative w.r.t. : Doing the same for : Which is the sum of the covariances weighted by the Finally, the : μ k N 1 μ k = γ( z nk ) x n N k n=1 Σ k N 1 Σ k = γ( z nk )( x n μ k )( x n μ k ) T N k n=1 π k γ( z nk ) π k = N k N 24

26 Probabilistic Clustering Now we do EM. First, expectation, computing the posterior probabilities γ( ) from an initial guess of the parameters π, μ, Σ: z nk Then we use the posterior maximize w.r.t. π, μ, Σ: π k And repeat until convergence... π k N (x μ k, Σ k ) γ( z nk ) = K j=1 γ( z nk ) π j N (x μ j, Σ j ) in the likelihood function and N k N N 1 1 = μ k = γ( z nk ) x n Σ k = γ( z nk )( x n μ k ) N N k N n=1 k n=1 25

27 Probabilistic Clustering Example, 2 Gaussian components, after 1 pass: 26

28 Probabilistic Clustering Example, 2 Gaussian components, after 2 passes: 27

29 Probabilistic Clustering Example, 2 Gaussian components, after 3 passes: 28

30 Probabilistic Clustering 2 Gaussian components not good for these data 29

31 Probabilistic Clustering Example, 3 Gaussian components, after 1 pass: 30

32 Probabilistic Clustering Example, 3 Gaussian components, after 2 passes: 31

33 Probabilistic Clustering Example, 3 Gaussian components, after 3 passes: 32

34 Aprendizagem Automática Probabilistic Clustering Gaussian mixtures with Scikit-Learn: lass sklearn.mixture.gmm: #arguments n_components=1 covariance_type='diag' #constraints on covariance n_iter=100 #attributes weights_ means_ covars_ 33

35 Probabilistic Clustering Rand index 34

36 Rand index In supervised learning, we saw the confusion matrix Example Class Prediction Class 1 Class 0 Class 1 True Positive False Positive Class 0 False Negative True Negative In unsupervised learning we cannot match examples with classes but we can consider all N(N 1)/2 pairs of points: True Positive: a pair from the same group placed in the same cluster True Negative: a pair from different groups placed in different clusters False Positive: a pair from different groups placed in the same cluster False Negative: a pair from the same group placed in different clusters 35

37 Rand index For all N(N 1)/2 pairs of examples Same Group Different Group Same Cluster True Positive False Positive Different Cluster False Negative True Negative Using this analogy, we can compute the Rand index, which is analogous to accuracy, and other scores: T P precision = recall = T P + FP T P T P + FN T P + T N Rand = F 1 = 2 N(N 1)/2 precision recall precision + recall 36

38 Aprendizagem Automática Assignment 2 37

39 Assignment 2 Clustering of seismic activity 38

40 Assignment 2 External indexes based on tectonic fault data 39

41 Assignment 2 Seismic events assigned to nearby relevant fault line 40

42 Assignment 2 Read the data file (use pandas?) and convert (lat,lon) to (x,y,z) 41

43 Assignment 2 Read the data file (use pandas?) and convert (lat,lon) to (x,y,z) Compare K-Means, DBSCAN and Gaussian Mixture models Use silhouette score as internal index Use Rand, Precision, Recall, F1 and Adjusted Rand as external indexes Explore main paramenters (k, ϵ and number of components) Also implement the ϵ selection method in DBSCAN, described in the paper (requires understanding the method...) Report is the most important part. Think and discuss the three algorithms and the different scores taking into account the relation between the distribution of earthquakes and the fault lines. 42

44 Probabilistic Clustering Summary 43

45 Probabilistic Clustering Summary Fuzzy sets: model uncertainty Fuzzy c-means: clustering with membership values Probabilistic clustering with gaussian mixtures EM algorithm: estimate posterior probabilities resulting from latent variables, maximize likelihood of parameters, repeat Rand index and Assignment 2 Further reading Bishop, Section 9.2 (Optional): Scikit-Learn demo on mixture models: 44

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a