Module Master Recherche Apprentissage et Fouille

Size: px

Start display at page:

Download "Module Master Recherche Apprentissage et Fouille"

Hannah Farmer
5 years ago
Views:

1 Module Master Recherche Apprentissage et Fouille Michele Sebag Balazs Kegl Antoine Cornuéjols 19 novembre 2008

2 Unsupervised Learning Clustering Data Streaming Application: Clustering of EGEE Jobs, Intrusion Detection

3 Clustering K-Means Expectation Maximization Selecting the number of clusters Affinity propagation Scalability

4 Clustering elias.pampalk/music/

5 Clustering Questions Hard or soft? Hard: find a partition of the data Soft: estimate the distribution of the data as a mixture of components. Parametric vs non Parametric? Parametric: number K of clusters is known Non-Parametric: find K (wrapping a parametric clustering algorithm) Caveat: Complexity Outliers Validation

6 Formal Background Notations E {x 1,... x N } dataset N number of data points K number of clusters given or optimized C k k-th cluster Hard clustering τ(i) index of cluster containing x i f k k-th model Soft clustering γ k (i) Pr(x i f k ) Solution Hard Clustering Partition = (C 1,... C k ) Soft Clustering i k γ k(i) = 1

7 Formal Background, 2 Quality / Cost function Measures how well the clusters characterize the data (log)likelihood dispersion K k=1 1 C k 2 d(x i, x j ) 2 x i,x j in C k soft clustering hard clustering Tradeoff Quality increases with K Regularization needed to avoid one cluster per data point

8 Clustering vs Classification Marina Meila Classification Clustering K # classes (given) # clusters (unknown) Quality Generalization error many cost functions Focus on Test set Training set Goal Prediction Interpretation Analysis discriminant exploratory Field mature new

9 Non-Parametric Clustering Hierarchical Clustering Principle agglomerative (join nearest clusters) divisive (split most dispersed cluster) CONS: Complexity O(N 3 )

10 Hierarchical Clustering, example

11 Influence of distance/similarity d(x, x ) = i (x i x i )2 1 1 P i x i x i x. x P i (x i x)(x i x ) x x. x x Euclidean distance Cosine angle Pearson

12 Parametric Clustering K is known Algorithms based on distances K-means graph / cut Algorithms based on models Mixture of models: EM algorithm

13 Clustering K-Means Expectation Maximization Selecting the number of clusters Affinity propagation Scalability

14 K-Means Algorithm 1. Init: Uniformly draw K points x ij in E Set C j = {x ij } 2. Repeat 3. Draw without replacement x i from E 4. τ(i) = argmin k=1...k {d(x i, C k )} find best cluster for x i 5. C τ(i) = C τ(i) xi add x i to C τ(i) 6. Until all points have been drawn 7. If partition C 1... C K has changed Stabilize Define x ik = best point in C k, C k = {x ik }, goto 2. Algorithm terminates

15 K-Means, Knobs Knob 1 : define d(x i, C k ) favors min{d(x i, x j ), x j C k } long clusters * average{d(x i, x j ), x j C k } compact clusters max{d(x i, x j ), x j C k } spheric clusters Knob 2 : define best in C k Medoid argmin i { x j C k d(x i, x j )} 1 * Average C k x j C k x j (does not belong to E)

16 No single best choice

17 K-Means, Discussion PROS Complexity O(K N) Can incorporate prior knowledge initialization CONS Sensitive to initialization Sensitive to outliers Sensitive to irrelevant attributes

18 K-Means, Convergence For cost function L( ) = k i,j / τ(i)=τ(j)=k d(x i, x j ) for d(x i, C k ) = average {d(x i, x j ), x j C k } for best in C k = average of x j C k K-means converges toward a (local) minimum of L.

19 K-Means, Practicalities Initialization Uniform sampling Average of E + random perturbations Average of E + orthogonal perturbations Extreme points: select x i1 uniformly in E, then Select x ij j = argmax{ d(x i, x ik )} k=1 Pre-processing Mean-centering the dataset

20 Clustering K-Means Expectation Maximization Selecting the number of clusters Affinity propagation Scalability

21 Model-based clustering Mixture of components Density f = K k=1 π kf k f k : the k-th component of the mixture γ k (i) = π kf k (x) f (x) induces C k = {x j / k = argmax{γ k (j)}} Nature of components: prior knowledge Most often Gaussian: f k = (µ k, Σ k ) Beware: clusters are not always Gaussian...

22 Model-based clustering, 2 Search space Solution : (π k, µ k, Σ k ) K k=1 = θ Criterion: log-likelihood of dataset l(θ) = log(pr(e)) = to be maximized. N log Pr(x i ) i=1 N K log(π k f k (x i )) i=1 k=1

23 Model-based clustering with EM Formalization Define z i,k = 1 iff x i belongs to C k. E[z i,k ] = γ k (i) Expectation of log likelihood prob. x i generated by π k f k E[l(θ)] N i=1 K k=1 γ i(k) log(π k f k (x i )) = N K i=1 k=1 γ i(k) log π k + N K i=1 k=1 γ i(k) log f k (x i ) EM optimization E step Given θ, compute γ k (i) = π kf k (x i ) f (x) M step Given γ k (i), compute θ = (π k, µ k, Σ k ) = argmine[l(θ)]

24 Maximization step π k : Fraction of points in C k π k = 1 N N γ k (i) i=1 µ k : Mean of C k µk = N i=1 γ k(i)x i N i=1 γ k(i) Σ k : Covariance Σ k = N i=1 γ k(i)(x i µ k )(x i µ k ) N i=1 γ k(i)

25 Clustering K-Means Expectation Maximization Selecting the number of clusters Affinity propagation Scalability

26 Choosing the number of clusters K-means constructs a partition whatever the K value is. Selection of K Bayesian approaches Tradeoff between accuracy / richness of the model Stability Varying the data should not change the result Gap statistics Compare with null hypothesis: all data in same cluster.

27 Bayesian approaches Bayesian Information Criterion BIC(θ) = l(θ) #θ 2 log N Select K = argmax BIC(θ) where #θ = number of free parameters in θ: if all components have same scalar variance σ #θ = K Kd if each component has a scalar variance σ k #θ = K 1 + K(d + 1) if each component has a full covariance matrix Σ k #θ = K 1 + K(d + d(d 1)/2)

28 Gap statistics Principle: hypothesis testing 1. Consider hypothesis H 0 : there is no cluster in the data. E is generated from a no-cluster distribution π. 2. Estimate the distribution f 0,K of L(C 1,... C K ) for data generated after π. Analytically if π is simple Use Monte-Carlo methods otherwise 3. Reject H 0 with confidence α if the probability of generating the true value L(C 1,... C K ) under f 0,K is less than α. Beware: the test is done for all K values...

29 Gap statistics, 2 Algorithm Assume E extracted from a no-cluster distribution, e.g. a single Gaussian. 1. Sample E according to this distribution 2. Apply K-means on this sample 3. Measure the associated loss function Repeat : compute the average L 0 (K) and variance σ 0 (K) Define the gap: Rule Select min K s.t. Gap(K) = L 0 (K) L(C 1,... C K ) Gap(K) Gap(K + 1) σ 0 (K + 1) What is nice: also tells if there are no clusters in the data...

30 Stability Principle Consider E perturbed from E Construct C 1,... C K from E Evaluate the distance between (C 1,... C K ) and (C 1,... C K ) If small distance (stability), K is OK Distortion D( ) Define S S ij = < x i, x j > (λ i, v i ) i-th (eigenvalue, eigenvector) of S X X i,j = 1 iff x i C j D( ) = i x i µ τ(i) 2 = tr(s) tr(x SX ) Minimal distortion D = tr(s) K 1 k=1 λ k

31 Stability, 2 Results has low distortion (µ 1,... µ K ) close to space (v 1,... v K ). 1, and 2 have low distortion close (and close to optimal clustering) Counter-example Meila ICML 06

32 From K-Means to K-Centers Assumptions for K-Means A distance or dissimilarity Possibility to create artefacts Not applicable in some domains barycenters average molecule? average sentence? K-Centers, position of the problem A combinatorial optimization problem. Find σ : {1,..., N} {1,..., N} minimizing: E[σ] = N d(x i, x σ(i) ) i=1 (What is missing here?)

33 Clustering K-Means Expectation Maximization Selecting the number of clusters Affinity propagation Scalability

34 Motivations Clustering: Unsupervised learning Affinity Propagation and State of the art K-means K-centers AP exemplar artefact actual point actual point parameter K K s (penalty) algorithm greedy search greedy search message passing performance not stable not stable stable complexity N K N K N 2 log(n) Clustering by Passing Messages Between Data Points. B.J. Frey, D. Dueck. Science 2007

35 Affinity Propagation Given E = {e 1, e 2,..., e N } d(e i, e j ) Find σ : E E such that: where σ = argmax elements their dissimilarity σ(e i ), exemplar representing e i N S(e i, σ(e i )) i=1 { S(ei, e j ) = d 2 (e i, e j ) if i j S(e i, e i ) = s s : penalty parameter Particular cases s =, only one exemplar s = 0, every point is an exemplar 1 cluster N clusters

36 Affinity Propagation, Principle Algorithm: Message propagation Responsibility r(i, k) Availability a(i, k). could x k be examplar for x i

37 Affinity Propagation, 2 Two types of messages r(i, k) : Responsibility of i to k a(i, k) : Availability of i as examplar for k Rules of propagation r(i, k) = S(e i, e k ) max k,k k{a(i, k ) + S(e i, e k )} r(k, k) = S(e k, e k ) max k,k k{s(e k, e k )} a(i, k) = min {0, r(k, k) + i,i i,k max{0, r(i, k)}} a(k, k) = i,i k max{0, r(i, k)}

38 Iterations of Message passing

39 Iterations of Message passing

40 Iterations of Message passing

41 Iterations of Message passing

42 Iterations of Message passing

43 Iterations of Message passing

44 Iterations of Message passing

45 Iterations of Message passing

46 Affinity Propagation, cont d

47 Affinity Propagation, cont d

48 Affinity Propagation in a Nutshell WHEN to use it? When averages don t make sense PROS vs K-centers Lower distortion e.g., molecules; documents D([σ]) = N i=1 d 2 (e i, σ(e i )) CONS: Computational complexity Similarity computation: O(N 2 ) Message passing: O(N 2 log N)

49 Clustering K-Means Expectation Maximization Selecting the number of clusters Affinity propagation Scalability

50 Hierarchical AP Clustering data streams: Theory and practice. S. Guha, A. Meyerson, N. Mishra, R. Motwani. TKDE 2003.

51 Weighted AP AP WAP e i (e i, n i ) S(e i, e j ) n i S(e i, e j ) S(e i, e i ) S(e i, e i ) + (n i 1) ɛ With S(e i, e j ) price for e i to select e j as an exemplar ɛ variance of n i points Proposition WAP AP with duplicated elements

52 Hierarchical WAP Complexity of HiWAP is O(N 3/2 ) can be iteratively reduced to O(N 1+γ )

Validation of Hi-WAP on EGEE jobs EGEE (Enabling Grids for E-sciencE) http://public.eu-egee.

53 Validation of Hi-WAP on EGEE jobs EGEE (Enabling Grids for E-sciencE) sites in 50 countries 10,000 user access to 80,000 CPUs 300,000 jobs per day Description of jobs (237,087) 4 numeric features 1 symbolic feature e.g. duration of execution name of queue

54 Validation of Hi-WAP on EGEE jobs Evaluation: Distortion D([σ]) = N d 2 (e i, σ(e i )) i=1 237,087 jobs each job IR 5 CPU 10 (Intel 2.66GHz) Baseline: K-centers best out of runs same overall computational cost

Unsupervised machine learning

Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels