Lecture 11: Unsupervised Machine Learning

Size: px

Start display at page:

Download "Lecture 11: Unsupervised Machine Learning"

Gordon Richards
5 years ago
Views:

1 CSE517A Machine Learning Spring 2018 Lecture 11: Unsupervised Machine Learning Instructor: Marion Neumann Scribe: Jingyu Xin Reading: fcml Ch6 (Intro), 6.2 (k-means), 6.3 (Mixture Models); [optional]: esl 8.5 (EM) 1 Introduction Main question: Can we describe the hidden structure in (unlabeled) data? Problems in USL: Clustering: partitioning of data sets into a finite number of (disjoint or overlapping) groups Projection/Dimensionality Reduction: project high-dimensional data to lower-dimensional space Use cases: data analysis, data exploration visualization feature selection for supervised learning (make d smaller) data compression data denoising/ prototype learning (e.g. to make kernel methods more tractable for big data) recommendation systems topic modeling Techniques: (1) Clustering: k-means, Gaussian Mixture Models (GMMs) (2) Dimensionality Reduction: PCA/ SVD, spectral methods 2 Clustering Goal: partition data points {x i } n i=1 into n groups (clusters) such that points within a cluster are close and points in different clusters are far away. Given: D = {x 1,..., x n } R d Task: identify clusters and cluster assignment for each point. Figure 1: Illustration of clustering 1

2 2 2.1 k-means Clustering k-means is one of the simplest and arguably the most popular clustering algorithm. Assumptions: number of clusters k is known each input can only belong to a single cluster Notation: 0. cluster assignment z i = 0 1 R k is one-hot encoding and the index α of the 1 means the i-th point 0. belongs to α-th cluster, with i = 1,..., n and α = 1,..., k cluster representative/prototype is the cluster mean: i=1 µ α = [z i] α x i n i=1 [z (1) i] α Algorithm We aim to minimize the following objective function: D(µ 1,..., µ k, z 1,..., z n ) = n i=1 α=1 k [z i ] α x i µ α 2 }{{} =d iα (2) Unfortunately, this objective is non-convex and non-continuous and therefore hard to optimize. So, we have to resort to an approximation solution for instance by performing alternate minimization. This algorithm known as k-means clustering is stated in Algorithm 1. The first for loop updates cluster assignment and the second for loop updates cluster centers. Algorithm 1 k-means clustering Initialize µ 1,..., µ k to distinct data points in D repeat for all i = 1,..., n do 1 if α = arg min α x i µ α 2 }{{} [z i ] α = = d iα 0 otherwise for all α = 1,..., k do (3) µ α = i=1 [z i] α x i i=1 [z i] α (4) until convergence

3 3 Visualization of the k-means algorithm Figure 2: Illustration of k-means. Top/left image: 2-dimensional data set. Top/right: randomly chosen initial two cluster centers (they are close together). Bottom/left: after one update the two clusters drift apart. Bottom/right: final assignment (captures the two clusters fairly well). Choosing the number of clusters k Typically, the number of clusters k is unknown. So, we could pick the k that leads to the minimual vlue of our objective function D, cf. Eq. (2). However, D always decreases as k increases. 1 So, in order to find an appropriate k we can use the following strategy: apply k-means repeatedly with an increasing value of k and observe the value of the objective function D (ideally average over several different initialization). Now, plot the values for D and find the kink in the curve. Figure 3: A 2-dimensional data set with two inherent clusters Figure 2 shows a dataset with two inherent clusters and how the objective D, cf. Eq. (2), varies as k increases. We can see that the objective value D will always decrease (on average) as k increases, however, the amount 1 Note that D is of course minimized at the trivial and uninteresting case of k = n, where each input has its own cluster.

4 of additional improvement diminishes once the true underlying number of clusters has been reached. This effect leads to a kink in the decreasing curve of D and points us to a good value for k.

4 4 of additional improvement diminishes once the true underlying number of clusters has been reached. This effect leads to a kink in the decreasing curve of D and points us to a good value for k. Figure 3 is another 2-dimensional toy data set with four inherent clusters. Figure 4: A 2-dimensional data set with four inherent clusters. Applications of k-means use clusters as data representatives/prototypes as illustrated in Fig.??, e.g. in rbf networks use clusters as compressed information as illustrated in Fig.?? (cf. vector quantization) Figure 5: The k = 20 cluster centers of k-means applied to a data set of handwritten digits. The cluster centers consist of the average image in each cluster and are not actual data points in the original data set. However, they clearly show that the data set has strong cluster structure defined by the different digit types.

5 Figure 6: An example application of k-means used for color reduction. The image pixels are represented in a 3-dimensional space (red,green,blue) and clustered with k-means.

5 5 Figure 6: An example application of k-means used for color reduction. The image pixels are represented in a 3-dimensional space (red,green,blue) and clustered with k-means. After clustering all pixels are substituted by their corresponding cluster centers, effectively reducing the number of different colors to only k colors. Summary + easy to implement + fast for small k + easily parallelizable - can be dominated by outliers and initialization - only allows hard cluster assignments - clusters are spherical! To overcome this last drawback, we will now discuss two generalizations of the k-means model: kernel k-means and Gaussian mixture model clustering. 2.2 Kernel k-means To kernelize k-means, we can apply the kernel trick. Hence, let s rephrase the equations of Algorithm 1 such that x i only appears in inner products. Then, replace all inner products by a kernel function.

6 6 This works out nicely for Eq. (??): Pn j=1 diα = xi µα 2 = (xi µα )> (xi µα ) = (xi Pn [zj ]α xj j=1 [zj ]α = x> xi i{z } hφ(xi ),φ(xi )i n X 2 [zj ] j=1 [zj ]α j=1 Pn x> xj i{z } hφ(xi ),φ(xj )i + Pn j=1 )> (xi Pn [zj ]α xj j=1 [zj ]α n X n X 1 Pn [zj ]α [zl ]α ( j=1 [zj ]α )2 j=1 However, in Eq. (??) µα cannot be compute for infinite dimensional φ(x): Pn [z ] φ(xi ) Pn i α µα = i=1 i=1 [zi ]α l=1 ) x> j xl {z } (5) hφ(xj ),φ(xl )i (6) So, we have to come up with an alternative k-means algorithm that does not compute the cluster centers. Instead, it computes the closest cluster α for each data point xi based on the minimal diα, cf. Algorithm 2. Algorithm 2 Kernel k-means Randomly initialize z1,..., zn (or use result of standard k-means) repeat for all i & α do diα Eq. (5) with hφ(xi ), φ(xj )i = k(xi, xj ) update zi using α = arg minα diα if zi do not change then STOP end if until convergence Figure 7: Illustrations of kernel k-means. (left) clusters found by k-means. (right) clusters found by kernel k-means. Note that we visualize the cluster centers for kernel k-means as they live in the (explicit or implicit) transformed feature space defined by the kernel and not in the original input space shown here.

7 7 Note, that now we can cluster non-vectorial data (text, graphs, molecules,...). We only need to be able to compute kernel values among the data objects. 2.3 Gaussian Mixture Model Clustering The Gaussian Mixture Model (gmm) can be thought of as a generalization of the k-means algorithm. Now, define [z i ] α [0, 1] as the probability that x i belongs to cluster α. soft cluster assignments k [z i ] α = 1 (7) α=1 Assumption: dataset D comes from a mixture of Gaussians as illustrated in Figs.?? and??. points in each cluster c α come from a different multivariate Gaussian: x i N (µ α, Σ α ) = p(x i c α ) (8) cluster center = mean of the Gaussian µ α radius = covariance of the Gaussian Σ α Figure 8: Mixtrue of three Gaussians on 1-dimensional input data; The three mixture components are shown in blue and the mixture distribution p(x) in red. Figure 9: Mixtrue of three Gaussians on 2-dimensional input data. (a) shows the level-sets of the three mixture components. (b) shows the level-sets of the mixture distribution and (c) shows the mixture distribution p(x).

8 8 Now, we can model how the points in D are generated (generative model): pick a cluster according to some probability π pick a data point from N(µ α, Σ α ) For gmm clustering, we model cluster assignment using this generative model: [z i ] α = p(c α x i ) = p(x i c α )p(c α ) l p(x i c l )p(c l ) = p(x i µ α, Σ α )π α l p(x i µ l, Σ l )π l, (9) where [z i ] is a random variable that is never observed, which thus is also called a hidden/latent variable. to get the cluster assignments {z i } i, we need to estimate the following parameters: π α, µ α, Σ α α to estimate the parameters π α, µ α, Σ α α using the sample mean, covariance and sample cluster probability estimates, we need the z i s To resolve this issue, we start by fixing the parameters and computing the cluster assignment. Then we update the parameters based on the new clustering and alternate these steps until convergence leading to the gmm clustering algorithm as given in Algorithm 3. This algorithm is also called Expectation- Maximization (em) with two steps, an (E) step and an (M) step: (E) step calculates cluster membership probabilities by using the (E)xpectation of the log-likelihood evaluated using the current estimate for the parameters (M) step estimates the parameters by (M)aximizing the expected log-likelihood Algorithm 3 gmm Algorithm Initialize µ 1,..., µ k to be distinct random data points in D Initialize Σ 1,..., Σ k to σ 2 I for any σ > 0 Initialize π 1,..., π k to 1 k (uniform prior) repeat for all i & α do E-step: for α = 1,..., k do M-step: [z i ] α = p(x i µ α, Σ α )π α l p(x i µ l, Σ l )π l (10) µ α = i=1 [z i] α x i i=1 [z i] α (11) Σ α = i=1 [z i] α (x i µ α )(x i µ α ) i=1 [z i] α (12) until convergence π α = 1 n [z i ] α (13) n i=1

9 9 Notes: k-means is a special case with Σ α = I α gmm clustering is a latent variable model (lvm) em actually performs maximum likelihood estimation (mle) to estimate the parameters for a model that depends on unobserved/latent variables. The likelihood l that is maximized is given as: n k l(d) = p(d π, {µ α, Σ α } k α=1) = π α p(x i µ, Σ α ) (14) maximize log likelihood w.r.t. µ α to get µ α = i=1 α=1 i=1 [z i] α x i i=1 [z i] α Note, that intuitively µ α is the weighted average of points in cluster α. Mathematically it is the mle solution that can be derived by taking the derivatives of l with respect to the µ α s and setting them to zero. Σ α and π α can be derived the same way. See fcml for the full derivation! probability density function (pdf) of gmm is given by: p(x π, {µ, Σ α } k α=1) = k π α N (x µ α, Σ α ) (15) where π α are the mixing coefficients acting as the prior probability of each cluster, π α [0, 1] and α π α = 1. α=1

Clustering and Gaussian Mixture Models

Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap