Statistical Learning Dong Liu Dept. EEIS, USTC
Chapter 6. Unsupervised and Semi-Supervised Learning 1. Unsupervised learning 2. k-means 3. Gaussian mixture model 4. Other approaches to clustering 5. Principle component analysis 6. Other approaches to dimensionality reduction 7. Semi-supervised learning 1 68
Section 6.1 Unsupervised Learning
What is unsupervised learning? Supervised learning aims to identify relation between data, to solve predictive task Unsupervised learning aims to discover pattern from data, to solve descriptive task Which pattern? Association analysis Clustering Anomaly detection Dimensionality reduction... 2 68
Association analysis Input: data with multiple attributes Output: which attributes are associated, i.e. frequently co-occurred? Goods Customers A B C D E... Coffee 1 1 1... Tea 1... Milk 1 1 1... Beer 1 1 1... Diaper 1 1... Aspirin 1 1... Mining shopping carts r(milk, Coffee) r(milk, Tea) r(beer, Diaper)... Challenge: computational efficiency 3 68
Clustering Input: data (with no class label) Output: which data are clustered? 4 68
Examples of clustering Market segmentation: Divide customers into clusters Document clustering: Divide retrieved documents about Amazon into clusters Image segmentation: Divide pixels into clusters 5 68
Anomaly detection Input: data (with no class label) Output: which data are normal, and which are abnormal? Detecting credit card frauds Given credit card transactions, try to detect which are normal and which are frauds Also useful in supervised learning: remove outliers to clean data 6 68
Dimensionality reduction Input: high-dimensional data (with many attributes) Output: low-dimensional representation (with few attributes) Often used as a data preprocessing step Techniques: Broad sense: feature extraction, feature selection, etc. Narrow sense: transform that reduces dimension Benefits: Reduce the computational cost Alleviate noise or irrelevant attributes Avoid curse of dimensionality 7 68
Curse of dimensionality 1/2 Many statistical learning methods depend on distance measure Distance cannot distinguish in high-dimensional space Randomly drop several points in a hypercube in a high-dimensional space, and measure the Euclidean distance between points, identify the maximum and minimum distances 8 68
Curse of dimensionality 2/2 Nearest neighbors are all far away For a point in a D-dimensional space, its nearest neighbors (having distance r) are distributed in a spherical shell (with distance αr) with probability 1 α D 1 Why? Because the volume of high-dimensional space is too huge, samples are very sparse in it Dimensionality reduction can find a low-dimensional structure (manifold) in high-dimensional space, and unfold the inherent structure 9 68
Section 6.2 k-means
Prototype-based clustering Each cluster has a prototype Distance to prototype decides cluster membership k-means is the most well known representative 10 68
k-means algorithm Input: dataset {x 1,..., x N }, number of clusters k Output: clusters q(x i ) {1,..., k} 1: Initialize k centroids {c 1,..., c k } 2: repeat 3: for i = 1,..., N do 4: q(x i ) arg min j x i c j 5: end for 6: for j = 1,..., k do 7: c j mean(x i q(x i ) = j) 8: end for 9: until centroids do not change 11 68
Illustration of k-means 1/4 Initialize k centroids Usually we select centroids from the dataset We will see that initial centroids are crucial for the final results Is it wise to select scattered initial centroids? 12 68
Illustration of k-means 2/4 Assign data points to clusters Complexity O(kN) Here we measure Euclidean distance, can another distance metric be used? 13 68
Illustration of k-means 3/4 Update cluster centroids Complexity O(N + k) Here we use (arithmetic) mean, can another method be used? 14 68
Illustration of k-means 4/4 Until the centroids do not change Will k-means converge? Usually it does 15 68
Interpretation of k-means k-means was firstly proposed for vector quantization, an extension of scalar quantization to Euclidean space Centroids are known as codewords that constitute codebook k-means is to solve min q,{c j } x i c q(xi ) 2 Heuristically, k-means update q and {c j } alternately, it is greedy and cannot ensure (global) optimum 16 68
k-means found local minimum Case 1: Case 2: In Case 2, we select initial centroids to be more scattered, but result is worse In practice, we often run k-means multiple times with different initializations, and choose the best one 17 68
Limitation of k-means: Outliers 18 68
Limitation of k-means: Different sized clusters Left: Ideal clusters; Right: k-means results 19 68
Limitation of k-means: Clusters with different densities Left: Ideal clusters; Right: k-means results 20 68
Limitation of k-means: Clusters having irregular shapes Left: Ideal clusters; Right: k-means results 21 68
A remedy for k-means: Over-segmentation and post-processing Left: For clusters with different densities; Right: For clusters having irregular shapes 22 68
Section 6.3 Gaussian mixture model
Distribution-based clustering Each cluster corresponds to a (single-modal) distribution Calculate posterior probabilities to decide clusters Gaussian mixture model is the most well known representative 23 68
Gaussian mixture model A mixture model is a combination of multiple single-modal distributions In Gaussian mixture model (GMM), each component is a Gaussian One-dimensional case: p(x) = k j=1 w jn (x µ j, σ 2 j ), where j w j = 1 Multi-dimensional case: p(x) = k j=1 w jn (x µ j, Σ j ) 24 68
GMM for clustering Assume we have known the parameters of GMM, then we can calculate posterior (responsibility) as p(q(x i ) = j) = γ ij = w j N (x µ j, σ 2 j ) k j=1 w jn (x µ j, σ 2 j ) Then we have q(x i ) = arg max j γ ij Now, the problem is how to estimate the parameters of GMM ϑ = {w j, µ j, Σ j j = 1,..., k} And we consider maximum likelihood estimation ˆϑ = arg max ϑ N i=1 p(x i ϑ) 25 68
Intuitive solution 1/3 As we want to maximize N i=1 k j=1 w jn (x i µ j, Σ j ), this is quite difficult Can we borrow the idea of k-means? First, we initialize the parameters 26 68
Intuitive solution 2/3 As we want to maximize N i=1 k j=1 w jn (x i µ j, Σ j ), this is quite difficult Can we borrow the idea of k-means? Second, we calculate responsibilities ( assign data to clusters ) 27 68
Intuitive solution 3/3 As we want to maximize N i=1 k j=1 w jn (x i µ j, Σ j ), this is quite difficult Can we borrow the idea of k-means? Third, we update parameters Note that p(q(x i ) = j) = γ ij, so it is intuitive that w j = µ j = i γ ij N i γ ijx i i γ ij Σ j = i γ ij(x i µ j )(x i µ j ) T i γ ij The above two steps are executed alternately, until convergence 28 68
Example of GMM result 29 68
k-means is a special case of GMM In this special case, we set w j 1 k, Σ j αi, and we quantize γ ij to 0 or 1 Initialize parameters = Initialize µ j s Calculate responsibilities = Find nearest µ j and set that γ ij = 1 Update parameters = Update µ j s as means So we can interpret why k-means has limitations for clusters having different sizes or densities or irregular shapes 30 68
Example of comparison between GMM and k- means 31 68
Interpretation: Expectation maximization Introduce latent variables z i {1,..., k}, representing the true cluster that generates x i Consider to maximize i p(x i, z i ϑ) = i j (w jn (x i µ j, Σ j )) I(zi=j), or equivalently i j I(z i = j) log(w j N (x i µ j, Σ j )) Expectation maximization (EM) executes two steps alternately: 1. E-step: Given ϑ t, calculate the expectation of objective function with eliminating latent variables. Note that γ ij is the expectation of I(z i = j), so we have i j γ ij log(w j N (x i µ j, Σ j )) 2. M-step: Maximize the expectation of objective function to find ϑ t+1. Note that j w j = 1, so we can derive the equations of GMM 32 68
EM as an algorithm Input: p(x, Z ϑ), X = {x 1,..., x N }, Z is unobserved Output: ˆϑ 1: t 0, initialize ϑ 0 2: repeat 3: Given ϑ t, calculate the expectation of log p(x, Z ϑ) with eliminating Z, denote the expectation by Q(ϑ, ϑ t ) 4: ϑ t+1 arg max ϑ Q(ϑ, ϑ t ) 5: until ϑ t+1 is similar to ϑ t 6: ˆϑ = ϑ t+1 33 68
Is EM correct? Consider log p(x ϑ) = Z p(z X, ϑ t ) log p(x ϑ) = Z p(z X, ϑ t ) log p(x, Z ϑ) Z p(z X, ϑ t ) log p(z X, ϑ) = Q(ϑ, ϑ t ) + H(ϑ, ϑ t ) The first term is actually Q(ϑ, ϑ t ). Since ϑ t+1 arg max ϑ Q(ϑ, ϑ t ), we know Q(ϑ t+1, ϑ t ) Q(ϑ t, ϑ t ) The second term H(ϑ, ϑ t ) is actually the cross-entropy between p(z X, ϑ t ) and p(z X, ϑ), so we know for any ϑ, H(ϑ, ϑ t ) H(ϑ t, ϑ t ) (the latter is entropy, and the difference is K-L divergence) Thus, EM ensures that log p(x ϑ t+1 ) log p(x ϑ t ) This is a greedy algorithm to maximize the likelihood, and it definitely will converge, but cannot ensure global optimum 34 68
Variants of EM We may set different initial values to escape a local optimum We may not maximize Q(ϑ, ϑ t ), having Q(ϑ t+1, ϑ t ) Q(ϑ t, ϑ t ) is enough (e.g. by gradient ascent); this is especially helpful if the maximization problem has no closed-form solution 35 68
Section 6.4 Other approaches to clustering
Density-based clustering Clusters correspond to high-density regions, while low-density regions separate clusters Exclude noisy data and outliers Mean-shift and DBSCAN are the most well known representatives 36 68
Mean-shift Mean-shift is an iterative algorithm for locating modes (i.e. local maximums of density), where density is estimated using Parzen window (non-parametric) Initialize a mode x 0, and iteratively refine it Given x t, local density is estimated by ˆp(x) = K(x x t ), where K() is a kernel function Thus local mean is m(x t x ) = i N (x t ) x ik(x i x t ) x i N (x t ) K(x i x t ), let xt+1 m(x t ) Then find the next mode... 37 68
DBSCAN Density-based spatial clustering of applications with noise (DBSCAN) seeks a cluster as a high-density region For each point, find its neighbors, if number of neighboring points is less than a threshold, then this point is noise Otherwise, let this point and its neighbors belong to a cluster, and expand this cluster until reaching low-density region 38 68
Connectivity-based clustering Use a graph to represent data points and their relations Find clusters as subgraphs that are highly related Also called graph-based clustering 39 68
Agglomerative clustering Bottom-up strategy: Combine points into clusters progressively 40 68
Divisive clustering Also called highly connected subgraphs (HCS) clustering Top-down strategy: Split a graph into two subgraphs by finding out the minimum cut, and split subgraphs further, until a subgraph is highly connected To evaluate whether a subgraph is highly connected or not, we use the ratio: 2n e n v(n v 1) 41 68
Section 6.5 Principle component analysis (PCA)
Projection for dimensionality reduction Given x R D We want to find a projection matrix P R K D, where K < D Then we have dimensionality reduced y = Px, y R K This is linear dimensionality reduction, and the problem is how to find P Principle component analysis (PCA) is the mostly used method 42 68
Motivation of PCA PCA seeks a projection that can keep the data s Euclidean distances as much as possible Note that i j (x i x j ) 2 is related to data variance 43 68
PCA 1/3 i x i N Step 1: subtract mean x = (x 1 x) T Let X =..., then X T X is the data covariance matrix (x N x) T 44 68
PCA 2/3 Step 2: rotate x i x Let C = X T X, calculate its eigenvalues and eigenvectors, Cu i = λ i u i, then we have an orthonormal matrix U = [u 1,..., u D ], and CU = ΛU where Λ = diag{λ 1,..., λ D }, λ 1 λ 2 λ D 0 Let X = XU, so X T X = Λ becomes a diagonal matrix 45 68
PCA 3/3 Step 3: select the K largest entries from {λ 1,..., λ D }, and select the corresponding columns of U to constitute P Let y i = P(x i x), it completes dimensionality reduction 46 68
Example: Eigenface 1/2 Consider applying PCA on face data, the calculated eigenvectors are termed eigenfaces Note that we usually reshape an image into a vector, otherwise we need 2D-PCA 47 68
Example: Eigenface 2/2 Using eigenfaces, for a new input face, we can perform dimensionality reduction y N+1 = Px N+1 (we omit the mean subtraction step) Note that P T P = I, so we have x N+1 P T y N+1, it is a decomposition over eigenfaces It is also a commonly used step for feature extraction 48 68
Kernel PCA 1/3 PCA finds a linear projection, how to deal with nonlinearity? Consider using basis functions φ i = φ(x i ), and then the data φ T covariance matrix is Φ T 1 Φ where Φ =... (here we assume φ T N i φ i = 0) We may also use the kernel trick, where we have k(x, y) = φ(x) φ(y) but we do not use φ( ) explicitly Note that we can calculate k(x 1, x 1 ),..., k(x 1, x N ) K =...,...,... = ΦΦ T k(x N, x 1 ),..., k(x N, x 1 ) And we can calculate eigenvalues/eigenvectors for K: Ku i = λ i u i 49 68
Kernel PCA 2/3 Note that Φ T ΦΦ T u i = Φ T Ku i = λ i Φ T u i, which means Φ T u i is an eigenvector of Φ T Φ. Here, we cannot ensure Φ T u i is a unit vector, so we normalize it u i = ΦT u i Φ T u i = Φ T u i = ΦT u i u T i ΦΦT u λi i (u 1) T Then let P =... where the corresponding K eigenvalues (u K )T are the largest And the projection is y i = Pφ i = u T 1 / λ 1... u T K / λ K k(x 1, x i )... k(x N, x i ) 50 68
Kernel PCA 3/3 Previously we assume i φ i = 0, but this is not satisfied by an arbitrary kernel, so we have to centralize the kernel Note that k ij = φ i φ j, now we want to calculate k ij = (φ i φ) (φ j φ) where φ = i φ i N k ij = k ij 1 k N ij 1 k N ij + 1 N 2 Replace K with K in the previous steps j i i j k ij 51 68
Example of kernel PCA Left: input points in 2-D plane; Right: projected points using k(x, y) = (1 + x y) 2 52 68
Section 6.6 Other approaches to dimensionality reduction
Nonlinear dimensionality reduction Kernel PCA ISOMAP Locally linear embedding (LLE) Self organizing map (Chap. 10) Autoencoder (Chap. 10) Laplacian eigenmap t-distributed stochastic neighbor embedding (t-sne)... 53 68
Manifold learning A manifold is a topological space that locally resembles Euclidean space near each point 1-D manifolds include lines and circles, but not 8 2-D manifolds are also called surfaces, such as spheres The intrinsic dimension of a manifold can be lower than its residing space Manifold learning is to identify such low-dimensional structure from high-dimensional data For manifold learning, Euclidean distance is not appropriate, and is replaced by geodesic distance 54 68
ISOMAP While PCA seeks to preserve data s Euclidean distances as much as possible, ISOMAP seeks to preserve data s geodesic distances as much as possible In ISOMAP, geodesic distance is defined as shortest distance on graph, the graph is constructed by nearest neighbors for each point As we have d ij = d(x i, x j ), we seek arg min ( y i y j d ij ) 2 y 1,...,y N which is solved by multi-dimensional scaling (MDS) i j i 55 68
Example of MDS MDS seeks an appropriate coordinate system for a distance 56 68
Locally linear embedding Locally linear embedding (LLE) seeks to preserve locally linear relations as much as possible In LLE, first, a locally linear relation matrix W is found arg min W x i i x j N (x i ) w ij x j 2, s.t. j If x j / N (x i ), then w ij 0 Second, a low-dimensional embedding is found arg min y 1,...,y N i y i j w ij y j 2 w ij = 1 57 68
Section 6.7 Semi-supervised learning
Supervised versus unsupervised learning For example, consider classification versus clustering Classification is Good at predicting the true class Requiring many labeled data for training Clustering is Able to find out clusters, but that are not equal to classes Requiring no label, just data Can we combine supervised and unsupervised learning? 58 68
Motivation of semi-supervised learning One practical difficulty for supervised learning is the lack of accurate labels Semi-supervised learning tries to use unlabeled data in addition to labeled data This includes weakly-supervised learning, where we have labels but labels are not accurate or noisy This also includes transductive learning, where we do not build a model but just want to have predictions for unlabeled data Supervised Semi-supervised (inductive) Transductive Data Labeled data Both labeled and unlabeled data Objective A model: ŷ = f(x) or q(y x) Predictions for unlabeled data 59 68
Examples of semi-supervised learning Image classification with unlabeled images Image classification from click-through data from search engine Image segmentation with only labels of bounding-boxes Anomaly detection with limited labels... 60 68
Why does semi-supervised learning work? For generative methods Labeled data provide information of p(x, y) and unlabeled data provide additional information of p(x), the latter is helpful to estimate p(x, y) For discriminative methods 1. Cluster assumption: If two points belong to the same cluster, they are likely to belong to the same class 2. Density assumption: The decision boundary shall locate at low-density regions that separate high-density regions 61 68
A generative method for semi-supervised classification For labeled data, we know (x i, y i ), i = 1,..., N. For unlabeled data, we know x j, j = 1,..., M We consider a generative model for the labeled data: p(x i, y i ) = p(y i )p(x i y i ), and assume a mixture model for the unlabeled data: p(x j ) = y j p(y j )p(x j y j ) We further parameterize the probabilities as p(y) = p(y ϑ 0 ), p(x y) = p(x y, ϑ 1 ) Then we can maximize the log-likelihood: i log(p(y i ϑ 0 )p(x i y i, ϑ 1 )) + j log( y j p(y j ϑ 0 )p(x j y j, ϑ 1 )) 62 68
EM for semi-supervised classification Consider y j s as latent variables, and we have the expectation of log p({x i, y i }, {x j, y j } ϑ) with respect to p({y j } ϑ t ) is Q(ϑ, ϑ t ) = i log(p(y i ϑ 0 )p(x i y i, ϑ 1 )) + k γt jk log(p(y j ϑ 0 )p(x j y j, ϑ 1 )) j where γ t jk = p(y j = k ϑ t ) p(y j = k ϑ t 0)p(x j y j, ϑ t 1) 63 68
Semi-supervised SVM 1/2 Recall (supervised) SVM 1 min w,b,ξ 2 w 2 + C i ξ i s.t. i, ξ i 0, y i (w T x i + b) 1 ξ i 64 68
Semi-supervised SVM 2/2 For semi-supervised SVM, also called transductive SVM ( min 1 w,b,ξ,y,ζ 2 w 2 + C i ξ i +C ) j ζ j s.t. i, ξ i 0, y i (w T x i + b) 1 ξ i j, y j {+1, 1}, ζ j 0, y j (w T x j + b) 1 ζ j 65 68
A graph-based method for transductive classification Input: (x i, y i ), i = 1,..., N, x j, j = 1,..., M Output: ŷ j 1: Construct a graph whose vertexes are x i and x j 2: In the graph, if j N (i ), construct an edge from i to j and let the edge weight be w i j = 1 d(x i,x j ) 3: t 0, initialize p 0 (class probability vectors) i 4: repeat 5: j, p t+1 0 j 6: i {1,..., M, M + 1,..., M + N}, j, if w i j 0, p t+1 p t+1 + w i j j j k w pt i k i 7: j, normalize p t+1 to be unit vector j 8: until convergence 9: ŷ j = arg max p t+1 j 66 68
Random walk and PageRank The loop is actually a random walk For example, PageRank is an unsupervised method based on random walk to determine the relative importances of webpages Input: A graph of webpages and hyperlinks Output: Relative importances of all webpages 1: t 0 2: Initialize r 0 (importance values) i 3: repeat 4: j, r t+1 0 j 5: i, j, if w ij 0, r t+1 r t+1 + w ij j j 6: j, r t+1 βr t+1 j j + 1 β N factor, used to avoid trap 7: until convergence k w ik rt i β is called damping An example of trap 67 68
In this chapter Dictionary Clustering Curse of dimensionality Dimensionality reduction Manifold learning Semi-supervised learning Transductive learning Unsupervised learning Toolbox Expectation maximization Gaussian mixture model k-means Principle component analysis, kernel Random walk Transductive support vector machine 68 / 68