Statistical Learning. Dong Liu. Dept. EEIS, USTC
|
|
- Hannah Douglas
- 5 years ago
- Views:
Transcription
1 Statistical Learning Dong Liu Dept. EEIS, USTC
2 Chapter 6. Unsupervised and Semi-Supervised Learning 1. Unsupervised learning 2. k-means 3. Gaussian mixture model 4. Other approaches to clustering 5. Principle component analysis 6. Other approaches to dimensionality reduction 7. Semi-supervised learning 1 68
3 Section 6.1 Unsupervised Learning
4 What is unsupervised learning? Supervised learning aims to identify relation between data, to solve predictive task Unsupervised learning aims to discover pattern from data, to solve descriptive task Which pattern? Association analysis Clustering Anomaly detection Dimensionality reduction
5 Association analysis Input: data with multiple attributes Output: which attributes are associated, i.e. frequently co-occurred? Goods Customers A B C D E... Coffee Tea 1... Milk Beer Diaper Aspirin Mining shopping carts r(milk, Coffee) r(milk, Tea) r(beer, Diaper)... Challenge: computational efficiency 3 68
6 Clustering Input: data (with no class label) Output: which data are clustered? 4 68
7 Examples of clustering Market segmentation: Divide customers into clusters Document clustering: Divide retrieved documents about Amazon into clusters Image segmentation: Divide pixels into clusters 5 68
8 Anomaly detection Input: data (with no class label) Output: which data are normal, and which are abnormal? Detecting credit card frauds Given credit card transactions, try to detect which are normal and which are frauds Also useful in supervised learning: remove outliers to clean data 6 68
9 Dimensionality reduction Input: high-dimensional data (with many attributes) Output: low-dimensional representation (with few attributes) Often used as a data preprocessing step Techniques: Broad sense: feature extraction, feature selection, etc. Narrow sense: transform that reduces dimension Benefits: Reduce the computational cost Alleviate noise or irrelevant attributes Avoid curse of dimensionality 7 68
10 Curse of dimensionality 1/2 Many statistical learning methods depend on distance measure Distance cannot distinguish in high-dimensional space Randomly drop several points in a hypercube in a high-dimensional space, and measure the Euclidean distance between points, identify the maximum and minimum distances 8 68
11 Curse of dimensionality 2/2 Nearest neighbors are all far away For a point in a D-dimensional space, its nearest neighbors (having distance r) are distributed in a spherical shell (with distance αr) with probability 1 α D 1 Why? Because the volume of high-dimensional space is too huge, samples are very sparse in it Dimensionality reduction can find a low-dimensional structure (manifold) in high-dimensional space, and unfold the inherent structure 9 68
12 Section 6.2 k-means
13 Prototype-based clustering Each cluster has a prototype Distance to prototype decides cluster membership k-means is the most well known representative 10 68
14 k-means algorithm Input: dataset {x 1,..., x N }, number of clusters k Output: clusters q(x i ) {1,..., k} 1: Initialize k centroids {c 1,..., c k } 2: repeat 3: for i = 1,..., N do 4: q(x i ) arg min j x i c j 5: end for 6: for j = 1,..., k do 7: c j mean(x i q(x i ) = j) 8: end for 9: until centroids do not change 11 68
15 Illustration of k-means 1/4 Initialize k centroids Usually we select centroids from the dataset We will see that initial centroids are crucial for the final results Is it wise to select scattered initial centroids? 12 68
16 Illustration of k-means 2/4 Assign data points to clusters Complexity O(kN) Here we measure Euclidean distance, can another distance metric be used? 13 68
17 Illustration of k-means 3/4 Update cluster centroids Complexity O(N + k) Here we use (arithmetic) mean, can another method be used? 14 68
18 Illustration of k-means 4/4 Until the centroids do not change Will k-means converge? Usually it does 15 68
19 Interpretation of k-means k-means was firstly proposed for vector quantization, an extension of scalar quantization to Euclidean space Centroids are known as codewords that constitute codebook k-means is to solve min q,{c j } x i c q(xi ) 2 Heuristically, k-means update q and {c j } alternately, it is greedy and cannot ensure (global) optimum 16 68
20 k-means found local minimum Case 1: Case 2: In Case 2, we select initial centroids to be more scattered, but result is worse In practice, we often run k-means multiple times with different initializations, and choose the best one 17 68
21 Limitation of k-means: Outliers 18 68
22 Limitation of k-means: Different sized clusters Left: Ideal clusters; Right: k-means results 19 68
23 Limitation of k-means: Clusters with different densities Left: Ideal clusters; Right: k-means results 20 68
24 Limitation of k-means: Clusters having irregular shapes Left: Ideal clusters; Right: k-means results 21 68
25 A remedy for k-means: Over-segmentation and post-processing Left: For clusters with different densities; Right: For clusters having irregular shapes 22 68
26 Section 6.3 Gaussian mixture model
27 Distribution-based clustering Each cluster corresponds to a (single-modal) distribution Calculate posterior probabilities to decide clusters Gaussian mixture model is the most well known representative 23 68
28 Gaussian mixture model A mixture model is a combination of multiple single-modal distributions In Gaussian mixture model (GMM), each component is a Gaussian One-dimensional case: p(x) = k j=1 w jn (x µ j, σ 2 j ), where j w j = 1 Multi-dimensional case: p(x) = k j=1 w jn (x µ j, Σ j ) 24 68
29 GMM for clustering Assume we have known the parameters of GMM, then we can calculate posterior (responsibility) as p(q(x i ) = j) = γ ij = w j N (x µ j, σ 2 j ) k j=1 w jn (x µ j, σ 2 j ) Then we have q(x i ) = arg max j γ ij Now, the problem is how to estimate the parameters of GMM ϑ = {w j, µ j, Σ j j = 1,..., k} And we consider maximum likelihood estimation ˆϑ = arg max ϑ N i=1 p(x i ϑ) 25 68
30 Intuitive solution 1/3 As we want to maximize N i=1 k j=1 w jn (x i µ j, Σ j ), this is quite difficult Can we borrow the idea of k-means? First, we initialize the parameters 26 68
31 Intuitive solution 2/3 As we want to maximize N i=1 k j=1 w jn (x i µ j, Σ j ), this is quite difficult Can we borrow the idea of k-means? Second, we calculate responsibilities ( assign data to clusters ) 27 68
32 Intuitive solution 3/3 As we want to maximize N i=1 k j=1 w jn (x i µ j, Σ j ), this is quite difficult Can we borrow the idea of k-means? Third, we update parameters Note that p(q(x i ) = j) = γ ij, so it is intuitive that w j = µ j = i γ ij N i γ ijx i i γ ij Σ j = i γ ij(x i µ j )(x i µ j ) T i γ ij The above two steps are executed alternately, until convergence 28 68
33 Example of GMM result 29 68
34 k-means is a special case of GMM In this special case, we set w j 1 k, Σ j αi, and we quantize γ ij to 0 or 1 Initialize parameters = Initialize µ j s Calculate responsibilities = Find nearest µ j and set that γ ij = 1 Update parameters = Update µ j s as means So we can interpret why k-means has limitations for clusters having different sizes or densities or irregular shapes 30 68
35 Example of comparison between GMM and k- means 31 68
36 Interpretation: Expectation maximization Introduce latent variables z i {1,..., k}, representing the true cluster that generates x i Consider to maximize i p(x i, z i ϑ) = i j (w jn (x i µ j, Σ j )) I(zi=j), or equivalently i j I(z i = j) log(w j N (x i µ j, Σ j )) Expectation maximization (EM) executes two steps alternately: 1. E-step: Given ϑ t, calculate the expectation of objective function with eliminating latent variables. Note that γ ij is the expectation of I(z i = j), so we have i j γ ij log(w j N (x i µ j, Σ j )) 2. M-step: Maximize the expectation of objective function to find ϑ t+1. Note that j w j = 1, so we can derive the equations of GMM 32 68
37 EM as an algorithm Input: p(x, Z ϑ), X = {x 1,..., x N }, Z is unobserved Output: ˆϑ 1: t 0, initialize ϑ 0 2: repeat 3: Given ϑ t, calculate the expectation of log p(x, Z ϑ) with eliminating Z, denote the expectation by Q(ϑ, ϑ t ) 4: ϑ t+1 arg max ϑ Q(ϑ, ϑ t ) 5: until ϑ t+1 is similar to ϑ t 6: ˆϑ = ϑ t
38 Is EM correct? Consider log p(x ϑ) = Z p(z X, ϑ t ) log p(x ϑ) = Z p(z X, ϑ t ) log p(x, Z ϑ) Z p(z X, ϑ t ) log p(z X, ϑ) = Q(ϑ, ϑ t ) + H(ϑ, ϑ t ) The first term is actually Q(ϑ, ϑ t ). Since ϑ t+1 arg max ϑ Q(ϑ, ϑ t ), we know Q(ϑ t+1, ϑ t ) Q(ϑ t, ϑ t ) The second term H(ϑ, ϑ t ) is actually the cross-entropy between p(z X, ϑ t ) and p(z X, ϑ), so we know for any ϑ, H(ϑ, ϑ t ) H(ϑ t, ϑ t ) (the latter is entropy, and the difference is K-L divergence) Thus, EM ensures that log p(x ϑ t+1 ) log p(x ϑ t ) This is a greedy algorithm to maximize the likelihood, and it definitely will converge, but cannot ensure global optimum 34 68
39 Variants of EM We may set different initial values to escape a local optimum We may not maximize Q(ϑ, ϑ t ), having Q(ϑ t+1, ϑ t ) Q(ϑ t, ϑ t ) is enough (e.g. by gradient ascent); this is especially helpful if the maximization problem has no closed-form solution 35 68
40 Section 6.4 Other approaches to clustering
41 Density-based clustering Clusters correspond to high-density regions, while low-density regions separate clusters Exclude noisy data and outliers Mean-shift and DBSCAN are the most well known representatives 36 68
42 Mean-shift Mean-shift is an iterative algorithm for locating modes (i.e. local maximums of density), where density is estimated using Parzen window (non-parametric) Initialize a mode x 0, and iteratively refine it Given x t, local density is estimated by ˆp(x) = K(x x t ), where K() is a kernel function Thus local mean is m(x t x ) = i N (x t ) x ik(x i x t ) x i N (x t ) K(x i x t ), let xt+1 m(x t ) Then find the next mode
43 DBSCAN Density-based spatial clustering of applications with noise (DBSCAN) seeks a cluster as a high-density region For each point, find its neighbors, if number of neighboring points is less than a threshold, then this point is noise Otherwise, let this point and its neighbors belong to a cluster, and expand this cluster until reaching low-density region 38 68
44 Connectivity-based clustering Use a graph to represent data points and their relations Find clusters as subgraphs that are highly related Also called graph-based clustering 39 68
45 Agglomerative clustering Bottom-up strategy: Combine points into clusters progressively 40 68
46 Divisive clustering Also called highly connected subgraphs (HCS) clustering Top-down strategy: Split a graph into two subgraphs by finding out the minimum cut, and split subgraphs further, until a subgraph is highly connected To evaluate whether a subgraph is highly connected or not, we use the ratio: 2n e n v(n v 1) 41 68
47 Section 6.5 Principle component analysis (PCA)
48 Projection for dimensionality reduction Given x R D We want to find a projection matrix P R K D, where K < D Then we have dimensionality reduced y = Px, y R K This is linear dimensionality reduction, and the problem is how to find P Principle component analysis (PCA) is the mostly used method 42 68
49 Motivation of PCA PCA seeks a projection that can keep the data s Euclidean distances as much as possible Note that i j (x i x j ) 2 is related to data variance 43 68
50 PCA 1/3 i x i N Step 1: subtract mean x = (x 1 x) T Let X =..., then X T X is the data covariance matrix (x N x) T 44 68
51 PCA 2/3 Step 2: rotate x i x Let C = X T X, calculate its eigenvalues and eigenvectors, Cu i = λ i u i, then we have an orthonormal matrix U = [u 1,..., u D ], and CU = ΛU where Λ = diag{λ 1,..., λ D }, λ 1 λ 2 λ D 0 Let X = XU, so X T X = Λ becomes a diagonal matrix 45 68
52 PCA 3/3 Step 3: select the K largest entries from {λ 1,..., λ D }, and select the corresponding columns of U to constitute P Let y i = P(x i x), it completes dimensionality reduction 46 68
53 Example: Eigenface 1/2 Consider applying PCA on face data, the calculated eigenvectors are termed eigenfaces Note that we usually reshape an image into a vector, otherwise we need 2D-PCA 47 68
54 Example: Eigenface 2/2 Using eigenfaces, for a new input face, we can perform dimensionality reduction y N+1 = Px N+1 (we omit the mean subtraction step) Note that P T P = I, so we have x N+1 P T y N+1, it is a decomposition over eigenfaces It is also a commonly used step for feature extraction 48 68
55 Kernel PCA 1/3 PCA finds a linear projection, how to deal with nonlinearity? Consider using basis functions φ i = φ(x i ), and then the data φ T covariance matrix is Φ T 1 Φ where Φ =... (here we assume φ T N i φ i = 0) We may also use the kernel trick, where we have k(x, y) = φ(x) φ(y) but we do not use φ( ) explicitly Note that we can calculate k(x 1, x 1 ),..., k(x 1, x N ) K =...,...,... = ΦΦ T k(x N, x 1 ),..., k(x N, x 1 ) And we can calculate eigenvalues/eigenvectors for K: Ku i = λ i u i 49 68
56 Kernel PCA 2/3 Note that Φ T ΦΦ T u i = Φ T Ku i = λ i Φ T u i, which means Φ T u i is an eigenvector of Φ T Φ. Here, we cannot ensure Φ T u i is a unit vector, so we normalize it u i = ΦT u i Φ T u i = Φ T u i = ΦT u i u T i ΦΦT u λi i (u 1) T Then let P =... where the corresponding K eigenvalues (u K )T are the largest And the projection is y i = Pφ i = u T 1 / λ 1... u T K / λ K k(x 1, x i )... k(x N, x i ) 50 68
57 Kernel PCA 3/3 Previously we assume i φ i = 0, but this is not satisfied by an arbitrary kernel, so we have to centralize the kernel Note that k ij = φ i φ j, now we want to calculate k ij = (φ i φ) (φ j φ) where φ = i φ i N k ij = k ij 1 k N ij 1 k N ij + 1 N 2 Replace K with K in the previous steps j i i j k ij 51 68
58 Example of kernel PCA Left: input points in 2-D plane; Right: projected points using k(x, y) = (1 + x y)
59 Section 6.6 Other approaches to dimensionality reduction
60 Nonlinear dimensionality reduction Kernel PCA ISOMAP Locally linear embedding (LLE) Self organizing map (Chap. 10) Autoencoder (Chap. 10) Laplacian eigenmap t-distributed stochastic neighbor embedding (t-sne)
61 Manifold learning A manifold is a topological space that locally resembles Euclidean space near each point 1-D manifolds include lines and circles, but not 8 2-D manifolds are also called surfaces, such as spheres The intrinsic dimension of a manifold can be lower than its residing space Manifold learning is to identify such low-dimensional structure from high-dimensional data For manifold learning, Euclidean distance is not appropriate, and is replaced by geodesic distance 54 68
62 ISOMAP While PCA seeks to preserve data s Euclidean distances as much as possible, ISOMAP seeks to preserve data s geodesic distances as much as possible In ISOMAP, geodesic distance is defined as shortest distance on graph, the graph is constructed by nearest neighbors for each point As we have d ij = d(x i, x j ), we seek arg min ( y i y j d ij ) 2 y 1,...,y N which is solved by multi-dimensional scaling (MDS) i j i 55 68
63 Example of MDS MDS seeks an appropriate coordinate system for a distance 56 68
64 Locally linear embedding Locally linear embedding (LLE) seeks to preserve locally linear relations as much as possible In LLE, first, a locally linear relation matrix W is found arg min W x i i x j N (x i ) w ij x j 2, s.t. j If x j / N (x i ), then w ij 0 Second, a low-dimensional embedding is found arg min y 1,...,y N i y i j w ij y j 2 w ij =
65 Section 6.7 Semi-supervised learning
66 Supervised versus unsupervised learning For example, consider classification versus clustering Classification is Good at predicting the true class Requiring many labeled data for training Clustering is Able to find out clusters, but that are not equal to classes Requiring no label, just data Can we combine supervised and unsupervised learning? 58 68
67 Motivation of semi-supervised learning One practical difficulty for supervised learning is the lack of accurate labels Semi-supervised learning tries to use unlabeled data in addition to labeled data This includes weakly-supervised learning, where we have labels but labels are not accurate or noisy This also includes transductive learning, where we do not build a model but just want to have predictions for unlabeled data Supervised Semi-supervised (inductive) Transductive Data Labeled data Both labeled and unlabeled data Objective A model: ŷ = f(x) or q(y x) Predictions for unlabeled data 59 68
68 Examples of semi-supervised learning Image classification with unlabeled images Image classification from click-through data from search engine Image segmentation with only labels of bounding-boxes Anomaly detection with limited labels
69 Why does semi-supervised learning work? For generative methods Labeled data provide information of p(x, y) and unlabeled data provide additional information of p(x), the latter is helpful to estimate p(x, y) For discriminative methods 1. Cluster assumption: If two points belong to the same cluster, they are likely to belong to the same class 2. Density assumption: The decision boundary shall locate at low-density regions that separate high-density regions 61 68
70 A generative method for semi-supervised classification For labeled data, we know (x i, y i ), i = 1,..., N. For unlabeled data, we know x j, j = 1,..., M We consider a generative model for the labeled data: p(x i, y i ) = p(y i )p(x i y i ), and assume a mixture model for the unlabeled data: p(x j ) = y j p(y j )p(x j y j ) We further parameterize the probabilities as p(y) = p(y ϑ 0 ), p(x y) = p(x y, ϑ 1 ) Then we can maximize the log-likelihood: i log(p(y i ϑ 0 )p(x i y i, ϑ 1 )) + j log( y j p(y j ϑ 0 )p(x j y j, ϑ 1 )) 62 68
71 EM for semi-supervised classification Consider y j s as latent variables, and we have the expectation of log p({x i, y i }, {x j, y j } ϑ) with respect to p({y j } ϑ t ) is Q(ϑ, ϑ t ) = i log(p(y i ϑ 0 )p(x i y i, ϑ 1 )) + k γt jk log(p(y j ϑ 0 )p(x j y j, ϑ 1 )) j where γ t jk = p(y j = k ϑ t ) p(y j = k ϑ t 0)p(x j y j, ϑ t 1) 63 68
72 Semi-supervised SVM 1/2 Recall (supervised) SVM 1 min w,b,ξ 2 w 2 + C i ξ i s.t. i, ξ i 0, y i (w T x i + b) 1 ξ i 64 68
73 Semi-supervised SVM 2/2 For semi-supervised SVM, also called transductive SVM ( min 1 w,b,ξ,y,ζ 2 w 2 + C i ξ i +C ) j ζ j s.t. i, ξ i 0, y i (w T x i + b) 1 ξ i j, y j {+1, 1}, ζ j 0, y j (w T x j + b) 1 ζ j 65 68
74 A graph-based method for transductive classification Input: (x i, y i ), i = 1,..., N, x j, j = 1,..., M Output: ŷ j 1: Construct a graph whose vertexes are x i and x j 2: In the graph, if j N (i ), construct an edge from i to j and let the edge weight be w i j = 1 d(x i,x j ) 3: t 0, initialize p 0 (class probability vectors) i 4: repeat 5: j, p t+1 0 j 6: i {1,..., M, M + 1,..., M + N}, j, if w i j 0, p t+1 p t+1 + w i j j j k w pt i k i 7: j, normalize p t+1 to be unit vector j 8: until convergence 9: ŷ j = arg max p t+1 j 66 68
75 Random walk and PageRank The loop is actually a random walk For example, PageRank is an unsupervised method based on random walk to determine the relative importances of webpages Input: A graph of webpages and hyperlinks Output: Relative importances of all webpages 1: t 0 2: Initialize r 0 (importance values) i 3: repeat 4: j, r t+1 0 j 5: i, j, if w ij 0, r t+1 r t+1 + w ij j j 6: j, r t+1 βr t+1 j j + 1 β N factor, used to avoid trap 7: until convergence k w ik rt i β is called damping An example of trap 67 68
76 In this chapter Dictionary Clustering Curse of dimensionality Dimensionality reduction Manifold learning Semi-supervised learning Transductive learning Unsupervised learning Toolbox Expectation maximization Gaussian mixture model k-means Principle component analysis, kernel Random walk Transductive support vector machine 68 / 68
Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto
Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian
More informationNonlinear Dimensionality Reduction
Nonlinear Dimensionality Reduction Piyush Rai CS5350/6350: Machine Learning October 25, 2011 Recap: Linear Dimensionality Reduction Linear Dimensionality Reduction: Based on a linear projection of the
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationFace Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi
Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Overview Introduction Linear Methods for Dimensionality Reduction Nonlinear Methods and Manifold
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationConnection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis
Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal
More informationClustering, K-Means, EM Tutorial
Clustering, K-Means, EM Tutorial Kamyar Ghasemipour Parts taken from Shikhar Sharma, Wenjie Luo, and Boris Ivanovic s tutorial slides, as well as lecture notes Organization: Clustering Motivation K-Means
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationExpectation Maximization
Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same
More informationNonlinear Dimensionality Reduction
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap
More informationManifold Regularization
9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,
More informationDimension Reduction Techniques. Presented by Jie (Jerry) Yu
Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage
More informationLinear and Non-Linear Dimensionality Reduction
Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections
More informationCOMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017
COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationClustering. Léon Bottou COS 424 3/4/2010. NEC Labs America
Clustering Léon Bottou NEC Labs America COS 424 3/4/2010 Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression, other.
More informationClustering. CSL465/603 - Fall 2016 Narayanan C Krishnan
Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationUnsupervised dimensionality reduction
Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationClassification: The rest of the story
U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher
More informationLECTURE NOTE #11 PROF. ALAN YUILLE
LECTURE NOTE #11 PROF. ALAN YUILLE 1. NonLinear Dimension Reduction Spectral Methods. The basic idea is to assume that the data lies on a manifold/surface in D-dimensional space, see figure (1) Perform
More informationMachine learning for pervasive systems Classification in high-dimensional spaces
Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version
More informationDimension Reduction and Low-dimensional Embedding
Dimension Reduction and Low-dimensional Embedding Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/26 Dimension
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationData Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationNon-linear Dimensionality Reduction
Non-linear Dimensionality Reduction CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Laplacian Eigenmaps Locally Linear Embedding (LLE)
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More informationLecture 3: Pattern Classification
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationPrincipal Component Analysis (PCA)
Principal Component Analysis (PCA) Salvador Dalí, Galatea of the Spheres CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang Some slides from Derek Hoiem and Alysha
More informationBeyond the Point Cloud: From Transductive to Semi-Supervised Learning
Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of
More informationClustering VS Classification
MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:
More informationLearning Eigenfunctions: Links with Spectral Clustering and Kernel PCA
Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-François Paiement University of Montreal April 2, Snowbird Learning 2003 Learning Modal Structures
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationLecture 14. Clustering, K-means, and EM
Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationClustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation
Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationCSC411 Fall 2018 Homework 5
Homework 5 Deadline: Wednesday, Nov. 4, at :59pm. Submission: You need to submit two files:. Your solutions to Questions and 2 as a PDF file, hw5_writeup.pdf, through MarkUs. (If you submit answers to
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationSemi-Supervised Learning through Principal Directions Estimation
Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de
More informationMLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT
MLCC 2018 - Clustering Lorenzo Rosasco UNIGE-MIT-IIT About this class We will consider an unsupervised setting, and in particular the problem of clustering unlabeled data into coherent groups. MLCC 2018
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationData Preprocessing. Cluster Similarity
1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M
More informationSPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS
SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS VIKAS CHANDRAKANT RAYKAR DECEMBER 5, 24 Abstract. We interpret spectral clustering algorithms in the light of unsupervised
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationFisher s Linear Discriminant Analysis
Fisher s Linear Discriminant Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationLaplacian Eigenmaps for Dimensionality Reduction and Data Representation
Laplacian Eigenmaps for Dimensionality Reduction and Data Representation Neural Computation, June 2003; 15 (6):1373-1396 Presentation for CSE291 sp07 M. Belkin 1 P. Niyogi 2 1 University of Chicago, Department
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationLecture 3: Pattern Classification. Pattern classification
EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and
More informationEXPECTATION- MAXIMIZATION THEORY
Chapter 3 EXPECTATION- MAXIMIZATION THEORY 3.1 Introduction Learning networks are commonly categorized in terms of supervised and unsupervised networks. In unsupervised learning, the training set consists
More informationMachine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU,
Eric Xing Eric Xing @ CMU, 2006-2010 1 Machine Learning Data visualization and dimensionality reduction Eric Xing Lecture 7, August 13, 2010 Eric Xing Eric Xing @ CMU, 2006-2010 2 Text document retrieval/labelling
More informationManifold Learning: Theory and Applications to HRI
Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46 Greek Philosopher
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationMachine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Machine Learning Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 47 Table of contents 1 Introduction
More informationLecture 11: Unsupervised Machine Learning
CSE517A Machine Learning Spring 2018 Lecture 11: Unsupervised Machine Learning Instructor: Marion Neumann Scribe: Jingyu Xin Reading: fcml Ch6 (Intro), 6.2 (k-means), 6.3 (Mixture Models); [optional]:
More informationUnsupervised Anomaly Detection for High Dimensional Data
Unsupervised Anomaly Detection for High Dimensional Data Department of Mathematics, Rowan University. July 19th, 2013 International Workshop in Sequential Methodologies (IWSM-2013) Outline of Talk Motivation
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationMachine Learning Notes
Machine Learning Notes Patrick O Neil April 24, 2014 Abstract These notes provide a quick and dirty exploration of several areas of machine learning. The focus of these notes is on implementation as opposed
More informationL26: Advanced dimensionality reduction
L26: Advanced dimensionality reduction The snapshot CA approach Oriented rincipal Components Analysis Non-linear dimensionality reduction (manifold learning) ISOMA Locally Linear Embedding CSCE 666 attern
More informationLecture 10: Dimension Reduction Techniques
Lecture 10: Dimension Reduction Techniques Radu Balan Department of Mathematics, AMSC, CSCAMM and NWC University of Maryland, College Park, MD April 17, 2018 Input Data It is assumed that there is a set
More informationApprentissage non supervisée
Apprentissage non supervisée Cours 3 Higher dimensions Jairo Cugliari Master ECD 2015-2016 From low to high dimension Density estimation Histograms and KDE Calibration can be done automacally But! Let
More informationPartially labeled classification with Markov random walks
Partially labeled classification with Markov random walks Martin Szummer MIT AI Lab & CBCL Cambridge, MA 0239 szummer@ai.mit.edu Tommi Jaakkola MIT AI Lab Cambridge, MA 0239 tommi@ai.mit.edu Abstract To
More informationData-dependent representations: Laplacian Eigenmaps
Data-dependent representations: Laplacian Eigenmaps November 4, 2015 Data Organization and Manifold Learning There are many techniques for Data Organization and Manifold Learning, e.g., Principal Component
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA Tobias Scheffer Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 2 PCA: Motivation
More informationLaplacian Eigenmaps for Dimensionality Reduction and Data Representation
Introduction and Data Representation Mikhail Belkin & Partha Niyogi Department of Electrical Engieering University of Minnesota Mar 21, 2017 1/22 Outline Introduction 1 Introduction 2 3 4 Connections to
More informationCSE446: Clustering and EM Spring 2017
CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationEEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1
EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationECE 661: Homework 10 Fall 2014
ECE 661: Homework 10 Fall 2014 This homework consists of the following two parts: (1) Face recognition with PCA and LDA for dimensionality reduction and the nearest-neighborhood rule for classification;
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu November 3, 2015 Methods to Learn Matrix Data Text Data Set Data Sequence Data Time Series Graph
More informationVariable selection and feature construction using methods related to information theory
Outline Variable selection and feature construction using methods related to information theory Kari 1 1 Intelligent Systems Lab, Motorola, Tempe, AZ IJCNN 2007 Outline Outline 1 Information Theory and
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality
More informationData dependent operators for the spatial-spectral fusion problem
Data dependent operators for the spatial-spectral fusion problem Wien, December 3, 2012 Joint work with: University of Maryland: J. J. Benedetto, J. A. Dobrosotskaya, T. Doster, K. W. Duke, M. Ehler, A.
More informationMachine Learning - MT & 14. PCA and MDS
Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More information