Statistical Learning. Dong Liu. Dept. EEIS, USTC

Size: px
Start display at page:

Download "Statistical Learning. Dong Liu. Dept. EEIS, USTC"

Transcription

1 Statistical Learning Dong Liu Dept. EEIS, USTC

2 Chapter 6. Unsupervised and Semi-Supervised Learning 1. Unsupervised learning 2. k-means 3. Gaussian mixture model 4. Other approaches to clustering 5. Principle component analysis 6. Other approaches to dimensionality reduction 7. Semi-supervised learning 1 68

3 Section 6.1 Unsupervised Learning

4 What is unsupervised learning? Supervised learning aims to identify relation between data, to solve predictive task Unsupervised learning aims to discover pattern from data, to solve descriptive task Which pattern? Association analysis Clustering Anomaly detection Dimensionality reduction

5 Association analysis Input: data with multiple attributes Output: which attributes are associated, i.e. frequently co-occurred? Goods Customers A B C D E... Coffee Tea 1... Milk Beer Diaper Aspirin Mining shopping carts r(milk, Coffee) r(milk, Tea) r(beer, Diaper)... Challenge: computational efficiency 3 68

6 Clustering Input: data (with no class label) Output: which data are clustered? 4 68

7 Examples of clustering Market segmentation: Divide customers into clusters Document clustering: Divide retrieved documents about Amazon into clusters Image segmentation: Divide pixels into clusters 5 68

8 Anomaly detection Input: data (with no class label) Output: which data are normal, and which are abnormal? Detecting credit card frauds Given credit card transactions, try to detect which are normal and which are frauds Also useful in supervised learning: remove outliers to clean data 6 68

9 Dimensionality reduction Input: high-dimensional data (with many attributes) Output: low-dimensional representation (with few attributes) Often used as a data preprocessing step Techniques: Broad sense: feature extraction, feature selection, etc. Narrow sense: transform that reduces dimension Benefits: Reduce the computational cost Alleviate noise or irrelevant attributes Avoid curse of dimensionality 7 68

10 Curse of dimensionality 1/2 Many statistical learning methods depend on distance measure Distance cannot distinguish in high-dimensional space Randomly drop several points in a hypercube in a high-dimensional space, and measure the Euclidean distance between points, identify the maximum and minimum distances 8 68

11 Curse of dimensionality 2/2 Nearest neighbors are all far away For a point in a D-dimensional space, its nearest neighbors (having distance r) are distributed in a spherical shell (with distance αr) with probability 1 α D 1 Why? Because the volume of high-dimensional space is too huge, samples are very sparse in it Dimensionality reduction can find a low-dimensional structure (manifold) in high-dimensional space, and unfold the inherent structure 9 68

12 Section 6.2 k-means

13 Prototype-based clustering Each cluster has a prototype Distance to prototype decides cluster membership k-means is the most well known representative 10 68

14 k-means algorithm Input: dataset {x 1,..., x N }, number of clusters k Output: clusters q(x i ) {1,..., k} 1: Initialize k centroids {c 1,..., c k } 2: repeat 3: for i = 1,..., N do 4: q(x i ) arg min j x i c j 5: end for 6: for j = 1,..., k do 7: c j mean(x i q(x i ) = j) 8: end for 9: until centroids do not change 11 68

15 Illustration of k-means 1/4 Initialize k centroids Usually we select centroids from the dataset We will see that initial centroids are crucial for the final results Is it wise to select scattered initial centroids? 12 68

16 Illustration of k-means 2/4 Assign data points to clusters Complexity O(kN) Here we measure Euclidean distance, can another distance metric be used? 13 68

17 Illustration of k-means 3/4 Update cluster centroids Complexity O(N + k) Here we use (arithmetic) mean, can another method be used? 14 68

18 Illustration of k-means 4/4 Until the centroids do not change Will k-means converge? Usually it does 15 68

19 Interpretation of k-means k-means was firstly proposed for vector quantization, an extension of scalar quantization to Euclidean space Centroids are known as codewords that constitute codebook k-means is to solve min q,{c j } x i c q(xi ) 2 Heuristically, k-means update q and {c j } alternately, it is greedy and cannot ensure (global) optimum 16 68

20 k-means found local minimum Case 1: Case 2: In Case 2, we select initial centroids to be more scattered, but result is worse In practice, we often run k-means multiple times with different initializations, and choose the best one 17 68

21 Limitation of k-means: Outliers 18 68

22 Limitation of k-means: Different sized clusters Left: Ideal clusters; Right: k-means results 19 68

23 Limitation of k-means: Clusters with different densities Left: Ideal clusters; Right: k-means results 20 68

24 Limitation of k-means: Clusters having irregular shapes Left: Ideal clusters; Right: k-means results 21 68

25 A remedy for k-means: Over-segmentation and post-processing Left: For clusters with different densities; Right: For clusters having irregular shapes 22 68

26 Section 6.3 Gaussian mixture model

27 Distribution-based clustering Each cluster corresponds to a (single-modal) distribution Calculate posterior probabilities to decide clusters Gaussian mixture model is the most well known representative 23 68

28 Gaussian mixture model A mixture model is a combination of multiple single-modal distributions In Gaussian mixture model (GMM), each component is a Gaussian One-dimensional case: p(x) = k j=1 w jn (x µ j, σ 2 j ), where j w j = 1 Multi-dimensional case: p(x) = k j=1 w jn (x µ j, Σ j ) 24 68

29 GMM for clustering Assume we have known the parameters of GMM, then we can calculate posterior (responsibility) as p(q(x i ) = j) = γ ij = w j N (x µ j, σ 2 j ) k j=1 w jn (x µ j, σ 2 j ) Then we have q(x i ) = arg max j γ ij Now, the problem is how to estimate the parameters of GMM ϑ = {w j, µ j, Σ j j = 1,..., k} And we consider maximum likelihood estimation ˆϑ = arg max ϑ N i=1 p(x i ϑ) 25 68

30 Intuitive solution 1/3 As we want to maximize N i=1 k j=1 w jn (x i µ j, Σ j ), this is quite difficult Can we borrow the idea of k-means? First, we initialize the parameters 26 68

31 Intuitive solution 2/3 As we want to maximize N i=1 k j=1 w jn (x i µ j, Σ j ), this is quite difficult Can we borrow the idea of k-means? Second, we calculate responsibilities ( assign data to clusters ) 27 68

32 Intuitive solution 3/3 As we want to maximize N i=1 k j=1 w jn (x i µ j, Σ j ), this is quite difficult Can we borrow the idea of k-means? Third, we update parameters Note that p(q(x i ) = j) = γ ij, so it is intuitive that w j = µ j = i γ ij N i γ ijx i i γ ij Σ j = i γ ij(x i µ j )(x i µ j ) T i γ ij The above two steps are executed alternately, until convergence 28 68

33 Example of GMM result 29 68

34 k-means is a special case of GMM In this special case, we set w j 1 k, Σ j αi, and we quantize γ ij to 0 or 1 Initialize parameters = Initialize µ j s Calculate responsibilities = Find nearest µ j and set that γ ij = 1 Update parameters = Update µ j s as means So we can interpret why k-means has limitations for clusters having different sizes or densities or irregular shapes 30 68

35 Example of comparison between GMM and k- means 31 68

36 Interpretation: Expectation maximization Introduce latent variables z i {1,..., k}, representing the true cluster that generates x i Consider to maximize i p(x i, z i ϑ) = i j (w jn (x i µ j, Σ j )) I(zi=j), or equivalently i j I(z i = j) log(w j N (x i µ j, Σ j )) Expectation maximization (EM) executes two steps alternately: 1. E-step: Given ϑ t, calculate the expectation of objective function with eliminating latent variables. Note that γ ij is the expectation of I(z i = j), so we have i j γ ij log(w j N (x i µ j, Σ j )) 2. M-step: Maximize the expectation of objective function to find ϑ t+1. Note that j w j = 1, so we can derive the equations of GMM 32 68

37 EM as an algorithm Input: p(x, Z ϑ), X = {x 1,..., x N }, Z is unobserved Output: ˆϑ 1: t 0, initialize ϑ 0 2: repeat 3: Given ϑ t, calculate the expectation of log p(x, Z ϑ) with eliminating Z, denote the expectation by Q(ϑ, ϑ t ) 4: ϑ t+1 arg max ϑ Q(ϑ, ϑ t ) 5: until ϑ t+1 is similar to ϑ t 6: ˆϑ = ϑ t

38 Is EM correct? Consider log p(x ϑ) = Z p(z X, ϑ t ) log p(x ϑ) = Z p(z X, ϑ t ) log p(x, Z ϑ) Z p(z X, ϑ t ) log p(z X, ϑ) = Q(ϑ, ϑ t ) + H(ϑ, ϑ t ) The first term is actually Q(ϑ, ϑ t ). Since ϑ t+1 arg max ϑ Q(ϑ, ϑ t ), we know Q(ϑ t+1, ϑ t ) Q(ϑ t, ϑ t ) The second term H(ϑ, ϑ t ) is actually the cross-entropy between p(z X, ϑ t ) and p(z X, ϑ), so we know for any ϑ, H(ϑ, ϑ t ) H(ϑ t, ϑ t ) (the latter is entropy, and the difference is K-L divergence) Thus, EM ensures that log p(x ϑ t+1 ) log p(x ϑ t ) This is a greedy algorithm to maximize the likelihood, and it definitely will converge, but cannot ensure global optimum 34 68

39 Variants of EM We may set different initial values to escape a local optimum We may not maximize Q(ϑ, ϑ t ), having Q(ϑ t+1, ϑ t ) Q(ϑ t, ϑ t ) is enough (e.g. by gradient ascent); this is especially helpful if the maximization problem has no closed-form solution 35 68

40 Section 6.4 Other approaches to clustering

41 Density-based clustering Clusters correspond to high-density regions, while low-density regions separate clusters Exclude noisy data and outliers Mean-shift and DBSCAN are the most well known representatives 36 68

42 Mean-shift Mean-shift is an iterative algorithm for locating modes (i.e. local maximums of density), where density is estimated using Parzen window (non-parametric) Initialize a mode x 0, and iteratively refine it Given x t, local density is estimated by ˆp(x) = K(x x t ), where K() is a kernel function Thus local mean is m(x t x ) = i N (x t ) x ik(x i x t ) x i N (x t ) K(x i x t ), let xt+1 m(x t ) Then find the next mode

43 DBSCAN Density-based spatial clustering of applications with noise (DBSCAN) seeks a cluster as a high-density region For each point, find its neighbors, if number of neighboring points is less than a threshold, then this point is noise Otherwise, let this point and its neighbors belong to a cluster, and expand this cluster until reaching low-density region 38 68

44 Connectivity-based clustering Use a graph to represent data points and their relations Find clusters as subgraphs that are highly related Also called graph-based clustering 39 68

45 Agglomerative clustering Bottom-up strategy: Combine points into clusters progressively 40 68

46 Divisive clustering Also called highly connected subgraphs (HCS) clustering Top-down strategy: Split a graph into two subgraphs by finding out the minimum cut, and split subgraphs further, until a subgraph is highly connected To evaluate whether a subgraph is highly connected or not, we use the ratio: 2n e n v(n v 1) 41 68

47 Section 6.5 Principle component analysis (PCA)

48 Projection for dimensionality reduction Given x R D We want to find a projection matrix P R K D, where K < D Then we have dimensionality reduced y = Px, y R K This is linear dimensionality reduction, and the problem is how to find P Principle component analysis (PCA) is the mostly used method 42 68

49 Motivation of PCA PCA seeks a projection that can keep the data s Euclidean distances as much as possible Note that i j (x i x j ) 2 is related to data variance 43 68

50 PCA 1/3 i x i N Step 1: subtract mean x = (x 1 x) T Let X =..., then X T X is the data covariance matrix (x N x) T 44 68

51 PCA 2/3 Step 2: rotate x i x Let C = X T X, calculate its eigenvalues and eigenvectors, Cu i = λ i u i, then we have an orthonormal matrix U = [u 1,..., u D ], and CU = ΛU where Λ = diag{λ 1,..., λ D }, λ 1 λ 2 λ D 0 Let X = XU, so X T X = Λ becomes a diagonal matrix 45 68

52 PCA 3/3 Step 3: select the K largest entries from {λ 1,..., λ D }, and select the corresponding columns of U to constitute P Let y i = P(x i x), it completes dimensionality reduction 46 68

53 Example: Eigenface 1/2 Consider applying PCA on face data, the calculated eigenvectors are termed eigenfaces Note that we usually reshape an image into a vector, otherwise we need 2D-PCA 47 68

54 Example: Eigenface 2/2 Using eigenfaces, for a new input face, we can perform dimensionality reduction y N+1 = Px N+1 (we omit the mean subtraction step) Note that P T P = I, so we have x N+1 P T y N+1, it is a decomposition over eigenfaces It is also a commonly used step for feature extraction 48 68

55 Kernel PCA 1/3 PCA finds a linear projection, how to deal with nonlinearity? Consider using basis functions φ i = φ(x i ), and then the data φ T covariance matrix is Φ T 1 Φ where Φ =... (here we assume φ T N i φ i = 0) We may also use the kernel trick, where we have k(x, y) = φ(x) φ(y) but we do not use φ( ) explicitly Note that we can calculate k(x 1, x 1 ),..., k(x 1, x N ) K =...,...,... = ΦΦ T k(x N, x 1 ),..., k(x N, x 1 ) And we can calculate eigenvalues/eigenvectors for K: Ku i = λ i u i 49 68

56 Kernel PCA 2/3 Note that Φ T ΦΦ T u i = Φ T Ku i = λ i Φ T u i, which means Φ T u i is an eigenvector of Φ T Φ. Here, we cannot ensure Φ T u i is a unit vector, so we normalize it u i = ΦT u i Φ T u i = Φ T u i = ΦT u i u T i ΦΦT u λi i (u 1) T Then let P =... where the corresponding K eigenvalues (u K )T are the largest And the projection is y i = Pφ i = u T 1 / λ 1... u T K / λ K k(x 1, x i )... k(x N, x i ) 50 68

57 Kernel PCA 3/3 Previously we assume i φ i = 0, but this is not satisfied by an arbitrary kernel, so we have to centralize the kernel Note that k ij = φ i φ j, now we want to calculate k ij = (φ i φ) (φ j φ) where φ = i φ i N k ij = k ij 1 k N ij 1 k N ij + 1 N 2 Replace K with K in the previous steps j i i j k ij 51 68

58 Example of kernel PCA Left: input points in 2-D plane; Right: projected points using k(x, y) = (1 + x y)

59 Section 6.6 Other approaches to dimensionality reduction

60 Nonlinear dimensionality reduction Kernel PCA ISOMAP Locally linear embedding (LLE) Self organizing map (Chap. 10) Autoencoder (Chap. 10) Laplacian eigenmap t-distributed stochastic neighbor embedding (t-sne)

61 Manifold learning A manifold is a topological space that locally resembles Euclidean space near each point 1-D manifolds include lines and circles, but not 8 2-D manifolds are also called surfaces, such as spheres The intrinsic dimension of a manifold can be lower than its residing space Manifold learning is to identify such low-dimensional structure from high-dimensional data For manifold learning, Euclidean distance is not appropriate, and is replaced by geodesic distance 54 68

62 ISOMAP While PCA seeks to preserve data s Euclidean distances as much as possible, ISOMAP seeks to preserve data s geodesic distances as much as possible In ISOMAP, geodesic distance is defined as shortest distance on graph, the graph is constructed by nearest neighbors for each point As we have d ij = d(x i, x j ), we seek arg min ( y i y j d ij ) 2 y 1,...,y N which is solved by multi-dimensional scaling (MDS) i j i 55 68

63 Example of MDS MDS seeks an appropriate coordinate system for a distance 56 68

64 Locally linear embedding Locally linear embedding (LLE) seeks to preserve locally linear relations as much as possible In LLE, first, a locally linear relation matrix W is found arg min W x i i x j N (x i ) w ij x j 2, s.t. j If x j / N (x i ), then w ij 0 Second, a low-dimensional embedding is found arg min y 1,...,y N i y i j w ij y j 2 w ij =

65 Section 6.7 Semi-supervised learning

66 Supervised versus unsupervised learning For example, consider classification versus clustering Classification is Good at predicting the true class Requiring many labeled data for training Clustering is Able to find out clusters, but that are not equal to classes Requiring no label, just data Can we combine supervised and unsupervised learning? 58 68

67 Motivation of semi-supervised learning One practical difficulty for supervised learning is the lack of accurate labels Semi-supervised learning tries to use unlabeled data in addition to labeled data This includes weakly-supervised learning, where we have labels but labels are not accurate or noisy This also includes transductive learning, where we do not build a model but just want to have predictions for unlabeled data Supervised Semi-supervised (inductive) Transductive Data Labeled data Both labeled and unlabeled data Objective A model: ŷ = f(x) or q(y x) Predictions for unlabeled data 59 68

68 Examples of semi-supervised learning Image classification with unlabeled images Image classification from click-through data from search engine Image segmentation with only labels of bounding-boxes Anomaly detection with limited labels

69 Why does semi-supervised learning work? For generative methods Labeled data provide information of p(x, y) and unlabeled data provide additional information of p(x), the latter is helpful to estimate p(x, y) For discriminative methods 1. Cluster assumption: If two points belong to the same cluster, they are likely to belong to the same class 2. Density assumption: The decision boundary shall locate at low-density regions that separate high-density regions 61 68

70 A generative method for semi-supervised classification For labeled data, we know (x i, y i ), i = 1,..., N. For unlabeled data, we know x j, j = 1,..., M We consider a generative model for the labeled data: p(x i, y i ) = p(y i )p(x i y i ), and assume a mixture model for the unlabeled data: p(x j ) = y j p(y j )p(x j y j ) We further parameterize the probabilities as p(y) = p(y ϑ 0 ), p(x y) = p(x y, ϑ 1 ) Then we can maximize the log-likelihood: i log(p(y i ϑ 0 )p(x i y i, ϑ 1 )) + j log( y j p(y j ϑ 0 )p(x j y j, ϑ 1 )) 62 68

71 EM for semi-supervised classification Consider y j s as latent variables, and we have the expectation of log p({x i, y i }, {x j, y j } ϑ) with respect to p({y j } ϑ t ) is Q(ϑ, ϑ t ) = i log(p(y i ϑ 0 )p(x i y i, ϑ 1 )) + k γt jk log(p(y j ϑ 0 )p(x j y j, ϑ 1 )) j where γ t jk = p(y j = k ϑ t ) p(y j = k ϑ t 0)p(x j y j, ϑ t 1) 63 68

72 Semi-supervised SVM 1/2 Recall (supervised) SVM 1 min w,b,ξ 2 w 2 + C i ξ i s.t. i, ξ i 0, y i (w T x i + b) 1 ξ i 64 68

73 Semi-supervised SVM 2/2 For semi-supervised SVM, also called transductive SVM ( min 1 w,b,ξ,y,ζ 2 w 2 + C i ξ i +C ) j ζ j s.t. i, ξ i 0, y i (w T x i + b) 1 ξ i j, y j {+1, 1}, ζ j 0, y j (w T x j + b) 1 ζ j 65 68

74 A graph-based method for transductive classification Input: (x i, y i ), i = 1,..., N, x j, j = 1,..., M Output: ŷ j 1: Construct a graph whose vertexes are x i and x j 2: In the graph, if j N (i ), construct an edge from i to j and let the edge weight be w i j = 1 d(x i,x j ) 3: t 0, initialize p 0 (class probability vectors) i 4: repeat 5: j, p t+1 0 j 6: i {1,..., M, M + 1,..., M + N}, j, if w i j 0, p t+1 p t+1 + w i j j j k w pt i k i 7: j, normalize p t+1 to be unit vector j 8: until convergence 9: ŷ j = arg max p t+1 j 66 68

75 Random walk and PageRank The loop is actually a random walk For example, PageRank is an unsupervised method based on random walk to determine the relative importances of webpages Input: A graph of webpages and hyperlinks Output: Relative importances of all webpages 1: t 0 2: Initialize r 0 (importance values) i 3: repeat 4: j, r t+1 0 j 5: i, j, if w ij 0, r t+1 r t+1 + w ij j j 6: j, r t+1 βr t+1 j j + 1 β N factor, used to avoid trap 7: until convergence k w ik rt i β is called damping An example of trap 67 68

76 In this chapter Dictionary Clustering Curse of dimensionality Dimensionality reduction Manifold learning Semi-supervised learning Transductive learning Unsupervised learning Toolbox Expectation maximization Gaussian mixture model k-means Principle component analysis, kernel Random walk Transductive support vector machine 68 / 68

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Nonlinear Dimensionality Reduction Piyush Rai CS5350/6350: Machine Learning October 25, 2011 Recap: Linear Dimensionality Reduction Linear Dimensionality Reduction: Based on a linear projection of the

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Overview Introduction Linear Methods for Dimensionality Reduction Nonlinear Methods and Manifold

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal

More information

Clustering, K-Means, EM Tutorial

Clustering, K-Means, EM Tutorial Clustering, K-Means, EM Tutorial Kamyar Ghasemipour Parts taken from Shikhar Sharma, Wenjie Luo, and Boris Ivanovic s tutorial slides, as well as lecture notes Organization: Clustering Motivation K-Means

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap

More information

Manifold Regularization

Manifold Regularization 9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,

More information

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

More information

Linear and Non-Linear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America Clustering Léon Bottou NEC Labs America COS 424 3/4/2010 Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression, other.

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Unsupervised dimensionality reduction

Unsupervised dimensionality reduction Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Classification: The rest of the story

Classification: The rest of the story U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

More information

LECTURE NOTE #11 PROF. ALAN YUILLE

LECTURE NOTE #11 PROF. ALAN YUILLE LECTURE NOTE #11 PROF. ALAN YUILLE 1. NonLinear Dimension Reduction Spectral Methods. The basic idea is to assume that the data lies on a manifold/surface in D-dimensional space, see figure (1) Perform

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

Dimension Reduction and Low-dimensional Embedding

Dimension Reduction and Low-dimensional Embedding Dimension Reduction and Low-dimensional Embedding Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/26 Dimension

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Non-linear Dimensionality Reduction

Non-linear Dimensionality Reduction Non-linear Dimensionality Reduction CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Laplacian Eigenmaps Locally Linear Embedding (LLE)

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:

More information

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2 STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA) Salvador Dalí, Galatea of the Spheres CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang Some slides from Derek Hoiem and Alysha

More information

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-François Paiement University of Montreal April 2, Snowbird Learning 2003 Learning Modal Structures

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Lecture 14. Clustering, K-means, and EM

Lecture 14. Clustering, K-means, and EM Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

CSC411 Fall 2018 Homework 5

CSC411 Fall 2018 Homework 5 Homework 5 Deadline: Wednesday, Nov. 4, at :59pm. Submission: You need to submit two files:. Your solutions to Questions and 2 as a PDF file, hw5_writeup.pdf, through MarkUs. (If you submit answers to

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Semi-Supervised Learning through Principal Directions Estimation

Semi-Supervised Learning through Principal Directions Estimation Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de

More information

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 - Clustering Lorenzo Rosasco UNIGE-MIT-IIT About this class We will consider an unsupervised setting, and in particular the problem of clustering unlabeled data into coherent groups. MLCC 2018

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS VIKAS CHANDRAKANT RAYKAR DECEMBER 5, 24 Abstract. We interpret spectral clustering algorithms in the light of unsupervised

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Fisher s Linear Discriminant Analysis

Fisher s Linear Discriminant Analysis Fisher s Linear Discriminant Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation Laplacian Eigenmaps for Dimensionality Reduction and Data Representation Neural Computation, June 2003; 15 (6):1373-1396 Presentation for CSE291 sp07 M. Belkin 1 P. Niyogi 2 1 University of Chicago, Department

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Lecture 3: Pattern Classification. Pattern classification

Lecture 3: Pattern Classification. Pattern classification EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and

More information

EXPECTATION- MAXIMIZATION THEORY

EXPECTATION- MAXIMIZATION THEORY Chapter 3 EXPECTATION- MAXIMIZATION THEORY 3.1 Introduction Learning networks are commonly categorized in terms of supervised and unsupervised networks. In unsupervised learning, the training set consists

More information

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU,

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU, Eric Xing Eric Xing @ CMU, 2006-2010 1 Machine Learning Data visualization and dimensionality reduction Eric Xing Lecture 7, August 13, 2010 Eric Xing Eric Xing @ CMU, 2006-2010 2 Text document retrieval/labelling

More information

Manifold Learning: Theory and Applications to HRI

Manifold Learning: Theory and Applications to HRI Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46 Greek Philosopher

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Machine Learning Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 47 Table of contents 1 Introduction

More information

Lecture 11: Unsupervised Machine Learning

Lecture 11: Unsupervised Machine Learning CSE517A Machine Learning Spring 2018 Lecture 11: Unsupervised Machine Learning Instructor: Marion Neumann Scribe: Jingyu Xin Reading: fcml Ch6 (Intro), 6.2 (k-means), 6.3 (Mixture Models); [optional]:

More information

Unsupervised Anomaly Detection for High Dimensional Data

Unsupervised Anomaly Detection for High Dimensional Data Unsupervised Anomaly Detection for High Dimensional Data Department of Mathematics, Rowan University. July 19th, 2013 International Workshop in Sequential Methodologies (IWSM-2013) Outline of Talk Motivation

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Machine Learning Notes

Machine Learning Notes Machine Learning Notes Patrick O Neil April 24, 2014 Abstract These notes provide a quick and dirty exploration of several areas of machine learning. The focus of these notes is on implementation as opposed

More information

L26: Advanced dimensionality reduction

L26: Advanced dimensionality reduction L26: Advanced dimensionality reduction The snapshot CA approach Oriented rincipal Components Analysis Non-linear dimensionality reduction (manifold learning) ISOMA Locally Linear Embedding CSCE 666 attern

More information

Lecture 10: Dimension Reduction Techniques

Lecture 10: Dimension Reduction Techniques Lecture 10: Dimension Reduction Techniques Radu Balan Department of Mathematics, AMSC, CSCAMM and NWC University of Maryland, College Park, MD April 17, 2018 Input Data It is assumed that there is a set

More information

Apprentissage non supervisée

Apprentissage non supervisée Apprentissage non supervisée Cours 3 Higher dimensions Jairo Cugliari Master ECD 2015-2016 From low to high dimension Density estimation Histograms and KDE Calibration can be done automacally But! Let

More information

Partially labeled classification with Markov random walks

Partially labeled classification with Markov random walks Partially labeled classification with Markov random walks Martin Szummer MIT AI Lab & CBCL Cambridge, MA 0239 szummer@ai.mit.edu Tommi Jaakkola MIT AI Lab Cambridge, MA 0239 tommi@ai.mit.edu Abstract To

More information

Data-dependent representations: Laplacian Eigenmaps

Data-dependent representations: Laplacian Eigenmaps Data-dependent representations: Laplacian Eigenmaps November 4, 2015 Data Organization and Manifold Learning There are many techniques for Data Organization and Manifold Learning, e.g., Principal Component

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA Tobias Scheffer Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 2 PCA: Motivation

More information

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation Introduction and Data Representation Mikhail Belkin & Partha Niyogi Department of Electrical Engieering University of Minnesota Mar 21, 2017 1/22 Outline Introduction 1 Introduction 2 3 4 Connections to

More information

CSE446: Clustering and EM Spring 2017

CSE446: Clustering and EM Spring 2017 CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

ECE 661: Homework 10 Fall 2014

ECE 661: Homework 10 Fall 2014 ECE 661: Homework 10 Fall 2014 This homework consists of the following two parts: (1) Face recognition with PCA and LDA for dimensionality reduction and the nearest-neighborhood rule for classification;

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu November 3, 2015 Methods to Learn Matrix Data Text Data Set Data Sequence Data Time Series Graph

More information

Variable selection and feature construction using methods related to information theory

Variable selection and feature construction using methods related to information theory Outline Variable selection and feature construction using methods related to information theory Kari 1 1 Intelligent Systems Lab, Motorola, Tempe, AZ IJCNN 2007 Outline Outline 1 Information Theory and

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality

More information

Data dependent operators for the spatial-spectral fusion problem

Data dependent operators for the spatial-spectral fusion problem Data dependent operators for the spatial-spectral fusion problem Wien, December 3, 2012 Joint work with: University of Maryland: J. J. Benedetto, J. A. Dobrosotskaya, T. Doster, K. W. Duke, M. Ehler, A.

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information