Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Similar documents
PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Statistical Pattern Recognition

Dimension Reduction (PCA, ICA, CCA, FLD,

Pattern Recognition and Machine Learning

Principal Component Analysis

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

L11: Pattern recognition principles

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

CS534 Machine Learning - Spring Final Exam

Classification: The rest of the story

Unsupervised Learning

Statistical Machine Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

PCA, Kernel PCA, ICA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

CS6220: DATA MINING TECHNIQUES

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Introduction to Machine Learning

STA 414/2104: Lecture 8

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

20 Unsupervised Learning and Principal Components Analysis (PCA)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimensionality Reduc1on

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

ECE 5984: Introduction to Machine Learning

CPSC 540: Machine Learning

PATTERN CLASSIFICATION

PCA and LDA. Man-Wai MAK

Generative Clustering, Topic Modeling, & Bayesian Inference

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Lecture 7: Con3nuous Latent Variable Models

Randomized Algorithms

6.036 midterm review. Wednesday, March 18, 15

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

CSCI-567: Machine Learning (Spring 2019)

Support Vector Machines

Expectation Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU,

Data Mining Techniques

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Introduction to Machine Learning

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Probabilistic Time Series Classification

LECTURE NOTE #11 PROF. ALAN YUILLE

Nonlinear Dimensionality Reduction. Jose A. Costa

Gaussian Mixture Models, Expectation Maximization

CS145: INTRODUCTION TO DATA MINING

Latent Variable Models and EM Algorithm

CS6220: DATA MINING TECHNIQUES

Linear Dimensionality Reduction

Machine learning for pervasive systems Classification in high-dimensional spaces

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Midterm exam CS 189/289, Fall 2015

PCA and admixture models

Clustering, K-Means, EM Tutorial

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

STA 414/2104: Machine Learning

COMS 4771 Introduction to Machine Learning. Nakul Verma

CS281 Section 4: Factor Analysis and PCA

Lecture 10: Dimension Reduction Techniques

Introduction to Machine Learning

Preprocessing & dimensionality reduction

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Machine Learning - MT & 14. PCA and MDS

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Kernel Methods and Support Vector Machines

Linear discriminant functions

Advanced Machine Learning & Perception

5. Discriminant analysis

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Lecture 2: Linear Algebra Review

Linear & nonlinear classifiers

Gaussian Mixture Models

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Expectation Maximization Algorithm

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

MACHINE LEARNING ADVANCED MACHINE LEARNING

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Lecture 15: Random Projections

Transcription:

Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

Roadmap so far: (large-margin) supervised learning binary, multiclass, and structured classifications online learning: avg perceptron/mira, convergence proof y=+1 y=-1 kernels and kernelized perceptron in dual SVMs: formulation, KKT, dual, convex optimization, slacks the man bit the dog structured perceptron, HMM, and Viterbi algorithm DT NN VBD DT NN what we left out: many classical algorithms next up: unsupervised learning nearest neighbors (instance-based), decision trees, logistic regression... clustering: k-means, EM, mixture models, hierarchical dimensionality reduction: linear (PCA/ICA, MDS), nonlinear (isomap) 2

Sup=>Unsup: Nearest Neighbor=> k-means let s look at a supervised learning method: nearest neighbor SVM, perceptron (in dual) and NN are all instance-based learning instance-based learning: store a subset of examples for classification compression rate: SVM: very high, perceptron: medium high, NN: 0 3

k-nearest Neighbor one way to prevent overfitting => more stable results 4

NN Voronoi in 2D and 3D 5

Voronoi for Euclidian and Manhattan 6

Unsupervised Learning cost of supervised learning labeled data: expensive to annotate! but there exists huge data w/o labels unsupervised learning can only hallucinate the labels infer some internal structures of data still the compression view of learning too much data => reduce it! clustering: reduce # of examples 2 0 2 (a) 2 0 2 dimensionality reduction: reduce # of dimensions 7

Challenges in Unsupervised Learning how to evaluate the results? there is no gold standard data! internal metric? how to interpret the results? how to name the clusters? how to initialize the model/guess? 2 0 2 (a) 2 0 2 a bad initial guess can lead to very bad results unsup is very sensitive to initialization (unlike supervised) how to do optimization => in general no longer convex! 8

k-means (randomly) pick k points to be initial centroids repeat the two steps until convergence assignment to centroids: voronoi, like NN recomputation of centroids based on the new assignment 2 (a) 0 2 2 0 2 9

k-means (randomly) pick k points to be initial centroids repeat the two steps until convergence assignment to centroids: voronoi, like NN recomputation of centroids based on the new assignment 2 (b) 0 2 2 0 2 10

k-means (randomly) pick k points to be initial centroids repeat the two steps until convergence assignment to centroids: voronoi, like NN recomputation of centroids based on the new assignment 2 (c) 0 2 2 0 2 11

k-means (randomly) pick k points to be initial centroids repeat the two steps until convergence assignment to centroids: voronoi, like NN recomputation of centroids based on the new assignment 2 (d) 0 2 2 0 2 12

k-means (randomly) pick k points to be initial centroids repeat the two steps until convergence assignment to centroids: voronoi, like NN recomputation of centroids based on the new assignment 2 (e) 0 2 2 0 2 13

k-means (randomly) pick k points to be initial centroids repeat the two steps until convergence assignment to centroids: voronoi, like NN recomputation of centroids based on the new assignment 2 (f) 0 2 2 0 2 14

k-means (randomly) pick k points to be initial centroids repeat the two steps until convergence assignment to centroids: voronoi, like NN recomputation of centroids based on the new assignment 2 (g) 0 2 2 0 2 15

k-means (randomly) pick k points to be initial centroids repeat the two steps until convergence assignment to centroids: voronoi, like NN recomputation of centroids based on the new assignment 2 (h) 0 2 2 0 2 16

k-means (randomly) pick k points to be initial centroids repeat the two steps until convergence assignment to centroids: voronoi, like NN recomputation of centroids based on the new assignment 2 (i) 0 2 2 0 2 17

k-means (randomly) pick k points to be initial centroids repeat the two steps until convergence assignment to centroids: voronoi, like NN recomputation of centroids based on the new assignment how to define convergence? after a fixed number of iterations, or assignments do not change, or centroids do not change (equivalent?) or change in objective function falls below threshold 2 0 2 (i) 2 0 2 18

k-means objective function residual sum of squares (RSS) sum of distances from points to their centroids guaranteed to decrease monotonically convergence proof: decrease + finite # of clusterings 1000 J 500 0 1 2 3 4 19

k-means for image segmentation 20

Problems with k-means problem: sensitive to initialization the objective function is non-convex: many local minima why? k-means works well if clusters are spherical clusters are well separated clusters of similar volumes clusters have similar # of examples 21

Problems with k-means problem: sensitive to initialization the objective function is non-convex: many local minima why? k-means works well if clusters are spherical clusters are well separated clusters of similar volumes clusters have similar # of examples 21

Problems with k-means problem: sensitive to initialization the objective function is non-convex: many local minima why? k-means works well if clusters are spherical clusters are well separated clusters of similar volumes clusters have similar # of examples 21

Better ( soft ) k-means? random restarts -- definitely helps soft clusters => EM with Gaussian Mixture Model 2 (i) p(x) 1 (a) 0 0.5 0.2 x 0 0.5 0.3 2 0 0.5 1 2 0 2 1 (b) 0.5 0 22

k-means randomize k initial centroids repeat the two steps until convergence E-step: assignment each example to centroids (Voronoi) M-step: recomputation of centroids (based on the new assignment) 2 (a) 0 2 2 0 2 23

EM for Gaussian Mixtures randomize k means, covariances, mixing coefficients repeat the two steps until convergence E-step: evaluate the responsibilities using current parameters M-step: reestimate parameters using current responsibilities 2 0 2 2 0 (a) 2 24

EM for Gaussian Mixtures randomize k means, covariances, mixing coefficients repeat the two steps until convergence E-step: evaluate the responsibilities using current parameters M-step: reestimate parameters using current responsibilities 2 fractional assignments 0 2 2 0 (b) 2 25

EM for Gaussian Mixtures randomize k means, covariances, mixing coefficients repeat the two steps until convergence E-step: evaluate the responsibilities using current parameters M-step: reestimate parameters using current responsibilities 2 L =1 0 2 2 0 (c) 2 26

EM for Gaussian Mixtures randomize k means, covariances, mixing coefficients repeat the two steps until convergence E-step: evaluate the responsibilities using current parameters M-step: reestimate parameters using current responsibilities 2 L =2 0 2 2 0 (d) 2 27

EM for Gaussian Mixtures randomize k means, covariances, mixing coefficients repeat the two steps until convergence E-step: evaluate the responsibilities using current parameters M-step: reestimate parameters using current responsibilities 2 L =5 0 2 2 0 (e) 2 28

EM for Gaussian Mixtures randomize k means, covariances, mixing coefficients repeat the two steps until convergence E-step: evaluate the responsibilities using current parameters M-step: reestimate parameters using current responsibilities 2 L = 20 0 2 2 0 (f) 2 29

EM for Gaussian Mixtures randomize k means, covariances, mixing coefficients repeat the two steps until convergence E-step: evaluate the responsibilities using current parameters M-step: reestimate parameters using current responsibilities 2 L = 20 0 2 2 0 (f) 2 30

EM for Gaussian Mixtures 2 2 L =1 0 0 2 2 2 0 (b) 2 2 0 (c) 2 31

Convergence EM converges much slower than k-means can t use assignment doesn t change for convergence use log likelihood of the data stop if increase in log likelihood smaller than threshold or a maximum # of iterations has reached L = log P (data) = log j P (x j ) = X j log P (x j ) = X j log X i P (c i )P (x j c i ) 32

EM: pros and cons (vs. k-means) EM: pros doesn t need the data to be spherical doesn t need the data to be well-separated doesn t need the clusters to be in similar sizes/volumes EM: cons converges much slower than k-means per-iteration computation also slower (to speedup EM): use k-means to burn-in (same as k-means) local minimum! 33

k-means is a special case of EM k-means is hard EM covariance matrix is diagonal -- i.e., spherical diagonal variances are approaching 0 2 (i) 2 L = 20 0 0 2 2 2 0 2 2 0 (f) 2 34

Why EM increases p(data) iteratively? CS 562 - EM 35

Why EM increases p(data) iteratively? CS 562 - EM 35

Why EM increases p(data) iteratively? CS 562 - EM 35

Why EM increases p(data) iteratively? CS 562 - EM 35

Why EM increases p(data) iteratively? = D converge to local maxima convex auxiliary function KL-divergence CS 562 - EM 36

How to maximize the auxiliary? ln p(x θ) L (q,θ) θ old θ new CS 562 - EM 37

Machine Learning A Geometric Approach CUNY Graduate Center, Spring 2013 Lectures 13: Unsupervised Learning 2 (dimensionality reduction: linear: PCA, ICA, CCA, LDA 1, LDA 2 ; non-linear: kernel PCA, isomap, LLE, SDE) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

Dimensionality Reduction CS 562 - EM 39

Dimensionality Reduction CS 562 - EM 40

Fingers extension Dimensionality Reduction from the Isomap paper (Tenebaum, de Silva, Langford, Science 2000) CS 562 - EM Wrist rotation 41

Algorithms linear methods non-linear methods PCA - principle... kernelized PCA ICA - independent... isomap CCA - canonical... LLE - locally linear embedding MDS - multidim. scaling SDE - semidefinite embedding LEM - laplacian eignen maps LDA1 - linear discriminant analysis LDA2 - latent dirichlet allocation all are spectral methods! -- i.e., using eigenvalues CS 562 - EM 42

PCA greedily find d orthogonal axes onto which the variance under projection is maximal the max variance subspace formulation 1st PC: direction of greatest variability in data 2nd PC: the next unrelated max-var direction remove all variance in 1 st PC, redo max-var another equivalent formulation: minimum reconstruction error find orthogonal vectors onto which the projection yields min MSE reconstruction CS 562 - EM 43

PCA greedily find d orthogonal axes onto which the variance under projection is maximal the max variance subspace formulation 1st PC: direction of greatest variability in data 2nd PC: the next unrelated max-var direction remove all variance in 1 st PC, redo max-var another equivalent formulation: minimum reconstruction error find orthogonal vectors onto which the projection yields min MSE reconstruction CS 562 - EM 43

PCA optimization: max-var proj. first translate data to zero mean compute co-variance matrix find top d eigenvalues and eigenvectors of covar matrix project data onto those eigenvectors CS 562 - EM 44

PCA for k-means and whitening rescaling to zero mean and unit variance as preprocessing we did that in perceptron HW1/2 also! but PCA can do more: whitening (spherication) CS 562 - EM 45

Eigendigits CS 562 - EM 46

Eigenfaces CS 562 - EM 47

Eigenfaces CS 562 - EM 47

Eigenfaces CS 562 - EM 47

LDA: Fisher s linear discriminant analysis PCA finds a small representation of the data LDA performs dimensionality reduction while preserving as much the given class discrimination info as possible it s a linear classification method (like perceptron) find a scalar projection that maximizes separability max separation b/w projected means min variance within each projected class CS 562 - EM 48

PCA vs LDA CS 562 - EM 49

Linear vs. non-linear LLE or isomap CS 562 - EM 50

Linear vs. non-linear PCA CS 562 - EM 51

Linear vs. non-linear PCA CS 562 - EM 52

Linear vs. non-linear LLE CS 562 - EM 53

Linear vs. non-linear PCA CS 562 - EM 54

Linear vs. non-linear LLE CS 562 - EM 55

PCA vs Kernel PCA CS 562 - EM 56

Brain does non-linear dim. redux CS 562 - EM 57