Statistical Machine Learning

Similar documents
Unsupervised dimensionality reduction

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Kernel methods for comparing distributions, measuring dependence

Preprocessing & dimensionality reduction

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Motivating the Covariance Matrix

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Statistical Pattern Recognition

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

LECTURE NOTE #11 PROF. ALAN YUILLE

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning - MT Clustering

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

L26: Advanced dimensionality reduction

Linear Dimensionality Reduction

Non-linear Dimensionality Reduction

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

PCA, Kernel PCA, ICA

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Computational functional genomics

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

PCA and admixture models

Unsupervised Learning Basics

Unsupervised learning: beyond simple clustering and PCA

Nonlinear Dimensionality Reduction

Introduction to Machine Learning

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Advanced Introduction to Machine Learning CMU-10715

Nonlinear Dimensionality Reduction

Principal Component Analysis

Principal Component Analysis (PCA)

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Final Exam, Machine Learning, Spring 2009

Dimension Reduction and Low-dimensional Embedding

Lecture 15: Random Projections

Data Mining Techniques

Advanced Machine Learning & Perception

Dimensionality Reduc1on

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Clustering VS Classification

Manifold Learning: Theory and Applications to HRI

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Lecture: Face Recognition and Feature Reduction

Randomized Algorithms

Lecture Notes 1: Vector spaces

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Unsupervised machine learning

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Manifold Learning and it s application

Lecture: Some Practical Considerations (3 of 4)

Machine Learning - MT & 14. PCA and MDS

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Lecture: Face Recognition and Feature Reduction

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

Principal Component Analysis

Lecture 10: Dimension Reduction Techniques

Unsupervised Learning: K- Means & PCA

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Convergence of Eigenspaces in Kernel Principal Component Analysis

Kernel methods, kernel SVM and ridge regression

15 Singular Value Decomposition

Data Exploration and Unsupervised Learning with Clustering

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

EECS 275 Matrix Computation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture Notes 2: Matrices

Spectral Clustering. Spectral Clustering? Two Moons Data. Spectral Clustering Algorithm: Bipartioning. Spectral methods

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Principal Component Analysis

20 Unsupervised Learning and Principal Components Analysis (PCA)

Dimensionality Reduction

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Machine Learning for Data Science (CS4786) Lecture 11

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

What is Principal Component Analysis?

DIMENSION REDUCTION. min. j=1

Lecture 12 : Graph Laplacians and Cheeger s Inequality

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Maximum variance formulation

Lecture 7: Con3nuous Latent Variable Models

14 Singular Value Decomposition

CSC 411 Lecture 12: Principal Component Analysis

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Example: Face Detection

Mathematical foundations - linear algebra

1 Principal Components Analysis

Dimensionality Reduction AShortTutorial

Transcription:

Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36

Unsupervised Learning Dimensionality Reduction 2 / 36

Dimensionality Reduction Given: data X = {x 1,..., x m } R d Dimensionality Reduction Transductive Task: Find a lower-dimensional representation Y = {y 1,..., y m } R n with m d, such that Y "represents X well" Dimensionality Reduction Inductive Task: find a function φ : R d R n and set y i = φ(x i ) (allows computing φ(x) for x X: "out-of-sample extension") 3 / 36

Linear Dimensionality Reduction Choice 1: φ : R d R n is linear or affine. Choice 2: "Y represents X well" means: There s a ψ : R n R d m such that x i ψ(y i ) 2 i=1 is small. 4 / 36

Linear Dimensionality Reduction Choice 1: φ : R d R n is linear or affine. Choice 2: "Y represents X well" means: There s a ψ : R n R d m such that x i ψ(y i ) 2 i=1 is small. Principal Component Analysis Given X = {x 1,..., x m } R d, find function φ(x) = Wx and ψ(y) = Uy by solving min U R n d W R d n m x i UWx i 2 i=1 4 / 36

Principal Component Analysis (PCA) U, W = argmin U R n d,w R d n m x i UWx i 2 i=1 (PCA) Lemma If U, W are minimizers of the above PCA problem, then the column of U are orthogonal, and W = U. 5 / 36

Principal Component Analysis (PCA) U, W = argmin U R n d,w R d n m x i UWx i 2 i=1 (PCA) Lemma If U, W are minimizers of the above PCA problem, then the column of U are orthogonal, and W = U. Theorem Let A = m i=1 x i xi and let u 1,..., u n be n eigenvectors of A that ) correspond to the largest n eigenvalues of A. Then U = (u 1 u 2 u n and W = U are minimizers of the PCA problem. A has orthogonal eigenvectors, since it is symmetric positive definite. U can also be obtained by singular value decomposition, X = USV. 5 / 36

Principal Component Analysis Visualization 6 / 36

Principal Component Analysis Visualization 6 / 36

Principal Component Analysis Visualization -3-2 -1 0 1 2 3 6 / 36

Principal Component Analysis Visualization 4 3 2 1 0-1 -2-3 -4-3 -2-1 0 1 2 3 6 / 36

Principal Component Analysis Affine Given X = {x 1,..., x m } R d, find function φ(x) = Wx + w and ψ(y) = Uy + u by solving U, W = argmin U R n d,w R d n m x i U (Wx i +w) u 2 i=1 (AffinePCA) Theorem Let µ = 1 mi=1 m x i the mean and C = 1 mi=1 m (x i µ)(x i µ) the covariance matrix of X. Let u 1,..., u n be n eigenvectors of C that ) correspond to the largest n eigenvalues. Then U = (u 1 u 2 u n, W = U, w = W µ and u = µ are minimizers of the affine PCA problem. Simpler to remember: φ(x) = W (x µ), ψ(y) = Uy + µ 7 / 36

Principal Component Analysis Alternative Views There s (at least) one more way to interpret the PCA procedure: The following to goals are equivalent: find subspace such that projecting to it orthogonally results in the smallest reconstruction error find subspace such that projecting to it orthogonally results preserves most of the data variance 8 / 36

Principal Component Analysis Applications Data Visualization If the original data is high-dimensional, use PCA with n = 2 or n = 3 to obtain low-dimensional representation that can be visualized. Data Compression If the original data is high-dimensional, use PCA to obtain a lower-dimensional representation that requires less RAM/storage. n typically chosen such that 95% or 99% of variance are preserved. Data Denoising If the original data is noisy, apply PCA and reconstruction to obtain a less noisy representation. n depends on noise level if known, otherwise as for compression. 9 / 36

Genes mirror geography in Europe [Novembre et al, Nature 2008] 10 / 36

Canonical Correlation Analysis (CCA) [Hotelling, 1936] Given: paired data X 1 = {x1 1,..., x1 m } R d X 2 = {x2 1,..., x2 m } R d for example (after some preprocessing): DNA expression and gene expression (Monday s colloquium) images and text captions. Canonical Correlation Analysis (CCA) Find projections φ 1 (x 1 ) = U 1 x 1 and φ 2 (x 2 ) = U 2 x 2 with U 1 R d m and U 2 Rd m such that after projection X 1 and X 2 are maximally correlated. 11 / 36

Canonical Correlation Analysis (CCA) One dimension: find directions u 1 R d, u 2 R d, such that max corr(u 1 X 1, u2 X 2 ). u 1 R d,u 2 R d With C 11 = cov(x 1, X 1 ), C 22 = cov(x 2, X 2 ) and C 12 = cov(x 1, X 2 ), u1 max C 12u 2 u 1 R d,u 2 R u d 1 C 11u 1 u2 C 22u 2 Find u 1, u 2 by solving generalized eigenvalue problem for maximal λ: ( ) ( ) ( ) ( ) 0 C12 u1 C11 0 u1 C12 = λ 0 u 2 0 C 22 u 2 12 / 36

Example: Canonical Correlation Analysis for fmri Data data 1: video sequence data 2: fmri signal while watching 13 / 36

Kernel Principle Component Analysis (Kernel-PCA) Reminder: given samples x i R d, PCA finds the directions of maximal covariance. Assume i x i = 0 (e.g. by first subtracting the mean). The PCA directions u 1,..., u n are the eigenvectors of the covariance matrix C = 1 m x i xi m i=1 sorted by their eigenvalues. We can express x i in PCA-space by P(x i ) = n j=1 x i, u j u j. x i, u 1 x Lower-dim. coordinate mapping: x i i, u 2... Rn x i, u m 14 / 36

Kernel-PCA Given samples x i X, kernel k : X X R with an implicit feature map φ : X H. Do PCA in the (implicit) feature space H. The kernel-pca directions u 1,..., u n are the eigenvectors of the covariance operator C = 1 m φ(x i )φ(x i ) m i=1 sorted by their eigenvalue. φ(x i ), u 1 φ(x Lower-dim. coordinate mapping: x i i ), u 2... Rn φ(x i ), u n 15 / 36

Kernel-PCA Given samples x i X, kernel k : X X R with an implicit feature map φ : X H. Do PCA in the (implicit) feature space H. Equivalently, we can use the eigenvectors u j and eigenvalues λ j of K R m m, with K ij = φ(x i ), φ(x j ) = k(x i, x j ) Coordinate mapping: x i ( λ 1 u 1 i,..., λ K u n i Kernel-PCA ). 16 / 36

Example: Canonical Correlation Analysis for fmri Data 17 / 36

Application Image Superresolution Collect high-res face images Use KernelPCA with Gaussian kernel to learn non-linear projections For new low-res image: scale to target high resolution project to closest point in face subspace reconstruction in r dimensions [Kim, Jung, Kim, "Face recognition using kernel principal component analysis", Signal Processing Letters, 2002.] 18 / 36

Random Projections Recently, random matrices have been used for dimensionality reduction: Let W R d n be a matrix with random entries (i.i.d. Gaussian) Then one can show that φ : R d R n with φ(x) = Wx does not distort Euclidean distances too much. Theorem For fixed x R d let W R n d be a random matrix as above. Then, for every ɛ (0, 3), [ ] 1 n P Wx 2 x 2 1 > ɛ 2e ɛ2 n/6 Note: The dimension of the original data does not show up in the bound! 19 / 36

Multidimensional Scaling (MDS) Given: data X = {x 1,..., x m } R d Task: find embedding y 1,..., y m R n that preserves pairwise distances ij = x i x j. Solve, e.g., by gradient descent on i,j ( y i y j 2 2 ij) 2 Multiple extensions: non-linear embedding take into account geodesic distances (e.g. IsoMap) arbitrary distances instead of Euclidean 20 / 36

Multidimensional Scaling (MDS) 21 / 36

Multidimensional Scaling (MDS) 2D embedding of US Senate Voting behavior 22 / 36

Unsupervised Learning Clustering 23 / 36

Clustering Given: data Clustering Transductive X = {x 1,..., x m } R d Task: partition the point in X into clusters S 1,..., S K. Idea: elements within a cluster are similar to each other, elements in different clusters are dissimilar Clustering Inductive Task: define a partitioning function f : R d {1,..., K} and set S k = { x X : f (x) = k }. (allows assigning a cluster label also to new points, x X: "out-of-sample extension") 24 / 36

Clustering Clustering is fundamentally problematic and subjective 25 / 36

Clustering Clustering is fundamentally problematic and subjective 25 / 36

Clustering Clustering is fundamentally problematic and subjective 25 / 36

Clustering Linkage-based General framework to create a hierarchical partitioning initialize: each point x i is it s own cluster, S i = {i} repeat take two most similar clusters and merge into a single new cluster until K clusters left Open question: how to define similarity between clusters? 26 / 36

Clustering Linkage-based Given: similarity between individual points d(x i, x j ) Single linkage clustering Smallest distance between any cluster elements Average linkage clustering d(s, S ) = min i S,j S d(x i, x j ) Average distance between all cluster elements d(s, S ) = 1 S S d(x i, x j ) i S,j S Max linkage clustering Largest distance between any cluster elements d(s, S ) = max i S,j S d(x i, x j ) 27 / 36

Example: Single linkage clustering Theorem The edges of a single linkage clustering forms a minimal spanning tree. 28 / 36

Clustering centroid-based clustering Let c 1,..., c K R d be K cluster centroids. Then a distance-based clustering function, c : X {1,..., K}, is given by the assignment f (x) = argmin x c i k=1,...,k (arbitrary tie break) (similar to K-means with training set {(c 1, 1),..., (c K, K)}) 29 / 36

Clustering centroid-based clustering K-means objective Find c 1,..., c K R d by minimizing the total Euclidean error m i=1 x i c f (xi ) 2 30 / 36

Clustering centroid-based clustering K-means objective Find c 1,..., c K R d by minimizing the total Euclidean error m i=1 x i c f (xi ) 2 Lloyd s algorithm Initialize c 1,..., c K (random subset of X, or smarter) repeat set S k = {i : f (x i ) = k} (current assignment) i S k x i c k = 1 S k until no more changes to S k (mean of points in cluster) Demo: http://shabal.in/visuals/kmeans/6.html 30 / 36

Clustering centroid-based clustering Alternatives: k-mediods: like k-means, but centroids must be datapoints update step chooses mediod of cluster instead of mean k-medians: like k-means, but minimize m i=1 x i c f (xi ) update step chooses median of each coordinate with each cluster 31 / 36

Clustering graph-based clustering For x 1,..., x m form a graph G = (V, E) with vertex set V = {1,..., m} and edge set E. Each partitioning of the graph defines a clustering of the original dataset. Choice of edge set ɛ-nearest neighbor graph E = {(i, j) V V : x i x j < ɛ} k-nearest neighbor graph E = {(i, j) V V : x i is a k-nearest neighbor of x j } Weighted graph Fully connected, but define edge weights w ij = exp( λ x i x j 2 ). 32 / 36

Example: Graph-based Clustering Data set 33 / 36

Example: Graph-based Clustering Neighborhood Graph 33 / 36

Example: Graph-based Clustering Min Cut: biased towards small clusters 33 / 36

Example: Graph-based Clustering Normalized Cut: balanced weight of cut edges and volume of clusters 33 / 36

Spectral Clustering Approximate solution to Normalized Cut Spectral Clustering Input: weight matrix W R m m compute graph Laplacian L = W D, for D = diag(d 1,..., d m ) with d i = j w ij. let v R m be the eigenvector of L corresponding to the second smallest eigenvalue (the smallest is 0, since L is singular) assign x i to cluster 1 if v i 0 and to cluster 2 otherwise. To obtain more than 2 clusters apply recursively, each time splitting the largest remaining cluster. 34 / 36

Clustering Axioms [Kleinberg, "An Impossibility Theorem for Clustering", NIPS 2002] Scale-Invariance For any distance d and any α > 0, f (d) = f (α d) Richness Range(f ) is the set of all partitions of {1,..., m} Consistency Let d and d be two distance functions. If f (d) = Γ, and d is a Γ-transform of d, then f (d ) = Γ. Definition: d is a Γ-transform of d, iff for any i, j in the same cluster d (i, j) d(i, j) and for i, j in different clusters, d (i, j) d(i, j). 35 / 36

Clustering Axioms [Kleinberg, "An Impossibility Theorem for Clustering", NIPS 2002] Scale-Invariance For any distance d and any α > 0, f (d) = f (α d) Richness Range(f ) is the set of all partitions of {1,..., m} Consistency Let d and d be two distance functions. If f (d) = Γ, and d is a Γ-transform of d, then f (d ) = Γ. Definition: d is a Γ-transform of d, iff for any i, j in the same cluster d (i, j) d(i, j) and for i, j in different clusters, d (i, j) d(i, j). Theorem: "Impossibility of Clustering". For each m 2, there is no clustering function f that satisfies all three axioms at the same time. 35 / 36

Clustering Axioms [Kleinberg, "An Impossibility Theorem for Clustering", NIPS 2002] Scale-Invariance For any distance d and any α > 0, f (d) = f (α d) Richness Range(f ) is the set of all partitions of {1,..., m} Consistency Let d and d be two distance functions. If f (d) = Γ, and d is a Γ-transform of d, then f (d ) = Γ. Definition: d is a Γ-transform of d, iff for any i, j in the same cluster d (i, j) d(i, j) and for i, j in different clusters, d (i, j) d(i, j). Theorem: "Impossibility of Clustering". For each m 2, there is no clustering function f that satisfies all three axioms at the same time. (but not all hope lost: "Consistency" is debatable...) 35 / 36

Final project Part 1 Go to https://kaggle.com/join/ist_sml2016/ and participate in the challenge: "Final project for Statistical Machine Learning Course 2016 at IST Austria" passing criterion: beat the baselines (linear SVM and LogReg) Part 2 send Alex a short (one to two pages) report that explains what exactly you did to achieve these results, including data preprocessing, classifier, software used, model selection, etc. Deadline: Thursday, 5th May midnight MEST 36 / 36