Apprentissage non supervisée Cours 3 Higher dimensions Jairo Cugliari Master ECD 2015-2016
From low to high dimension Density estimation Histograms and KDE Calibration can be done automacally But! Let s look at the Mean-Square-Error : E[ˆf h (x) f (x)] 2 Histogramme with p = 1 et h : MISE C/n 2/3 KDE with p = 1 : MISE C/n 4/5 KDE in p : MSE C/n 4/(4+p) So when p grows, the estimator is less attractive. This is a commonc behaviour in data analysis : we just met the curse of dimensionality!
Curse of dimensionality (Bellman, 1961) When p increases, the volume of the space increases so fast that the available data become sparse. Data needed to support a reliable result often grows exponentially with p.
High-dimensional spaces The curse of dimensionality Empty space phenomenon Norm concentration phenomenon And more funny things A hypercube looks like a sea urchin (many spiky corners!) Hypercube corners collapse towards the center in any projection The volume of a unit hypersphere tends to zero The sphere volume concentrates in a thin shell Tails of a Gaussian get heavier than the central bell Hopefully data convey some information / structure clusters of data manifold data Possible solutions are clustering, dimensionality reduction,...
Dimensionality reduction Some notation : Input data : x 1, x 2,..., x n R p Output data : f 1, f 2,..., f n R d, d p We want Observations close on R p should be close on R d Observations distant on R p should be distant on R d We ll try Linear methods (PCA, MDS) Nonlinear methods (IsoMap, LLE, EigenMaps)
PCA Pearson, 1901 ; Hotelling, 1933 ; Karhunen, 1946 ; Loève, 1948. Idea Decorrelate zero-mean data Keep large variance axes Fit a plane though the data cloud and project Representation quality
Assume inputs are centered (i.e. i x i = 0) Given a unit vector u and a point x, the length of the projection of x onto u is given by x T u Maximize projected variance The inner matrix is called Gramm matrix G = 1 n i x ixi T. Maximizing u T Gu s.t. u = 1 gives the principal eigenvector of G.
To project the data into a p dimensional subspace (d p) we take u 1,..., u d the top d eigenvectors of G (which forms a orthogonal basis) The low dimensional outputs are y i = (u T 1 x i, u T 2 x i,..., u T d x i) T How to interpret the PCA : Eigenvectors : principal axes of maximum variance subspace. Eigenvalues : variance of projected inputs along principle axes. Estimated dimensionality : number of significant (nonnegative) eigenvalues.
Multidimensional Scaling (MDS) Preserve pairwise distances Projet n points in an Euclidean space (e.g. R 2 ) using only information about the pairwise distances. Source : http://www.benfrederickson.com
MDS Input : a distance matrix Recall : A square matrix D of order n is a distance matrix if it is symmetric, d ii = 0 and d ij >= 0, i j. Aim : find the n data points y 1,..., y n in d dimensions such that y i y j 2 is similar to d ij. Let d (X ) ij be the original distances and d (Y ) ij the new ones, then one wants to min y 1,...,y n n n i=1 j=1 (d (X ) ij d (Y ) ij ) 2
Metric MDS Let 1 be a vector of ones Centering matrix H = I 1 n 11T Let A be a square matrix of order n with a ij = d2 ij 2 Then, we define the double certered matrix B B = HAH T B is a Gram matrix (SPD) iff D is an Euclidean distance matrix
Metric MDS If B is a Gram matrix we have B = (HX )(HX ) T Using SVD on B we have B = U U T The columns of Y = U 1/2 give the coordinates of the euclidean representation. Algorithm Construct A Compute B = HAH T SVD of B to get B = U U T Obtain Y = U 1/2
Metric MDS Interpreting MDS Eigenvectors : Ordered, scaled, and truncated to yield low dimensional embedding. Eigenvalues : Measure how each dimension contributes to dot products. Estimated dimensionality : Number of significant (nonnegative) eigenvalues.
Non linear structure
Graph-Based Methods Tenenbaum et. al s Isomap Algorithm Global approach Preserves global pairwise distances. Roweis and Saul s Locally Linear Embedding Algorithm Local approach Nearby points should map nearby Belkin and Niyogi Laplacian Eigenmaps Algorithm Local approach minimizes approximately the same value as LLE
ISOMAP Algorithm Compute the k-nearest neighbours Obtain the shortest paths through graph MDS on geodesic distances
Non linear structure