Preprocessing & dimensionality reduction

Size: px

Start display at page:

Download "Preprocessing & dimensionality reduction"

Peter Henry
5 years ago
Views:

1 Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

2 Outline 1 Preprocessing for data simplification Sampling Aggregation Discretization Density estimation Dimensionality reduction 2 Principal component analysis (PCA) Autoencoder Variance maximization Singular value decomposition (SVD) 3 Multidimensional scaling (MDS) Gram matrix Double-centering Stress function CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

Preprocessing for data simplification Sampling Sampling Select a subset of representative data points instead of processing the entire data.

3 Preprocessing for data simplification Sampling Sampling Select a subset of representative data points instead of processing the entire data. A sampled subset is useful only if its analysis yields the same patterns, results, conclusions, etc., as the analysis of the entire data points 2000 points 500 points CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

4 Preprocessing for data simplification Sampling Sampling Select a subset of representative data points instead of processing the entire data. Common sampling approaches Random: an equal probability of selecting any particular item. Without replacement: iteratively selected & remove items. With replacement: selected items remain in the population. Stratified: draw random samples from each partition. Choosing a sufficient sample size is often crucial for effective sampling. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

5 Preprocessing for data simplification Sampling Example Choose enough samples guarantee at least one representative is selected from each distinct group/cluster/profile in the data. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

6 Preprocessing for data simplification Aggregation Instead of sampling representative data points we can coarse-grain the data by aggregating together attributes or data points. Aggregation Combining several attributes to a single feature, or several data points into a single observation. Examples Change monthly revenues to annual revenues Analyze neighborhoods instead of houses Provide average rating of a season (not per episode) CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

7 Preprocessing for data simplification Discretization It is sometimes convenient to transform the entire data to nominal (or ordinal) attributes. Discretization Transformation of continuous attributes (or ones with infinite range) to discrete ones with a finite range. Discretization can be done in a supervised discretization (e.g., using class labels) or in an unsupervised manner (e.g., using clustering). CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

impurity: 3 values per axis 5 values per axis CPSC

8 Preprocessing for data simplification Discretization Supervised discretization based on minimizing impurity: 3 values per axis 5 values per axis CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

9 Preprocessing for data simplification Discretization Unsupervised discretization: CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

10 Preprocessing for data simplification Density estimation Transforming attributes from raw vales to densities can be used to coarse-grain the data and bring its features to comparable scales between zero and one. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

11 Preprocessing for data simplification Density estimation Transforming attributes from raw vales to densities can be used to coarse-grain the data and bring its features to comparable scales between zero and one. Cell-based density estimation CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

12 Preprocessing for data simplification Density estimation Transforming attributes from raw vales to densities can be used to coarse-grain the data and bring its features to comparable scales between zero and one. Center-based density estimation CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

13 Preprocessing for data simplification Dimensionality reduction Dimensionality of data is generally determined by the number of attributes or features that represent each data point. Curse of dimensionality A general term for various phenomena that arise when analyzing and processing high-dimensional data. Common theme - statistical significance is difficult, impractical, or even impossible to obtain due to sparsity of the data in high-dimensions Causes poor performance of classical statistical methods compared to low-dimensional data Common solution - reduce the dimensionality of the data as part of its (pre)processing. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

14 Preprocessing for data simplification Dimensionality reduction There are several approaches to represent the data in a lower dimension, which can generally be split into two types: Dimensionality reduction approaches Feature selection/weighting - select a subset of existing features and only use them in the analysis, while possibly also assigning them importance weights to eliminate redundant information Feature extraction/construction - create new features by extracting relevant information from the original features PCA and MDS are two of the most common dimensionality reduction methods in data analysis, but many others exist as well. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

15 Preprocessing for data simplification Feature subset selection Ideally - choose the best feature subset out of all possible combinations. Impractical - there are 2 n choices for n attributes! Feature selection approaches Embedded methods - choose the best features for a task as part of the data mining algorithm (e.g., decision trees). Filter methods - choose features that optimize a general criterion (e.g., min correlation) as part of data preprocessing using an efficient search algorithm. Wrapper methods - first formulate & handle a data mining task to select features, and then use the resulting subset to solve the real task. Alternatively, expert knowledge can sometimes be used to eliminate redundant and unnecessary features. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

16 Principal Component Analysis CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

17 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

18 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

19 Assume: avg = 0 Find: best k-dim projection CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

20 Projection on principal components: Principal components Data points CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

21 Projection on principal components: 3D space λ1φ1 1D space CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

22 What is the best projection? Find subspace S R n s.t. dim(s) = k and the data is well approximated by ˆx = proj S x. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

23 What is the best projection? Find subspace S R n s.t. dim(s) = k and the data is well approximated by ˆx = proj S x. Find subspace S R n s.t. S = span{u 1,..., u k } and the data is x ˆx is minimal over the data with ˆx = proj S x. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

24 What is the best projection? Find subspace S R n s.t. dim(s) = k and the data is well approximated by ˆx = proj S x. Find subspace S R n s.t. S = span{u 1,..., u k } and the data is x ˆx is minimal over the data with ˆx = proj S x. Find k vectors u 1,..., u k s.t. N 1 N i=1 x i ˆx i 2 is minimal with ˆx = proj span{u1,...,u k } x. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

25 What is the best projection? Find subspace S R n s.t. dim(s) = k and the data is well approximated by ˆx = proj S x. Find subspace S R n s.t. S = span{u 1,..., u k } and the data is x ˆx is minimal over the data with ˆx = proj S x. Find k vectors u 1,..., u k s.t. N 1 N i=1 x i ˆx i 2 is minimal with ˆx = proj span{u1,...,u k } x. How do we find these vectors u 1,..., u k? CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

i = W x i ˆx i = Uh i Output layer: ˆx[1] ˆx[2] ˆx[3] ˆx[4] ˆx[5] arg min W R k

26 Autoencoder Minimize N 1 n i=1 x i ˆx i 2 s.t. ˆx = proj span{u1,...,u k } x Input layer: x[1] x[2] x[3] x[4] x[5] Hidden layer: h[1] h[2] h[3] h i = W x i ˆx i = Uh i Output layer: ˆx[1] ˆx[2] ˆx[3] ˆx[4] ˆx[5] arg min W R k n,u R n k N x i UWx i 2 i=1 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

27 Reconstruction error minimization We only need to consider orthonormal vectors u 1,..., u k (i.e., u i = 1, u i, u j = 0 for i j) that form a basis for the subspace. We can then extend this set to form a basis u 1,..., u n for the entire R n. Then, we can write x = n j=1 x, u j u j = n j=1 u j u T j x and proj span{u1,...,u k } = k j=1 u j u T j x. We now consider the reconstruction error N 1 N i=1 x i ˆx i 2. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

28 Reconstruction error minimization First, notice that x ˆx = n j=1 u j u T j x k j=1 u j u T j x = n j=k+1 u j u T j x n n x ˆx 2 = ( u j [q]uj T x) 2 q=1 j=k+1 = n n n ( u j [q]u j [q])(uj T x)(uj T x) j=k+1 j =k+1 q=1 = n n k (uj T x) 2 = (uj T x) 2 (uj T x) 2 = x 2 ˆx 2 j=k+1 j=1 j=1 Minimizing the reconstruction error is equivalent to maximizing N 1 N i=1 ˆx i 2 = k j=1 N 1 N i=1 (u T j x i ) 2 = k j=1 variance(u T j x) CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

29 Variance maximization Find a direction that maximizes the variance in the projected data. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

30 Variance maximization Find a direction that maximizes the variance in the projected data. Find a unit vector u R n that maximizes variance(u T x) = u T Σu, where Σ is the covariance matrix. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

31 Variance maximization Find a direction that maximizes the variance in the projected data. Find a unit vector u R n that maximizes: variance(u T x) = N 1 N i=1 = u T ( N 1 where Σ is the covariance matrix. N (u T x i ) 2 = N 1 (u T x i )(xi T u) N i=1 x i x T i ) i=1 u = u T Σu CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

32 Variance maximization Find a direction that maximizes the variance in the projected data. Find a unit vector u R n that maximizes variance(u T x) = u T Σu, where Σ is the covariance matrix. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

33 Variance maximization Find a direction that maximizes the variance in the projected data. Find a unit vector u R n that maximizes variance(u T x) = u T Σu, where Σ is the covariance matrix. Solve the maximization problem: maximize u T Σu s.t. u = 1 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

34 Variance maximization Solve the maximization problem: maximize u T Σu s.t. u = 1 Apply Lagrange multipliers method: f (u, α) = u T Σu + α(1 u T u) u f (u, α) = 2(Σu αu) u f (u, α) = 0 Σu = αu Therefore, u is an eigenvector of Σ with eigenvalue α, which has to be the maximal eigenvalue to maximize u T Σu = α. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

35 Variance maximization Similarly, a second direction is found via: maximize u2 T Σu 2 s.t. u 2 = 1 u 2, u 1 = 0 Apply Lagrange multipliers method: f (u 2, α, β) = u T 2 Σu 2 + α(1 u T 2 u 2 ) βu T 2 u 1 u2 f (u 2, α, β) = 2(Σu 2 αu 2 ) βu 1 u2 f (u 2, α, β) = 0 β = u 1, u2 f (u 2, α, β) = 0 Σu 2 = αu 2 Therefore, u 2 is an eigenvector of Σ with the second largest eigenvalue. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

36 Eigendecomposition and SVD features data points features Covariance matrix = data points cov(q 1, q 2 ) i x i [q 1 ] x i [q 2 ] CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

37 Eigendecomposition and SVD Eigenvector Eigenvalue λ i φ i = Covariance matrix φ i q 1 q 2 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

38 Eigendecomposition and SVD Spectral theorem applies to cov. matrices: Eigenvector Eigenvalue λ i φ i = Covariance matrix φ i q 1 q 2 Covariance matrix = Singular vectors SVD (Singular Value Decomposition) Singular values Spectral Theorem: cov(q 1, q 2 ) = i λ i φ i [q 1 ] φ i [q 2 ] CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

39 Singular value decomposition Any matrix M R n k can be decomposed to U, S, V SVD(M) as M = U S V T n n orthogonal n k diagonal k k orthogonal The singular values in S are the square root of the (nonnegative) eigenvalues of both MM T and M T M. The singular vectors in (the columns of) U are the eigenvectors of MM T. The singular vectors in (the columns of) V are the eigenvectors of M T M. Proof and more details about SVD can be found on Wikipedia. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

40 Singular value decomposition eigenvalues λ 1 λ 2 λ 3 λ 4 λ 5 Decaying covariance spectrum reveals (low) dimensionality CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

41 Singular value decomposition Covariance matrix = }{{} principal components Eigenvectors Eigenvalues Covariance matrix can be approximated by a truncated SVD CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

42 Trivial example Consider simple case of data points that are all on the same high dimensional line Straight line is defined by a unit vector ψ = 1 Points on the line are defined by multiplying ψ by scalars The points can be formulated as x i = c i ψ Covariance: cov(t 1, t 2 ) = x i [t 1 ]x i [t 2 ] = c iψ[t1 ]c iψ[t2 ] = i i ( ci 2 ) ψ[t 1 ] ψ[t 2 ] = c 2 ψ(t1 ) ψ(t 2 ) c (c 1, c 2,...) i CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

Trivial example Consider simple case of data points that are all on the same high dimensional line Straight line is defined by a unit vector ψ = 1 Points on the line are defined by multiplying ψ by

43 Trivial example Consider simple case of data points that are all on the same high dimensional line Straight line is defined by a unit vector ψ = 1 Points on the line are defined by multiplying ψ by scalars c The points can be formulated as 2 x i = c i ψ Covariance: Covariance cov(t 1, t 2 ) = x i [t 1 ]x i [t 2 ] = c iψ[t1 ]c iψ[t2 ] = = i i ( ci 2 ) ψ[t Matrix 1 ] ψ[t 2 ] = c 2 ψ(t1 ) ψ(t 2 ) c (c 1, c 2,...) i ψ CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

44 Trivial example Consider simple case of data points that are all on the same high dimensional line Straight line is defined by a unit vector ψ = 1 Points on the line are defined by multiplying ψ by scalars The points can be formulated as x i = c i ψ Covariance: cov(t 1, t 2 ) = x i [t 1 ]x i [t 2 ] = c iψ[t1 ]c iψ[t2 ] = i i ( ci 2 ) ψ[t 1 ] ψ[t 2 ] = c 2 ψ(t1 ) ψ(t 2 ) c (c 1, c 2,...) i Covariance matrix has a single eigenvalue c 2 and a single eigenvector ψ, which defines principal direction of the data-point vectors CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

45 Trivial example 3D space CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

46 Trivial example 3D space φ 1 = ψ CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

47 Trivial example 3D space CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

48 Trivial example Length: eigenvalues Direction: eigenvectors 3D space λ1φ1 λ 2 φ 2 principal components max var directions CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

49 PCA algorithm: 1 Centering 2 Covariance 3 SVD (or eigendecomposition) 4 Projection Alternative method: Multi-Dimensional Scaling (MDS) - preserve distances/inner-products with minimal set of coordinates. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

50 Multidimensional Scaling CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

51 Multidimensional scaling What if we cannot compute a covariance matrix? Consider a k-dimensional rigid body - all we need to know are distances between its parts. We can ignore its position and orientation and find the most efficient way to place it in R k. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

52 Multidimensional scaling 0 d 1j d 1m..... D = d i1 0 d im..... d m1 d mj 0 { y1,..., y m R k : y i y j = d ij = x i x j } Multidimensional scaling Given a m m matrix D of distances between m objects, find k dimensional coordinates that preserve these distances. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

53 Multidimensional scaling Gram matrix A distance matrix is not convenient to directly embed in R k, but embedding inner products is a simpler task. Gram matrix A matrix G that contains inner products g ij = x i, x j is a Gram matrix. Using the spectral theorem we can decompose G = ΦΛΦ T and get m x i, x j = g ij = λ q Φ[i, q]φ[j, q] = Φ[i, ]Λ 1/2, Φ[j, ]Λ 1/2 q=1 Similar to PCA, we can truncate small eigenvalues and use the k biggest eigenpairs. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

54 Multidimensional scaling Spectral embedding G = λ 1 λ 2 λ 3 λ k > 0 m φ 1 φ 2 φ 3 φ k CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

55 Multidimensional scaling Spectral embedding G = λ 1 λ 2 λ 3 λ k > 0 m φ 1 φ 2 φ 3 φ k CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

56 Multidimensional scaling Spectral embedding G = λ 1 λ 2 λ 3 λ k > 0 m φ 1 φ 2 φ 3 φ k x Φ(x) [λ 1/2 1 φ 1 (x), λ 1/2 2 φ 2 (x), λ 1/2 3 φ 3 (x),..., λ 1/2 φ k (x)] T k CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

57 Multidimensional scaling Double-centering Notice that given a distance metric that is equivalent to Euclidean distances, we can write: x y 2 = x 2 + y 2 2 x, y But then: CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

58 Multidimensional scaling Double-centering Notice that given a distance metric that is equivalent to Euclidean distances, we can write: x y 2 = x 2 + y 2 2 x, y But then: mean x ( x y 2 ) = z 2 + y 2 2 z, y where z and z 2 are the mean and mean squared norm of the data CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

59 Multidimensional scaling Double-centering Notice that given a distance metric that is equivalent to Euclidean distances, we can write: x y 2 = x 2 + y 2 2 x, y But then: mean x ( x y 2 ) = z 2 + y 2 2 z, y mean y ( x y 2 ) = z 2 + x 2 2 x, z where z and z 2 are the mean and mean squared norm of the data CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

60 Multidimensional scaling Double-centering Notice that given a distance metric that is equivalent to Euclidean distances, we can write: x y 2 = x 2 + y 2 2 x, y But then: mean x ( x y 2 ) = z 2 + y 2 2 z, y mean y ( x y 2 ) = z 2 + x 2 2 x, z mean x,y ( x y 2 ) = 2 z 2 2 z, z where z and z 2 are the mean and mean squared norm of the data CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

61 Multidimensional scaling Double-centering Thus, if we set g(x, y) = 2 1 ( x y 2 mean x ( x y 2 ) we get a gram matrix, since: mean y ( x y 2 ) + mean x,y ( x y 2 ) ) g(x, y) = ( x, y x, z ) ( z, y z, z ) = x z, y z }{{}}{{} x,y z z,y z Therefore, we can compute G = 1 2 J D(2) J where J = Id 1 m 1 1 T CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

62 Multidimensional scaling Classic MDS Classic MDS is computed with the following algorithm: MDS algorithm 1 Formulate squared distances 2 Build Gram matrix by double-centering 3 SVD (or eigendecomposition) 4 Assign coordinates based on eigenvalues and eigenvectors Exercise: show that for centered data in Euclidean space this embedding is identical to PCA. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

63 Multidimensional scaling Stress function What if we are not given a distance metric, but just dissimilarities? Stress function A function that quantifies the disagreement between given dissimilarities and embedded Euclidean distances. Examples Stress functions i<j Metric MDS stress: (ˆd ij f (d ij )) 2, where f is a predetermined i<j d2 ij monotonically increasing function i<j Kruskal s stress-1: (ˆd ij f (d ij )) 2 ˆd, where f is optimized, but still 2 i<j ij monotonically increasing Sammon s stress: ( i<j d ij ) 1 (ˆd ij d ij ) 2 i<j d ij CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

64 Multidimensional scaling Non-metric MDS Non-metric, or non-classical MDS is computed by the following algorithm: Non-metric MDS algorithm 1 Formulate a dissimilarity matrix D. 2 Find an initial configuration (e.g., using classical MDS) with distance matrix ˆD. 3 Minimize STRESS D (f, ˆD) by optimizing the fitting function. 4 Minimize STRESS D (f, ˆD) by optimizing the configuration and resulting ˆD. 5 Iterate the previous two steps until the stress is lower than a stopping threshold. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

65 Summary Preprocessing steps are crucial in preparing data for meaningful analysis. Linear dimensionality reduction for alleviating the curse of dimensionality: PCA - project data on leading eigenvectors of the covariance matrix. MDS - embed data using leading eigenvalues of a Gram matrix and entries of corresponding eigenvectors. In both cases, SVD is used in practice instead of eigendecomposition. Nonlinear dimensionality reduction will be covered later in the semester. CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall / 29

Introduction to Machine Learning

10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what