Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Size: px

Start display at page:

Download "Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent"

Tyler Shaw
5 years ago
Views:

1 Unsupervised Machine Learning and Data Mining DS 5230 / DS Fall 2018 Lecture 7 Jan-Willem van de Meent

2 DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford)

3 Dimensionality Reduction Goal: Map high dimensional data onto lower-dimensional data in a manner that preserves distances/similarities Original Data (4 dims) Projection with PCA (2 dims) Objective: projection should preserve relative distances

4 Linear Dimensionality Reduction Idea: Project high-dimensional vector onto a lower dimensional space x 2 R z 2 R 10 z = U > x

5 Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Transpose of X used in regression!

6 Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Want to reduce dimensionality from d to k Choose k directions u 1,...,u k U = ( u 1 u k ) 2 Rd k u = u > x

7 Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Want to reduce dimensionality from d to k Choose k directions u 1,...,u k U = ( u 1 u k ) 2 Rd k For each u j, compute similarity z j = u > j x =( ) > = >

8 Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Want to reduce dimensionality from d to k Choose k directions u 1,...,u k U = ( u 1 u k ) 2 Rd k For each u j, compute similarity z j = u > j x Project x down to z =(z 1,...,z k ) > = U > x How to choose U?

9 Principal Component Analysis Top 2 components Bottom 2 components Data: three varieties of wheat: Kama, Rosa, Canadian Attributes: Area, Perimeter, Compactness, Length of Kernel, Width of Kernel, Asymmetry Coefficient, Length of Groove

10 Principal Component Analysis x 2 R z 2 R 10 z = U > x Optimize two equivalent objectives 1. Minimize the reconstruction error 2. Maximizes the projected variance

11 PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x P

12 PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x Decode: x = Uz = P k j=1 z ju j kx xk

13 PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x Decode: x = Uz = P k j=1 z ju j Want reconstruction error kx xk to be small

14 PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x Decode: x = Uz = P k j=1 z ju j Want reconstruction error kx xk to be small Objective: minimize total squared reconstruction error min U2R d k nx kx i UU > x i k 2 i=1

15 PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2

16 PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: Ê[x] =0

17 PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: Ê[x] =0

18 PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: EÊ[x] =0 Objective: maximize variance of projected data max U2R d k,u > U=I Ê[kU > xk 2 ]

19 PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: Ê[x] =0 (what s Ê[U> x]?) Objective: maximize variance of projected data max U2R d k,u > U=I Ê[kU > xk 2 ]

20 Equivalence of two objectives Key intuition: variance of data {z } fixed = captured variance {z } want large + reconstruction error {z } want small > >

21 Equivalence of two objectives Key intuition: variance of data {z } fixed = captured variance {z } want large + reconstruction error {z } want small Pythagorean decomposition: x = UU > x +(I UU > )x kxk k(i UU > )xk kuu > xk Take expectations; note rotation U doesn t a ect length: Ê[kxk 2 ]=Ê[kU> xk 2 ]+Ê[kx UU> xk 2 ] $

22 Equivalence of two objectives Key intuition: variance of data {z } fixed = captured variance {z } want large + reconstruction error {z } want small Pythagorean decomposition: x = UU > x +(I UU > )x kxk k(i UU > )xk kuu > xk Take expectations; note rotation U doesn t a ect length: Ê[kxk 2 ]=Ê[kU> xk 2 ]+Ê[kx UU> xk 2 ] Minimize reconstruction error $ Maximize captured variance

23 Changes of Basis Data Orthonormal Basis X = ( x 1 x n ) 2 Rd n U = ( u 1 u k ) d 2 Rd d =

24 Changes of Basis Data Orthonormal Basis X = ( x 1 x n ) Change of basis 2 2 Rd n z =(z 1,...,z ) > d z j = u > j x U = ( u 1 u k ) 2 Rd = Inverse Change of basis > x = Uz d d d > z = U > x

25 Principal Component Analysis Data Orthonormal Basis X = ( x 1 x n ) 2 Rd n U = ( u 1 u k ) d 2 Rd d Eigenvectors of Covariance = Eigen-decomposition = C A Claim: Eigenvectors of a symmetric matrix are orthogonal d

26 Principal Component Analysis n (from stack exchange)

27 Principal Component Analysis Data Orthonormal Basis X = ( x 1 x n ) 2 Rd n Eigenvectors of Covariance U = ( u 1 u k ) 2 Rd = Eigen-decomposition d d = C A Idea: Take top-k eigenvectors to maximize variance d

28 Principal Component Analysis Data Truncated Basis X = ( x 1 x n ) 2 Rd n 2 Rd k U = ( u 1 u k ) Eigenvectors of Covariance = Truncated decomposition (k) = C A k

29 PCA: Complexity Data Truncated Basis X = ( x 1 x n ) 2 Rd n U = ( u 1 u k ) 2 Rd k = Using eigen-value decomposition Computation of covariance C: O(n d 2 ) Eigen-value decomposition: O(d 3 ) Total complexity: O(n d 2 +d 3 )

30 PCA: Complexity Data Truncated Basis X = ( x 1 x n ) 2 Rd n U = ( u 1 u k ) 2 Rd k = Using singular-value decomposition Full decomposition: O(min{nd 2, n 2 d}) Rank-k decomposition: O(k d n log(n)) (with power method)

31 Singular Value Decomposition Idea: Decompose a d x d matrix M into 1. Change of basis V (unitary matrix) 2. A scaling Σ (diagonal matrix) 3. Change of basis U (unitary matrix)

32 Singular Value Decomposition Idea: Decompose the d x n matrix X into 1. A n x n basis V (unitary matrix) 2. A d x n matrix Σ (diagonal projection) 3. A d x d basis U (unitary matrix) X = U d d d n V > n n U > U = V > V

33 Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i

34 Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) z x

35 Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i Can use z i for nearest-neighbor classification ( + ) ( )

36 Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i Can use z i for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d k

than x i Can use z i for nearest-neighbor classification Much

37 Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i Can use z i for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d Why no time savings for linear classifier? k

38 Aside: How many components? Magnitude of eigenvalues indicate fraction of variance captured. Eigenvalues on a face image dataset: i i Eigenvalues typically drop o sharply, so don t need that many. Of course variance isn t everything...

39 Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x =

40 Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x ji = frequency of word j in document i ( X d n u U d k Z ) ( k n ) u stocks: 2 0 chairman: 4 1 the: wins: 0 2 game: 1 3 ( z 1... z n )

41 Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x ji = frequency of word j in document i ( X d n u U d k Z ) ( k n ) u stocks: 2 0 chairman: 4 1 the: wins: 0 2 game: 1 3 ( z 1... z n ) How to measure similarity between two documents? z > 1 z 2 is probably better than x > 1 x 2

42 Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x ji = frequency of word j in document i ( X d n u U d k Z ) ( k n ) u stocks: 2 0 chairman: 4 1 the: wins: 0 2 game: 1 3 ( z 1... z n ) How to measure similarity between two documents? z > 1 z 2 is probably better than x > 1 x 2 Applications: information retrieval Note: no computational savings; original x is already sparse

43 PCA Summary Intuition: capture variance of data or minimize reconstruction error Algorithm: find eigendecomposition of covariance matrix or SVD Impact: reduce storage (from O(nd) to O(nk)), reduce time complexity Advantages: simple,fast Applications: eigen-faces, eigen-documents, network anomaly detection, etc.

44 Probabilistic Interpretation Generative Model [Tipping and Bishop, 1999]: For each data point i =1,...,n: Draw the latent vector: z i N (0,I k k ) Create the data point: x i N (Uz i, 2 I d d ) PCA finds the U that maximizes the likelihood of the data max U p(x U)

45 Probabilistic Interpretation Generative Model [Tipping and Bishop, 1999]: For each data point i =1,...,n: Draw the latent vector: z i N (0,I k k ) Create the data point: x i N (Uz i, 2 I d d ) PCA finds the U that maximizes the likelihood of the data max U p(x U) Advantages: Handles missing data (important for collaborative filtering) Extension to factor analysis: allow non-isotropic noise (replace 2 I d d with arbitrary diagonal matrix)

46 Limitations of Linearity PCA is e ective PCA is ine ective

47 Limitations of Linearity PCA is e ective PCA is ine ective Problem is that PCA subspace is linear: In this example: S = {x = Uz : z 2 R k } S = {(x 1,x 2 ):x 2 = u 2 u 1 x 1 }

48 Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} = { (x) =Uz} (x) =( ) >

49 Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} We can get this: S = { (x) =Uz} with (x) =(x 2 1,x 2 ) >

50 Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} We can get this: S = { (x) =Uz} with (x) =(x 2 1,x 2 ) > { } > Linear dimensionality reduction in (x) space, Nonlinear dimensionality reduction in x space (x) =( sin( ) ) >

51 Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} We can get this: S = { (x) =Uz} with (x) =(x 2 1,x 2 ) > { } > Linear dimensionality reduction in (x) space, Nonlinear dimensionality reduction in x space (x) =( sin( ) ) > Idea: Use kernels

52 Kernel PCA Representer theorem: XX > u = u x X u = X = P n i=1 ix i

53 Kernel PCA Representer theorem: XX > u = x u Kernel function: k(x 1, x 2 ) such that K, thekernelmatrixformedbyk ij = k(x i, x j ), is positive semi-definite X u = X = P n i=1 ix i

54 Kernel PCA P Representer theorem: XX > u = x u Kernel function: k(x 1, x 2 ) such that K, thekernelmatrixformedbyk ij = k(x i, x j ), is positive semi-definite X u = X = P n i=1 ix i max kuk=1 u> XX > u = max > X > X =1 > (X > X)(X > X) = max > K =1 > K 2

55 Kernel PCA Direct method: Kernel PCA objective: max > K =1 > K 2 ) kernel PCA eigenvalue problem: X > X = 0 Modular method (if you don t want to think about kernels): Find vectors x 0 1,...,x 0 n such that x 0> i x 0 j = K ij = (x i ) > (x j ) Key: use any vectors that preserve inner products One possibility is Cholesky decomposition K = X x 0 n > Xx 0 n

56 Kernel PCA

57 Canonical Correlation Analysis (CCA)

58 Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image

59 Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image Time series: x: Signal at time t y: Signal at time t +1

60 Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image Time series: x: Signal at time t y: Signal at time t +1 Two-view learning: dividefeaturesintotwosets x: Features of a word/object, etc. y: Features of the context in which it appears

61 Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image Time series: x: Signal at time t y: Signal at time t +1 Two-view learning: dividefeaturesintotwosets x: Features of a word/object, etc. y: Features of the context in which it appears Goal: reduce the dimensionality of the two views jointly

62 CCA Example Setup: Input data: (x 1, y 1 ),...,(x n, y n ) (matrices X, Y) Goal: find pair of projections (u, v)

63 CCA Example Setup: Input data: (x 1, y 1 ),...,(x n, y n ) (matrices X, Y) Goal: find pair of projections (u, v) Dimensionality reduction solutions: Independent Joint x and y are paired by brightness

64 CCA Definition Definitions: Variance: cvar(u > x)=u > XX > u Covariance: ccov(u > x, v > y)=u > XY > v Correlation: ccov(u > x,v > y) p p cvar(u > x) cvar(v > y) Objective: maximize correlation between projected views Properties: max u,v dcorr(u> x, v > y) Focus on how variables are related, not how much they vary Invariant to any rotation and scaling of data

65 From PCA to CCA PCA on views separately: no covariance term max u,v u > XX > u u > u + v> YY > v v > v PCA on concatenation (X >, Y > ) > : includes covariance term max u,v u > XX > u +2u > XY > v + v > YY > v u > u + v > v

66 From PCA to CCA PCA on views separately: no covariance term max u,v u > XX > u u > u + v> YY > v v > v PCA on concatenation (X >, Y > ) > : includes covariance term max u,v u > XX > u +2u > XY > v + v > YY > v u > u + v > v Maximum covariance: drop variance terms max u,v u > XY > v p u> u p v > v

67 From PCA to CCA PCA on views separately: no covariance term max u,v u > XX > u u > u + v> YY > v v > v PCA on concatenation (X >, Y > ) > : includes covariance term max u,v u > XX > u +2u > XY > v + v > YY > v u > u + v > v Maximum covariance: drop variance terms max u,v u > XY > v p u> u p v > v Maximum correlation (CCA): divide out variance terms max u,v u > XY > v p u> XX > u p v > YY > v

68 Importance of Regularization Extreme examples of degeneracy: If x = Ay, thenany(u, v) with u = Av is optimal (correlation 1) If x and y are independent, then any (u, v) is optimal (correlation 0) X Y (u v)

69 Importance of Regularization Extreme examples of degeneracy: If x = Ay, thenany(u, v) with u = Av is optimal (correlation 1) If x and y are independent, then any (u, v) is optimal (correlation 0) Problem: ifx or Y has rank n, thenany(u, v) is optimal > (correlation 1) with u = X > Yv ) CCA is meaningless!

70 Importance of Regularization Extreme examples of degeneracy: If x = Ay, thenany(u, v) with u = Av is optimal (correlation 1) If x and y are independent, then any (u, v) is optimal (correlation 0) Problem: ifx or Y has rank n, thenany(u, v) is optimal (correlation 1) with u = X > ) > > Yv ) CCA is meaningless! Solution: regularization (interpolate between maximum covariance and maximum correlation) max u,v u > XY > v p u> (XX > + I)u p v > (YY > + I)v

Data Mining Techniques

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality