Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Size: px

Start display at page:

Download "Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods"

Camilla Casey
5 years ago
Views:

1 Techniques for Dimensionality Reduction PCA and Other Matrix Factorization Methods

2 Outline Principle Compoments Analysis (PCA) Example (Bishop, ch 12) PCA as a mixture model variant With a continuous latent variable Breaking down PCA Optimization problem Solution Intuitions General matrix factorization Application to collaborative filtering Algorithms Wrap-up

3 A Motivating Example The MNist digits problem was simplified because the digits were Centered In a canonical position Scaled to the same size What if they weren t?

4 A Motivating Example Take a single 64*64 digit and create a dataset by repeatedly Move it to a 100*100 image Shift by x,y and rotate by θ Dataset has 10,000 features but really only needs 3

5 A Motivating Example prototype = a vector of the same dimension as the instances PCA: reduces each instance to a linear combination of a few prototypes (blue+, green-). These are the first 5: A specific choice of prototypes are the principle components

6 A Motivating Example prototype = a vector of the same dimension as the instances PCA: reduces each instance to a linear combination of a few prototypes (blue+, green-). These are the first 5: Σ

7 PC1 PCA as matrices 2 prototypes 10,000 pixels 1000 * 10,000,00 x1 x2.. y1 y2.. a1 a2.. am b1 b2 bm ~ v images vij xn yn PC2 vnm V[i,j] = pixel j in image i

PC1 1.4*PC1 + 0.5*PC2 = 2 prototypes 10,000 pixels 1000 * 10,000,00 1.4 0.5 x2 y2.

8 PC1 1.4*PC *PC2 = 2 prototypes 10,000 pixels 1000 * 10,000, x2 y2.... a1 a2.. am b1 b2 bm v images vij xn yn PC2 vnm V[i,j] = pixel j in image i

9 PCA for movie recommendation m movies m movies x1 x2.. y1 y2.. a1 a2.. am b1 b2 bm ~ v11 n users vij V Bob xn yn vnm V[i,j] = user i s rating of movie j

10 Bob

11 A Cartoon of PCA Red: the dataset

12 A Cartoon of PCA Green: the reconstruction of the original data Magenta: the lowerdimensional model (linear combinations of one prototype ) In PCA we find a model that minimizes the reconstruction error (blue lines).

13 A 3D Cartoon of PCA

14 Some more cartoons

15 PCA vs Linear Regression r features (eg 4) m=1 regressors predictions n instances (e.g., 150) pl1 pw1 sl1 sw1 pl2 pw2 sl2 sw2.... W w1 w2 w3 w4 H ~ y1 yi Y pln pwn yn Y[i,1] = instance i s prediction

16 PCA vs Linear Regression In contrast: in regression we d minimize square error on one dimension (x 2 ) using a linear combination the other dimensions

17 PCA vs mixture of Gaussians Mixture of Gaussians For each point: Pick the index of the (latent) Gaussian Z=k Pick the the point x from that the k-th Gaussian, x ~ N(µ k,σ k ) z 1 z 2 z N x 1 x 2 x N Plate notation

18 PCA vs mixture of Gaussians Mixture of Gaussians Pick the index of the (latent) Gaussian Z=k Pick the the point x from that the k-th Gaussian, x ~ N(µ k,σ k )

19 PCA vs mixture of Gaussians PCA Pick a continuous value z, which will be used to combine the prototypes u in the model Pick the the point x from a spherical Gaussian centered on zu u u ẑ u

20 PCA vs mixture of Gaussians z is discrete z is continuous u Comment: we can preprocess the data so that the mean is 0 to simplify the model

21 Finding the Principle Components There are different algorithms that can be used EM (Roweis, NIPS 2007) Can also be turned into an eigenvector computation (next)

22 Outline PCA Example (Bishop, ch 12) PCA as a mixture model variant With a continuous latent variable Breaking down PCA Optimization problem Solution Intuition

23 The PCA Problem (vectors) Start with a zero-mean dataset, where x t is a the t-th instance: We want to find small number of orthogonal prototypes u 1,..u k and k weights z t 1,, zt k for each instance xt so that if we approximate x t by the approximation error will be small: we want to find u s and z s to minimize

24 The PCA Problem (matrices) Given a zero-mean dataset Find factors U and Z so that X is approximately their outer product: Specifically minimizing the square of the reconstruction error under the constraint that the rows of U are orthogonal.

25 A PCA Algorithm Start with a zero-mean dataset, where x t is a the t-th instance f i is a column of feature values for the i-th feature. Compute the sample covariance matrix i.e., Find the largest k eigenvectors of C X. These are the prototypes, U. Now find Z given X and U.

26 PCA Algorithm: Intuitions Start with a zero-mean dataset, where x t is a the t-th instance f i is a column of feature values for the i-th feature. Compute the sample covariance matrix Some intuitions: 1. Suppose you wanted to predict feature i from feature j. Your best guess would be 2. If you wanted to predict feature i from all other feature s j, a plausible guess is 3. Any eigenvector, e, of C X leads to an internally consistent* set of predictions * up to a multiplier

27 PCA: Eigenfaces Turk and Pentland, 1991

28 PCA: Eigenfaces Turk and Pentland, 1991 Average face Six eigenfaces (PC s)

29 PCA: Eigenfaces Turk and Pentland, 1991

30 PCA: Eigenfaces

31 PCA: Eigenfaces How is this done? Simplest approach: Add the image with missing values to the data matrix Minimize reconstruction error over the non-missing values?

32 for image denoising

33 Outline Principle Compoments Analysis (PCA) Other types of/applications of matrix factorization Collaborative filtering/recommendation Matrix factorization for CF using gradient descent

34 What is collaborative filtering?

35 What is collaborative filtering?

36 What is collaborative filtering?

37 What is collaborative filtering?

39 What is collaborative filtering?

40 Other examples of social filtering.

41 Other examples of social filtering.

42 Everyday Examples of Collaborative Filtering... Bestseller lists Top 40 music lists The recent returns shelf at the library Unmarked but well-used paths thru the woods The printer room at work Read any good books lately?... Common insight: personal tastes are correlated: If Alice and Bob both like X and Alice likes Y then Bob is more likely to like Y especially (perhaps) if Bob knows Alice

43 Outline Principle Compoments Analysis (PCA) Other types of/applications of matrix factorization Collaborative filtering/recommendation Algorithms: K-NN type methods Classification-base methods Matrix factorization

44 Recovering latent factors in a matrix m movies v11 n users vij vnm V[i,j] = user i s rating of movie j

45 Recovering latent factors in a matrix m movies m movies x1 x2.. y1 y2.. a1 a2.. am b1 b2 bm ~ v11 n users xn yn Minimize squared error reconstruction error and force the prototype users to be orthogonal è PCA vij vnm V[i,j] = user i s rating of movie j

46 talk pilfered from à.. KDD 2011

48 Recovering latent factors in a matrix r m movies m movies x1 x2.. y1 y2.. H a1 a2.. am b1 b2 bm ~ v11 n users W vij V xn yn vnm V[i,j] = user i s rating of movie j

50 user-specific bias term movie-specific bias term

52 Recovering latent factors in a matrix r m movies m movies x1 x2.. y1 y2.. H a1 a2.. am b1 b2 bm ~ v11 n users W vij V xn yn vnm V[i,j] = user i s rating of movie j

53 is like Linear Regression. r features (eg 4) m=1 regressors predictions n instances (e.g., 150) pl1 pw1 sl1 sw1 pl2 pw2 sl2 sw2.... W w1 w2 w3 w4 H ~ y1 yi Y pln pwn yn Y[i,1] = instance i s prediction

54 .. for many outputs at once. r features (eg 4) m regressors predictions n instances (e.g., 150) pl1 pw1 sl1 sw1 pl2 pw2 sl2 sw2.... W w11 w12 w21.. H w31.. w41.. ~ y11 y12 Y ym pln yn1 ynm where we also have to Oind the dataset! Y[I,j] = instance i s prediction for regression task j

55 Matrix factorization as SGD step size

56 Matrix factorization as SGD - why does this work? step size

57 Matrix factorization as SGD - why does this work? Here s the key claim:

58 Checking the claim Think for SGD for logistic regression LR loss = compare y and ŷ = dot(w,x) similar but now update w (user weights) and x (movie weight)

59 What loss functions are possible? N1, N2 - diagonal matrixes, sort of like IDF factors for the users/ movies generalized KL- divergence

60 What loss functions are possible?

61 What loss functions are possible?

62 ALS = alternating least squares

63 Wrapup: Matrix Multiplications in Machine Learning

64 Recovering latent factors in a matrix r m movies m movies x1 x2.. y1 y2.. H a1 a2.. am b1 b2 bm ~ v11 n users W vij V xn yn vnm V[i,j] = user i s rating of movie j

65 vs PCA r m movies m movies x1 x2.. y1 y2.. H a1 a2.. am b1 b2 bm ~ v11 n users W xn yn Minimize squared error reconstruction error and force the prototype users to be orthogonal è PCA vij V vnm V[i,j] = user i s rating of movie j

66 Flashback to NN lecture.. vs autoencoders & nonlinear PCA Assume we would like to learn the following (trivial?) output function: Using the following network: Input Output With linear hidden units, how do the weights match up to W and H?

67 indicators for r clusters.. vs k- means cluster means original data set M a1 a2.. am b1 b2 bm ~ v11 n examples Z vij X xn yn vnm

Dimensionality Reduction and Principle Components Analysis

Dimensionality Reduction and Principle Components Analysis 1 Outline What is dimensionality reduction? Principle Components Analysis (PCA) Example (Bishop, ch 12) PCA vs linear regression PCA as a mixture