Learning representations

Size: px

Start display at page:

Download "Learning representations"

Amie Miles
5 years ago
Views:

Learning representations Optimization-Based Data Analysis http://www.

1 Learning representations Optimization-Based Data Analysis Carlos Fernandez-Granda 4/11/2016

2 General problem For a dataset of n signals X := [ x 1 x 2 x n ] learn a model such that x j k Φ i A ij, 1 j n, for k n i=1 Φ 1,..., Φ k R d are atoms A 1,..., A n R k are coefficient vectors

3 Matrix factorization problem Equivalent formulation X [ Φ 1 Φ 2 Φ k ] [ A1 A 2 A n ] = Φ A Φ R d k, A R k n Nonconvex problem!

4 Matrix factorization problem

5 Applications Learned representation Φ can be used for compression, denoising, classification, etc. Coefficient patterns in A allow to cluster the data

6 Faces dataset, 10 images of 40 different people

7 K-means Low-rank models Principal component analysis Nonnegative matrix factorization Sparse PCA Dictionary learning

8 K-means Aim: Divide x 1,... x n into k classes Learn Φ 1,..., Φ k that minimize n x i Φ c(i) 2 2 i=1 c (i) := arg min 1 j k x i Φ j 2

9 Matrix-factorization interpretation Equivalent formulation X [ Φ 1 Φ 2 Φ k ] [ ec(1) e c(2) e c(n) ] = Φ A

10 Lloyd s algorithm Initialize Φ 1,..., Φ k R d randomly Repeat 1. Assignment step 2. Averaging step c (i) := arg min 1 j k x i Φ j 2 Φ j := n i=1 δ (c ( j) = i) x i n i=1 δ (c ( j) = i)

11 Lloyd s algorithm

12 Lloyd s algorithm

13 Lloyd s algorithm

14 Lloyd s algorithm

15 Lloyd s algorithm

16 Spike sorting in neuroscience Electrode data from the retina [Litke et al 2004]

17 Spike sorting in neuroscience k = 2 k = 3 k = 4 Data from Chichilnisky lab at Stanford

18 Original data

19 Original data

20 Faces dataset, k = 5

21 Faces dataset, k = 15

22 Faces dataset, k = 40

23 K-means Low-rank models Principal component analysis Nonnegative matrix factorization Sparse PCA Dictionary learning

24 K-means Low-rank models Principal component analysis Nonnegative matrix factorization Sparse PCA Dictionary learning

25 Singular-value decomposition Every real matrix A R d n, d n has a unique singular-value decomposition (SVD) σ v T A = [ 1 ] u 1 u 2 u d 0 σ 2 0 v T σ d vd T = UΣV T The singular values are σ 1 σ 2 σ d 0 The left singular vectors u 1,..., u d R d are a basis of the column space The right singular vectors v 1,..., v d R n are a basis of the row space

26 Variation of the data in a certain direction Variation of dataset in the direction of unit vector u n n P span(u) x i 2 2 = u T x i xi T i=1 i=1 = u T XX T u = X T u 2 2 u

27 Directions of maximum variation σ 1 = max u 2 =1 X T u 2 Nonconvex problem! U 1 = arg max u 2 =1 X T u 2 σ j = max u 2 =1 u U 1,...,U j 1 X T u, 2 2 j d U j = arg max u 2 =1 u U 1,...,U j 1 X T u, 2 2 j d

28 Example: 2D data σ 1 n = σ 2 n = u 1 u 2

29 Example: 2D data σ 1 n = σ 2 n = u 1 u 2

30 Example: 2D data σ 1 n = σ 2 n = u 1 u 2

31 Centering is important! σ 1 n = σ 2 n = u 1 u 2

32 Centering is important! σ 1 n = σ 2 n = u 2 u 1

33 Covariance of a random vector Assume data x 1,..., x n are samples from a random d-dimensional vector ˇx with covariance matrix Σˇx The variance of the projection of ˇx onto span (u) is ( ) Var ˇx T u = u T Σˇx u Eigenvectors of Σˇx are directions of maximum variance

34 Empirical covariance Given samples x 1,..., x n, the empirical mean is x = 1 n n i=1 x i and the empirical covariance is Σ n := 1 n n (x i x) (x i x) T i=1 = 1 n XX T If the mean is known, it s an unbiased estimate of the true covariance

35 Probabilistic interpretation When the estimate converges, PCA finds directions of maximum variance ( ) Var ˇx T u = u T Σˇx u 1 n ut XX T u = 1 X T u n 2 2

36 Example: n = 5 True covariance Empirical covariance

37 Example: n = 20 True covariance Empirical covariance

38 Example: n = 100 True covariance Empirical covariance

39 Best k-rank approximation For any subspace S of dimension k min {m, n} U 1:k U1:k T X 2 F = n P span(u1,u 2,...,U k )x i 2 2 i=1 n P S x i 2 2 i=1 This implies that for any matrix M of rank k X U 1:k Σ 1:k V1:k T X M F F The truncated SVD is the best low-rank approximation

40 Principal component analysis Data: x 1, x 2,..., x n 1. Center the data. Compute x i = x i 1 n n x i, i=1 2. Group the centered data in a data matrix X X = [ x 1 x 2 x n ] Compute the SVD of X and extract the first k left singular vectors

41 Principal component analysis (PCA) First k left singular vectors can be interpreted as atoms X [ U 1 U 2 U k ] A := ΦA σ V T 1 A := 0 σ 2 0 V T σ k Vk T

42 Faces dataset

43 Collaborative filtering A := Bob Molly Mary Larry The Dark Knight Spiderman Love Actually Bridget Jones s Diary Pretty Woman Superman 2

44 Centering µ := 1 n m n A ij, i=1 j=1 Ā := µ µ µ µ µ µ µ µ µ

45 SVD A Ā = UΣV T = U V T

46 First left singular vector U 1 = D. Knight Sp. 3 Love Act. B.J. s Diary P. Woman Sup. 2 ( ) Interpretations: Score atom: Centered scores for each person are proportional to U 1 Coefficients: They cluster movies into action (+) and romantic (-)

47 First right singular vector V 1 = Bob Molly Mary Larry ( ) Interpretations: Score atom: Centered scores for each movie are proportional to V 1 Coefficients: They cluster people into action (-) and romantic (+)

48 Rank 1 model Bob Molly Mary Larry 1.34 (1) 1.19 (1) 4.66 (5) 4.81 (4) The Dark Knight 1.55 (2) 1.42 (1) 4.45 (4) 4.58 (5) Spiderman 3 Ā + σ 1 U 1 V1 T = 4.45 (4) 4.58 (5) 1.55 (2) 1.42 (1) Love Actually 4.43 (5) 4.56 (4) 1.57 (2) 1.44 (1) B.J. s Diary 4.43 (4) 4.56 (5) 1.57 (1) 1.44 (2) Pretty Woman 1.34 (1) 1.19 (2) 4.66 (5) 4.81 (5) Superman 2

49 K-means Low-rank models Principal component analysis Nonnegative matrix factorization Sparse PCA Dictionary learning

50 Nonnegative matrix factorization Nonnegative atoms/coefficients can make results easier to interpret X Φ A, Φ i,j 0, A i,j 0, for all i, j Nonconvex optimization problem: minimize X Φ Ã 2 2 subject to Φ i,j 0, Ã i,j 0, for all i, j Φ R d k and Ã R k n for a fixed k

51 Faces dataset

52 Topic modeling A := singer GDP senate election vote stock bass market band Articles a b c d e f

53 SVD A Ā = UΣV T = U V T

54 Left singular vectors a b c d e f U 1 = ( ) U 2 = ( ) U 3 = ( )

55 Right singular vectors singer GDP senate election vote stock bass market band V 1 = ( ) V 2 = ( ) V 3 = ( )

56 Nonnegative matrix factorization X W H W i,j 0, H i,j 0, for all i, j

57 Right nonnegative factors singer GDP senate election vote stock bass market band H 1 = ( ) H 2 = ( ) H 3 = ( ) Interpretations: Count atom: Counts for each doc are weighted sum of H 1, H 2, H 3 Coefficients: They cluster words into politics, music and economics

58 Left nonnegative factors a b c d e f W 1 = ( ) W 2 = ( ) W 3 = ( ) Interpretations: Count atom: Counts for each word are weighted sum of W 1, W 2, W 3 Coefficients: They cluster docs into politics, music and economics

59 K-means Low-rank models Principal component analysis Nonnegative matrix factorization Sparse PCA Dictionary learning

60 Sparse PCA Sparse atoms can make results easier to interpret X Φ A, Φ sparse Nonconvex optimization problem: minimize subject to X Φ Ã 2 k + λ 1 Φ i 2 2 Φ i = 1, i=1 1 i k Φ R d k and Ã R k n for a fixed k

61 Faces dataset

62 K-means Low-rank models Principal component analysis Nonnegative matrix factorization Sparse PCA Dictionary learning

63 Dictionary learning Learn sparsifying dictionary X Φ A, A sparse Nonconvex optimization problem: minimize subject to X Φ Ã 2 k + λ Ãi 2 2 Φ i = 1, i=1 1 i k 1 Φ R d k and Ã R k n for a fixed k

64 Dictionary learning

65 Denoising via dictionary learning

66 Denoising via dictionary learning

67 Denoising via dictionary learning

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda Overview Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 1/25/2016 Sparsity Denoising Regression Inverse problems Low-rank models Matrix completion