Weighted Low Rank Approximations

Size: px

Start display at page:

Download "Weighted Low Rank Approximations"

Coral Powers
6 years ago
Views:

1 Weighted Low Rank Approximations Nathan Srebro and Tommi Jaakkola Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

2 Weighted Low Rank Approximations What is a weighted low rank approximation? What is a low rank approximation? Why weighted low rank approximations? How do we find a weighted low rank approximation? What can we do with weighted low rank approximations?

3 Low Rank Approximation titles titles A u1 v 1 + u v + u3 v 3 preferences of a specific user (real-valued preference level for each title) characteristics of the user comic value dramatic value violence

4 Low Rank Approximation titles A k u titles V k

5 users Low Rank Approximation titles A A data (target) k u U titles V k

6 users Low Rank Approximation titles A A data (target) u U rank k

7 Low Rank Approximation A data (target) rank k U V Compression (mostly to reduce processing time) Prediction: collaborative filtering Reconstructing latent signal biological processes through gene expression Capturing structure in a corpus documents, images, etc Basic building block, e.g. for non-linear dimensionality reduction

8 Low Rank Approximation A data (target) + rank k Z iid Gaussian noise ( ) 1 ; A logp ( A ) ( A ) + logl σ const Max likelihood low-rank matrix given iid Gaussian noise low-rank minimizing sum-squared error given explicitly in terms of SVD

9 Weighted Low Rank Approximation A data (target) rank k + Z Gaussian noise Z ~ N (0, σ ) logl ( ) ; A logp ( A ) 1 ( A + w σ ) w 1/ σ const Find low-rank matrix minimizing weighted sum-squared-error [Young 1940]

10 Weighted Low Rank Approximation low-rank matrix minimizing weighted sum-squared error External information about noise variance for each measurement e.g. in gene expression analysis Missing data (0/1 weights) collaborative filtering Different number of samples e.g. separating style and content [Tenenbaum Freeman 00] Varying importance of different entries e.g. D filters [Shpak 90, Lu et al 97] Subroutine for further generalizations

11 How? Given A and W, find rank k matrix minimizing the weighted sum-square difference W ( A )

12 (Unweighted) Low Rank Approximation

13 (Unweighted) Low Rank Approximation J(UV ) n A - objective d target rank k Fro

14 (Unweighted) Low Rank Approximation J(UV ) objective n d k d A target - n U V k Fro J U ( UV ' A) V 0 J V ( VU ' A' ) U 0 UV VAV VU UA U

15 (Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A J U J(UV ) objective ( UV ' A) V n UV VAV d k d A target 0 - n U V J V k Fro orthonormal V VI orthogonal U U G Γ diagonal ( VU ' A' ) U VU UA Γ AV 0

16 (Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A Global Minimum U,V are correspondingly spanned by leading eigenvectors U,V spanned by eigenvectors of AA and AA J ( UV ') A UV ' unselected eigenvalues

17 (Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A Global Minimum U,V are correspondingly spanned by leading eigenvectors All other fixed points are saddle points (no non-global local minima)

18 Weighted Low Rank Approximation J ( UV ') W ( A UV ') J U i (( U V ' A ) W ) V i i i 0 U i A WV ( V ' WV ) i i i 1 If rank(w)1, (V W i V) can be simultaneously diagonalized eigen-methods apply [IraniAnandan 000] Otherwise, eigen-methods cannot be used, solutions are not incremental

19 WLRA: Optimization J ( UV ') W ( A UV ') For fixed V, find optimal U For fixed U, find optimal V V J * ( V ) minj ( UV ') U J * ( V ) U *'(( U * V ' Y ) W ) Conjugate gradient descent on J* n k U d V k Optimize kd parameters instead of k(d+n)

20 Local Minima in WLRA A W 1 + α α θ

21 WLRA: An EM Approach 0/1 weights: A A+Z observations N(0,1) noise parameters

22 WLRA: An EM Approach Expectation Step: A 1 A A N missing A r [i,j] [i,j] Maximization Step: A r +Z r LRA 1 N A r r same low-rank for all targets A r independent noise Z r for each target A r

23 A 1 WLRA: An EM Approach A A r +Z r A N integer WLRA(A,W) with W[i,j]w[i,j]/N Expectation Step: missing A r [i,j] [i,j] Maximization Step: LRA 1 N A r r A r [i,j]a[i,j], or missing if w[i,j]<r LRA(W A+(1-W) )

24 WLRA: Optimization J ( UV ') W ( A UV ') For fixed V, find optimal U For fixed U, find optimal V V J * ( V ) minj ( UV ') U J * ( V ') U *'(( U * V ' A) W ) Conjugate gradient descent on J* LRA(W A+(1-W) )

25 Estimations of a planted with Non-Identical Noise Variances Reconst. Err: normalized sum diff to planted unweighted weighted Signal/Noise (weighted sum low-rank signal/weighted sum noise)

26 Collaborative Filtering of Joke Preferences ( Jester ) users jokes ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? -4 [Goldberg 001]

27 Collaborative Filtering of Joke Preferences ( Jester ) Test Error WLRA rank SVD, unobserved0, observed rescaled to preserve mean [Azar+ 001] SVD on always observed columns [Goldberg+ 001] SVD, unobserved0 [BillsusPazzani 98]

28 WLRA as a Subroutine for Other Cost Functions ( ) Low Rank Approximation (PCA) A W ( A Weighted Low Rank Approximation ) Loss ( A, )

29 Logistic Low Rank Regression [Collins+ 00, Gordon 003, Schein+ 003] Pr Y g observed rank k rank k g ( x ) 1 1+ e x ( t ) ( t ) Loss ( Y, ) logg ( Y, ) ( t 1) ( t 1) g ( Y ) g ( Y ) Y t t ( ) ( 1) + ( t g ( Y 1) ) + Const

30 Logistic Low Rank Regression [Collins+ 00, Gordon 003, Schein+ 003] Pr Y g observed rank k rank k g ( x ) 1 1+ e x ( t ) ( t ) Loss ( Y, ) logg ( Y, ) ( t 1) ( t 1) g ( Y ) g ( Y ) Y t t W (t ) ( ) ( 1) + A (t ) ( t g ( Y W ( t ) g ( Y ( t 1) ) g ( Y ( t 1) ) A ( t ) Y 1) ) ( t 1) + ( t 1) g ( Y ) + Const

31 Maximum Likelihood Estimation with Gaussian Mixture Noise Y observed + rank k Z noise C (hidden) Mixture of Gaussians: {p c,µ c,σ c } E step: calculate posteriors of C M step: WLRA with Pr( C c ) W σ c C A Pr( C Y + c σc c ) µ C W

32 Weighted Low Rank Approximations Nathan Srebro Tommi Jaakkola Weights often appropriate in low rank approximation WLRA more complicated than unweighted LRA Eigenmethods (SVD) do not apply Non-incremental Local minima Optimization approaches: Alternate optimization Gradient methods on J*(V) EM method: LRA(W A+(1-W) ) WLRA useful as subroutine for more general loss functions

33 END

34 Local Minima in WLRA A 1 1 W V[,-1] V[1,] J⅓ J⅓ J3

35 (Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A A U 0 S V 0

36 (Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A A U 00 S V 0 Q U Q V I U U 0 S Q U V V 0 Q V

37 (Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A A U* U U 0 S Q U s 1 s s 3 s m V 0 V* 1 V 0 Q V A α s UV ' s α α selected α A UV ' s α unselected α 0/1 permuted diagonal matrix

38 (Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A Global Minimum U,V are correspondingly spanned by leading eigenvectors U U 0 S 1 V V

Learning with Matrix Factorizations

Learning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Adviser: Tommi Jaakkola (MIT) Committee: Alan Willsky (MIT),