Weighted Low Rank Approximations

Weighted Low Rank Approximations Nathan Srebro and Tommi Jaakkola Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Weighted Low Rank Approximations What is a weighted low rank approximation? What is a low rank approximation? Why weighted low rank approximations? How do we find a weighted low rank approximation? What can we do with weighted low rank approximations?

Low Rank Approximation titles titles A u1 v 1 + u v + u3 v 3 preferences of a specific user (real-valued preference level for each title) characteristics of the user comic value dramatic value violence

Low Rank Approximation titles A k u titles V k

users Low Rank Approximation titles A A data (target) k u U titles V k

users Low Rank Approximation titles A A data (target) u U rank k

Low Rank Approximation A data (target) rank k U V Compression (mostly to reduce processing time) Prediction: collaborative filtering Reconstructing latent signal biological processes through gene expression Capturing structure in a corpus documents, images, etc Basic building block, e.g. for non-linear dimensionality reduction

Low Rank Approximation A data (target) + rank k Z iid Gaussian noise ( ) 1 ; A logp ( A ) ( A ) + logl σ const Max likelihood low-rank matrix given iid Gaussian noise low-rank minimizing sum-squared error given explicitly in terms of SVD

Weighted Low Rank Approximation A data (target) rank k + Z Gaussian noise Z ~ N (0, σ ) logl ( ) ; A logp ( A ) 1 ( A + w σ ) w 1/ σ const Find low-rank matrix minimizing weighted sum-squared-error [Young 1940]

Weighted Low Rank Approximation low-rank matrix minimizing weighted sum-squared error External information about noise variance for each measurement e.g. in gene expression analysis Missing data (0/1 weights) collaborative filtering Different number of samples e.g. separating style and content [Tenenbaum Freeman 00] Varying importance of different entries e.g. D filters [Shpak 90, Lu et al 97] Subroutine for further generalizations

How? Given A and W, find rank k matrix minimizing the weighted sum-square difference W ( A )

(Unweighted) Low Rank Approximation

(Unweighted) Low Rank Approximation J(UV ) n A - objective d target rank k Fro

(Unweighted) Low Rank Approximation J(UV ) objective n d k d A target - n U V k Fro J U ( UV ' A) V 0 J V ( VU ' A' ) U 0 UV VAV VU UA U

(Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A J U J(UV ) objective ( UV ' A) V n UV VAV d k d A target 0 - n U V J V k Fro orthonormal V VI orthogonal U U G Γ diagonal ( VU ' A' ) U VU UA Γ AV 0

(Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A Global Minimum U,V are correspondingly spanned by leading eigenvectors U,V spanned by eigenvectors of AA and AA J ( UV ') A UV ' unselected eigenvalues

(Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A Global Minimum U,V are correspondingly spanned by leading eigenvectors All other fixed points are saddle points (no non-global local minima)

Weighted Low Rank Approximation J ( UV ') W ( A UV ') J U i (( U V ' A ) W ) V i i i 0 U i A WV ( V ' WV ) i i i 1 If rank(w)1, (V W i V) can be simultaneously diagonalized eigen-methods apply [IraniAnandan 000] Otherwise, eigen-methods cannot be used, solutions are not incremental

WLRA: Optimization J ( UV ') W ( A UV ') For fixed V, find optimal U For fixed U, find optimal V V J * ( V ) minj ( UV ') U J * ( V ) U *'(( U * V ' Y ) W ) Conjugate gradient descent on J* n k U d V k Optimize kd parameters instead of k(d+n)

Local Minima in WLRA A 1 1 1.1 1 W 1 + α 1 1 1+ α θ

WLRA: An EM Approach 0/1 weights: A A+Z observations N(0,1) noise parameters

WLRA: An EM Approach Expectation Step: A 1 A A N missing A r [i,j] [i,j] Maximization Step: A r +Z r LRA 1 N A r r same low-rank for all targets A r independent noise Z r for each target A r

A 1 WLRA: An EM Approach A A r +Z r A N integer WLRA(A,W) with W[i,j]w[i,j]/N Expectation Step: missing A r [i,j] [i,j] Maximization Step: LRA 1 N A r r A r [i,j]a[i,j], or missing if w[i,j]<r LRA(W A+(1-W) )

WLRA: Optimization J ( UV ') W ( A UV ') For fixed V, find optimal U For fixed U, find optimal V V J * ( V ) minj ( UV ') U J * ( V ') U *'(( U * V ' A) W ) Conjugate gradient descent on J* LRA(W A+(1-W) )

Estimations of a planted with Non-Identical Noise Variances Reconst. Err: normalized sum diff to planted 10 10 0 10 - unweighted weighted 10 0 10 Signal/Noise (weighted sum low-rank signal/weighted sum noise)

Collaborative Filtering of Joke Preferences ( Jester ) users jokes +1-+3 +6 +5? - +6+4-7 -? +4-3 -5+7-+6+? -1-7 -1++4 +1? +3-3 +7 +-3-6 +4? 0 - -1? +5- +6 +1 +5+-7 +3? - -3-1 +7-3 - -6 ++ -4? -7 +3-4++ +1? +-5-3 -6-1? +6+5-7 +1-6+7-3 - +6+5 +9+4+5? - -6 +7-5+ +4 -? -3 +1+6-? ++4 - +3-3+3-6 -4-9 +7? 0 + -4+3+- 0? -4 [Goldberg 001]

Collaborative Filtering of Joke Preferences ( Jester ) Test Error 0.3 0.3 0.8 0.6 0.4 0. 0. 0.18 0.16 0.14 0 4 6 8 10 1 14 16 WLRA rank SVD, unobserved0, observed rescaled to preserve mean [Azar+ 001] SVD on always observed columns [Goldberg+ 001] SVD, unobserved0 [BillsusPazzani 98]

WLRA as a Subroutine for Other Cost Functions ( ) Low Rank Approximation (PCA) A W ( A Weighted Low Rank Approximation ) Loss ( A, )

Logistic Low Rank Regression [Collins+ 00, Gordon 003, Schein+ 003] Pr Y g observed rank k rank k g ( x ) 1 1+ e 8 6 4 0 4 6 8 x ( t ) ( t ) Loss ( Y, ) logg ( Y, ) ( t 1) ( t 1) g ( Y ) g ( Y ) Y t t ( ) ( 1) + ( t g ( Y 1) ) + Const

Logistic Low Rank Regression [Collins+ 00, Gordon 003, Schein+ 003] Pr Y g observed rank k rank k g ( x ) 1 1+ e 8 6 4 0 4 6 8 x ( t ) ( t ) Loss ( Y, ) logg ( Y, ) ( t 1) ( t 1) g ( Y ) g ( Y ) Y t t W (t ) ( ) ( 1) + A (t ) ( t g ( Y W ( t ) g ( Y ( t 1) ) g ( Y ( t 1) ) A ( t ) Y 1) ) ( t 1) + ( t 1) g ( Y ) + Const

Maximum Likelihood Estimation with Gaussian Mixture Noise Y observed + rank k Z noise C (hidden) Mixture of Gaussians: {p c,µ c,σ c } E step: calculate posteriors of C M step: WLRA with Pr( C c ) W σ c C A Pr( C Y + c σc c ) µ C W

Weighted Low Rank Approximations Nathan Srebro Tommi Jaakkola Weights often appropriate in low rank approximation WLRA more complicated than unweighted LRA Eigenmethods (SVD) do not apply Non-incremental Local minima Optimization approaches: Alternate optimization Gradient methods on J*(V) EM method: LRA(W A+(1-W) ) WLRA useful as subroutine for more general loss functions www.csail.mit.edu/~nati/lowrank/

END

Local Minima in WLRA 1 1 1 1 A 1 1 W V[,-1] 3 3 1 3 1 3 1 3 3 1 3 1 3 1 V[1,] J⅓ J⅓ J3

(Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A A U 0 S V 0

(Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A A U 00 S V 0 Q U Q V I U U 0 S Q U V V 0 Q V

(Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A A U* U 0 1 11 1 1 U 0 S Q U s 1 s s 3 s m V 0 V* 1 V 0 Q V A α s UV ' s α α selected α A UV ' s 1 111 α unselected α 0/1 permuted diagonal matrix