A Simple Algorithm for Nuclear Norm Regularized Problems ICML 00 Martin Jaggi, Marek Sulovský ETH Zurich
Matrix Factorizations for recommender systems Y = Customer Movie UV T = u () The Netflix challenge: 7 000 Movies 500 000 Customers 00 000 000 Ratings (Observed Entries %) u (k) = Angelina Jolie plays in movie j v () v (k) k = Customer i is male
Matrix Factorizations in machine learning Applications: Customer i Product j (Amazon, Netflix, etc...) Customer i Customer j i j 3 3 Word i Document j (Search engines, Latent Semantic Analysis) many other applications (e.g. feature generation, dimensionality reduction, clustering)
m Regularization n Y UV T =: X 3 Error (Loss) Low Rank Model complexity (Regularization) Low Norm +µ rank(x) +µ X Trade-off variant min X f(x) s.t. rank(x) k s.t. X k Constrained variant f(x) := ij S (X Y ) ij Nuclear norm regularized problems
Existing Methods UV T =: X f( ) convex Optimization problem: Existing ML methods solve: min f(x) X s.t. constraint(x) min U,V f(uv T ) s.t. constraint(u, V ) not convex Nuclear norm case: X t U Fro + V Fro t Local minima
3 Convex optimization f( ) convex U (U T V T )= V n UU T VU T X T m UV X T =: Z 3 VV T Optimization problem: Our method solves: min f(x) X s.t. constraint(x) min Z Sym n+m Z0 s.t. f(z) constraint(z) convex Nuclear norm case: X t Tr(Z) =t No local minima
Sparse Approximation The Problem f( ) convex, differentiable min f(x) x R n x 0 T x = min f(z) Z Sym n n Z 0 Tr(Z) =
The Problem min f(x) x R n x 0 T x = min f(z) Z Sym n n Z 0 Tr(Z) = The Algorithm [ Clarkson SODA '08 ] [ Hazan LATIN '08 ] x (k+) := ( λ) +λ x (k) e i Z (k+) := ( λ) Z (k) +λvv T i := arg max f( x (k) ) i i Coordinate descent v := arg max v T ( f( Z (k) ))v v =,largest eigenvector x (k) λ =/k Z (k) = v () v (k) k Sparsity = k Rank = k v () v (k) No projection steps!
The Algorithm x (k+) := ( λ) +λ The Convergence i O After steps the primal-dual error is. x (k) := arg max f( x (k) ) i i e i := ( λ) Z (k) +λvv T Z (k+) := arg max v T ( f( v v = O Z (k) After steps the primal-dual error is. ))v v Approximate eigenvector computation Instead of v := arg max v T Mv Z (k) v = it is enough to work with : v T Mv λ max v v = O Such a can be found by doing Lanzcos steps. Alternative: Power method M := f( )
3 3 Low Norm Matrix Factorization f(z) := ij S (Z Y ) ij We need the largest eigenvector of M := f( ) Z (k) =: Z 3 Power method: n m v := Mv Mv M = 0 3 0 computations correspond to Simon Funk s method
Comparison MMMF, Alternating gradient descent Singular Value Thresholding Methods Convergence guarantee Step complexity Convex Control on the rank k(n + m) O(/ ε) compute exact, full SVD Our Method O(/ε) compute approx. eigenvector * Simon Funk s / SVD++ matrix-vector multiplication * different optimization problem
Experiments > 5x faster than existing Singular Value Thresholding methods such as [ Toh & Yun `09, Mazumder et al `09, Ji & Ye ICML `09,... ] Scales well to larger size problems such as the Netflix data 0.9 0.863 0.785 RMSE 0.708 0.63 MovieLens 0M rb /k, test best on line segm., test gradient interp., test /k, train best on line segm., train gradient interp., train 0 00 00 300 00 k Prediction performance is - comparable to the best non-linear MMMF methods such as [ Lawrence & Urtasun ICML `09 ] - slightly worse than the customly engineered methods for Netflix. Sensitivity on the regularization parameter: 0.95 0.93 0.9 0.89 RMSE test k=000 0 5000 30000 5000 60000 Trace regularization t
Conclusions Overall computational cost is about the same as a single SVD First algorithm for nuclear norm optimization which does not need SVD as an internal computation First Simon-Funk-type algorithm with a convergence guarantee Easy to implement and to parallelize, any approx. eigenvector method of choice can be used internally
Thanks