A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Size: px

Start display at page:

Download "A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices"

Maurice Jesse Haynes
5 years ago
Views:

1 A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology ICML 2010 Ryota Tomioka Univ Tokyo M-DAL / 25

2 Introduction Learning low-rank matrices 1 Matrix completion [Srebro et al. 05; Abernethy et al. 09] collaborative filtering, link prediction Y = W Part of Y is observed W = W user W movie 2 Multi-task learning [Argyriou et al., 07] y = W x + b W = w task 1 w task 2. w task R 3 Predicting over matrices classification/regression [Tomioka & Aihara, 07] time y = W, X + b X space W W = W task W feature = W space W time Ryota Tomioka Univ Tokyo M-DAL / 25

3 Introduction Problem formulation Primal problem minimize f l AW + φ λ W W R R C }{{} =:f W f l R m R: loss function differentiable. A R R C R m : design matrix. φ λ : regularizer non-differentiable; for example, the trace norm: φ λ W = λ W = λ r j=1 σ jw linear sum of singular values. Separation of f l and A convergence regardless of A input data. Ryota Tomioka Univ Tokyo M-DAL / 25

4 Introduction Existing approaches and our goal Proximal accelerated gradient [Ji & Ye, 09] Can keep the intermediate solution low-rank. Slow for poorly conditioned design matrix A. Optimal as a first order black-box method O1/k 2. Interior point algorithm [Tomioka & Aihara, 07] Can obtain highly precise solution in small number of iterations. Only low-rank in the limit each step can be heavy. Can we keep the intermediate solution low-rank and still converge rapidly as an IP method? Dual Augmented Lagrangian DAL [Tomioka & Sugiyama, 09] for sparse estimation. M-DAL: generalized to low-rank matrix estimation. Ryota Tomioka Univ Tokyo M-DAL / 25

5 Introduction Existing approaches and our goal Proximal accelerated gradient [Ji & Ye, 09] Can keep the intermediate solution low-rank. Slow for poorly conditioned design matrix A. Optimal as a first order black-box method O1/k 2. Interior point algorithm [Tomioka & Aihara, 07] Can obtain highly precise solution in small number of iterations. Only low-rank in the limit each step can be heavy. Can we keep the intermediate solution low-rank and still converge rapidly as an IP method? Dual Augmented Lagrangian DAL [Tomioka & Sugiyama, 09] for sparse estimation. M-DAL: generalized to low-rank matrix estimation. Ryota Tomioka Univ Tokyo M-DAL / 25

6 Introduction Existing approaches and our goal Proximal accelerated gradient [Ji & Ye, 09] Can keep the intermediate solution low-rank. Slow for poorly conditioned design matrix A. Optimal as a first order black-box method O1/k 2. Interior point algorithm [Tomioka & Aihara, 07] Can obtain highly precise solution in small number of iterations. Only low-rank in the limit each step can be heavy. Can we keep the intermediate solution low-rank and still converge rapidly as an IP method? Dual Augmented Lagrangian DAL [Tomioka & Sugiyama, 09] for sparse estimation. M-DAL: generalized to low-rank matrix estimation. Ryota Tomioka Univ Tokyo M-DAL / 25

7 Introduction Outline 1 Introduction Why learn low-rank matrices? Existing algorithms 2 Proposed algorithm Augmented Lagrangian Proximal minimization Super-linear convergence Generalizations 3 Experiments Synthetic 10,000 10,000 matrix completion. BCI classification problem learning multiple matrices. 4 Summary Ryota Tomioka Univ Tokyo M-DAL / 25

8 Algorithm M-DAL algorithm 1 Initialize W 0. Choose η 1 η 2 2 Iterate until relative duality gap < ϵ α t := argmin ϕ t α inner objective α R m W t+1 = ST ληt W t + η t A α t Choosing α = fl t yields the proximal gradient forward-backward method. ST λ W := argmin φ λ X + 1 X R 2 X W 2 fro. R C For the trace-norm φ λ W = λ W, ST λ W = U maxs λi, 0V, λ where W = USV singular-value decomposition. Ryota Tomioka Univ Tokyo M-DAL / 25

9 Algorithm Proximal minimization [Rockafellar 76a, 76b] 1 Initialize W 0. 2 Iterate: W t+1 = argmin f W + W = argmin W Fenchel dual see Rockafellar 1970: min W f l AW + 1 η t Φ ληt W ; W t proximal term {}}{ 1 2η t W W t 2 f l AW + φ λ W + 1 2η t W W t 2 }{{} =: 1 η t Φ ληt W ; W t =max α =: min ϕ t α α f l α 1 η t Φ λη t η t A α; W t Ryota Tomioka Univ Tokyo M-DAL / 25

10 Algorithm Proximal minimization [Rockafellar 76a, 76b] 1 Initialize W 0. 2 Iterate: W t+1 = argmin f W + W = argmin W Fenchel dual see Rockafellar 1970: min W f l AW + 1 η t Φ ληt W ; W t proximal term {}}{ 1 2η t W W t 2 f l AW + φ λ W + 1 2η t W W t 2 }{{} =: 1 η t Φ ληt W ; W t =max α =: min ϕ t α α f l α 1 η t Φ λη t η t A α; W t Ryota Tomioka Univ Tokyo M-DAL / 25

11 Algorithm Proximal minimization [Rockafellar 76a, 76b] 1 Initialize W 0. 2 Iterate: W t+1 = argmin f W + W = argmin W Fenchel dual see Rockafellar 1970: min W f l AW + 1 η t Φ ληt W ; W t proximal term {}}{ 1 2η t W W t 2 f l AW + φ λ W + 1 2η t W W t 2 }{{} =: 1 η t Φ ληt W ; W t =max α =: min ϕ t α α f l α 1 η t Φ λη t η t A α; W t Ryota Tomioka Univ Tokyo M-DAL / 25

12 Definition Algorithm W : the unique minimizer of the objective f W. W t : sequence generated by the M-DAL algorithm with ϕ t α t γ W t+1 W t 1/γ: Lipschitz constant of f l ηt fro. Assumption There is a constant σ > 0 such that f W t+1 f W σ W t+1 W 2 fro t = 0, 1, 2,.... Theorem: Super-linear convergence W t+1 W fro σηt W t W fro. I.e., W t converges super-linearly to W if η t is increasing. Ryota Tomioka Univ Tokyo M-DAL / 25

13 Why is M-DAL efficient? Algorithm 1 Proximation wrt φ λ is analytic though non-smooth: W t+1 = ST ηt λ w t + η t A α t 2 Inner minimization is smooth: α t = argmin α R m f l α }{{} Differentiable + 1 2η t ST ηt λw t + η t A α 2 fro }{{} = Φ λ linear to the estimated rank φ λ w Φλ w λ 0 λ Ryota Tomioka Univ Tokyo M-DAL / 25

14 Generalizations Algorithm 1 Learning multiple matrices K φ λ W = λ W k = λ k=1 W W K No need to form the big matrix. Can be used to select informative data sources and learn feature extractors simultaneously. 2 General spectral regularization r φ λ W = g λ σ j W j=1 for any convex function g λ for which the proximal operator: ST g λ σ j = argmin g λ x + 12 x σ j 2 x R can be computed in closed form. Ryota Tomioka Univ Tokyo M-DAL / 25

15 Experiments Synthetic experiment 1: low-rank matrix completion Synthetic experiment 1: low-rank matrix completion Large scale & structured. True matrix W : 10, , M elements, low rank. Observation: randomly chosen m elements sparse. Quasi-Newton method for the minimization of ϕ t α No need to form the full matrix! m=1,200,000 m=2,400,000 Ryota Tomioka Univ Tokyo M-DAL / 25

16 Experiments Synthetic experiment 1: low-rank matrix completion Synthetic experiment 1: low-rank matrix completion Large scale & structured. True matrix W : 10, , M elements, low rank. Observation: randomly chosen m elements sparse. Quasi-Newton method for the minimization of ϕ t α No need to form the full matrix! 300 m=1,200,000 Rank= m=2,400,000 Rank=20 20 Time s Rank 6 Time s Rank Regularization constant λ Regularization constant λ Ryota Tomioka Univ Tokyo M-DAL / 25

17 Rank=10 Experiments Synthetic experiment 1: low-rank matrix completion Rank=10, #observations m=1,200,000 λ time s #outer #inner rank S-RMSE ±2.0 5 ±0 8 ±0 2.8 ± ± ± ±0 18 ±0 5 ± ± ± ±0 28 ±0 6.4 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±11 41 ±0 70 ± ± ± #inner iterations is roughly 2 times #outer iterations. Because we don t need to solve the inner problem very precisely! Ryota Tomioka Univ Tokyo M-DAL / 25

18 Rank=20 Experiments Synthetic experiment 1: low-rank matrix completion Rank=20, #observations m=2,400,000 λ time s #outer #inner rank S-RMSE ±19 6 ± ± ± ± ±22 11 ± ± ± ± ±25 15 ± ± ± ± ±29 19 ± ± ± ± ±36 24 ± ± ± ± ±44 29 ± ± ± ± ±48 34 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Ryota Tomioka Univ Tokyo M-DAL / 25

19 Experiments Synthetic experiment 2: classifying matrices Synthetic experiment 2: classifying matrices True matrix W : 64 64, rank=16. Classification problem # samples m = 1000: f l z = m log1 + exp y i z i i=1 Design matrix A: dense each example drawn from Wishart distribution. λ = 800 chosen to roughly reproduce rank=16. Medium scale & dense Newton method for minimizing ϕ t α works better when the condition is poor. Methods: M-DAL proposed IP interior point method [T& A 2007] PG projected gradient method [T& S 2008] AG accelerated gradient method [Ji & Ye 2009] Ryota Tomioka Univ Tokyo M-DAL / 25

20 Comparison Experiments Synthetic experiment 2: classifying matrices Relative duality gap AG IP PG M DAL CPU time s M-DAL proposed IP interior point method [T& A 2007] PG projected gradient method [T& S 2008] AG accelerated gradient method [Ji & Ye 2009] Ryota Tomioka Univ Tokyo M-DAL / 25

21 Experiments Super-linear convergence Synthetic experiment 2: classifying matrices Relative duality gap Number of iterations Time s Ryota Tomioka Univ Tokyo M-DAL / 25

22 Experiments Benchmark experiment: BCI dataset Benchmark experiment: BCI dataset BCI competition 2003 dataset IV. Task: predict the upcoming finger movement is right or left. Original data: multichannel EEG time-series 28 channels 50 time-points 500ms long Each training example is preprocessed into three matrices and whitened [T&M 2010]: First order <20Hz component: matrix. Second order alpha 7-15Hz component: covariance Second order beta 15-30Hz component: covariance 316 training examples. 100 test samples. Ryota Tomioka Univ Tokyo M-DAL / 25

23 Experiments BCI dataset regularization path Benchmark experiment: BCI dataset -o- M-DAL proposed - - PG T&S 08 -x- AG Ji&Ye Number of iterations Time s Accuracy Regularization constant Regularization constant Regularization constant Stopping criterion: RDG Note: these are costs for obtaining the entire regularization path. Ryota Tomioka Univ Tokyo M-DAL / 25

24 Experiments Benchmark experiment: BCI dataset Summary M-DAL: Dual Augmented Lagrangian algorithm for learning low-rank matrices. Theoretically shown that M-DAL converges superlinearly requires small number of steps. Separable regularizer + differentiable loss function each step can be efficiently solved. Empirically we can solve matrix completion problem of size 10,000 10,000 in roughly 5min. Training a low-rank classifier that automatically combines multiple data sources matrices is sped up by a factor of Future work: Other AL algorithms. Large scale and unstructured or badly conditioned problems. Ryota Tomioka Univ Tokyo M-DAL / 25

25 Background Appendix Convex conjugate of f Convex conjugate of a sum: Fenchel dual: f y = sup x R n y, x f x f + g y = f g α = inf α f α + g y α inf f x + gx = sup f α g α x Rn α R n Ryota Tomioka Univ Tokyo M-DAL / 25

26 Inf-convolution Appendix Inf-convolution envelope function: Φ λ V ; W t = φ λ W t V = inf φ λ Y + 12 W t + V Y 2 Y R R C = Φ λ W t + V ; 0 = Φ λ W t + V Nondifferentiable Differentiable Φ λ x φ λ *v φ λ x Φ λ *v Ryota Tomioka Univ Tokyo M-DAL / 25

27 Appendix M-DAL algorithm for the trace-norm W t+1 = ST ληt W t + η t A α t where α t = argmin α 1 η t Φ λη t W t +η t A α {}}{ fl α + 1 ST ληt W t + η t A α 2 2η t } {{ } ϕ t α ϕ t α is differentiable: ϕ t α = f l α + AST λη t W t + η t A α 2 ϕ t α : a bit more involved but can be computed Wright 92 Ryota Tomioka Univ Tokyo M-DAL / 25

28 Appendix Minimizing the inner objective ϕ t α ϕ t α, ϕ t α can be computed using only the singular values σ j W t α λη t and the corresponding SVs. no need to compute full SVD of W t α := W t + η t A α. But, computation of 2 ϕ t α requires full SVD. Consequence: When A α is structured, computing the above SVD is cheap. E.g., matrix completion: A α is sparse with only m non-zeros. W t α = W t L W t R +η t Quasi-Newton method maintains factorized W t and scalable When A α is not structured, Full Newton method converges faster and more stable Ryota Tomioka Univ Tokyo M-DAL / 25

29 Appendix Primal Proximal Minimization A M-DAL W t+1 = argmin W f l AW + Primal Proximal Minimization B W t+1 = argmin W φ λ W + φ λ W + 1 W W t 2 fro 2η t f l AW + 1 2η t W W t 2 fro Dual Proximal Minimization A split Bregman iteration α t+1 = argmin fl α + φ λ A α + 1 α α t 2 α 2η t Dual Proximal Minimization B α t+1 = argmin α φ λ A α + fl α + 1 α α t 2 2η t Ryota Tomioka Univ Tokyo M-DAL / 25

30 Appendix References Abernethy et al. A new approach to collaborative filtering: Operator estimation with spectral regularization. JMLR 10, Argyriou et al. Multi-task feature learning In NIPS 19, Ji & Ye. An accelerated gradient method for trace norm minimization. In ICML Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. on Control and Optimization 14, 1976a. Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math. of Oper. Res. 1, 1976b. Srebro et al. Maximum-margin matrix factorization. In NIPS 17, Tomioka & Aihara. Classifying matrices with a spectral regularization. In ICML Tomioka et al. Super-Linear Convergence of Dual Augmented-Lagrangian Algorithm for Sparsity Regularized Estimation. Arxiv: , Tomioka & Sugiyama. Sparse learning with duality gap guarantee. In NIPS workshop OPT 2008 Optimization for Machine Learning, Tomioka & Müller. A regularized discriminative framework for EEG analysis with application to brain-computer interface. Neuroimage 49, Wright. Differential equations for the analytic singular value decomposition of a matrix. Numer. Math. 63, Ryota Tomioka Univ Tokyo M-DAL / 25

Optimization for Machine Learning

Optimization for Machine Learning Editors: Suvrit Sra suvrit@gmail.com Max Planck Insitute for Biological Cybernetics 72076 Tübingen, Germany Sebastian Nowozin Microsoft Research Cambridge, CB3 0FB, United