A Simple Algorithm for Nuclear Norm Regularized Problems

Similar documents
Approximate SDP solvers, Matrix Factorizations, the Netflix Prize, and PageRank. Mittagseminar Martin Jaggi, Oct

Using SVD to Recommend Movies

Lecture 9: September 28

Maximum Margin Matrix Factorization

Algorithms for Collaborative Filtering

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Matrix Factorizations: A Tale of Two Norms

Adaptive one-bit matrix completion

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Low Rank Matrix Completion Formulation and Algorithm


Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Collaborative Filtering

Generalized Conditional Gradient and Its Applications

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Collaborative Filtering Matrix Completion Alternating Least Squares

Collaborative Filtering: A Machine Learning Perspective

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

Structured matrix factorizations. Example: Eigenfaces

Spectral k-support Norm Regularization

Recommendation Systems

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Collaborative Filtering. Radek Pelánek

Graphical Models for Collaborative Filtering

Large-scale Collaborative Ranking in Near-Linear Time

1-Bit Matrix Completion

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion

Provable Alternating Minimization Methods for Non-convex Optimization

PU Learning for Matrix Completion

Lecture 1: September 25

Learning with Matrix Factorizations

Lecture Notes 10: Matrix Factorization

Jeffrey D. Ullman Stanford University

1-Bit Matrix Completion

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2017

A direct formulation for sparse PCA using semidefinite programming

Lecture 5 : Projections

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Conditional Gradient (Frank-Wolfe) Method

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Rank, Trace-Norm & Max-Norm

LLORMA: Local Low-Rank Matrix Approximation

CSC 576: Variants of Sparse Learning

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

A Greedy Framework for First-Order Optimization

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

1-Bit Matrix Completion

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Big Data Analytics: Optimization and Randomization

Maximum Margin Matrix Factorization for Collaborative Ranking

Collaborative Filtering

Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage

A direct formulation for sparse PCA using semidefinite programming

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Multivariate Normal Models

Lecture: Matrix Completion

1 Kernel methods & optimization

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Multivariate Normal Models

Linear dimensionality reduction for data analysis

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Restricted Boltzmann Machines for Collaborative Filtering

Efficient and Practical Stochastic Subgradient Descent for Nuclear Norm Regularization

First-order methods for structured nonsmooth optimization

A Least Squares Formulation for Canonical Correlation Analysis

Andriy Mnih and Ruslan Salakhutdinov

Low-rank Matrix Completion with Noisy Observations: a Quantitative Comparison

Homework 4. Convex Optimization /36-725

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Matrix Completion: Fundamental Limits and Efficient Algorithms

Decoupled Collaborative Ranking

Optimization methods

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture 18 Nov 3rd, 2015

Robust Principal Component Pursuit via Alternating Minimization Scheme on Matrix Manifolds

SCMF: Sparse Covariance Matrix Factorization for Collaborative Filtering

Lecture 23: November 19

arxiv: v1 [math.oc] 1 Jul 2016

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Riemannian Pursuit for Big Matrix Recovery

arxiv: v1 [stat.ml] 1 Mar 2015

ECS289: Scalable Machine Learning

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Optimization and Gradient Descent

L26: Advanced dimensionality reduction

Gradient Descent. Dr. Xiaowei Huang

Matrix completion: Fundamental limits and efficient algorithms. Sewoong Oh Stanford University

Transcription:

A Simple Algorithm for Nuclear Norm Regularized Problems ICML 00 Martin Jaggi, Marek Sulovský ETH Zurich

Matrix Factorizations for recommender systems Y = Customer Movie UV T = u () The Netflix challenge: 7 000 Movies 500 000 Customers 00 000 000 Ratings (Observed Entries %) u (k) = Angelina Jolie plays in movie j v () v (k) k = Customer i is male

Matrix Factorizations in machine learning Applications: Customer i Product j (Amazon, Netflix, etc...) Customer i Customer j i j 3 3 Word i Document j (Search engines, Latent Semantic Analysis) many other applications (e.g. feature generation, dimensionality reduction, clustering)

m Regularization n Y UV T =: X 3 Error (Loss) Low Rank Model complexity (Regularization) Low Norm +µ rank(x) +µ X Trade-off variant min X f(x) s.t. rank(x) k s.t. X k Constrained variant f(x) := ij S (X Y ) ij Nuclear norm regularized problems

Existing Methods UV T =: X f( ) convex Optimization problem: Existing ML methods solve: min f(x) X s.t. constraint(x) min U,V f(uv T ) s.t. constraint(u, V ) not convex Nuclear norm case: X t U Fro + V Fro t Local minima

3 Convex optimization f( ) convex U (U T V T )= V n UU T VU T X T m UV X T =: Z 3 VV T Optimization problem: Our method solves: min f(x) X s.t. constraint(x) min Z Sym n+m Z0 s.t. f(z) constraint(z) convex Nuclear norm case: X t Tr(Z) =t No local minima

Sparse Approximation The Problem f( ) convex, differentiable min f(x) x R n x 0 T x = min f(z) Z Sym n n Z 0 Tr(Z) =

The Problem min f(x) x R n x 0 T x = min f(z) Z Sym n n Z 0 Tr(Z) = The Algorithm [ Clarkson SODA '08 ] [ Hazan LATIN '08 ] x (k+) := ( λ) +λ x (k) e i Z (k+) := ( λ) Z (k) +λvv T i := arg max f( x (k) ) i i Coordinate descent v := arg max v T ( f( Z (k) ))v v =,largest eigenvector x (k) λ =/k Z (k) = v () v (k) k Sparsity = k Rank = k v () v (k) No projection steps!

The Algorithm x (k+) := ( λ) +λ The Convergence i O After steps the primal-dual error is. x (k) := arg max f( x (k) ) i i e i := ( λ) Z (k) +λvv T Z (k+) := arg max v T ( f( v v = O Z (k) After steps the primal-dual error is. ))v v Approximate eigenvector computation Instead of v := arg max v T Mv Z (k) v = it is enough to work with : v T Mv λ max v v = O Such a can be found by doing Lanzcos steps. Alternative: Power method M := f( )

3 3 Low Norm Matrix Factorization f(z) := ij S (Z Y ) ij We need the largest eigenvector of M := f( ) Z (k) =: Z 3 Power method: n m v := Mv Mv M = 0 3 0 computations correspond to Simon Funk s method

Comparison MMMF, Alternating gradient descent Singular Value Thresholding Methods Convergence guarantee Step complexity Convex Control on the rank k(n + m) O(/ ε) compute exact, full SVD Our Method O(/ε) compute approx. eigenvector * Simon Funk s / SVD++ matrix-vector multiplication * different optimization problem

Experiments > 5x faster than existing Singular Value Thresholding methods such as [ Toh & Yun `09, Mazumder et al `09, Ji & Ye ICML `09,... ] Scales well to larger size problems such as the Netflix data 0.9 0.863 0.785 RMSE 0.708 0.63 MovieLens 0M rb /k, test best on line segm., test gradient interp., test /k, train best on line segm., train gradient interp., train 0 00 00 300 00 k Prediction performance is - comparable to the best non-linear MMMF methods such as [ Lawrence & Urtasun ICML `09 ] - slightly worse than the customly engineered methods for Netflix. Sensitivity on the regularization parameter: 0.95 0.93 0.9 0.89 RMSE test k=000 0 5000 30000 5000 60000 Trace regularization t

Conclusions Overall computational cost is about the same as a single SVD First algorithm for nuclear norm optimization which does not need SVD as an internal computation First Simon-Funk-type algorithm with a convergence guarantee Easy to implement and to parallelize, any approx. eigenvector method of choice can be used internally

Thanks