Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Similar documents
IBM Research Report. Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

ECS289: Scalable Machine Learning

Parallel Numerical Algorithms

Large-scale Matrix Factorization. Kijung Shin Ph.D. Student, CSD

FLEXIFACT: Scalable Flexible Factorization of Coupled Tensors on Hadoop

Matrix Factorization Techniques for Recommender Systems

Scaling Neighbourhood Methods

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer

Using SVD to Recommend Movies

Collaborative Filtering Matrix Completion Alternating Least Squares

Preliminaries. Data Mining. The art of extracting knowledge from large bodies of structured data. Let s put it to use!

Large-scale Collaborative Ranking in Near-Linear Time

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Matrix Factorization and Collaborative Filtering

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems

ECS289: Scalable Machine Learning

SQL-Rank: A Listwise Approach to Collaborative Ranking

Stochastic Gradient Descent with Variance Reduction

Distributed Methods for High-dimensional and Large-scale Tensor Factorization

ECS289: Scalable Machine Learning

Collaborative Filtering. Radek Pelánek

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation

Large-scale Stochastic Optimization

Recommendation Systems

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

PU Learning for Matrix Completion

Parallel Matrix Factorization for Recommender Systems

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Graphical Models for Collaborative Filtering

Matrix Factorization and Factorization Machines for Recommender Systems

CS425: Algorithms for Web Scale Data

Collaborative Filtering

Data Mining Techniques

Linear classifiers: Overfitting and regularization

Andriy Mnih and Ruslan Salakhutdinov

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Matrix and Tensor Factorization from a Machine Learning Perspective

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

ECS289: Scalable Machine Learning

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering

Distributed Stochastic ADMM for Matrix Factorization

Algorithms for Collaborative Filtering

Data Mining and Matrices

Recommendation Systems

Stochastic Gradient Descent. CS 584: Big Data Analytics

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Recommendation Systems

Large-Scale Behavioral Targeting

ECE521 lecture 4: 19 January Optimization, MLE, regularization

1 Kernel methods & optimization

Maximum Margin Matrix Factorization for Collaborative Ranking

Generative Models for Discrete Data

J. Sadeghi E. Patelli M. de Angelis

Collaborative Filtering: A Machine Learning Perspective

Data Science Mastery Program

Lecture 25: November 27

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Big Data Analytics: Optimization and Randomization

Collaborative Filtering

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Importance Sampling for Minibatches

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

Modern Optimization Techniques

CS425: Algorithms for Web Scale Data

Restricted Boltzmann Machines for Collaborative Filtering

CS246 Final Exam, Winter 2011

Collaborative topic models: motivations cont

STA141C: Big Data & High Performance Statistical Computing

Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

Expectation Maximization

Lecture 11 Linear regression

Decoupled Collaborative Ranking

Click Prediction and Preference Ranking of RSS Feeds

From Non-Negative Matrix Factorization to Deep Learning

Bayesian Contextual Multi-armed Bandits

Neural networks and optimization

Collaborative Filtering Applied to Educational Data Mining

Optimization and Gradient Descent

FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop

Big Data Analytics. Lucas Rego Drumond

Nonnegative Matrix Factorization

CS-E3210 Machine Learning: Basic Principles

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Large Scale Machine Learning with Stochastic Gradient Descent

CS60021: Scalable Data Mining. Large Scale Machine Learning

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

A Simple Algorithm for Nuclear Norm Regularized Problems

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Large Scale Semi-supervised Linear SVMs. University of Chicago

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Low Rank Matrix Completion Formulation and Algorithm

High Performance Parallel Tucker Decomposition of Sparse Tensors

Kernelized Perceptron Support Vector Machines

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Machine Learning Techniques

Transcription:

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1

Outline Matrix Factorization Stochastic Gradient Descent Distributed SGD Experiments Summary Dept. CSE, UT Arlington 2

Matrix completion Dept. CSE, UT Arlington 3

Matrix Factorization As Web 2.0 and enterprise-cloud applications proliferate, data mining becomes more important Real application Set of users Set of items (movies, books, products, ) Feedback (ratings, purchase, tags, ) Predict additional items a user may like Assumption: Similar feedback Similar taste Matrix Factorization Dept. CSE, UT Arlington 4

Matrix Factorization Example - Netflix competition 500k users, 20k movies, 100M movie ratings, 3M question marks Avatar The Matrix Up Alice? 4 2 Bob 3 2? Charlie 5? 3 The goal is to predict missing entries (denoted by?) Dept. CSE, UT Arlington 5

Matrix Factorization A general machine learning problem Recommender systems, text indexing, face recognition, Training data V: m n input matrix (e.g., rating matrix) Z: training set of indexes in V (e.g., subset of known ratings) Output find a approximation V WH which have the smallest loss argmin W,H L(V, W, H) Dept. CSE, UT Arlington 6

The loss function Loss function Nonzero squared loss L NZSL = L NZSL i,j:v ij 0 V ij WH ij 2 A sum of local losses over the entries in V ij L = (i,j) Z l V ij, W i, H j Focus on the class of nonzero decompositions Z = i, j : V ij 0 Dept. CSE, UT Arlington 7

The loss function Find best model argmin W,H (i,j) Z L ij (W i, H j ) W i row i of matrix W H j column j of matrix H To avoid trivialities, we assume there is at least one training point in every row and in every column. Dept. CSE, UT Arlington 8

Prior Work Specialized algorithms Designed for a small class of loss functions GKL loss Generic algorithms Handle all differentiable loss functions that decompose into summation form Distributed gradient descent (DGD), Partitioned SGD (PSGD) The proposed Dept. CSE, UT Arlington 9

Successful applications Movie recommendation >12M users, >20k movies, 2.4B ratings 36GB data, 9.2GB model Website recommendation 51M users, 15M URLs, 1.2B clicks 17.8GB data, 161GB metadata, 49GB model News personalization Dept. CSE, UT Arlington 10

Stochastic Gradient Descent Find minimum θ of function L Dept. CSE, UT Arlington 11

Stochastic Gradient Descent Find minimum θ of function L Pick a starting point θ 0 Dept. CSE, UT Arlington 12

Stochastic Gradient Descent Find minimum θ of function L Pick a starting point θ 0 Dept. CSE, UT Arlington 13

Stochastic Gradient Descent Find minimum θ of function L Pick a starting point θ 0 Dept. CSE, UT Arlington 14

Stochastic Gradient Descent Find minimum θ of function L Pick a starting point θ 0 Approximate gradient L (θ 0 ) Dept. CSE, UT Arlington 15

Stochastic Gradient Descent Find minimum θ of function L Pick a starting point θ 0 Approximate gradient L (θ 0 ) Stochastic difference equation θ n+1 = θ n ε n L (θ n ) Under certain conditions, asymptotically approximates (continuous) gradient descent Dept. CSE, UT Arlington 16

Stochastic Gradient Descent for Matrix Factorization Set θ = (W, H) L(θ) = and use (i,j) Z L ij W i, H j L (θ) = (i,j) Z L ij W i, H j L (θ, z) = NL z (W iz, H jz ) Where N = Z and training point z is chosen randomly from the training set. Z = i, j : V ij 0 Dept. CSE, UT Arlington 17

Stochastic Gradient Descent for Matrix Factorization SGD for Matrix Factorization Input: A training set Z, initial values W0 and H0 1. Pick a random entry z Z 2. Compute approximate gradient L (θ, z) 3. Update parameters θ n+1 = θ n ε n L (θ n, z) 4. Repeat N times In practice, an additional projection is used θ n+1 = Π H [θ n ε n L θ n, z ] to keep the iterative in a given constraint set H = {θ: θ 0}. Dept. CSE, UT Arlington 18

Stochastic Gradient Descent for Matrix Factorization Why stochastic is good? Easy obtain May help in escaping local minima Exploit repetition with the data Dept. CSE, UT Arlington 19

Distributed SGD (DSGD) SGD steps depend on each other How to distribute? θ n+1 = θ n ε n L (θ n, z) Parameter mixing (ISGD) Map: Run independent instances of SGD on subsets of the data Reduce: Average results once at the end Does not converge to correct solution Iterative Parameter mixing (PSGD) Map: Run independent instances of SGD on subsets of the data (for some time) Reduce: Average results after each pass over the data Converges slowly Dept. CSE, UT Arlington 20

Distributed SGD (DSGD) Dept. CSE, UT Arlington 21

Stratified SGD Proposed Stratified SGD to obtain an efficient DSGD for matrix factorization L θ = ω 1 L 1 θ + ω 2 L 2 θ + + ω q L q θ The sum of loss function L s, and s is a stratum. A stratum is a part or partition of dataset SSGD runs standard SGD on a single stratum at a time, but switches strata in a way that guarantees correctness Dept. CSE, UT Arlington 22

Stratified SGD algorithm Suppose a stratum sequence {γ n }, each γ n takes values in {1,, q} θ n+1 = Π H [θ n ε n L γn θ n ] Appropriate sufficient conditions for the convergence of SSGD can be from stochastic approximation Dept. CSE, UT Arlington 23

Distribute SSGD SGD steps depend on each other θ n+1 = Π H [θ n ε n L γn θ n ] An SGD step on example z Z 1. Reads W iz, H jz 2. Performs gradient computation L ij 3. Updates and W iz H jz W iz, H jz Dept. CSE, UT Arlington 24

Problem Structure SGD steps depend on each other θ n+1 = Π H [θ n ε n L γn θ n ] An SGD step on example z Z 1. Reads W iz, H jz 2. Performs gradient computation L ij 3. Updates and W iz H jz W iz, H jz Dept. CSE, UT Arlington 25

Problem Structure SGD steps depend on each other θ n+1 = Π H [θ n ε n L γn θ n ] An SGD step on example z Z 1. Reads W iz, H jz 2. Performs gradient computation L ij 3. Updates and W iz H jz W iz, H jz Dept. CSE, UT Arlington 26

Problem Structure SGD steps depend on each other θ n+1 = Π H [θ n ε n L γn θ n ] An SGD step on example z Z 1. Reads W iz, H jz 2. Performs gradient computation L ij 3. Updates and W iz H jz W iz, H jz Not all steps are dependent Dept. CSE, UT Arlington 27

Interchangeability Definition 1. Two elements neither row nor column z 1, z 2 Z are interchangeable if they share Whenz n, z n+1 are interchangeable, the SGD steps θ n+1 = θ n ε L θ n, z n θ n+2 = θ n ε L θ n, z n ε L (θ n+1, z n+1 ) = θ n ε L θ n, z n ε L (θ n, z n+1 ) Dept. CSE, UT Arlington 28

A simple case Denote by the set of training points in block. Suppose we run T steps of SSGD on Z, starting from some initial point θ 0 = (W 0, H 0 ) and using a fixed step size ε. Describe an instance of the SGD process by a training sequence ω = z 0, z 1,, z T 1 of T training points. Z b θ n+1 ω = θ n ω + ε T 1 n=0 Y n (ω) Z b Dept. CSE, UT Arlington 29

A simple case θ n+1 ω = θ n ω + ε T 1 n=0 Y n (ω) Consider the subsequence σ b (ω) means training points from block Z b ; the length T b ω = σ b ω. The following theorem suggests we can run SGD on each block independently and then sum up the results. Theorem 3 Using the definitions above θ T ω = θ 0 + ε d b=1 T b ω 1 k=0 Y k (σ b (ω)) Dept. CSE, UT Arlington 30

A simple case If we divide into d independent map tasks Γ 1, Γ d. Task Γ b is responsible for subsequence σ b (ω) : It takes Z b, W b and H b as input, performs the block-local updates σ b (ω) Dept. CSE, UT Arlington 31

The General Case Theorem 3 can also be applied into a general case. d-monomial Dept. CSE, UT Arlington 32

The General Case Dept. CSE, UT Arlington 33

Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Dept. CSE, UT Arlington 34

Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Dept. CSE, UT Arlington 35

Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Dept. CSE, UT Arlington 36

Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Step 2: Simulate sequential SGD 1. Interchangeable blocks 2. Throw dice of how many iterations per block 3. Throw dice of which step sizes per block Dept. CSE, UT Arlington 37

Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Step 2: Simulate sequential SGD 1. Interchangeable blocks 2. Throw dice of how many iterations per block 3. Throw dice of which step sizes per block Dept. CSE, UT Arlington 38

Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Step 2: Simulate sequential SGD 1. Interchangeable blocks 2. Throw dice of how many iterations per block 3. Throw dice of which step sizes per block Dept. CSE, UT Arlington 39

Experiments Compare with PSGD, DGD, ALS methods Data Netflix Competition 100M ratings from 480k customers on 18k movies Synthetic dataset 10M rows, 1M columns, 1B nonzero entries Dept. CSE, UT Arlington 40

Experiments Test on two well-known loss functions L NZSL = i,j:v ij 0 V ij WH ij 2 L L2 = L NZSL + λ W F 2 + H F 2 It works well on a variety kind of loss functions, results for other loss functions can be found in their technique report. Dept. CSE, UT Arlington 41

Experiments The proposed DSGD converges faster and achieves better results than others Dept. CSE, UT Arlington 42

Experiments 1. The processing time remains constant as the size increase 2. To very large datasets on larger clusters, the overall running time increases by a modest 30% Dept. CSE, UT Arlington 43

Summary Matrix factorization Widely applicable via customized loss functions Large instances (millions * millions with billions of entries) Distributed Stochastic Gradient Descent Simple and versatile Achieves Fully distributed data/model Fully distributed processing Same or better loss Faster Good scalability Dept. CSE, UT Arlington 44

Thank you! Dept. CSE, UT Arlington 45