Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Size: px
Start display at page:

Download "Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent"

Transcription

1 Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1

2 Outline Matrix Factorization Stochastic Gradient Descent Distributed SGD Experiments Summary Dept. CSE, UT Arlington 2

3 Matrix completion Dept. CSE, UT Arlington 3

4 Matrix Factorization As Web 2.0 and enterprise-cloud applications proliferate, data mining becomes more important Real application Set of users Set of items (movies, books, products, ) Feedback (ratings, purchase, tags, ) Predict additional items a user may like Assumption: Similar feedback Similar taste Matrix Factorization Dept. CSE, UT Arlington 4

5 Matrix Factorization Example - Netflix competition 500k users, 20k movies, 100M movie ratings, 3M question marks Avatar The Matrix Up Alice? 4 2 Bob 3 2? Charlie 5? 3 The goal is to predict missing entries (denoted by?) Dept. CSE, UT Arlington 5

6 Matrix Factorization A general machine learning problem Recommender systems, text indexing, face recognition, Training data V: m n input matrix (e.g., rating matrix) Z: training set of indexes in V (e.g., subset of known ratings) Output find a approximation V WH which have the smallest loss argmin W,H L(V, W, H) Dept. CSE, UT Arlington 6

7 The loss function Loss function Nonzero squared loss L NZSL = L NZSL i,j:v ij 0 V ij WH ij 2 A sum of local losses over the entries in V ij L = (i,j) Z l V ij, W i, H j Focus on the class of nonzero decompositions Z = i, j : V ij 0 Dept. CSE, UT Arlington 7

8 The loss function Find best model argmin W,H (i,j) Z L ij (W i, H j ) W i row i of matrix W H j column j of matrix H To avoid trivialities, we assume there is at least one training point in every row and in every column. Dept. CSE, UT Arlington 8

9 Prior Work Specialized algorithms Designed for a small class of loss functions GKL loss Generic algorithms Handle all differentiable loss functions that decompose into summation form Distributed gradient descent (DGD), Partitioned SGD (PSGD) The proposed Dept. CSE, UT Arlington 9

10 Successful applications Movie recommendation >12M users, >20k movies, 2.4B ratings 36GB data, 9.2GB model Website recommendation 51M users, 15M URLs, 1.2B clicks 17.8GB data, 161GB metadata, 49GB model News personalization Dept. CSE, UT Arlington 10

11 Stochastic Gradient Descent Find minimum θ of function L Dept. CSE, UT Arlington 11

12 Stochastic Gradient Descent Find minimum θ of function L Pick a starting point θ 0 Dept. CSE, UT Arlington 12

13 Stochastic Gradient Descent Find minimum θ of function L Pick a starting point θ 0 Dept. CSE, UT Arlington 13

14 Stochastic Gradient Descent Find minimum θ of function L Pick a starting point θ 0 Dept. CSE, UT Arlington 14

15 Stochastic Gradient Descent Find minimum θ of function L Pick a starting point θ 0 Approximate gradient L (θ 0 ) Dept. CSE, UT Arlington 15

16 Stochastic Gradient Descent Find minimum θ of function L Pick a starting point θ 0 Approximate gradient L (θ 0 ) Stochastic difference equation θ n+1 = θ n ε n L (θ n ) Under certain conditions, asymptotically approximates (continuous) gradient descent Dept. CSE, UT Arlington 16

17 Stochastic Gradient Descent for Matrix Factorization Set θ = (W, H) L(θ) = and use (i,j) Z L ij W i, H j L (θ) = (i,j) Z L ij W i, H j L (θ, z) = NL z (W iz, H jz ) Where N = Z and training point z is chosen randomly from the training set. Z = i, j : V ij 0 Dept. CSE, UT Arlington 17

18 Stochastic Gradient Descent for Matrix Factorization SGD for Matrix Factorization Input: A training set Z, initial values W0 and H0 1. Pick a random entry z Z 2. Compute approximate gradient L (θ, z) 3. Update parameters θ n+1 = θ n ε n L (θ n, z) 4. Repeat N times In practice, an additional projection is used θ n+1 = Π H [θ n ε n L θ n, z ] to keep the iterative in a given constraint set H = {θ: θ 0}. Dept. CSE, UT Arlington 18

19 Stochastic Gradient Descent for Matrix Factorization Why stochastic is good? Easy obtain May help in escaping local minima Exploit repetition with the data Dept. CSE, UT Arlington 19

20 Distributed SGD (DSGD) SGD steps depend on each other How to distribute? θ n+1 = θ n ε n L (θ n, z) Parameter mixing (ISGD) Map: Run independent instances of SGD on subsets of the data Reduce: Average results once at the end Does not converge to correct solution Iterative Parameter mixing (PSGD) Map: Run independent instances of SGD on subsets of the data (for some time) Reduce: Average results after each pass over the data Converges slowly Dept. CSE, UT Arlington 20

21 Distributed SGD (DSGD) Dept. CSE, UT Arlington 21

22 Stratified SGD Proposed Stratified SGD to obtain an efficient DSGD for matrix factorization L θ = ω 1 L 1 θ + ω 2 L 2 θ + + ω q L q θ The sum of loss function L s, and s is a stratum. A stratum is a part or partition of dataset SSGD runs standard SGD on a single stratum at a time, but switches strata in a way that guarantees correctness Dept. CSE, UT Arlington 22

23 Stratified SGD algorithm Suppose a stratum sequence {γ n }, each γ n takes values in {1,, q} θ n+1 = Π H [θ n ε n L γn θ n ] Appropriate sufficient conditions for the convergence of SSGD can be from stochastic approximation Dept. CSE, UT Arlington 23

24 Distribute SSGD SGD steps depend on each other θ n+1 = Π H [θ n ε n L γn θ n ] An SGD step on example z Z 1. Reads W iz, H jz 2. Performs gradient computation L ij 3. Updates and W iz H jz W iz, H jz Dept. CSE, UT Arlington 24

25 Problem Structure SGD steps depend on each other θ n+1 = Π H [θ n ε n L γn θ n ] An SGD step on example z Z 1. Reads W iz, H jz 2. Performs gradient computation L ij 3. Updates and W iz H jz W iz, H jz Dept. CSE, UT Arlington 25

26 Problem Structure SGD steps depend on each other θ n+1 = Π H [θ n ε n L γn θ n ] An SGD step on example z Z 1. Reads W iz, H jz 2. Performs gradient computation L ij 3. Updates and W iz H jz W iz, H jz Dept. CSE, UT Arlington 26

27 Problem Structure SGD steps depend on each other θ n+1 = Π H [θ n ε n L γn θ n ] An SGD step on example z Z 1. Reads W iz, H jz 2. Performs gradient computation L ij 3. Updates and W iz H jz W iz, H jz Not all steps are dependent Dept. CSE, UT Arlington 27

28 Interchangeability Definition 1. Two elements neither row nor column z 1, z 2 Z are interchangeable if they share Whenz n, z n+1 are interchangeable, the SGD steps θ n+1 = θ n ε L θ n, z n θ n+2 = θ n ε L θ n, z n ε L (θ n+1, z n+1 ) = θ n ε L θ n, z n ε L (θ n, z n+1 ) Dept. CSE, UT Arlington 28

29 A simple case Denote by the set of training points in block. Suppose we run T steps of SSGD on Z, starting from some initial point θ 0 = (W 0, H 0 ) and using a fixed step size ε. Describe an instance of the SGD process by a training sequence ω = z 0, z 1,, z T 1 of T training points. Z b θ n+1 ω = θ n ω + ε T 1 n=0 Y n (ω) Z b Dept. CSE, UT Arlington 29

30 A simple case θ n+1 ω = θ n ω + ε T 1 n=0 Y n (ω) Consider the subsequence σ b (ω) means training points from block Z b ; the length T b ω = σ b ω. The following theorem suggests we can run SGD on each block independently and then sum up the results. Theorem 3 Using the definitions above θ T ω = θ 0 + ε d b=1 T b ω 1 k=0 Y k (σ b (ω)) Dept. CSE, UT Arlington 30

31 A simple case If we divide into d independent map tasks Γ 1, Γ d. Task Γ b is responsible for subsequence σ b (ω) : It takes Z b, W b and H b as input, performs the block-local updates σ b (ω) Dept. CSE, UT Arlington 31

32 The General Case Theorem 3 can also be applied into a general case. d-monomial Dept. CSE, UT Arlington 32

33 The General Case Dept. CSE, UT Arlington 33

34 Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Dept. CSE, UT Arlington 34

35 Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Dept. CSE, UT Arlington 35

36 Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Dept. CSE, UT Arlington 36

37 Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Step 2: Simulate sequential SGD 1. Interchangeable blocks 2. Throw dice of how many iterations per block 3. Throw dice of which step sizes per block Dept. CSE, UT Arlington 37

38 Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Step 2: Simulate sequential SGD 1. Interchangeable blocks 2. Throw dice of how many iterations per block 3. Throw dice of which step sizes per block Dept. CSE, UT Arlington 38

39 Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Step 2: Simulate sequential SGD 1. Interchangeable blocks 2. Throw dice of how many iterations per block 3. Throw dice of which step sizes per block Dept. CSE, UT Arlington 39

40 Experiments Compare with PSGD, DGD, ALS methods Data Netflix Competition 100M ratings from 480k customers on 18k movies Synthetic dataset 10M rows, 1M columns, 1B nonzero entries Dept. CSE, UT Arlington 40

41 Experiments Test on two well-known loss functions L NZSL = i,j:v ij 0 V ij WH ij 2 L L2 = L NZSL + λ W F 2 + H F 2 It works well on a variety kind of loss functions, results for other loss functions can be found in their technique report. Dept. CSE, UT Arlington 41

42 Experiments The proposed DSGD converges faster and achieves better results than others Dept. CSE, UT Arlington 42

43 Experiments 1. The processing time remains constant as the size increase 2. To very large datasets on larger clusters, the overall running time increases by a modest 30% Dept. CSE, UT Arlington 43

44 Summary Matrix factorization Widely applicable via customized loss functions Large instances (millions * millions with billions of entries) Distributed Stochastic Gradient Descent Simple and versatile Achieves Fully distributed data/model Fully distributed processing Same or better loss Faster Good scalability Dept. CSE, UT Arlington 44

45 Thank you! Dept. CSE, UT Arlington 45

IBM Research Report. Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

IBM Research Report. Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent RJ10481 (A1103-008) March 16, 2011 Computer Science IBM Research Report Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla Max-Planck-Institut für Informatik Saarbrücken,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3 Numerical Optimization Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at

More information

Large-scale Matrix Factorization. Kijung Shin Ph.D. Student, CSD

Large-scale Matrix Factorization. Kijung Shin Ph.D. Student, CSD Large-scale Matrix Factorization Kijung Shin Ph.D. Student, CSD Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments

More information

FLEXIFACT: Scalable Flexible Factorization of Coupled Tensors on Hadoop

FLEXIFACT: Scalable Flexible Factorization of Coupled Tensors on Hadoop FLEXIFACT: Scalable Flexible Factorization of Coupled Tensors on Hadoop Alex Beutel Department of Computer Science abeutel@cs.cmu.edu Evangelos Papalexakis Department of Computer Science epapalex@cs.cmu.edu

More information

Matrix Factorization Techniques for Recommender Systems

Matrix Factorization Techniques for Recommender Systems Matrix Factorization Techniques for Recommender Systems Patrick Seemann, December 16 th, 2014 16.12.2014 Fachbereich Informatik Recommender Systems Seminar Patrick Seemann Topics Intro New-User / New-Item

More information

Scaling Neighbourhood Methods

Scaling Neighbourhood Methods Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Recommendation Tobias Scheffer Recommendation Engines Recommendation of products, music, contacts,.. Based on user features, item

More information

Using SVD to Recommend Movies

Using SVD to Recommend Movies Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last

More information

Collaborative Filtering Matrix Completion Alternating Least Squares

Collaborative Filtering Matrix Completion Alternating Least Squares Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016

More information

Preliminaries. Data Mining. The art of extracting knowledge from large bodies of structured data. Let s put it to use!

Preliminaries. Data Mining. The art of extracting knowledge from large bodies of structured data. Let s put it to use! Data Mining The art of extracting knowledge from large bodies of structured data. Let s put it to use! 1 Recommendations 2 Basic Recommendations with Collaborative Filtering Making Recommendations 4 The

More information

Large-scale Collaborative Ranking in Near-Linear Time

Large-scale Collaborative Ranking in Near-Linear Time Large-scale Collaborative Ranking in Near-Linear Time Liwei Wu Depts of Statistics and Computer Science UC Davis KDD 17, Halifax, Canada August 13-17, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack

More information

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods Techniques for Dimensionality Reduction PCA and Other Matrix Factorization Methods Outline Principle Compoments Analysis (PCA) Example (Bishop, ch 12) PCA as a mixture model variant With a continuous latent

More information

Matrix Factorization and Collaborative Filtering

Matrix Factorization and Collaborative Filtering 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Matrix Factorization and Collaborative Filtering MF Readings: (Koren et al., 2009)

More information

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon Department of Computer Science University of Texas

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

SQL-Rank: A Listwise Approach to Collaborative Ranking

SQL-Rank: A Listwise Approach to Collaborative Ranking SQL-Rank: A Listwise Approach to Collaborative Ranking Liwei Wu Depts of Statistics and Computer Science UC Davis ICML 18, Stockholm, Sweden July 10-15, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Distributed Methods for High-dimensional and Large-scale Tensor Factorization

Distributed Methods for High-dimensional and Large-scale Tensor Factorization Distributed Methods for High-dimensional and Large-scale Tensor Factorization Kijung Shin Dept. of Computer Science and Engineering Seoul National University Seoul, Republic of Korea koreaskj@snu.ac.kr

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Collaborative Filtering. Radek Pelánek

Collaborative Filtering. Radek Pelánek Collaborative Filtering Radek Pelánek 2017 Notes on Lecture the most technical lecture of the course includes some scary looking math, but typically with intuitive interpretation use of standard machine

More information

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation Yue Ning 1 Yue Shi 2 Liangjie Hong 2 Huzefa Rangwala 3 Naren Ramakrishnan 1 1 Virginia Tech 2 Yahoo Research. Yue Shi

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Recommendation Systems

Recommendation Systems Recommendation Systems Pawan Goyal CSE, IITKGP October 21, 2014 Pawan Goyal (IIT Kharagpur) Recommendation Systems October 21, 2014 1 / 52 Recommendation System? Pawan Goyal (IIT Kharagpur) Recommendation

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

PU Learning for Matrix Completion

PU Learning for Matrix Completion Cho-Jui Hsieh Dept of Computer Science UT Austin ICML 2015 Joint work with N. Natarajan and I. S. Dhillon Matrix Completion Example: movie recommendation Given a set Ω and the values M Ω, how to predict

More information

Parallel Matrix Factorization for Recommender Systems

Parallel Matrix Factorization for Recommender Systems Under consideration for publication in Knowledge and Information Systems Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon Department of

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Matrix Factorization and Factorization Machines for Recommender Systems

Matrix Factorization and Factorization Machines for Recommender Systems Talk at SDM workshop on Machine Learning Methods on Recommender Systems, May 2, 215 Chih-Jen Lin (National Taiwan Univ.) 1 / 54 Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org Customer

More information

Collaborative Filtering

Collaborative Filtering Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Andriy Mnih and Ruslan Salakhutdinov

Andriy Mnih and Ruslan Salakhutdinov MATRIX FACTORIZATION METHODS FOR COLLABORATIVE FILTERING Andriy Mnih and Ruslan Salakhutdinov University of Toronto, Machine Learning Group 1 What is collaborative filtering? The goal of collaborative

More information

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task Binary Principal Component Analysis in the Netflix Collaborative Filtering Task László Kozma, Alexander Ilin, Tapani Raiko first.last@tkk.fi Helsinki University of Technology Adaptive Informatics Research

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

Matrix and Tensor Factorization from a Machine Learning Perspective

Matrix and Tensor Factorization from a Machine Learning Perspective Matrix and Tensor Factorization from a Machine Learning Perspective Christoph Freudenthaler Information Systems and Machine Learning Lab, University of Hildesheim Research Seminar, Vienna University of

More information

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Principal Component Analysis (PCA) for Sparse High-Dimensional Data AB Principal Component Analysis (PCA) for Sparse High-Dimensional Data Tapani Raiko, Alexander Ilin, and Juha Karhunen Helsinki University of Technology, Finland Adaptive Informatics Research Center Principal

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering Matrix Factorization Techniques For Recommender Systems Collaborative Filtering Markus Freitag, Jan-Felix Schwarz 28 April 2011 Agenda 2 1. Paper Backgrounds 2. Latent Factor Models 3. Overfitting & Regularization

More information

Distributed Stochastic ADMM for Matrix Factorization

Distributed Stochastic ADMM for Matrix Factorization Distributed Stochastic ADMM for Matrix Factorization Zhi-Qin Yu Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering Shanghai Jiao Tong University, China

More information

Algorithms for Collaborative Filtering

Algorithms for Collaborative Filtering Algorithms for Collaborative Filtering or How to Get Half Way to Winning $1million from Netflix Todd Lipcon Advisor: Prof. Philip Klein The Real-World Problem E-commerce sites would like to make personalized

More information

Data Mining and Matrices

Data Mining and Matrices Data Mining and Matrices 04 Matrix Completion Rainer Gemulla, Pauli Miettinen May 02, 2013 Recommender systems Problem Set of users Set of items (movies, books, jokes, products, stories,...) Feedback (ratings,

More information

Recommendation Systems

Recommendation Systems Recommendation Systems Pawan Goyal CSE, IITKGP October 29-30, 2015 Pawan Goyal (IIT Kharagpur) Recommendation Systems October 29-30, 2015 1 / 61 Recommendation System? Pawan Goyal (IIT Kharagpur) Recommendation

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017 Algorithms other than SGD CS6787 Lecture 10 Fall 2017 Machine learning is not just SGD Once a model is trained, we need to use it to classify new examples This inference task is not computed with SGD There

More information

Recommendation Systems

Recommendation Systems Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume

More information

Large-Scale Behavioral Targeting

Large-Scale Behavioral Targeting Large-Scale Behavioral Targeting Ye Chen, Dmitry Pavlov, John Canny ebay, Yandex, UC Berkeley (This work was conducted at Yahoo! Labs.) June 30, 2009 Chen et al. (KDD 09) Large-Scale Behavioral Targeting

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

1 Kernel methods & optimization

1 Kernel methods & optimization Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions

More information

Maximum Margin Matrix Factorization for Collaborative Ranking

Maximum Margin Matrix Factorization for Collaborative Ranking Maximum Margin Matrix Factorization for Collaborative Ranking Joint work with Quoc Le, Alexandros Karatzoglou and Markus Weimer Alexander J. Smola sml.nicta.com.au Statistical Machine Learning Program

More information

Generative Models for Discrete Data

Generative Models for Discrete Data Generative Models for Discrete Data ddebarr@uw.edu 2016-04-21 Agenda Bayesian Concept Learning Beta-Binomial Model Dirichlet-Multinomial Model Naïve Bayes Classifiers Bayesian Concept Learning Numbers

More information

J. Sadeghi E. Patelli M. de Angelis

J. Sadeghi E. Patelli M. de Angelis J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of

More information

Collaborative Filtering: A Machine Learning Perspective

Collaborative Filtering: A Machine Learning Perspective Collaborative Filtering: A Machine Learning Perspective Chapter 6: Dimensionality Reduction Benjamin Marlin Presenter: Chaitanya Desai Collaborative Filtering: A Machine Learning Perspective p.1/18 Topics

More information

Data Science Mastery Program

Data Science Mastery Program Data Science Mastery Program Copyright Policy All content included on the Site or third-party platforms as part of the class, such as text, graphics, logos, button icons, images, audio clips, video clips,

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

EE 381V: Large Scale Optimization Fall Lecture 24 April 11 EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Collaborative Filtering

Collaborative Filtering Collaborative Filtering Nicholas Ruozzi University of Texas at Dallas based on the slides of Alex Smola & Narges Razavian Collaborative Filtering Combining information among collaborating entities to make

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Chengjie Qin 1, Martin Torres 2, and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced August 31, 2017 Machine

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Stochastic Gradient Descent Stochastic

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org J. Leskovec,

More information

Restricted Boltzmann Machines for Collaborative Filtering

Restricted Boltzmann Machines for Collaborative Filtering Restricted Boltzmann Machines for Collaborative Filtering Authors: Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton Benjamin Schwehn Presentation by: Ioan Stanculescu 1 Overview The Netflix prize problem

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

Collaborative topic models: motivations cont

Collaborative topic models: motivations cont Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra

Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE 5290 - Artificial Intelligence Grad Project Dr. Debasis Mitra Group 6 Taher Patanwala Zubin Kadva Factor Analysis (FA) 1. Introduction Factor

More information

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same

More information

Lecture 11 Linear regression

Lecture 11 Linear regression Advanced Algorithms Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2013-2014 Lecture 11 Linear regression These slides are taken from Andrew Ng, Machine Learning

More information

Decoupled Collaborative Ranking

Decoupled Collaborative Ranking Decoupled Collaborative Ranking Jun Hu, Ping Li April 24, 2017 Jun Hu, Ping Li WWW2017 April 24, 2017 1 / 36 Recommender Systems Recommendation system is an information filtering technique, which provides

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

From Non-Negative Matrix Factorization to Deep Learning

From Non-Negative Matrix Factorization to Deep Learning The Math!! From Non-Negative Matrix Factorization to Deep Learning Intuitions and some Math too! luissarmento@gmailcom https://wwwlinkedincom/in/luissarmento/ October 18, 2017 The Math!! Introduction Disclaimer

More information

Bayesian Contextual Multi-armed Bandits

Bayesian Contextual Multi-armed Bandits Bayesian Contextual Multi-armed Bandits Xiaoting Zhao Joint Work with Peter I. Frazier School of Operations Research and Information Engineering Cornell University October 22, 2012 1 / 33 Outline 1 Motivating

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux INRIA 8 Nov 2011 Nicolas Le Roux (INRIA) Neural networks and optimization 8 Nov 2011 1 / 80 1 Introduction 2 Linear classifier 3 Convolutional neural networks

More information

Collaborative Filtering Applied to Educational Data Mining

Collaborative Filtering Applied to Educational Data Mining Journal of Machine Learning Research (200) Submitted ; Published Collaborative Filtering Applied to Educational Data Mining Andreas Töscher commendo research 8580 Köflach, Austria andreas.toescher@commendo.at

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop

FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop Alex Beutel abeutel@cs.cmu.edu Partha Pratim Talukdar ppt@cs.cmu.edu Abhimanu Kumar abhimank@cs.cmu.edu Christos Faloutsos christos@cs.cmu.edu

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Nonnegative Matrix Factorization

Nonnegative Matrix Factorization Nonnegative Matrix Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 48 In a nutshell

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

Large Scale Machine Learning with Stochastic Gradient Descent

Large Scale Machine Learning with Stochastic Gradient Descent Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007 Recommender Systems Dipanjan Das Language Technologies Institute Carnegie Mellon University 20 November, 2007 Today s Outline What are Recommender Systems? Two approaches Content Based Methods Collaborative

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 8: Evaluation & SVD Paul Ginsparg Cornell University, Ithaca, NY 20 Sep 2011

More information

A Simple Algorithm for Nuclear Norm Regularized Problems

A Simple Algorithm for Nuclear Norm Regularized Problems A Simple Algorithm for Nuclear Norm Regularized Problems ICML 00 Martin Jaggi, Marek Sulovský ETH Zurich Matrix Factorizations for recommender systems Y = Customer Movie UV T = u () The Netflix challenge:

More information

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016 GloVe on Spark Alex Adamson SUNet ID: aadamson June 6, 2016 Introduction Pennington et al. proposes a novel word representation algorithm called GloVe (Global Vectors for Word Representation) that synthesizes

More information

Large Scale Semi-supervised Linear SVMs. University of Chicago

Large Scale Semi-supervised Linear SVMs. University of Chicago Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.

More information

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient

More information

Low Rank Matrix Completion Formulation and Algorithm

Low Rank Matrix Completion Formulation and Algorithm 1 2 Low Rank Matrix Completion and Algorithm Jian Zhang Department of Computer Science, ETH Zurich zhangjianthu@gmail.com March 25, 2014 Movie Rating 1 2 Critic A 5 5 Critic B 6 5 Jian 9 8 Kind Guy B 9

More information

High Performance Parallel Tucker Decomposition of Sparse Tensors

High Performance Parallel Tucker Decomposition of Sparse Tensors High Performance Parallel Tucker Decomposition of Sparse Tensors Oguz Kaya INRIA and LIP, ENS Lyon, France SIAM PP 16, April 14, 2016, Paris, France Joint work with: Bora Uçar, CNRS and LIP, ENS Lyon,

More information

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Machine Learning Techniques

Machine Learning Techniques Machine Learning Techniques ( 機器學習技法 ) Lecture 15: Matrix Factorization Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University

More information