Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent
|
|
- Elinor Butler
- 5 years ago
- Views:
Transcription
1 Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1
2 Outline Matrix Factorization Stochastic Gradient Descent Distributed SGD Experiments Summary Dept. CSE, UT Arlington 2
3 Matrix completion Dept. CSE, UT Arlington 3
4 Matrix Factorization As Web 2.0 and enterprise-cloud applications proliferate, data mining becomes more important Real application Set of users Set of items (movies, books, products, ) Feedback (ratings, purchase, tags, ) Predict additional items a user may like Assumption: Similar feedback Similar taste Matrix Factorization Dept. CSE, UT Arlington 4
5 Matrix Factorization Example - Netflix competition 500k users, 20k movies, 100M movie ratings, 3M question marks Avatar The Matrix Up Alice? 4 2 Bob 3 2? Charlie 5? 3 The goal is to predict missing entries (denoted by?) Dept. CSE, UT Arlington 5
6 Matrix Factorization A general machine learning problem Recommender systems, text indexing, face recognition, Training data V: m n input matrix (e.g., rating matrix) Z: training set of indexes in V (e.g., subset of known ratings) Output find a approximation V WH which have the smallest loss argmin W,H L(V, W, H) Dept. CSE, UT Arlington 6
7 The loss function Loss function Nonzero squared loss L NZSL = L NZSL i,j:v ij 0 V ij WH ij 2 A sum of local losses over the entries in V ij L = (i,j) Z l V ij, W i, H j Focus on the class of nonzero decompositions Z = i, j : V ij 0 Dept. CSE, UT Arlington 7
8 The loss function Find best model argmin W,H (i,j) Z L ij (W i, H j ) W i row i of matrix W H j column j of matrix H To avoid trivialities, we assume there is at least one training point in every row and in every column. Dept. CSE, UT Arlington 8
9 Prior Work Specialized algorithms Designed for a small class of loss functions GKL loss Generic algorithms Handle all differentiable loss functions that decompose into summation form Distributed gradient descent (DGD), Partitioned SGD (PSGD) The proposed Dept. CSE, UT Arlington 9
10 Successful applications Movie recommendation >12M users, >20k movies, 2.4B ratings 36GB data, 9.2GB model Website recommendation 51M users, 15M URLs, 1.2B clicks 17.8GB data, 161GB metadata, 49GB model News personalization Dept. CSE, UT Arlington 10
11 Stochastic Gradient Descent Find minimum θ of function L Dept. CSE, UT Arlington 11
12 Stochastic Gradient Descent Find minimum θ of function L Pick a starting point θ 0 Dept. CSE, UT Arlington 12
13 Stochastic Gradient Descent Find minimum θ of function L Pick a starting point θ 0 Dept. CSE, UT Arlington 13
14 Stochastic Gradient Descent Find minimum θ of function L Pick a starting point θ 0 Dept. CSE, UT Arlington 14
15 Stochastic Gradient Descent Find minimum θ of function L Pick a starting point θ 0 Approximate gradient L (θ 0 ) Dept. CSE, UT Arlington 15
16 Stochastic Gradient Descent Find minimum θ of function L Pick a starting point θ 0 Approximate gradient L (θ 0 ) Stochastic difference equation θ n+1 = θ n ε n L (θ n ) Under certain conditions, asymptotically approximates (continuous) gradient descent Dept. CSE, UT Arlington 16
17 Stochastic Gradient Descent for Matrix Factorization Set θ = (W, H) L(θ) = and use (i,j) Z L ij W i, H j L (θ) = (i,j) Z L ij W i, H j L (θ, z) = NL z (W iz, H jz ) Where N = Z and training point z is chosen randomly from the training set. Z = i, j : V ij 0 Dept. CSE, UT Arlington 17
18 Stochastic Gradient Descent for Matrix Factorization SGD for Matrix Factorization Input: A training set Z, initial values W0 and H0 1. Pick a random entry z Z 2. Compute approximate gradient L (θ, z) 3. Update parameters θ n+1 = θ n ε n L (θ n, z) 4. Repeat N times In practice, an additional projection is used θ n+1 = Π H [θ n ε n L θ n, z ] to keep the iterative in a given constraint set H = {θ: θ 0}. Dept. CSE, UT Arlington 18
19 Stochastic Gradient Descent for Matrix Factorization Why stochastic is good? Easy obtain May help in escaping local minima Exploit repetition with the data Dept. CSE, UT Arlington 19
20 Distributed SGD (DSGD) SGD steps depend on each other How to distribute? θ n+1 = θ n ε n L (θ n, z) Parameter mixing (ISGD) Map: Run independent instances of SGD on subsets of the data Reduce: Average results once at the end Does not converge to correct solution Iterative Parameter mixing (PSGD) Map: Run independent instances of SGD on subsets of the data (for some time) Reduce: Average results after each pass over the data Converges slowly Dept. CSE, UT Arlington 20
21 Distributed SGD (DSGD) Dept. CSE, UT Arlington 21
22 Stratified SGD Proposed Stratified SGD to obtain an efficient DSGD for matrix factorization L θ = ω 1 L 1 θ + ω 2 L 2 θ + + ω q L q θ The sum of loss function L s, and s is a stratum. A stratum is a part or partition of dataset SSGD runs standard SGD on a single stratum at a time, but switches strata in a way that guarantees correctness Dept. CSE, UT Arlington 22
23 Stratified SGD algorithm Suppose a stratum sequence {γ n }, each γ n takes values in {1,, q} θ n+1 = Π H [θ n ε n L γn θ n ] Appropriate sufficient conditions for the convergence of SSGD can be from stochastic approximation Dept. CSE, UT Arlington 23
24 Distribute SSGD SGD steps depend on each other θ n+1 = Π H [θ n ε n L γn θ n ] An SGD step on example z Z 1. Reads W iz, H jz 2. Performs gradient computation L ij 3. Updates and W iz H jz W iz, H jz Dept. CSE, UT Arlington 24
25 Problem Structure SGD steps depend on each other θ n+1 = Π H [θ n ε n L γn θ n ] An SGD step on example z Z 1. Reads W iz, H jz 2. Performs gradient computation L ij 3. Updates and W iz H jz W iz, H jz Dept. CSE, UT Arlington 25
26 Problem Structure SGD steps depend on each other θ n+1 = Π H [θ n ε n L γn θ n ] An SGD step on example z Z 1. Reads W iz, H jz 2. Performs gradient computation L ij 3. Updates and W iz H jz W iz, H jz Dept. CSE, UT Arlington 26
27 Problem Structure SGD steps depend on each other θ n+1 = Π H [θ n ε n L γn θ n ] An SGD step on example z Z 1. Reads W iz, H jz 2. Performs gradient computation L ij 3. Updates and W iz H jz W iz, H jz Not all steps are dependent Dept. CSE, UT Arlington 27
28 Interchangeability Definition 1. Two elements neither row nor column z 1, z 2 Z are interchangeable if they share Whenz n, z n+1 are interchangeable, the SGD steps θ n+1 = θ n ε L θ n, z n θ n+2 = θ n ε L θ n, z n ε L (θ n+1, z n+1 ) = θ n ε L θ n, z n ε L (θ n, z n+1 ) Dept. CSE, UT Arlington 28
29 A simple case Denote by the set of training points in block. Suppose we run T steps of SSGD on Z, starting from some initial point θ 0 = (W 0, H 0 ) and using a fixed step size ε. Describe an instance of the SGD process by a training sequence ω = z 0, z 1,, z T 1 of T training points. Z b θ n+1 ω = θ n ω + ε T 1 n=0 Y n (ω) Z b Dept. CSE, UT Arlington 29
30 A simple case θ n+1 ω = θ n ω + ε T 1 n=0 Y n (ω) Consider the subsequence σ b (ω) means training points from block Z b ; the length T b ω = σ b ω. The following theorem suggests we can run SGD on each block independently and then sum up the results. Theorem 3 Using the definitions above θ T ω = θ 0 + ε d b=1 T b ω 1 k=0 Y k (σ b (ω)) Dept. CSE, UT Arlington 30
31 A simple case If we divide into d independent map tasks Γ 1, Γ d. Task Γ b is responsible for subsequence σ b (ω) : It takes Z b, W b and H b as input, performs the block-local updates σ b (ω) Dept. CSE, UT Arlington 31
32 The General Case Theorem 3 can also be applied into a general case. d-monomial Dept. CSE, UT Arlington 32
33 The General Case Dept. CSE, UT Arlington 33
34 Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Dept. CSE, UT Arlington 34
35 Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Dept. CSE, UT Arlington 35
36 Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Dept. CSE, UT Arlington 36
37 Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Step 2: Simulate sequential SGD 1. Interchangeable blocks 2. Throw dice of how many iterations per block 3. Throw dice of which step sizes per block Dept. CSE, UT Arlington 37
38 Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Step 2: Simulate sequential SGD 1. Interchangeable blocks 2. Throw dice of how many iterations per block 3. Throw dice of which step sizes per block Dept. CSE, UT Arlington 38
39 Exploitation Block and distribute the input matrix V High-level approach (Map only) 1. Pick a diagonal 2. Run SGD on the diagonal (in parallel) 3. Merge the results 4. Move on to next diagonal Steps 1-3 form a cycle Step 2: Simulate sequential SGD 1. Interchangeable blocks 2. Throw dice of how many iterations per block 3. Throw dice of which step sizes per block Dept. CSE, UT Arlington 39
40 Experiments Compare with PSGD, DGD, ALS methods Data Netflix Competition 100M ratings from 480k customers on 18k movies Synthetic dataset 10M rows, 1M columns, 1B nonzero entries Dept. CSE, UT Arlington 40
41 Experiments Test on two well-known loss functions L NZSL = i,j:v ij 0 V ij WH ij 2 L L2 = L NZSL + λ W F 2 + H F 2 It works well on a variety kind of loss functions, results for other loss functions can be found in their technique report. Dept. CSE, UT Arlington 41
42 Experiments The proposed DSGD converges faster and achieves better results than others Dept. CSE, UT Arlington 42
43 Experiments 1. The processing time remains constant as the size increase 2. To very large datasets on larger clusters, the overall running time increases by a modest 30% Dept. CSE, UT Arlington 43
44 Summary Matrix factorization Widely applicable via customized loss functions Large instances (millions * millions with billions of entries) Distributed Stochastic Gradient Descent Simple and versatile Achieves Fully distributed data/model Fully distributed processing Same or better loss Faster Good scalability Dept. CSE, UT Arlington 44
45 Thank you! Dept. CSE, UT Arlington 45
IBM Research Report. Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent
RJ10481 (A1103-008) March 16, 2011 Computer Science IBM Research Report Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla Max-Planck-Institut für Informatik Saarbrücken,
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3 Numerical Optimization Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at
More informationLarge-scale Matrix Factorization. Kijung Shin Ph.D. Student, CSD
Large-scale Matrix Factorization Kijung Shin Ph.D. Student, CSD Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments
More informationFLEXIFACT: Scalable Flexible Factorization of Coupled Tensors on Hadoop
FLEXIFACT: Scalable Flexible Factorization of Coupled Tensors on Hadoop Alex Beutel Department of Computer Science abeutel@cs.cmu.edu Evangelos Papalexakis Department of Computer Science epapalex@cs.cmu.edu
More informationMatrix Factorization Techniques for Recommender Systems
Matrix Factorization Techniques for Recommender Systems Patrick Seemann, December 16 th, 2014 16.12.2014 Fachbereich Informatik Recommender Systems Seminar Patrick Seemann Topics Intro New-User / New-Item
More informationScaling Neighbourhood Methods
Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Recommendation Tobias Scheffer Recommendation Engines Recommendation of products, music, contacts,.. Based on user features, item
More informationUsing SVD to Recommend Movies
Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last
More informationCollaborative Filtering Matrix Completion Alternating Least Squares
Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016
More informationPreliminaries. Data Mining. The art of extracting knowledge from large bodies of structured data. Let s put it to use!
Data Mining The art of extracting knowledge from large bodies of structured data. Let s put it to use! 1 Recommendations 2 Basic Recommendations with Collaborative Filtering Making Recommendations 4 The
More informationLarge-scale Collaborative Ranking in Near-Linear Time
Large-scale Collaborative Ranking in Near-Linear Time Liwei Wu Depts of Statistics and Computer Science UC Davis KDD 17, Halifax, Canada August 13-17, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack
More informationTechniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods
Techniques for Dimensionality Reduction PCA and Other Matrix Factorization Methods Outline Principle Compoments Analysis (PCA) Example (Bishop, ch 12) PCA as a mixture model variant With a continuous latent
More informationMatrix Factorization and Collaborative Filtering
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Matrix Factorization and Collaborative Filtering MF Readings: (Koren et al., 2009)
More informationScalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems
Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon Department of Computer Science University of Texas
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationSQL-Rank: A Listwise Approach to Collaborative Ranking
SQL-Rank: A Listwise Approach to Collaborative Ranking Liwei Wu Depts of Statistics and Computer Science UC Davis ICML 18, Stockholm, Sweden July 10-15, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationDistributed Methods for High-dimensional and Large-scale Tensor Factorization
Distributed Methods for High-dimensional and Large-scale Tensor Factorization Kijung Shin Dept. of Computer Science and Engineering Seoul National University Seoul, Republic of Korea koreaskj@snu.ac.kr
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationCollaborative Filtering. Radek Pelánek
Collaborative Filtering Radek Pelánek 2017 Notes on Lecture the most technical lecture of the course includes some scary looking math, but typically with intuitive interpretation use of standard machine
More informationA Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation
A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation Yue Ning 1 Yue Shi 2 Liangjie Hong 2 Huzefa Rangwala 3 Naren Ramakrishnan 1 1 Virginia Tech 2 Yahoo Research. Yue Shi
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationRecommendation Systems
Recommendation Systems Pawan Goyal CSE, IITKGP October 21, 2014 Pawan Goyal (IIT Kharagpur) Recommendation Systems October 21, 2014 1 / 52 Recommendation System? Pawan Goyal (IIT Kharagpur) Recommendation
More informationMIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,
MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationPU Learning for Matrix Completion
Cho-Jui Hsieh Dept of Computer Science UT Austin ICML 2015 Joint work with N. Natarajan and I. S. Dhillon Matrix Completion Example: movie recommendation Given a set Ω and the values M Ω, how to predict
More informationParallel Matrix Factorization for Recommender Systems
Under consideration for publication in Knowledge and Information Systems Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon Department of
More informationLinear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationMatrix Factorization and Factorization Machines for Recommender Systems
Talk at SDM workshop on Machine Learning Methods on Recommender Systems, May 2, 215 Chih-Jen Lin (National Taiwan Univ.) 1 / 54 Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen
More informationCS425: Algorithms for Web Scale Data
CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org Customer
More informationCollaborative Filtering
Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin
More informationData Mining Techniques
Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationAndriy Mnih and Ruslan Salakhutdinov
MATRIX FACTORIZATION METHODS FOR COLLABORATIVE FILTERING Andriy Mnih and Ruslan Salakhutdinov University of Toronto, Machine Learning Group 1 What is collaborative filtering? The goal of collaborative
More informationBinary Principal Component Analysis in the Netflix Collaborative Filtering Task
Binary Principal Component Analysis in the Netflix Collaborative Filtering Task László Kozma, Alexander Ilin, Tapani Raiko first.last@tkk.fi Helsinki University of Technology Adaptive Informatics Research
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationMatrix and Tensor Factorization from a Machine Learning Perspective
Matrix and Tensor Factorization from a Machine Learning Perspective Christoph Freudenthaler Information Systems and Machine Learning Lab, University of Hildesheim Research Seminar, Vienna University of
More informationPrincipal Component Analysis (PCA) for Sparse High-Dimensional Data
AB Principal Component Analysis (PCA) for Sparse High-Dimensional Data Tapani Raiko, Alexander Ilin, and Juha Karhunen Helsinki University of Technology, Finland Adaptive Informatics Research Center Principal
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationMatrix Factorization Techniques For Recommender Systems. Collaborative Filtering
Matrix Factorization Techniques For Recommender Systems Collaborative Filtering Markus Freitag, Jan-Felix Schwarz 28 April 2011 Agenda 2 1. Paper Backgrounds 2. Latent Factor Models 3. Overfitting & Regularization
More informationDistributed Stochastic ADMM for Matrix Factorization
Distributed Stochastic ADMM for Matrix Factorization Zhi-Qin Yu Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering Shanghai Jiao Tong University, China
More informationAlgorithms for Collaborative Filtering
Algorithms for Collaborative Filtering or How to Get Half Way to Winning $1million from Netflix Todd Lipcon Advisor: Prof. Philip Klein The Real-World Problem E-commerce sites would like to make personalized
More informationData Mining and Matrices
Data Mining and Matrices 04 Matrix Completion Rainer Gemulla, Pauli Miettinen May 02, 2013 Recommender systems Problem Set of users Set of items (movies, books, jokes, products, stories,...) Feedback (ratings,
More informationRecommendation Systems
Recommendation Systems Pawan Goyal CSE, IITKGP October 29-30, 2015 Pawan Goyal (IIT Kharagpur) Recommendation Systems October 29-30, 2015 1 / 61 Recommendation System? Pawan Goyal (IIT Kharagpur) Recommendation
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More information1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016
AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework
More informationAlgorithms other than SGD. CS6787 Lecture 10 Fall 2017
Algorithms other than SGD CS6787 Lecture 10 Fall 2017 Machine learning is not just SGD Once a model is trained, we need to use it to classify new examples This inference task is not computed with SGD There
More informationRecommendation Systems
Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume
More informationLarge-Scale Behavioral Targeting
Large-Scale Behavioral Targeting Ye Chen, Dmitry Pavlov, John Canny ebay, Yandex, UC Berkeley (This work was conducted at Yahoo! Labs.) June 30, 2009 Chen et al. (KDD 09) Large-Scale Behavioral Targeting
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More information1 Kernel methods & optimization
Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions
More informationMaximum Margin Matrix Factorization for Collaborative Ranking
Maximum Margin Matrix Factorization for Collaborative Ranking Joint work with Quoc Le, Alexandros Karatzoglou and Markus Weimer Alexander J. Smola sml.nicta.com.au Statistical Machine Learning Program
More informationGenerative Models for Discrete Data
Generative Models for Discrete Data ddebarr@uw.edu 2016-04-21 Agenda Bayesian Concept Learning Beta-Binomial Model Dirichlet-Multinomial Model Naïve Bayes Classifiers Bayesian Concept Learning Numbers
More informationJ. Sadeghi E. Patelli M. de Angelis
J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of
More informationCollaborative Filtering: A Machine Learning Perspective
Collaborative Filtering: A Machine Learning Perspective Chapter 6: Dimensionality Reduction Benjamin Marlin Presenter: Chaitanya Desai Collaborative Filtering: A Machine Learning Perspective p.1/18 Topics
More informationData Science Mastery Program
Data Science Mastery Program Copyright Policy All content included on the Site or third-party platforms as part of the class, such as text, graphics, logos, button icons, images, audio clips, video clips,
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationEE 381V: Large Scale Optimization Fall Lecture 24 April 11
EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationCollaborative Filtering
Collaborative Filtering Nicholas Ruozzi University of Texas at Dallas based on the slides of Alex Smola & Narges Razavian Collaborative Filtering Combining information among collaborating entities to make
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationImportance Sampling for Minibatches
Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,
More informationScalable Asynchronous Gradient Descent Optimization for Out-of-Core Models
Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Chengjie Qin 1, Martin Torres 2, and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced August 31, 2017 Machine
More informationModern Optimization Techniques
Modern Optimization Techniques Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Stochastic Gradient Descent Stochastic
More informationCS425: Algorithms for Web Scale Data
CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org J. Leskovec,
More informationRestricted Boltzmann Machines for Collaborative Filtering
Restricted Boltzmann Machines for Collaborative Filtering Authors: Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton Benjamin Schwehn Presentation by: Ioan Stanculescu 1 Overview The Netflix prize problem
More informationCS246 Final Exam, Winter 2011
CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including
More informationCollaborative topic models: motivations cont
Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationFactor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra
Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE 5290 - Artificial Intelligence Grad Project Dr. Debasis Mitra Group 6 Taher Patanwala Zubin Kadva Factor Analysis (FA) 1. Introduction Factor
More informationA Randomized Approach for Crowdsourcing in the Presence of Multiple Views
A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion
More informationExpectation Maximization
Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same
More informationLecture 11 Linear regression
Advanced Algorithms Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2013-2014 Lecture 11 Linear regression These slides are taken from Andrew Ng, Machine Learning
More informationDecoupled Collaborative Ranking
Decoupled Collaborative Ranking Jun Hu, Ping Li April 24, 2017 Jun Hu, Ping Li WWW2017 April 24, 2017 1 / 36 Recommender Systems Recommendation system is an information filtering technique, which provides
More informationClick Prediction and Preference Ranking of RSS Feeds
Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS
More informationFrom Non-Negative Matrix Factorization to Deep Learning
The Math!! From Non-Negative Matrix Factorization to Deep Learning Intuitions and some Math too! luissarmento@gmailcom https://wwwlinkedincom/in/luissarmento/ October 18, 2017 The Math!! Introduction Disclaimer
More informationBayesian Contextual Multi-armed Bandits
Bayesian Contextual Multi-armed Bandits Xiaoting Zhao Joint Work with Peter I. Frazier School of Operations Research and Information Engineering Cornell University October 22, 2012 1 / 33 Outline 1 Motivating
More informationNeural networks and optimization
Neural networks and optimization Nicolas Le Roux INRIA 8 Nov 2011 Nicolas Le Roux (INRIA) Neural networks and optimization 8 Nov 2011 1 / 80 1 Introduction 2 Linear classifier 3 Convolutional neural networks
More informationCollaborative Filtering Applied to Educational Data Mining
Journal of Machine Learning Research (200) Submitted ; Published Collaborative Filtering Applied to Educational Data Mining Andreas Töscher commendo research 8580 Köflach, Austria andreas.toescher@commendo.at
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationFlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop
FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop Alex Beutel abeutel@cs.cmu.edu Partha Pratim Talukdar ppt@cs.cmu.edu Abhimanu Kumar abhimank@cs.cmu.edu Christos Faloutsos christos@cs.cmu.edu
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationNonnegative Matrix Factorization
Nonnegative Matrix Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 48 In a nutshell
More informationSum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017
Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth
More informationLarge Scale Machine Learning with Stochastic Gradient Descent
Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationRecommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007
Recommender Systems Dipanjan Das Language Technologies Institute Carnegie Mellon University 20 November, 2007 Today s Outline What are Recommender Systems? Two approaches Content Based Methods Collaborative
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 8: Evaluation & SVD Paul Ginsparg Cornell University, Ithaca, NY 20 Sep 2011
More informationA Simple Algorithm for Nuclear Norm Regularized Problems
A Simple Algorithm for Nuclear Norm Regularized Problems ICML 00 Martin Jaggi, Marek Sulovský ETH Zurich Matrix Factorizations for recommender systems Y = Customer Movie UV T = u () The Netflix challenge:
More informationCME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016
GloVe on Spark Alex Adamson SUNet ID: aadamson June 6, 2016 Introduction Pennington et al. proposes a novel word representation algorithm called GloVe (Global Vectors for Word Representation) that synthesizes
More informationLarge Scale Semi-supervised Linear SVMs. University of Chicago
Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.
More informationLearning MN Parameters with Alternative Objective Functions. Sargur Srihari
Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient
More informationLow Rank Matrix Completion Formulation and Algorithm
1 2 Low Rank Matrix Completion and Algorithm Jian Zhang Department of Computer Science, ETH Zurich zhangjianthu@gmail.com March 25, 2014 Movie Rating 1 2 Critic A 5 5 Critic B 6 5 Jian 9 8 Kind Guy B 9
More informationHigh Performance Parallel Tucker Decomposition of Sparse Tensors
High Performance Parallel Tucker Decomposition of Sparse Tensors Oguz Kaya INRIA and LIP, ENS Lyon, France SIAM PP 16, April 14, 2016, Paris, France Joint work with: Bora Uçar, CNRS and LIP, ENS Lyon,
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationCPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017
CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code
More informationMachine Learning Techniques
Machine Learning Techniques ( 機器學習技法 ) Lecture 15: Matrix Factorization Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University
More information