Machine Learning Techniques

Similar documents
Machine Learning Techniques

Machine Learning Techniques

Machine Learning Foundations

Machine Learning Techniques

Machine Learning Techniques

Machine Learning Techniques

Machine Learning Foundations

Machine Learning Foundations

Machine Learning Foundations

Array 10/26/2015. Department of Computer Science & Information Engineering. National Taiwan University

Machine Learning Techniques

Polymorphism 11/09/2015. Department of Computer Science & Information Engineering. National Taiwan University

Encapsulation 10/26/2015. Department of Computer Science & Information Engineering. National Taiwan University

Inheritance 11/02/2015. Department of Computer Science & Information Engineering. National Taiwan University

Basic Java OOP 10/12/2015. Department of Computer Science & Information Engineering. National Taiwan University

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

ECS171: Machine Learning

Support Vector Machines

Restricted Boltzmann Machines for Collaborative Filtering

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning (CSE 446): Neural Networks

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Least Mean Squares Regression

Linear Regression (continued)

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Neural Networks and Deep Learning

COMS 4771 Introduction to Machine Learning. Nakul Verma

ECS171: Machine Learning

Least Mean Squares Regression. Machine Learning Fall 2018

Introduction to Logistic Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Overfitting, Bias / Variance Analysis

Learning from Examples

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

CS260: Machine Learning Algorithms

Neural networks and optimization

The Perceptron. Volker Tresp Summer 2016

FINAL: CS 6375 (Machine Learning) Fall 2014

Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning (CSE 446): Backpropagation

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Using SVD to Recommend Movies

Introduction to Logistic Regression and Support Vector Machine

CSCI567 Machine Learning (Fall 2014)

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

CS7267 MACHINE LEARNING

Numerical Learning Algorithms

Statistical Machine Learning from Data

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Neural Network Training

Andriy Mnih and Ruslan Salakhutdinov

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

STA141C: Big Data & High Performance Statistical Computing

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Lecture 5: Logistic Regression. Neural Networks

Ensembles. Léon Bottou COS 424 4/8/2010

PAC-learning, VC Dimension and Margin-based Bounds

Reading Group on Deep Learning Session 1

CS-E3210 Machine Learning: Basic Principles

Chapter 8: Generalization and Function Approximation

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Algorithms for Collaborative Filtering

The Perceptron Algorithm

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Machine Learning Lecture 5

Why should you care about the solution strategies?

Natural Language Processing with Deep Learning CS224N/Ling284

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

CS425: Algorithms for Web Scale Data

Optimization and Gradient Descent

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Advanced statistical methods for data analysis Lecture 2

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Reinforcement Learning

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Learning Theory Continued

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Collaborative Filtering. Radek Pelánek

Jeff Howbert Introduction to Machine Learning Winter

Final Exam, Fall 2002

Ensemble Methods for Machine Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

MODULE -4 BAYEIAN LEARNING

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

B555 - Machine Learning - Homework 4. Enrique Areyan April 28, 2015

ECE 5424: Introduction to Machine Learning

CPSC 340: Machine Learning and Data Mining

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

The Perceptron. Volker Tresp Summer 2014

Lecture 9: Generalization

CSC 411 Lecture 6: Linear Regression

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Learning Decision Trees

CSCI567 Machine Learning (Fall 2018)

Transcription:

Machine Learning Techniques ( 機器學習技法 ) Lecture 15: Matrix Factorization Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University ( 國立台灣大學資訊工程系 ) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/22

Roadmap 1 Embedding Numerous Features: Kernel Models 2 Combining Predictive Features: Aggregation Models 3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network linear aggregation of distance-based similarities using k-means clustering for prototype finding Lecture 15: Matrix Factorization Linear Network Hypothesis Basic Matrix Factorization Stochastic Gradient Descent Summary of Extraction Models Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/22

Linear Network Hypothesis Recommender System Revisited data ML skill data: how many users have rated some movies skill: predict how a user would rate an unrated movie A Hot Problem competition held by Netflix in 2006 100,480,507 ratings that 480,189 users gave to 17,770 movies 10% improvement = 1 million dollar prize data D m for m-th movie: {( x n = (n), y n = r nm ): user n rated movie m} abstract feature x n = (n) how to learn our preferences from data? Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/22

Linear Network Hypothesis Binary Vector Encoding of Categorical Feature x n = (n): user IDs, such as 1126, 5566, 6211,... called categorical features categorical features, e.g. IDs blood type: A, B, AB, O programming languages: C, C++, Java, Python,... many ML models operate on numerical features linear models extended linear models such as NNet except for decision trees need: encoding (transform) from categorical to numerical binary vector encoding: A = [1 0 0 0] T, B = [0 1 0 0] T, AB = [0 0 1 0] T, O = [0 0 0 1] T Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/22

Linear Network Hypothesis Feature Extraction from Encoded Vector encoded data D m for m-th movie: { } (x n = BinaryVectorEncoding(n), y n = r nm ): user n rated movie m or, joint data D { (x n = BinaryVectorEncoding(n), y n = [r n1?? r n4 r n5 }... r nm ] T ) idea: try feature extraction using N- d-m NNet without all x (l) 0 x = x 1 x 2 x 3 x 4 tanh w (1) ni w (2) im tanh is tanh necessary? :-) y 1 y 2 y 3 = y Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/22

Linear Network Hypothesis Linear Network Hypothesis x 1 y 1 x = x 2 x 3 V T : w (1) ni W: w (2) im y 2 = y y 3 x { 4 } (x n = BinaryVectorEncoding(n), y n = [r n1?? r n4 r n5... r nm ] T ) [ rename: V T for w (1) ni hypothesis: h(x) = W T Vx ] [ and W for w (2) im per-user output: h(x n ) = W T v n, where v n is n-th column of V ] linear network for recommender system: learn V and W Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22

Linear Network Hypothesis Fun Time For N users, M movies, and d features, how many variables need to be used to specify a linear network hypothesis h(x) = W T Vx? 1 N + M + d 2 N M d 3 (N + M) d 4 (N M) + d

Linear Network Hypothesis Fun Time For N users, M movies, and d features, how many variables need to be used to specify a linear network hypothesis h(x) = W T Vx? 1 N + M + d 2 N M d 3 (N + M) d 4 (N M) + d Reference Answer: 3 simply N d for V T and d M for W Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/22

linear network: Basic Matrix Factorization Linear Network: Linear Model Per Movie h(x) = W T Vx }{{} Φ(x) for m-th movie, just linear model h m (x) = w T mφ(x) subject to shared transform Φ for every D m, want r nm = y n w T mv n E in over all D m with squared error measure: E in ({w m }, {v n }) = 1 M m=1 D m user n rated movie m ( r nm w T mv n ) 2 linear network: transform and linear models jointly learned from all D m Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/22

Basic Matrix Factorization Matrix Factorization R movie 1 movie 2 movie M user 1 100 80? user 2? 70 90 user N? 60 0 r nm w T mv n = v T n w m R V T W V T v T 1 v T 2 v T N W w 1 w 2 w M viewer likes comedy? likes action? prefers blockbusters? Match movie and viewer factors add contributions from each factor likes Tom Cruise? predicted rating Matrix Factorization Model learning: known rating learned factors v n and w m unknown rating prediction movie blockbuster? action content comedy content Tom Cruise in it? similar modeling can be used for other abstract features Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/22

Basic Matrix Factorization min E in({w m }, {v n }) W,V Matrix Factorization Learning = user n rated movie m M m=1 (x n,r nm) D m (r nm w T mv n ) 2 ) 2 (r nm w T mv n two sets of variables: can consider alternating minimization, remember? :-) when v n fixed, minimizing w m minimize E in within D m simply per-movie (per-d m ) linear regression without w 0 when w m fixed, minimizing v n? per-user linear regression without v 0 by symmetry between users/movies called alternating least squares algorithm Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/22

Basic Matrix Factorization Alternating Least Squares Alternating Least Squares 1 initialize d dimension vectors {w m }, {v n } 2 alternating optimization of E in : repeatedly 1 optimize w 1, w 2,..., w M : update w m by m-th-movie linear regression on {(v n, r nm )} 2 optimize v 1, v 2,..., v N : update v n by n-th-user linear regression on {(w m, r nm )} until converge initialize: usually just randomly converge: guaranteed as E in decreases during alternating minimization alternating least squares: the tango dance between users/movies Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/22

Basic Matrix Factorization Linear Autoencoder versus Matrix Factorization Linear Autoencoder X W ( W T X ) motivation: special d- d-d linear NNet error measure: squared on all x ni solution: global optimal at eigenvectors of X T X usefulness: extract dimension-reduced features Matrix Factorization R V T W motivation: N- d-m linear NNet error measure: squared on known r nm solution: local optimal via alternating least squares usefulness: extract hidden user/movie features linear autoencoder special matrix factorization of complete X Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/22

Basic Matrix Factorization Fun Time How many least squares problems does the alternating least squares algorithm needs to solve in one iteration of alternation? 1 number of movies M 2 number of users N 3 M + N 4 M N

Basic Matrix Factorization Fun Time How many least squares problems does the alternating least squares algorithm needs to solve in one iteration of alternation? 1 number of movies M 2 number of users N 3 M + N 4 M N Reference Answer: 3 simply M per-movie problems and N per-user problems Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/22

Stochastic Gradient Descent Another Possibility: Stochastic Gradient Descent E in ({w m }, {v n }) (r nm w T mv n ) 2 }{{} user n rated movie m err(user n, movie m, rating r nm) SGD: randomly pick one example within the & update with gradient to per-example err, remember? :-) efficient per iteration simple to implement easily extends to other err next: SGD for matrix factorization Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/22

Stochastic Gradient Descent Gradient of Per-Example Error Function err(user n, movie m, rating r nm ) = ( r nm w T mv n ) 2 v1126 err(user n, movie m, rating r nm ) = 0 unless n = 1126 w6211 err(user n, movie m, rating r nm ) = 0 unless m = 6211 ) vn err(user n, movie m, rating r nm ) = 2 (r nm w T mv n ) wm err(user n, movie m, rating r nm ) = 2 (r nm w T mv n v n w m per-example gradient (residual)(the other feature vector) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/22

Stochastic Gradient Descent SGD for Matrix Factorization SGD for Matrix Factorization initialize d dimension vectors {w m }, {v n } randomly for t = 0, 1,..., T 1 randomly pick (n, m) within all known r nm 2 calculate residual r nm = ( r nm w T mv n ) 3 SGD-update: v new n v old n + η r nm w old m w new m w old m + η r nm v old n SGD: perhaps most popular large-scale matrix factorization algorithm Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/22

Stochastic Gradient Descent SGD for Matrix Factorization in Practice KDDCup 2011 Track 1: World Champion Solution by NTU specialty of data (application need): per-user training ratings earlier than test ratings in time training/test mismatch: typical sampling bias, remember? :-) want: emphasize latter examples last T iterations of SGD: only those T examples considered learned {w m }, {v n } favoring those our idea: time-deterministic SGD that visits latter examples last consistent improvements of test performance if you understand the behavior of techniques, easier to modify for your real-world use Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/22

Stochastic Gradient Descent Fun Time If all w m and v n are initialized to the 0 vector, what will NOT happen in SGD for matrix factorization? 1 all w m are always 0 2 all v n are always 0 3 every residual r nm = the original rating r nm 4 E in decreases after each SGD update

Stochastic Gradient Descent Fun Time If all w m and v n are initialized to the 0 vector, what will NOT happen in SGD for matrix factorization? 1 all w m are always 0 2 all v n are always 0 3 every residual r nm = the original rating r nm 4 E in decreases after each SGD update Reference Answer: 4 The 0 feature vectors provides a per-example gradient of 0 for every example. So E in cannot be further decreased. Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/22

Summary of Extraction Models Map of Extraction Models extraction models: feature transform Φ as hidden variables in addition to linear model Adaptive/Gradient Boosting hypotheses g t ; weights α t Neural Network/ Deep Learning weights w (l) ij ; weights w (L) ij RBF Network RBF centers µ m ; weights β m k Nearest Neighbor x n -neighbor RBF; weights y n Matrix Factorization user features v n ; movie features w m extraction models: a rich family Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22

Summary of Extraction Models Map of Extraction Techniques Adaptive/Gradient Boosting functional gradient descent Neural Network/ Deep Learning SGD (backprop) autoencoder RBF Network k-means clustering k Nearest Neighbor lazy learning :-) Matrix Factorization SGD alternating leastsqr extraction techniques: quite diverse Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/22

Summary of Extraction Models Pros and Cons of Extraction Models Neural Network/ Deep Learning RBF Network Matrix Factorization Pros easy : reduces human burden in designing features powerful: if enough hidden variables considered Cons hard : non-convex optimization problems in general overfitting: needs proper regularization/validation be careful when applying extraction models Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/22

Summary of Extraction Models Fun Time Which of the following extraction model extracts Gaussian centers by k-means and aggregate the Gaussians linearly? 1 RBF Network 2 Deep Learning 3 Adaptive Boosting 4 Matrix Factorization

Summary of Extraction Models Fun Time Which of the following extraction model extracts Gaussian centers by k-means and aggregate the Gaussians linearly? 1 RBF Network 2 Deep Learning 3 Adaptive Boosting 4 Matrix Factorization Reference Answer: 1 Congratulations on being an expert in extraction models! :-) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/22

Summary of Extraction Models Summary 1 Embedding Numerous Features: Kernel Models 2 Combining Predictive Features: Aggregation Models 3 Distilling Implicit Features: Extraction Models Lecture 15: Matrix Factorization Linear Network Hypothesis feature extraction from binary vector encoding Basic Matrix Factorization alternating least squares between user/movie Stochastic Gradient Descent efficient and easily modified for practical use Summary of Extraction Models powerful thus need careful use next: closing remarks of techniques Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/22