Linear Regression (continued)

Similar documents
Overfitting, Bias / Variance Analysis

CSCI567 Machine Learning (Fall 2014)

Statistical Machine Learning Hilary Term 2018

Association studies and regression

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Linear Models in Machine Learning

Neural Networks and Deep Learning

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Linear Models for Regression. Sargur Srihari

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear Regression (9/11/13)

1 Machine Learning Concepts (16 points)

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Models for Regression CS534

Linear Models for Regression

Linear Models for Regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Linear Regression. S. Sumitra

CS6220: DATA MINING TECHNIQUES

Perceptron (Theory) + Linear Regression

MLCC 2017 Regularization Networks I: Linear Models

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Support Vector Machines, Kernel SVM

Linear Models for Regression CS534

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Linear Models for Regression CS534

Lecture 6. Regression

Linear classifiers: Overfitting and regularization

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Neural Network Training

CS6220: DATA MINING TECHNIQUES

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Linear Discrimination Functions

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Introduction to Logistic Regression and Support Vector Machine

Machine Learning and Data Mining. Linear regression. Kalev Kask

Midterm exam CS 189/289, Fall 2015

Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Warm up: risk prediction with logistic regression

Machine Learning Lecture 5

Machine Learning. Lecture 2: Linear regression. Feng Li.

CS-E3210 Machine Learning: Basic Principles

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

ECE521 week 3: 23/26 January 2017

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Big Data Analytics. Lucas Rego Drumond

CSCI567 Machine Learning (Fall 2018)

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

CMU-Q Lecture 24:

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Machine Learning (CS 567) Lecture 5

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Why should you care about the solution strategies?

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning Lecture 7

CSC321 Lecture 2: Linear Regression

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Logistic Regression Logistic

Linear & nonlinear classifiers

4 Bias-Variance for Ridge Regression (24 points)

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

CPSC 340: Machine Learning and Data Mining

Linear discriminant functions

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Support Vector Machines

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Lecture 2 Machine Learning Review

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Machine Learning Practice Page 2 of 2 10/28/13

Recap from previous lecture

Midterm: CS 6375 Spring 2015 Solutions

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Final Exam, Machine Learning, Spring 2009

ECS171: Machine Learning

Data Mining Techniques. Lecture 2: Regression

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

CSC 411: Lecture 02: Linear Regression

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

1 Review of Winnow Algorithm

Transcription:

Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39

Outline 1 Administration 2 Review of last lecture 3 Linear regression 4 Nonlinear basis functions Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 2 / 39

Announcements HW2 will be returned in section on Friday HW3 due in class next Monday Midterm is next Wednesday (will review in more detail next class) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 3 / 39

Outline 1 Administration 2 Review of last lecture Perceptron Linear regression introduction 3 Linear regression 4 Nonlinear basis functions Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 4 / 39

Perceptron Main idea Consider a linear model for binary classification w T x We use this model to distinguish between two classes { 1, +1}. One goal ε = n I[y n sign(w T x n )] i.e., to minimize errors on the training dataset. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 5 / 39

Hard, but easy if we have only one training example How can we change w such that Two cases y n = sign(w T x n ) If y n = sign(w T x n ), do nothing. If y n sign(w T x n ), w new w old + y n x n Gauranteed to make progress, i.e., to get us closer to y(w x) > 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 6 / 39

Hard, but easy if we have only one training example How can we change w such that Two cases y n = sign(w T x n ) If y n = sign(w T x n ), do nothing. If y n sign(w T x n ), w new w old + y n x n Gauranteed to make progress, i.e., to get us closer to y(w x) > 0 What does update do? y n [(w + y n x n ) T x n ] = y n w T x n + y 2 nx T n x n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 6 / 39

Hard, but easy if we have only one training example How can we change w such that Two cases y n = sign(w T x n ) If y n = sign(w T x n ), do nothing. If y n sign(w T x n ), w new w old + y n x n Gauranteed to make progress, i.e., to get us closer to y(w x) > 0 What does update do? y n [(w + y n x n ) T x n ] = y n w T x n + y 2 nx T n x n We are adding a positive number, so it s possible that y n (w newt x n ) > 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 6 / 39

Perceptron algorithm Iteratively solving one case at a time REPEAT Pick a data point x n (can be a fixed order of the training instances) Make a prediction y = sign(w T x n ) using the current w If y = y n, do nothing. Else, w w + y n x n UNTIL converged. Properties Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 7 / 39

Perceptron algorithm Iteratively solving one case at a time REPEAT Pick a data point x n (can be a fixed order of the training instances) Make a prediction y = sign(w T x n ) using the current w If y = y n, do nothing. Else, UNTIL converged. Properties This is an online algorithm. w w + y n x n If the training data is linearly separable, the algorithm stops in a finite number of steps (we proved this). The parameter vector is always a linear combination of training instances (requires initialization of w 0 = 0). Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 7 / 39

Regression Predicting a continuous outcome variable Predicting shoe size from height, weight and gender Predicting song year from audio features Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 8 / 39

Regression Predicting a continuous outcome variable Predicting shoe size from height, weight and gender Predicting song year from audio features Key difference from classification Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 8 / 39

Regression Predicting a continuous outcome variable Predicting shoe size from height, weight and gender Predicting song year from audio features Key difference from classification We can measure closeness of prediction and labels Predicting shoe size: better to be off by one size than by 5 sizes Predicting song year: better to be off by one year than by 20 years As opposed to 0-1 classification error, we will focus on squared difference, i.e., (ŷ y) 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 8 / 39

1D example: predicting the sale price of a house Sale price price per sqft square footage + fixed expense Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 9 / 39

Minimize squared errors Our model Sale price = price per sqft square footage + fixed expense + unexplainable stuff Training data sqft sale price prediction error squared error 2000 810K 720K 90K 8100 2100 907K 800K 107K 107 2 1100 312K 350K 38K 38 2 5500 2,600K 2,600K 0 0 Total 8100 + 107 2 + 38 2 + 0 + Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 10 / 39

Minimize squared errors Our model Sale price = price per sqft square footage + fixed expense + unexplainable stuff Training data sqft sale price prediction error squared error 2000 810K 720K 90K 8100 2100 907K 800K 107K 107 2 1100 312K 350K 38K 38 2 5500 2,600K 2,600K 0 0 Total 8100 + 107 2 + 38 2 + 0 + Aim Adjust price per sqft and fixed expense such that the sum of the squared error is minimized i.e., unexplainable stuff is minimized. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 10 / 39

Linear regression Setup Input: x R D (covariates, predictors, features, etc) Output: y R (responses, targets, outcomes, outputs, etc) Model: f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39

Linear regression Setup Input: x R D (covariates, predictors, features, etc) Output: y R (responses, targets, outcomes, outputs, etc) Model: f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x w = [w1 w 2 w D ] T : weights, parameters, or parameter vector w 0 is called bias We also sometimes call w = [w 0 w 1 w 2 w D ] T parameters too Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39

Linear regression Setup Input: x R D (covariates, predictors, features, etc) Output: y R (responses, targets, outcomes, outputs, etc) Model: f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x w = [w1 w 2 w D ] T : weights, parameters, or parameter vector w 0 is called bias We also sometimes call w = [w 0 w 1 w 2 w D ] T parameters too Training data: D = {(x n, y n ), n = 1, 2,..., N} Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39

Linear regression Setup Input: x R D (covariates, predictors, features, etc) Output: y R (responses, targets, outcomes, outputs, etc) Model: f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x w = [w1 w 2 w D ] T : weights, parameters, or parameter vector w 0 is called bias We also sometimes call w = [w 0 w 1 w 2 w D ] T parameters too Training data: D = {(x n, y n ), n = 1, 2,..., N} Least Mean Squares (LMS) Objective: Minimize squared difference on training data (or residual sum of squares) RSS( w) = n [y n f(x n )] 2 = n [y n (w 0 + d w d x nd )] 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39

Linear regression Setup Input: x R D (covariates, predictors, features, etc) Output: y R (responses, targets, outcomes, outputs, etc) Model: f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x w = [w1 w 2 w D ] T : weights, parameters, or parameter vector w 0 is called bias We also sometimes call w = [w 0 w 1 w 2 w D ] T parameters too Training data: D = {(x n, y n ), n = 1, 2,..., N} Least Mean Squares (LMS) Objective: Minimize squared difference on training data (or residual sum of squares) RSS( w) = n [y n f(x n )] 2 = n [y n (w 0 + d w d x nd )] 2 1D Solution: Identify stationary points by taking derivative with respect to parameters and setting to zero, yielding normal equations Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39

Probabilistic interpretation Noisy observation model Y = w 0 + w 1 X + η where η N(0, σ 2 ) is a Gaussian random variable Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 12 / 39

Probabilistic interpretation Noisy observation model Y = w 0 + w 1 X + η where η N(0, σ 2 ) is a Gaussian random variable Likelihood of one training sample (x n, y n ) p(y n x n ) = N(w 0 + w 1 x n, σ 2 ) = 1 e [yn (w 0 +w 1 xn)]2 2σ 2 2πσ Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 12 / 39

Maximum likelihood estimation Maximize over w 0 and w 1 max log P (D) min n [y n (w 0 + w 1 x n )] 2 That is RSS( w)! Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 13 / 39

Maximum likelihood estimation Maximize over w 0 and w 1 max log P (D) min n [y n (w 0 + w 1 x n )] 2 That is RSS( w)! Maximize over s = σ 2 { } log P (D) = 1 1 s 2 s 2 [y n (w 0 + w 1 x n )] 2 + N 1 = 0 s n σ 2 = s = 1 [y n (w 0 + w 1 x n )] 2 N n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 13 / 39

How does this probabilistic interpretation help us? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 14 / 39

How does this probabilistic interpretation help us? It gives a solid footing to our intuition: minimizing RSS( w) is a sensible thing based on reasonable modeling assumptions Estimating σ tells us how much noise there could be in our predictions. For example, it allows us to place confidence intervals around our predictions. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 14 / 39

Outline 1 Administration 2 Review of last lecture 3 Linear regression Multivariate solution in matrix form Computational and numerical optimization Ridge regression 4 Nonlinear basis functions Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 15 / 39

LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39

LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 = n [y n w T x n ] 2 where we have redefined some variables (by augmenting) x [1 x 1 x 2... x D ] T, w [w 0 w 1 w 2... w D ] T Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39

LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 = n [y n w T x n ] 2 where we have redefined some variables (by augmenting) x [1 x 1 x 2... x D ] T, w [w 0 w 1 w 2... w D ] T which leads to RSS( w) = n (y n w T x n )(y n x T n w) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39

LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 = n [y n w T x n ] 2 where we have redefined some variables (by augmenting) x [1 x 1 x 2... x D ] T, w [w 0 w 1 w 2... w D ] T which leads to RSS( w) = n = n (y n w T x n )(y n x T n w) w T x n x T n w 2y n x T n w + const. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39

LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 = n [y n w T x n ] 2 where we have redefined some variables (by augmenting) x [1 x 1 x 2... x D ] T, w [w 0 w 1 w 2... w D ] T which leads to RSS( w) = n (y n w T x n )(y n x T n w) = w T x n x T n w 2y n x T n w + const. n { ( ) ( ) = w T x n x T n w 2 y n x T n n n w } + const. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39

Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9

Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9

Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9 9 1 + 3 3 + 5 2 = 28

Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9 9 1 + 3 3 + 5 2 = 28

Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9

Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9

Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9

Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9 9 18 4 8 + 9 15 3 5 + 10 15 4 6

Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9 9 18 4 8 + 9 15 3 5 + 10 15 4 6

Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9 9 18 4 8 + 9 15 3 5 + 10 15 4 6

Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9 9 18 4 8 + 9 15 3 5 + 10 15 4 6

Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9 9 18 4 8 + 9 15 3 5 + 10 15 4 6

Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9 9 18 4 8 + 9 15 3 5 + 10 15 4 6

Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9 9 18 4 8 + 9 15 3 5 + 10 15 4 6

Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices 9 3 5 4 1 2 1 2 3 5 2 3 = 28 18 11 9 9 18 4 8 + 9 15 3 5 + 10 15 4 6

RSS( w) in new notations From previous slide { ) ( RSS( w) = w T x n x T n n ( ) w 2 y n x T n n w } + const. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 17 / 39

RSS( w) in new notations From previous slide { ) ( RSS( w) = w T x n x T n n ( ) w 2 y n x T n n w } + const. Design matrix and target vector x T 1 x X T 2 =. RN (D+1) x T N Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 17 / 39

RSS( w) in new notations From previous slide { ) ( RSS( w) = w T x n x T n n ( ) w 2 y n x T n n w } + const. Design matrix and target vector x T 1 x X T 2 =. RN (D+1), y = x T N y 1 y 2. y N Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 17 / 39

RSS( w) in new notations From previous slide { ) ( RSS( w) = w T x n x T n n ( ) w 2 y n x T n n w } + const. Design matrix and target vector x T 1 x X T 2 =. RN (D+1), y = x T N Compact expression { RSS( w) = X w y 2 2 = y 1 y 2. y N ( ) } T w T X T X w 2 X T y w + const Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 17 / 39

Solution in matrix form Compact expression { RSS( w) = X w y 2 2 = ( ) } T w T X T X w 2 X T y w + const Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 18 / 39

Solution in matrix form Compact expression { RSS( w) = X w y 2 2 = ( ) } T w T X T X w 2 X T y w + const Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 18 / 39

Solution in matrix form Compact expression { RSS( w) = X w y 2 2 = ( ) } T w T X T X w 2 X T y w + const Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Normal equation w RSS( w) X T Xw X T y = 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 18 / 39

Solution in matrix form Compact expression { RSS( w) = X w y 2 2 = ( ) } T w T X T X w 2 X T y w + const Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Normal equation w RSS( w) X T Xw X T y = 0 This leads to the least-mean-square (LMS) solution w LMS = ( X T X) 1 X T y Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 18 / 39

Mini-Summary Linear regression is the linear combination of features f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x If we minimize residual sum of squares as our learning objective, we get a closed-form solution of parameters Probabilistic interpretation: maximum likelihood if assuming residual is Gaussian distributed Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 19 / 39

Computational complexity Bottleneck of computing the solution? w = ( XT X) 1 Xy Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 20 / 39

Computational complexity Bottleneck of computing the solution? w = ( XT X) 1 Xy Matrix multiply of X T X R (D+1) (D+1) Inverting the matrix X T X How many operations do we need? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 20 / 39

Computational complexity Bottleneck of computing the solution? w = ( XT X) 1 Xy Matrix multiply of X T X R (D+1) (D+1) Inverting the matrix X T X How many operations do we need? O(ND 2 ) for matrix multiplication O(D 3 ) (e.g., using Gauss-Jordan elimination) or O(D 2.373 ) (recent theoretical advances) for matrix inversion Impractical for very large D or N Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 20 / 39

Alternative method: an example of using numerical optimization (Batch) Gradient descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39

Alternative method: an example of using numerical optimization (Batch) Gradient descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Loop until convergence 1 Compute the gradient RSS( w) = X T X w (t) X T y Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39

Alternative method: an example of using numerical optimization (Batch) Gradient descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Loop until convergence 1 Compute the gradient RSS( w) = X T X w (t) X T y 2 Update the parameters w (t+1) = w (t) η RSS( w) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39

Alternative method: an example of using numerical optimization (Batch) Gradient descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Loop until convergence 1 Compute the gradient RSS( w) = X T X w (t) X T y 2 Update the parameters w (t+1) = w (t) η RSS( w) 3 t t + 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39

Alternative method: an example of using numerical optimization (Batch) Gradient descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Loop until convergence 1 Compute the gradient RSS( w) = X T X w (t) X T y 2 Update the parameters w (t+1) = w (t) η RSS( w) 3 t t + 1 What is the complexity of each iteration? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39

Why would this work? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 22 / 39

Why would this work? If gradient descent converges, it will converge to the same solution as using matrix inversion. This is because RSS( w) is a convex function in its parameters w Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 22 / 39

Why would this work? If gradient descent converges, it will converge to the same solution as using matrix inversion. This is because RSS( w) is a convex function in its parameters w Hessian of RSS RSS( w) = w T X T X w 2 ( X T y) T w + const 2 RSS( w) w w T = 2 X T X Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 22 / 39

Why would this work? If gradient descent converges, it will converge to the same solution as using matrix inversion. This is because RSS( w) is a convex function in its parameters w Hessian of RSS RSS( w) = w T X T X w 2 ( X T y) T w + const 2 RSS( w) w w T = 2 X T X X T X is positive semidefinite, because for any v v T X T Xv = X T v 2 2 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 22 / 39

Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence 1 random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence 1 random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+1) = w (t) ηg t Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence 1 random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+1) = w (t) ηg t 4 t t + 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence 1 random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+1) = w (t) ηg t 4 t t + 1 How does the complexity per iteration compare with gradient descent? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence 1 random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+1) = w (t) ηg t 4 t t + 1 How does the complexity per iteration compare with gradient descent? O(ND) for gradient descent versus O(D) for SGD Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39

Mini-summary Batch gradient descent computes the exact gradient. Stochastic gradient descent approximates the gradient with a single data point; Its expectation equals the true gradient. Mini-batch variant: trade-off between accuracy of estimating gradient and computational cost Similar ideas extend to other ML optimization problems. For large-scale problems, stochastic gradient descent often works well. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 24 / 39

What if X T X is not invertible Why might this happen? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 25 / 39

What if X T X is not invertible Why might this happen? Answer 1: N < D. Intuitively, not enough data to estimate all parameters. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 25 / 39

What if X T X is not invertible Why might this happen? Answer 1: N < D. Intuitively, not enough data to estimate all parameters. Answer 2: Columns of X are not linearly independent, e.g., some features are perfectly correlated. In this case, solution is not unique. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 25 / 39

Ridge regression Intuition: what does a non-invertible X T X mean? Consider the SVD of this matrix: Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 26 / 39

Ridge regression Intuition: what does a non-invertible X T X mean? Consider the SVD of this matrix: λ 1 0 0 0 X T 0 λ 2 0 0 X = U 0 0 0 λ r 0 U 0 0 0 where λ 1 λ 2 λ r > 0 and r < D. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 26 / 39

Ridge regression Intuition: what does a non-invertible X T X mean? Consider the SVD of this matrix: λ 1 0 0 0 X T 0 λ 2 0 0 X = U 0 0 0 λ r 0 U 0 0 0 where λ 1 λ 2 λ r > 0 and r < D. Fix the problem by ensuring all singular values are non-zero X T X + λi = Udiag(λ 1 + λ, λ 2 + λ,, λ)u where λ > 0 and I is the identity matrix Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 26 / 39

Regularized least square (ridge regression) Solution w = ( X T X + λi) 1 XT y Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 27 / 39

Regularized least square (ridge regression) Solution w = ( X T X + λi) 1 XT y This is equivalent to adding an extra term to RSS( w) RSS( w) { { }}{ 1 ( ) } T w T X T X w 2 X T y w + 1 2 2 λ w 2 2 }{{} regularization Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 27 / 39

Regularized least square (ridge regression) Solution w = ( X T X + λi) 1 XT y This is equivalent to adding an extra term to RSS( w) Benefits RSS( w) { { }}{ 1 ( ) } T w T X T X w 2 X T y w + 1 2 2 λ w 2 2 }{{} regularization Numerically more stable, invertible matrix Prevent overfitting more on this later Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 27 / 39

How to choose λ? λ is referred as hyperparameter In contrast w is the parameter vector Use validation set or cross-validation to find good choice of λ Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 28 / 39

Outline 1 Administration 2 Review of last lecture 3 Linear regression 4 Nonlinear basis functions Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 29 / 39

Is a linear modeling assumption always a good idea? Example of nonlinear classification 1.5 1 0.5 0 0.5 1 1.5 1.5 1 0.5 0 0.5 1 1.5 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 30 / 39

Is a linear modeling assumption always a good idea? Example of nonlinear classification 1.5 1 0.5 0 0.5 1 1.5 1.5 1 0.5 0 0.5 1 1.5 Example of nonlinear regression t 1 0 1 0 x 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 30 / 39

Nonlinear basis for classification Transform the input/feature φ(x) : x R 2 z = x 1 x 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 31 / 39

Nonlinear basis for classification Transform the input/feature φ(x) : x R 2 z = x 1 x 2 Transformed training data: linearly separable! 1.5 1 0.5 0 0.5 1 1.5 1.5 1 0.5 0 0.5 1 1.5 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1.5 1 0.5 0 0.5 1 1.5 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 31 / 39

Another example How to transform the input/feature? 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 32 / 39

Another example 4 3 2 1 0 How to transform the input/feature? φ(x) : x R 2 z = x 2 1 x 1 x 2 x 2 2 R 3 1 2 3 4 4 3 2 1 0 1 2 3 4 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 32 / 39

Another example 4 3 2 1 0 How to transform the input/feature? φ(x) : x R 2 z = x 2 1 x 1 x 2 x 2 2 R 3 1 2 3 4 4 3 2 1 0 1 2 3 4 Transformed training data: linearly separable 12 10 8 6 4 2 0 15 10 5 0 10 5 0 5 10 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 32 / 39

General nonlinear basis functions We can use a nonlinear mapping φ(x) : x R D z R M where M is the dimensionality of the new feature/input z (or φ(x)). M could be greater than, less than, or equal to D Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 33 / 39

General nonlinear basis functions We can use a nonlinear mapping φ(x) : x R D z R M where M is the dimensionality of the new feature/input z (or φ(x)). M could be greater than, less than, or equal to D With the new features, we can apply our learning techniques to minimize our errors on the transformed training data linear methods: prediction is based on w T φ(x) other methods: nearest neighbors, decision trees, etc Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 33 / 39

Regression with nonlinear basis Residual sum squares [w T φ(x n ) y n ] 2 n where w R M, the same dimensionality as the transformed features φ(x). Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 34 / 39

Regression with nonlinear basis Residual sum squares [w T φ(x n ) y n ] 2 n where w R M, the same dimensionality as the transformed features φ(x). The LMS solution can be formulated with the new design matrix φ(x 1 ) T φ(x 2 ) T Φ =. RN M, w lms = ( Φ T Φ ) 1 Φ T y φ(x N ) T Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 34 / 39

Example with regression Polynomial basis functions 1 x φ(x) = x 2 f(x) = w 0 +. x M M w m x m m=1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 35 / 39

Example with regression Polynomial basis functions 1 x φ(x) = x 2 f(x) = w 0 +. x M M w m x m m=1 Fitting samples from a sine function: underrfitting as f(x) is too simple 1 M = 0 1 M = 1 t t 0 0 1 1 0 x 1 0 x 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 35 / 39

Adding high-order terms M=3 t 1 M = 3 0 1 0 x 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 36 / 39

Adding high-order terms M=3 M=9: overfitting t 1 M = 3 t 1 M = 9 0 0 1 1 0 x 1 0 x 1 More complex features lead to better results on the training data, but potentially worse results on new data, e.g., test data! Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 36 / 39

Overfiting Parameters for higher-order polynomials are very large M = 0 M = 1 M = 3 M = 9 w 0 0.19 0.82 0.31 0.35 w 1-1.27 7.99 232.37 w 2-25.43-5321.83 w 3 17.37 48568.31 w 4-231639.30 w 5 640042.26 w 6-1061800.52 w 7 1042400.18 w 8-557682.99 w 9 125201.43 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 37 / 39

Overfitting can be quite disastrous Fitting the housing price data with large M Predicted price goes to zero (and is ultimately negative) if you buy a big enough house! This is called poor generalization/overfitting. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 38 / 39

Detecting overfitting Plot model complexity versus objective function X axis: model complexity, e.g., M Y axis: error, e.g., RSS, RMS (square root of RSS), 0-1 loss ERMS 1 0.5 Training Test 0 0 3 M 6 9 As a model increases in complexity: Training error keeps improving Test error may first improve but eventually will deteriorate Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 39 / 39