Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Similar documents
Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Linear Models for Regression. Sargur Srihari

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Models for Regression CS534

CSCI567 Machine Learning (Fall 2014)

Linear Models for Regression

Linear Models for Regression CS534

Overfitting, Bias / Variance Analysis

Linear Regression (continued)

Neural Network Training

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Linear Models for Regression CS534

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Least Mean Squares Regression. Machine Learning Fall 2018

Linear Models for Regression

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

ECE521 week 3: 23/26 January 2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Linear Regression. S. Sumitra

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Least Mean Squares Regression

J. Sadeghi E. Patelli M. de Angelis

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Linear Methods for Regression. Lijun Zhang

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

CSC 411: Lecture 02: Linear Regression

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Data Mining Techniques. Lecture 2: Regression

Day 3 Lecture 3. Optimizing deep networks

Logistic Regression. COMP 527 Danushka Bollegala

Introduction to Machine Learning

Least Squares Regression

Parameter Norm Penalties. Sargur N. Srihari

Machine Learning Practice Page 2 of 2 10/28/13

Big Data Analytics. Lucas Rego Drumond

Machine Learning and Data Mining. Linear regression. Kalev Kask

Least Squares Regression

Linear Regression (9/11/13)

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

OPTIMIZATION METHODS IN DEEP LEARNING

Bayesian Linear Regression. Sargur Srihari

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

COMS 4771 Regression. Nakul Verma

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

GWAS IV: Bayesian linear (variance component) models

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Linear Models in Machine Learning

Deep Learning & Artificial Intelligence WS 2018/2019

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Linear & nonlinear classifiers

Regression with Numerical Optimization. Logistic

Linear Models for Classification

Why should you care about the solution strategies?

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

y Xw 2 2 y Xw λ w 2 2

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

STA414/2104 Statistical Methods for Machine Learning II

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)


Classification Logistic Regression

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 2 Machine Learning Review

Linear Models for Regression

Ch 4. Linear Models for Classification

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

ECE521 Lectures 9 Fully Connected Neural Networks

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Discriminative Learning and Big Data

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

ECS171: Machine Learning

CS6220: DATA MINING TECHNIQUES

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Discriminative Models

Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme

Machine Learning. 7. Logistic and Linear Regression

DATA MINING AND MACHINE LEARNING

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Lecture 1: Supervised Learning

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Stochastic Gradient Descent

Linear Models for Regression

Regression, Ridge Regression, Lasso

Bayesian Linear Regression [DRAFT - In Progress]

Reading Group on Deep Learning Session 1

LINEAR REGRESSION, RIDGE, LASSO, SVR

CS6220: DATA MINING TECHNIQUES

Transcription:

Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in

Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization Linear Regression CSL465/603 - Machine Learning 2

Example - Green Chilies Entertainment Company Earnings from the film (in crores of Rs) Cost of making the film (in crores of Rs) Linear Regression CSL465/603 - Machine Learning 3

Notations Training dataset Number of examples - N Input variable - x # Target variable - y % Goal: Learn function that predicts y for new input x Cost of Film (Crores of Rs) - x 98.28 199.69 40.22 93.69 62.07 100.33 Profit/Loss (Crores of Rs) - y Linear Regression CSL465/603 - Machine Learning 4

Linear Regression Simplest form f(x) = w + + w - x Earnings from the film (in crores of Rs) Cost of making the film (in crores of Rs) Linear Regression CSL465/603 - Machine Learning 5

Least Mean Squares - Cost Function Choose parameters w + and w - (or w ) so that f x is as close as to y Earnings from the film (in crores of Rs) Cost of making the film (in crores of Rs) Linear Regression CSL465/603 - Machine Learning 6

Least Mean Squares - Cost Function - Parameter Space (1) Let J w +, w - = - 3 2 f x 23 %8-6 y 6 Linear Regression CSL465/603 - Machine Learning 7

Least Mean Squares - Cost Function - Parameter Space (2) Let J w +, w - = - 3 2 f x 23 %8- % y % Linear Regression CSL465/603 - Machine Learning 8

Least Mean Squares - Cost Function - Parameter Space (3) Let J w +, w - = - 3 2 f x 23 %8-6 y 6 Linear Regression CSL465/603 - Machine Learning 9

Plot of the Error Surface Linear Regression CSL465/603 - Machine Learning 10

Contour Plot of Error Surface Linear Regression CSL465/603 - Machine Learning 11

Estimating Optimal Parameters Linear Regression CSL465/603 - Machine Learning 12

Gradient Descent Basic Principle Minimize J w = - 3 f x 23 %8- % y % 2 Start with an initial estimate for w Keep changing w so that J w is progressively reduced Stop when no change or have reached the minimum Linear Regression CSL465/603 - Machine Learning 13

Gradient Descent - Intuition Linear Regression CSL465/603 - Machine Learning 14

Effect of Learning Parameter Too small value slow convergence Too large value oscillates widely and may not converge Linear Regression CSL465/603 - Machine Learning 15

Gradient Descent Local Minima Depending on the function J w, gradient descent can get stuck at local minima Linear Regression CSL465/603 - Machine Learning 16

Gradient Descent for Regression Convex error function J w = 1 3 2N ; f x 2 # y % %8- Geometrically error surface is bowl shaped. Only global minima Exercise Prove that the sum of squared error is a convex function Linear Regression CSL465/603 - Machine Learning 17

Parameter Update (1) Minimize 3 J w = 1 2N ; f x 2 # y % %8- Linear Regression CSL465/603 - Machine Learning 18

Parameter Update (2) Repeat till convergence 3 w + = w + α 1 N ; f x # y % w - = w - α 1 N ; f x # y % x # %8-3 %8- Linear Regression CSL465/603 - Machine Learning 19

Example Iteration 0 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 20

Example Iteration 1 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 21

Example Iteration 2 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 22

Example Iteration 4 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 23

Example Iteration 7 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 24

Example Iteration 9 Regression Function Error Function Linear Regression CSL465/603 - Machine Learning 25

Gradient Descent Batch Mode Update includes contribution of all data points w + = w + α 1 N ; f x # y % 3 w - = w - α 1 N ; f x # y % x # %8- %8- Will talk stochastic gradient descent later (neural networks). 3 Linear Regression CSL465/603 - Machine Learning 26

Multivariate Linear Regression Cost of Film (Crores of Rs) Celebrity status of the protagonist # of theatres release Age of the protagonist 75.72 7.57 32 52 157.39 18.74 1.87 16 68 81.93 50.96 5.09 27 35 131.95 Earnings (Crores of Rs) - y Dimension of the input data - D Linear Regression CSL465/603 - Machine Learning 27

Multivariate Linear Regression - Formulation Simplest model: f x = w + + w - x - + w 2 x 2 + + w? x? Parameters to learn: w +, w -,, w? = w Cost function: J w = - 3 2 f x 23 %8- # y % Update equation: w B = w B α - 3 f x 3 %8- % y % x %B Linear Regression CSL465/603 - Machine Learning 28

Gradient Descent Parameter update equation 3 w B = w B α 1 N ; f x % y % x %B %8- Linear Regression CSL465/603 - Machine Learning 29

Feature Scaling Multivariate Linear Regression (1) Cost of Film (Crores of Rs) Celebrity status of the protagonist # of theatres release Age of the protagonist 75.72 7.57 32 52 157.39 18.74 1.87 16 68 81.93 50.96 5.09 27 35 131.95 Profit/Loss (Crores of Rs) - y Transform features to be of same scale Linear Regression CSL465/603 - Machine Learning 30

Feature Scaling for Multivariate Linear Regression (2) Normalization 1 x D 1 or 0 x D 1 Standardization mean 0 and standard deviation 1 Linear Regression CSL465/603 - Machine Learning 31

Multivariate Linear Regression Analytical Solution Cost of Film (Crores of Rs) Celebrity status of the protagonist # of theatres release Design Matrix and Target Vector Age of the protagonist 75.72 7.57 32 52 157.39 18.74 1.87 16 68 81.93 50.96 5.09 27 35 131.95 Profit/Loss (Crores of Rs) - y X = 1 x -- x -? 1 x 2- x 2? Y = 1 x 3- x 3? y - y 2 y 3 Linear Regression CSL465/603 - Machine Learning 32

Least Squares Method f X = Xw = 1 x -- x -? 1 x 2- x 2? 1 x 3- x 3? w + w - w? = y - y 2 y 3 = Y 3 J w = 1 2 ; f x # y % 2 %8- Linear Regression CSL465/603 - Machine Learning 33

Normal Equations 1 min J W = min L L 2 XW Y N XW Y Finding the gradient wrt W and equate it to 0 Linear Regression CSL465/603 - Machine Learning 34

Analytical Solution Advantage No need for the learning parameter α! No need for iterative updates Disadvantage Need to perform matrix inversion Pseudo-Inverse of the matrix X N X P- X N Sometimes we deal with non-invertible matrices (redundant features) Linear Regression CSL465/603 - Machine Learning 35

Probabilistic View of Linear Regression (1) Let y = f(x) + ε ε is the error term that captures unmodeled effects or random noise. ε~n 0, σ 2 - Gaussian distribution Linear Regression CSL465/603 - Machine Learning 36

Probabilistic View of Linear Regression (2) Let y = f(x) + ε ε is the error term that captures unmodeled effects or random noise. ε~n 0, σ 2 - Gaussian distribution - why? N 0, σ 2 has maximum entropy among all real-valued distributions with a specified variance σ 2 3-σ rule: Linear Regression CSL465/603 - Machine Learning 37

Probabilistic View of Linear Regression (3) Let y = f(x) + ε ε is the error term that captures unmodeled effects or random noise. ε~n 0, σ 2 - Gaussian distribution Then P ε = And P y x = t y(x 0 ) p(t x 0 ) y(x) x 0 x Linear Regression CSL465/603 - Machine Learning 38

Probabilistic View of Linear Regression (4) P y -,, y 3 x -,, x 3 = P y -,, y 3 x -,, x 3 ; W = Linear Regression CSL465/603 - Machine Learning 39

Maximizing the Likelihood Maximize L W = 3 %8- P y % x % ; W Linear Regression CSL465/603 - Machine Learning 40

Loss Functions Squared loss f x y 2 Absolute loss f x y Dead band loss max 0, f x y ε, ε R ] Linear Regression CSL465/603 - Machine Learning 41

Loss Functions Problem with squared loss Linear Regression CSL465/603 - Machine Learning 42

Linear Regression with Absolute Loss Function Objective min L 3 ; XW Y %8- Non-differentiable, so cannot take the gradient descent approach Solution: frame as a constrained optimization problem Introduce new variables v R 3, v % x % W y % 3 min ; v %, subject to v % L,` %8- x % W y % v % Linear Regression CSL465/603 - Machine Learning 43

Linear Regression with Absolute Loss Function - Example LMS output LP output Linear Regression CSL465/603 - Machine Learning 44

Some Additional Notations Underlying response function (Target Concept) C Actual observed response y = C x + ε ε~n 0, σ 2, E y/x = C(x) Predicted response based on the model learned from dataset A - f x; A Expected response averaged over all datasets fm x = E n f x; A Expected L 2 error on a new test instance x - E pqq = E n f x ; A y 2 Linear Regression CSL465/603 - Machine Learning 45

Bias-Variance Analysis (1) Linear Regression CSL465/603 - Machine Learning 46

Bias-Variance Analysis (2) Linear Regression CSL465/603 - Machine Learning 47

Bias-Variance Analysis (3) Root Mean Square Error Linear Regression CSL465/603 - Machine Learning 48

Bias-Variance Analysis (4) 9 th degree polynomial fit with more sample data Linear Regression CSL465/603 - Machine Learning 49

Bias-Variance Analysis (5) Expected square loss - E L = r f x y 2 P x, y dxdy Linear Regression CSL465/603 - Machine Learning 50

Bias-Variance Analysis (6) Expected square loss - E L = r f x y 2 P x, y dxdy Linear Regression CSL465/603 - Machine Learning 51

Bias-Variance Analysis (7) Relevant part of loss: u f x C x 2 P x dx Linear Regression CSL465/603 - Machine Learning 52

Bias-Variance Analysis (8) Relevant part of loss: E n f x; A C x 2 Linear Regression CSL465/603 - Machine Learning 53

Bias-Variance Analysis (9) Degree = 1 Degree = 4 Linear Regression CSL465/603 - Machine Learning 54

Bias-Variance Analysis (10) Bias term of the error E n f x; A C x 2 Measures how well our approximation architecture can fit the data Weak approximators will have high bias Example low degree polynomials Strong approximators will have low bias Example high degree polynomials Linear Regression CSL465/603 - Machine Learning 55

Bias-Variance Analysis (11) Variance term of the error E n f x; A E n f x; A 2 No direct dependence on the target value For a fixed size dataset A Strong approximators tend to have more variance Small changes in the dataset can result in wide changes in the predictors Weak approximators tend to have less variance Small changes in the dataset result in similar predictors Variance disappears as A Linear Regression CSL465/603 - Machine Learning 56

Bias-Variance Analysis (12) Measuring Bias and Variance in practice Bootstrap from the given dataset Start with a complex approximator, and reduce the complexity through regularization Setting more coefficients/parameters to 0 Do Feature Selection Reduces variance, but can increase bias. Hopefully just sufficient to model the given data Linear Regression CSL465/603 - Machine Learning 57

Regularization Central Idea: penalize over-complicated solutions Linear regression minimizes 3 ; x % w y % 2 Regularized regression minimizes 3 2 ; x % w y % + λ w %8- %8- Linear Regression CSL465/603 - Machine Learning 58

Modified Solution Solution for ordinary linear regression min z J w min L 1 2 Xw Y N Xw Y w = X N X P- X N Y Now for the regularized version which uses L 2 norm Ridge Regression 1 min J w min z L 2 Xw Y N Xw Y + λ w 2 w = X N X + λi P- X N Y Exercise: derive the closed for solution for ridge regression with L2 regularizer Linear Regression CSL465/603 - Machine Learning 59

How to choose λ? Tradeoff between complexity vs. goodness of the fit Solution 1: If we have lots of data Generate multiple models Use lots of test data to discard the bad models Solution 2: With limited data Use k- fold cross validation Will discuss later Linear Regression CSL465/603 - Machine Learning 60

General Form of Regularizer Term 3 ; x % w y % 2? } + λ ; w D %8- D8- Quadratic/L 2 regularizer q = 2 Contours for the regularization term q =0.5 q =1 q =2 q =4 Linear Regression CSL465/603 - Machine Learning 61

Special scenario q = 1 - LASSO Least Absolute Shrinkage and Selection Operator 3 Error Function: 2? %8- x % w y % + λ D8- w D For sufficiently large λ many of the coefficients become 0 resulting in a sparse solution w 2 w 2 w w w 1 w 1 Linear Regression CSL465/603 - Machine Learning 62

LASSO Quadratic programming to solve the optimization problem Least Angles Regression solution - refer to ESL http://web.stanford.edu/~hastie/glmnet_matlab/ - matlab packages for LASSO Linear Regression CSL465/603 - Machine Learning 63

Linear Regression with Non- Linear Basis Functions Linear combination of fixed non-linear functions of the input variables? f x = w + + ; φ D x D8-1 1 1 0.5 0.75 0.75 0 0.5 0.5 0.5 0.25 0.25 1 1 0 1 0 1 0 1 0 1 0 1 Linear Regression CSL465/603 - Machine Learning 64

Linear Regression with Basis Functions Solution f X = 1 φ - x - φ? x - 1 φ - x 2 φ? x 2 1 φ - x 3 φ? x 3 w + w - w? = y - y 2 y 3 = Y w = φ X N φ X P- φ X N Y Linear Regression CSL465/603 - Machine Learning 65

Linear Regression with Multiple Outputs Multiple outputs Y = 1 x -- x -? f X = XW = 1 x 2- x 2? y -- y 1 - x 3- x 3? = Y y 3- y 3 y -- y - y 3- y 3 W = X N X P- X N Y w -+ w + = w -? w? Linear Regression CSL465/603 - Machine Learning 66

Summary Linear Regression (aka curve fitting) Gradient Descent Approach for finding the solution Analytical solution Loss Functions Probabilistic view of Linear Regression Bias-Variance analysis Regularization Ridge Regression Regression with basis functions Locally weighted regression (refer ML - 8.3) Linear Regression CSL465/603 - Machine Learning 67