Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Similar documents
Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Classification Based on Probability

Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Lecture 2: Linear regression

Classification: Decision Trees

Unsupervised Learning: K-Means, Gaussian Mixture Models

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Linear Regression (continued)

CSC321 Lecture 2: Linear Regression

Linear Models for Regression

Regression with Numerical Optimization. Logistic

Machine Learning and Data Mining. Linear regression. Kalev Kask

LINEAR REGRESSION, RIDGE, LASSO, SVR

Linear Models for Regression CS534

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Overfitting, Bias / Variance Analysis

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

COMP 562: Introduction to Machine Learning

ECE521 week 3: 23/26 January 2017

Logistic Regression Logistic

CSC 411 Lecture 6: Linear Regression

Gaussian Mixture Models, Expectation Maximization

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

CSE446: non-parametric methods Spring 2017

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Linear Models for Regression

Machine Learning Lecture 7

Bias-Variance Tradeoff

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

CSC 411: Lecture 02: Linear Regression

Support Vector Machines

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

CS-E3210 Machine Learning: Basic Principles

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Machine Learning Lecture 5

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 6. Regression

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Linear & nonlinear classifiers

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Day 3 Lecture 3. Optimizing deep networks

Stochastic Gradient Descent

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Generative v. Discriminative classifiers Intuition

Linear & nonlinear classifiers

Gradient Descent. Stephan Robert

Week 3: Linear Regression

18.9 SUPPORT VECTOR MACHINES

Lecture 2 Machine Learning Review

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

CPSC 540: Machine Learning

ECE 5984: Introduction to Machine Learning

Linear Models for Regression

Support Vector Machines

Ordinary Least Squares Linear Regression

SVMs, Duality and the Kernel Trick

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Regression. Manfred Huber

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

Linear Models for Classification

The Multivariate Gaussian Distribution [DRAFT]

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Sample questions for Fundamentals of Machine Learning 2018

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Error Functions & Linear Regression (1)

CS60021: Scalable Data Mining. Large Scale Machine Learning

MLCC 2017 Regularization Networks I: Linear Models

CS 231A Section 1: Linear Algebra & Probability Review

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Generative v. Discriminative classifiers Intuition

Lecture 5: Logistic Regression. Neural Networks

CSC321 Lecture 9: Generalization

Gradient Descent. Model

Midterm: CS 6375 Spring 2015 Solutions

Lecture 3 - Linear and Logistic Regression

CSC321 Lecture 9: Generalization

Lecture 4: Training a Classifier

18.6 Regression and Classification with Linear Models

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Max Margin-Classifier

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

Probability Basics. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

STA414/2104 Statistical Methods for Machine Learning II

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Transcription:

Linear Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Please send comments and corrections to Eric. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Regression Given: n Data X = x (1),...,x (n)o where x (i) 2 R d n Corresponding labels y = y (1),...,y (n)o where y (i) 2 R September Arctic Sea Ice Extent (1,000,000 sq km) 9 8 7 6 5 4 3 2 1 Linear Regression Quadratic Regression 0 1970 1980 1990 2000 2010 2020 Year Data from G. Witt. Journal of Statistics Education, Volume 21, Number 1 (2013) 2

Linear Regression Hypothesis: y = 0 + 1 x 1 + 2 x 2 +...+ d x d = dx j x j Assume x 0 = 1 j=0 Fit model by minimizing sum of squared errors x Figures are courtesy of Greg Shakhnarovich 3

Least Squares Linear Regression Cost Function J( ) = 1 2n Fit by solving min nx i=1 J( ) h x (i) y (i) 2 4

Intuition Behind Cost Function J( ) = 1 2n For insight on J(), let s assume nx h x (i) y (i) 2 i=1 so x 2 R =[ 0, 1 ] Based on example by Andrew Ng 5

Intuition Behind Cost Function J( ) = 1 2n For insight on J(), let s assume nx h x (i) y (i) 2 i=1 so x 2 R =[ 0, 1 ] (for fixed, this is a function of x) (function of the parameter ) 3 3 y 2 1 2 1 0 0 1 2 3 x 0-0.5 0 0.5 1 1.5 2 2.5 Based on example by Andrew Ng 6

Intuition Behind Cost Function J( ) = 1 2n For insight on J(), let s assume nx h x (i) y (i) 2 i=1 so x 2 R =[ 0, 1 ] (for fixed, this is a function of x) (function of the parameter ) 3 3 y 2 1 2 1 0 Based on example by Andrew Ng 0 1 2 3 x J([0, 0.5]) = 1 2 3 0-0.5 0 0.5 1 1.5 2 2.5 (0.5 1) 2 +(1 2) 2 +(1.5 3) 2 0.58 7

Intuition Behind Cost Function J( ) = 1 2n For insight on J(), let s assume nx h x (i) y (i) 2 i=1 so x 2 R =[ 0, 1 ] y (for fixed, this is a function of x) (function of the parameter ) 3 3 J([0, 0]) 2.333 2 2 1 1 J() is concave 0 0 1 2 3 x 0-0.5 0 0.5 1 1.5 2 2.5 Based on example by Andrew Ng 8

Intuition Behind Cost Function Slide by Andrew Ng 9

Intuition Behind Cost Function (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 10

Intuition Behind Cost Function (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 11

Intuition Behind Cost Function (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 12

Intuition Behind Cost Function (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 13

Basic Search Procedure Choose initial value for Until we reach a minimum: Choose a new value for to reduce J( ) J(q 0,q 1 ) q 1 q 0 Figure by Andrew Ng 14

Basic Search Procedure Choose initial value for Until we reach a minimum: Choose a new value for to reduce J( ) J(q 0,q 1 ) q 1 q 0 Figure by Andrew Ng 15

Basic Search Procedure Choose initial value for Until we reach a minimum: Choose a new value for to reduce J( ) J(q 0,q 1 ) Since the least squares objective function is convex (concave), q 1 q we don t need 0 to worry about local minima Figure by Andrew Ng 16

Gradient Descent Initialize Repeat until convergence j j @ J( ) @ j simultaneous update for j = 0... d learning rate (small) e.g., α = 0.05 3 J( ) 2 1 0-0.5 0 0.5 1 1.5 2 2.5 17

Gradient Descent Initialize Repeat until convergence j j @ J( ) @ j simultaneous update for j = 0... d For Linear Regression: @ @ j J( ) = @ 1 @ j 2n nx h x (i) y (i) 2 i=1 X X! 2 18

Gradient Descent Initialize Repeat until convergence j j @ J( ) @ j simultaneous update for j = 0... d For Linear Regression: @ @ j J( ) = @ 1 @ j 2n = @ @ j 1 2n X nx h x (i) y (i) 2 i=1 nx i=1 X dx k=0 k x (i) k! y (i)! 2 X! 19

Gradient Descent Initialize Repeat until convergence j j @ J( ) @ j simultaneous update for j = 0... d For Linear Regression: @ @ j J( ) = @ 1 @ j 2n = @ @ j 1 2n = 1 n nx i=1 X nx h x (i) y (i) 2 i=1 nx i=1 dx k=0 X dx k=0 k x (i) k k x (i) k y (i)!! y (i)! 2 @ @ j dx k=0 k x (i) k y (i)! 20

Gradient Descent Initialize Repeat until convergence j j @ J( ) @ j simultaneous update for j = 0... d For Linear Regression: @ @ j J( ) = @ 1 @ j 2n = @ @ j 1 2n = 1 n = 1 n nx i=1 nx i=1 nx h x (i) y (i) 2 i=1 nx i=1 dx k=0 dx k=0 dx k=0 k x (i) k k x (i) k k x (i) k y (i) y (i)!! nx y (i) x (i) j! 2 @ @ j dx k=0 k x (i) k y (i)! 21

Gradient Descent for Linear Regression Initialize Repeat until convergence j j 1 nx h x (i) n i=1 y (i) x (i) j simultaneous update for j = 0... d To achieve simultaneous update At the start of each GD iteration, compute h x (i) Use this stored value in the update step loop Assume convergence when L 2 norm: s X kvk 2 = vi qv 2 = 1 2 + v2 2 +...+ v2 v i k new old k 2 < 22

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) h(x) = - 900 0.1 x Slide by Andrew Ng 23

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 24

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 25

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 26

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 27

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 28

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 29

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 30

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 31

Choosing α α too small slow convergence α too large Increasing value for J( ) May overshoot the minimum May fail to converge May even diverge To see if gradient descent is working, print out The value should decrease at each iteration If it doesn t, adjust α J( ) each iteration 32

Extending Linear Regression to More Complex Models The inputs X for linear regression can be: Original quantitative inputs Transformation of quantitative inputs e.g. log, exp, square root, square, etc. Polynomial transformation example: y = b 0 + b 1 x + b 2 x 2 + b 3 x 3 Basis expansions Dummy coding of categorical inputs Interactions between variables example: x 3 = x 1 x 2 This allows use of linear regression techniques to fit non- linear datasets.

Linear Basis Function Models Generally, h (x) = dx j=0 j j (x) basis function Typically, 0(x) =1so that 0 acts as a bias In the simplest case, we use linear basis functions : j(x) =x j Based on slide by Christopher Bishop (PRML)

Linear Basis Function Models Polynomial basis functions: These are global; a small change in x affects all basis functions Gaussian basis functions: These are local; a small change in x only affect nearby basis functions. μ j and s control location and scale (width). Based on slide by Christopher Bishop (PRML)

Linear Basis Function Models Sigmoidal basis functions: where These are also local; a small change in x only affects nearby basis functions. μ j and s control location and scale (slope). Based on slide by Christopher Bishop (PRML)

Example of Fitting a Polynomial Curve with a Linear Model y = 0 + 1 x + 2 x 2 +...+ p x p = px j x j j=0

Linear Basis Function Models Basic Linear Model: Generalized Linear Model: h (x) = h (x) = Once we have replaced the data by the outputs of the basis functions, fitting the generalized model is exactly the same problem as fitting the basic model Unless we use the kernel trick more on that when we cover support vector machines Therefore, there is no point in cluttering the math with basis functions dx j=0 dx j=0 j x j j j (x) Based on slide by Geoff Hinton 38

Linear Algebra Concepts Vector in R d is an ordered set of d real numbers e.g., v = [1,6,3,4] is in R 4 [1,6,3,4] is a column vector: as opposed to a row vector: ( 1 6 3 4) æ 1ö ç ç 6 ç 3 ç ç è 4ø An m- by- n matrix is an object with m rows and n columns, where each entry is a real number: æ 1 2 8ö ç ç 4 78 6 ç è 9 3 2ø Based on slides by Joseph Bradley

Linear Algebra Concepts Transpose: reflect vector/matrix on line: æ aö T ç = è bø ( a b) æ ç è a c b d ö ø T = æ ç è a b c d ö ø Note: (Ax) T =x T A T (We ll define multiplication soon ) Vector norms: L p norm of v = (v 1,,v k ) is Common norms: L 1, L 2 L infinity = max i v i Length of a vector v is L 2 (v) Based on slides by Joseph Bradley X i v i p! 1 p

Linear Algebra Concepts Vector dot product: ( u1 u2) ( v1 v2 ) = u1v 1 u2v2 u v = + Note: dot product of u with itself = length(u) 2 = kuk 2 2 Matrix product: A AB æ a = ç ç è a 11 21 æ a = ç ç è a 11 21 b b a a 11 11 12 22 ö, ø + + B = a a 12 22 b b 21 21 æ b ç ç è b 11 21 a a 11 21 b b b b 12 12 12 22 ö ø + + a a 12 22 b b 22 22 ö ø Based on slides by Joseph Bradley

Vector products: Dot product: Outer product: ( ) 2 2 1 1 2 1 2 1 v u u v v v u u v u v u T + = ø ö ç ç è æ = = ( ) ø ö ç ç è æ = ø ö ç ç è æ = 2 2 1 2 2 1 1 1 2 1 2 1 v u v u u v u v v v u u uv T Based on slides by Joseph Bradley Linear Algebra Concepts

Benefits of vectorization Vectorization More compact equations Faster code (using optimized matrix libraries) Consider our model: Let = 2 6 4 0 1. d 3 7 5 h(x) = dx j=0 j x j x = 1 x 1... x d Can write the model in vectorized form as h(x) = x 43

Vectorization Consider our model for n instances: h x (i) dx = j x (i) j Let = 2 6 4 0 1. d R (d+1) 1 3 7 5 X = Can write the model in vectorized form as 2 6 4 j=0 1 x (1) 1... x (1) d...... 1 x (i) 1... x (i) d...... 1 x (n) 1... x (n) d R n (d+1) 3 7 5 h (x) =X 44

Vectorization Let: y = For the linear regression cost function: J( ) = 1 nx h x (i) y (i) 2 2n 2 6 4 y (1) y (2). y (n) 3 7 5 = 1 2n i=1 nx i=1 x (i) y (i) 2 = 1 2n (X y) (X y) R 1 n R n 1 R n (d+1) R (d+1) 1 45

Closed Form Solution Instead of using GD, solve for optimal Notice that the solution is when Derivation: Closed Form Solution: @ J( ) =0 @ analytically ( ) = 1 2n (X y) 1 x 1 J (X y) / X X y X X y + y y / / X X 2 X y + y y Take derivative and set equal to 0, then solve for : @ ( X X 2 X y + y y)=0 @ (X X) X y =0 (X X) = X y =(X X) 1 X y 46

Closed Form Solution Can obtain by simply plugging X and yinto =(X X) 1 X y X = 2 6 4 1 x (1) 1... x (1) d...... 1 x (i) 1... x (i) d...... 1 x (n) 1... x (n) d y = 6 4 If X T X is not invertible (i.e., singular), may need to: Use pseudo- inverse instead of the inverse In python, numpy.linalg.pinv(a) Remove redundant (not linearly independent) features Remove extra features to ensure that d n 3 7 5 2 y (1) y (2). y (n) 3 7 5 47

Gradient Descent vs Closed Form Gradient Descent Requires multiple iterations Need to choose α Works well when n is large Can support incremental learning Closed Form Solution Non- iterative No need for α Slow if n is large Computing (X T X) - 1 is roughly O(n 3 ) 48

Improving Learning: Feature Scaling Idea: Ensure that feature have similar scales 2 20 15 10 Before Feature Scaling 5 0 0 5 10 15 20 1 Makes gradient descent converge much faster 2 20 15 10 5 0 After Feature Scaling 0 5 10 15 20 1 49

Feature Standardization Rescales features to have zero mean and unit variance Let μ j be the mean of feature j: µ j = 1 nx x (i) j n Replace each value with: x (i) j x (i) j s j s j is the standard deviation of feature j Could also use the range of feature j (max j min j ) for s j Must apply the same transformation to instances for both training and prediction Outliers can cause problems µ j i=1 for j = 1...d (not x 0!) 50

Quality of Fit Price Price Price Size Underfitting (high bias) Size Correct fit Size Overfitting (high variance) Overfitting: The learned hypothesis may fit the training set very well ( J( ) 0 )...but fails to generalize to new examples Based on example by Andrew Ng 51

Regularization A method for automatically controlling the complexity of the learned hypothesis Idea: penalize for large values of Can incorporate into the cost function Works well when we have a lot of features, each that contributes a bit to predicting the label Can also address overfitting by eliminating features (either manually or via model selection) j 52

Regularization Linear regression objective function J( ) = 1 2n nx i=1 h x (i) y (i) 2 + 2 dx j=1 2 j model fit to data regularization is the regularization parameter ( 0) No regularization on 0! 53

Understanding Regularization J( ) = 1 2n nx i=1 h x (i) y (i) 2 + 2 dx j=1 2 j dx Note that j 2 = k 1:d k 2 2 j=1 This is the magnitude of the feature coefficient vector! We can also think of this as: dx ( j 0) 2 = k 1:d ~0k 2 2 j=1 L 2 regularization pulls coefficients toward 0 54

Understanding Regularization J( ) = 1 2n nx i=1 h x (i) y (i) 2 + 2 dx j=1 2 j What happens if we set to be huge (e.g., 10 10 )? Price Size Based on example by Andrew Ng 55

Understanding Regularization J( ) = 1 2n nx i=1 h x (i) y (i) 2 + 2 dx j=1 2 j What happens if we set to be huge (e.g., 10 10 )? Price Size 0 0 0 0 Based on example by Andrew Ng 56

Regularized Linear Regression Cost Function J( ) = 1 2n Fit by solving nx i=1 min h x (i) y (i) 2 + 2 dx J( ) j=1 2 j @ @ 0 J( ) @ @ j J( ) Gradient update: 0 0 1 n j j 1 n nx i=1 nx i=1 h x (i) y (i) h x (i) y (i) x (i) j j regularization 57

Regularized Linear Regression J( ) = 1 2n 0 0 1 n nx h x (i) y (i) 2 dx + 2 i=1 j j 1 n nx h x (i) y (i) i=1 nx i=1 h x (i) y (i) x (i) j j=1 2 j j We can rewrite the gradient step as: j j (1 ) 1 nx h x (i) n i=1 y (i) x (i) j 58

Regularized Linear Regression To incorporate regularization into the closed form solution: 0 2 31 1 0 0 0... 0 0 1 0... 0 = B X X + 0 0 1... 0 X y 6 @ 4...... 7C. 5A 0 0 0... 1 59

Regularized Linear Regression To incorporate regularization into the closed form solution: 0 2 31 1 0 0 0... 0 0 1 0... 0 = B X X + 0 0 1... 0 X y 6 @ 4...... 7C. 5A 0 0 0... 1 Can derive this the same way, by solving Can prove that for λ > 0, inverse exists in the equation above @ J( ) =0 @ 60