Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Similar documents
Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 2, Kevin Jamieson 1

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Bias-Variance Tradeoff

Announcements. Proposals graded

Warm up: risk prediction with logistic regression

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Classification Logistic Regression

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Classification Logistic Regression

Overfitting, Bias / Variance Analysis

Stochastic Gradient Descent

Is the test error unbiased for these programs?

Understanding Generalization Error: Bounds and Decompositions

Gaussians Linear Regression Bias-Variance Tradeoff

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

PDEEC Machine Learning 2016/17

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

Probability and Statistical Decision Theory

Machine Learning. Ensemble Methods. Manfred Huber

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Linear classifiers: Logistic regression

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Bias-Variance in Machine Learning

y Xw 2 2 y Xw λ w 2 2

Announcements. Proposals graded

Announcements Kevin Jamieson

Machine Learning

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Machine Learning. Regression. Manfred Huber

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Machine Learning (CSE 446): Probabilistic Machine Learning

Linear Regression (continued)

Variance Reduction and Ensemble Methods

ECE 5424: Introduction to Machine Learning

Linear classifiers: Overfitting and regularization

Announcements Kevin Jamieson

Introduction to Bayesian Learning. Machine Learning Fall 2018

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Learning Theory Continued

Learning with multiple models. Boosting.

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Linear Models for Regression CS534

Ridge Regression: Regulating overfitting when using many features

Machine Learning CSE546

Point Estimation. Maximum likelihood estimation for a binomial distribution. CSE 446: Machine Learning

Recap from previous lecture

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Linear Models for Regression CS534

Decision Tree Learning Lecture 2

Machine Learning

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Linear Models for Regression CS534

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning CSE546

Least Squares Regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Machine Learning. Lecture 2: Linear regression. Feng Li.

Support Vector Machines

Bayesian Linear Regression [DRAFT - In Progress]

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSCI567 Machine Learning (Fall 2014)

Machine Learning

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

12 Statistical Justifications; the Bias-Variance Decomposition

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Least Squares Regression

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Lecture 2 Machine Learning Review

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Machine Learning And Applications: Supervised Learning-SVM

ESS2222. Lecture 3 Bias-Variance Trade-off

6.867 Machine Learning

GI01/M055: Supervised Learning

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

ECE521 week 3: 23/26 January 2017

Lecture 3: Introduction to Complexity Regularization

ECE 5984: Introduction to Machine Learning

Simple and Multiple Linear Regression

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Logistic Regression. Machine Learning Fall 2018

An Introduction to Statistical and Probabilistic Linear Models

Transcription:

Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1

The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} Training Data: {(x i,y i )} n i=1 x i 2 R d y i 2 R Sale Price # square feet 2

The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} best linear fit Training Data: {(x i,y i )} n i=1 x i 2 R d y i 2 R Sale Price # square feet Hypothesis: linear y i x T i w Loss: least squares min w nx i=1 y i x T i w 2 3

The regression problem in matrix notation y = 2 6 4 y 1. y n bw LS = arg min w 3 7 5 X = 2 3 6 4x T 7 1. 5 x T n nx i=1 y i x T i w 2 = arg min w (y Xw)T (y Xw) 4

The regression problem in matrix notation bw LS = arg min w Xw 2 2 = arg min w Xw)T (y Xw) 5

The regression problem in matrix notation bw LS = arg min w y Xw 2 2 =(X T X) 1 X T y What about an offset? bw LS, b b LS = arg min w,b nx i=1 y i (x T i w + b) 2 = arg min w,b y (Xw + 1b) 2 2 6

Dealing with an offset bw LS, b b LS = arg min w,b y (Xw + 1b) 2 2 7

Dealing with an offset bw LS, b b LS = arg min w,b y (Xw + 1b) 2 2 X T X bw LS + b b LS X T 1 = X T y 1 T X bw LS + b b LS 1 T 1 = 1 T y If X T 1 = 0 (i.e., if each feature is mean-zero) then bw LS =(X T X) 1 X T Y nx b bls = 1 n i=1 y i 8

The regression problem in matrix notation bw LS = arg min w y Xw 2 2 =(X T X) 1 X T y But why least squares? Consider y i = x T i w + i where i i.i.d. N (0, 2 ) P (y x, w, )= 9

Maximizing log-likelihood Maximize: Y n 1 log P (D w, ) = log( p 2 ) n i=1 e (y i x T i w)2 2 2 10

MLE is LS under linear model bw LS = arg min w nx i=1 y i x T i w 2 bw MLE = arg max P (D w, ) if y i = x T w i.i.d. i w + i and i N (0, 2 ) bw LS = bw MLE =(X T X) 1 X T Y 11

The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} best linear fit Training Data: {(x i,y i )} n i=1 x i 2 R d y i 2 R Sale Price # square feet Hypothesis: linear y i x T i w Loss: least squares min w nx i=1 y i x T i w 2 12

The regression problem Sale Price Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.} Training Data: date of sale best linear fit {(x i,y i )} n i=1 Hypothesis: linear y i x T i w Loss: least squares min w nx i=1 x i 2 R d y i 2 R y i x T i w 2 13

The regression problem Training Data: {(x i,y i )} n i=1 Hypothesis: linear y i x T i w x i 2 R d y i 2 R Transformed data: Loss: least squares min w nx i=1 y i x T i w 2 14

The regression problem Training Data: {(x i,y i )} n i=1 Hypothesis: linear y i x T i w Loss: least squares min w nx i=1 x i 2 R d y i 2 R y i x T i w 2 Transformed data: h : R d! R p maps original features to a rich, possibly high-dimensional space 2 3 2 h 1 (x) 6h 2 (x) 7 6 in d=1: h(x) = 6 4. h p (x) 7 5 = 6 4 for d>1, generate {u j } p j=1 Rd 1 h j (x) = 1 + exp(u T j x) h j (x) =(u T j x) 2 3 x 7 5 x 2. x p h j (x) = cos(u T j x) 15

The regression problem Training Data: {(x i,y i )} n i=1 Hypothesis: linear y i x T i w Loss: least squares min w nx i=1 x i 2 R d y i 2 R y i x T i w 2 Transformed data: h(x) = Hypothesis: linear y i h(x i ) T w Loss: least squares min w nx i=1 2 3 h 1 (x) h 2 (x) 6 7 4. 5 h p (x) w 2 R p y i h(x i ) T w 2 16

The regression problem Training Data: {(x i,y i )} n i=1 x i 2 R d y i 2 R Transformed data: h(x) = 2 3 h 1 (x) h 2 (x) 6 7 4. 5 h p (x) best linear fit Hypothesis: linear Sale Price y i h(x i ) T w Loss: least squares min w nx i=1 w 2 R p y i h(x i ) T w 2 date of sale 17

The regression problem Training Data: {(x i,y i )} n i=1 Sale Price date of sale x i 2 R d y i 2 R small p fit Transformed data: h(x) = Hypothesis: linear y i h(x i ) T w Loss: least squares min w nx i=1 2 3 h 1 (x) h 2 (x) 6 7 4. 5 h p (x) w 2 R p y i h(x i ) T w 2 18

The regression problem Training Data: {(x i,y i )} n i=1 Sale Price date of sale x i 2 R d y i 2 R large p fit Transformed data: h(x) = Hypothesis: linear y i h(x i ) T w Loss: least squares min w nx i=1 2 3 h 1 (x) h 2 (x) 6 7 4. 5 h p (x) w 2 R p y i h(x i ) T w 2 What s going on here? 19

Bias-Variance Tradeoff Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 20

Statistical Learning P XY (X = x, Y = y) y x

Statistical Learning P XY (X = x, Y = y) P XY (Y = y X = x 0 ) y x 0 x 1 x P XY (Y = y X = x 1 )

Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E XY [Y X = x] P XY (Y = y X = x 0 ) y x 0 x 1 x P XY (Y = y X = x 1 )

Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E XY [Y X = x] y x

Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E XY [Y X = x] But we only have samples: (x i,y i ) i.i.d. P XY for i =1,...,n y x

Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E XY [Y X = x] y x But we only have samples: (x i,y i ) i.i.d. P XY bf = arg min f2f 1 n nx (y i f(x i )) 2 i=1 for i =1,...,n and are restricted to a function class (e.g., linear) so we compute:

Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E XY [Y X = x] y x But we only have samples: (x i,y i ) i.i.d. P XY bf = arg min f2f 1 n nx (y i f(x i )) 2 i=1 for i =1,...,n and are restricted to a function class (e.g., linear) so we compute: We care about future predictions: E XY [(Y b f(x)) 2 ]

Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E XY [Y X = x] y x But we only have samples: (x i,y i ) i.i.d. P XY bf = arg min f2f 1 n nx (y i f(x i )) 2 i=1 for i =1,...,n and are restricted to a function class (e.g., linear) so we compute: Each draw D = {(x i,y i )} n i=1 results in di erent b f

Statistical Learning P XY (X = x, Y = y) Ideally, we want to find: (x) =E XY [Y X = x] y E D [ b f(x)] x But we only have samples: (x i,y i ) i.i.d. P XY bf = arg min f2f 1 n nx (y i f(x i )) 2 i=1 for i =1,...,n and are restricted to a function class (e.g., linear) so we compute: Each draw D = {(x i,y i )} n i=1 results in di erent b f

Bias-Variance Tradeoff (x) =E XY [Y X = x] b f = arg min f2f 1 n nx (y i f(x i )) 2 i=1 E Y X=x [E D [(Y b fd (x)) 2 ]] = E Y X=x [E D [(Y (x)+ (x) b fd (x)) 2 ]]

Bias-Variance Tradeoff (x) =E XY [Y X = x] b f = arg min f2f 1 n nx (y i f(x i )) 2 i=1 E XY [E D [(Y b fd (x)) 2 ] X = x] =E XY [E D [(Y (x)+ (x) b fd (x)) 2 ] X = x] =E XY [E D [(Y (x)) 2 + 2(Y (x))( (x) fd b (x)) +( (x) fd b (x)) 2 ] X = x] =E XY [(Y (x)) 2 X = x]+e D [( (x) fd b (x)) 2 ] irreducible error Caused by stochastic label noise learning error Caused by either using too simple of a model or not enough data to learn the model accurately

Bias-Variance Tradeoff (x) =E XY [Y X = x] b f = arg min f2f 1 n nx (y i f(x i )) 2 i=1 E D [( (x) b fd (x)) 2 ]=E D [( (x) E D [ b f D (x)] + E D [ b f D (x)] b fd (x)) 2 ]

Bias-Variance Tradeoff (x) =E XY [Y X = x] b f = arg min f2f 1 n nx (y i f(x i )) 2 i=1 E D [( (x) fd b (x)) 2 ]=E D [( (x) E D [ f b D (x)] + E D [ f b D (x)] fd b (x)) 2 ] =E D [( (x) E D [ f b D (x)]) 2 + 2( (x) E D [ f b D (x)])(e D [ f b D (x)] fd b (x)) +(E D [ f b D (x)] fd b (x)) 2 ] =( (x) E D [ f b D (x)]) 2 + E D [(E D [ f b D (x)] fd b (x)) 2 ] biased squared variance

Bias-Variance Tradeoff E XY [E D [(Y b fd (x)) 2 ] X = x] =E XY [(Y (x)) 2 X = x] irreducible error +( (x) E D [ b f D (x)]) 2 + E D [(E D [ b f D (x)] b fd (x)) 2 ] biased squared variance Model too simple high bias, cannot fit well to data Model too complex high variance, small changes in data change learned function a lot

Bias-Variance Tradeoff E XY [E D [(Y b fd (x)) 2 ] X = x] =E XY [(Y (x)) 2 X = x] irreducible error +( (x) E D [ b f D (x)]) 2 + E D [(E D [ b f D (x)] b fd (x)) 2 ] biased squared variance

Overfitting Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 36

Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance But in practice??

Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance But in practice?? Before we saw how increasing the feature space can increase the complexity of the learned estimator: F 1 F 2 F 3... bf (k) D 1 = arg min f2f k D X (x i,y i )2D Complexity grows as k grows (y i f(x i )) 2

Training set error as a function of model complexity F 1 F 2 F 3... bf (k) D 1 = arg min f2f k D X (x i,y i )2D D i.i.d. P XY (y i f(x i )) 2 1 TRAIN error: X (y i b(k) f D D (x i)) 2 (x i,y i )2D TRUE error: E XY [(Y f b(k) D (X))2 ] 39

Training set error as a function of model complexity F 1 F 2 F 3... bf (k) D 1 = arg min f2f k D X (x i,y i )2D D i.i.d. P XY (y i f(x i )) 2 1 TRAIN error: X (y i b(k) f D D (x i)) 2 (x i,y i )2D TRUE error: E XY [(Y f b(k) D (X))2 ] Complexity (k) TEST error: T i.i.d. 1 T P XY X (x i,y i )2T (y i b f (k) D (x i)) 2 Important: D \ T = ; 40

Training set error as a function of model complexity F 1 F 2 F 3... bf (k) D 1 = arg min f2f k D X (x i,y i )2D D i.i.d. P XY (y i f(x i )) 2 1 TRAIN error: X (y i b(k) f D D (x i)) 2 (x i,y i )2D TRUE error: E XY [(Y f b(k) D (X))2 ] Each line is i.i.d. draw of D or T TEST error: T i.i.d. 1 T P XY X (x i,y i )2T (y i b f (k) D (x i)) 2 Important: D \ T = ; Complexity (k) Plot from Hastie et al 41

Training set error as a function of model complexity F 1 F 2 F 3... bf (k) D 1 = arg min f2f k D X (x i,y i )2D D i.i.d. P XY (y i f(x i )) 2 1 TRAIN error: X (y i b(k) f D D (x i)) 2 (x i,y i )2D TRAIN error is optimistically biased because it is evaluated on the data it trained on. TEST error is unbiased only if T is never used to train the model or even pick the complexity k. TRUE error: E XY [(Y f b(k) D (X))2 ] TEST error: T i.i.d. 1 T P XY X (x i,y i )2T (y i b f (k) D (x i)) 2 Important: D \ T = ; 42

Test set error Given a dataset, randomly split it into two parts: Training data: Test data: Use training data to learn predictor e.g., 1 D T X D (x i,y i )2D Important: D \ T = ; (y i f b(k) D (x i)) 2 use training data to pick complexity k (next lecture) Use test data to report predicted performance 1 T X (x i,y i )2T (y i b f (k) D (x i)) 2 43

Overfitting Overfitting: a learning algorithm overfits the training data if it outputs a solution w when there exists another solution w such that: 44

How many points do I use for training/testing? Very hard question to answer! Too few training points, learned model is bad Too few test points, you never know if you reached a good solution Bounds, such as Hoeffding s inequality can help: More on this later this quarter, but still hard to answer Typically: If you have a reasonable amount of data 90/10 splits are common If you have little data, then you need to get fancy (e.g., bootstrapping) 45

Recap Learning is Collect some data E.g., housing info and sale price Randomly split dataset into TRAIN and TEST E.g., 80% and 20%, respectively Choose a hypothesis class or model E.g., linear Choose a loss function E.g., least squares Choose an optimization procedure E.g., set derivative to zero to obtain estimator Justifying the accuracy of the estimate E.g., report TEST error 46