COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Size: px
Start display at page:

Download "COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)"

Transcription

1 COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof Slides mostly by: Class web page: Unless otherwise noted, all material posted for this course are copyright of the Instructors, and cannot be reused or reposted without the instructor s written permission.

2 What we saw last time Definition and characteristics of a supervised learning problem. Linear regression (hypothesis class, cost function, algorithm). Closed-form least-squares solution method (algorithm, computational complexity, stability issues). Alternative optimization methods via gradient descent. 2

3 Predicting recurrence time from tumor size This function looks complicated, and a linear hypothesis does not seem very good. What should we do? 3

4 Predicting recurrence time from tumor size This function looks complicated, and a linear hypothesis does not seem very good. What should we do? Pick a better function? Use more features? Get more data? 4

5 Input variables for linear regression Original quantitative variables X 1,, X m Transformations of variables, e.g. X m+1 = log(x i ) Basis expansions, e.g. X m+1 = X i2, X m+2 = X i3, Interaction terms, e.g. X m+1 = X i X j Numeric coding of qualitative variables, e.g. X m+1 = 1 if X i is true and 0 otherwise. In all cases, we can add X m+1,, X m+k to the list of original variables and perform the linear regression. 5

6 Example of linear regression with polynomial terms f w (x) = w 0 + w 1 x +w 2 x 2 X = x 2 x Y =

7 Solving the problem w =(X T X) 1 X T Y = apple apple = apple So the best order-2 polynomial is y =0.68x x Compared to y = 1.6x for the order-1 polynomial. 7

8 Order-3 fit: Is this better? 8

9 Order-4 fit 9

10 Order-5 fit 10

11 Order-6 fit 11

12 Order-7 fit 12

13 Order-8 fit 13

14 Order-9 fit 14

15 This is overfitting! 15

16 This is overfitting! We can find a hypothesis that explains perfectly the training data, but does not generalize well to new data. In this example: we have a lot of parameters (weights), so the hypothesis matches the data points exactly, but is wild everywhere else. A very important problem in machine learning. 16

17 Overfitting Every hypothesis has a true error measured on all possible data items we could ever encounter (e.g. f w (x i ) - y i ). Since we don t have all possible data, in order to decide what is a good hypothesis, we measure error over the training set. Formally: Suppose we compare hypotheses f 1 and f 2. Assume f 1 has lower error on the training set. If f 2 has lower true error, then our algorithm is overfitting. 17

18 Overfitting Which hypothesis has the lowest true error? d=1 d=2 d=3 d=4 d=5 d=6 d=7 d=8 18

19 Cross-Validation Partition your data into a Training Set and a Validation set. The proportions in each set can vary. Use the Training Set to find the best hypothesis in the class. Use the Validation Set to evaluate the true prediction error. Compare across different hypothesis classes (different order polynominals.) Train: Validate: X = Y =

20 k-fold Cross-Validation Consider k partitions of the data (usually of equal size). Train with k-1 subset, validate on k th subset. Repeat k times. Average the prediction error over the k rounds/folds. Source: 20

21 k-fold Cross-Validation Consider k partitions of the data (usually of equal size). Train with k-1 subset, validate on k th subset. Repeat k times. Average the prediction error over the k rounds/folds. Source: Computation time is increased by factor of k. 21

22 Leave-one-out cross-validation Let k = n, the size of the training set For each order-d hypothesis class, Repeat n times: Set aside one instance <x i, y i > from the training set. Use all other data points to find w (optimization). Measure prediction error on the held-out <x i, y i >. Average the prediction error over all n subsets. Choose the d with lowest estimated true prediction error. 22

23 Estimating true error for d=1 x Data y Cross-validation results 23

24 Cross-validation results Optimal choice: d=2. Overfitting for d > 2. 24

25 Evaluation We use cross-validation for model selection. Available labeled data is split into two parts: Training set is used to select a hypothesis f from a class of hypotheses F (e.g. regression of a given degree). Validation set is used to compare the best f from each hypothesis class across different classes (e.g. different degree regression). Must be untouched during the process of looking for f within a class F. 25

26 Evaluation After adapting the weights to minimize the error on the train set, the weights could be exploiting particularities in the train set: have to use the validation set as proxy for true error After choosing the hypothesis class to minimize error on the validation set, the hypothesis class could be adapted to some particularities in the validation set Validation set is no longer a good proxy for the true error! 26

27 Evaluation We use cross-validation for model selection. Available labeled data is split into parts: Training set is used to select a hypothesis f from a class of hypotheses F (e.g. regression of a given degree). Validation set is used to compare the best f from each hypothesis class across different classes (e.g. different degree regression). Must be untouched during the process of looking for f within a class F. Test set: Ideally, a separate set of (labeled) data is withheld to get a true estimate of the generalization error. Cannot be touched during the process of selecting F (Often the validation set is called test set, without distinction.) 27

28 Validation vs Train error [From Hastie et al. textbook] High Bias Low Variance Low Bias High Variance Prediction Error Test Sample Training Sample Low Model Complexity High FIGURE Test and training error as a function of model complexity. 28

29 Understanding the error Given set of examples <X, Y>. Assume that y = f(x) +, where is Gaussian noise with zero mean and std deviation σ. N (x µ, 2 ) f(x) 2 µ x 29

30 Understanding the error Consider standard linear regression solution: Err(w) = i=1:n ( y i - w T x i ) 2 If we consider only the class of linear hypotheses, we have systematic prediction error, called bias, whenever the data is generated by a non-linear function. Depending on what dataset we observed, we may get different solutions. Thus we can also have error due to this variance. This occurs even if data is generated from class of linear functions. 30

31 An example (from Tom Dietterich) The circles are data points. X is drawn uniformly randomly. Y is generated by the function y = 2sin(0.5x) +. 31

32 An example (from Tom Dietterich) With different sets of 20 points, we get different lines. 32

33 Bias-variance decomposition Compare f(x) (true value at x) to prediction h(x) Error: E[(h(x) f(x)) 2 ] Bias: f(x) h(x) How far is the average estimate from the true function? Variance: E[(h(x) h(x)) 2 ] How far are estimates (on average) from the average estimate Notes: Expectations over all possible dataset (noise realizations) h(x) =E[h(x)] 33

34 Bias-variance decomposition Can show that: Error = bias 2 + variance (e.g. What happens if we apply 1 st order model to quadratic dataset? No matter how much data we have, can t fit the dataset This means we have bias (underfitting) What happens if we apply 2 nd order model to linear dataset? The model can definitely fit the dataset! (no bias) Spurious parameters make the model sensitive to noise Higher variance than necessary (overfitting if dataset is small) 34

35 Bias-variance decomposition Can show that: Error = bias 2 + variance (e.g. What happens if we apply 1 st order model to quadratic dataset? No matter how much data we have, can t fit the dataset This means we have bias (underfitting) What happens if we apply 2 nd order model to linear dataset? The model can definitely fit the dataset! (no bias) Spurious parameters make the model sensitive to noise Higher variance than necessary (overfitting if dataset is small) 35

36 Validation vs Train error [From Hastie et al. textbook] High Bias Low Variance Low Bias High Variance Prediction Error Test Sample Training Sample Low Model Complexity High FIGURE Test and training error as a function of model complexity. 36

37 Bias-variance decomposition Can show that: Error = bias 2 + variance (e.g. What happens if we apply 1 st order model to quadratic dataset? No matter how much data we have, can t fit the dataset This means we have bias (underfitting) What happens if we apply 2 nd order model to linear dataset? The model can definitely fit the dataset! (no bias) Spurious parameters make the model sensitive to noise Higher variance than necessary (overfitting if dataset is small) What happens if we apply 1 st order model to linear dataset? 37

38 Gauss-Markov Theorem Main result: The least-squares estimates of the parameters w have the smallest variance among all linear unbiased estimates. 38

39 Gauss-Markov Theorem Main result: The least-squares estimates of the parameters w have the smallest variance among all linear unbiased estimates. Understanding the statement: Real parameters are denoted: w Estimate of the parameters is denoted: ŵ Error of the estimator: Err(ŵ) = E(ŵ-w) 2 = Var(ŵ) + (E(ŵ-w)) 2 Unbiased estimator means: E(ŵ-w)=0 There may exist an estimator that has lower error, but some bias. 39

40 Bias vs Variance Gauss-Markov Theorem says: The least-squares estimates of the parameters w have the smallest variance among all linear unbiased estimates. Insight: Find lower variance solution, at the expense of some bias. 40

41 Bias vs Variance Gauss-Markov Theorem says: The least-squares estimates of the parameters w have the smallest variance among all linear unbiased estimates. Insight: Find lower variance solution, at the expense of some bias. E.g. fix low-relevance weights to 0 41

42 Recall our prostate cancer example The Z-score measures the effect of dropping that feature from the linear regression. z j = ŵ j / sqrt(σ 2 v j ) where ŵ j is the estimated weight of the j th feature, and v j is the j th diagonal element of (X T X) -1 TABLE 3.2. Linear model fit to the prostate cancer data. The Z score is the coefficient divided by its standard error (3.12). Roughly a Z score larger than two in absolute value is significantly nonzero at the p =0.05 level. Term Coefficient Std. Error Z Score Intercept lcavol lweight age lbph svi lcp gleason pgg [From Hastie et al. textbook] 42

43 Subset selection Idea: Keep only a small set of features with non-zero weights. Goal: Find lower variance solution, at the expense of some bias. There are many different methods for choosing subsets. (More on this later ) Least-squares regression can be used to estimate the weights of the selected features. Bias as true model might rely on the discarded features! 43

44 Bias vs Variance Find lower variance solution, at the expense of some bias. Force some weights to 0 E.g. Include penalty for model complexity in error to reduce overfitting. Err(w) = i=1:n ( y i - w T x i ) 2 + λ model_size λ is a hyper-parameter that controls penalty size. 44

45 Ridge regression (aka L2-regularization) Constrains the weights by imposing a penalty on their size: ŵ ridge = argmin w { i=1:n ( y i - w T x i ) 2 + λ j=0:m w j2 } where λ can be selected manually, or by cross-validation. 45

46 Ridge regression (aka L2-regularization) Constrains the weights by imposing a penalty on their size: ŵ ridge = argmin w { i=1:n ( y i - w T x i ) 2 + λ j=0:m w j2 } where λ can be selected manually, or by cross-validation. Do a little algebra to get the solution: ŵ ridge = (X T X+λI) -1 X T Y 46

47 Ridge regression (aka L2-regularization) Re-write in matrix notation: f w (X) = Xw Err(w) = (Y Xw) T (Y Xw) + λw T w To minimize, take the derivative w.r.t. w: Try a little algebra: Err(w)/ w = -2 X T (Y Xw) 2λw = 0 47

48 Ridge regression (aka L2-regularization) Re-write in matrix notation: f w (X) = Xw Err(w) = (Y Xw) T (Y Xw) + λw T w To minimize, take the derivative w.r.t. w: Err(w)/ w = -2 X T (Y Xw) 2λw = 0 Try a little algebra: X T Y = (X T X + Iλ) w ŵ = (X T X + Iλ) -1 X T Y 48

49 Ridge regression (aka L2-regularization) Constrains the weights by imposing a penalty on their size: ŵ ridge = argmin w { i=1:n ( y i - w T x i ) 2 + λ j=0:m w j2 } where λ can be selected manually, or by cross-validation. Do a little algebra to get the solution: ŵ ridge = (X T X+λI) -1 X T Y The ridge solution is not equivariant under scaling of the data, so typically need to normalize the inputs first. Ridge gives a smooth solution, effectively shrinking the weights, but drives few weights to 0. 49

50 Lasso regression (aka L1-regularization) Constrains the weights by penalizing the absolute value of their size: ŵ lasso = argmin W { i=1:n ( y i - w T x i ) 2 + λ j=1:m w j } 50

51 Lasso regression (aka L1-regularization) Constrains the weights by penalizing the absolute value of their size: ŵ lasso = argmin W { i=1:n ( y i - w T x i ) 2 + λ j=1:m w j } Now there is no closed-form solution. Need to solve a quadratic programming problem instead. 51

52 Lasso regression (aka L1-regularization) Constrains the weights by penalizing the absolute value of their size: ŵ lasso = argmin W { i=1:n ( y i - w T x i ) 2 + λ j=1:m w j } Now there is no closed-form solution. Need to solve a quadratic programming problem instead. More computationally expensive than Ridge regression. Effectively sets the weights of less relevant input features to zero. 52

53 Comparing Ridge and Lasso Ridge Complexity: j=0:m w j 2 Lasso λ j=1:m w j Note that for lasso, reducing any weight by, say, 0.1 reduces the complexity by the same amount So, we d prefer reducing less relevant features In ridge regression, reducing high weights reduces complexity more than reducing a low weight So, trade-off between reducing less relevant features and larger weights. Tend to not reduce weights that are already small 53

54 Comparing Ridge and Lasso Ridge g regularization (2 pa w 2 Contours of equal regression error w 2 Lasso 1 w? w? w 1 w 1 Contours of equal model complexity penalty 54

55 A quick look at evaluation functions We call L(Y,f w (x)) the loss function. Least-square / Mean squared-error (MSE) loss: L(Y, f w (X)) = i=1:n ( y i - w T x i ) 2 Other loss functions? 55

56 A quick look at evaluation functions We call L(Y,f w (x)) the loss function. Least-square / Mean squared-error (MSE) loss: L(Y, f w (X)) = i=1:n ( y i - w T x i ) 2 Other loss functions? Absolute error loss: L(Y, f w (X)) = i=1:n y i w T x i 0-1 loss (for classification): L(Y, f w (X)) = i=1:n I ( y i f w (x i ) ) 56

57 A quick look at evaluation functions We call L(Y,f w (x)) the loss function. Least-square / Mean squared-error (MSE) loss: L(Y, f w (X)) = i=1:n ( y i - w T x i ) 2 Other loss functions? Absolute error loss: L(Y, f w (X)) = i=1:n y i w T x i 0-1 loss (for classification): L(Y, f w (X)) = i=1:n I ( y i f w (x i ) ) Different loss functions make different assumptions. Squared error loss assumes the data can be approximated by a global linear model with Gaussian noise. Loss function independent of complexity penalty (l1 or l2) 57

58 A quick look at evaluation functions Assume data generated as p(y x; w) =N (x T w, 1 (y x T = p exp w) ) 58

59 A quick look at evaluation functions Assume data generated as p(y x; w) =N (x T w, 1 (y x T = p exp w) ) Reasonable to try to maximize arg max w the probability of generating the labels y p(y x, w) = arg max log p(y x, w) w = arg max w const (y xt w) 2 59

60 Tutorial Programming with Python This Friday, 6-7 pm Stewart Biology building S3/3 60

61 What you should know Overfitting (when it happens, how to avoid it). Cross-validation (how and why we use it). Ridge regression (l2 regularisation), Lasso (l1 regularisation) 61

62 MSURJ announcement 62

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

COMP 551 Applied Machine Learning Lecture 2: Linear Regression COMP 551 Applied Machine Learning Lecture 2: Linear Regression Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise

More information

COMP 551 Applied Machine Learning Lecture 2: Linear regression

COMP 551 Applied Machine Learning Lecture 2: Linear regression COMP 551 Applied Machine Learning Lecture 2: Linear regression Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Week 3: Linear Regression

Week 3: Linear Regression Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to

More information

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection Instructor: Herke van Hoof (herke.vanhoof@cs.mcgill.ca) Based on slides by:, Jackie Chi Kit Cheung Class web page:

More information

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as: CS 1: Machine Learning Spring 15 College of Computer and Information Science Northeastern University Lecture 3 February, 3 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu 1 Introduction Ridge

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive: Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Machine learning - HT Basis Expansion, Regularization, Validation

Machine learning - HT Basis Expansion, Regularization, Validation Machine learning - HT 016 4. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford Feburary 03, 016 Outline Introduce basis function to go beyond linear regression Understanding

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Machine Learning and Data Mining. Linear regression. Kalev Kask

Machine Learning and Data Mining. Linear regression. Kalev Kask Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance

More information

CSC321 Lecture 2: Linear Regression

CSC321 Lecture 2: Linear Regression CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Tutorial on Linear Regression

Tutorial on Linear Regression Tutorial on Linear Regression HY-539: Advanced Topics on Wireless Networks & Mobile Systems Prof. Maria Papadopouli Evripidis Tzamousis tzamusis@csd.uoc.gr Agenda 1. Simple linear regression 2. Multiple

More information

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015 CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015 Luke ZeElemoyer Slides adapted from Carlos Guestrin Predic5on of con5nuous variables Billionaire says: Wait, that s not what

More information

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

STAT 462-Computational Data Analysis

STAT 462-Computational Data Analysis STAT 462-Computational Data Analysis Chapter 5- Part 2 Nasser Sadeghkhani a.sadeghkhani@queensu.ca October 2017 1 / 27 Outline Shrinkage Methods 1. Ridge Regression 2. Lasso Dimension Reduction Methods

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

CSC 411: Lecture 02: Linear Regression

CSC 411: Lecture 02: Linear Regression CSC 411: Lecture 02: Linear Regression Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto (Most plots in this lecture are from Bishop s book) Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a

More information

CS 340 Lec. 15: Linear Regression

CS 340 Lec. 15: Linear Regression CS 340 Lec. 15: Linear Regression AD February 2011 AD () February 2011 1 / 31 Regression Assume you are given some training data { x i, y i } N where x i R d and y i R c. Given an input test data x, you

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1 Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.

More information

LINEAR REGRESSION, RIDGE, LASSO, SVR

LINEAR REGRESSION, RIDGE, LASSO, SVR LINEAR REGRESSION, RIDGE, LASSO, SVR Supervised Learning Katerina Tzompanaki Linear regression one feature* Price (y) What is the estimated price of a new house of area 30 m 2? 30 Area (x) *Also called

More information

12 Statistical Justifications; the Bias-Variance Decomposition

12 Statistical Justifications; the Bias-Variance Decomposition Statistical Justifications; the Bias-Variance Decomposition 65 12 Statistical Justifications; the Bias-Variance Decomposition STATISTICAL JUSTIFICATIONS FOR REGRESSION [So far, I ve talked about regression

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

y Xw 2 2 y Xw λ w 2 2

y Xw 2 2 y Xw λ w 2 2 CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015 Fundamentals of Machine Learning Mohammad Emtiyaz Khan EPFL Aug 25, 25 Mohammad Emtiyaz Khan 24 Contents List of concepts 2 Course Goals 3 2 Regression 4 3 Model: Linear Regression 7 4 Cost Function: MSE

More information

ORIE 4741: Learning with Big Messy Data. Regularization

ORIE 4741: Learning with Big Messy Data. Regularization ORIE 4741: Learning with Big Messy Data Regularization Professor Udell Operations Research and Information Engineering Cornell October 26, 2017 1 / 24 Regularized empirical risk minimization choose model

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 48 In a nutshell

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 3 Additive Models and Linear Regression Sinusoids and Radial Basis Functions Classification Logistic Regression Gradient Descent Polynomial Basis Functions

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Regression. Volker Tresp 2018

Linear Regression. Volker Tresp 2018 Linear Regression Volker Tresp 2018 1 Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h = M j=0 w

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information

Regularization Paths

Regularization Paths December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Lecture 2: Linear regression

Lecture 2: Linear regression Lecture 2: Linear regression Roger Grosse 1 Introduction Let s ump right in and look at our first machine learning algorithm, linear regression. In regression, we are interested in predicting a scalar-valued

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler + Machine Learning and Data Mining Linear regression Prof. Alexander Ihler Supervised learning Notation Features x Targets y Predictions ŷ Parameters θ Learning algorithm Program ( Learner ) Change µ Improve

More information

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,

More information

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017 CPSC 340: Machine Learning and Data Mining Regularization Fall 2017 Assignment 2 Admin 2 late days to hand in tonight, answers posted tomorrow morning. Extra office hours Thursday at 4pm (ICICS 246). Midterm

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

6. Regularized linear regression

6. Regularized linear regression Foundations of Machine Learning École Centrale Paris Fall 2015 6. Regularized linear regression Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe agathe.azencott@mines paristech.fr

More information

Linear classifiers: Logistic regression

Linear classifiers: Logistic regression Linear classifiers: Logistic regression STAT/CSE 416: Machine Learning Emily Fox University of Washington April 19, 2018 How confident is your prediction? The sushi & everything else were awesome! The

More information

Linear Regression. Machine Learning Seyoung Kim. Many of these slides are derived from Tom Mitchell. Thanks!

Linear Regression. Machine Learning Seyoung Kim. Many of these slides are derived from Tom Mitchell. Thanks! Linear Regression Machine Learning 10-601 Seyoung Kim Many of these slides are derived from Tom Mitchell. Thanks! Regression So far, we ve been interested in learning P(Y X) where Y has discrete values

More information

High-Dimensional Statistical Learning: Introduction

High-Dimensional Statistical Learning: Introduction Classical Statistics Biological Big Data Supervised and Unsupervised Learning High-Dimensional Statistical Learning: Introduction Ali Shojaie University of Washington http://faculty.washington.edu/ashojaie/

More information

Machine Learning (COMP-652 and ECSE-608)

Machine Learning (COMP-652 and ECSE-608) Machine Learning (COMP-652 and ECSE-608) Lecturers: Guillaume Rabusseau and Riashat Islam Email: guillaume.rabusseau@mail.mcgill.ca - riashat.islam@mail.mcgill.ca Teaching assistant: Ethan Macdonald Email:

More information

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation 15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation J. Zico Kolter Carnegie Mellon University Fall 2016 1 Outline Example: return to peak demand prediction

More information

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com Linear Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these

More information

ESS2222. Lecture 4 Linear model

ESS2222. Lecture 4 Linear model ESS2222 Lecture 4 Linear model Hosein Shahnas University of Toronto, Department of Earth Sciences, 1 Outline Logistic Regression Predicting Continuous Target Variables Support Vector Machine (Some Details)

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

Data Mining Techniques. Lecture 2: Regression

Data Mining Techniques. Lecture 2: Regression Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 2: Regression Jan-Willem van de Meent (credit: Yijun Zhao, Marc Toussaint, Bishop) Administrativa Instructor Jan-Willem van de Meent Email:

More information

CSE446: non-parametric methods Spring 2017

CSE446: non-parametric methods Spring 2017 CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric CS 189 Introduction to Machine Learning Fall 2017 Note 5 1 Overview Recall from our previous note that for a fixed input x, our measurement Y is a noisy measurement of the true underlying response f x):

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Machine Learning. Ensemble Methods. Manfred Huber

Machine Learning. Ensemble Methods. Manfred Huber Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected

More information

Machine Learning (COMP-652 and ECSE-608)

Machine Learning (COMP-652 and ECSE-608) Machine Learning (COMP-652 and ECSE-608) Instructors: Doina Precup and Guillaume Rabusseau Email: dprecup@cs.mcgill.ca and guillaume.rabusseau@mail.mcgill.ca Teaching assistants: Tianyu Li and TBA Class

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods) Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March

More information