COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)
|
|
- Clare Jacobs
- 5 years ago
- Views:
Transcription
1 COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof Slides mostly by: Class web page: Unless otherwise noted, all material posted for this course are copyright of the Instructors, and cannot be reused or reposted without the instructor s written permission.
2 What we saw last time Definition and characteristics of a supervised learning problem. Linear regression (hypothesis class, cost function, algorithm). Closed-form least-squares solution method (algorithm, computational complexity, stability issues). Alternative optimization methods via gradient descent. 2
3 Predicting recurrence time from tumor size This function looks complicated, and a linear hypothesis does not seem very good. What should we do? 3
4 Predicting recurrence time from tumor size This function looks complicated, and a linear hypothesis does not seem very good. What should we do? Pick a better function? Use more features? Get more data? 4
5 Input variables for linear regression Original quantitative variables X 1,, X m Transformations of variables, e.g. X m+1 = log(x i ) Basis expansions, e.g. X m+1 = X i2, X m+2 = X i3, Interaction terms, e.g. X m+1 = X i X j Numeric coding of qualitative variables, e.g. X m+1 = 1 if X i is true and 0 otherwise. In all cases, we can add X m+1,, X m+k to the list of original variables and perform the linear regression. 5
6 Example of linear regression with polynomial terms f w (x) = w 0 + w 1 x +w 2 x 2 X = x 2 x Y =
7 Solving the problem w =(X T X) 1 X T Y = apple apple = apple So the best order-2 polynomial is y =0.68x x Compared to y = 1.6x for the order-1 polynomial. 7
8 Order-3 fit: Is this better? 8
9 Order-4 fit 9
10 Order-5 fit 10
11 Order-6 fit 11
12 Order-7 fit 12
13 Order-8 fit 13
14 Order-9 fit 14
15 This is overfitting! 15
16 This is overfitting! We can find a hypothesis that explains perfectly the training data, but does not generalize well to new data. In this example: we have a lot of parameters (weights), so the hypothesis matches the data points exactly, but is wild everywhere else. A very important problem in machine learning. 16
17 Overfitting Every hypothesis has a true error measured on all possible data items we could ever encounter (e.g. f w (x i ) - y i ). Since we don t have all possible data, in order to decide what is a good hypothesis, we measure error over the training set. Formally: Suppose we compare hypotheses f 1 and f 2. Assume f 1 has lower error on the training set. If f 2 has lower true error, then our algorithm is overfitting. 17
18 Overfitting Which hypothesis has the lowest true error? d=1 d=2 d=3 d=4 d=5 d=6 d=7 d=8 18
19 Cross-Validation Partition your data into a Training Set and a Validation set. The proportions in each set can vary. Use the Training Set to find the best hypothesis in the class. Use the Validation Set to evaluate the true prediction error. Compare across different hypothesis classes (different order polynominals.) Train: Validate: X = Y =
20 k-fold Cross-Validation Consider k partitions of the data (usually of equal size). Train with k-1 subset, validate on k th subset. Repeat k times. Average the prediction error over the k rounds/folds. Source: 20
21 k-fold Cross-Validation Consider k partitions of the data (usually of equal size). Train with k-1 subset, validate on k th subset. Repeat k times. Average the prediction error over the k rounds/folds. Source: Computation time is increased by factor of k. 21
22 Leave-one-out cross-validation Let k = n, the size of the training set For each order-d hypothesis class, Repeat n times: Set aside one instance <x i, y i > from the training set. Use all other data points to find w (optimization). Measure prediction error on the held-out <x i, y i >. Average the prediction error over all n subsets. Choose the d with lowest estimated true prediction error. 22
23 Estimating true error for d=1 x Data y Cross-validation results 23
24 Cross-validation results Optimal choice: d=2. Overfitting for d > 2. 24
25 Evaluation We use cross-validation for model selection. Available labeled data is split into two parts: Training set is used to select a hypothesis f from a class of hypotheses F (e.g. regression of a given degree). Validation set is used to compare the best f from each hypothesis class across different classes (e.g. different degree regression). Must be untouched during the process of looking for f within a class F. 25
26 Evaluation After adapting the weights to minimize the error on the train set, the weights could be exploiting particularities in the train set: have to use the validation set as proxy for true error After choosing the hypothesis class to minimize error on the validation set, the hypothesis class could be adapted to some particularities in the validation set Validation set is no longer a good proxy for the true error! 26
27 Evaluation We use cross-validation for model selection. Available labeled data is split into parts: Training set is used to select a hypothesis f from a class of hypotheses F (e.g. regression of a given degree). Validation set is used to compare the best f from each hypothesis class across different classes (e.g. different degree regression). Must be untouched during the process of looking for f within a class F. Test set: Ideally, a separate set of (labeled) data is withheld to get a true estimate of the generalization error. Cannot be touched during the process of selecting F (Often the validation set is called test set, without distinction.) 27
28 Validation vs Train error [From Hastie et al. textbook] High Bias Low Variance Low Bias High Variance Prediction Error Test Sample Training Sample Low Model Complexity High FIGURE Test and training error as a function of model complexity. 28
29 Understanding the error Given set of examples <X, Y>. Assume that y = f(x) +, where is Gaussian noise with zero mean and std deviation σ. N (x µ, 2 ) f(x) 2 µ x 29
30 Understanding the error Consider standard linear regression solution: Err(w) = i=1:n ( y i - w T x i ) 2 If we consider only the class of linear hypotheses, we have systematic prediction error, called bias, whenever the data is generated by a non-linear function. Depending on what dataset we observed, we may get different solutions. Thus we can also have error due to this variance. This occurs even if data is generated from class of linear functions. 30
31 An example (from Tom Dietterich) The circles are data points. X is drawn uniformly randomly. Y is generated by the function y = 2sin(0.5x) +. 31
32 An example (from Tom Dietterich) With different sets of 20 points, we get different lines. 32
33 Bias-variance decomposition Compare f(x) (true value at x) to prediction h(x) Error: E[(h(x) f(x)) 2 ] Bias: f(x) h(x) How far is the average estimate from the true function? Variance: E[(h(x) h(x)) 2 ] How far are estimates (on average) from the average estimate Notes: Expectations over all possible dataset (noise realizations) h(x) =E[h(x)] 33
34 Bias-variance decomposition Can show that: Error = bias 2 + variance (e.g. What happens if we apply 1 st order model to quadratic dataset? No matter how much data we have, can t fit the dataset This means we have bias (underfitting) What happens if we apply 2 nd order model to linear dataset? The model can definitely fit the dataset! (no bias) Spurious parameters make the model sensitive to noise Higher variance than necessary (overfitting if dataset is small) 34
35 Bias-variance decomposition Can show that: Error = bias 2 + variance (e.g. What happens if we apply 1 st order model to quadratic dataset? No matter how much data we have, can t fit the dataset This means we have bias (underfitting) What happens if we apply 2 nd order model to linear dataset? The model can definitely fit the dataset! (no bias) Spurious parameters make the model sensitive to noise Higher variance than necessary (overfitting if dataset is small) 35
36 Validation vs Train error [From Hastie et al. textbook] High Bias Low Variance Low Bias High Variance Prediction Error Test Sample Training Sample Low Model Complexity High FIGURE Test and training error as a function of model complexity. 36
37 Bias-variance decomposition Can show that: Error = bias 2 + variance (e.g. What happens if we apply 1 st order model to quadratic dataset? No matter how much data we have, can t fit the dataset This means we have bias (underfitting) What happens if we apply 2 nd order model to linear dataset? The model can definitely fit the dataset! (no bias) Spurious parameters make the model sensitive to noise Higher variance than necessary (overfitting if dataset is small) What happens if we apply 1 st order model to linear dataset? 37
38 Gauss-Markov Theorem Main result: The least-squares estimates of the parameters w have the smallest variance among all linear unbiased estimates. 38
39 Gauss-Markov Theorem Main result: The least-squares estimates of the parameters w have the smallest variance among all linear unbiased estimates. Understanding the statement: Real parameters are denoted: w Estimate of the parameters is denoted: ŵ Error of the estimator: Err(ŵ) = E(ŵ-w) 2 = Var(ŵ) + (E(ŵ-w)) 2 Unbiased estimator means: E(ŵ-w)=0 There may exist an estimator that has lower error, but some bias. 39
40 Bias vs Variance Gauss-Markov Theorem says: The least-squares estimates of the parameters w have the smallest variance among all linear unbiased estimates. Insight: Find lower variance solution, at the expense of some bias. 40
41 Bias vs Variance Gauss-Markov Theorem says: The least-squares estimates of the parameters w have the smallest variance among all linear unbiased estimates. Insight: Find lower variance solution, at the expense of some bias. E.g. fix low-relevance weights to 0 41
42 Recall our prostate cancer example The Z-score measures the effect of dropping that feature from the linear regression. z j = ŵ j / sqrt(σ 2 v j ) where ŵ j is the estimated weight of the j th feature, and v j is the j th diagonal element of (X T X) -1 TABLE 3.2. Linear model fit to the prostate cancer data. The Z score is the coefficient divided by its standard error (3.12). Roughly a Z score larger than two in absolute value is significantly nonzero at the p =0.05 level. Term Coefficient Std. Error Z Score Intercept lcavol lweight age lbph svi lcp gleason pgg [From Hastie et al. textbook] 42
43 Subset selection Idea: Keep only a small set of features with non-zero weights. Goal: Find lower variance solution, at the expense of some bias. There are many different methods for choosing subsets. (More on this later ) Least-squares regression can be used to estimate the weights of the selected features. Bias as true model might rely on the discarded features! 43
44 Bias vs Variance Find lower variance solution, at the expense of some bias. Force some weights to 0 E.g. Include penalty for model complexity in error to reduce overfitting. Err(w) = i=1:n ( y i - w T x i ) 2 + λ model_size λ is a hyper-parameter that controls penalty size. 44
45 Ridge regression (aka L2-regularization) Constrains the weights by imposing a penalty on their size: ŵ ridge = argmin w { i=1:n ( y i - w T x i ) 2 + λ j=0:m w j2 } where λ can be selected manually, or by cross-validation. 45
46 Ridge regression (aka L2-regularization) Constrains the weights by imposing a penalty on their size: ŵ ridge = argmin w { i=1:n ( y i - w T x i ) 2 + λ j=0:m w j2 } where λ can be selected manually, or by cross-validation. Do a little algebra to get the solution: ŵ ridge = (X T X+λI) -1 X T Y 46
47 Ridge regression (aka L2-regularization) Re-write in matrix notation: f w (X) = Xw Err(w) = (Y Xw) T (Y Xw) + λw T w To minimize, take the derivative w.r.t. w: Try a little algebra: Err(w)/ w = -2 X T (Y Xw) 2λw = 0 47
48 Ridge regression (aka L2-regularization) Re-write in matrix notation: f w (X) = Xw Err(w) = (Y Xw) T (Y Xw) + λw T w To minimize, take the derivative w.r.t. w: Err(w)/ w = -2 X T (Y Xw) 2λw = 0 Try a little algebra: X T Y = (X T X + Iλ) w ŵ = (X T X + Iλ) -1 X T Y 48
49 Ridge regression (aka L2-regularization) Constrains the weights by imposing a penalty on their size: ŵ ridge = argmin w { i=1:n ( y i - w T x i ) 2 + λ j=0:m w j2 } where λ can be selected manually, or by cross-validation. Do a little algebra to get the solution: ŵ ridge = (X T X+λI) -1 X T Y The ridge solution is not equivariant under scaling of the data, so typically need to normalize the inputs first. Ridge gives a smooth solution, effectively shrinking the weights, but drives few weights to 0. 49
50 Lasso regression (aka L1-regularization) Constrains the weights by penalizing the absolute value of their size: ŵ lasso = argmin W { i=1:n ( y i - w T x i ) 2 + λ j=1:m w j } 50
51 Lasso regression (aka L1-regularization) Constrains the weights by penalizing the absolute value of their size: ŵ lasso = argmin W { i=1:n ( y i - w T x i ) 2 + λ j=1:m w j } Now there is no closed-form solution. Need to solve a quadratic programming problem instead. 51
52 Lasso regression (aka L1-regularization) Constrains the weights by penalizing the absolute value of their size: ŵ lasso = argmin W { i=1:n ( y i - w T x i ) 2 + λ j=1:m w j } Now there is no closed-form solution. Need to solve a quadratic programming problem instead. More computationally expensive than Ridge regression. Effectively sets the weights of less relevant input features to zero. 52
53 Comparing Ridge and Lasso Ridge Complexity: j=0:m w j 2 Lasso λ j=1:m w j Note that for lasso, reducing any weight by, say, 0.1 reduces the complexity by the same amount So, we d prefer reducing less relevant features In ridge regression, reducing high weights reduces complexity more than reducing a low weight So, trade-off between reducing less relevant features and larger weights. Tend to not reduce weights that are already small 53
54 Comparing Ridge and Lasso Ridge g regularization (2 pa w 2 Contours of equal regression error w 2 Lasso 1 w? w? w 1 w 1 Contours of equal model complexity penalty 54
55 A quick look at evaluation functions We call L(Y,f w (x)) the loss function. Least-square / Mean squared-error (MSE) loss: L(Y, f w (X)) = i=1:n ( y i - w T x i ) 2 Other loss functions? 55
56 A quick look at evaluation functions We call L(Y,f w (x)) the loss function. Least-square / Mean squared-error (MSE) loss: L(Y, f w (X)) = i=1:n ( y i - w T x i ) 2 Other loss functions? Absolute error loss: L(Y, f w (X)) = i=1:n y i w T x i 0-1 loss (for classification): L(Y, f w (X)) = i=1:n I ( y i f w (x i ) ) 56
57 A quick look at evaluation functions We call L(Y,f w (x)) the loss function. Least-square / Mean squared-error (MSE) loss: L(Y, f w (X)) = i=1:n ( y i - w T x i ) 2 Other loss functions? Absolute error loss: L(Y, f w (X)) = i=1:n y i w T x i 0-1 loss (for classification): L(Y, f w (X)) = i=1:n I ( y i f w (x i ) ) Different loss functions make different assumptions. Squared error loss assumes the data can be approximated by a global linear model with Gaussian noise. Loss function independent of complexity penalty (l1 or l2) 57
58 A quick look at evaluation functions Assume data generated as p(y x; w) =N (x T w, 1 (y x T = p exp w) ) 58
59 A quick look at evaluation functions Assume data generated as p(y x; w) =N (x T w, 1 (y x T = p exp w) ) Reasonable to try to maximize arg max w the probability of generating the labels y p(y x, w) = arg max log p(y x, w) w = arg max w const (y xt w) 2 59
60 Tutorial Programming with Python This Friday, 6-7 pm Stewart Biology building S3/3 60
61 What you should know Overfitting (when it happens, how to avoid it). Cross-validation (how and why we use it). Ridge regression (l2 regularisation), Lasso (l1 regularisation) 61
62 MSURJ announcement 62
COMP 551 Applied Machine Learning Lecture 2: Linear Regression
COMP 551 Applied Machine Learning Lecture 2: Linear Regression Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise
More informationCOMP 551 Applied Machine Learning Lecture 2: Linear regression
COMP 551 Applied Machine Learning Lecture 2: Linear regression Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this
More informationData Mining Stat 588
Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic
More informationCOMP 551 Applied Machine Learning Lecture 20: Gaussian processes
COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationWeek 3: Linear Regression
Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to
More informationCOMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection
COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection Instructor: Herke van Hoof (herke.vanhoof@cs.mcgill.ca) Based on slides by:, Jackie Chi Kit Cheung Class web page:
More informationRidge Regression 1. to which some random noise is added. So that the training labels can be represented as:
CS 1: Machine Learning Spring 15 College of Computer and Information Science Northeastern University Lecture 3 February, 3 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu 1 Introduction Ridge
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:
Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationMachine learning - HT Basis Expansion, Regularization, Validation
Machine learning - HT 016 4. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford Feburary 03, 016 Outline Introduce basis function to go beyond linear regression Understanding
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationRirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology
Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationCOMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation
COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationCOS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10
COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationMachine Learning and Data Mining. Linear regression. Kalev Kask
Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance
More informationCSC321 Lecture 2: Linear Regression
CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationTutorial on Linear Regression
Tutorial on Linear Regression HY-539: Advanced Topics on Wireless Networks & Mobile Systems Prof. Maria Papadopouli Evripidis Tzamousis tzamusis@csd.uoc.gr Agenda 1. Simple linear regression 2. Multiple
More informationCSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015
CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015 Luke ZeElemoyer Slides adapted from Carlos Guestrin Predic5on of con5nuous variables Billionaire says: Wait, that s not what
More informationcxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c
Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationCS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning
CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationSTAT 462-Computational Data Analysis
STAT 462-Computational Data Analysis Chapter 5- Part 2 Nasser Sadeghkhani a.sadeghkhani@queensu.ca October 2017 1 / 27 Outline Shrinkage Methods 1. Ridge Regression 2. Lasso Dimension Reduction Methods
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationCSC 411: Lecture 02: Linear Regression
CSC 411: Lecture 02: Linear Regression Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto (Most plots in this lecture are from Bishop s book) Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationIntroduction to Machine Learning
Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationMidterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.
CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic
More informationReminders. Thought questions should be submitted on eclass. Please list the section related to the thought question
Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a
More informationCS 340 Lec. 15: Linear Regression
CS 340 Lec. 15: Linear Regression AD February 2011 AD () February 2011 1 / 31 Regression Assume you are given some training data { x i, y i } N where x i R d and y i R c. Given an input test data x, you
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationLinear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1
Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.
More informationLINEAR REGRESSION, RIDGE, LASSO, SVR
LINEAR REGRESSION, RIDGE, LASSO, SVR Supervised Learning Katerina Tzompanaki Linear regression one feature* Price (y) What is the estimated price of a new house of area 30 m 2? 30 Area (x) *Also called
More information12 Statistical Justifications; the Bias-Variance Decomposition
Statistical Justifications; the Bias-Variance Decomposition 65 12 Statistical Justifications; the Bias-Variance Decomposition STATISTICAL JUSTIFICATIONS FOR REGRESSION [So far, I ve talked about regression
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationy Xw 2 2 y Xw λ w 2 2
CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationFundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015
Fundamentals of Machine Learning Mohammad Emtiyaz Khan EPFL Aug 25, 25 Mohammad Emtiyaz Khan 24 Contents List of concepts 2 Course Goals 3 2 Regression 4 3 Model: Linear Regression 7 4 Cost Function: MSE
More informationORIE 4741: Learning with Big Messy Data. Regularization
ORIE 4741: Learning with Big Messy Data Regularization Professor Udell Operations Research and Information Engineering Cornell October 26, 2017 1 / 24 Regularized empirical risk minimization choose model
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 48 In a nutshell
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 3 Additive Models and Linear Regression Sinusoids and Radial Basis Functions Classification Logistic Regression Gradient Descent Polynomial Basis Functions
More informationLinear Regression. CSL603 - Fall 2017 Narayanan C Krishnan
Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization
More informationLinear Regression. Volker Tresp 2018
Linear Regression Volker Tresp 2018 1 Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h = M j=0 w
More informationLinear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationData Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.
TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin
More informationRegularization Paths
December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and
More informationApplied Machine Learning Annalisa Marsico
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature
More informationLecture 2: Linear regression
Lecture 2: Linear regression Roger Grosse 1 Introduction Let s ump right in and look at our first machine learning algorithm, linear regression. In regression, we are interested in predicting a scalar-valued
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationLinear Models for Regression
Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationMachine Learning and Data Mining. Linear regression. Prof. Alexander Ihler
+ Machine Learning and Data Mining Linear regression Prof. Alexander Ihler Supervised learning Notation Features x Targets y Predictions ŷ Parameters θ Learning algorithm Program ( Learner ) Change µ Improve
More informationCOMP 551 Applied Machine Learning Lecture 14: Neural Networks
COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,
More informationCPSC 340: Machine Learning and Data Mining. Regularization Fall 2017
CPSC 340: Machine Learning and Data Mining Regularization Fall 2017 Assignment 2 Admin 2 late days to hand in tonight, answers posted tomorrow morning. Extra office hours Thursday at 4pm (ICICS 246). Midterm
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
More informationMachine Learning Linear Regression. Prof. Matteo Matteucci
Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More information6. Regularized linear regression
Foundations of Machine Learning École Centrale Paris Fall 2015 6. Regularized linear regression Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe agathe.azencott@mines paristech.fr
More informationLinear classifiers: Logistic regression
Linear classifiers: Logistic regression STAT/CSE 416: Machine Learning Emily Fox University of Washington April 19, 2018 How confident is your prediction? The sushi & everything else were awesome! The
More informationLinear Regression. Machine Learning Seyoung Kim. Many of these slides are derived from Tom Mitchell. Thanks!
Linear Regression Machine Learning 10-601 Seyoung Kim Many of these slides are derived from Tom Mitchell. Thanks! Regression So far, we ve been interested in learning P(Y X) where Y has discrete values
More informationHigh-Dimensional Statistical Learning: Introduction
Classical Statistics Biological Big Data Supervised and Unsupervised Learning High-Dimensional Statistical Learning: Introduction Ali Shojaie University of Washington http://faculty.washington.edu/ashojaie/
More informationMachine Learning (COMP-652 and ECSE-608)
Machine Learning (COMP-652 and ECSE-608) Lecturers: Guillaume Rabusseau and Riashat Islam Email: guillaume.rabusseau@mail.mcgill.ca - riashat.islam@mail.mcgill.ca Teaching assistant: Ethan Macdonald Email:
More information15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation
15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation J. Zico Kolter Carnegie Mellon University Fall 2016 1 Outline Example: return to peak demand prediction
More informationLinear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com
Linear Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these
More informationESS2222. Lecture 4 Linear model
ESS2222 Lecture 4 Linear model Hosein Shahnas University of Toronto, Department of Earth Sciences, 1 Outline Logistic Regression Predicting Continuous Target Variables Support Vector Machine (Some Details)
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationCMSC858P Supervised Learning Methods
CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors
More informationData Mining Techniques. Lecture 2: Regression
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 2: Regression Jan-Willem van de Meent (credit: Yijun Zhao, Marc Toussaint, Bishop) Administrativa Instructor Jan-Willem van de Meent Email:
More informationCSE446: non-parametric methods Spring 2017
CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationIntroduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric
CS 189 Introduction to Machine Learning Fall 2017 Note 5 1 Overview Recall from our previous note that for a fixed input x, our measurement Y is a noisy measurement of the true underlying response f x):
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationMachine Learning. Ensemble Methods. Manfred Huber
Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected
More informationMachine Learning (COMP-652 and ECSE-608)
Machine Learning (COMP-652 and ECSE-608) Instructors: Doina Precup and Guillaume Rabusseau Email: dprecup@cs.mcgill.ca and guillaume.rabusseau@mail.mcgill.ca Teaching assistants: Tianyu Li and TBA Class
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationMachine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)
Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March
More information