Overfitting, Bias / Variance Analysis

Size: px
Start display at page:

Download "Overfitting, Bias / Variance Analysis"

Transcription

1 Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40

2 Outline Administration 2 Review of last lecture 3 Basic ideas to overcome overfitting 4 Bias/Variance Analysis Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

3 Announcements HW2 will be returned in section on Friday HW3 due in class next Monday Midterm is next Wednesday Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

4 Midterm Next Wednesday in class from 0am - :50am Completely closed-book (no notes allowed) Will include roughly 6 short answer questions and 3 long questions Short questions should take 5 minutes on average Long questions should take 5 minutes each Covers all material through (and including) today s lecture Goal is to test conceptual understanding of the course material Suggestion: carefully review lecture notes and problem sets Office hours / Section (see timing on course website) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

5 Outline Administration 2 Review of last lecture Linear Regression Ridge Regression for Numerical Purposes Non-linear Basis 3 Basic ideas to overcome overfitting 4 Bias/Variance Analysis Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

6 Linear regression Setup Input: x R D (covariates, predictors, features, etc) Output: y R (responses, targets, outcomes, outputs, etc) Model: f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x w = [w w 2 w D ] T : weights, parameters, or parameter vector w 0 is called bias We also sometimes call w = [w 0 w w 2 w D ] T parameters too Training data: D = {(x n, y n ), n =, 2,..., N} Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

7 Linear regression Setup Input: x R D (covariates, predictors, features, etc) Output: y R (responses, targets, outcomes, outputs, etc) Model: f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x w = [w w 2 w D ] T : weights, parameters, or parameter vector w 0 is called bias We also sometimes call w = [w 0 w w 2 w D ] T parameters too Training data: D = {(x n, y n ), n =, 2,..., N} Least Mean Squares (LMS) Objective: Minimize squared difference on training data (or residual sum of squares) RSS( w) = n [y n f(x n )] 2 = n [y n (w 0 + d w d x nd )] 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

8 Linear regression Setup Input: x R D (covariates, predictors, features, etc) Output: y R (responses, targets, outcomes, outputs, etc) Model: f : x y, with f(x) = w 0 + d w dx d = w 0 + w T x w = [w w 2 w D ] T : weights, parameters, or parameter vector w 0 is called bias We also sometimes call w = [w 0 w w 2 w D ] T parameters too Training data: D = {(x n, y n ), n =, 2,..., N} Least Mean Squares (LMS) Objective: Minimize squared difference on training data (or residual sum of squares) RSS( w) = n [y n f(x n )] 2 = n [y n (w 0 + d w d x nd )] 2 D Solution: Identify stationary points by taking derivative with respect to parameters and setting to zero, yielding normal equations Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

9 LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

10 LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 = n [y n w T x n ] 2 where we have redefined some variables (by augmenting) x [ x x 2... x D ] T, w [w 0 w w 2... w D ] T Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

11 LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 = n [y n w T x n ] 2 where we have redefined some variables (by augmenting) x [ x x 2... x D ] T, w [w 0 w w 2... w D ] T which leads to RSS( w) = n (y n w T x n )(y n x T n w) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

12 LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 = n [y n w T x n ] 2 where we have redefined some variables (by augmenting) x [ x x 2... x D ] T, w [w 0 w w 2... w D ] T which leads to RSS( w) = n = n (y n w T x n )(y n x T n w) w T x n x T n w 2y n x T n w + const. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

13 LMS when x is D-dimensional RSS( w) in matrix form RSS( w) = n [y n (w 0 + d w d x nd )] 2 = n [y n w T x n ] 2 where we have redefined some variables (by augmenting) x [ x x 2... x D ] T, w [w 0 w w 2... w D ] T which leads to RSS( w) = n (y n w T x n )(y n x T n w) = w T x n x T n w 2y n x T n w + const. n { ( ) ( ) = w T x n x T n w 2 y n x T n n n w } + const. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

14 RSS( w) in new notations From previous slide { ) ( RSS( w) = w T x n x T n n ( ) w 2 y n x T n n w } + const. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

15 RSS( w) in new notations From previous slide { ) ( RSS( w) = w T x n x T n n ( ) w 2 y n x T n n w } + const. Design matrix and target vector x T x X T 2 =. RN (D+), y = x T N y y 2. y N Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

16 RSS( w) in new notations From previous slide { ) ( RSS( w) = w T x n x T n n ( ) w 2 y n x T n n w } + const. Design matrix and target vector x T x X T 2 =. RN (D+), y = x T N Compact expression { RSS( w) = X w y 2 2 = y y 2. y N ( ) } T w T X T X w 2 X T y w + const Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

17 Solution in matrix form Compact expression { RSS( w) = X w y 2 2 = ( ) } T w T X T X w 2 X T y w + const Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

18 Solution in matrix form Compact expression { RSS( w) = X w y 2 2 = ( ) } T w T X T X w 2 X T y w + const Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

19 Solution in matrix form Compact expression { RSS( w) = X w y 2 2 = ( ) } T w T X T X w 2 X T y w + const Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Normal equation w RSS( w) X T Xw X T y = 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

20 Solution in matrix form Compact expression { RSS( w) = X w y 2 2 = ( ) } T w T X T X w 2 X T y w + const Gradients of Linear and Quadratic Functions x b x = b x x Ax = 2Ax (symmetric A) Normal equation w RSS( w) X T Xw X T y = 0 This leads to the least-mean-square (LMS) solution w LMS = ( X T X) X T y Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

21 Practical concerns Bottleneck of computing the LMS solution w = ( XT X) Xy Matrix multiply of X T X R (D+) (D+) Inverting the matrix X T X Scalable methods Batch gradient descent Stochastic gradient descent Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

22 (Batch) Gradient Descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Loop until convergence Compute the gradient RSS( w) = X T X w (t) X T y 2 Update the parameters w (t+) = w (t) η RSS( w) 3 t t + What is the complexity of each iteration? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40

23 (Batch) Gradient Descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Loop until convergence Compute the gradient RSS( w) = X T X w (t) X T y 2 Update the parameters w (t+) = w (t) η RSS( w) 3 t t + What is the complexity of each iteration? O(ND) Why does this work? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40

24 (Batch) Gradient Descent Initialize w to w (0) (e.g., randomly); set t = 0; choose η > 0 Loop until convergence Compute the gradient RSS( w) = X T X w (t) X T y 2 Update the parameters w (t+) = w (t) η RSS( w) 3 t t + What is the complexity of each iteration? O(ND) Why does this work? RSS( w) is convex (Hessian is PSD) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40

25 Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

26 Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+) = w (t) ηg t 4 t t + Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

27 Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+) = w (t) ηg t 4 t t + How does the complexity per iteration compare with gradient descent? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

28 Stochastic gradient descent Widrow-Hoff rule: update parameters using one example at a time Initialize w to some w (0) ; set t = 0; choose η > 0 Loop until convergence random choose a training a sample x t 2 Compute its contribution to the gradient g t = ( x T t w(t) y t ) x t 3 Update the parameters w (t+) = w (t) ηg t 4 t t + How does the complexity per iteration compare with gradient descent? O(ND) for gradient descent versus O(D) for SGD Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

29 What if X T X is not invertible Why might this happen? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

30 What if X T X is not invertible Why might this happen? Answer : N < D. Intuitively, not enough data to estimate all parameters. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

31 What if X T X is not invertible Why might this happen? Answer : N < D. Intuitively, not enough data to estimate all parameters. Answer 2: Columns of X are not linearly independent, e.g., some features are perfectly correlated. In this case, solution is not unique. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

32 Ridge regression What can we do when X T X is not invertible? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

33 Ridge regression What can we do when X T X is not invertible? Add regularizer so that all singular values are at least λ > 0! This is equivalent to adding an extra term to RSS( w) RSS( w) { { }}{ ( ) } T w T X T X w 2 X T y w λ w 2 2 }{{} regularization Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

34 Ridge regression What can we do when X T X is not invertible? Solution Add regularizer so that all singular values are at least λ > 0! This is equivalent to adding an extra term to RSS( w) RSS( w) { { }}{ ( ) } T w T X T X w 2 X T y w λ w 2 2 }{{} regularization Can derive normal equations as before Solution is of the form: w = ( X T X + λi) XT y Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

35 Is a linear modeling assumption always a good idea? Example of nonlinear classification Example of nonlinear regression t 0 0 x Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

36 General nonlinear basis functions We can use a nonlinear mapping φ(x) : x R D z R M M is dimensionality of new features z (or φ(x)) M could be greater than, less than, or equal to D Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

37 General nonlinear basis functions We can use a nonlinear mapping φ(x) : x R D z R M M is dimensionality of new features z (or φ(x)) M could be greater than, less than, or equal to D We can apply existing learning methods on the transformed data linear methods: prediction is based on w T φ(x) other methods: nearest neighbors, decision trees, etc Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

38 Regression with nonlinear basis Residual sum squares [w T φ(x n ) y n ] 2 n where w R M, the same dimensionality as the transformed features φ(x). The LMS solution can be formulated with the new design matrix φ(x ) T φ(x 2 ) T Φ =. RN M, w lms = ( Φ T Φ ) Φ T y φ(x N ) T Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

39 Example with regression Polynomial basis functions x φ(x) = x 2 f(x) = w 0 +. x M M w m x m m= Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

40 Example with regression Polynomial basis functions x φ(x) = x 2 f(x) = w 0 +. x M M w m x m m= Fitting samples from a sine function: underfitting as f(x) is too simple M = 0 M = t t x 0 x Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

41 Adding high-order terms M=3 t M = x Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

42 Adding high-order terms M=3 M=9: overfitting t M = 3 t M = x 0 x More complex features lead to better results on the training data, but potentially worse results on new data, e.g., test data! Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

43 Overfitting Parameters for higher-order polynomials are very large M = 0 M = M = 3 M = 9 w w w w w w w w w w Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

44 Overfitting can be quite disastrous Fitting the housing price data with large M Predicted price goes to zero (and is ultimately negative) if you buy a big enough house! How might we prevent overfitting? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

45 Outline Administration 2 Review of last lecture 3 Basic ideas to overcome overfitting Use more training data Regularization methods 4 Bias/Variance Analysis Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

46 Use more training data to prevent over fitting The more, the merrier t M = x Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

47 Use more training data to prevent over fitting The more, the merrier M = 9 N = 5 N = 00 t t t x 0 x 0 x Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

48 Use more training data to prevent over fitting The more, the merrier M = 9 N = 5 N = 00 t t t x 0 x What if we do not have a lot of data? 0 x Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

49 Regularization methods Intuition: Give preference to simpler models How do we define a simple linear regression model w T x? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

50 Regularization methods Intuition: Give preference to simpler models How do we define a simple linear regression model w T x? Our Strategy: Place a prior on our weights Interpret w as a random variable Assume that each w d is centered around zero Use observed data D to update our prior belief on w Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

51 Review: Probabilistic interpretation for LMS LMS model: Y = w X + η Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

52 Review: Probabilistic interpretation for LMS LMS model: Y = w X + η η N(0, σ0) 2 is a Gaussian random variable Thus, Y N(w X, σ0) 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

53 Review: Probabilistic interpretation for LMS LMS model: Y = w X + η η N(0, σ0) 2 is a Gaussian random variable Thus, Y N(w X, σ0) 2 We assume that w is fixed (Frequentist interpretation) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

54 Review: Probabilistic interpretation for LMS LMS model: Y = w X + η η N(0, σ0) 2 is a Gaussian random variable Thus, Y N(w X, σ0) 2 We assume that w is fixed (Frequentist interpretation) The likelihood function maps parameters to probabilities L : w, σ0 2 p(d w, σ0) 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

55 Review: Probabilistic interpretation for LMS LMS model: Y = w X + η η N(0, σ0) 2 is a Gaussian random variable Thus, Y N(w X, σ0) 2 We assume that w is fixed (Frequentist interpretation) The likelihood function maps parameters to probabilities L : w, σ 2 0 p(d w, σ 2 0) = p(y X, w, σ 2 0) = n p(y n x n, w, σ 2 0) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

56 Review: Probabilistic interpretation for LMS LMS model: Y = w X + η η N(0, σ0) 2 is a Gaussian random variable Thus, Y N(w X, σ0) 2 We assume that w is fixed (Frequentist interpretation) The likelihood function maps parameters to probabilities L : w, σ 2 0 p(d w, σ 2 0) = p(y X, w, σ 2 0) = n p(y n x n, w, σ 2 0) Maximizing likelihood with respect to w minimizes RSS and yields the LMS solution: w LMS = w ML = arg max w L(w, σ 2 0) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

57 Probabilistic interpretation of Ridge Regression Ridge Regression model: Y = w X + η Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

58 Probabilistic interpretation of Ridge Regression Ridge Regression model: Y = w X + η Y N(w X, σ0) 2 is a Gaussian random variable (as before) wd N(0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that all wd share the same variance σ 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

59 Probabilistic interpretation of Ridge Regression Ridge Regression model: Y = w X + η Y N(w X, σ0) 2 is a Gaussian random variable (as before) wd N(0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that all wd share the same variance σ 2 w is a random variable (Bayesian interpretation) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

60 Probabilistic interpretation of Ridge Regression Ridge Regression model: Y = w X + η Y N(w X, σ0) 2 is a Gaussian random variable (as before) wd N(0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that all wd share the same variance σ 2 w is a random variable (Bayesian interpretation) To find w given data D, we can compute posterior distribution of w: p(w D) = Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

61 Probabilistic interpretation of Ridge Regression Ridge Regression model: Y = w X + η Y N(w X, σ0) 2 is a Gaussian random variable (as before) wd N(0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that all wd share the same variance σ 2 w is a random variable (Bayesian interpretation) To find w given data D, we can compute posterior distribution of w: p(w D) = p(d w)p(w) p(d) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

62 Probabilistic interpretation of Ridge Regression Ridge Regression model: Y = w X + η Y N(w X, σ0) 2 is a Gaussian random variable (as before) wd N(0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that all wd share the same variance σ 2 w is a random variable (Bayesian interpretation) To find w given data D, we can compute posterior distribution of w: p(w D) = p(d w)p(w) p(d) Maximum a posterior (MAP) estimate: w map = arg max w p(w D) = arg max w p(d w)p(w) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

63 Probabilistic interpretation of Ridge Regression Ridge Regression model: Y = w X + η Y N(w X, σ0) 2 is a Gaussian random variable (as before) wd N(0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that all wd share the same variance σ 2 w is a random variable (Bayesian interpretation) To find w given data D, we can compute posterior distribution of w: p(w D) = p(d w)p(w) p(d) Maximum a posterior (MAP) estimate: w map = arg max w p(w D) = arg max w p(d w)p(w) What s the relationship between MAP and MLE? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

64 Probabilistic interpretation of Ridge Regression Ridge Regression model: Y = w X + η Y N(w X, σ0) 2 is a Gaussian random variable (as before) wd N(0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that all wd share the same variance σ 2 w is a random variable (Bayesian interpretation) To find w given data D, we can compute posterior distribution of w: p(w D) = p(d w)p(w) p(d) Maximum a posterior (MAP) estimate: w map = arg max w p(w D) = arg max w p(d w)p(w) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(w) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

65 Probabilistic interpretation of Ridge Regression Ridge Regression model: Y = w X + η Y N(w X, σ0) 2 is a Gaussian random variable (as before) wd N(0, σ 2 ) are i.i.d. Gaussian random variables (unlike before) Note that all wd share the same variance σ 2 w is a random variable (Bayesian interpretation) To find w given data D, we can compute posterior distribution of w: p(w D) = p(d w)p(w) p(d) Maximum a posterior (MAP) estimate: w map = arg max w p(w D) = arg max w p(d w)p(w) What s the relationship between MAP and MLE? MAP reduces to MLE if we assume uniform prior for p(w) Fully Bayesian treatment considers entire posterior, not just the mode Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

66 Estimating w Let X,..., X N be IID with y w, x N(w x, σ0 2) Let w d be IID with w d N(0, σ 2 ) Joint likelihood of data and parameters (given σ 0, σ) p(d, w) = p(d w)p(w) = n p(y n x n, w) d p(w d ) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

67 Estimating w Let X,..., X N be IID with y w, x N(w x, σ0 2) Let w d be IID with w d N(0, σ 2 ) Joint likelihood of data and parameters (given σ 0, σ) p(d, w) = p(d w)p(w) = n p(y n x n, w) d p(w d ) Joint log likelihood Plugging in Gaussian PDF, we get: log p(d, w) = n log p(y n x n, w) + d log p(w d ) = n (wt x n y n ) 2 2σ 2 0 d 2σ 2 w2 d + const Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

68 Estimating w Let X,..., X N be IID with y w, x N(w x, σ0 2) Let w d be IID with w d N(0, σ 2 ) Joint likelihood of data and parameters (given σ 0, σ) p(d, w) = p(d w)p(w) = n p(y n x n, w) d p(w d ) Joint log likelihood Plugging in Gaussian PDF, we get: log p(d, w) = n log p(y n x n, w) + d log p(w d ) = n (wt x n y n ) 2 2σ 2 0 d 2σ 2 w2 d + const MAP estimate: w map = arg max w log p(d, w) As with LMS, set gradient equal to zero and solve (for w) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

69 Maximum a posterior (MAP) estimate Regularized linear regression: a new error to minimize E(w) = n (w T x n y n ) 2 + λ w 2 2 where λ > 0 is used to denote σ0 2/σ2. This extra term w 2 2 regularization/regularizer and controls the model complexity. is called Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

70 Maximum a posterior (MAP) estimate Regularized linear regression: a new error to minimize E(w) = n (w T x n y n ) 2 + λ w 2 2 where λ > 0 is used to denote σ0 2/σ2. This extra term w 2 2 regularization/regularizer and controls the model complexity. is called Intuitions Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

71 Maximum a posterior (MAP) estimate Regularized linear regression: a new error to minimize E(w) = n (w T x n y n ) 2 + λ w 2 2 where λ > 0 is used to denote σ0 2/σ2. This extra term w 2 2 regularization/regularizer and controls the model complexity. is called Intuitions If λ +, then σ 2 0 σ 2. That is, the variance of noise is far greater than what our prior model can allow for w. In this case, our prior model on w would be more accurate than what data can tell us. Thus, we are getting a simple model. Numerically, w map 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

72 Maximum a posterior (MAP) estimate Regularized linear regression: a new error to minimize E(w) = n (w T x n y n ) 2 + λ w 2 2 where λ > 0 is used to denote σ0 2/σ2. This extra term w 2 2 regularization/regularizer and controls the model complexity. is called Intuitions If λ +, then σ 2 0 σ 2. That is, the variance of noise is far greater than what our prior model can allow for w. In this case, our prior model on w would be more accurate than what data can tell us. Thus, we are getting a simple model. Numerically, w map 0 If λ 0, then we trust our data more. Numerically, w map w lms = arg min n (w T x n y n ) 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

73 Closed-form solution For regularized linear regression: the solution changes very little (in form) from the LMS solution arg min n (w T x n y n ) 2 + λ w 2 2 w map = ( X T X + λi ) X T y and reduces to the LMS solution when λ = 0, as expected. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

74 Closed-form solution For regularized linear regression: the solution changes very little (in form) from the LMS solution arg min n (w T x n y n ) 2 + λ w 2 2 w map = ( X T X + λi ) X T y and reduces to the LMS solution when λ = 0, as expected. Gradients and Hessian change nominally too E(w) = 2(X T Xw X T y + λw), H = 2(X T X + λi) As long as λ 0, the optimization is convex. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

75 Example: fitting data with polynomials Our regression model y = M w m x m m= Regularization would discourage large parameter values as we saw with the LMS solution, thus potentially preventing overfitting. M = 0 M = M = 3 M = 9 w w w w w w w w w w Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

76 Overfitting in terms of λ Overfitting is reduced from complex model to simpler one with the help of increasing regularizers t M = x Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

77 Overfitting in terms of λ Overfitting is reduced from complex model to simpler one with the help of increasing regularizers M = 9 ln λ = 8 t t x 0 x Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

78 Overfitting in terms of λ Overfitting is reduced from complex model to simpler one with the help of increasing regularizers M = 9 ln λ = 8 ln λ = 0 t t t x 0 x 0 x Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

79 Overfitting in terms of λ Overfitting is reduced from complex model to simpler one with the help of increasing regularizers M = 9 ln λ = 8 ln λ = 0 t t t x 0 x 0 x λ vs. residual error shows the difference of the model performance on training and testing dataset Training Test ERMS ln λ Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

80 The effect of λ Large λ attenuates parameters towards 0 ln λ = ln λ = 8 ln λ = 0 w w w w w w w w w w Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

81 Regularized methods for classification Adding regularizer to the cross-entropy functions used for binary and multinomial logistic regression E(w) = n E(w, w 2,..., w K ) = n {y n log σ(w T x n ) + ( y n ) log[ σ(w T x n )]} + λ w 2 2 log P (C k x n ) + λ w k 2 2 k k Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

82 Regularized methods for classification Adding regularizer to the cross-entropy functions used for binary and multinomial logistic regression E(w) = n E(w, w 2,..., w K ) = n Numerical optimization {y n log σ(w T x n ) + ( y n ) log[ σ(w T x n )]} + λ w 2 2 log P (C k x n ) + λ w k 2 2 k k Objective functions remain to be convex as long as λ 0. Gradients and Hessians change marginally and can be easily derived. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

83 How to choose the right amount of regularization? Can we tune λ on the training dataset? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

84 How to choose the right amount of regularization? Can we tune λ on the training dataset? No: as this will always set λ to zero, i.e., no regularization, defeating our intention of controlling model complexity λ is thus a hyperparemeter. To tune it, We can use a validation set or do cross validation (as previously discussed) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

85 Outline Administration 2 Review of last lecture 3 Basic ideas to overcome overfitting 4 Bias/Variance Analysis Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

86 Basic and important machine learning concepts Supervised learning We aim to build a function h(x) to predict the true value y associated with x. If we make a mistake, we incur a loss l(h(x), y) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

87 Basic and important machine learning concepts Supervised learning We aim to build a function h(x) to predict the true value y associated with x. If we make a mistake, we incur a loss l(h(x), y) Example: quadratic loss function for regression when y is continuous l(h(x), y) = [h(x) y] 2 Ex: when y = 0 h(x) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

88 Other types of loss functions For classification: cross-entropy loss (also called logistic loss) l(h(x), y) = y log h(x) ( y) log[ h(x)] Ex: when y = h(x) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

89 Measure how good our predictor is Risk: Given the true distribution of data p(x, y), the risk is R[h(x)] = l(h(x), y)p(x, y)dxd y x,y Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

90 Measure how good our predictor is Risk: Given the true distribution of data p(x, y), the risk is R[h(x)] = l(h(x), y)p(x, y)dxd y x,y However, we cannot compute R[h(x)], so we use empirical risk, given a training dataset D R emp [h(x)] = l(h(x n ), y n ) N n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

91 Measure how good our predictor is Risk: Given the true distribution of data p(x, y), the risk is R[h(x)] = l(h(x), y)p(x, y)dxd y x,y However, we cannot compute R[h(x)], so we use empirical risk, given a training dataset D Intuitively, as N +, R emp [h(x)] = l(h(x n ), y n ) N R emp [h(x)] R[h(x)] n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

92 How this relates to what we have learned? So far, we have been doing empirical risk minimization (ERM) For linear regression, h(x) = w T x, and we use squared loss For logistic regression, h(x) = σ(w T x), and we use cross-entropy loss Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

93 How this relates to what we have learned? So far, we have been doing empirical risk minimization (ERM) For linear regression, h(x) = w T x, and we use squared loss For logistic regression, h(x) = σ(w T x), and we use cross-entropy loss ERM might be problematic Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

94 How this relates to what we have learned? So far, we have been doing empirical risk minimization (ERM) For linear regression, h(x) = w T x, and we use squared loss For logistic regression, h(x) = σ(w T x), and we use cross-entropy loss ERM might be problematic If h(x) is complicated enough, R emp [h(x)] 0 But then h(x) is unlikely to do well in predicting things out of the training dataset D This is called poor generalization or overfitting. We have just discussed approaches to address this issue. We ll explore why regularization might work from the context of the bias-variance tradeoff, focusing on regression / squared loss Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

95 Bias/variance tradeoff (Looking ahead) Error decomposes into 3 terms E D R[h D (x)] = variance + bias 2 + noise We will prove this result, and interpret what it means... Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, / 40

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Multiple regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Multiple regression 1 / 36 Previous two lectures Linear and logistic

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48 Logistic Regression Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 1 / 48 Outline 1 Administration 2 Review of last lecture 3 Logistic regression

More information

Today. Calculus. Linear Regression. Lagrange Multipliers

Today. Calculus. Linear Regression. Lagrange Multipliers Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 2, 2015 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University. SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Bayesian Linear Regression. Sargur Srihari

Bayesian Linear Regression. Sargur Srihari Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables? Linear Regression Machine Learning CSE546 Sham Kakade University of Washington Oct 4, 2016 1 What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Linear Regression. S. Sumitra

Linear Regression. S. Sumitra Linear Regression S Sumitra Notations: x i : ith data point; x T : transpose of x; x ij : ith data point s jth attribute Let {(x 1, y 1 ), (x, y )(x N, y N )} be the given data, x i D and y i Y Here D

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables? Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what

More information

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016 Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic

More information

CSC 411: Lecture 02: Linear Regression

CSC 411: Lecture 02: Linear Regression CSC 411: Lecture 02: Linear Regression Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto (Most plots in this lecture are from Bishop s book) Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University, 1 Introduction to Machine Learning Regression Computer Science, Tel-Aviv University, 2013-14 Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g.,

More information

12 Statistical Justifications; the Bias-Variance Decomposition

12 Statistical Justifications; the Bias-Variance Decomposition Statistical Justifications; the Bias-Variance Decomposition 65 12 Statistical Justifications; the Bias-Variance Decomposition STATISTICAL JUSTIFICATIONS FOR REGRESSION [So far, I ve talked about regression

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

Machine Learning. 7. Logistic and Linear Regression

Machine Learning. 7. Logistic and Linear Regression Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,

More information

Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting) Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting) Professor: Aude Billard Assistants: Nadia Figueroa, Ilaria Lauzana and Brice Platerrier E-mails: aude.billard@epfl.ch,

More information

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized

More information

Week 3: Linear Regression

Week 3: Linear Regression Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

CSC 411: Lecture 04: Logistic Regression

CSC 411: Lecture 04: Logistic Regression CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

Linear and logistic regression

Linear and logistic regression Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides. ECE521 Tutorial 2 Regression, GPs, Assignment 1 ECE521 Winter 2016 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides. ECE521 Tutorial 2 ECE521 Winter 2016 Credits to Alireza / 3 Outline

More information

Classification. Sandro Cumani. Politecnico di Torino

Classification. Sandro Cumani. Politecnico di Torino Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam. CS 189 Spring 2013 Introduction to Machine Learning Midterm You have 1 hour 20 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1 Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.

More information

Machine learning - HT Basis Expansion, Regularization, Validation

Machine learning - HT Basis Expansion, Regularization, Validation Machine learning - HT 016 4. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford Feburary 03, 016 Outline Introduce basis function to go beyond linear regression Understanding

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information