Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Size: px
Start display at page:

Download "Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology"

Transcription

1 Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

2 Outline I 1 Expected value 2 Linear regression wrap up 3 Bias, variance, noise Average function Expected loss 4 Ridge regression Keeping model in check (smart way) L2 penalty regularization L1 penalty lasso Gradient for L1 regularization Coordinace gradient descent 5 Choosing lambda Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

3 Presentation Outline 1 Expected value 2 Linear regression wrap up 3 Bias, variance, noise 4 Ridge regression 5 Choosing lambda Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

4 Expected value What is expected value In probability theory, the expected value of a random variable, intuitively, is the long-run average value of repetitions of the experiment it represe. For instance expected value of dice roll is 3.5. Why? For discrete random variables inf E[X ] = x i p(x i ). i For equally probable outcomes, it is just an average For continuous case: E[X ] = whwre p(x) is probability density function. Note, that ususally E[XY ] E[X ] E[Y ] x p(x) dx, Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

5 Expected value What is expected value In probability theory, the expected value of a random variable, intuitively, is the long-run average value of repetitions of the experiment it represe. For instance expected value of dice roll is 3.5. Why? Properties (some): E[c] = c E[E[X ]] = E[X ] E[X + c] = E[X ] + c E[X + Y ] = E[X ] + E[Y ] E[aX ] = a E[X ] Note, that ususally E[XY ] E[X ] E[Y ] Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

6 Presentation Outline 1 Expected value 2 Linear regression wrap up 3 Bias, variance, noise 4 Ridge regression 5 Choosing lambda Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

7 Problem formulation t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

8 Hypotetical, ideal function and noise t f(x) Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

9 But which is ideal t f(x) f(x) f(x) Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

10 But which is ideal t f(x) f(x) f(x) Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

11 Gaussian interpretation t f(x) Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

12 t Gaussian interpretation Prediction is linear function and noise: y = f (x) + ɛ We assume that the noise ɛ is drawn from normal distribution: N (y µ, σ 2 ) = 1 (y µ)2 e 2σ 2 2σ2 π f(x) X We assume that training samples are i.i.d. We want to learn P(y θ, x, σ 2 1 ) = (y h 2σ2 π e θ (x)) 2σ 2 2 x The best P is when probability is max for every training set: So what is µ? arg max P(D θ, σ 2 N 1 ) = θ 2σ2 π N j e (y j h θ (x)) 2 2σ 2 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

13 Expected value t f * (x) = E Y [y x] Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

14 Expected value t f(x) = E D E Y [y x] Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

15 Presentation Outline 1 Expected value 2 Linear regression wrap up 3 Bias, variance, noise Average function Expected loss 4 Ridge regression 5 Choosing lambda Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

16 Average of complex function t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

17 Average of complex function t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

18 Average of complex function t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

19 Average of complex function t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

20 Average of complex function t f(x) = E D E Y [y x] Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

21 Average of complex function t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

22 Average of simple function t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

23 Average of simple function t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

24 Average of simple function t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

25 Average of simple function t f(x) = E D E Y [y x] Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

26 Average of simple function t f(x) = E D E Y [y x] Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

27 t t Bias and variance x x Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

28 t t Bias and variance x x High bias the difference from the average to the ideal is large Low variance the difference between particular models and their average is low Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

29 t t Bias and variance x x High bias the difference from the average to the ideal is large Low variance the difference between particular models and their average is low Low bias the difference from the average to the ideal is low High variance the difference between particular models and their average is large Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

30 Outline 1 Expected value 2 Linear regression wrap up 3 Bias, variance, noise Average function Expected loss 4 Ridge regression Keeping model in check (smart way) L2 penalty regularization L1 penalty lasso Gradient for L1 regularization Coordinace gradient descent 5 Choosing lambda Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

31 Loss function Linear regression objective: maximize log likelihood or minimize RSS: J(θ) = N {y(x i ) h θ (x i )} 2 i We also know, that even the ideal function f (x) = y(x) + ɛ we can t do any better than approach f which still will be wrong For the perfect case E D [h θ (x; D)] = f (x), but usually not (see constant function) So we can say, tha our loss that we have impact on can be expressed as difference between ideal and ours, plus variance of data: L(y, h θ (x)) = E X [{h θ (x) f (x)} 2 ] + σ 2 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

32 Expected loss In general: E[L] = E D E X [L(y, h θ (x, D))] But for our particular case: E[L] = E X E D [{h θ (x; D) f (x)} 2 ] + σ 2 Let us focus on the big picture E D [{h θ (x; D) f (x)} 2 ] = Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

33 t t Bias, variance, noise decomposition E[L] =E D [{h θ (x; D) E D (h θ (x; D)} 2 ]+ + {E D [h θ (x; D)] f (x)} 2 + +σ 2 x x Bias average vs. ideal Variance average vs. its Szymon Bobek (AGH-UST) components Machine Learning 21 March / 62

34 How can we reduce the loss Problems P1 Noise: E D E X [{f (x) y(x; D)} 2 ] ideal function (unknown) vs. data (given). Not much we can do. P2 Bias: {E D [h θ (x; D)] f (x)} 2 how close we are to the ideal function wrt. model type (i.e. linear, quadratic, polynomial) P3 Variance: E D [{h θ (x; D) E D (h θ (x; D)} 2 ] how sensitive we are to the training data, how robust the algorithm is, if we get different dataset will the moedl be similar? Solutions P1 - No solution :) P2 - If you are far away from the ideal change the model to more complex, more data will help, but not much P3 - If you are very sensitive to data change the model to simpler, or get more data (average of complex models is close to ideal) Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

35 Bias, variance and error Error Error True error True error Test error Test error Number of training examples Number of training examples Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

36 Bias, variance and error Error Error Variance True error True error Train error Bias Train error Number of training examples Number of training examples Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

37 OK, so let us get the most complex model ever Error True error Train error Number of training examples Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

38 OK, so let us get the most complex model ever Error True error Train error Number of training examples Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

39 Bias, variance, model complexity Underfitting High Bias Low Variance Overfitting High Variance Low Bias Prediction Error Lowest Generalization Error Testing Error (Private LB) Validation Error (Local CV) Training Error Low Model Complexity High Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

40 Bias, variance, model complexity Underfitting High Bias Low Variance Overfitting High Variance Low Bias Prediction Error Lowest Generalization Error Testing Error (Private LB) Validation Error (Local CV) Training Error Low Model Complexity High Hmmm... Can we make the lagorithm to find the balance automatically? Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

41 Presentation Outline 1 Expected value 2 Linear regression wrap up 3 Bias, variance, noise 4 Ridge regression Keeping model in check (smart way) L2 penalty regularization L1 penalty lasso Gradient for L1 regularization Coordinace gradient descent 5 Choosing lambda Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

42 Expected loss revisited Problems P1 Noise: E D E X [{f (x) y(x; D)} 2 ] ideal function (unknown) vs. data (given). Not much we can do. P2 Bias: {E D [h θ (x; D)] f (x)} 2 how close we are to the ideal function wrt. model type (i.e. linear, quadratic, polynomial) P3 Variance: E D [{h θ (x; D) E D (h θ (x; D)} 2 ] how sensitive we are to the training data, how robust the algorithm is, if we get different dataset will the moedl be similar? Perfect situation Let us have very complex model (to have a lot of flexibility in reducing bias), and let the algorithm keep the variance low. How to keep the variance low? Well... by reducing the space of possible models :) Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

43 Reducing search space for the model 1 t t t x 1 0 x 1 1 t t t x 1 0 x 1 How to do this? Naive way: all subsets and gready algorithm (reduce search space in feature set). Do not let coefficients θ to grow too much (L2 penalty reduce search space in coefficients values) Try to figure out which coefficients θ are not usefull and set them to 0 (L1 penalty smart naive way) 0 x 1 0 x 1 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

44 Naive approach All subsets Start with no features and measure the J(θ) Search over all features and pick the one that has the lowest J Search two best features in the set of features that has teh lowes J Contiue, untill no significant improvement in J Forward/backward stepwise Start with no features and measure the J(θ) Search over all features and pick the one that has the lowest J Keep the previously best feature, adn select second best contiue, untill no significant improvement in J Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

45 All subsets, forward/backward stepwise RSS #bedrooms sq. meters #showers #floors year # of features Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

46 All subsets, forward/backward stepwise RSS #bedrooms sq. meters #showers #floors year # of features Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

47 Why not going brute force Question We have a very simple problem (15 features possible). We use linear regression and want to select features with all-subsets approach. How many models do we have to evaluate? Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

48 Why not going brute force Question We have a very simple problem (15 features possible). We use linear regression and want to select features with forwar stepwise approach. How many models do we have to evaluate? Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

49 How to measure L1, L2 penalties Sum of coefficients... no :) (L1) Sum of absolute values of coefficients L1 = θ 1 + θ θ n (L2) Sum of squered values of coefficients (L3, L4) Is there something like that? How to minimize it? L2 = θ θ θ 2 n Add it to cost function! But... will the cost function be still convex? How will I calculate gradient for such cost function? Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

50 Linear regression with L2 penalty Linear regression function h θ (x) = N θ i x i = Θx i theta 2 Cost function J(θ) = E x [(h θ (x) y(x)) 2 ] = 1 2N N (h θ (x (i) ) y (i) ) 2 i theta theta1 Cost function with regularization J(θ) = 1 2N N i (h θ (x (i) ) y (i) ) 2 + λ 2N N i=2 θ 2 i Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

51 Gradient for regularization regression For j = 0: theta 2 J(θ) θ 0 For j 1: = 1 N N (h θ (x (i) ) y (i) )x (i) j i=1 theta J(θ) θ j = 1 N N (h θ (x (i) ) y (i) )x (i) j + λ N θ j i=1 theta1 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

52 Countour plot White circles represent path of gradient descent for MSE only Yellow tringles represent gradient descent path for L2 penalty only for λ inf White circles represents the gradiet descent for combined cost function MSE + L2 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

53 Coefficients path Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

54 Why not set small coefficients to 0? Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

55 Why not set small coefficients to 0? Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

56 Why not set small coefficients to 0? Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

57 Why not set small coefficients to 0? Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

58 Linear regression with L1 penalty Linear regression function N h θ (x) = θ i x i = Θx i theta 2 Cost function J(θ) = E x [(h θ (x) y(x)) 2 ] = 1 2N N (h θ (x (i) ) y (i) ) 2 i theta theta 1 Cost function with L1 penalty J(θ) = 1 2N N i (h θ (x (i) ) y (i) ) 2 + λ 2N N θ i i=2 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

59 Countour plot White circles represent path of gradient descent for MSE only Yellow tringles represent gradient descent path for L1 penalty only for λ inf White circles represents the gradiet descent for combined cost function MSE + L1 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

60 (Sub)Gradient for L1 penalty Cost function with L1 penalty J(θ) = 1 2N N Let us focus on gradient for single θ j : J(θ) θ k = 1 N i = 1 N (h θ (x (i) ) y (i) ) 2 + λ 2N N i (h θ (x (i) ) y (i) )x (i) j [ N D θ k (x (i) k ) y (i) )x (i) j i k ] N θ i i=2 + λ θ 2N θ j + λ θ 2N θ j Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

61 Solution: subgradients f(x) g(x) = f(x 0 ) + v(x - x 0 ) f(x) g(x) = f(x 0 ) + c(x - x 0 ) f(x 0 ) x 0 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

62 Solution: subgradients f(x) f(x 0 ) + v(x x 0 ) V = [-1; 1] f(x 0 ) x 0 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

63 Solution: subgradient f(x) f(x 0 ) + v(x x 0 ) V = [-1; 1] x j (i) = N x (i) j i (x (i) j ) 2 f(x 0 ) J(θ) θ j J(θ) θ j = 1 N i x 0 = 1 N k [ N D (θ k x (i) k i k [ N D (θ k x (i) k y (i) ) x (i) j y (i) ) x (i) j ] + λ θ 2N θ j ] λ 2N if θ j < 0 + [ λ 2N ; λ 2N ] if θ j = 0 λ 2N if θ j > 0 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

64 Another trick = 1 N = 1 N J(θ) = 1 θ j N = 1 N D N i i k j k j N D (θ k x (i) k N D (θ k x (i) k i k j [ N D (θ k x (i) k i k (θ k x (i) k + θ j x (i) j y (i) ) x (i) j + 1 N y (i) ) x (i) j } {{ } ρ j ] + λ 2N y (i) ) x (i) j + λ 2N y (i) ) x (i) j +θ j 1 N N [ i N [ i θ j x (i) j ( x (i) j θ θ j = θ θ j = ] x (i) j + λ θ = 2N θ j ) 2] + λ 2N θ θ j = Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

65 Sum up J(θ) θ j = 1 N D θ k ( x (i) k N ) y (i) ) x (i) j i k j }{{} ρ j +θ j 1 N N [ i ( x (i) j ) 2] + λ 2N = 1 N ρ j + 1 λ 2N if θ j < 0 N θ j + [ λ 2N ; λ 2N ] if θ j = 0 = λ 2N if θ j > 0 λ 2N + 1 N ρ j + 1 N θ j if θ j < 0 = [ λ 2N + 1 N ρ j + 1 N θ j; λ 2N + 1 N ρ j + 1 N θ j] if θ j = 0 λ 2N + 1 N ρ j + 1 N θ j if θ j > 0 θ θ j = Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

66 Optimal solution = set gradient to zero J(θ) λ 2N + 1 N ρ j + 1 N θ j if θ j < 0 = [ λ θ 2N j + 1 N ρ j + 1 N θ j; λ 2N + 1 N ρ j + 1 N θ j] if θ j = 0 = 0 λ 2N + 1 N ρ j + 1 N θ j if θ j > 0 λ 2N + 1 N ρ j + 1 N θ j = 0 if θ j < 0 [ λ 2N + 1 N ρ j + 1 N θ j; λ 2N + 1 N ρ j + 1 N θ j] has to contain 0 if θ j = 0 λ 2N + 1 N ρ j + 1 N θ j = 0 if θ j > 0 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

67 Optimal solution = set gradient to zero λ 2N + 1 N ρ j + 1 N θ j = 0 if θ j < 0 [ λ 2N + 1 N ρ j + 1 N θ j; λ 2N + 1 N ρ j + 1 N θ j] has to contain 0 if θ j = 0 λ 2N + 1 N ρ j + 1 N θ j = 0 if θ j > 0 Therefore: Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

68 Coefficients path Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

69 Outline 1 Expected value 2 Linear regression wrap up 3 Bias, variance, noise Average function Expected loss 4 Ridge regression Keeping model in check (smart way) L2 penalty regularization L1 penalty lasso Gradient for L1 regularization Coordinace gradient descent 5 Choosing lambda Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

70 Optimize one coordinate at a time Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

71 Limitations Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

72 Coordinate descent for simple, ridge and lasso Algorithm until not converged select θ j (round robin, random, etc.) calculate ρ j : J(θ) θ j = 1 N N D i k j (θ k x (i) k y (i) ) x (i) j } {{ } ρ j +θ j 1 N update θ j : For simple: θ j = ρ j For ridge: θ j = ρ j 2λ+1 θ j = ρ j + λ if ρ 2 j > λ 2 For lasso: θ j = θ j = 0 if ρ j < λ 2 ; λ 2 > θ j = ρ j λ if ρ 2 j < λ 2 N [ (x (i) j ) 2] + 1 N (RegTerm) i } {{ } 1 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

73 Lasso sotf thresholding Θ j (simple) Θ j Θ j (LASSO) Θ j (ridge) λ 2 λ 2 ρ j Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

74 Lasso pros and cons No step size! Converges to optimum for strongly cnovex functions In some cases will not converge (the case with non differentiable cost functions) It shrinks coefficients relative to RSS it produces more bias, less variance Run lasso to select featutres Run simple/ridge with only selected features Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

75 Presentation Outline 1 Expected value 2 Linear regression wrap up 3 Bias, variance, noise 4 Ridge regression 5 Choosing lambda Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

76 Training/Test/Cross-Validation All data you have 60% 20% 20% Fit θ with respect to some λ Test different λ Test generalization error for θ Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

77 Cross validation techniques k-fold All data you have Test K Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

78 Cross validation techniques leave 1-out All data you have Test N Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

79 Cross validation techniques - (leave p-out) All data you have Test C n p Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

80 Cross validation techniques Monte-Carlo All data you have Test K Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

81 Cross validation techniques Bootstrap All data you have x 1 x 2 x 3 x 4 x 5 x 6 x 7 Test 1 x 1 x 1 x 2 x 5 x 3 x 4 x 6 x x 2 x 3 x 5 x 6 x 7 x 1 x 4... K x 3 x 6 x 6 x 7 x 3 x 2 x 4 x 1 x 5 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

82 Lasso and ridge vs. normalization Ridge and lasso puts penalty on the value of the θ Value of θ depends on the magnitude of gradient We multiply each gradient by, making θ dependant on x (i) j the magnitude of x (i) j The penalty is therefore dependant on the magnitude of x... :/ Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

83 Lasso and ridge vs. normalization Normalization: x j (i) = N x (i) j i (x (i) j ) 2 When testing and using models, you need to normalize it too: x j (test) = N x (test) j i (x (i) j ) 2 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

84 Demo Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

85 Thank you! Szymon Bobek Institute of Applied Computer Science AGH University of Science and Technology 21 March Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 Feature selection task 1 Why might you want to perform feature selection? Efficiency: - If size(w)

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013 Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015 CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015 Luke ZeElemoyer Slides adapted from Carlos Guestrin Predic5on of con5nuous variables Billionaire says: Wait, that s not what

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables? Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what

More information

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive: Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Probability and Statistical Decision Theory

Probability and Statistical Decision Theory Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Erik Sudderth (UCI) Prof. Mike Hughes

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Machine Learning and Data Mining. Linear regression. Kalev Kask

Machine Learning and Data Mining. Linear regression. Kalev Kask Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 3 Additive Models and Linear Regression Sinusoids and Radial Basis Functions Classification Logistic Regression Gradient Descent Polynomial Basis Functions

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables? Linear Regression Machine Learning CSE546 Sham Kakade University of Washington Oct 4, 2016 1 What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

LINEAR REGRESSION, RIDGE, LASSO, SVR

LINEAR REGRESSION, RIDGE, LASSO, SVR LINEAR REGRESSION, RIDGE, LASSO, SVR Supervised Learning Katerina Tzompanaki Linear regression one feature* Price (y) What is the estimated price of a new house of area 30 m 2? 30 Area (x) *Also called

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Gaussians Linear Regression Bias-Variance Tradeoff

Gaussians Linear Regression Bias-Variance Tradeoff Readings listed in class website Gaussians Linear Regression Bias-Variance Tradeoff Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 22 nd, 2007 Maximum Likelihood Estimation

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16 COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Linear Regression 1 / 25. Karl Stratos. June 18, 2018 Linear Regression Karl Stratos June 18, 2018 1 / 25 The Regression Problem Problem. Find a desired input-output mapping f : X R where the output is a real value. x = = y = 0.1 How much should I turn my

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Introduction to the regression problem. Luca Martino

Introduction to the regression problem. Luca Martino Introduction to the regression problem Luca Martino 2017 2018 1 / 30 Approximated outline of the course 1. Very basic introduction to regression 2. Gaussian Processes (GPs) and Relevant Vector Machines

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

CSE446: non-parametric methods Spring 2017

CSE446: non-parametric methods Spring 2017 CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want

More information

Today. Calculus. Linear Regression. Lagrange Multipliers

Today. Calculus. Linear Regression. Lagrange Multipliers Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

More information

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015 Fundamentals of Machine Learning Mohammad Emtiyaz Khan EPFL Aug 25, 25 Mohammad Emtiyaz Khan 24 Contents List of concepts 2 Course Goals 3 2 Regression 4 3 Model: Linear Regression 7 4 Cost Function: MSE

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE LECTURE NOTE #NEW 6 PROF. ALAN YUILLE 1. Introduction to Regression Now consider learning the conditional distribution p(y x). This is often easier than learning the likelihood function p(x y) and the

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning LECTURE 10: LINEAR MODEL SELECTION PT. 1 October 16, 2017 SDS 293: Machine Learning Outline Model selection: alternatives to least-squares Subset selection - Best subset - Stepwise selection (forward and

More information

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods) Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March

More information

CSC 411: Lecture 02: Linear Regression

CSC 411: Lecture 02: Linear Regression CSC 411: Lecture 02: Linear Regression Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto (Most plots in this lecture are from Bishop s book) Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler + Machine Learning and Data Mining Linear regression Prof. Alexander Ihler Supervised learning Notation Features x Targets y Predictions ŷ Parameters θ Learning algorithm Program ( Learner ) Change µ Improve

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017 Linear Models: Comparing Variables Stony Brook University CSE545, Fall 2017 Statistical Preliminaries Random Variables Random Variables X: A mapping from Ω to ℝ that describes the question we care about

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric CS 189 Introduction to Machine Learning Fall 2017 Note 5 1 Overview Recall from our previous note that for a fixed input x, our measurement Y is a noisy measurement of the true underlying response f x):

More information

Ridge Regression: Regulating overfitting when using many features. Training, true, & test error vs. model complexity. CSE 446: Machine Learning

Ridge Regression: Regulating overfitting when using many features. Training, true, & test error vs. model complexity. CSE 446: Machine Learning Ridge Regression: Regulating overfitting when using many features Emily Fox University of Washington January 3, 207 Training, true, & test error vs. model complexity Overfitting if: Error y Model complexity

More information

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Is the test error unbiased for these programs? 2017 Kevin Jamieson Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning

More information

Lecture 14: Shrinkage

Lecture 14: Shrinkage Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Support Vector Machine I

Support Vector Machine I Support Vector Machine I Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please use piazza. No emails. HW 0 grades are back. Re-grade request for one week. HW 1 due soon. HW

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 4, 2015 Today: Generative discriminative classifiers Linear regression Decomposition of error into

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Machine Learning

Machine Learning Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 1, 2011 Today: Generative discriminative classifiers Linear regression Decomposition of error into

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated

More information

An Introduction to Statistical Machine Learning - Theoretical Aspects -

An Introduction to Statistical Machine Learning - Theoretical Aspects - An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17 3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information