Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Size: px

Start display at page:

Download "Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology"

Alaina Paul
5 years ago
Views:

Fox slides from Coursera Specialization on Machine Learnign http://geist.

1 Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

2 Outline I 1 Expected value 2 Linear regression wrap up 3 Bias, variance, noise Average function Expected loss 4 Ridge regression Keeping model in check (smart way) L2 penalty regularization L1 penalty lasso Gradient for L1 regularization Coordinace gradient descent 5 Choosing lambda Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

3 Presentation Outline 1 Expected value 2 Linear regression wrap up 3 Bias, variance, noise 4 Ridge regression 5 Choosing lambda Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

4 Expected value What is expected value In probability theory, the expected value of a random variable, intuitively, is the long-run average value of repetitions of the experiment it represe. For instance expected value of dice roll is 3.5. Why? For discrete random variables inf E[X ] = x i p(x i ). i For equally probable outcomes, it is just an average For continuous case: E[X ] = whwre p(x) is probability density function. Note, that ususally E[XY ] E[X ] E[Y ] x p(x) dx, Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

5 Expected value What is expected value In probability theory, the expected value of a random variable, intuitively, is the long-run average value of repetitions of the experiment it represe. For instance expected value of dice roll is 3.5. Why? Properties (some): E[c] = c E[E[X ]] = E[X ] E[X + c] = E[X ] + c E[X + Y ] = E[X ] + E[Y ] E[aX ] = a E[X ] Note, that ususally E[XY ] E[X ] E[Y ] Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

6 Presentation Outline 1 Expected value 2 Linear regression wrap up 3 Bias, variance, noise 4 Ridge regression 5 Choosing lambda Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

7 Problem formulation t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

8 Hypotetical, ideal function and noise t f(x) Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

9 But which is ideal t f(x) f(x) f(x) Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

10 But which is ideal t f(x) f(x) f(x) Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

11 Gaussian interpretation t f(x) Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

12 t Gaussian interpretation Prediction is linear function and noise: y = f (x) + ɛ We assume that the noise ɛ is drawn from normal distribution: N (y µ, σ 2 ) = 1 (y µ)2 e 2σ 2 2σ2 π f(x) X We assume that training samples are i.i.d. We want to learn P(y θ, x, σ 2 1 ) = (y h 2σ2 π e θ (x)) 2σ 2 2 x The best P is when probability is max for every training set: So what is µ? arg max P(D θ, σ 2 N 1 ) = θ 2σ2 π N j e (y j h θ (x)) 2 2σ 2 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

13 Expected value t f * (x) = E Y [y x] Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

14 Expected value t f(x) = E D E Y [y x] Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

15 Presentation Outline 1 Expected value 2 Linear regression wrap up 3 Bias, variance, noise Average function Expected loss 4 Ridge regression 5 Choosing lambda Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

16 Average of complex function t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

17 Average of complex function t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

18 Average of complex function t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

19 Average of complex function t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

20 Average of complex function t f(x) = E D E Y [y x] Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

21 Average of complex function t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

22 Average of simple function t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

23 Average of simple function t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

24 Average of simple function t Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

25 Average of simple function t f(x) = E D E Y [y x] Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

26 Average of simple function t f(x) = E D E Y [y x] Szymon Bobek (AGH-UST) Machine Learning 21 March / 62 x

27 t t Bias and variance x x Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

28 t t Bias and variance x x High bias the difference from the average to the ideal is large Low variance the difference between particular models and their average is low Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

29 t t Bias and variance x x High bias the difference from the average to the ideal is large Low variance the difference between particular models and their average is low Low bias the difference from the average to the ideal is low High variance the difference between particular models and their average is large Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

30 Outline 1 Expected value 2 Linear regression wrap up 3 Bias, variance, noise Average function Expected loss 4 Ridge regression Keeping model in check (smart way) L2 penalty regularization L1 penalty lasso Gradient for L1 regularization Coordinace gradient descent 5 Choosing lambda Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

31 Loss function Linear regression objective: maximize log likelihood or minimize RSS: J(θ) = N {y(x i ) h θ (x i )} 2 i We also know, that even the ideal function f (x) = y(x) + ɛ we can t do any better than approach f which still will be wrong For the perfect case E D [h θ (x; D)] = f (x), but usually not (see constant function) So we can say, tha our loss that we have impact on can be expressed as difference between ideal and ours, plus variance of data: L(y, h θ (x)) = E X [{h θ (x) f (x)} 2 ] + σ 2 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

32 Expected loss In general: E[L] = E D E X [L(y, h θ (x, D))] But for our particular case: E[L] = E X E D [{h θ (x; D) f (x)} 2 ] + σ 2 Let us focus on the big picture E D [{h θ (x; D) f (x)} 2 ] = Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

33 t t Bias, variance, noise decomposition E[L] =E D [{h θ (x; D) E D (h θ (x; D)} 2 ]+ + {E D [h θ (x; D)] f (x)} 2 + +σ 2 x x Bias average vs. ideal Variance average vs. its Szymon Bobek (AGH-UST) components Machine Learning 21 March / 62

34 How can we reduce the loss Problems P1 Noise: E D E X [{f (x) y(x; D)} 2 ] ideal function (unknown) vs. data (given). Not much we can do. P2 Bias: {E D [h θ (x; D)] f (x)} 2 how close we are to the ideal function wrt. model type (i.e. linear, quadratic, polynomial) P3 Variance: E D [{h θ (x; D) E D (h θ (x; D)} 2 ] how sensitive we are to the training data, how robust the algorithm is, if we get different dataset will the moedl be similar? Solutions P1 - No solution :) P2 - If you are far away from the ideal change the model to more complex, more data will help, but not much P3 - If you are very sensitive to data change the model to simpler, or get more data (average of complex models is close to ideal) Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

35 Bias, variance and error Error Error True error True error Test error Test error Number of training examples Number of training examples Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

36 Bias, variance and error Error Error Variance True error True error Train error Bias Train error Number of training examples Number of training examples Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

37 OK, so let us get the most complex model ever Error True error Train error Number of training examples Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

38 OK, so let us get the most complex model ever Error True error Train error Number of training examples Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

39 Bias, variance, model complexity Underfitting High Bias Low Variance Overfitting High Variance Low Bias Prediction Error Lowest Generalization Error Testing Error (Private LB) Validation Error (Local CV) Training Error Low Model Complexity High Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

40 Bias, variance, model complexity Underfitting High Bias Low Variance Overfitting High Variance Low Bias Prediction Error Lowest Generalization Error Testing Error (Private LB) Validation Error (Local CV) Training Error Low Model Complexity High Hmmm... Can we make the lagorithm to find the balance automatically? Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

41 Presentation Outline 1 Expected value 2 Linear regression wrap up 3 Bias, variance, noise 4 Ridge regression Keeping model in check (smart way) L2 penalty regularization L1 penalty lasso Gradient for L1 regularization Coordinace gradient descent 5 Choosing lambda Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

42 Expected loss revisited Problems P1 Noise: E D E X [{f (x) y(x; D)} 2 ] ideal function (unknown) vs. data (given). Not much we can do. P2 Bias: {E D [h θ (x; D)] f (x)} 2 how close we are to the ideal function wrt. model type (i.e. linear, quadratic, polynomial) P3 Variance: E D [{h θ (x; D) E D (h θ (x; D)} 2 ] how sensitive we are to the training data, how robust the algorithm is, if we get different dataset will the moedl be similar? Perfect situation Let us have very complex model (to have a lot of flexibility in reducing bias), and let the algorithm keep the variance low. How to keep the variance low? Well... by reducing the space of possible models :) Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

43 Reducing search space for the model 1 t t t x 1 0 x 1 1 t t t x 1 0 x 1 How to do this? Naive way: all subsets and gready algorithm (reduce search space in feature set). Do not let coefficients θ to grow too much (L2 penalty reduce search space in coefficients values) Try to figure out which coefficients θ are not usefull and set them to 0 (L1 penalty smart naive way) 0 x 1 0 x 1 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

44 Naive approach All subsets Start with no features and measure the J(θ) Search over all features and pick the one that has the lowest J Search two best features in the set of features that has teh lowes J Contiue, untill no significant improvement in J Forward/backward stepwise Start with no features and measure the J(θ) Search over all features and pick the one that has the lowest J Keep the previously best feature, adn select second best contiue, untill no significant improvement in J Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

45 All subsets, forward/backward stepwise RSS #bedrooms sq. meters #showers #floors year # of features Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

46 All subsets, forward/backward stepwise RSS #bedrooms sq. meters #showers #floors year # of features Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

47 Why not going brute force Question We have a very simple problem (15 features possible). We use linear regression and want to select features with all-subsets approach. How many models do we have to evaluate? Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

48 Why not going brute force Question We have a very simple problem (15 features possible). We use linear regression and want to select features with forwar stepwise approach. How many models do we have to evaluate? Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

49 How to measure L1, L2 penalties Sum of coefficients... no :) (L1) Sum of absolute values of coefficients L1 = θ 1 + θ θ n (L2) Sum of squered values of coefficients (L3, L4) Is there something like that? How to minimize it? L2 = θ θ θ 2 n Add it to cost function! But... will the cost function be still convex? How will I calculate gradient for such cost function? Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

50 Linear regression with L2 penalty Linear regression function h θ (x) = N θ i x i = Θx i theta 2 Cost function J(θ) = E x [(h θ (x) y(x)) 2 ] = 1 2N N (h θ (x (i) ) y (i) ) 2 i theta theta1 Cost function with regularization J(θ) = 1 2N N i (h θ (x (i) ) y (i) ) 2 + λ 2N N i=2 θ 2 i Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

51 Gradient for regularization regression For j = 0: theta 2 J(θ) θ 0 For j 1: = 1 N N (h θ (x (i) ) y (i) )x (i) j i=1 theta J(θ) θ j = 1 N N (h θ (x (i) ) y (i) )x (i) j + λ N θ j i=1 theta1 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

Countour plot White circles represent path of gradient descent for MSE only Yellow tringles represent gradient descent path for L2 penalty only for

52 Countour plot White circles represent path of gradient descent for MSE only Yellow tringles represent gradient descent path for L2 penalty only for λ inf White circles represents the gradiet descent for combined cost function MSE + L2 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

53 Coefficients path Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

54 Why not set small coefficients to 0? Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

55 Why not set small coefficients to 0? Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

56 Why not set small coefficients to 0? Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

57 Why not set small coefficients to 0? Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

58 Linear regression with L1 penalty Linear regression function N h θ (x) = θ i x i = Θx i theta 2 Cost function J(θ) = E x [(h θ (x) y(x)) 2 ] = 1 2N N (h θ (x (i) ) y (i) ) 2 i theta theta 1 Cost function with L1 penalty J(θ) = 1 2N N i (h θ (x (i) ) y (i) ) 2 + λ 2N N θ i i=2 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

59 Countour plot White circles represent path of gradient descent for MSE only Yellow tringles represent gradient descent path for L1 penalty only for λ inf White circles represents the gradiet descent for combined cost function MSE + L1 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

60 (Sub)Gradient for L1 penalty Cost function with L1 penalty J(θ) = 1 2N N Let us focus on gradient for single θ j : J(θ) θ k = 1 N i = 1 N (h θ (x (i) ) y (i) ) 2 + λ 2N N i (h θ (x (i) ) y (i) )x (i) j [ N D θ k (x (i) k ) y (i) )x (i) j i k ] N θ i i=2 + λ θ 2N θ j + λ θ 2N θ j Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

61 Solution: subgradients f(x) g(x) = f(x 0 ) + v(x - x 0 ) f(x) g(x) = f(x 0 ) + c(x - x 0 ) f(x 0 ) x 0 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

62 Solution: subgradients f(x) f(x 0 ) + v(x x 0 ) V = [-1; 1] f(x 0 ) x 0 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

63 Solution: subgradient f(x) f(x 0 ) + v(x x 0 ) V = [-1; 1] x j (i) = N x (i) j i (x (i) j ) 2 f(x 0 ) J(θ) θ j J(θ) θ j = 1 N i x 0 = 1 N k [ N D (θ k x (i) k i k [ N D (θ k x (i) k y (i) ) x (i) j y (i) ) x (i) j ] + λ θ 2N θ j ] λ 2N if θ j < 0 + [ λ 2N ; λ 2N ] if θ j = 0 λ 2N if θ j > 0 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

64 Another trick = 1 N = 1 N J(θ) = 1 θ j N = 1 N D N i i k j k j N D (θ k x (i) k N D (θ k x (i) k i k j [ N D (θ k x (i) k i k (θ k x (i) k + θ j x (i) j y (i) ) x (i) j + 1 N y (i) ) x (i) j } {{ } ρ j ] + λ 2N y (i) ) x (i) j + λ 2N y (i) ) x (i) j +θ j 1 N N [ i N [ i θ j x (i) j ( x (i) j θ θ j = θ θ j = ] x (i) j + λ θ = 2N θ j ) 2] + λ 2N θ θ j = Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

65 Sum up J(θ) θ j = 1 N D θ k ( x (i) k N ) y (i) ) x (i) j i k j }{{} ρ j +θ j 1 N N [ i ( x (i) j ) 2] + λ 2N = 1 N ρ j + 1 λ 2N if θ j < 0 N θ j + [ λ 2N ; λ 2N ] if θ j = 0 = λ 2N if θ j > 0 λ 2N + 1 N ρ j + 1 N θ j if θ j < 0 = [ λ 2N + 1 N ρ j + 1 N θ j; λ 2N + 1 N ρ j + 1 N θ j] if θ j = 0 λ 2N + 1 N ρ j + 1 N θ j if θ j > 0 θ θ j = Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

66 Optimal solution = set gradient to zero J(θ) λ 2N + 1 N ρ j + 1 N θ j if θ j < 0 = [ λ θ 2N j + 1 N ρ j + 1 N θ j; λ 2N + 1 N ρ j + 1 N θ j] if θ j = 0 = 0 λ 2N + 1 N ρ j + 1 N θ j if θ j > 0 λ 2N + 1 N ρ j + 1 N θ j = 0 if θ j < 0 [ λ 2N + 1 N ρ j + 1 N θ j; λ 2N + 1 N ρ j + 1 N θ j] has to contain 0 if θ j = 0 λ 2N + 1 N ρ j + 1 N θ j = 0 if θ j > 0 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

67 Optimal solution = set gradient to zero λ 2N + 1 N ρ j + 1 N θ j = 0 if θ j < 0 [ λ 2N + 1 N ρ j + 1 N θ j; λ 2N + 1 N ρ j + 1 N θ j] has to contain 0 if θ j = 0 λ 2N + 1 N ρ j + 1 N θ j = 0 if θ j > 0 Therefore: Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

68 Coefficients path Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

69 Outline 1 Expected value 2 Linear regression wrap up 3 Bias, variance, noise Average function Expected loss 4 Ridge regression Keeping model in check (smart way) L2 penalty regularization L1 penalty lasso Gradient for L1 regularization Coordinace gradient descent 5 Choosing lambda Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

70 Optimize one coordinate at a time Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

71 Limitations Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

72 Coordinate descent for simple, ridge and lasso Algorithm until not converged select θ j (round robin, random, etc.) calculate ρ j : J(θ) θ j = 1 N N D i k j (θ k x (i) k y (i) ) x (i) j } {{ } ρ j +θ j 1 N update θ j : For simple: θ j = ρ j For ridge: θ j = ρ j 2λ+1 θ j = ρ j + λ if ρ 2 j > λ 2 For lasso: θ j = θ j = 0 if ρ j < λ 2 ; λ 2 > θ j = ρ j λ if ρ 2 j < λ 2 N [ (x (i) j ) 2] + 1 N (RegTerm) i } {{ } 1 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

73 Lasso sotf thresholding Θ j (simple) Θ j Θ j (LASSO) Θ j (ridge) λ 2 λ 2 ρ j Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

74 Lasso pros and cons No step size! Converges to optimum for strongly cnovex functions In some cases will not converge (the case with non differentiable cost functions) It shrinks coefficients relative to RSS it produces more bias, less variance Run lasso to select featutres Run simple/ridge with only selected features Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

75 Presentation Outline 1 Expected value 2 Linear regression wrap up 3 Bias, variance, noise 4 Ridge regression 5 Choosing lambda Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

76 Training/Test/Cross-Validation All data you have 60% 20% 20% Fit θ with respect to some λ Test different λ Test generalization error for θ Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

77 Cross validation techniques k-fold All data you have Test K Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

78 Cross validation techniques leave 1-out All data you have Test N Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

79 Cross validation techniques - (leave p-out) All data you have Test C n p Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

80 Cross validation techniques Monte-Carlo All data you have Test K Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

81 Cross validation techniques Bootstrap All data you have x 1 x 2 x 3 x 4 x 5 x 6 x 7 Test 1 x 1 x 1 x 2 x 5 x 3 x 4 x 6 x x 2 x 3 x 5 x 6 x 7 x 1 x 4... K x 3 x 6 x 6 x 7 x 3 x 2 x 4 x 1 x 5 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

82 Lasso and ridge vs. normalization Ridge and lasso puts penalty on the value of the θ Value of θ depends on the magnitude of gradient We multiply each gradient by, making θ dependant on x (i) j the magnitude of x (i) j The penalty is therefore dependant on the magnitude of x... :/ Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

83 Lasso and ridge vs. normalization Normalization: x j (i) = N x (i) j i (x (i) j ) 2 When testing and using models, you need to normalize it too: x j (test) = N x (test) j i (x (i) j ) 2 Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

84 Demo Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

85 Thank you! Szymon Bobek Institute of Applied Computer Science AGH University of Science and Technology 21 March Szymon Bobek (AGH-UST) Machine Learning 21 March / 62

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html