Polynomial Regression and Regularization

Size: px

Start display at page:

Download "Polynomial Regression and Regularization"

Beatrix Wright
5 years ago
Views:

1 Polynomial Regression and Regularization

2 Administrivia o If you still haven t enrolled in Moodle yet Enrollment key on Piazza If joined course recently, me to get added to Piazza o Homework 1 posted later tonight. Good Milestones: Problems 1-3 This Week Problems 4 and 5 Next Week 2

3 The RoadMap o Last Time: Regression Refresher (there was nothing fresh about it) o This Time: Polynomial Regression Regularization (wiggles are bad, Man) o Next Time: Bias-Variance Trade-Off (what does it all MEAN?) 3

4 Previously on CSCI 4622 Given training data (x i1,x i2,...,x ip,y i ) for i =1, 2,...,n fit a regression of the form y i = x i1 + 2 x i2 + + p x ip + i Estimates of the parameters are found by minimizing RSS = Construct the design matrix X For the moment, solve via normal equations: where i N(0, 2 ) nx [( x i1 + + p x ip ) y i ] 2 = kx yk 2 i=1 by prepending column of 1s to data matrix ˆ = f X T X 1 X T y LOSS Function 4

5 So, We Can Model Things Like This 5

6 And Things Like This Y X 2 X 1 6

7 But What About Things Like This? 7

8 But What About Things Like This? o So yeah, nonlinearity is a thing o Clearly, the linear models of regression cannot handle this. o Give me Neural Networks or Give me Death! 8

9 But What About Things Like This? o So yeah, nonlinearity is a thing o Clearly, the linear models of regression cannot handle this. o Give me Neural Networks or Give me Death! o Nah. Regression can totally do this 9

10 Start with the Obvious: Polynomials o Suppose, as in the previous pictures, we have a single feature X o We want to go from this: Y = X + o To something like this: Y = X + 2 X p X p + o So what is the difference between these two? 10

11 Start with the Obvious: Polynomials o Suppose, as in the previous pictures, we have a single feature X o We want to go from this: Y = X + o To something like this: Y = X + 2 X p X p + o So what is the difference between these two? o Seems like it s closer to this: Y = X X p X p + 11

12 The One Becomes the Many o Start with a single feature X o And derive new polynomial features: X 1 = X, X 2 = X 2, X p = X p o These two things: Y = X + 2 X p X p + o are exactly the same: Y = X X p X p + 12

13 The Polynomial Regression Pipeline o Start with a single feature X o Derive new polynomial features: X 1 = X, X 2 = X 2, X p = X p o Solve the MLR in the usual way: Y = X X p X p + o Question: What does the Matrix Equation look like? 13

14 The Polynomial Regression Pipeline o Start with a single feature X o Derive new polynomial features: X 1 = X, X 2 = X 2, X p = X p o Solve the MLR in the usual way: Y = X X p X p + o Question: What does the Matrix Equation look like? Before: x 11 x 21 x 1p 1 x 21 x 22 x 2p 1 x 31 x 32 x 3p x n1 x n2 x np p 3 2 = y 1 y 2 y 3. y n

15 The Polynomial Regression Pipeline o Start with a single feature X o Derive new polynomial features: X 1 = X, X 2 = X 2, X p = X p o Solve the MLR in the usual way: Y = X X p X p + o Question: What does the Matrix Equation look like? After: x 1 x 2 1 x p 1 1 x 2 x 2 2 x p 2 1 x 3 x 2 2 x p x n x 2 n x p n p 3 2 = y 1 y 2 y 3. y n

16 The Polynomial Regression Pipeline o Suppose I learn the parameters from the training set o What if I want to make a prediction about data I haven t seen yet? HAVE TEST point Xo NEED to PERFORM SAME Transformation on Xo THAT WE PERFORMED data. xo ( 1,., On X5,...,XoP) THEN PRE Pichon IS just o= Xotp training 16

17 The Polynomial Regression Pipeline o Suppose I learn the parameters from the training set o What if I want to make a prediction about data I haven t seen yet? o Important Theme #1: If you transform your features for training, predicting is always as easy transforming the features of your new data in the same way. Input space - FEATURE SPACE Training X X, Xn,... XP Test Xo Xo MAGIC, Xp,.. Happens, XOP HERE! 17

18 The Polynomial Regression Pipeline o Suppose I learn the parameters from the training set o What if I want to make a prediction about data I haven t seen yet? o Important Theme #1: If you transform your features for training, predicting is always as easy transforming the features of your new data in the same way. o Important Theme #2: If your current features are not flexible enough, moving into a higher dimensional space will make your model more flexible (but BEWARE!) 18

19 Example: Sinusoidal Data True Model is f(x) =sin( x)+ o Question: What degree polynomial should I use? 19

20 Example: Sinusoidal Data True Model is f(x) =sin( x)+ o Question: What degree polynomial should I use? o Degree = 1 20

21 Example: Sinusoidal Data True Model is f(x) =sin( x)+ o Question: What degree polynomial should I use? o Degree = 4 21

22 Example: Sinusoidal Data True Model is f(x) =sin( x)+ o Question: What degree polynomial should I use? o Degree = 7 22

23 Example: Sinusoidal Data True Model is f(x) =sin( x)+ o Question: What degree polynomial should I use? o Degree = 11 23

24 Example: Sinusoidal Data True Model is f(x) =sin( x)+ o Question: What degree polynomial should I use? o Degree = 11 o What Questions should we be asking? 24

25 The Real ML Pipeline (Almost) The more flexible (powerful) your model gets, the better it ll do on training set But that s not useful. We want to know how model will do on new data How well it ll Generalize A better pipeline: RANDOM on TRAIN TEST on TH 'S THIS SPLIT ) ) L L All Labeled Data Training Set Validation Set Train on training set. Validate (or evaluate) on validation set. HELD DATA - OUT 25

26 The Real ML Pipeline (Almost) Need a performance measure: For Regression, we use the Mean-Squared Error: MSE = 1 n nx (y i ŷ i ) 2 I=1 - y TRUE RESPONSE predictions All Labeled Data Training Set Validation Set 26

27 Example: Sinusoidal Data What happens to the MSE on training set as p grows? What happens to MSE on validation set as p grows? MSE goes DOWNMSE goes Down At±T, then BACK-Up! 27

28 Example: Sinusoidal Data FLEXIBLE Enough MODEL / OUERFITT HAS STARTED ' ng 28

29 Example: Sinusoidal Data Revisit Polynomial Regression What causes those wiggles that are hurting us so bad? 29

30 Example: Sinusoidal Data True Model is f(x) =sin( x)+ 30

31 Example: Sinusoidal Data True Model is f(x) =sin( x)+ 31

32 Example: Sinusoidal Data True Model is f(x) =sin( x)+ 32

33 Example: Sinusoidal Data True Model is f(x) =sin( x)+ 33

34 Example: Sinusoidal Data Coefficients grow very large. Need a way to control them. Regulate them even 34

35 Regularization Goal: Keep model coefficients from growing too large Idea: Put something in the loss function to dissuade them growing too much minimize RSS = nx [( x i1 + + p x ip ) y i ] 2 + i=1 o Regularization weight of is a tuning parameter px k=1 2 k Tenafly o Have to do regularization study to choose best value 35

36 Regularization Goal: Keep model coefficients from growing too large Idea: Put something in the loss function to dissuade them growing too much RSS = nx [( x i1 + + p x ip ) y i ] 2 + i=1 px k=1 2 k o Why do we not regularize the bias parameter? BIAS Tells Us About Average RESPONSE ( HEIGHT OF DATA ). Regularizing po pulls THE whole model down FROM WHERE Should BE 36

37 Ridge Regression Goal: Keep model coefficients from growing too large Idea: Put something in the loss function to dissuade them growing too much RSS = nx [( x i1 + + p x ip ) y i ] 2 + i=1 px k=1 2 k o This form of penalty term is called Ridge Regression. But there are many more 37

38 Sinusoidal Data with Ridge Regression Watch the coefficients 38

39 Sinusoidal Data with Ridge Regression Watch the coefficients 39

40 Sinusoidal Data with Ridge Regression Watch the coefficients 40

41 Sinusoidal Data with Ridge Regression Watch the coefficients 41

42 Sinusoidal Data with Ridge Regression Watch the coefficients 42

43 Sinusoidal Data with Ridge Regression Watch the coefficients NOTE : COEFFICIENTS STAYED Small! 43

44 Example: Sinusoidal Data No Regularization Ridge Regression BLOW - UP HAPPENS MUCH LATER! 44

45 Example: Sinusoidal Data Regularization Study: o Choose poly degree o Vary over reg strength o Look for best validation MSE 45

46 Example: Sinusoidal Data Regularization Study: o Choose poly degree o Vary over reg strength o Look at size of coefficients COEFFS As SHRINK oo 46

47 Regularization Wrap-Up You should always do regularization. If you choose carefully, it will always help Generalization Next Time: o Talk more about Ridge Regression Details o Learning Curves and what they tell us about the Bias-Variance Trade-Off 47

48 48

49 49

50 50

51 51

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin