Linear classifiers: Overfitting and regularization

Size: px

Start display at page:

Download "Linear classifiers: Overfitting and regularization"

Shana Burns
5 years ago
Views:

1 Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1

2 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x i ) + w 1 h 1 (x i ) + + w D h D (x i ) = w T h(x i ) #awful 4 3 Score(x) < 0 Relate Score(x i ) to P(y=+1 x, )? Score(x) > 0 #awesome 3 Compare and contrast regression models Linear regression with Gaussian errors y i = w T h(x i ) + ε i ε i ~ N(0,σ 2 ) p(y x,w) = N(y; w T h(x), σ 2 ) 2σ w T h(x) y 4 Logistic regression e P(y x,w) = - w T h(x) T e - w h(x) 1 + e - w T h(x) y = +1 y = y 2

3 Understanding the logistic regression model P(y=+1 x i,w) = sigmoid(score(x i )) = e - w T h(x) Score(x i ) P(y=+1 x i,w) 5 Sentence from review Input: x Predict most likely class P(y x) = estimate of class probabilities If P(y=+1 x) > 0.5: = Else: = Estimating P(y x) improves interpretability: - Predict = +1 and tell me how sure you are 6 3

4 Gradient descent algorithm, cont d Step 5: Gradient over all data points Sum over data points Feature value Difference between truth and prediction P(y=+1 x i,w) P(y=+1 x i,w) y i =+1 y i =-1 8 4

5 Summary of gradient ascent for logistic regression init w (1) =0 (or randomly, or smartly), t=1 while Δ for j=0,,d l(w (t) ) > ε partial[j] = w (t+1) j w (t) j + η partial[j] t t Overfitting in classification 5

6 Learned decision boundary Feature Value Coefficient learned h 0 (x) h 1 (x) x[1] 1.12 h 2 (x) x[2] Quadratic features (in 2d) Note: we are not including cross terms for simplicity Feature Value Coefficient learned h 0 (x) h 1 (x) x[1] 1.39 h 2 (x) x[2] h 3 (x) (x[1]) h 4 (x) (x[2])

7 Degree 6 features (in 2d) Note: we are not including cross terms for simplicity 13 Feature Value Coefficient learned h 0 (x) h 1 (x) x[1] 5.3 h 2 (x) x[2] h 3 (x) (x[1]) h 4 (x) (x[2]) h 5 (x) (x[1]) h 6 (x) (x[2]) h 7 (x) (x[1]) h 8 (x) (x[2]) h 9 (x) (x[1]) h 10 (x) (x[2]) h 11 (x) (x[1]) h 12 (x) (x[2]) Coefficient values getting large Score(x) < 0 Degree 20 features (in 2d) Note: we are not including cross terms for simplicity Featur e Value Coefficien t learned h 0 (x) h 1 (x) x[1] 5.1 h 2 (x) x[2] 78.7 h 11 (x) (x[1]) h 12 (x) (x[2]) h 13 (x) (x[1]) h 14 (x) (x[2]) h 37 (x) (x[1]) 19-2*10-6 h 38 (x) (x[2]) h 39 (x) (x[1]) 20-2* h 40 (x) (x[2]) Often, overfitting associated with very large estimated coefficients 7

8 Overfitting in classification Error = Classification Error Overfitting if there exists w*: training_error(w*) > training_error( ) true_error(w*) < true_error( ) Model complexity 15 Overfitting in classifiers Overconfident predictions 8

9 Logistic regression model w - T h(x i ) P(y=+1 x i,w) = sigmoid(w T h(x i )) 17 The subtle (negative) consequence of overfitting in logistic regression Overfitting Large coefficient values T h(x i ) is very positive (or very negative) sigmoid( T h(x i )) goes to 1 (or to 0) Model becomes extremely overconfident of predictions 18 9

Effect of coefficients on logistic regression model Input

#awful 19 Learned probabilities Feature Value Coefficient

10 Effect of coefficients on logistic regression model Input x: #awesome=2, #awful=1 w 0 0 w #awesome +1 w #awful -1 w 0 0 w #awesome +2 w #awful -2 w 0 0 w #awesome +6 w #awful -6 #awesome - #awful #awesome - #awful #awesome - #awful 19 Learned probabilities Feature Value Coefficient learned h 0 (x) h 1 (x) x[1] 1.12 h 2 (x) x[2]

11 Quadratic features: Learned probabilities Feature Value Coefficien t learned h 0 (x) h 1 (x) x[1] 1.39 h 2 (x) x[2] h 3 (x) (x[1]) h 4 (x) (x[2]) Overfitting Overconfident predictions Degree 6: Learned probabilities Tiny uncertainty regions Overfitting & overconfident about it!!! We are sure we are right, when we are surely wrong! Degree 20: Learned probabilities 22 11

12 Overfitting in logistic regression: Another perspective Linearly-separable data Data are linearly separable if: There exist coefficients such that: - For all positive training data Note 1: If you are using D features, linear separability happens in a D-dimensional space 24 - For all negative training data Note 2: If you have enough features, data are (almost) always linearly separable training_error( ) = 0 12

13 Effect of linear separability on coefficients X[2]=#awful Score(x) < 0 Score(x) > x[1]=#awesome Data are linearly separable with 1=1.0 and 2=-1.5 Data also linearly separable with 1=10 and 2=-15 Data also linearly separable with 1=10 9 and 2=-1.5x Maximum Effect likelihood of linear estimation separability (MLE) prefers most certain model on weights Coefficients go to infinity for linearly-separable data!!! X[2]=#awful x[1]= 2 x[2]= x[1]=#awesome P(y=+1 x i, ) Data is linearly separable with 1=1.0 and 2=-1.5 Data also linearly separable with 1=10 and 2=-15 Data also linearly separable with 1=10 9 and 2=-1.5x

14 Overfitting in logistic regression is twice as bad Learning tries to find decision boundary that separates data Overly complex boundary If data are linearly separable Coefficients go to infinity! 1=10 9 2=-1.5x Penalizing large coefficients to mitigate overfitting 14

15 Desired total cost format Want to balance: i. How well function fits data ii. Magnitude of coefficients Total quality = want to balance measure of fit - measure of magnitude of coefficients 29 (data likelihood) large # = good fit to training data large # = overfit Consider resulting objective Select to maximize: 2 l(w) - λ w 2 tuning parameter = balance of fit and magnitude L 2 regularized logistic regression Pick λ using: Validation set (for large datasets) Cross-validation (for smaller datasets) (as in ridge/lasso regression) 30 15

16 Degree 20 features, effect of regularization penalty λ Regularizatio n λ = 0 λ = λ = λ = 1 λ = 10 Range of coefficients to to to to to 0.22 Decision boundary 31 Coefficient path coefficients j 32 λ 16

17 Degree 20 features: regularization reduces overconfidence Regularization λ = 0 λ = λ = λ = 1 Range of coefficients to to to to 0.57 Learned probabilities 33 Finding best L 2 regularized linear classifier with gradient ascent 17

Understanding contribution of L 2 regularization Term from L 2 penalty w j > 0 w j < 0-2 λ w j Impact on w j 35 Summary of gradient ascent for logistic regression + L

18 Understanding contribution of L 2 regularization Term from L 2 penalty w j > 0 w j < 0-2 λ w j Impact on w j 35 Summary of gradient ascent for logistic regression + L 2 regularization init w (1) =0 (or randomly, or smartly), t=1 while not converged: for j=0,,d partial[j] = w (t+1) j w (t) j + η (partial[j] 2λ w (t) j ) t t

19 Summary of overfitting in logistic regression What you can do now Identify when overfitting is happening Relate large learned coefficients to overfitting Describe the impact of overfitting on decision boundaries and predicted probabilities of linear classifiers Motivate the form of L2 regularized logistic regression quality metric Describe what happens to estimated coefficients as tuning parameter λ is varied Interpret coefficient path plot Estimate L2 regularized logistic regression coefficients using gradient ascent Describe the use of L1 regularization to obtain sparse logistic regression solutions 38 19

20 Linear classifiers: Scaling up learning via SGD Emily Fox University of Washington January 25, 2017 Stochastic gradient descent: Learning, one data point at a time 20

21 Why gradient ascent is slow w (t) Update coefficients w (t+1) Update coefficients w (t+2) Data Δ l(w (t) ) Pass over data Compute gradient Δ l(w (t+1) ) Every update requires a full pass over data 41 More formally: How expensive is gradient ascent? Sum over data points Contribution of data point x i,y i to gradient 42 21

22 Every step requires touching every data point!!! Sum over data points 43 Time to compute contribution of x i,y i # of data points (N) 1 millisecond second millisecond 10 million 1 millisecond 10 billion Total time to compute 1 step of gradient ascent Stochastic gradient ascent w (t) Update coefficients w (t+1) Update coefficients w (t+2) Update coefficients w (t+3) Update coefficients w (t+4) Data Use only small subsets of data Compute gradient Many updates for each pass over data 44 22

23 Example: Instead of all data points for gradient, use 1 data point only??? Gradient ascent Sum over data points 45 Stochastic gradient ascent Each time, pick different data point i Stochastic gradient ascent for logistic regression init w (1) =0, t=1 until for converged i=1,,n for j=0,,d partial[j] = Each time, pick Sum over different data point i data points t t + 1 w j (t+1) w j (t) + η partial[j] 46 23

24 Stochastic gradient for L2-regularized objective Total derivative = - 2 λ w j partial[j] What about regularization term? Each time, pick different data point i Stochastic gradient ascent Total derivative - 2 λ w j N partial[j] Each data point contributes 1/N to regularization 47 Comparing computational time per step Gradient ascent Stochastic gradient ascent Time to compute contribution of x i,y i 48 # of data points (N) Total time for 1 step of gradient 1 millisecond second 1 second minutes 1 millisecond 10 million 2.8 hours 1 millisecond 10 billion days Total time for 1 step of stochastic gradient 24

25 Which one is better??? Depends Algorithm Gradient Stochastic gradient Time per iteration Slow for large data Always fast Total time to convergence for large data In theory Slower Faster In practice Often slower Often faster Sensitivity to parameters Moderate Very high 49 Summary of stochastic gradient Tiny change to gradient ascent Much better scalability Huge impact in real-world Very tricky to get right in practice 50 25

26 Why would stochastic gradient ever work??? Gradient is direction of steepest ascent Gradient is best direction, but any direction that goes up would be useful 52 26

27 In ML, steepest direction is sum of little directions from each data point Sum over data points 53 For most data points, contribution points up Stochastic gradient: Pick a data point and move in direction Most of the time, total likelihood will increase 54 27

28 Stochastic gradient ascent: Most iterations increase likelihood, but sometimes decrease it On average, make progress until converged for i=1,,n for j=0,,d w (t+1) j w (t) j + η t t Convergence path 28

Convergence paths Gradient Stochastic gradient 57 Stochastic gradient convergence is noisy Stochastic gradient makes noisy progress Stochastic gradient achieves

29 Convergence paths Gradient Stochastic gradient 57 Stochastic gradient convergence is noisy Stochastic gradient makes noisy progress Stochastic gradient achieves higher likelihood sooner, but it s noisier Better Avg. log likelihood Gradient usually increases likelihood smoothly 58 Total time proportional to # passes over data 29

! 60 Stochastic gradient will eventually oscillate around a solution w (1005) was good How do we minimize risk

30 Eventually, gradient catches up Note: should only trust average quality of stochastic gradient Better Avg. log likelihood Stochastic gradient Gradient 59 The last coefficients may be really good or really bad!! 60 Stochastic gradient will eventually oscillate around a solution w (1005) was good How do we minimize risk of picking bad coefficients w (1000) was bad Minimize noise: don t return last learned coefficients Output average: = 1 w (t) T 30

31 Summary of why stochastic gradient works Gradient finds direction of steeps ascent Gradient is sum of contributions from each data point Stochastic gradient uses direction from 1 data point On average increases likelihood, sometimes decreases 61 Stochastic gradient has noisy convergence Online learning: Fitting models from streaming data 31

32 Batch vs online learning Batch learning All data is available at start of training time Online learning Data arrives (streams in) over time - Must train model as data arrives! t=1 t=2 t=3 t=4 time Data ML algorithm Data Data Data Data ML (1) (2) (3) (4) algorithm Online learning example: Ad targeting Website Ad1 Ad2 Ad3 = Suggested ads User clicked on Ad2 y t =Ad2 64 Input: x t User info, page text ML algorithm (t) (t+1) 32

33 Online learning problem Data arrives over each time step t: - Observe input x t Info of user, text of webpage - Make a prediction t Which ad to show - Observe true output y t Which ad user clicked on 65 Need ML algorithm to update coefficients each time step! Stochastic gradient ascent can be used for online learning!!! init w (1) =0, t=1 Each time step t: - Observe input x t - Make a prediction t - Observe true output y t - Update coefficients: for j=0,,d w j (t+1) w j (t) + η 66 33

34 Summary of online learning Data arrives over time Must make a prediction every time new data point arrives Observe true class after prediction made Want to update parameters immediately 67 Summary of stochastic gradient descent 34

35 What you can do now Significantly speedup learning algorithm using stochastic gradient Describe intuition behind why stochastic gradient works Apply stochastic gradient in practice Describe online learning problems Relate stochastic gradient to online learning 69 Generalized linear models 35

36 . Models for data Linear regression with Gaussian errors y i = w T h(x i ) + ε i ε i ~ N(0,σ 2 ) p(y x,w) = N(y; w T h(x), σ 2 ) 2σ w T h(x) y 71 Logistic regression e P(y x,w) = - w T h(x) T e - w h(x) 1 + e - w T h(x) y = +1 y = y Examples of generalized linear models A generalized linear model has E[y x,w ] = g(w T h(x)) Mean Link function g Linear regression w T h(x) identity function Logistic regression (assuming y in {0,1} instead of {-1,1} or some other set of values) P(y=+1 x,w) (mean is different if y is not in {0,1}) sigmoid function (need slight modification if y is not in {0,1}) 72 36

37 Stochastic gradient descent more formally Learning Problems as Expectations Minimizing loss in training data: - Given dataset: Sampled iid from some distribution p(x) on features: - Loss function, e.g., hinge loss, logistic loss, - We often minimize loss in training data: However, we should really minimize expected loss on all data: 74 So, we are approximating the integral by the average on the training data 37

38 Gradient Ascent in Terms of Expectations True objective function: Taking the gradient: True gradient ascent rule: How do we estimate expected gradient? 75 SGD: Stochastic Gradient Ascent (or Descent) True gradient: Sample based approximation: What if we estimate gradient with just one sample??? - Unbiased estimate of gradient - Very noisy! - Called stochastic gradient ascent (or descent) Among many other names - VERY useful in practice!!! 76 38

Linear classifiers: Logistic regression

Linear classifiers: Logistic regression STAT/CSE 416: Machine Learning Emily Fox University of Washington April 19, 2018 How confident is your prediction? The sushi & everything else were awesome! The