Modern Optimization Techniques

Size: px

Start display at page:

Download "Modern Optimization Techniques"

Nickolas Booker
6 years ago
Views:

1 Modern Optimization Techniques Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Stochastic Gradient Descent Stochastic Gradient Descent 1 / 32

2 Outline 1. Unconstrained Optimization 2. Stochastic Gradient Descent 3. Choosing the right step size 4. Stochastic Gradient Descent on Practice Stochastic Gradient Descent 1 / 32

3 1. Unconstrained Optimization Outline 1. Unconstrained Optimization 2. Stochastic Gradient Descent 3. Choosing the right step size 4. Stochastic Gradient Descent on Practice Stochastic Gradient Descent 1 / 32

4 1. Unconstrained Optimization Gradient Descent 1: procedure GradientDescent input: f 0 2: Get initial point β 3: repeat 4: Get Step Size µ 5: β := β µ f 0 (β) 6: until convergence 7: return β, f 0 (β) 8: end procedure Stochastic Gradient Descent 1 / 32

5 2. Stochastic Gradient Descent Outline 1. Unconstrained Optimization 2. Stochastic Gradient Descent 3. Choosing the right step size 4. Stochastic Gradient Descent on Practice Stochastic Gradient Descent 2 / 32

6 2. Stochastic Gradient Descent Practical Example: Household Spending Suppose we have the following data about different households: Number of workers in the household (x 1 ) Household composition (x 2 ) Region (x 3 ) Gross normal weekly household income (x 4 ) Weekly household spending (y) We want to creat a model of the weekly household spending Stochastic Gradient Descent 2 / 32

7 2. Stochastic Gradient Descent Practical Example: Household Spending If we have data about m households, we can represent it as: 1 x 1,2... x 1,n y 1 1 x 2,2... x 2,n X m,n =.... y = y 2. 1 x m,2... x m,n y m We can model the household consumption is a linear combination of the household features with parameters β: ŷ i = β T x i = β β 1 x i,1 + β 2 x i,2 + β 3 x i,3 + β 4 x i,4 Stochastic Gradient Descent 3 / 32

8 2. Stochastic Gradient Descent Least Square Problem Revisited The following least square problem minimize X β y 2 2 Can be rewritten as minimize m (β T x i y i ) 2 i=1 1 x 1,1 x 1,2 x 1,3 x 1,4 y 1 1 x 2,1 x 2,2 x 2,3 x 2,4 X m,n =..... y = y 2. 1 x m,1 x m,2 x m,3 x m,4 y m Stochastic Gradient Descent 4 / 32

9 2. Stochastic Gradient Descent The Gradient Descent update rule For the problem minimize m (β T x i y i ) 2 i=1 The the gradient f 0 (β) of the objective function is: m β f 0 (β) = 2 (β T x i y i )x i i=1 The Gradient Descent update rule is then: ( ) m β β µ 2 (β T x i y i )x i i=1 Stochastic Gradient Descent 5 / 32

10 2. Stochastic Gradient Descent The Gradient Descent update rule We need to see all the data before updating β ( ) m β β µ 2 (β T x i y i )x i Can we make any progress before iterating over all the data? i=1 Stochastic Gradient Descent 6 / 32

11 2. Stochastic Gradient Descent Decomposing the objective function The objective function m f 0 (β) = (β T x i y i ) 2 i=1 Can be expressed as a function of the objective on each data point (x, y): So that g(β, i) = (β T x i y i ) 2 f 0 (β) = m g(β, i) i=1 Stochastic Gradient Descent 7 / 32

12 2. Stochastic Gradient Descent A simpler update rule Now that we have f 0 (β) = m g(β, i) i=1 We can define the following update rule Pick a random instance i Uniform(1, m) Update β β β + µ ( β g(β, i)) Stochastic Gradient Descent 8 / 32

13 2. Stochastic Gradient Descent Stochastic Gradient Descent (SGD) 1: procedure StochasticGradiendDescent input: f 0, µ 2: Get initial point β 3: repeat 4: for i 1,..., m do 5: β β µ g(β, i) 6: end for 7: until convergence 8: return β, f 0 (β) 9: end procedure Stochastic Gradient Descent 9 / 32

14 2. Stochastic Gradient Descent SGD and the least squares We have m f 0 (β) = g(β, i) i=1 with g(β, i) = (β T x i y i ) 2 The update rule is β g(β, i) = 2(β T x i y i )x i ) β β µ (2(β T x i y i )x i Stochastic Gradient Descent 10 / 32

15 2. Stochastic Gradient Descent SGD vs. GD 1: procedure SGD input: f 0, µ 2: Get initial point β 3: repeat 4: for i 1,..., m do 5: β β µ g(β, i) 6: end for 7: until convergence 8: return β, f 0 (β) 9: end procedure 1: procedure GradientDescent input: f 0 2: Get initial point β 3: repeat 4: Get Step Size µ 5: β := β µ f 0 (β) 6: until convergence 7: return β, f 0 (β) 8: end procedure Stochastic Gradient Descent 11 / 32

16 2. Stochastic Gradient Descent SGD vs. GD - Least Squares 1: procedure SGD input: f 0, µ 2: Get initial point β 3: repeat 4: for i 1,..., m do 5: β β µ ( 2(β T ) x i y i )x i 6: end for 7: until convergence 8: return β, f 0 (β) 9: end procedure 1: procedure GD input: f 0 2: Get initial point β 3: repeat 4: Get Step Size µ 5: β β µ ( 2 m ) i=1 (βt x i y i )x i 6: until convergence 7: return β, f 0 (β) 8: end procedure Stochastic Gradient Descent 12 / 32

17 3. Choosing the right step size Outline 1. Unconstrained Optimization 2. Stochastic Gradient Descent 3. Choosing the right step size 4. Stochastic Gradient Descent on Practice Stochastic Gradient Descent 13 / 32

18 3. Choosing the right step size Choosing the step size for SGD The step size µ is a crucial parameter to be tuned Given the low cost of the SGD update, using line search for the step size is a bad choice Possible alternatives: Fixed step size Armijo principle Bold-Driver Adagrad Stochastic Gradient Descent 13 / 32

19 3. Choosing the right step size Real World Dataset: Body Fat prediction We want to estimate the percentage of body fat based on various attributes: Age (years) Weight (lbs) Height (inches) Neck circumference (cm) Chest circumference (cm) Abdomen 2 circumference (cm) Hip circumference (cm) Thigh circumference (cm) Knee circumference (cm)... Stochastic Gradient Descent 14 / 32

20 3. Choosing the right step size Real World Dataset: Body Fat prediction The data is represented it as: 1 x 1,1 x 1,2... x 1,n y 1 1 x 2,1 x 2,2... x 2,n X m,n =..... y = y 2. 1 x m,1 x m,2... x m,n y m with m = 252, n = 14 We can model the percentage of body fat y is a linear combination of the body measurements with parameters β: ŷ i = β T x i = β β 1 x i,1 + β 2 x i, β n x i,n Stochastic Gradient Descent 15 / 32

21 3. Choosing the right step size SGD - Fixed Step Size on the Body Fat dataset SGD Step Size MSE Iterations Stochastic Gradient Descent 16 / 32

22 3. Choosing the right step size Bold Driver Heuristic The Bold Driver Heuristic makes the assumption that smaller step sizes are needed when closer to the optimum It adjusts the step size based on the value of f 0 (β t ) f 0 (β t 1 ) If the value of f 0 (β) grows, the step size must decrease If the value of f 0 (β) decreases, the step size can be larger for faster convergence Stochastic Gradient Descent 17 / 32

23 3. Choosing the right step size Bold Driver Heuristic - Update Rule We have f 0 (β) = m g(β, i) i=1 We need to define an increase factor γ and a decay factor ν For each epoch Evaluate the objective function f 0 (β t 1 ) Cycle through the whole data and update the parameters Evaluate the objective function f 0 (β t ) if f 0 (β t ) < f 0 (β t 1 ) then µ γµ else f 0 (β t ) > f 0 (β t 1 ) then µ νµ Widely used values: γ = 1.05 and ν = 0.5 Stochastic Gradient Descent 18 / 32

24 3. Choosing the right step size SGD with Bold Driver 1: procedure BoldDriverSGD input: f 0, µ, γ and ν 2: Get initial point β 3: repeat 4: ɛ t 1 f 0 (β) 5: for i 1,..., m do 6: β β µ g(β, i) 7: end for 8: ɛ t f 0 (β) 9: if ɛ t < ɛ t 1 then 10: µ νµ 11: else 12: µ γµ 13: end if 14: until convergence 15: return β, f 0 (β) 16: end procedure Stochastic Gradient Descent 19 / 32

25 3. Choosing the right step size Considerations Works well for a range of problems The initial µ just need to be large enough γ and ν needs to be adusted May lead to faster convergence rates Stochastic Gradient Descent 20 / 32

26 3. Choosing the right step size AdaGrad Adagrad adjusts the step size for each parameter to be optimized It uses information about the past gradients Leads to faster convergence Less sensitive to the choice of the step size Stochastic Gradient Descent 21 / 32

27 3. Choosing the right step size AdaGrad - Update Rule We have f 0 (β) = m g(β, i) i=1 Update rule: Pick a random instance i Uniform(1, m) Compute the gradient β g(β, i) Update the gradient history h h + β g(β, i) β g(β, i) The step size for parameter β i is Update denotes the elementwise product µ hi β β µ h ( β g(β, i)) Stochastic Gradient Descent 22 / 32

28 3. Choosing the right step size SGD with Adagrad 1: procedure AdaGradSGD input: f 0, µ 2: Get initial point β 3: h 0 4: repeat 5: for i 1,..., m do 6: h h + β g(β, i) β g(β, i) 7: β β µ h g(β, i) 8: end for 9: until convergence 10: return β, f 0 (β) 11: end procedure Stochastic Gradient Descent 23 / 32

29 3. Choosing the right step size AdaGrad Step Size ADAGRAD Step Size MSE Iterations Stochastic Gradient Descent 24 / 32

30 3. Choosing the right step size AdaGrad vs Fixed Step Size ADAGRAD Step Size MSE AdaGrad Fixed Step Size Iterations Stochastic Gradient Descent 25 / 32

31 4. Stochastic Gradient Descent on Practice Outline 1. Unconstrained Optimization 2. Stochastic Gradient Descent 3. Choosing the right step size 4. Stochastic Gradient Descent on Practice Stochastic Gradient Descent 26 / 32

32 4. Stochastic Gradient Descent on Practice GD Step Size GD Step Size MSE Iterations Stochastic Gradient Descent 26 / 32

33 4. Stochastic Gradient Descent on Practice SGD vs GD - Body Fat Dataset SGD vs GD MSE SGD GD Iterations Stochastic Gradient Descent 27 / 32

34 4. Stochastic Gradient Descent on Practice Year Prediction Data Set Least Squares Problem Prediction of the release year of a song from audio features 90 features Experiments done on a subset of 1000 instances of the data Stochastic Gradient Descent 28 / 32

35 4. Stochastic Gradient Descent on Practice GD Step Size - Year Prediction GD Step Size MSE 0e+00 2e+05 4e+05 6e+05 8e+05 1e Iterations Stochastic Gradient Descent 29 / 32

36 4. Stochastic Gradient Descent on Practice SGD Step Size - Year Prediction SGD Step Size MSE 0e+00 1e+05 2e+05 3e+05 4e+05 5e Iterations Stochastic Gradient Descent 30 / 32

37 4. Stochastic Gradient Descent on Practice AdaGrad Step Size - Year Prediction ADAGRAD Step Size MSE 0e+00 1e+05 2e+05 3e+05 4e+05 5e Iterations Stochastic Gradient Descent 31 / 32

38 4. Stochastic Gradient Descent on Practice AdaGrad vs SGD vs GD - Year Prediction ADAGRAD Step Size MSE 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 AdaGrad GD SGD Iterations Stochastic Gradient Descent 32 / 32

Modern Optimization Techniques

Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University