Forward and Reverse Gradient-Based Hyperparameter Optimization

Size: px

Start display at page:

Download "Forward and Reverse Gradient-Based Hyperparameter Optimization"

Allyson Norman
5 years ago
Views:

1 Forward and Reverse Gradient-Based Hyperparameter Optimization Luca Franceschi 1,2, Michele Donini 1, Paolo Frasconi 3, Massimiliano Pontil 1,2 1 Istituto Italiano di Tecnologia, Genoa, Italy 2 Dept of Computer Science, University College London, UK 3 Dept of Information Engineering, Universit degli Studi di Firenze, Italy ICML 2017/ Presenter: Ji Gao Gradient-Based ( Istituto Hyperparameter Italiano di ICML Tecnologia, 2017/ Optimization Presenter: Genoa, Italy) Ji Gao 1 /

2 Outline 1 Motivation 2 Method Overview Optimization 3 Complexity analysis 4 Experiment Gradient-Based ( Istituto Hyperparameter Italiano di ICML Tecnologia, 2017/ Optimization Presenter: Genoa, Italy) Ji Gao 2 /

3 Motivation Choose hyperparameters in optimization are hard Could we automatically select hyperparameters? Hyperparameter optimization: Construct a response function of the hyperparameters and explore the hyperparameter space to seek an optimum Gradient-Based ( Istituto Hyperparameter Italiano di ICML Tecnologia, 2017/ Optimization Presenter: Genoa, Italy) Ji Gao 3 /

4 Related Work Grid search: List parameters on a grid and train all of them. Problem: Impractical when number of hyperparameters is large. Even outperform by random search. Bayesian optimization: Treat the global process as a random function and place a prior over it. After that, construct an acquisition function (referred to as infill sampling criteria) that determines the next query point. Gradient-based methods: Use the gradient method to optimize hyperparameters. Gradient-Based ( Istituto Hyperparameter Italiano di ICML Tecnologia, 2017/ Optimization Presenter: Genoa, Italy) Ji Gao 4 /

5 Hyperparameters s: state in R d, including weights (object) and hyperparameters λ. An example in such definition: s t = Φ t (s t 1, λ) Gradient Descent with Momentum w: weights. J: objective function. λ: hyperparameters s t = (v t, w t ) : In this case: λ = (µ, η) v t = µv t 1 + J t (w t 1 ) w t = w t 1 η(µv t 1 J t (w t 1 )) uca Franceschi, Michele Donini, Paolo Frasconi Forward, Massimiliano and ReversePontil Gradient-Based ( Istituto Hyperparameter Italiano di ICML Tecnologia, 2017/ Optimization Presenter: Genoa, Italy) Ji Gao 5 /

6 Problem formulation Goal of hyperparameter optimization Solve: min f (λ) λ Where a response function f : R m R is defined at λ R m as E: Validation error f (λ) = E(s T (λ)) Gradient-Based ( Istituto Hyperparameter Italiano di ICML Tecnologia, 2017/ Optimization Presenter: Genoa, Italy) Ji Gao 6 /

7 Problem formulation - Optimization Goal of hyperparameter optimization Solve: min λ,s 1,...s t E(s T ) Subject to: s t = Φ t (s t 1, λ) Lagrangian: T L(s, λ, α) = E(s T ) + α t (Φ t (s t 1, λ) s t ) t=1 Gradient-Based ( Istituto Hyperparameter Italiano di ICML Tecnologia, 2017/ Optimization Presenter: Genoa, Italy) Ji Gao 7 /

8 Problem formulation - Optimization Lagrangian: L(s, λ, α) = E(s T ) + Derivatives of Lagrangian: T α t (Φ t (s t 1, λ) s t ) t=1 L = Φ t (s t 1, λ) s t, t = 1..T a t L Φ t (s t, λ) = a t+1 a t, t = 1..T s t s t L = E(s T ) a T s T L T λ = Φ t (s t 1, λ) α t λ t=1 Gradient-Based ( Istituto Hyperparameter Italiano di ICML Tecnologia, 2017/ Optimization Presenter: Genoa, Italy) Ji Gao 8 /

9 Problem formulation - Optimization Solution: Solution can be achieved by setting each derivatives to 0. Let A t = Φt(s t 1,λ) s t 1, B t = Φt(s t 1,λ) λ The the solution is: a t = E(s T )A t+1..a T And we have: L λ = E(s T ) T (A t+1..a T )B t (1) t=1 Gradient-Based ( Istituto Hyperparameter Italiano di ICML Tecnologia, 2017/ Optimization Presenter: Genoa, Italy) Ji Gao 9 /

10 Algorithm ICML 2017/ Presenter: Ji Gao 10 /

11 Another way to Calculate We have: Let Z t = ds T dλ, Lead to a recursive solution f (λ) = E(S T ) ds T dλ Z t = A t Z t 1 + B t ICML 2017/ Presenter: Ji Gao 11 /

12 Algorithm2 Can be real-time updated. ICML 2017/ Presenter: Ji Gao 12 /

13 Background: Algorithmic (Automatic) Differentiation Algorithmic Differentiation: Techniques to numerically evaluate the derivative of a function. Two modes of AD: Forward mode and Reverse mode. ICML 2017/ Presenter: Ji Gao 13 /

14 Complexity of Algorithmic Differentiation Complexity of calculating the Jacobian matrix (the matrix of all first-order partial derivatives): Suppose f : R n R p can be evaluated in time c(n, p) and space s(n, p). We have: For any vector r R n, product of r and Jacobian matrix J F r can be evaluated in time O(c(n, p)) and space O(s(n, p)) using forward mode AD. For any vector q R p, product of q and Jacobian matrix q T J F can be evaluated in time O(c(n, p)) and space O(s(n, p)) using reverse mode AD. Jacobian can be calculated in time O(nc(n, p)) using forward mode, and O(pc(n, p)) using reverse mode. ICML 2017/ Presenter: Ji Gao 14 /

15 Complexity Suppose s t = Φ t (s t 1, λ) can updated in time g(d, m) and space h(d, m). For Algorithm 1: Each step of a t+1 A t+1 and a t B t cost O(g(d, m)) time. So it s totally O(Tg(d, m)) time. For space, we need to store all s t, which requires O(Th(d, m)) space. ICML 2017/ Presenter: Ji Gao 15 /

16 Complexity - Algorithm 2 For Algorithm 2: Each step of A t Z t+1 require m Jacobian vector multiplication, so the cost is O(mg(d, m)) time. So it s totally O(Tmg(d, m)) time. For space, we only need to store the current s t, which requires O(h(d, m)) space. ICML 2017/ Presenter: Ji Gao 16 /

17 Experiment 1 - Data Hyper-cleaning Task: Have a large dataset with corrupted labels. Can only afford to clean a subset. Train a model. Method: Weighting every training sample a hyperparameter in [0,1]. Train with a weighted loss on the cleaned validation set. Train a plain softmax regression model with weight W and bias b Optimization problem: min λ E val (W T, b T ) Subject to: λ [0, 1] Ntr, λ 1 R Experiment design: 5000 examples from MNIST dataset as the training data, corrupt 2500 of them. Have 5000 more as validation data, and as test set. ICML 2017/ Presenter: Ji Gao 17 /

18 Experiment 1 result Oracle: Train with 2500 correct samples together with validation set. Baseline: Train with corrupted data and validation set. DH-R: Optimize and find a cleaned subset D c with a different R value, and finally train with D c and the validation set. ICML 2017/ Presenter: Ji Gao 18 /

19 Experiment 1 result It can successfully discard corrupted samples. ICML 2017/ Presenter: Ji Gao 19 /

20 Experiment 2 - Multiple task learning Task: Find simultaneously the model of several different related tasks. For example, few shot learning. Experiment design: Try both CIFAR-10 and CIFAR samples on CIFAR-10, 300 samples on CIFAR-100 as training set. Same size of validation set, and all rest for testing. Use pretrained Inception-V3 model to fetch the feature. Use a regularizer from [Evgeniou et al., 2005] Ω A,ρ (W ) = K j,k=1 A j,k w j w k ρ K k=1 w k 2 Training error E tr (W ) = i l(wx i + b, y i ) + Ω A,ρ (W ) Optimization problem: min λ E val (W T, b T ) Subject to: ρ 0, A 0 ICML 2017/ Presenter: Ji Gao 20 /

21 Experiment 2 result HMTL-S algorithm find the following relationship graph: ICML 2017/ Presenter: Ji Gao 21 /

22 Experiment 3 - Phone classification Task: Phone state classification over 183 classes. Experiment design: Data: TIMIT phonetic recognition dataset. Model: A previous multi task learning framework [Badino,2016]. Hyperparameters: learning rate η, momentum µ and ρ, a hyperparameter of the algorithm ICML 2017/ Presenter: Ji Gao 22 /

23 Experiment 3 result ICML 2017/ Presenter: Ji Gao 23 /

24 Experiment 3 result ICML 2017/ Presenter: Ji Gao /

ON HYPERPARAMETER OPTIMIZATION IN LEARNING SYSTEMS

ON HYPERPARAMETER OPTIMIZATION IN LEARNING SYSTEMS Luca Franceschi 1,2, Michele Donini 1, Paolo Frasconi 3, Massimiliano Pontil 1,2 (1) Istituto Italiano di Tecnologia, Genoa, 16163 Italy (2) Dept of Computer