Forward and Reverse Gradient-based Hyperparameter Optimization

Size: px

Start display at page:

Download "Forward and Reverse Gradient-based Hyperparameter Optimization"

Roxanne Henderson
6 years ago
Views:

it (1) Istituto Italiano di Tecnologia, (2) University College London, (3) Università

1 Forward and Reverse Gradient-based Hyperparameter Optimization Luca Franceschi 1,2, Michele Donini 1, Paolo Frasconi 3 and Massimilano Pontil 1,2 massimiliano.pontil@iit.it (1) Istituto Italiano di Tecnologia, (2) University College London, (3) Università degli Studi di Firenze Optimization and Big Data Summer School, Veroli, July 3-7,

2 Overview Training Set + Training Algorithm + Machine Learning Model (Parameters) Hyperparameter Optimization Hyperparameters (Algorithm Behavior) Hyperparameters (Capacity/Design) Model Training Validation Set 2

3 Gradient-based Hyperparameter optimization Setting Model parameters: w R n ; hyperparameters: λ R m Training error: J; validation error: E Classic hyperparameter optimization problem: min E(w(λ)), where w(λ) arg min J(w) λ w In practice J is minimized with an iterative scheme s t = Φ t (s t 1, λ) t = 1,..., T where s t = (w t,... ) R d contains parameters and accessory variables, and Φ t : R d R m R d is a smooth mapping Aim: minimize the response function f (λ) = E(s T (λ)) 3

4 Gradient-based Hyperparameter optimization Example setting: SGD with momentum Model parameters: w R n Accessory variables: v R n Hyperparameters: λ = (η, µ) State: s t = (w t, v t ); Dynamical system: Φ t (s t 1, λ) defined as w t = w t 1 η(µv t 1 J t (w t 1 )) v t = µv t 1 + J t (w t 1 ) For µ = 0 this is gradient descent 4

5 Reverse-mode Computation Linked to BPTT [Werbos, 1990] HO as constrained optimization problem (fixed horizon T ): min λ,s 1,...,s T E(s T ) subject to s t = Φ t (s t 1, λ), t {1,..., T }. The Lagrangian is L(s, λ, α) = E(s T ) + T t=1 α t(φ t (s t 1, λ) s t ) Define: A t = Φt(s t 1,λ) s t 1, B t = Φt(s t 1,λ) λ L E(s T ) if t = T, = 0 = α t = s t α t+1 A t+1 if 0 t T 1 L T ( ) λ = E(s T ) (A t+1 A T ) B t = f (λ) t=1 5

6 Forward-mode Computation Linked to RTRL [Williams and Zipser, 1989] Direct calculation of the hypergradient f (λ) using chain rule: ds t dλ = Φ t(s t 1, λ) Define: Z 0 = 0; s t 1 f (λ) = E(s T ) ds T dλ ; Z t = dst dλ ; ds t 1 dλ + Φ t(s t 1, λ) λ t {1,..., T }. recursive equation: Z t = A t Z t 1 + B t, t {1,..., T }. ( T ) Thus f (λ) = E(s T )Z T = E(s T ) t=1 (A t+1 A T ) B t. 6

Computational Complexity Assuming Algorithmic Differentiation [Griewank and Walther, 2008] Time complexity of Φ t: C T = g(d, m) Space complexity of Φ t: C S = (d, m).

7 Computational Complexity Assuming Algorithmic Differentiation [Griewank and Walther, 2008] Time complexity of Φ t: C T = g(d, m) Space complexity of Φ t: C S = (d, m). Reverse-HG: for t = 1 to T do s t Φ t(s t 1, λ) end for α T E(s T ) g 0 for t = T 1 downto 1 do α t α t+1a t+1 g g + α tb t end for{c T = C S = O(Tg(d, m))} return g Forward-HG: Z 0 0 for t = 1 to T do s t Φ t(s t 1, λ) Z t A tz t 1 + B t end for{c T = O(Tmg(d, m)) C S = O(mh(d, m))} return E(s)Z T 7

8 Real-Time Hyperparameter Optimization (RTHO) With Forward-HG... partial hypergradients available t: f t (λ) = de(st) dλ = E(s t )Z t ; no need to specify final T. Training Set Machine Learning Model (Parameters) Hyperparameters (Capacity/Design) RTHO Validation set Training Dynamics Hyperparameters (Algorithm Behavior) 8

9 RTHO pseudo-code Choose an hyper-batch size and an hyper-learning rate α: RTHO: Z 0 0 for t = 1 to... do s t Φ t (s t 1, λ) Z t A t Z t 1 + B t if t mod = 0 then λ λ α E(s t )Z t end if end for 9

10 Learning Task Interactions Many multitask methods require that a task interaction matrix is given as input to the learning algorithm. We employed the following MTL regularizer Ω A,ρ (W ) = 1 2 = T t,h=1 A t,h w t w h ρ T w t, w h ( ) L t,h + ρδ t,h }{{} G 1 t,h t,h=1 T w t 2 (1) where L is the Laplacian: L t,h = δ t,h d t A t,h, d t = T h=1 A t,h In real applications, the matrix A is often unknown and it is interesting to learn it from data. t=1 (2) 10

11 Output kernel learning Dinuzzo et al. [2011] propose a method to learn simultaneously a vector-valued function and a kernel among its components. H is the RKHS of a vector-valued function associated with the kernel H = KG, K is a psd scalar kernel and G is a symmetric psd matrix linked to A and ρ. Their optimization problem is min min G S m + g H n λ g(x i ) y i g 2 H + G 2 F. i=1 11

12 Output kernel learning By the representer theorem: g (x) = G n c i K(x, x i ). i=1 The optimization problem becomes the following min G S m + min C R l m λ Y KCG 2 F + C T KC, L F + G 2 F. The problem is not jointly convex in G and C. The authors show that each local minimizer is also a global minimizer (due to invex function property of the problem). 12

13 Output kernel learning Jawanpuria et al. [2015] exploit the idea of learning a sparse output kernel. The regularizer enforces the sparsity of the output kernel avoiding the spurious relations among different tasks and leading to interpretable solutions: Ω(G) = G p p, p (1, 2]. (3) This regularizer can be applied to an arbitrary convex loss function solving the minimization problem. 13

14 CIFAR We use our method to learn the matrix A from the validation set by using Reverse-HG on CIFAR-10 dataset. 14

15 CIFAR Test accuracy±standard deviation on CIFAR-10. The third column is the p-norm regularizer for the task interaction matrix A. CIFAR-10 p STL 67.47±2.78 Naive MTL 69.41±1.90 Dinuzzo 69.96± Jawanpuria 70.20± Jawanpuria 70.96±1.04 4/3 R-HG 70.85±1.87 R-HG-Sparse 71.62±

16 BONUS: data hyper-cleaning A dataset with label noise and due to time or resource constraints we can only afford to cleanup a subset of the available data. We use the cleaned data as the validation set, and assign one hyperparameter to each training example, i.e. the training loss becomes 1 λ i l(w, x i ) with λ i [0, 1]. N tr i Training set By putting a sparsity constraint on the vector of hyperparameters λ 1 < R, we bring to zero the influence of noisy examples by using Forward-HG on a corrupted MNIST dataset. 16

17 Reference for this talk L Franceschi, M Donini, P Frasconi, M Pontil. Forward and Reverse Gradient-Based Hyperparameter Optimization. Proc. International Conference on Machine Learning, arxiv:

18 Additional references Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8): , Francesco Dinuzzo, Cheng S Ong, Gianluigi Pillonetto, and Peter V Gehler. Learning output kernels with block coordinate descent. In ICML, pages 49 56, Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Second Edition. Society for Industrial and Applied Mathematics, second edition, Pratik Jawanpuria, Maksim Lapin, Matthias Hein, and Bernt Schiele. Efficient output kernel learning for multiple tasks. In Advances in Neural Information Processing Systems, pages , Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible learning. In ICML, Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In ICML, volume 48, pages , Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10): , Ronald J. Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2): ,

ON HYPERPARAMETER OPTIMIZATION IN LEARNING SYSTEMS

ON HYPERPARAMETER OPTIMIZATION IN LEARNING SYSTEMS Luca Franceschi 1,2, Michele Donini 1, Paolo Frasconi 3, Massimiliano Pontil 1,2 (1) Istituto Italiano di Tecnologia, Genoa, 16163 Italy (2) Dept of Computer