BACKPROPAGATION THROUGH THE VOID

Size: px

Start display at page:

Download "BACKPROPAGATION THROUGH THE VOID"

Sharleen Ward
5 years ago
Views:

1 BACKPROPAGATION THROUGH THE VOID Optimizing Control Variates for Black-Box Gradient Estimation 27 Nov 2017, University of Cambridge Speaker: Geoffrey Roeder, University of Toronto

2 OPTIMIZING EXPECTATIONS L( ) =E p(b ) f(b) Variational inference: Evidence Lower Bound Reinforcement learning: Expected Reward Function Hard attention mechanism How to choose the parameters expectation? to maximize this

3 GRADIENT-BASED OPTIMIZATION Reverse-mode automatic differentiation (backpropagation) computes exact gradients of deterministic, differentiable objectives Reparameterization trick (Williams, 1992; Kingma & Welling 2014; Rezende 2014): using backprop, gives unbiased, lowvariance estimates of gradients of expectations This has allows effective stochastic optimization of large probabilistic continuous latent-variable models

4 GRADIENT-BASED OPTIMIZATION: LIMITATIONS There many relevant objective functions in ML to which backpropagation cannot be applied In RL, in fact, the reward function is unknown: a black box from the perspective of an agent Discrete latent variable models: discrete sampling creates discontinuities, giving the objective function zero gradient w.r.t. its parameters

5 GRADIENT-BASED OPTIMIZATION: LIMITATIONS But, gradients are appealing: in high dimensions, provides information on how to adjust each parameter individually Moreover, stochastic optimization is essential for scalability However, are only guaranteed to converge to a fixed point of an objective if a gradient estimator is unbiased

6 How can we build unbiased stochastic L( )

7 SCORE-FUNCTION ESTIMATOR ( REINFORCE, WILLIAMS 1992)

8 SCORE-FUNCTION ESTIMATOR ( REINFORCE, E p(b )f(b) p(b )f(b)d apple We can estimate this quantity with Monte Carlo integration: High variance: convergence to good solution challenging

9 SCORE-FUNCTION ESTIMATOR ( REINFORCE, E p(b )f(b) p(b )f(b)d = E p(b log p(b ) Log-derivative trick allows us to rewrite gradient of expectation as expectation of gradient (under weak regularity conditions) We can estimate this quantity with Monte Carlo integration: High variance: convergence to good solution challenging

10 SCORE-FUNCTION ESTIMATOR ( REINFORCE, E p(b )f(b) p(b )f(b)d = E p(b log p(b ) ĝ SF log p(b ) Log-derivative trick allows us to rewrite gradient of expectation as expectation of gradient (under weak regularity conditions) Yields unbiased, but high variance estimator

11 REPARAMETERIZATION TRICK g REP T (, ), p( ) Requires function to be known and differentiable Requires distribution p(b ) to be reparameterizable through a transformation T (, ) Unbiased; lower variance empirically

12 CONCRETE REPARAMETERIZATION (MADDISON ET AL. 2016) g CON @ (z),z = T (, ), p( ) Works well with careful hyper parameter choices Lower variance than scorefunction estimator due to reparameterization Biased estimator Temperature parameter Requires f to be known and differentiable Requires p(b ) to be reparamaterizable

13 REBAR (TUCKER ET AL. 2017) Improves over concrete distribution (rebar is stronger than concrete) Uses continuous relaxation of discrete random variables (concrete) to build unbiased, lower-variance gradient estimator Using the reparameterization from the Concrete distribution, construct a control variate for the score-function estimator Show how tune additional parameters of the estimator (e.g., temperature ) online

14 Digression: control variates for Monte Carlo estimators

15 CONTROL VARIATES: DIGRESSION ĝ new (b) =ĝ(b) c(b) E p(b) [c(b)]? = Cov[ĝ, c] Var[ĝ] New estimator is equal in expectation to old estimator (bias is unchanged) Variance is reduced when corr(c, g) > 0 We exploit the difference between the function c and its known mean during optimization to correct the value of the estimator

16 CONTROL VARIATES: FREE-FORM ĝ new (b) =ĝ(b) c (b)+e p(b) [c (b)] If we choose a neural network as our parameterized differentiable function, then the above formulation can be simplified to the above The scaling constant will be absorbed into the weights of the network, and optimality is determined by training How should we update the weights of the free-form control variate?

17 What is essential for a control variate?

18 LEARNING FREE-FORM CONTROL VARIATE: ] For unbiased estimator, we can form a Monte-Carlo estimate for the variance of the estimator overall We use this as the training signal for the parameters of the control variate, adapting the parameters during training

19 GENERALIZING REBAR ĝ LAX = g SF [f] g SF [c ]+g REP [c ] =[f(b) log c (b), Start with score function (SF) estimator of gradient of b = T (, ), p( ) f Introduce a parametrized differentiable function c c Use SF estimator of as a control variate, subtracting its mean estimated through the lower-variance reparameterization estimator This generalizes Tucker et al to free-form control variates: no longer require continuous relaxations

20 RELAX: EXTENSION TO DISCRETE RANDOM VARIABLES ĝ RELAX =[f(b) c log c ( z), z p(z ),b= H(z), z p(z b, ) When b is discrete, we introduce a related distribution and a function H where H(z) =b p(b ) We use a conditional reparameterization scheme developed by Tucker et al for REBAR This estimator is unbiased for all choices of c

21 RELAX: EXTENSION TO DISCRETE RANDOM VARIABLES ĝ RELAX =[f(b) c log c ( z), z p(z ),b= H(z), z p(z b, ) When b is discrete, we introduce a related distribution and a function H where H(z) =b p(b ) We use a conditional reparameterization scheme developed by Tucker et al for REBAR This estimator is unbiased for all choices of c

22 EXPERIMENTAL RESULTS

23 SIMPLE EXAMPLE E p(b ) (t b) 2 b Ber( ) Validated idea with simple function above Used to validate REBAR estimator, fixing t=0.45 We chose t = 0.499

24 SIMPLE EXAMPLE E p(b ) (t b) 2 (Right) RELAX finds a reasonable solution, REINFORCE and REBAR oscillate (Left) Variance is considerably reduced in our estimator

25 A MORE INTERESTING APPLICATION log p(x) L( ) =E q(b x) [log p(x b) + log p(b) log q(b x)] Discrete Variational Autoencoder Latent state: 2 layers of 200 Bernoulli variables Discrete sampling renders reparameterization estimator unusable c (z) =f( (z)) + r (z)

26 MNIST RESULTS

27 OMNIGLOT RESULTS

28 QUANTITATIVE RESULTS Dataset Model Concrete NVIL MuProp REBAR RELAX Nonlinear MNIST linear 1 layer linear 2 layer Nonlinear Omniglot linear 1 layer linear 2 layer Table 1: Best obtained training objective for discrete variational autoencoders.

29 OVERFITTING 1 LAYER: MNIST (LEFT), OMNIGLOT (RIGHT)

30 REINFORCEMENT LEARNING Policy gradient methods effective for finding policy parameters (A2C, A3C, ACKTR) Goal: argmax E ( ) [R( )] Need E ( ) [R( )] True reward function unknown (black-box, from environment)

31 ADVANTAGE ACTOR CRITIC ĝ A2C = 1X t=1 (SUTTON, 2000) X log (a t s t, ) t 0 =t # r t c (s 0 t ),a t (a t s t, ) c is an estimate of the value function This is exactly the REINFORCE estimator using an estimate of the value function as a control variate Why not use action in control variate? Dependence on action would add bias

32 EXTENDING LAX TO RL g RL LAX = log (a t s t ; ) " 1 X t 0 =t # r t 0 c (a t,s t c (a t,s t ) a t = a t ( t,s t, ), t p( t ) Allows for action-dependence in control variate Remains unbiased estimator Similar extension possible for discrete action spaces, see paper Appendix C.2

33 RL BENCHMARK RESULTS Lunar lander Inverted pendulum Inverted double pendulum Log-Variance Reward Cart-pole Figure 4: Top row: Reward curves. Bottom row: Variance of policy gradients (log scale). In each curve, the center line indicates the mean reward over 5 random seeds. The opaque bars in the top

34 BERNOULLI REPARAM Bernoulli When p(b ) is Bernoulli distribution we let H(z) =I(z >0) and we sample from p(z ) with z = log 1 + log u, u uniform[0, 1]. 1 u We can sample from p(z b, ) with v 0 = v (1 ) b =0 v +(1 ) b =1 z = log 1 + log v 0, v uniform[0, 1]. 1 v0

Backpropagation Through

Backpropagation Through Backpropagation Through Backpropagation Through Will Grathwohl Dami Choi Yuhuai Wu Geoff Roeder David Duvenaud Where do we see this guy? L( ) =E p(b ) [f(b)] Just about everywhere!