Variational Adaptive-Newton Method

Size: px

Start display at page:

Download "Variational Adaptive-Newton Method"

Bruce Robinson
5 years ago
Views:

1 Variational Adaptive-Newton Method Mohaad Etiyaz Khan Wu Lin Voot Tangkaratt Zuozhu Liu Didrik Nielsen AIP, RIKEN, Tokyo Abstract We present a black-box learning ethod called the Variational Adaptive-Newton (VAN) ethod. The ethod is especially suitable for explorative-learning tasks such as active learning and reinforceent learning. Just like Bayesian ethods, it estiates a distribution that can be used for exploration, but its coputational coplexity is siilar to that of the continuous optiization ethods. VAN is a second-order ethod that unifies existing ethods in distinct fields of continuous optiization, variational inference, and evolution strategies. Our results show that VAN perfors well on a wide-variety of learning tasks but takes fewer coputations than existing approaches. To suarize, this works presents a ethod has the potential not only to be a general-purpose learning tool, but also to be used to solve difficult probles such as deep reinforceent learning and unsupervised learning. Introduction Huans learn by sequentially exploring the world. During our lifeties, we use our past experiences to acquire new ones and then use the to learn even ore. How can we design ethods that can iic this explorative" learning? This is an open question in artificial intelligence and achine learning. One such approach is based on the Bayesian posterior distribution which can be used to suarize past data exaples and to choose new ones (Perfors et al., ). Bayesian ethods have been applied to a wide variety of explorative-learning tasks, e.g., for active learning and Bayesian optiization to select inforative data exaples (Houlsby et al., ; Gal et al., 7; Brochu et al., ; Fishel and Loeb, ), and for reinforceent learning to learn through interactions (Wyatt, 998; Strens, ). However, estiating the posterior distribution is coputationally deanding, which akes its application to large-scale probles challenging. On the other hand, non-bayesian ethods, such as those based on continuous optiization, are coputationally uch cheaper. This is because they only copute a point estiate of the odel paraeters rather than their distributions. However, since there is no distribution associated with the paraeters, these ethods cannot directly exploit the echaniss of Bayesian exploration. This raises the following question: how can we design explorative-learning ethods that copute a distribution just like Bayesian ethods, but cost siilar to optiization ethods? In this paper, we propose such a ethod. Our ethod is a general purpose black-box optiization ethod, but is especially suitable for exploration-based learning. Just like Newton s ethod, our ethod is a second-order ethod, but it also unifies existing ethods in distinct fields of continuous optiization, variational inference, and evolution strategies. Due to this connection, we call our ethod the Variational Adaptive-Newton (VAN) ethod. We show results for supervised and unsupervised learning, as well as for active learning and reinforceent learning. Our results suggest that VAN can be used as a general purpose learning ethod and has the potential to iprove existing approaches for a variety of difficult learning probles. ZL is fro Singapore University of Technology and Design, Singapore. Work was done in internship at AIP. A longer version of this paper is available at 3st Conference on Neural Inforation Processing Systes (NIPS 7), Long Beach, CA, USA.

2 Variational Optiization with Gaussian distributions We consider learning tasks that can be forulated as a function iniization proble, θ = argin f(θ), where θ R D. () θ A wide variety of probles in supervised, unsupervised, and reinforceent learning fall under this category. However, instead of directly solving proble (), we take a different approach that follows the Variational Optiization (VO) fraework (Staines and Barber, ) where we iniize, in η E q(θ η) f(θ)] := L(η), () where q is a probability distribution with paraeters η. This approach can lead us to the iniu θ because L(η) is an upper bound on the iniu value of f at θ, i.e., in θ f(θ) E q(θ η) f(θ)]. Therefore iniizing L(η) iniizes f(θ), and when the distribution q puts all its ass on θ, we recover the iniu θ. This type of function iniization is coonly used in any areas of stochastic search such as evolution strategies (Hansen and Ostereier, ; Wierstra et al., 8). In our proble context, this forulation is advantageous because it enables learning via exploration, where exploration is facilitated with the use of the distribution q(θ η). In this paper, we will use a Gaussian distribution for exploration, i.e., we set q(θ η) := N (θ µ, Σ) with η = {µ, Σ}. The proble () can then be rewritten as follows, in E N (θ µ,σ) f(θ)] := L(µ, Σ). (3) µ,σ A straightforward approach to iniize L(µ, Σ) is to use SGD to optiize µ and Σ as shown below, V-SGD : µ t+ = µ t ρ t µ L t, Σ t+ = Σ t ρ t Σ L t, (4) where ρ t > is a step size at iteration t, denotes an unbiased stochastic-gradient estiate, and L t = L(µ t, Σ t ). We refer to this approach as Variational SGD or siply V-SGD to differentiate it fro the standard SGD to optiize f(θ) in the θ space. This approach is siple and can be powerful when used with adaptive-gradient ethods to adaptively change the step-size, e.g., AdaGrad and RMSprop. However, as pointed by Wierstra et al. (8), it has issues, especially when REINFORCE (Willias, 99) is used to estiate the gradients of f(θ). Wierstra et al. (8) argue that the V-SGD update becoes increasingly unstable when the covariance is sall, while becoing very sall when the covariance is large. To fix these probles, Wierstra et al. (8) proposed a natural-gradient ethod. Our ethod is also a natural-gradient ethod, but, as we show in the next section, its updates are uch sipler and they lead to a second-order ethod which is siilar to Newton s ethod. 3 Variational Adaptive-Newton Method We propose a new ethod to solve (3) by using a irror-descent algorith. We show that our algorith is a second-order ethod which solves the original proble (), even though it is designed to solve (3). Due to its siilarity to Newton s ethod, we refer to our ethod as the Variational Adaptive-Newton (VAN) ethod. Figure (b) shows an illustrative exaple of our ethod to optiize a one-diensional non-convex function. VAN can be obtained by aking two sall changes to the V-SGD objective. Firstly, note that the V-SGD update in (4) is a solution of the following optiization proble: η t+ = argin η, η L t + η η η ρ t η t ρ t η L t. (5) t A siple interpretation of this optiization proble is that, in V-SGD, we choose the next point η along the gradient but contained within a scaled l -ball centered at the current point η t. This interpretation enables us to slightly odify V-SGD to obtain VAN. Our VAN ethod is based on two odifications to (5). The first odification is to replace the Euclidean distance by a Bregan divergence which results in the irror-descent ethod. Note

3 that for exponential-faily distributions, the Bregan divergence corresponds to the Kullback-Leibler (KL) divergence (Raskutti and Mukherjee, 5). Using the KL divergence as distance easure has the benefit of resulting in natural gradient updates which are efficient when optiizing the paraeter of a probability distribution (Aari, 998). Our second odification is that we optiize the VO objective with respect to the ean paraeterization of the Gaussian distribution := {µ, µµ T +Σ} instead of the paraeter η := {µ, Σ}. The two odifications give us the following proble: t+ = argin, L t + β t D KL q q t ], (6) where q := q(θ ), q t := q(θ t ), and D KL q q t ] = E q log(q/q t ) denotes the KL divergence. The convergence of this procedure is guaranteed under ild conditions (Ghadii et al., 4). The solution to this optiization proble is given by (proof given in Appendix A), µ t+ = µ t β t Σ t+ µ L t, Σ t+ = Σ t + β t Σ L t. (7) The VAN update above corresponds to those of a second-order ethod. To see this connection, we rewrite the gradients in the update by using the following identities (Opper and Archabeau, 9): µ L t = µ E q f(θ)] = E q θ f(θ)], Σ E q f(θ)] = E q θθ f(θ) ]. (8) By substituting these in (7) and expressing the update in ters of the precision atrix P t := Σ t, we get the following updates: VAN: µ t+ = µ t β t P t+ E q t θ f(θ)], P t+ = P t + β t E qt θθ f(θ) ], (9) where q t := N (θ µ t, Σ t ). Copare this to the update of Newton s ethod, Newton s Method: θ t+ = θ t ρ t θθ f(θ t ) ] θ f(θ t )]. () While Newton s ethod scales the gradient by the inverse Hessian to obtain a step direction, VAN scales the averaged gradients by the precision atrix P t which contains a weighted su of the past averaged Hessians. Both the averaged-gradients and averaged-hessians are the average of gradients and Hessians coputed at locations sapled fro q t, respectively. The VAN update therefore uses the second-order inforation but, unlike Newton s ethod, this inforation is averaged over saples fro the explorative distribution q and sued over past iterations. 4 Variants of VAN and Their Connections to Existing Methods A Large-Scale Variant: For probles with a large nuber of paraeters, coputing the full Hessian atrix ight be infeasible. By using a ean-field approxiation q(θ η) = D d= N (θ d µ d, σd ) however, we obtain a ethod that use a diagonal approxiation of Hessian instead of the full Hessian atrix in its updates. Denoting the precision paraeters by s d = /σd, and a vector containing the by s, the following diagonal version of VAN, which we call VAN-D, can be written as follows: VAN-D: µ t+ = µ t β t diag(s t+ ) E qt θ f(θ)], s t+ = s t + β t E qt h(θ)], () where diag(s) is a diagonal atrix containing the vector s as its diagonal and h(θ) is the diagonal of the Hessian θθf(θ). The diagonal Hessian can be coputed using the reparaeterization trick. Connections to AdaGrad: Instead of using the Hessian, we can use a Gauss-Newton variant of VAN-D where we replace the Hessian θ iθ i f(θ) by the square of the gradient θi f(θ)]. The resulting updates are very siilar to the updates of the adaptive-gradient ethod AdaGrad (Duchi et al., ). The ain difference is that our ethod uses the averaged-gradients instead of the gradient at one point. VAN can also be seen as a generalization of an adaptive ethod called AROW (Craer et al., 9). AROW uses a irror descent algorith siilar to ours, but has been applied only to SVMs. Connections to Variational Inference (VI): VI is a special type of VO proble where function f(θ) := logp(y, θ)/q(θ η)] with p(y, θ) being the joint distribution and q being the variational approxiation to the posterior distribution. Our approach is an extension of a recent approach for VI by? called the conjugate-coputation variational inference (CVI). VAN is a generalization of CVI 3

4 Test Loss - * Test Loss Test Loss.5 (a): LassoReg/YearPredictionMSD.3 (e): BatchLogReg/usps_3vs5.6 (i): StochasticLogReg/usps_3vs5 iridge VAN svan.5 VAN Newton svan- AdaGrad svan-d-.5 M=.34 M= # Data Passes.8 3 # Data Passes.8 - # Data Passes (a) The left figure shows iproveents for Lasso regression when copared to the Iterative-Ridge ethod on the YearPredictionMSD dataset (N = 55K and D = 9). Here, stochastic-van (svan) with a inibatch size of converges faster than the batch ethods. The iddle figure shows coparison for Batch-VAN to Newton s ethod for USPS dataset using logistic regression (N = 54 and D = 56), where VAN converges at the sae rate as Newton s ethod. The right figure shows the sae coparison for stochastic-van and its diagonal version (svan-d) with ini-batch size of where we observe a convergence siilar to AdaGrad. Both of our ethods use saples to copute averaged-gradients and Hessian..5.6 ActiveLearning/usps_3vs5 svan-active- svan-.34 M= #Data Passes RL/HalfCheetah-v (b) The top figure shows the function f(θ) = sinc(θ) with a blue curve (global inia at ). The second plot shows the VO objective L(µ, σ) = E qf(θ)] for a Gaussian q = N (θ µ, σ ). The red points and arrows show the iterations of our VAN ethod initialized at µ = 3. and σ =.5. The botto plot shows the progression of q where darker curves indicate higher iterations. Test Rewards 5 svan-d- svan-d- svag-d- V-SGD SGD 5 3 Traning Episode (c) The top figure shows results for active learning on logistic regression. svan-active- uses a axiu-entropy acquisition function to select a ini-batch size of and achieve a good error uch quickly when copared to svan- with no active learning. The botto figure shows results for paraeter-based exploration for reinforceent learning in deep deterinistic-policy gradient ethod (Silver et al., 4). svan-d variants achieve better reward than AROW, V-SGD, and SGD. to a general function f(θ). A direct consequence of our theoretical result fro the previous section is that CVI, just like VAN, is also a second-order ethod. VAN as a natural-gradient ethod: VAN perfors natural-gradient updates in the space of the Gaussian distributions. This can be shown by using a recent result by Raskutti and Mukherjee (5). Our approach however is uch sipler than existing ethods such as Natural Evolution Strategies (NES) (Wierstra et al., 8) which applies natural-gradient descent directly in the space of µ and Σ. Unfortunately, this results in an infeasible algorith since the Fisher inforation atrix has O(D 4 ) paraeters. NES requires sophisticated reparaeterization to solve this issue. An advantage of VAN over NES is that it naturally has O(D ) paraeters and its updates are therefore uch sipler. Results: Our experiental results shown in Fig. (a) and (c) reveal that VAN and its variants perfor well for supervised learning on Lasso and Logistic regression, active learning for logistic regression, and a reinforceent-learning task for paraeter based exploration. Figure (b) shows an exaple where VAN can avoid local inia through exploration. These results show the generality of our ethod and also that it can be used to solve difficult probles such deep reinforceent learning. 4

5 References Aari, S.-I. (998). Natural gradient works efficiently in learning. Neural coputation, ():5 76. Brochu, E., Cora, V. M., and De Freitas, N. (). A tutorial on Bayesian optiization of expensive cost functions, with application to active user odeling and hierarchical reinforceent learning. arxiv preprint arxiv:.599. Craer, K., Kulesza, A., and Dredze, M. (9). Adaptive regularization of weight vectors. In Advances in neural inforation processing systes, pages Duchi, J., Hazan, E., and Singer, Y. (). Adaptive subgradient ethods for online learning and stochastic optiization. The Journal of Machine Learning Research, : 59. Fishel, J. A. and Loeb, G. E. (). Bayesian exploration for intelligent identification of textures. Frontiers in neurorobotics, 6. Gal, Y., Isla, R., and Ghahraani, Z. (7). Deep Bayesian Active Learning with Iage Data. ArXiv e-prints. Ghadii, S., Lan, G., and Zhang, H. (4). Mini-batch stochastic approxiation ethods for nonconvex stochastic coposite optiization. Matheatical Prograing, pages 39. Hansen, N. and Ostereier, A. (). Copletely derandoized self-adaptation in evolution strategies. Evolutionary coputation, 9(): Houlsby, N., Huszár, F., Ghahraani, Z., and Lengyel, M. (). Bayesian Active Learning for Classification and Preference Learning. ArXiv e-prints. Kinga, D. and Ba, J. (4). Ada: A ethod for stochastic optiization. arxiv preprint arxiv: Opper, M. and Archabeau, C. (9). The Variational Gaussian Approxiation Revisited. Neural Coputation, (3): Perfors, A., Tenenbau, J. B., Griffiths, T. L., and Xu, F. (). A tutorial introduction to Bayesian odels of cognitive developent. Cognition, (3):3 3. Raskutti, G. and Mukherjee, S. (5). The inforation geoetry of irror descent. IEEE Transactions on Inforation Theory, 6(3): Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riediller, M. A. (4). Deterinistic policy gradient algoriths. In Proceedings of the 3th International Conference on Machine Learning, ICML 4, Beijing, China, -6 June 4, pages Staines, J. and Barber, D. (). Variational Optiization. ArXiv e-prints. Strens, M. (). A Bayesian fraework for reinforceent learning. In In Proceedings of the Seventeenth International Conference on Machine Learning, pages ICML. Wierstra, D., Schaul, T., Peters, J., and Schidhuber, J. (8). Natural evolution strategies. In Evolutionary Coputation, 8. CEC 8.(IEEE World Congress on Coputational Intelligence). IEEE Congress on, pages IEEE. Willias, R. J. (99). Siple statistical gradient-following algoriths for connectionist reinforceent learning. Machine learning, 8(3-4):9 56. Wyatt, J. (998). Exploration and inference in learning fro reinforceent. 5

6 A Derivation of VAN Denote the ean paraeters of q t (θ) by t which is equal to the expected value of the sufficient statistics φ(θ), i.e., t := E qt φ(θ)]. The irror descent update at iteration t is given by the solution to t+ = argin, L t = argin E q = argin = argin = argin = argin E q + D β KL q q t ] () t φ(θ), )] L t + log ((q/q t ) /βt (3) exp φ(θ), L t q /βt log (4) q /βt t q /βt E q log q /βt t exp φ(θ), (5) L t E q log q β t q t exp φ(θ), β t L t (6) ] D KL q q t exp φ(θ), β t L t /Z t. β t (7) where Z is the noralizing constant of the distribution in the denoinator which is a function of the gradient and step size. Miniizing this KL divergence gives the update q t+ (θ) q t (θ) exp φ(θ), β t L t. (8) By rewriting this, we see that we get an update in the natural paraeters λ t of q t (θ), i.e. λ t+ = λ t β t L t. (9) Recalling that the ean paraeters of a Gaussian q(θ) = N (θ µ, Σ) are () = µ and M () = Σ + µµ T and using the chain rule, we can express the gradient L t in ters of µ and Σ, ()L = µ L Σ L µ () M ()L = Σ L. () Finally, recalling that the natural paraeters of a Gaussian q(θ) = N (θ µ, Σ) are λ () = Σ µ and λ () = Σ, we can rewrite the VAN updates in ters of µ and Σ, Σ t+ = Σ t + β t Σ L t µ t+ = Σ t+ Σ t µ t β t ( µ L t Σ L t ] µ t )] () (3) ( ) = Σ t+ Σ t + β t Σ L t µ t β t Σ t+ µ L t (4) = µ t β t Σ t+ µ L t. (5) 6

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes