Equivalence Between Wasserstein and Value-Aware Loss for Model-based Reinforcement Learning

Size: px

Start display at page:

Download "Equivalence Between Wasserstein and Value-Aware Loss for Model-based Reinforcement Learning"

Kelley Little
5 years ago
Views:

1 Equivalence Between Wasserstein and Value-ware Loss for Model-based Reinforcement Learning Kavosh sadi 1 Evan Cater 1 Dipendra Misra Michael L Littman 1 bstract Learning a generative model is a key component of model-based reinforcement learning Though learning a good model in the tabular setting is a simple task, learning a useful model in the approximate setting is challenging In this context, an important question is the loss function used for model learning as varying the loss function can have a remarkable impact on effectiveness of planning Recently Farahmand et al 017 proposed a value-aware model learning VML objective that captures the structure of value function during model learning Using tools from sadi et al 018, we show that minimizing the VML objective is in fact equivalent to minimizing the Wasserstein metric This equivalence improves our understanding of value-aware models, and also creates a theoretical foundation for applications of Wasserstein in model-based reinforcement learning 1 Introduction The model-based approach to reinforcement learning consists of learning an internal model of the environment and planning with the learned model Sutton & Barto, 1998 The main promise of the model-based approach is data-efficiency: the ability to perform policy improvements with a relatively small number of environmental interactions lthough the model-based approach is well-understood in the tabular case Kaelbling et al, 1996; Sutton & Barto, 1998, the extension to approximate setting is difficult Models usually have non-zero generalization error due to limited training samples Moreover, the model 1 Department of Computer Science, Brown University, Providence, US Department of Computer Science, Cornell Tech, New York, US Correspondence to: Kavosh asadi <kavosh@brownedu> learning problem can be unrelizable, leading to an imperfect model with irreducible error Ross & Bagnell, 01; Talvitie, 014 Sometimes referred to as the compounding error phenomenon, it has been shown that such small modeling errors can also compound after multiple steps and degrade the policy learned using the model Talvitie, 014; Venkatraman et al, 015; sadi et al, 018 On way of addressing this problem is by learning a model that is tailored to the specific planning algorithm we intend to use That is, even though the model is imperfect, it is useful for the planning algorithm that is going to leverage it To this end, Farahmand et al 017 proposed an objective function for model-based RL that captures the structure of value function during model learning to ensure that the model is useful for Value Iteration Learning a model using this loss, known as value-aware model learning VML loss, empirically improved upon a model learned using maximum-likelihood objective, thus providing a promising direction for learning useful models in the approximate setting More specifically, VML minimizes the maximum Bellman error given the learned model, MDP dynamics, and an arbitrary space of value functions s we will show, computing the Wasserstein metric involves a similar maximization problem, but over a space of Lipschitz functions Under certain assumptions, we prove that the value function of an MDP is Lipschitz Therefore, minimizing the VML objective is in fact equivalent to minimizing Wasserstein Background 1 MDPs We consider the Markov decision process MDP setting in which the RL problem is formulated by the tuple S,, R, T, γ Here, S denotes a state space and denotes an action set The functions R : S R and T : S PrS denote the reward and transition dynamics Finally γ [0, 1 is the discount rate ccepted at the FIM workshop Prediction and Generative Modeling in Reinforcement Learning, Stockholm, Sweden, 018 Copyright 018 by the authors

2 Equivalence Between Wasserstein and Value-ware Model-based Reinforcement Learning Lipschitz Continuity We make use of the notion of smoothness of a function as quantified below Definition 1 Given two metric spaces M 1, d 1 and M, d consisting of a space and a distance metric, a function f : M 1 M is Lipschitz continuous sometimes simply Lipschitz if the Lipschitz constant, defined as d fs1, fs K d1,d f :=, 1 s 1 S,s S d 1 s 1, s is finite Equivalently, for a Lipschitz f, s 1, s d fs1, fs K d1,d f d 1 s 1, s Note that the input and output of f can generally be scalars, vectors, or probability distributions Lipschitz function f is called a non-expansion when K d1,d f = 1 and a contraction when K d1,d f < 1 We also define Lipschitz continuity over a subset of inputs: Definition function f : M 1 M is uniformly Lipschitz continuous in if fs1, a, fs, a Kd 1,d f := is finite d a d 1 s 1, s, Note that the metric d 1 is still defined only on M 1 Below we also present two useful lemmas Lemma 1 Composition Lemma Define three metric spaces M 1, d 1, M, d, and M 3, d 3 Define Lipschitz functions f : M M 3 and g : M 1 M with constants K d,d 3 f and K d1,d g Then, h : f g : M 1 M 3 is Lipschitz with constant K d1,d 3 h K d,d 3 fk d1,d g Proof K d1,d 3 h f gs 1, f gs d 3 = s1,s d = d 1 s 1, s gs1, gs d 1 s 1, s gs1, gs d d 1 s 1, s = K d1,d gk d,d 3 f d 3 f gs 1, f gs d gs1, gs fs1, fs d 3 d s 1, s Lemma Summation Lemma Define two vector spaces M 1, and M, Define Lipschitz functions f : M 1 M and g : M 1 M with constants K, f and K, g Then, h : f + g : M 1 M is Lipschitz with constant K, h K, f + K, g Proof fs + gs fs 1 gs 1 K d1,d h := s s 1 fs fs 1 + gs gs 1 s s 1 s s 1 fs fs 1 s s 1 gs gs 1 + s s 1 = K, f + K, g 3 Distance Between Distributions We require a notion of difference between two distributions quantified below Definition 3 Given a metric space M, d and the set PM of all probability measures on M, the Wasserstein metric or the 1st Kantorovic metric between two probability distributions µ 1 and µ in PM is defined as W µ 1, µ := inf j Λ js 1, s ds 1, s ds ds 1, 3 where Λ denotes the collection of all joint distributions j on M M with marginals µ 1 and µ Vaserstein, 1969 Wasserstein is linked to Lipschitz continuity using duality: µ1 W µ 1, µ = s µ s fsds f:k d,dr f1 4 This equivalence is known as Kantorovich-Rubinstein duality Kantorovich & Rubinstein, 1958; Villani, 008 Sometimes referred to as Earth Mover s distance, Wasserstein has recently become popular in machine learning, namely in the context of generative adversarial networks rjovsky et al, 017 and value distributions in reinforcement learning Bellemare et al, 017 We also define Kullback Leibler divergence simply KL as an alternative measure of difference between two distributions: KLµ 1 µ := µ 1 s log µ 1s µ s ds

3 Equivalence Between Wasserstein and Value-ware Model-based Reinforcement Learning 3 Value-ware Model Learning VML Loss The basic idea behind VML Farahmand et al, 017 is to learn a model tailored to the planning algorithm that intends to use it Since Bellman equations Bellman, 1957 are in the core of many RL algorithms Sutton & Barto, 1998, we assume that the planner uses the following Bellman equation: Qs, a = Rs, a + γ T s s, af Qs, ds, where f can generally be any arbitrary operator Littman & Szepesvári, 1996 such as max We also define: vs := f Qs, good model T could then be thought of as the one that minimizes the error: lt, T s, a = Rs, a + γ T s s, avs ds Rs, a γ T s s, avs ds T = γ s s, a T s s, a vs ds Note that minimizing this objective requires access to the value function in the first place, but we can obviate this need by leveraging Holder s inequality: T l T, T s, a = γ s s, a T s s, a vs ds γ T s s, a T s s, a v 1 Further, we can use Pinsker s inequality to write: T s, a T s, a KL T s, a T s, a 1 This justifies the use of maximum likelihood estimation for model learning, a common practice in model-based RL Bagnell & Schneider, 001; bbeel et al, 006; gostini & Celaya, 010, since maximum likelihood estimation is equivalent to empirical KL minimization However, there exists a major drawback with the KL objective, namely that it ignores the structure of the value function during model learning s a simple example, if the value function is constant through the state-space, any randomly chosen model T will, in fact, yield zero Bellman error However, a model learning algorithm that ignores the structure of value function can potentially require many samples to provide any guarantee about the performance of learned policy Consider the objective function lt, T, and notice again that v itself is not known so we cannot directly optimize for this objective Farahmand et al 017 proposed to search for a model that results in lowest error given all possible value functions belonging to a specific class: LT, ˆT T s, a= s s, a T s s, a vs v F 5 Note that minimizing this objective is shown to be tractable if, for example, F is restricted to the class of exponential functions Observe that the VML objective 5 is similar to the dual of Wasserstein 4, but the main difference is the space of value functions In the next section we show that even the space of value functions are the same under certain conditions 4 Lipschitz Generalized Value Iteration We show that solving for a class of Bellman equations yields a Lipschitz value function Our proof is in the context of GVI Littman & Szepesvári, 1996, which defines Value Iteration Bellman, 1957 with arbitrary backup operators We make use of the following lemmas Lemma 3 Given a non-expansion f : S R: K d S,d R T s s, afs ds K d S,W T Proof Starting from the definition, we write: K d S,d R T s s, afs ds = a a T s s 1, a T s s, a fs ds ds 1, s g T s s 1, a T s s, a gs ds 1, s where K ds,d R g 1 g T s s 1, a T s s, a gs ds = a ds 1, s T s1, a, T s, a = a W ds 1, s = K d S,W T Lemma 4 The following operators are non-expansion K,d R = 1: 1 maxx, meanx ɛ-greedyx := ɛ meanx + 1 ɛmaxx 3 mm β x := log i eβx i n β

4 Equivalence Between Wasserstein and Value-ware Model-based Reinforcement Learning Proof 1 is proven by Littman & Szepesvári 1996 follows from 1: metrics not shown for brevity Kɛ-greedyx = K ɛ meanx + 1 ɛmaxx ɛk meanx + 1 ɛk maxx = 1 Finally, 3 is proven multiple times in the literature sadi & Littman, 017; Nachum et al, 017; Neu et al, 017 lgorithm 1 GVI algorithm Input: initial Qs, a, δ, and choose an operator f repeat diff 0 for each s S do for each a do Q copy Qs, a Qs, a Rs, a+γ T s s, af Qs, ds diff max { diff, Q copy Qs, a } end for end for until diff < δ We now present the main result of this paper Theorem For any choice of backup operator f outlined in Lemma 4, GVI computes a value function with a Lipschitz K constant bounded by d S,d R R 1 γk ds,w T if γk d S,W T < 1 Proof From lgorithm 1, in the nth round of GVI updates we have: Q n+1 s, a Rs, a + γ T s s, af Qn s, ds First observe that: K d S,d R Q n+1 due to Summation Lemma K d S,d R R+γK d S,d R T s s, af Qn s, ds due to Lemma 3 K d S,d R R + γk d S,W T K ds,r f Qn s, due to Composition Lemma 1 K d S,d R R + γk d S,W T K,d R fk d S,d R Q n due to Lemma 4, the non-expansion property of f = K d S,d R R + γk d S,W T K d S,d R Q n Equivalently: K d S,d R Q n+1 K d S,d R R n γk ds,w T i i=0 + γk d S,W T n K ds,d R Q 0 By computing the limit of both sides, we get: lim n K d S,d R Q n lim n K d S,d R R n γk ds,w T i i=0 + lim γk ds,w T n K ds,d n R Q 0 Kd = S,d R R 1 γk ds,w T + 0, where we used the fact that γk ds,w T n = 0 lim n This concludes the proof Now notice that as defined earlier: V n s := f Qn s,, so as a relevant corollary of our theorem we get: K ds,d R vs = lim K d n S,d R V n = lim K d n S,d R f Qn s, lim n K d S,d R Q n Kd S,d R R 1 γk ds,w T That is, solving for the fixed point of this general class of Bellman equations results in a Lipschitz state-value function 5 Equivalence Between VML and Wasserstein We now show the main claim of the paper, namely that minimzing for the VML objective is the same as minimizing the Wasserstein metric Consider again the VML objective: LT, ˆT T s, a= s s, a T s s, a vs v F where F can generally be any class of functions From our theorem, however, the space of value functions F should be restricted to Lipschitz functions Moreover, it is easy to design an MDP and a policy such that a desired Lipschitz value function is attained

5 Equivalence Between Wasserstein and Value-ware Model-based Reinforcement Learning This space L C can then be defined as follows: where L C = {f : K ds,d R f C}, C = K d S,d R R 1 γk ds,w T So we can rewrite the VML objective L as follows: L T, ˆT s, a= fs T s s, a T s s, a f L C = C fs T s s, a f L C C T s s, a = C gs T s s, a T s s, a g L 1 It is clear that a function g that maximizes the Kantorovich-Rubinstein dual form: gs T s s, a T s s, a ds g L 1 := W T s, a, T s, a, will also maximize: L T, ˆT s, a = gs T s s, a T s s, a This is due to the fact that g L 1 g L 1 and so computing absolute value or squaring the term will not change arg max in this case s a result: L T, T s, a = C W T s, a, T s, a This highlights a nice property of Wasserstein, namely that minimizing this metric yields a value-aware model 6 Conclusion and Future Work We showed that the value function of an MDP is Lipschitz This result enabled us to draw a connection between value-aware model-based reinforcement learning and the Wassertein metric We hypothesize that the value function is Lipschitz in a more general sense, and so, further investigation of Lipschitz continuity of value functions should be interesting on its own The second interesting direction relates to design of practical model-learning algorithms that can minimize Wasserstein Two promising directions are the use of generative adversarial networks Goodfellow et al, 014; rjovsky et al, 017 or approximations such as entropic regularization Frogner et al, 015 We leave these two directions for future work References bbeel, Pieter, Quigley, Morgan, and Ng, ndrew Y Using inaccurate models in reinforcement learning In Proceedings of the 3rd international conference on Machine learning, pp 1 8 CM, 006 gostini, lejandro and Celaya, Enric Reinforcement learning with a gaussian mixture model In Neural Networks IJCNN, The 010 International Joint Conference on, pp 1 8 IEEE, 010 rjovsky, Martin, Chintala, Soumith, and Bottou, Léon Wasserstein generative adversarial networks In International Conference on Machine Learning, pp 14 3, 017 sadi, Kavosh and Littman, Michael L n alternative softmax operator for reinforcement learning In Proceedings of the 34th International Conference on Machine Learning, pp 43 5, 017 sadi, Kavosh, Misra, Dipendra, and Littman, Michael L Lipschitz continuity in model-based reinforcement learning arxiv preprint arxiv: , 018 Bagnell, J ndrew and Schneider, Jeff G utonomous helicopter control using reinforcement learning policy search methods In Robotics and utomation, 001 Proceedings 001 ICR IEEE International Conference on, volume, pp IEEE, 001 Bellemare, Marc G, Dabney, Will, and Munos, Rémi distributional perspective on reinforcement learning In International Conference on Machine Learning, pp , 017 Bellman, Richard markovian decision process Journal of Mathematics and Mechanics, pp , 1957 Farahmand, mir-massoud, Barreto, ndre, and Nikovski, Daniel Value-ware Loss Function for Model-based Reinforcement Learning In Proceedings of the 0th International Conference on rtificial Intelligence and Statistics, pp , 017 Frogner, Charlie, Zhang, Chiyuan, Mobahi, Hossein, raya, Mauricio, and Poggio, Tomaso Learning with a wasserstein loss In dvances in Neural Information Processing Systems, pp , 015 Goodfellow, Ian, Pouget-badie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, aron, and Bengio, Yoshua Generative adversarial nets In dvances in neural information processing systems, pp , 014

6 Equivalence Between Wasserstein and Value-ware Model-based Reinforcement Learning Kaelbling, Leslie Pack, Littman, Michael L, and Moore, ndrew W Reinforcement learning: survey J rtif Intell Res, 4:37 85, 1996 Kantorovich, Leonid Vasilevich and Rubinstein, G Sh On a space of completely additive functions Vestnik Leningrad Univ, 137:5 59, 1958 Littman, Michael L and Szepesvári, Csaba generalized reinforcement-learning model: Convergence and applications In Proceedings of the 13th International Conference on Machine Learning, pp , 1996 Nachum, Ofir, Norouzi, Mohammad, Xu, Kelvin, and Schuurmans, Dale Bridging the gap between value and policy based reinforcement learning arxiv preprint arxiv: , 017 Neu, Gergely, Jonsson, nders, and Gómez, Vicenç unified view of entropy-regularized Markov decision processes arxiv preprint arxiv: , 017 Ross, Stéphane and Bagnell, Drew gnostic system identification for model-based reinforcement learning In Proceedings of the 9th International Conference on Machine Learning, ICML 01, Edinburgh, Scotland, UK, June 6 - July 1, 01, 01 Sutton, Richard S and Barto, ndrew G Reinforcement Learning: n Introduction The MIT Press, 1998 Talvitie, Erik Model regularization for stable sample rollouts In Proceedings of the Thirtieth Conference on Uncertainty in rtificial Intelligence, UI 014, Quebec City, Quebec, Canada, July 3-7, 014, pp , 014 Vaserstein, Leonid Nisonovich Markov processes over denumerable products of spaces, describing large systems of automata Problemy Peredachi Informatsii, 53:64 7, 1969 Venkatraman, run, Hebert, Martial, and Bagnell, J ndrew Improving multi-step prediction of learned time series models In Proceedings of the Twenty-Ninth I Conference on rtificial Intelligence, January 5-30, 015, ustin, Texas, US, 015 Villani, Cédric Optimal transport: old and new, volume 338 Springer Science & Business Media, 008

Reinforcement Learning

Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.