arxiv: v1 [cs.ai] 27 Oct 2017

Size: px
Start display at page:

Download "arxiv: v1 [cs.ai] 27 Oct 2017"

Transcription

1 Distributional Reinforcement Learning with Quantile Regression Will Dabney DeepMind Mark Rowland University of Cambridge Marc G. Bellemare Google Brain Rémi Munos DeepMind arxiv: v1 [cs.ai] 27 Oct 2017 Abstract In reinforcement learning an agent interacts with the environment by taking actions and observing the next state and reward. When sampled probabilistically, these state transitions, rewards, and actions can all induce randomness in the observed long-term return. Traditionally, reinforcement learning algorithms average over this randomness to estimate the value function. In this paper, we build on recent work advocating a distributional approach to reinforcement learning in which the distribution over returns is modeled explicitly instead of only estimating the mean. That is, we examine methods of learning the value distribution instead of the value function. We give results that close a number of gaps between the theoretical and algorithmic results given by Bellemare, Dabney, and Munos (2017). First, we extend existing results to the approximate distribution setting. Second, we present a novel distributional reinforcement learning algorithm consistent with our theoretical formulation. Finally, we evaluate this new algorithm on the Atari 2600 games, observing that it significantly outperforms many of the recent improvements on DQN, including the related distributional algorithm C51. Introduction In reinforcement learning, the value of an action a in state s describes the expected return, or discounted sum of rewards, obtained from beginning in that state, choosing action a, and subsequently following a prescribed policy. Because knowing this value for the optimal policy is sufficient to act optimally, it is the object modelled by classic value-based methods such as SARSA (Rummery and Niranjan 1994) and Q- Learning (Watkins and Dayan 1992), which use Bellman s equation (Bellman 1957) to efficiently reason about value. Recently, Bellemare, Dabney, and Munos (2017) showed that the distribution of the random returns, whose expectation constitutes the aforementioned value, can be described by the distributional analogue of Bellman s equation, echoing previous results in risk-sensitive reinforcement learning (Heger 1994; Morimura et al. 2010; Chow et al. 2015). In this previous work, however, the authors argued for the usefulness in modeling this value distribution in and of itself. Their claim was asserted by exhibiting a distributional reinforcement learning algorithm, C51, which achieved state-of- Contributed during an internship at DeepMind. the-art on the suite of benchmark Atari 2600 games (Bellemare et al. 2013). One of the theoretical contributions of the C51 work was a proof that the distributional Bellman operator is a contraction in a maximal form of the Wasserstein metric between probability distributions. In this context, the Wasserstein metric is particularly interesting because it does not suffer from disjoint-support issues (Arjovsky, Chintala, and Bottou 2017) which arise when performing Bellman updates. Unfortunately, this result does not directly lead to a practical algorithm: as noted by the authors, and further developed by Bellemare et al. (2017), the Wasserstein metric, viewed as a loss, cannot generally be minimized using stochastic gradient methods. This negative result left open the question as to whether it is possible to devise an online distributional reinforcement learning algorithm which takes advantage of the contraction result. Instead, the C51 algorithm first performs a heuristic projection step, followed by the minimization of a KL divergence between projected Bellman update and prediction. The work therefore leaves a theory-practice gap in our understanding of distributional reinforcement learning, which makes it difficult to explain the good performance of C51. Thus, the existence of a distributional algorithm that operates end-to-end on the Wasserstein metric remains an open question. In this paper, we answer this question affirmatively. By appealing to the theory of quantile regression (Koenker 2005), we show that there exists an algorithm, applicable in a stochastic approximation setting, which can perform distributional reinforcement learning over the Wasserstein metric. Our method relies on the following techniques: We transpose the parametrization from C51: whereas the former uses N fixed locations for its approximation distribution and adjusts their probabilities, we assign fixed, uniform probabilities to N adjustable locations; We show that quantile regression may be used to stochastically adjust the distributions locations so as to minimize the Wasserstein distance to a target distribution. We formally prove contraction mapping results for our overall algorithm, and use these results to conclude that our method performs distributional RL end-to-end under the Wasserstein metric, as desired.

2 The main interest of the original distributional algorithm was its state-of-the-art performance, despite still acting by maximizing expectations. One might naturally expect that a direct minimization of the Wasserstein metric, rather than its heuristic approximation, may yield even better results. We derive the Q-Learning analogue for our method (QR-DQN), apply it to the same suite of Atari 2600 games, and find that it achieves even better performance. By using a smoothed version of quantile regression, Huber quantile regression, we gain an impressive 33% median score increment over the already state-of-the-art C51. Distributional RL We model the agent-environment interactions by a Markov decision process (MDP) (X, A, R, P, γ) (Puterman 1994), with X and A the state and action spaces, R the random variable reward function, P (x x, a) the probability of transitioning from state x to state x after taking action a, and γ [0, 1) the discount factor. A policy π( x) maps each state x X to a distribution over A. For a fixed policy π, the return, Z π = t=0 γt R t, is a random variable representing the sum of discounted rewards observed along one trajectory of states while following π. Standard RL algorithms estimate the expected value of Z π, the value function, [ ] V π (x) := E [Z π (x)] = E γ t R(x t, a t ) x 0 = x. t=0 (1) Similarly, many RL algorithms estimate the action-value function, [ ] Q π (x, a) := E [Z π (x, a)] = E γ t R(x t, a t ), (2) t=0 x t P ( x t 1, a t 1 ), a t π( x t ), x 0 = x, a 0 = a. The ɛ-greedy policy on Q π chooses actions uniformly at random with probability ɛ and otherwise according to arg max a Q π (x, a). In distributional RL the distribution over returns (i.e. the probability law of Z π ), plays the central role and replaces the value function. We will refer to the value distribution by its random variable. When we say that the value function is the mean of the value distribution we are saying that the value function is the expected value, taken over all sources of intrinsic randomness (Goldstein, Misra, and Courtage 1981), of the value distribution. This should highlight that the value distribution is not designed to capture the uncertainty in the estimate of the value function (Dearden, Friedman, and Russell 1998; Engel, Mannor, and Meir 2005), that is the parametric uncertainty, but rather the randomness in the returns intrinsic to the MDP. Temporal difference (TD) methods significantly speed up the learning process by incrementally improving an estimate of Q π using dynamic programming through the Bellman operator (Bellman 1957), T π Q(x, a) = E [R(x, a)] + γe P,π [Q(x, a )]. q 1 T Z q 2 T Z z 1 4z z 2 D KL ( T ZkZ) T Z 4z 4z 4z Figure 1: Projection used by C51 assigns mass inversely proportional to distance from nearest support. Update minimizes KL between projected target and estimate. Similarly, the value distribution can be computed through dynamic programming using a distributional Bellman operator (Bellemare, Dabney, and Munos 2017), T π Z(x, a) : D = R(x, a) + γz(x, a ), (3) x P ( x, a), a π( x ), where Y : = D U denotes equality of probability laws, that is the random variable Y is distributed according to the same law as U. The C51 algorithm models Z π (x, a) using a discrete distribution supported on a comb of fixed locations z 1 z N uniformly spaced over a predetermined interval. The parameters of that distribution are the probabilities q i, expressed as logits, associated with each location z i. Given a current value distribution, the C51 algorithm applies a projection step Φ to map the target T π Z onto its finite element support, followed by a Kullback-Leibler (KL) minimization step (see Figure 1). C51 achieved state-of-the-art performance on Atari 2600 games, but did so with a clear disconnect with the theoretical results of Bellemare, Dabney, and Munos (2017). We now review these results before extending them to the case of approximate distributions. The Wasserstein Metric The p-wasserstein metric W p, for p [1, ], also known as the Mallows metric (Bickel and Freedman 1981) or the Earth Mover s Distance (EMD) when p = 1 (Levina and Bickel 2001), is an integral probability metric between distributions. The p-wasserstein distance is characterized as the L p metric on inverse cumulative distribution functions (inverse CDFs) (Müller 1997). That is, the p-wasserstein metric between distributions U and Y is given by, 1 ( 1 1/p W p (U, Y ) = F 1 1 Y (ω) FU dω) (ω) p, (4) 0 where for a random variable Y, the inverse CDF F 1 Y of Y is defined by F 1 Y (ω) := inf{y R : ω F Y (y)}, (5) where F Y (y) = P r(y y) is the CDF of Y. Figure 2 illustrates the 1-Wasserstein distance as the area between two CDFs. 1 For p =, W (Y, U) = sup ω [0,1] F 1 1 (ω) F (ω). Y U

3 Recently, the Wasserstein metric has been the focus of increased research due to its appealing properties of respecting the underlying metric distances between outcomes (Arjovsky, Chintala, and Bottou 2017; Bellemare et al. 2017). Unlike the Kullback-Leibler divergence, the Wasserstein metric is a true probability metric and considers both the probability of and the distance between various outcome events. These properties make the Wasserstein well-suited to domains where an underlying similarity in outcome is more important than exactly matching likelihoods. Convergence of Distributional Bellman Operator In the context of distributional RL, let Z be the space of action-value distributions with finite moments: Z = {Z : X A P(R) E [ Z(x, a) p ] <, (x, a), p 1}. Then, for two action-value distributions Z 1, Z 2 Z, we will use the maximal form of the Wasserstein metric introduced by (Bellemare, Dabney, and Munos 2017), d p (Z 1, Z 2 ) := sup W p (Z 1 (x, a), Z 2 (x, a)). (6) x,a It was shown that d p is a metric over value distributions. Furthermore, the distributional Bellman operator T π is a contraction in d p, a result that we now recall. Lemma 1 (Lemma 3, Bellemare, Dabney, and Munos 2017). T π is a γ-contraction: for any two Z 1, Z 2 Z, d p (T π Z 1, T π Z 2 ) γ d p (Z 1, Z 2 ). Lemma 1 tells us that d p is a useful metric for studying the behaviour of distributional reinforcement learning algorithms, in particular to show their convergence to the fixed point Z π. Moreover, the lemma suggests that an effective way in practice to learn value distributions is to attempt to minimize the Wasserstein distance between a distribution Z and its Bellman update T π Z, analogous to the way that TDlearning attempts to iteratively minimize the L 2 distance between Q and T Q. Unfortunately, another result shows that we cannot in general minimize the Wasserstein metric (viewed as a loss) using stochastic gradient descent. Theorem 1 (Theorem 1, Bellemare et al. 2017). Let Ŷm := 1 m m i=1 δ Y i be the empirical distribution derived from samples Y 1,..., Y m drawn from a Bernoulli distribution B. Let B µ be a Bernoulli distribution parametrized by µ, the probability of the variable taking the value 1. Then the minimum of the expected sample loss is in general different from the minimum of the true Wasserstein loss; that is, arg min µ [ E Wp (Ŷm, B µ ) ] arg min W p (B, B µ ). Y 1:m This issue becomes salient in a practical context, where the value distribution must be approximated. Crucially, the C51 algorithm is not guaranteed to minimize any p- Wasserstein metric. This gap between theory and practice in distributional RL is not restricted to C51. Morimura et µ Probability Space 4 = ˆ 4 ˆ 3 ˆ 2 ˆ 1 0 =0 Z 2 Z W1 Z 2 Z Q z 1 = F 1 Z (ˆ 1) z 2 z 3 z 4 Space of Returns Figure 2: 1-Wasserstein minimizing projection onto N = 4 uniformly weighted Diracs. Shaded regions sum to form the 1-Wasserstein error. al. (2010) parameterize a value distribution with the mean and scale of a Gaussian or Laplace distribution, and minimize the KL divergence between the target T π Z and the prediction Z. They demonstrate that value distributions learned in this way are sufficient to perform risk-sensitive Q- Learning. However, any theoretical guarantees derived from their method can only be asymptotic; the Bellman operator is at best a non-expansion in KL divergence. Approximately Minimizing Wasserstein Recall that C51 approximates the distribution at each state by attaching variable (parametrized) probabilities q 1,..., q N to fixed locations z 1 z N. Our approach is to transpose this parametrization by considering fixed probabilities but variable locations. Specifically, we take uniform weights, so that q i = 1/N for each i = 1,..., N. Effectively, our new approximation aims to estimate quantiles of the target distribution. Accordingly, we will call it a quantile distribution, and let Z Q be the space of quantile distributions for fixed N. We will denote the cumulative probabilities associated with such a distribution (that is, the discrete values taken on by the CDF) by τ 1,..., τ N, so that τ i = i N for i = 1,..., N. We will also write τ 0 = 0 to simplify notation. Formally, let θ : X A R N be some parametric model. A quantile distribution Z θ Z Q maps each state-action pair (x, a) to a uniform probability distribution supported on {θ i (x, a)}. That is, Z θ (x, a) := 1 N q 4 q 3 q 2 q 1 N δ θi(x,a), (7) i=1 where δ z denotes a Dirac at z R. Compared to the original parametrization, the benefits of a parameterized quantile distribution are threefold. First, (1) we are not restricted to prespecified bounds on the support, or a uniform resolution, potentially leading to significantly more accurate predictions when the range of returns vary

4 greatly across states. This also (2) lets us do away with the unwieldy projection step present in C51, as there are no issues of disjoint supports. Together, these obviate the need for domain knowledge about the bounds of the return distribution when applying the algorithm to new tasks. Finally, (3) this reparametrization allows us to minimize the Wasserstein loss, without suffering from biased gradients, specifically, using quantile regression. The Quantile Approximation It is well-known that in reinforcement learning, the use of function approximation may result in instabilities in the learning process (Tsitsiklis and Van Roy 1997). Specifically, the Bellman update projected onto the approximation space may no longer be a contraction. In our case, we analyze the distributional Bellman update, projected onto a parameterized quantile distribution, and prove that the combined operator is a contraction. Quantile Projection We are interested in quantifying the projection of an arbitrary value distribution Z Z onto Z Q, that is Π W1 Z := arg min W 1 (Z, Z θ ), Z θ Z Q Let Y be a distribution with bounded first moment and U a uniform distribution over N Diracs as in (7), with support {θ 1,..., θ N }. Then W 1 (Y, U) = N i=1 τi F 1 Y τ i 1 (ω) θ i dω. Lemma 2. For any τ, τ [0, 1] with τ < τ and cumulative distribution function F with inverse F 1, the set of θ R minimizing is given by τ τ F 1 (ω) θ dω, { ( )} θ R τ + τ F (θ) =. 2 In particular, if F 1 is the inverse CDF, then F 1 ((τ + τ )/2) is always a valid minimizer, and if F 1 is continuous at (τ +τ )/2, then F 1 ((τ +τ )/2) is the unique minimizer. These quantile midpoints will be denoted by ˆτ i = τi 1+τi 2 for 1 i N. Therefore, by Lemma 2, the values for {θ 1, θ 1,..., θ N } that minimize W 1 (Y, U) are given by θ i = F 1 Y (ˆτ i). Figure 2 shows an example of the quantile projection Π W1 Z minimizing the 1-Wasserstein distance to Z. 2 Quantile Regression The original proof of Theorem 1 only states the existence of a distribution whose gradients are biased. As a result, we might hope that our quantile parametrization leads to unbiased gradients. Unfortunately, this is not true. 2 We save proofs for the appendix due to space limitations. Proposition 1. Let Z θ be a quantile distribution, and Ẑm the empirical distribution composed of m samples from Z. Then for all p 1, there exists a Z such that arg min E[W p (Ẑm, Z θ )] arg min W p (Z, Z θ ). However, there is a method, more widely used in economics than machine learning, for unbiased stochastic approximation of the quantile function. Quantile regression, and conditional quantile regression, are methods for approximating the quantile functions of distributions and conditional distributions respectively (Koenker 2005). These methods have been used in a variety of settings where outcomes have intrinsic randomness (Koenker and Hallock 2001); from food expenditure as a function of household income (Engel 1857), to studying value-at-risk in economic models (Taylor 1999). The quantile regression loss, for quantile τ [0, 1], is an asymmetric convex loss function that penalizes overestimation errors with weight τ and underestimation errors with weight 1 τ. For a distribution Z, and a given quantile τ, the value of the quantile function F 1 Z (τ) may be characterized as the minimizer of the quantile regression loss L τ QR(θ) := EẐ Z [ρ τ (Ẑ θ)], where ρ τ (u) = u(τ δ {u<0} ), u R. (8) More generally, by Lemma 2 we have that the minimizing values of {θ 1,..., θ N } for W 1 (Z, Z θ ) are those that minimize the following objective: N EẐ Z [ρˆτi (Ẑ θ i)] i=1 In particular, this loss gives unbiased sample gradients. As a result, we can find the minimizing {θ 1,..., θ N } by stochastic gradient descent. Quantile Huber Loss The quantile regression loss is not smooth at zero; as u 0 +, the gradient of Equation 8 stays constant. We hypothesized that this could limit performance when using non-linear function approximation. To this end, we also consider a modified quantile loss, called the quantile Huber loss. 3 This quantile regression loss acts as an asymmetric squared loss in an interval [ κ, κ] around zero and reverts to a standard quantile loss outside this interval. The Huber loss is given by (Huber 1964), { 1 L κ (u) = 2 u2, if u κ κ( u 1. (9) 2κ), otherwise The quantile Huber loss is then simply the asymmetric variant of the Huber loss, ρ κ τ (u) = τ δ {u<0} L κ (u). (10) For notational simplicity we will denote ρ 0 τ = ρ τ, that is, it will revert to the standard quantile regression loss. 3 Our quantile Huber loss is related to, but distinct from that of Aravkin et al. (2014).

5 Combining Projection and Bellman Update We are now in a position to prove our main result, which states that the combination of the projection implied by quantile regression with the Bellman operator is a contraction. The result is in -Wasserstein metric, i.e. the size of the largest gap between the two CDFs. Proposition 2. Let Π W1 be the quantile projection defined as above, and when applied to value distributions gives the projection for each state-value distribution. For any two value distributions Z 1, Z 2 Z for an MDP with countable state and action spaces, d (Π W1 T π Z 1, Π W1 T π Z 2 ) γ d (Z 1, Z 2 ). (11) We therefore conclude that the combined operator Π W1 T π has a unique fixed point Ẑπ, and the repeated application of this operator, or its stochastic approximation, converges to Ẑπ. Because d p d, we conclude that convergence occurs for all p [1, ]. Interestingly, the contraction property does not directly hold for p < ; see Lemma 5 in the appendix. Distributional RL using Quantile Regression We can now form a complete algorithmic approach to distributional RL consistent with our theoretical results. That is, approximating the value distribution with a parameterized quantile distribution over the set of quantile midpoints, defined by Lemma 2. Then, training the location parameters using quantile regression (Equation 8). Quantile Regression Temporal Difference Learning Recall the standard TD update for evaluating a policy π, V (x) V (x) + α(r + γv (x ) V (x)), a π( x), r R(x, a), x P ( x, a). TD allows us to update the estimated value function with a single unbiased sample following π. Quantile regression also allows us to improve the estimate of the quantile function for some target distribution, Y (x), by observing samples y Y (x) and minimizing Equation 8. Furthermore, we have shown that by estimating the quantile function for well-chosen values of τ (0, 1) we can obtain an approximation with minimal 1-Wasserstein distance from the original (Lemma 2). Finally, we can combine this with the distributional Bellman operator to give a target distribution for quantile regression. This gives us the quantile regression temporal difference learning (QRTD) algorithm, summarized simply by the update, θ i (x) θ i (x) + α(ˆτ i δ {r+γz <θ i(x))}), (12) a π( x), r R(x, a), x P ( x, a), z Z θ (x ), where Z θ is a quantile distribution as in (7), and θ i (x) is the estimated value of F 1 Z π (x) (ˆτ i) in state x. It is important to note that this update is for each value of ˆτ i and is defined for a single sample from the next state value distribution. In general it is better to draw many samples of z Z(x ) and minimize the expected update. A natural approach in this case, which we use in practice, is to compute the update for all pairs of (θ i (x), θ j (x )). Next, we turn to a control algorithm and the use of non-linear function approximation. Quantile Regression DQN Q-Learning is an off-policy reinforcement learning algorithm built around directly learning the optimal action-value function using the Bellman optimality operator (Watkins and Dayan 1992), [ ] T Q(x, a) = E [R(x, a)] + γ E max Q(x, a ). a x P The distributional variant of this is to estimate a stateaction value distribution and apply a distributional Bellman optimality operator, T Z(x, a) = R(x, a) + γz(x, a ), (13) x P ( x, a), a = arg max a E z Z(x,a ) [z]. Notice in particular that the action used for the next state is the greedy action with respect to the mean of the next stateaction value distribution. For a concrete algorithm we will build on the DQN architecture (Mnih et al. 2015). We focus on the minimal changes necessary to form a distributional version of DQN. Specifically, we require three modifications to DQN. First, we use a nearly identical neural network architecture as DQN, only changing the output layer to be of size A N, where N is a hyper-parameter giving the number of quantile targets. Second, we replace the Huber loss used by DQN 4, L κ (r t + γ max a Q(x t+1, a ) Q(x t, a t )) with κ = 1, with a quantile Huber loss (full loss given by Algorithm 1). Finally, we replace RMSProp (Tieleman and Hinton 2012) with Adam (Kingma and Ba 2015). We call this new algorithm quantile regression DQN (QR-DQN). Algorithm 1 Quantile Regression Q-Learning Require: N, κ input x, a, r, x, γ [0, 1) # Compute distributional Bellman target Q(x, a ) := j q jθ j (x, a ) a arg max a Q(x, a ) T θ j r + γθ j (x, a ), j # Compute quantile regression loss (Equation 10) output N i=1 E [ j ρ (T θ κˆτi j θ i (x, a)) ] Unlike C51, QR-DQN does not require projection onto the approximating distribution s support, instead it is able to expand or contract the values arbitrarily to cover the true range of return values. As an additional advantage, this means that QR-DQN does not require the additional hyper-parameter giving the bounds of the support required by C51. The only additional hyper-parameter of QR-DQN not shared by DQN is the number of quantiles N, which controls with what resolution we approximate the value distribution. As we increase N, QR-DQN goes from DQN to increasingly able to estimate the upper and lower quantiles of the value distribution. It becomes increasingly capable of distinguishing low probability events at either end of the cumulative distribution over returns. 4 DQN uses gradient clipping of the squared error that makes it equivalent to a Huber loss with κ = 1.

6 x S x G (a) (b) (c) Z(xS) F Z(xS) Z MC Z Returns Returns [V MC(xS) V (xs)] 2 W1(Z MC(xS),Z(xS)) (d) (e) Episodes Figure 3: (a) Two-room windy gridworld, with wind magnitude shown along bottom row. Policy trajectory shown by blue path, with additional cycles caused by randomness shown by dashed line. (b, c) (Cumulative) Value distribution at start state x S, estimated by MC, Z π MC, and by QRTD, Z θ. (d, e) Value function (distribution) approximation errors for TD(0) and QRTD. Experimental Results In the introduction we claimed that learning the distribution over returns had distinct advantages over learning the value function alone. We have now given theoretically justified algorithms for performing distributional reinforcement learning, QRTD for policy evaluation and QR-DQN for control. In this section we will empirically validate that the proposed distributional reinforcement learning algorithms: (1) learn the true distribution over returns, (2) show increased robustness during training, and (3) significantly improve sample complexity and final performance over baseline algorithms. Value Distribution Approximation Error We begin our experimental results by demonstrating that QRTD actually learns an approximate value distribution that minimizes the 1-Wasserstein to the ground truth distribution over returns. Although our theoretical results already establish convergence of the former to the latter, the empirical performance helps to round out our understanding. We use a variant of the classic windy gridworld domain (Sutton and Barto 1998), modified to have two rooms and randomness in the transitions. Figure 3(a) shows our version of the domain, where we have combined the transition stochasticity, wind, and the doorway to produce a multimodal distribution over returns when anywhere in the first room. Each state transition has probability 0.1 of moving in a random direction, otherwise the transition is affected by wind moving the agent northward. The reward function is zero until reaching the goal state x G, which terminates the episode and gives a reward of 1.0. The discount factor is γ = We compute the ground truth value distribution for optimal policy π, learned by policy iteration, at each state by performing 1K Monte-Carlo (MC) rollouts and recording the observed returns as an empirical distribution, shown in Figure 3(b). Next, we ran both TD(0) and QRTD while following π for 10K episodes. Each episode begins in the designated start state (x S ). Both algorithms started with a learning rate of α = 0.1. For QRTD we used N = 32 and drop α by half every 2K episodes. Let ZMC π (x S) be the MC estimated distribution over returns from the start state x S, similarly VMC π (x S) its mean. In Figure 3 we show the approximation errors at x S for both algorithms with respect to the number of episodes. In (d) we evaluated, for both TD(0) and QRTD, the squared error, (V π MC V (x S)) 2, and in (e) we show the 1-Wasserstein metric for QRTD, W 1 (Z π MC (x S), Z(x S )), where V (x S ) and Z(x S ) are the expected returns and value distribution at state x S estimated by the algorithm. As expected both algorithms converge correctly in mean, and QRTD minimizes the 1-Wasserstein distance to Z π MC. Evaluation on Atari 2600 We now provide experimental results that demonstrate the practical advantages of minimizing the Wasserstein metric end-to-end, in contrast to the C51 approach. We use the 57 Atari 2600 games from the Arcade Learning Environment (ALE) (Bellemare et al. 2013). Both C51 and QR-DQN build on the standard DQN architecture, and we expect both to benefit from recent improvements to DQN such as the dueling architectures (Wang et al. 2016) and prioritized replay (Schaul et al. 2016). However, in our evaluations we compare the pure versions of C51 and QR-DQN without these additions. We present results for both a strict quantile loss, κ = 0 (QR-DQN-0), and with a Huber quantile loss with κ = 1 (QR-DQN-1). We performed hyper-parameter tuning over a set of five training games and evaluated on the full set of 57 games using these best settings (α = , ɛ ADAM = 0.01/32, and N = 200). 5 As with DQN we use a target network when computing the distributional Bellman update. We also allow ɛ to decay at the same rate as in DQN, but to a lower value of 0.01, as is common in recent work (Bellemare, Dabney, and Munos 2017; Wang et al. 2016; van Hasselt, Guez, and Silver 2016). Out training procedure follows that of Mnih et al. (2015) s, and we present results under two evaluation protocols: best agent performance and online performance. In both evaluation protocols we consider performance over all 57 Atari 2600 games, and transform raw scores into humannormalized scores (van Hasselt, Guez, and Silver 2016). 5 We swept over α in (10 3, , 10 4, , 10 5 ); ɛ ADAM in (0.01/32, 0.005/32, 0.001/32); N (10, 50, 100, 200)

7 50% 50% 50% 50% 40% 40% 40% 40% 30% 30% 30% 30% 20% 20% 20% 20% 10% Figure 4: Online evaluation results, in human-normalized scores, over 57 Atari 2600 games for 200 million training samples. (Left) Testing performance for one seed, showing median over games. (Right) Training performance, averaged over three seeds, showing percentiles (10, 20, 30, 40, and 50) over games. Mean Median >human >DQN DQN 228% 79% 24 0 DDQN 307% 118% DUEL. 373% 151% PRIOR. 434% 124% PR. DUEL. 592% 172% C51 701% 178% QR-DQN-0 881% 199% QR-DQN-1 915% 211% Table 1: Mean and median of best scores across 57 Atari 2600 games, measured as percentages of human baseline (Nair et al. 2015). Best agent performance To provide comparable results with existing work we report test evaluation results under the best agent protocol. Every one million training frames, learning is frozen and the agent is evaluated for 500K frames while recording the average return. Evaluation episodes begin with up to 30 random no-ops (Mnih et al. 2015), and the agent uses a lower exploration rate (ɛ = 0.001). As training progresses we keep track of the best agent performance achieved thus far. Table 1 gives the best agent performance, at 200 million frames trained, for QR-DQN, C51, DQN, Double DQN (van Hasselt, Guez, and Silver 2016), Prioritized replay (Schaul et al. 2016), and Dueling architecture (Wang et al. 2016). We see that QR-DQN outperforms all previous agents in mean and median human-normalized score. Online performance In this evaluation protocol (Figure 4) we track the average return attained during each testing (left) and training (right) iteration. For the testing performance we use a single seed for each algorithm, but show online performance with no form of early stopping. For training performance, values are averages over three seeds. Instead of reporting only median performance, we look at the distribution of human-normalized scores over the full set of games. Each bar represents the score distribution at a fixed percentile (10th, 20th, 30th, 40th, and 50th). The upper percentiles show a similar trend but are omitted here for visual clarity, as their scale dwarfs the more informative lower half. From this, we can infer a few interesting results. (1) Early in learning, most algorithms perform worse than random for at least 10% of games. (2) QRTD gives similar improvements to sample complexity as prioritized replay, while also improving final performance. (3) Even at 200 million frames, there are 10% of games where all algorithms reach less than 10% of human. This final point in particular shows us that all of our recent advances continue to be severely limited on a small subset of the Atari 2600 games. Conclusions The importance of the distribution over returns in reinforcement learning has been (re)discovered and highlighted many times by now. In Bellemare, Dabney, and Munos (2017) the idea was taken a step further, and argued to be a central part of approximate reinforcement learning. However, the paper left open the question of whether there exists an algorithm which could bridge the gap between Wasserstein-metric theory and practical concerns. In this paper we have closed this gap with both theoretical contributions and a new algorithm which achieves stateof-the-art performance in Atari There remain many promising directions for future work. Most exciting will be to expand on the promise of a richer policy class, made possible by action-value distributions. We have mentioned a few examples of such policies, often used for risk-sensitive decision making. However, there are many more possible decision policies that consider the action-value distributions as a whole. Additionally, QR-DQN is likely to benefit from the improvements on DQN made in recent years. For instance, due to the similarity in loss functions and Bellman operators we might expect that QR-DQN suffers from similar overestimation biases to those that Double DQN was designed to address (van Hasselt, Guez, and Silver 2016). A natural next step would be to combine QR-DQN with the nondistributional methods found in Table 1.

8 Acknowledgements The authors acknowledge the vital contributions of their colleagues at DeepMind. Special thanks to Tom Schaul, Audrunas Gruslys, Charles Blundell, and Benigno Uria for their early suggestions and discussions on the topic of quantile regression. Additionally, we are grateful for feedback from David Silver, Yee Whye Teh, Georg Ostrovski, Joseph Modayil, Matt Hoffman, Hado van Hasselt, Ian Osband, Mohammad Azar, Tom Stepleton, Olivier Pietquin, Bilal Piot; and a second acknowledgement in particular of Tom Schaul for his detailed review of an previous draft. References Aravkin, A. Y.; Kambadur, A.; Lozano, A. C.; and Luss, R Sparse Quantile Huber Regression for Efficient and Robust Estimation. arxiv. Arjovsky, M.; Chintala, S.; and Bottou, L Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML). Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M The Arcade Learning Environment: An Evaluation Platform for General Agents. Journal of Artificial Intelligence Research 47: Bellemare, M. G.; Danihelka, I.; Dabney, W.; Mohamed, S.; Lakshminarayanan, B.; Hoyer, S.; and Munos, R The Cramer Distance as a Solution to Biased Wasserstein Gradients. arxiv. Bellemare, M. G.; Dabney, W.; and Munos, R A Distributional Perspective on Reinforcement Learning. Proceedings of the 34th International Conference on Machine Learning (ICML). Bellman, R. E Dynamic Programming. Princeton, NJ: Princeton University Press. Bickel, P. J., and Freedman, D. A Some Asymptotic Theory for the Bootstrap. The Annals of Statistics Chow, Y.; Tamar, A.; Mannor, S.; and Pavone, M Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach. In Advances in Neural Information Processing Systems (NIPS), Dearden, R.; Friedman, N.; and Russell, S Bayesian Q-learning. In Proceedings of the National Conference on Artificial Intelligence. Engel, Y.; Mannor, S.; and Meir, R Reinforcement Learning with Gaussian Processes. In Proceedings of the International Conference on Machine Learning (ICML). Engel, E Die Productions-und Consumtionsverhältnisse des Königreichs Sachsen. Zeitschrift des Statistischen Bureaus des Königlich Sächsischen Ministeriums des Innern 8:1 54. Goldstein, S.; Misra, B.; and Courtage, M On Intrinsic Randomness of Dynamical Systems. Journal of Statistical Physics 25(1): Heger, M Consideration of Risk in Reinforcement Learning. In Proceedings of the 11th International Conference on Machine Learning, Huber, P. J Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics 35(1): Kingma, D., and Ba, J Adam: A Method for Stochastic Optimization. Proceedings of the International Conference on Learning Representations. Koenker, R., and Hallock, K Quantile Regression: An Introduction. Journal of Economic Perspectives 15(4): Koenker, R Quantile Regression. Cambridge University Press. Levina, E., and Bickel, P The Earth Mover s Distance is the Mallows Distance: Some Insights from Statistics. In The 8th IEEE International Conference on Computer Vision (ICCV). IEEE. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al Human-level Control through Deep Reinforcement Learning. Nature 518(7540): Morimura, T.; Hachiya, H.; Sugiyama, M.; Tanaka, T.; and Kashima, H Parametric Return Density Estimation for Reinforcement Learning. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI). Müller, A Integral Probability Metrics and their Generating Classes of Functions. Advances in Applied Probability 29(2): Nair, A.; Srinivasan, P.; Blackwell, S.; Alcicek, C.; Fearon, R.; De Maria, A.; Panneershelvam, V.; Suleyman, M.; Beattie, C.; and Petersen, S. e. a Massively Parallel Methods for Deep Reinforcement Learning. In ICML Workshop on Deep Learning. Puterman, M. L Markov Decision Processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc. Rummery, G. A., and Niranjan, M On-line Q- learning using Connectionist Systems. Technical report, Cambridge University Engineering Department. Schaul, T.; Quan, J.; Antonoglou, I.; and Silver, D Prioritized Experience Replay. In Proceedings of the International Conference on Learning Representations (ICLR). Sutton, R. S., and Barto, A. G Reinforcement Learning: An Introduction. MIT Press. Taylor, J. W A Quantile Regression Approach to Estimating the Distribution of Multiperiod Returns. The Journal of Derivatives 7(1): Tieleman, T., and Hinton, G Lecture 6.5: Rmsprop. COURSERA: Neural Networks for Machine Learning 4(2). Tsitsiklis, J. N., and Van Roy, B An Analysis of Temporal-Difference Learning with Function Approximation. IEEE Transactions on Automatic Control 42(5):

9 van Hasselt, H.; Guez, A.; and Silver, D Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the AAAI Conference on Artificial Intelligence. Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H. v.; Lanctot, M.; and de Freitas, N Dueling Network Architectures for Deep Reinforcement Learning. In Proceedings of the International Conference on Machine Learning (ICML). Watkins, C. J., and Dayan, P Q-learning. Machine Learning 8(3): Appendix Proofs Lemma 2. For any τ, τ [0, 1] with τ < τ and cumulative distribution function F with inverse F 1, the set of θ R minimizing is given by τ τ F 1 (ω) θ dω, { ( )} θ R τ + τ F (θ) =. 2 In particular, if F 1 is the inverse CDF, then F 1 ((τ + τ )/2) is always a valid minimizer, and if F 1 is continuous at (τ +τ )/2, then F 1 ((τ +τ )/2) is the unique minimizer. Proof. For any ω [0, 1], the function θ F 1 (ω) θ is convex, and has subgradient given by 1 if θ < F 1 (ω) θ [ 1, 1] if θ = F 1 (ω) 1 if θ > F 1 (ω), so the function θ τ F 1 (ω) θ dω is also convex, and τ has subgradient given by F (θ) τ θ 1dω + 1dω. τ F (θ) Setting this subgradient equal to 0 yields (τ + τ ) 2F (θ) = 0, (14) and since F F 1 is the identity map on [0, 1], it is clear that θ = F 1 ((τ +τ )/2) satisfies Equation 14. Note that in fact any θ such that F (θ) = (τ + τ )/2 yields a subgradient of 0, which leads to a multitude of minimizers if F 1 is not continuous at (τ + τ )/2. Proposition 1. Let Z θ be a quantile distribution, and Ẑm the empirical distribution composed of m samples from Z. Then for all p 1, there exists a Z such that arg min E[W p (Ẑm, Z θ )] arg min W p (Z, Z θ ). Proof. Write Z θ = N i=1 1 N δ θ i, with θ 1 θ N. We take Z to be of the same form as Z θ. Specifically, consider Z given by Z = N i=1 1 N δ i, supported on the set {1,..., N}, and take m = N. Then clearly the unique minimizing Z θ for W p (Z, Z θ ) is given by taking Z θ = Z. However, consider the gradient with respect to θ 1 for the objective We have E[W p (ẐN, Z θ )]. θ1 E[W p (ẐN, Z θ )] θ1=1 = E[ θ1 W p (ẐN, Z θ ) θ1=1].

10 In the event that the sample distribution ẐN has an atom at 1, then the optimal transport plan pairs the atom of Z θ at θ 1 = 1 with this atom of ẐN, and gradient with respect to θ 1 of W p (ẐN, Z θ ) is 0. If the sample distribution ẐN does not contain an atom at 1, then the left-most atom of ẐN is greater than 1 (since Z is supported on {1,..., N}. In this case, the gradient on θ 1 is negative. Since this happens with non-zero probability, we conclude that θ1 E[W p (ẐN, Z θ )] θ1=1 < 0, and therefore Z θ = Z cannot be the minimizer of E[W p (ẐN, Z θ )]. Proposition 2. Let Π W1 be the quantile projection defined as above, and when applied to value distributions gives the projection for each state-value distribution. For any two value distributions Z 1, Z 2 Z for an MDP with countable state and action spaces, d (Π W1 T π Z 1, Π W1 T π Z 2 ) γ d (Z 1, Z 2 ). (11) Proof. We assume that instantaneous rewards given a stateaction pair are deterministic; the general case is a straightforward generalization. Further, since the operator T π is a γ-contraction in d, it is sufficient to prove the claim in the case γ = 1. In addition, since Wasserstein distances are invariant under translation of the support of distributions, it is sufficient to deal with the case where r(x, a) 0 for all (x, a) X A. The proof then proceeds by first reducing to the case where every value distribution consists only of single Diracs, and then dealing with this reduced case using Lemma 3. We write Z(x, a) = N k=1 1 N δ θ k (x,a) and Y (x, a) = N k=1 1 N δ ψ k (x,a), for some functions θ, ψ : X A R n. Let (x, a) be a state-action pair, and let ((x i, a i )) i I be all the state-action pairs that are accessible from (x, a ) in a single transition, where I is a (finite or countable) indexing set. Write p i for the probability of transitioning from (x, a ) to (x i, a i ), for each i I. We now construct a new MDP and new value distributions for this MDP in which all distributions are given by single Diracs, with a view to applying Lemma 3. The new MDP is of the following form. We take the state-action pair (x, a ), and define new states, actions, transitions, and a policy π, so that the stateaction pairs accessible from (x, a ) in this new MDP are given by (( x j i, ãj i ) i I) N j=1, and the probability of reaching the state-action pair ( x j i, ãj i ) is p i/n. Further, we define new value distributions Z, Ỹ as follows. For each i I and j = 1,..., N, we set: Z( x j i, ãj i ) = δ θ j(x i,a i) Ỹ ( x j i, ãj i ) = δ ψ j(x i,a i). The construction is illustrated in Figure 5. Since, by Lemma 4, the d distance between the 1- Wasserstein projections of two real-valued distributions is the max over the difference of a certain set of quantiles, we Figure 5: Initial MDP and value distribution Z (top), and transformed MDP and value distribution Z (bottom). may appeal to Lemma 3 to obtain the following: d (Π W1 (T π Z)(x, a ), Π W1 (T π Ỹ )(x, a )) θ j (x i, a i ) ψ j (x i, a i ) sup i=1 I j=1,...,n = sup d (Z(x i, a i ), Y (x i, a i )) (15) i=1 I Now note that by construction, (T π Z)(x, a ) (respectively, (T π Ỹ )(x, a )) has the same distribution as (T π Z)(x, a ) (respectively, (T π Y )(x, a )), and so d (Π W1 (T π Z)(x, a ), Π W1 (T π Ỹ )(x, a )) = d (Π W1 (T π Z)(x, a ), Π W1 (T π Y )(x, a )). Therefore, substituting this into the Inequality 15, we obtain d (Π W1 (T π Z)(x, a ), Π W1 (T π Y )(x, a )) sup d (Z(x i, a i ), Y (x i, a i )). i I Taking suprema over the initial state (x, a ) then yields the result. Supporting results Lemma 3. Consider an MDP with countable state and action spaces. Let Z, Y be value distributions such that each state-action distribution Z(x, a), Y (x, a) is given by a single Dirac. Consider the particular case where rewards are identically 0 and γ = 1, and let τ [0, 1]. Denote by Π τ

11 the projection operator that maps a probability distribution onto a Dirac delta located at its τ th quantile. Then d (Π τ T π Z, Π τ T π Y ) d (Z, Y ) Proof. Let Z(x, a) = δ θ(x,a) and Y (x, a) = δ ψ(x,a) for each state-action pair (x, a) X A, for some functions ψ, θ : X A R. Let (x, a ) be a state-action pair, and let ((x i, a i )) i I be all the state-action pairs that are accessible from (x, a ) in a single transition, with I a (finite or countably infinite) indexing set. To lighten notation, we write θ i for θ(x i, a i ) and ψ i for ψ(x i, a i ). Further, let the probability of transitioning from (x, a ) to (x i, a i ) be p i, for all i I. Then we have (T π Z)(x, a ) = i I (T π Y )(x, a ) = i I p i δ θi (16) p i δ ψi. (17) Now consider the τ th quantile of each of these distributions, for τ [0, 1] arbitrary. Let u I be such that θ u is equal to this quantile of (T π Z)(x, a ), and let v I such that ψ v is equal to this quantile of (T π Y )(x, a ). Now note that d (Π τ T π Z(x, a ), Π τ T π Y (x, a )) = θ u ψ v We now show that θ u ψ v > θ i ψ i i I (18) is impossible, from which it will follow that d (Π τ T π Z(x, a ), Π τ T π Y (x, a )) d (Z, Y ), and the result then follows by taking maxima over stateaction pairs (x, a ). To demonstrate the impossibility of (18), without loss of generality we take θ u ψ v. We now introduce the following partitions of the indexing set I. Define: I θu = {i I θ i θ u }, I >θu = {i I θ i > θ u }, I <ψv = {i I ψ i < ψ v }, I ψv = {i I ψ i ψ v }, and observe that we clearly have the following disjoint unions: I = I θu I >θu, I = I <ψv I ψv. If (18) is to hold, then we must have I θu I ψv =. Therefore, we must have I θu I <ψv. But if this is the case, then since θ u is the τ th quantile of (T π Z)(x, a ), we must have p i τ, i I θu and so consequently p i τ, i I <ψv from which we conclude that the τ th quantile of (T π Y )(x, a ) is less than ψ v, a contradiction. Therefore (18) cannot hold, completing the proof. Lemma 4. For any two probability distributions ν 1, ν 2 over the real numbers, and the Wasserstein projection operator Π W1 that projects distributions onto support of size n, we have that d (Π W1 ν 1, Π W1 ν 2 ) ( ) ( ) = max 2i 1 2i 1 i=1,...,n F ν 1 1 Fν 1 2n 2. 2n Proof. By the discussion surrounding Lemma 2, we have that Π W1 ν k = n i=1 1 n δ Fν 1 k ( 2i 1 2n ) for k = 1, 2. Therefore, the optimal coupling between Π W1 ν 1 and Π W1 ν 2 must be given by Fν 1 1 ( 2i 1 1 2n ) Fν 2 ( 2i 1 2n ) for each i = 1,..., n. This immediately leads to the expression of the lemma. Further theoretical results Lemma 5. The projected Bellman operator Π W1 T π is in general not a non-expansion in d p, for p [1, ). Proof. Consider the case where the number of Dirac deltas in each distribution, N, is equal to 2, and let γ = 1. We consider an MDP with a single initial state, x, and two terminal states, x 1 and x 2. We take the action space of the MDP to be trivial, and therefore omit it in the notation that follows. Let the MDP have a 2/3 probability of transitioning from x to x 1, and 1/3 probability of transitioning from x to x 2. We take all rewards in the MDP to be identically 0. Further, consider two value distributions, Z and Y, given by: Z(x 1 ) = 1 2 δ δ 2, Y (x 1 ) = 1 2 δ δ 2, Z(x 2 ) = 1 2 δ δ 5, Y (x 2 ) = 1 2 δ δ 5, Z(x) = δ 0, Y (x) = δ 0. Then note that we have ( ) 1/p 1 d p (Z(x 1 ), Y (x 1 )) = 1 0 = 1 2 2, 1/p ( ) 1/p 1 d p (Z(x 2 ), Y (x 2 )) = 4 3 = 1 2 2, 1/p d p (Z(x), Y (x)) = 0, and so d p (Z, Y ) = /p We now consider the projected backup for these two value distributions at the state x. We first compute the full backup: (T π Z)(x) = 1 3 δ δ δ δ 5, (T π Y )(x) = 1 3 δ δ δ δ 5. Appealing to Lemma 2, we note that when projected these distributions onto two equally-weighted Diracs, the locations of these Diracs correspond to the 25% and 75% quantiles of the original distributions. We therefore have (Π W1 T π Z)(x) = 1 2 δ δ 3, (Π W1 T π Y )(x) = 1 2 δ δ 4,

12 and we therefore obtain d 1 (Π W1 T π Z, Π W1 T π Y ) = completing the proof. ( ) 1/p 1 2 ( 1 0 p p ) =1 > 1 2 1/p = d 1(Z, Y ), Notation Human-normalized scores are given by (van Hasselt, Guez, and Silver 2016), score = agent random human random, where agent, human and random represent the per-game raw scores for the agent, human baseline, and random agent baseline. Symbol Table 2: Notation used in the paper Description of usage Reinforcement Learning M MDP (X, A, R, P, γ) X State space of MDP A Action space of MDP R, R t Reward function, random variable reward P Transition probabilities, P (x x, a) γ Discount factor, γ [0, 1) x, x t X States a, a, b A Actions r, r t R Rewards π Policy T π (dist.) Bellman operator T (dist.) Bellman optimality operator V π, V Value function, state-value function Q π, Q Action-value function α Step-size parameter, learning rate ɛ Exploration rate, ɛ-greedy ɛ ADAM Adam parameter κ Huber-loss parameter Huber-loss with parameter κ L κ Distributional Reinforcement Learning Z π, Z Random return, value distribution ZMC π Monte-Carlo value distribution under policy π Z Space of value distributions Ẑ π Fixed point of convergence for Π W1 T π z Z Instantiated return sample p Metric order W p p-wasserstein metric L p Metric order p d p maximal form of Wasserstein Φ Projection used by C51 Π W1 1-Wasserstein projection ρ τ Quantile regression loss ρ κ τ Huber quantile loss q 1,..., q N Probabilities, parameterized probabilities τ 0, τ 1,..., τ N Cumulative probabilities with τ 0 := 0 ˆτ 1,..., ˆτ N Midpoint quantile targets ω Sample from unit interval δ z Dirac function at z R θ Parameterized function B Bernoulli distribution B µ Parameterized Bernoulli distribution Z Q Space of quantile (value) distributions Z θ Parameterized quantile (value) distribution Y Random variable over R Y 1,..., Y m Random variable samples Empirical distribution from m-diracs Ŷ m

Distributional Reinforcement Learning with Quantile Regression

Distributional Reinforcement Learning with Quantile Regression The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Distributional Reinforcement Learning with Quantile Regression Will Dabney DeepMind Mark Rowland University of Cambridge Marc G. Bellemare

More information

An Analysis of Categorical Distributional Reinforcement Learning

An Analysis of Categorical Distributional Reinforcement Learning Mark Rowland *1 Marc G. Bellemare Will Dabney Rémi Munos Yee Whye Teh * University of Cambridge, Google Brain, DeepMind Abstract Distributional approaches to value-based reinforcement learning model the

More information

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge Reinforcement Learning with Function Approximation KATJA HOFMANN Researcher, MSR Cambridge Representation and Generalization in RL Focus on training stability Learning generalizable value functions Navigating

More information

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department Deep reinforcement learning Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 25 In this lecture... Introduction to deep reinforcement learning Value-based Deep RL Deep

More information

A Distributional Perspective on Reinforcement Learning

A Distributional Perspective on Reinforcement Learning Marc G. Bellemare * 1 Will Dabney * 1 Rémi Munos 1 Abstract In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement

More information

Driving a car with low dimensional input features

Driving a car with low dimensional input features December 2016 Driving a car with low dimensional input features Jean-Claude Manoli jcma@stanford.edu Abstract The aim of this project is to train a Deep Q-Network to drive a car on a race track using only

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

arxiv: v1 [cs.lg] 21 Jul 2017

arxiv: v1 [cs.lg] 21 Jul 2017 Marc G. Bellemare * 1 Will Dabney * 1 Rémi Munos 1 arxiv:1707.06887v1 [cs.lg] 21 Jul 2017 Abstract In this paper we argue for the fundamental importance of the value distribution: the distribution of the

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

arxiv: v1 [cs.ai] 5 Nov 2017

arxiv: v1 [cs.ai] 5 Nov 2017 arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Variational Deep Q Network

Variational Deep Q Network Variational Deep Q Network Yunhao Tang Department of IEOR Columbia University yt2541@columbia.edu Alp Kucukelbir Department of Computer Science Columbia University alp@cs.columbia.edu Abstract We propose

More information

Notes on Tabular Methods

Notes on Tabular Methods Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from

More information

Approximation Methods in Reinforcement Learning

Approximation Methods in Reinforcement Learning 2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement

More information

Food delivered. Food obtained S 3

Food delivered. Food obtained S 3 Press lever Enter magazine * S 0 Initial state S 1 Food delivered * S 2 No reward S 2 No reward S 3 Food obtained Supplementary Figure 1 Value propagation in tree search, after 50 steps of learning the

More information

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Reinforcement Learning Part 2

Reinforcement Learning Part 2 Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning

Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning Oron Anschel 1 Nir Baram 1 Nahum Shimkin 1 Abstract Instability and variability of Deep Reinforcement Learning (DRL) algorithms

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

arxiv: v1 [cs.ne] 4 Dec 2015

arxiv: v1 [cs.ne] 4 Dec 2015 Q-Networks for Binary Vector Actions arxiv:1512.01332v1 [cs.ne] 4 Dec 2015 Naoto Yoshida Tohoku University Aramaki Aza Aoba 6-6-01 Sendai 980-8579, Miyagi, Japan naotoyoshida@pfsl.mech.tohoku.ac.jp Abstract

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

arxiv: v1 [cs.lg] 30 Dec 2016

arxiv: v1 [cs.lg] 30 Dec 2016 Adaptive Least-Squares Temporal Difference Learning Timothy A. Mann and Hugo Penedones and Todd Hester Google DeepMind London, United Kingdom {kingtim, hugopen, toddhester}@google.com Shie Mannor Electrical

More information

Bayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison

Bayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison Bayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison Michaël Castronovo University of Liège, Institut Montefiore, B28, B-4000 Liège, BELGIUM Damien Ernst

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

arxiv: v4 [cs.ai] 10 Mar 2017

arxiv: v4 [cs.ai] 10 Mar 2017 Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning Oron Anschel 1 Nir Baram 1 Nahum Shimkin 1 arxiv:1611.1929v4 [cs.ai] 1 Mar 217 Abstract Instability and variability of

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Deep Q-Learning with Recurrent Neural Networks

Deep Q-Learning with Recurrent Neural Networks Deep Q-Learning with Recurrent Neural Networks Clare Chen cchen9@stanford.edu Vincent Ying vincenthying@stanford.edu Dillon Laird dalaird@cs.stanford.edu Abstract Deep reinforcement learning models have

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Deep Reinforcement Learning

Deep Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 50 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

SAMPLE-EFFICIENT DEEP REINFORCEMENT LEARN-

SAMPLE-EFFICIENT DEEP REINFORCEMENT LEARN- SAMPLE-EFFICIENT DEEP REINFORCEMENT LEARN- ING VIA EPISODIC BACKWARD UPDATE Anonymous authors Paper under double-blind review ABSTRACT We propose Episodic Backward Update - a new algorithm to boost the

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Optimization of a Sequential Decision Making Problem for a Rare Disease Diagnostic Application.

Optimization of a Sequential Decision Making Problem for a Rare Disease Diagnostic Application. Optimization of a Sequential Decision Making Problem for a Rare Disease Diagnostic Application. Rémi Besson with Erwan Le Pennec and Stéphanie Allassonnière École Polytechnique - 09/07/2018 Prenatal Ultrasound

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Policy Gradient Methods. February 13, 2017

Policy Gradient Methods. February 13, 2017 Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Weighted Double Q-learning

Weighted Double Q-learning Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-7) Weighted Double Q-learning Zongzhang Zhang Soochow University Suzhou, Jiangsu 56 China zzzhang@suda.edu.cn

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

Supplementary Material for Asynchronous Methods for Deep Reinforcement Learning

Supplementary Material for Asynchronous Methods for Deep Reinforcement Learning Supplementary Material for Asynchronous Methods for Deep Reinforcement Learning May 2, 216 1 Optimization Details We investigated two different optimization algorithms with our asynchronous framework stochastic

More information

CS230: Lecture 9 Deep Reinforcement Learning

CS230: Lecture 9 Deep Reinforcement Learning CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code: 21 90 15 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep

More information

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology October 11, 2016 Outline

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Multiagent (Deep) Reinforcement Learning

Multiagent (Deep) Reinforcement Learning Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ) Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

arxiv: v1 [cs.lg] 28 Dec 2016

arxiv: v1 [cs.lg] 28 Dec 2016 THE PREDICTRON: END-TO-END LEARNING AND PLANNING David Silver*, Hado van Hasselt*, Matteo Hessel*, Tom Schaul*, Arthur Guez*, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto,

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

arxiv: v1 [cs.lg] 20 Sep 2018

arxiv: v1 [cs.lg] 20 Sep 2018 Predicting Periodicity with Temporal Difference Learning Kristopher De Asis, Brendan Bennett, Richard S. Sutton Reinforcement Learning and Artificial Intelligence Laboratory, University of Alberta {kldeasis,

More information

Q-learning. Tambet Matiisen

Q-learning. Tambet Matiisen Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning JMLR: Workshop and Conference Proceedings vol:1 8, 2012 10th European Workshop on Reinforcement Learning Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning Michael

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

arxiv: v1 [cs.lg] 5 Dec 2015

arxiv: v1 [cs.lg] 5 Dec 2015 Deep Attention Recurrent Q-Network Ivan Sorokin, Alexey Seleznev, Mikhail Pavlov, Aleksandr Fedorov, Anastasiia Ignateva 5vision 5visionteam@gmail.com arxiv:1512.1693v1 [cs.lg] 5 Dec 215 Abstract A deep

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning From the basics to Deep RL Olivier Sigaud ISIR, UPMC + INRIA http://people.isir.upmc.fr/sigaud September 14, 2017 1 / 54 Introduction Outline Some quick background about discrete

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Michalis K. Titsias Department of Informatics Athens University of Economics and Business

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

Source Traces for Temporal Difference Learning

Source Traces for Temporal Difference Learning Source Traces for Temporal Difference Learning Silviu Pitis Georgia Institute of Technology Atlanta, GA, USA 30332 spitis@gatech.edu Abstract This paper motivates and develops source traces for temporal

More information

arxiv: v3 [cs.ai] 19 Jun 2018

arxiv: v3 [cs.ai] 19 Jun 2018 Brendan O Donoghue 1 Ian Osband 1 Remi Munos 1 Volodymyr Mnih 1 arxiv:1709.05380v3 [cs.ai] 19 Jun 2018 Abstract We consider the exploration/exploitation problem in reinforcement learning. For exploitation,

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Approximate Q-Learning. Dan Weld / University of Washington

Approximate Q-Learning. Dan Weld / University of Washington Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Q Learning

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien

More information

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel & Sergey Levine Department of Electrical Engineering and Computer

More information

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu

More information

Count-Based Exploration in Feature Space for Reinforcement Learning

Count-Based Exploration in Feature Space for Reinforcement Learning Count-Based Exploration in Feature Space for Reinforcement Learning Jarryd Martin, Suraj Narayanan S., Tom Everitt, Marcus Hutter Research School of Computer Science, Australian National University, Canberra

More information

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017 Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning as Variational Inference: Two Recent Approaches Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing

More information

arxiv: v4 [cs.lg] 14 Oct 2018

arxiv: v4 [cs.lg] 14 Oct 2018 Equivalence Between Policy Gradients and Soft -Learning John Schulman 1, Xi Chen 1,2, and Pieter Abbeel 1,2 1 OpenAI 2 UC Berkeley, EECS Dept. {joschu, peter, pieter}@openai.com arxiv:174.644v4 cs.lg 14

More information

arxiv: v2 [cs.lg] 8 Aug 2018

arxiv: v2 [cs.lg] 8 Aug 2018 : Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor Tuomas Haarnoja 1 Aurick Zhou 1 Pieter Abbeel 1 Sergey Levine 1 arxiv:181.129v2 [cs.lg] 8 Aug 218 Abstract Model-free deep

More information