Optimal learning with non-gaussian rewards

Size: px

Start display at page:

Download "Optimal learning with non-gaussian rewards"

Megan Holmes
6 years ago
Views:

1 Optimal learning with non-gaussian rewards Zi Ding Ilya O. Ryzhov November 2, 214 Abstract We propose a novel theoretical characterization of the optimal Gittins index policy in multi-armed bandit problems with non-gaussian, infinitely divisible reward distributions. We first construct a continuous-time, conditional Lévy process which probabilistically interpolates the sequence of discrete-time rewards. When the rewards are Gaussian, this approach enables an easy connection to the convenient time-change properties of Brownian motion. Although no such device is available in general for the non-gaussian case, we use optimal stopping theory to characterize the value of the optimal policy as the solution to a free-boundary partial integrodifferential equation (PIDE). We provide the free-boundary PIDE in explicit form under the specific settings of exponential and Poisson rewards. We also prove continuity and monotonicity properties of the Gittins index in these two problems, and discuss how the PIDE can be solved numerically to find the optimal index value of a given belief state. 1 Introduction The field of optimal learning (Powell & Ryzhov, 212) studies the efficient collection of information in stochastic optimization problems subject to environmental uncertainty that is, problems where uncertainty is driven by unknown probability distributions. In many applications in business, medicine, and various branches of engineering, the decision-maker is able to formulate a belief about these unknown distributions, and gradually improve it using information collected from expensive simulations or field experiments. In particular, we consider a fundamental class of optimal learning problems known as multi-armed bandit problems (Gittins et al., 211), in which there is a finite set of competing arms or alternatives (e.g. system designs, pricing strategies, or hiring policies), each with an unknown value. Alternatives are implemented sequentially in an online manner: upon choosing an alternative, we collect a reward in the form of a noisy sample centered around the unknown value. Our objective is to maximize the cumulative discounted expected reward collected over an infinite horizon. Each individual reward thus plays two roles: it contributes immediate economic value, and it also provides information about the alternative with the potential to improve future decisions. The 1

2 trade-off between reward and information is known as the exploration vs. exploitation dilemma. This problem arises in applications where decisions are made in real time, such as dynamic pricing or advertising placement in e-commerce (Chhabra & Das, 211) or clinical drug trials with human patients (Berry & Pearson, 1985). The bandit model is also relevant in simulation, in situations where a single simulation experiment costs money (Chick & Gans, 29). In the classical multi-armed bandit setting where the decision-maker s beliefs about the alternatives are mutually independent, the work by Gittins & Jones (1979) shows that the optimal strategy takes the form of an index policy. At each time stage, an index is computed for each alternative independently of our knowledge about the others, and the alternative with the highest index is implemented. The index can be expressed as the solution to an optimal stopping problem (Katehakis & Veinott, 1987). Nonetheless, despite this considerable structure (see e.g. Gittins & Wang, 1992, for additional scaling properties), which continues to inspire new theoretical research on Gittins-like policies (Filliger & Hongler, 27; Glazebrook & Minty, 29), even the stopping problem for a single arm is computationally intractable. This challenge has given rise to a large body of work on heuristic methods, which often make additional assumptions on the reward distribution. It is especially common to require the reward distributions to be Gaussian, or to have bounded support (see e.g. Auer et al., 22, for examples of both). In the simulation literature, Gaussian assumptions are standard due to advantages such as the ability to concisely model correlations between estimated values (Nelson & Matejcik, 1995; Chick & Inoue, 21; Ryzhov et al., 212; Qu et al., 212). More recently, however, numerous applications have emerged where observations are clearly non-gaussian. The operations management literature has recently studied applications in assortment planning (Caro & Gallien, 27; Glazebrook et al., 213) where the observed demand comes from a Poisson distribution with unknown rate. The challenge of learning Poisson distributions also arises in dynamic pricing (Farias & Van Roy, 21), optimal investment and consumption (Wang & Wang, 21), models for household purchasing decisions (Zhang et al., 212), and online advertising and publishing (Agarwal et al., 29). The work by Lariviere & Porteus (1999) studies a newsvendor problem where a Bayesian gamma prior is used to model beliefs about an exponentially distributed demand. The gamma-exponential model is also used by Jouini & Moy (212) in the problem of learning signal-to-noise ratios in channel selection, and would also be appropriate for learning service times or network latencies. Motivated by applications like the above, we consider Bayesian bandit problems under non- 2

3 Gaussian, infinitely divisible reward distributions, encompassing both exponential and Poisson models. In the Gaussian setting, a recent stream of work by Brezzi & Lai (22), Yao (26), and Chick & Gans (29) has approximated the Gittins index for an arm by formulating an optimal stopping problem on a Brownian motion with unknown drift, a continuous-time process that serves as a probabilistic interpolation of the sequence of Gaussian rewards collected from the arm. By making the connection between Brownian motion and the heat equation (Steele, 2), one can formulate and numerically solve a free-boundary problem (van Moerbeke, 1976) to approximate the Gittins index. Our approach uses a similar foundation: we interpolate the reward sequences in the non-gaussian problems with conditional Lévy processes, which are generated by infinitely divisible distributions Sato (1999), and then derive the relevant stopping problems for these continuous-time processes. Although there has been a stream of research available on multi-armed bandit problems driven by Lévy processes (El Karoui & Karatzas, 1994; Kaspi & Mandelbaum, 1995; Mandelbaum, 1986, 1987), they do not consider the Bayesian perspective where the reward distributions depend on unknown (random) parameters and our beliefs about them evolve over time. The dependence of our beliefs on the information collected from previously observed rewards requires the interpolation process to have increments that are generally non-stationary and non-independent, which leads to the use of conditional Lévy processes. Such processes have been studied in a learning context by Buonaguidi & Muliere (213) and Cohen & Solan (213), but only in the much more restrictive setting of binary prior distributions. The major challenge in studying non-gaussian reward problems under this interpolation technique is that we cannot exploit the time-change properties of Brownian motion to standardize the problem, as was done before in the Gaussian reward case. Therefore, we develop an alternate method based on equating the infinitesimal and characteristic operators (Peskir & Shiryaev, 26) of value functions in an optimal stopping problem. We then obtain free-boundary problems on partial integro-differential equations (PIDEs), which can potentially be used to approximate the Gittins index. The solutions to these equations are shown to possess intuitive properties, such as continuity and monotonicity, that are known to hold for classic bandit problems in discrete time. This paper makes the following contributions: 1) We propose novel continuous-time approximations to Gittins indices for non-gaussian problems, in the form of stopping problems on conditional Lévy processes that probabilistically interpolate sequences of infinitely divisible rewards. 2) We derive free-boundary PIDEs whose solutions match the value functions in these stopping prob- 3

4 lems, and illustrate how these solutions can be calculated. 3) We apply our method directly to gamma-exponential and gamma-poisson problems and derive explicitly the PIDEs corresponding to these models. 4) We prove relevant structural properties on the monotonicity, continuity, and asymptotic behavior of the value functions and Gittins indicies. We also derive a scaling property for the gamma-poisson problem that is, to our knowledge, entirely new. Section 2 reviews non-gaussian Bayesian statistical learning models and the Gittins index, and provides additional motivation by discussing the statistical inconsistency of one prominent heuristic (non-gittins) approach in the non-gaussian setting. Section 3 presents the general theory of probabilistic interpolations by conditional Lévy processes used in our analysis, and derives the free-boundary PIDEs that can be used to find Gittins indices. Section 4 applies these methods directly to the gamma-exponential and gamma-poisson learning models, for which we derive the PIDEs in explicit form. Section 5 provides theoretical derivations of the structural properties of the Gittins index in the two problems. Section 6 provides some numerical illustrations of the structural properties we derive. 2 Optimal learning with non-gaussian rewards Section 2.1 sets up the notation for our analysis and describes two major classes of conjugate learning models with infinitely divisible rewards, namely gamma-exponential and gamma-poisson. Section 2.2 reviews the Gittins index policy, known to be optimal for bandit problems. Section 2.3 provides additional motivation for our study by showing that non-gaussian problems can cause inconsistent behavior in knowledge gradient methods, a prominent class of heuristic learning policies. 2.1 Learning models for non-gaussian rewards Consider a bandit problem with M alternatives, with x n {1,..., M} denoting the alternative chosen for implementation in stage n =, 1, 2,... Let Wn+1 xn be the single-period reward observed after x n is implemented. In the discrete-time problem, quantities are indexed by the time at which they become known; thus, x n is chosen at time n, but the output Wn+1 xn only becomes known one time period later. For fixed x, the rewards W1 x, W 2 x,... are drawn from a common sampling distribution with density f x ( ; λ x ), where λ x is an unknown parameter (or vector of parameters). The rewards are 4

5 conditionally independent given λ x. Let F n be the sigma-algebra generated by the first n decisions x, x 1,..., x n 1 as well as the resulting rewards W x 1,..., W x n 1 n. The unknown parameter λ x is modeled as a random variable, and our beliefs about the possible values of the parameter at time n are represented by the conditional distribution of λ x given F n. In the problems we consider, this sequence of conditional distributions is characterized by a sequence (kn) x n= of random vectors, where k x n is F n -measurable for all n, and we write E ( W x n+1 F n ) = m (k x n ) (1) for some appropriately chosen function m, so that m is the mean of the reward based on our current belief. For convenience, we also let m x = E ( Wn+1 x λ x) be the true mean of the single-period reward, which is the mean of the reward distribution when the true value of λ x is known. Note that, if x is observed infinitely often, m (k x n) m x a.s. by martingale convergence. The knowledge state k x n can also be viewed as a set of sufficient statistics for the conditional distribution of λ x given F n. We follow the classic multi-armed bandit model, where λ x and λ y are independent for any x y, and likewise the single-period rewards are independent across alternatives. Thus, our beliefs about all the alternatives can be completely characterized by k n = { k 1 n,..., kn M }. A policy π denotes a sequence X π, Xπ 1,... of functions mapping knowledge states k, k 1,... to elements of {1,..., M}. In other words, a policy is a rule for making decisions, under any possible knowledge state, at any time stage. Our objective can thus be written as sup E π π n= ( γ n m k π,xπ n (kn) n ), (2) where < γ < 1 is a pre-specified discount factor. In words, the optimal policy should maximize the expected cumulative, infinite-horizon, discounted reward obtained from alternatives implemented by the policy. We specifically highlight two classic Bayesian learning models where the sampling distributions are infinitely divisible. In the gamma-exponential model, f x is (conditionally) exponential with unknown rate λ x. Under the assumption that λ x Gamma (a x, bx ), the conditional distribution of λ x, given F n, is also gamma with parameters a x n and b x n. From DeGroot (197), we can obtain 5

6 simple recursive relationships for the parameters, given by a x n+1 = b x n+1 = { a x n + 1 if x n = x a x n if x n x, { b x n + Wn+1 x if x n = x b x n if x n x. (3) (4) In the gamma-exponential model, k x n = (a x n, b x n), and the mean function m is given by m (k x n) = E ( 1 λ x F n ) = b x n a x n 1. We also consider the gamma-poisson model, where f x is conditionally Poisson with unknown rate λ x. Again, we start with λ x Gamma (a x, bx ), whence the posterior distribution of λ x at time n is again gamma with parameters a x n and b x n, and the Bayesian updating equations are now given by a x n+1 = b x n+1 = { a x n + Wn+1 x if x n = x a x n if x n x, { b x n + 1 if x n = x b x n if x n x. (5) (6) Again, the decision-maker s knowledge about λ x at time n is represented by k x n = (a x n, b x n) with mean function m (k x n) = E (λ x F n ) = ax n b x n. Both of these models are conjugate, meaning that the posterior distribution of any unknown parameter is always gamma, at any time stage. Beliefs based on conjugate priors are easy to store and update, making such models useful for practitioners. Our approach in this paper is designed for infinitely divisible sampling distributions, a class that includes both the gamma-exponential and gamma-poisson cases. Among the standard conjugate models in DeGroot (197), these two are among the most intuitive, and also are the most suitable for applications in assortment planning and revenue management. We briefly mention a third well-known non-gaussian model, which assumes Bernoulli rewards with beta belief distributions; the sampling distribution in this case is not infinitely divisible and does not fall under our framework, but has been relatively well-studied elsewhere (Berry & Pearson, 1985; Bertsimas & Mersereau, 27), and we do not explore it here. To our knowledge, the gamma-exponential and gamma-poisson models have received less theoretical attention in the bandit literature. 2.2 Review of Gittins indices We briefly summarize the characterization of the Gittins index policy, known to optimally solve (2). For a more detailed introduction, the reader is referred to Ch. 6 of Powell & Ryzhov (212). 6

7 Furthermore, Gittins et al. (211) provides a deeper theoretical treatment with several equivalent proofs of optimality for the policy. The Gittins method considers each alternative separately from the others. Let k denote our beliefs about an arbitrary alternative, dropping the superscript x for notational convenience. Consider a situation where, in every time stage, we have a choice between implementing this alternative and receiving a known, deterministic retirement reward r. The optimal decision (implement vs. retire) can be characterized using Bellman s equation for dynamic programming. We write V (k, r) = max { r + γv (k, r), E [ W + γv ( k, r ) k }, (7) where W is the reward obtained from the implementation, and k is the future knowledge state arising due to the new information provided by W. In our setting, k would be computed using (3)-(4) or (5)-(6). Because we do not update our beliefs when we collect the fixed reward, it follows that, if we prefer the fixed reward given the knowledge state k, we will continue to prefer it for all future time periods. Then, (7) becomes { r V (k, r) = max 1 γ, m (k) + γe [ V ( k, r ) k }. (8) The Gittins index R(k) is a particular retirement reward value r that makes us indifferent between the two quantities inside the maximum in (8). In the special case where the parameter λ x is known, this retirement reward is equal to the mean single-period reward, as shown in the following lemma. Lemma 2.1. If the parameter λ x is a known constant, the Gittins index of arm x is m x. Proof: If λ x is known, there is no knowledge state, and (7) becomes V (r) = max {r + γv (r), m x + γv (r)}, whence the result follows immediately. Once the Gittins indices have been computed, the policy Xn (k n ) = arg max R (k x x n) can be shown to be optimal for the objective in (2). Thus, the Gittins method decomposes an M- dimensional problem into M one-dimensional problems, each of which can be solved independently 7

8 of the others. Furthermore, in the gamma-exponential version of the problem (that is, where k = (a, b) and (3)-(4) are used to update k), it has also been shown (Gittins & Wang, 1992) that R (a, b) = br (a, 1), (9) meaning that Gittins indices only have to be computed for a restricted class of knowledge states. Equivalently, if we can find b ( (a) such that R a, b ) (a) = 1, we can use (9) to write b (a) R (a, 1) = R (a, b ) (a) = 1, (1) whence R (a, b) = b b(a). Yet, even with this structure, it is difficult to compute R (a, 1) or b (a) for arbitrary a. Computational approaches based on PDEs have previously been developed only for problems with Gaussian rewards. 2.3 Inconsistency of knowledge gradient methods This section provides additional motivation for our work by showing that non-gaussian problems create theoretical challenges for a prominent class of sub-optimal heuristics known as knowledge gradient (KG) methods. Instead of separating the arms and computing an individual index for each one, this approach considers all the alternatives together and calculates the expected improvement [ Rn KG,x = E max m ( k y y n+1) max m (k y y n) F n, x n = x (11) that a single implementation contributes to an estimate of the value of the best alternative. The knowledge gradient (KG) policy X KG n (k n ) = arg max R KG,x x n (12) has received attention in the simulation community (see e.g. Chick, 26, for an overview), because it is computationally efficient and often performs near-optimally in experiments. Typically the goal in such problems is to identify the alternative with the highest value, rather than to maximize the cumulative reward as in (2). However, these objectives are closely related, and the method can be easily adapted to bandit problems (Ryzhov et al., 212) with a simple modification of (12) known as online KG. If the rewards are Gaussian, the policy in (12) is statistically consistent (Frazier et al., 28), meaning that m (kn) x m x a.s. for every x. In other words, the policy is guaranteed to discover the best alternative over an infinite horizon, a useful regularity property for simulation optimization 8

9 algorithms. This property typically does not hold in bandit problems, even for Gittins indices Brezzi & Lai (2), but the consistency of (12) leads to certain regularity properties for online KG (Ryzhov et al., 212). In general, the consistency of knowledge gradient policies has been shown in numerous settings where Gaussian rewards appear (Vazquez & Bect, 21; Frazier & Powell, 211; Ryzhov & Powell, 212). However, as we now show, this property does not hold for the gamma-exponential learning model. The work by Ryzhov & Powell (211) provides a closed-form solution of (11) for the gammaexponential problem, given by R KG,x n = ( ) 1 b x ax n n (a x n 1)(Cn) x ax n 1 a x n 1 (a x n 1)(Cx n )ax n 1 ( b x n a x n ) ax n ( b x n a x n 1 Cx n ) if b x n a x n 1 Cx n if bx n a x n C x n < bx n a x n 1 if C x n < bx n a x n, (13) where C x n = max y x m (k y n). We see immediately that it is possible to have R KG,x n V ar (λ x F n ) >. = even though This behavior, which does not arise in the Gaussian setting, is the cause of the inconsistency result. Essentially, if the policy places zero value on alternative x, there may be a non-zero probability that the policy will never measure x, which means that m (k x n) will not converge to the true mean reward. The proof can be found in the Appendix. Theorem 2.1. There exists a gamma-exponential problem for which (12) has a non-zero probability of never measuring a particular alternative. This result provides some insight into the fundamental differences between Gaussian and non- Gaussian learning models. It also adds to the motivation of our study of Gittins indices, a policy that looks over an infinite horizon and thus avoids under-estimating the value of information. We now return to the Gittins policy and develop a new characterization of it for non-gaussian problems with infinitely divisible rewards. 3 The Gittins index as a stopping boundary Our analysis is based on the idea of continuous-time interpolation, which has previously only been studied in the setting of Gaussian rewards. The basic idea was first proposed by Brezzi & Lai (22). For arbitrary x (in the following, we again drop the superscript x for convenience as in Section 2.2), the discrete-time process (W n ) n=1 of single-period rewards with unknown mean µ 9

10 and known variance σ 2 is replaced by a continuous-time process (X t ) t. The process (X t ) is constructed in such a way that, for integer t, the increment X t+1 X t has the same distribution as the single-period reward W t+1. Therefore, (X t ) can be viewed as a probabilistic interpolation of (W n ). In the Gaussian setting, (X t ) is conditionally a Brownian motion with unknown drift µ and known volatility σ. The formulation of the Gittins index in (8) is then extended to the continuous-time case and written as the solution to an optimal stopping problem on (X t ). Our approach is inspired by this idea. When the rewards are non-gaussian and infinitely divisible, we construct a continuous-time process (X t ) such that, for integer t, the increment X t+1 X t has the same distribution as W t+1. Since the discrete-time rewards are conditionally i.i.d. given λ, the interpolation (X t ) is a conditional Lévy process, with increments that are conditionally independent and stationary. For the gamma-exponential problem, (X t ) is conditionally a gamma process given λ (see e.g. Cinlar, 211, for a definition), with exponentially distributed increments over unit intervals. Similarly in the gamma-poisson problem, (X t ) is conditionally a Poisson process. Section 3.1 reviews the theory of conditional Lévy processes and presents the continuous-time analog to the Gittins index equation (8) under this interpolation, similar to that used by Brezzi & Lai (22) and Yao (26) in studying Gaussian problems. Section 3.2 derives a free-boundary partial integro-differential equation that can be used to solve for the Gittins indices. The Gittins indices obtained from such a continuous interpolation will then serve as approximations to the Gittins indices in the original discrete-time problem. 3.1 Continuous-time conditional Lévy interpolation In this section, we follow the theoretical characterization of conditional Lévy processes introduced in Cinlar (23). Let (X t ) be a real-valued stochastic process that will later serve in Section 4 as the continuous-time interpolation of cumulative rewards without discounting. The parameter λ is a random variable (or random vector) whose conditional distribution given F t characterizes our belief on λ at time t. Given λ, the process (X t ) has conditionally stationary and independent increments. Under the context of non-gaussian conjugate models, we further restrict (X t ) to increasing and right-continuous pure jump processes. The method below does apply to general stochastic processes, but this is not as useful for interpolation purposes in Bayesian bandit problems. 1

11 The dependence of X t on λ is described as X t = X + zµ(ds, dz) [,t R + (14) where µ is conditionally (given λ) a random measure on R + R + with mean measure ν(λ, dz)ds, satisfying R + ν(λ, dz)(z 1) < for all λ (for details on random measure and mean measure, see Ch. 6 in Cinlar, 211). The intensity measure of µ at time t, that is, the intensity given F t but not given λ, is written as ν t (dz)ds = E [ν(λ, dz) F t ds. Intuitively, this definition states that a conditional Lévy process has two layers of randomness. The conditional mean measure ν characterizes the infinitesimal behavior of the process, given a value of λ, as that of a Lévy process. The unconditional intensity measure ν further removes the randomness in the belief on λ by integrating over its distribution. Thus, ν can be described as the mean of the conditional mean measure. With the reward process (X t ) set up as a continuous-time conditional Lévy process, the Gittins logic can be extended to the continuous-time setting as follows. Let c be a continuous-time discount factor (lower c corresponds to higher γ in discrete time). The Gittins index R is the particular value of r such that r [ e cs ds = sup E e cs dx s + r e cs ds, (15) where denotes a stopping time. This expectation is evaluated given some initial state k at time, which is the starting time when the index is calculated. We omit this dependence from the notation and take the starting time to be zero without loss of generality, since the Gittins index only depends on the time through the knowledge state. This formulation is equivalent to the one in (8); see e.g. Katehakis & Veinott (1987) or Yao (26) for more details. As before, discounted rewards are collected from the process (X t ) until time, at which point we collect the fixed retirement reward r until the end of time. If (15) holds, we are indifferent between stopping immediately and running the process until the optimal stopping time. 3.2 Characterization as a free-boundary problem In the Gaussian setting, the optimal stopping problem (15) can be solved by performing a time change on the conditional Brownian motion to convert it into a standard Wiener process (Brezzi & Lai, 22). This standardized problem can be solved by simulation (Brezzi & Lai, 22; Yao, 26) 11

12 or using a free boundary heat equation (Chick & Gans, 29; Chick & Frazier, 212). When the rewards are non-gaussian, the process can still be embedded in Brownian motion (Monroe, 1978), but the time change is random and computationally intractable. Instead, we will apply a new approach based on Peskir & Shiryaev (26). We manipulate the continuous-time Gittins index set up in (15) as follows: = sup [ E e cs dx s e cs rds [ e cs zµ(dz, ds) e cs rds [, R + ) e ( R cs zν(λ, dz) r ds + e cs z [µ(dz, ds) ν(λ, dz)ds (16) + [, R + ) e ( R cs zν(λ, dz) r ds + (17) + = sup E [ = sup E [ = sup E = sup = sup = sup E e cs ze [µ(dz, ds) ν(λ, dz)ds F s [, R [ + ) E e cs zν(λ, dz) r ds ( R [ + ) E e cs ze (ν(λ, dz) F s ) r ds (18) ( R [ + ) e ( R cs z ν s (dz) r ds (19) + In (16) we use a compensating technique by adding and subtracting ν(λ, dz). Then. the random measure µ is cancelled in (17), by taking iterated conditional expected values (also called the tower property). We then use the tower property again in (18). Notice that R + z ν s (dz) is the mean of the infinitesimal increment of the process. For a classic Lévy process, where λ is a known constant and has no dependency on F s, the term R + zν(dz) is called the compensator of the process (Itô et al., 24). We denote R + z ν t (dz) by m t to emphasize that this quantity serves the same role as m(k n ) in (1). Then, (15) in continuous time can be written as, [ sup E e cs (m s r) ds = (2) which we will refer to as the calibration equation throughout this paper. We also introduce notation for the LHS of (2), [ V (t, m) := sup E e cs (m s r) ds (21) 12

13 Recall that the expectation in (21) is evaluated given some initial state at time. As we observe from models such as (3)-(6), the pair (t, m), representing a time parameter and a mean parameter, is a set of sufficient statistics for the distribution of λ given F t (in fact, these statistics are sufficient for all the models described in Gittins et al., 211). Therefore, V (t, m) indicates that our initial knowledge is characterized by (t, m). In this value function, r is a fixed constant value and the Gittins index R(t, m) is the particular value of r that makes V (t, m) =. On the other hand, if we fix r, the set of pairs (t, m) for which V (t, m) = is precisely the set of states that have r as the Gittins index. We now construct a free-boundary problem for V by equating the characteristic and infinitesimal operators of V. In this approach, we rely on the mild condition that (m t ) is a càdlàg strong Markov process. As we will see from calculating m t explicitly for gamma-exponential and gamma- Poisson problems in Section 4, this assumption is reasonable and generally satisfied in the context of Bayesian conjugate models. The characteristic operator of V is defined as L char V (t, m) = lim U {m} EV (t U c, m U c ) V (t, m), (22) E ( U c) where U is an open set that contains m, and U c (m t ). That is, is the hitting time of the set U c for the process U c = inf {t : m t U c } is the first time at which (m t ) leaves the set U. First we consider the value function at the moment when U c occurs, and then we shrink U down to the singleton {m}. The concept of the characteristic operator dates back to Dynkin (1965). The value function V = E [ e cs (m s r) ds is analogous to a killed version of the Lagrange problem introduced in Ch. 6 of Peskir & Shiryaev (26), where the value function is of the form e Λ(s) L (m s ) ds. The characteristic operator in the Lagrange problem can be explicitly calculated, and (22) is given in closed form in the following result. Lemma 3.1. If (m t ) is a càdlàg strong Markov process, the characteristic operator of V is given by L char V (t, m) = cv (t, m) (m r). (23) 13

14 Proof: The lemma follows from (7.2.8) in Peskir & Shiryaev (26). In a killed Lagrange problem on the value function e Λ(s) L (m s ) ds, by inserting Λ(s) = cs and L (m s ) = m s r, we obtain the desired results in the lemma. The infinitesimal operator L inf (also called the generator of V ) is derived using Itô s lemma. Our goal is to obtain an expression V (t, m t ) = V (, m ) + t L inf V (s, m s ) ds + Y t, (24) where (Y t ) is a martingale formed by adding and subtracting a continuous compensator to the jump component of V (see Itô et al., 24, for an exposition of this idea). Recall that in discrete time models such as (3)-(6), the mean process (m t ) is expressed by t and X t in simple closed form. We now assume that m t can be written as g(t, X t ) for some continuous function g with first-order derivatives. In Section 4, we will explicitly derive g and show that this assumption is satisfied in gamma-exponential and gamma-poisson problems. Lemma 3.2. If (m t ) can be written into the form m t = g(t, X t ) for some continuous function g with first-order derivatives, the infinitesimal operator of V is given by L inf (t, m) = V g (t, m)+ t t (t, X t) V (t, m)+ [V (t, g(t, X t +z)) V (t, g(t, X t )) ν t (dz). (25) m R + Proof: The proposition follows from the calculation V (t, m t ) = V (, m ) + t t = V (, m ) + = V (, m ) + + t V t s (s, m V s)ds + m (s, m s)dm c s + V s (s, m s)ds + V s (s, m s)ds + t t V m (s, m s)dm c s + <s t V m (s, m s) g s ds + [V (s, m s ) V (s, m s ) (26) [V (s, g(s, X s + z)) V (s, g(s, X s )) µ(ds, dz) [,t R + [V (s, g(s, X s + z)) V (s, g(s, X s )) ν s (dz)ds [,t R + [,t R + [V (s, g(s, X s + z)) V (s, g(s, X s )) (µ(ds, dz) ν(λ, dz)ds + ν(λ, dz)ds ν s (dz)ds) (27) = V (, m ) + t L inf V (s, m s ) ds + Y t In (26), we use Itô s lemma for jump-diffusion processes (see Sato, 1999, Ch. 6, Theorem 31.5), and m c s denotes the continuous part of m s, after removing all jumps. Since m s = g(s, X s ) and X s is a pure jump process, we have dm c s = g s ds. In (27), we applied a compensator technique and 14

15 put the component with respect to the random measure µ into an F t -martingale Y t. Note that, for T t, the tower property implies that, E [Y T F t [ = Y t + E [V (s, g(s, X s + z)) V (s, g(s, X s )) [µ(ds, dz) ν(λ, dz)ds + ν(λ, dz)ds ν s (dz)ds [t,t R + F t [ = Y t + E [V (s, g(s, X s + z)) V (s, g(s, X s )) E [µ(ds, dz) ν(λ, dz)ds F s [t,t R + F t [, +E [V (s, g(s, X s + z)) V (s, g(s, X s )) E [ν(λ, dz)ds ν s (dz)ds F s [t,t R + F t = Y t. Therefore, the infinitesimal operator L inf V is, L inf V (t, m) = V g (t, m) + t t (t, X t) V (t, m) + m which yields the result in the lemma. R + [V (t, g(t, X t + z)) V (t, m) ν s (dz) Essentially, the characteristic and infinitesimal operators are two different expressions for the derivative of V based on Kolmogorov theory and Itô calculus. Under general arguments (Peskir & Shiryaev, 26), the two operators exist and coincide. By matching them, we obtain a free-boundary problem on a PIDE as a consequence of the above derivations. Theorem 3.1. If (m t ) is a strong Markov càdlàg process and can be written as g(t, X t ) for some continuous function g with first order derivatives, the value function V (t, m) solves the freeboundary problem V t g (t, m) + t (t, X t) V (t, m) + m [V (t, g(x t + y)) V (t, m) ν t (dy) = cv (t, m) (m r) V (t, m (t)) = where m (t) is an unknown stopping boundary curve. For every point on the stopping boundary, the Gittins index R (t, m (t)) is equal to r. Proof: Using the characteristic operator and infinitesimal operator shown in Lemmas 3.1 and 3.2, the theorem follows from Ch. 7.2 in Peskir & Shiryaev (26). We have assumed a fixed retirement reward r in this PIDE. Thus, the solution of the PIDE does not immediately yield a Gittins index R (t, m) for an arbitrary knowledge state (t, m). However, 15

16 the stopping boundary curve m describes the set of all knowledge states whose Gittins index is exactly equal to r. In the gamma-exponential problem, it would be sufficient to find the boundary for r = 1 and apply (1). 4 Exponential and Poisson rewards problems We now apply the general result of Theorem 3.1 to characterize Gittins indices for exponential and Poisson rewards. Section 4.1 covers the gamma-exponential problem, whereas Section 4.2 covers the gamma-poisson problem. 4.1 A free-boundary problem for gamma-exponential bandits In the gamma-exponential problem, our continuous-time interpolation (X t ) is a gamma process with shape parameter 1 and unknown scale parameter λ. We begin by assuming λ Gamma (a, b ), reflecting the decision-maker s prior beliefs. Letting F t be the σ-algebra generated by the path of (X t ) up to time t, we find that the conditional distribution of λ given F t is still gamma with posterior parameters a t = a + t, b t = b + X t, as in (3)-(4). For convenience, we may also use the notation k t = (a t, b t ). The value function V (t, m) for the gamma-exponential problem can also be written as V (a, m) under a shift of variable for simplicity as a t = a + t. Theorem 4.1. The value function V (a, m) in gamma-exponential problem solves the free-boundary problem V a (a, m) m V (a, m) + a 1 m [V (a, m + z) V (a, m) 1 z ( ) m a dz = cv (a, m) (m r) m + z V (a, m (a)) = where m (a) is an unknown stopping boundary curve. boundary, the Gittins index R(a, m) is equal to r. For every point (a, m) on this stopping Proof: This can be shown through explicit calculation based on the PIDE in Theorem 3.1. In the conditional Lévy process we use to model exponential rewards, the conditional mean measure 16

17 given λ is ν(λ, dy) = e λy y, the same as in a gamma process, and the distribution of λ given F t is Gamma(a t, b t ). Therefore, the unconditional mean measure ν t (dy) is calculated as ν t (dy) = = e λy bat t λat 1 e btλ y Γ(a t ) ( bt b t + y ) at 1 y dy dλdy and m t = = = b t a t 1 y ν t (dy) ( ) at bt dy b t + y Therefore, m t = g(t, X t ) = b +X t a +t 1 g and t = b +X t (a = mt +t 1) 2 a. Also, t 1 R + [V (t, g(t, X t + y)) V (t, g(t, X t )) ν t (dy) = = R + [ ( V t, b ) ( t + y V t, a t 1 R + [V (t, m t + z) V (t, m t ) ) ( b t bt a t 1 b t + y ( ) at mt 1 m t + z z dz ) at 1 y dy where the last equality is obtained by using a change of variable z = y a t 1. In the exponential reward problem, as well as the Poisson reward problem to be discussed in Section 4.2, we use integration by parts to simplify the free boundary partial integro-differential equation from Theorem 3.1. First, we rewrite the value function as V (a, m) = sup E Observe that = sup 1 c E e cs (m s r) ds [ (m r) e c (m r) + e cs d ds m s d ds m s = d b + X s ds a + s 1 = b + X s (a + s 1) 2 ds + 1 a + s 1 dx s., 17

18 We take the expectation of this quantity, whence E e cs d ( ds m (a s, b s ) = E e cs b + X s (a + s 1) 2 ( = E e cs b + X s (a + s 1) 2 =. ) ds + E ) ds + E [ e cs 1 E a + s 1 dx s F s ( ) e cs b + X s (a + s 1) 2 ds Consequently, (2) can be rewritten as [ 1 sup E [ e c (r m ) + m r =. c We define a new value function G(a, m) := supe [e c (r m ) = cv (a, m) m + r for fixed r, and plug it into Theorem 4.1 to obtain the following equivalent free boundary problem. Proposition 4.1. The value function G (a, m) in the gamma-exponential problem solves the freeboundary problem G (a, m) m a G (a, m) + a 1 m [G (a, m + z) G (a, m) 1 z ( ) m a dz = cg (a, m) m + z G (a, m (a)) = r m (a) where m (a) is an unknown stopping boundary curve. boundary, the Gittins index R(a, m) is equal to r. For every point (a, m) on this stopping Proof: By substituting V (a, m) = 1 c [G(a, m) + m r in Theorem 4.1, we get V (a, m) m a G V (a, m) + a 1 m = 1 c a (a, m) 1 c = 1 [ G (a, m) m c a and m a 1 [ G (a, m) + 1 m G (a, m) + a 1 m cv (a, m) (m r) = G(a, m) [V (a, m + z) V (a, m) 1 ( ) m a dz z m + z + 1 [G (a, m + z) G (a, m) 1 ( m c z m + z [G (a, m + z) G (a, m) 1 ( ) m a dz z m + z ) a dz + 1 c m a 1 while on the stopping boundary, 1 c [G(a, m) + m r =. The formulation in Proposition 4.1 is equivalent to the more intuitive one in Theorem 4.1 where L inf V = L char V and the value function equals zero on the stopping boundary. We will use it to prove structural properties in Section 5 and for numerical convenience in Section 6. 18

19 4.2 A free-boundary problem for gamma-poisson bandits In the gamma-poisson problem, the continuous-time interpolation (X t ) is a Poisson process with unknown rate λ. Again, we assume λ Gamma (a, b ), let F t be the σ-algebra generated by the path of (X t ) up to time t, and update the posterior parameters using a t = a + X t, b t = b + t, as in (5)-(6). As in Section 4.1, we get the following free-boundary PIDE through calculating m t explicitly. Theorem 4.2. The value function V (b, m) in the gamma-poisson problem solves the free-boundary problem V b (b, m) m b [ ( V (b, m) + V b, m + 1 ) V (b, m) m = cv (b, m) (m r) m b V (b, m (b)) = where m (b) is an unknown stopping boundary curve. And, for every point (b, m) on this stopping boundary, the Gittins index R(b, m) is equal to r. Proof: It suffices to use explicit calculation based on the PIDE in Theorem 3.1. In the conditional Lévy process we use to model exponential rewards, the conditional mean measure given λ is identical to that of a Poisson process ν(λ, dy) = λδ 1, where δ 1 is the Dirac delta function, and the distribution of λ given F t is Gamma(a t, b t ). Therefore, the mean measure intensity ν t (dy) is calculated as and ν t (dy) = = a t b t δ 1 dy λ bat t λat 1 e btλ Γ(a t ) dλδ 1 dy m t = y ν t (dy) = a t = a + X t b t b + t 19

20 Therefore, m t = g(t, X t ) = a +X t b +t and g t = a +X t (b +t) 2 = mt b t. Also, R + [V (t, g(t, X t + y)) V (t, g(t, X t )) ν t (dy) = = R + [ V [ ( V t, a ) ( t + y V b t t, a t (t, m t + 1bt ) V (t, m t ) b t ) at b t δ 1 dy m t as required. Again, we use integration by parts to simplify the value function to get V (b, m) = 1 [ sup E [ e c (r m ) + m r =. c By defining G(b, m) := supe [e c (r m ) = cv (b, m) m + r for fixed r and replacing V in Theorem 4.2, we obtain the equivalent free-boundary problem for the gamma-poisson model. Proposition 4.2. The value function G (b, m) in the gamma-poisson problem solves the freeboundary problem, G b (b, m) m b [ ( G (b, m) + V b, m + 1 ) V (b, m) m = cg (b, m) m b G (b, m (b)) = r m (b) where m (b) is an unknown stopping boundary curve. boundary, the Gittins index R(b, m) is equal to r. For every point (b, m) on this stopping The proof is same as that of Proposition 4.1 and we omit it here. Unfortunately, the scaling properties of the Gittins index are not as straightforward in the gamma-poisson problem as they are in the gamma-exponential problem. However, in the following section, we derive a new scaling property for the gamma-poisson problem. 5 Theoretical analysis In this section, we provide theoretical results on the structure of the Gittins index for non-gaussian problems, in continuous time. We focus on two major aspects. Section 5.1 considers scaling properties, more notably for the gamma-poisson problem. Section 5.2 investigates the continuity and monotonicity of the Gittins index and value function, and concludes the theoretical analysis 2

21 with an asymptotic convergence result. These theoretical properties provide more insight on how the Gittins index, an indifference indexing rule, prices the value of a state. Also, these properties match the discrete-time results shown in Gittins et al. (211) and Yu (211), which argues in favor of our continuous-time framework as a generalization of the discrete-time model. Throughout this section, we abuse notation slightly by writing the value functions V and G, as well as the Gittins index R, as functions of (t, m), (a, m), or (b, m), as is convenient. Most results apply to both gamma-exponential and gamma-poisson problems, and therefore we use (t, m) where possible. We will specifically use (a, m) or (b, m) in proofs when needed. When we hold a parameter constant, we omit it in the argument, e.g. V (m) denotes the value function V (t, m) while we hold t constant. When needed, we also use the subscript r for value functions, e.g. V r (t, m), to denote that they are calculated given that fixed r value. This notation facilitates writing our proofs in this section. 5.1 Distributional and scaling properties We begin with two computational results on the predictive distributions appearing in the gammaexponential and gamma-poisson problems. These results are used in the proofs of some structural properties in this section, and later on help us to create initial conditions for PIDE solution procedures. The proofs can be found in the Appendix. Lemma 5.1. In the gamma-exponential model, the predictive distribution of Xt b, given F, is the beta-prime distribution with parameters t and a. Lemma 5.2. In the gamma-poisson model, the predictive distribution of X t, given F, is the generalized negative binomial distribution with parameters a and t b +t. Next, we establish scaling properties of the Gittins index for both non-gaussian problems. Theorem 5.1 extends the result of (9) to the continuous-time setting, where the Gittins index is defined to be the value of r that solves (15); we include this proof for completeness. We then derive a new scaling property for the gamma-poisson problem. Theorem 5.1. In the gamma-exponential problem, the Gittins index satisfies R (a, b) = br (a, 1). 21

22 Proof: We factor b out of the calibration equation (2) of the Gittins index to obtain [ sup E e cs (m s r) ds [ = sup E [ = b sup E = ) a + s 1 r ds e cs ( b + X s e cs ( 1 + X s b a + s 1 r b ) ds (28) The factor b in front of (28) can be dropped since (28) equals zero. properties of the gamma process and gamma distribution, we see that the process By applying the scaling ( ) Xt b has the same law as a conditional gamma process with the prior λ Gamma (a, 1). Then, if R balances (2), it follows that the index R b balances the calibration equation for a gamma-exponential problem starting from the knowledge state (a, 1). Thus, R (a, b) = br (a, 1), as required. For the gamma-poisson problem, we also emphasize the dependence of R on the discount factor c, as this plays a role in the scaling property. To our knowledge, Theorem 5.2 is the first known scaling result for problems with Poisson rewards. Theorem 5.2. In the gamma-poisson problem, the Gittins index satisfies the scaling property R (b, m, c) = 1 σ R ( b σ, σm, σc ) for all σ >. Proof: We consider the calibration equation for the gamma-poisson problem and write [ sup E e cs (m s r) ds = sup = sup E [ E [ e cs ( a + X s b + s ) r ds ( e cs a + X s b σ + s rσ σ ) 1 σ ds. (29) Letting t = s σ and Y t = X σt, we rewrite (29) as [ sup E e cs (m s r)ds [ σ = sup E [ = sup E ( ) e cσt a + X σt b σ rσ dt + t ) e cσt ( a + Y t b σ + t rσ dt Observe that if is a stopping time for X t, then σ is a stopping time for Y t, where Y t is a conditional Poisson process with rate σλ, which is equivalent to a conditional Poisson process 22

23 ) with the prior λ Gamma (a, b σ. This suggests a comparison with the calibration equation ) under discount factor cσ and prior λ Gamma (a, b σ, which yields the desired scaling property R (b, m, c) = 1 σ R ( b σ, σm, σc). Corollary 5.1. From Theorem 5.2, it follows that R (b, m, c) = 1 b R (1, mb, bc) = cr ( bc, m c, 1 ). Thus, we can scale either b or c to 1, but the other parameter will also be changed. For the gamma-exponential problem, any Gittins index can be obtained by computing a family of stopping boundaries corresponding to r = 1 for each value of c. In the gamma-poisson problem, we can standardize the discount factor, but it is necessary to construct a family of curves indexed by b and m. Since the value of c is fixed throughout a given bandit problem, while the values of a and b change in each time step, the gamma-exponential problem will be less computationally intensive. 5.2 Continuity and monotonicity In this section, we study various continuity and monotonicity properties of the value functions V r (t, m), G r (t, m), R(t, m). The continuity of the Gittins index for discrete-time problems has been studied e.g. by Aalto et al. (211). Such proofs typically use induction arguments that do not apply in continuous time. We are interested in continuity and monotonicity both to show that our continuous-time framework retains the fundamental structure of discrete-time bandit problems, and also to provide intuition on the Gittins index as an indifference pricing rule. We demonstrate that the index policy assigns higher value to states with higher belief mean m, which we describe as higher exploitation value (immediate payoff that can be exploited), and more uncertainty under smaller t, which we describe as higher exploration value (potential gain from exploration). Furthermore, even though the non-gaussian problems we study involve processes with jumps, our continuity results show that these jumps are smoothed out by the optimal retirement strategy. At the end of this section, we prove in Theorem 5.6 that, when t goes to infinity, i.e. we take infinitely many measurements, the exploration value vanishes and the Gittins index approaches the true mean value almost surely. Consequently, the stopping boundary curves from Section 4 are continuous and monotonically increase to the limit r. This structure may be useful for initializing PIDE solution procedures, although this issue is outside the scope of the present paper. 23

24 First, we recall the predictive distributions of X t given in Lemmas 5.1 and 5.2. In computing V r (t, m), G r (t, m), R(t, m), only information in the current knowledge state (t, m) is used, so when we write m t we are implying the predictive distribution of (m t ) conditioned on F. We adopt this viewpoint throughout the following analysis. Starting with the next result, we will repeatedly compare two arbitrary prior knowledge states. Let (m t ) denote the mean process starting with the prior parameters (t, m ), and let (m t) denote the process starting with (t, m ). Our proofs in this section are heavily based on stochastic dominance theory (Müller & Stoyan, 22; Shaked & Shanthikumar, 27), also used by Yu (211) to establish discrete-time results. We shall follow the notation used in Müller & Stoyan (22) and will use the usual stochastic order st, the convex order cx, and the increasing convex order icx. For random variables X and Y, X st Y if f X (c)/f Y (c) is decreasing in c. Also, X cx Y (respectively, X icx Y ) if Eφ(X) Eφ(Y ) for all convex functions (respectively, convex and increasing) φ. Useful properties that we will use include the equivalent definition X st Y if F X F Y (1.2.1 in Müller & Stoyan, 22), the implication icx cx when EX = EY, and the coupling techniques that will be re-stated into Lemmas 5.4 and 5.5 in this section. Lemma 5.3. In the gamma-exponential and gamma-poisson problems, the two following stochastic order properties hold for predictive mean processes, for every t: m t st m t if m m and t = t (3) m t cx m t if m = m and t t (31) Proof: We prove (3) first. It suffices to show that, when m m and t = t, we have F m t F m t. For the gamma-exponential problem, t = t implies a = a, and we denote this common value by a. By Lemma 5.1 we have ( ) b + X t P (m t m) = P a + t 1 m a, m = 1 F ( m (a + t 1) m (a 1) ) 1 and ( ) b P (m t m) = P + X t a + t 1 m a, m = 1 F ( ) m (a + t 1) m (a 1) 1 where F is the cdf of the Beta (t, a) distribution. When m m, m (a + t 1) m (a 1) m (a + t 1) m. (a 1) 24

25 and therefore P (m t m) P (m t m), i.e. F mt F m t. For the gamma-poisson problem, t = t implies b = b, and we denote this common value by b. By Lemma 5.2 we have ( ) a + X t P (m t m) = P m b + t b, m = 1 F (m (b + t) m b) and P (m t m) = P ( a + X t b + t ) m b, m = 1 F ( m (b + t) m b ) where F is the cdf of the generalized negative binomial (GNB) distribution with parameters m b ( ) and t b+t, and F is the cdf of GNB m b, t b+t. When m m, F (m (b + t) m b) F (m (b + t) m b) F (m (b + t) m b), whence P (m t m) P (m t m) as required. Secondly, we prove (31). We first consider the gamma-exponential case; the gamma-poisson version can be shown in exactly the same way. In the gamma-exponential case, (31) assumed that m = b a 1 = b a 1 = m, which we denote by m, and a a. We prove convex dominance by showing m t = cx cx = m t ( ) b + X t a + t 1 λ Gamma(a, b ) ( ) b + X t a + t 1 λ Gamma(a, b ) ( ) b + X t a + t 1 λ Gamma(a, b ) (32) (33) We observe that ( ) b + X t a + t 1 λ Gamma(a, b ) ( ) b + mt + (X t mt) = a + t 1 λ Gamma(a, b ) 1 = m + a + t 1 (X t mt λ Gamma(a, b )) and similarly ( ) b + X t a + t 1 λ Gamma(a 1, b ) = m + a + t 1 (X t mt λ Gamma(a, b )) 25

Optimal Learning with Non-Gaussian Rewards. Zi Ding

ABSTRACT Title of dissertation: Optimal Learning with Non-Gaussian Rewards Zi Ding, Doctor of Philosophy, 214 Dissertation directed by: Professor Ilya O. Ryzhov Robert. H. Smith School of Business In this