Optimal learning with non-gaussian rewards

Size: px
Start display at page:

Download "Optimal learning with non-gaussian rewards"

Transcription

1 Optimal learning with non-gaussian rewards Zi Ding Ilya O. Ryzhov November 2, 214 Abstract We propose a novel theoretical characterization of the optimal Gittins index policy in multi-armed bandit problems with non-gaussian, infinitely divisible reward distributions. We first construct a continuous-time, conditional Lévy process which probabilistically interpolates the sequence of discrete-time rewards. When the rewards are Gaussian, this approach enables an easy connection to the convenient time-change properties of Brownian motion. Although no such device is available in general for the non-gaussian case, we use optimal stopping theory to characterize the value of the optimal policy as the solution to a free-boundary partial integrodifferential equation (PIDE). We provide the free-boundary PIDE in explicit form under the specific settings of exponential and Poisson rewards. We also prove continuity and monotonicity properties of the Gittins index in these two problems, and discuss how the PIDE can be solved numerically to find the optimal index value of a given belief state. 1 Introduction The field of optimal learning (Powell & Ryzhov, 212) studies the efficient collection of information in stochastic optimization problems subject to environmental uncertainty that is, problems where uncertainty is driven by unknown probability distributions. In many applications in business, medicine, and various branches of engineering, the decision-maker is able to formulate a belief about these unknown distributions, and gradually improve it using information collected from expensive simulations or field experiments. In particular, we consider a fundamental class of optimal learning problems known as multi-armed bandit problems (Gittins et al., 211), in which there is a finite set of competing arms or alternatives (e.g. system designs, pricing strategies, or hiring policies), each with an unknown value. Alternatives are implemented sequentially in an online manner: upon choosing an alternative, we collect a reward in the form of a noisy sample centered around the unknown value. Our objective is to maximize the cumulative discounted expected reward collected over an infinite horizon. Each individual reward thus plays two roles: it contributes immediate economic value, and it also provides information about the alternative with the potential to improve future decisions. The 1

2 trade-off between reward and information is known as the exploration vs. exploitation dilemma. This problem arises in applications where decisions are made in real time, such as dynamic pricing or advertising placement in e-commerce (Chhabra & Das, 211) or clinical drug trials with human patients (Berry & Pearson, 1985). The bandit model is also relevant in simulation, in situations where a single simulation experiment costs money (Chick & Gans, 29). In the classical multi-armed bandit setting where the decision-maker s beliefs about the alternatives are mutually independent, the work by Gittins & Jones (1979) shows that the optimal strategy takes the form of an index policy. At each time stage, an index is computed for each alternative independently of our knowledge about the others, and the alternative with the highest index is implemented. The index can be expressed as the solution to an optimal stopping problem (Katehakis & Veinott, 1987). Nonetheless, despite this considerable structure (see e.g. Gittins & Wang, 1992, for additional scaling properties), which continues to inspire new theoretical research on Gittins-like policies (Filliger & Hongler, 27; Glazebrook & Minty, 29), even the stopping problem for a single arm is computationally intractable. This challenge has given rise to a large body of work on heuristic methods, which often make additional assumptions on the reward distribution. It is especially common to require the reward distributions to be Gaussian, or to have bounded support (see e.g. Auer et al., 22, for examples of both). In the simulation literature, Gaussian assumptions are standard due to advantages such as the ability to concisely model correlations between estimated values (Nelson & Matejcik, 1995; Chick & Inoue, 21; Ryzhov et al., 212; Qu et al., 212). More recently, however, numerous applications have emerged where observations are clearly non-gaussian. The operations management literature has recently studied applications in assortment planning (Caro & Gallien, 27; Glazebrook et al., 213) where the observed demand comes from a Poisson distribution with unknown rate. The challenge of learning Poisson distributions also arises in dynamic pricing (Farias & Van Roy, 21), optimal investment and consumption (Wang & Wang, 21), models for household purchasing decisions (Zhang et al., 212), and online advertising and publishing (Agarwal et al., 29). The work by Lariviere & Porteus (1999) studies a newsvendor problem where a Bayesian gamma prior is used to model beliefs about an exponentially distributed demand. The gamma-exponential model is also used by Jouini & Moy (212) in the problem of learning signal-to-noise ratios in channel selection, and would also be appropriate for learning service times or network latencies. Motivated by applications like the above, we consider Bayesian bandit problems under non- 2

3 Gaussian, infinitely divisible reward distributions, encompassing both exponential and Poisson models. In the Gaussian setting, a recent stream of work by Brezzi & Lai (22), Yao (26), and Chick & Gans (29) has approximated the Gittins index for an arm by formulating an optimal stopping problem on a Brownian motion with unknown drift, a continuous-time process that serves as a probabilistic interpolation of the sequence of Gaussian rewards collected from the arm. By making the connection between Brownian motion and the heat equation (Steele, 2), one can formulate and numerically solve a free-boundary problem (van Moerbeke, 1976) to approximate the Gittins index. Our approach uses a similar foundation: we interpolate the reward sequences in the non-gaussian problems with conditional Lévy processes, which are generated by infinitely divisible distributions Sato (1999), and then derive the relevant stopping problems for these continuous-time processes. Although there has been a stream of research available on multi-armed bandit problems driven by Lévy processes (El Karoui & Karatzas, 1994; Kaspi & Mandelbaum, 1995; Mandelbaum, 1986, 1987), they do not consider the Bayesian perspective where the reward distributions depend on unknown (random) parameters and our beliefs about them evolve over time. The dependence of our beliefs on the information collected from previously observed rewards requires the interpolation process to have increments that are generally non-stationary and non-independent, which leads to the use of conditional Lévy processes. Such processes have been studied in a learning context by Buonaguidi & Muliere (213) and Cohen & Solan (213), but only in the much more restrictive setting of binary prior distributions. The major challenge in studying non-gaussian reward problems under this interpolation technique is that we cannot exploit the time-change properties of Brownian motion to standardize the problem, as was done before in the Gaussian reward case. Therefore, we develop an alternate method based on equating the infinitesimal and characteristic operators (Peskir & Shiryaev, 26) of value functions in an optimal stopping problem. We then obtain free-boundary problems on partial integro-differential equations (PIDEs), which can potentially be used to approximate the Gittins index. The solutions to these equations are shown to possess intuitive properties, such as continuity and monotonicity, that are known to hold for classic bandit problems in discrete time. This paper makes the following contributions: 1) We propose novel continuous-time approximations to Gittins indices for non-gaussian problems, in the form of stopping problems on conditional Lévy processes that probabilistically interpolate sequences of infinitely divisible rewards. 2) We derive free-boundary PIDEs whose solutions match the value functions in these stopping prob- 3

4 lems, and illustrate how these solutions can be calculated. 3) We apply our method directly to gamma-exponential and gamma-poisson problems and derive explicitly the PIDEs corresponding to these models. 4) We prove relevant structural properties on the monotonicity, continuity, and asymptotic behavior of the value functions and Gittins indicies. We also derive a scaling property for the gamma-poisson problem that is, to our knowledge, entirely new. Section 2 reviews non-gaussian Bayesian statistical learning models and the Gittins index, and provides additional motivation by discussing the statistical inconsistency of one prominent heuristic (non-gittins) approach in the non-gaussian setting. Section 3 presents the general theory of probabilistic interpolations by conditional Lévy processes used in our analysis, and derives the free-boundary PIDEs that can be used to find Gittins indices. Section 4 applies these methods directly to the gamma-exponential and gamma-poisson learning models, for which we derive the PIDEs in explicit form. Section 5 provides theoretical derivations of the structural properties of the Gittins index in the two problems. Section 6 provides some numerical illustrations of the structural properties we derive. 2 Optimal learning with non-gaussian rewards Section 2.1 sets up the notation for our analysis and describes two major classes of conjugate learning models with infinitely divisible rewards, namely gamma-exponential and gamma-poisson. Section 2.2 reviews the Gittins index policy, known to be optimal for bandit problems. Section 2.3 provides additional motivation for our study by showing that non-gaussian problems can cause inconsistent behavior in knowledge gradient methods, a prominent class of heuristic learning policies. 2.1 Learning models for non-gaussian rewards Consider a bandit problem with M alternatives, with x n {1,..., M} denoting the alternative chosen for implementation in stage n =, 1, 2,... Let Wn+1 xn be the single-period reward observed after x n is implemented. In the discrete-time problem, quantities are indexed by the time at which they become known; thus, x n is chosen at time n, but the output Wn+1 xn only becomes known one time period later. For fixed x, the rewards W1 x, W 2 x,... are drawn from a common sampling distribution with density f x ( ; λ x ), where λ x is an unknown parameter (or vector of parameters). The rewards are 4

5 conditionally independent given λ x. Let F n be the sigma-algebra generated by the first n decisions x, x 1,..., x n 1 as well as the resulting rewards W x 1,..., W x n 1 n. The unknown parameter λ x is modeled as a random variable, and our beliefs about the possible values of the parameter at time n are represented by the conditional distribution of λ x given F n. In the problems we consider, this sequence of conditional distributions is characterized by a sequence (kn) x n= of random vectors, where k x n is F n -measurable for all n, and we write E ( W x n+1 F n ) = m (k x n ) (1) for some appropriately chosen function m, so that m is the mean of the reward based on our current belief. For convenience, we also let m x = E ( Wn+1 x λ x) be the true mean of the single-period reward, which is the mean of the reward distribution when the true value of λ x is known. Note that, if x is observed infinitely often, m (k x n) m x a.s. by martingale convergence. The knowledge state k x n can also be viewed as a set of sufficient statistics for the conditional distribution of λ x given F n. We follow the classic multi-armed bandit model, where λ x and λ y are independent for any x y, and likewise the single-period rewards are independent across alternatives. Thus, our beliefs about all the alternatives can be completely characterized by k n = { k 1 n,..., kn M }. A policy π denotes a sequence X π, Xπ 1,... of functions mapping knowledge states k, k 1,... to elements of {1,..., M}. In other words, a policy is a rule for making decisions, under any possible knowledge state, at any time stage. Our objective can thus be written as sup E π π n= ( γ n m k π,xπ n (kn) n ), (2) where < γ < 1 is a pre-specified discount factor. In words, the optimal policy should maximize the expected cumulative, infinite-horizon, discounted reward obtained from alternatives implemented by the policy. We specifically highlight two classic Bayesian learning models where the sampling distributions are infinitely divisible. In the gamma-exponential model, f x is (conditionally) exponential with unknown rate λ x. Under the assumption that λ x Gamma (a x, bx ), the conditional distribution of λ x, given F n, is also gamma with parameters a x n and b x n. From DeGroot (197), we can obtain 5

6 simple recursive relationships for the parameters, given by a x n+1 = b x n+1 = { a x n + 1 if x n = x a x n if x n x, { b x n + Wn+1 x if x n = x b x n if x n x. (3) (4) In the gamma-exponential model, k x n = (a x n, b x n), and the mean function m is given by m (k x n) = E ( 1 λ x F n ) = b x n a x n 1. We also consider the gamma-poisson model, where f x is conditionally Poisson with unknown rate λ x. Again, we start with λ x Gamma (a x, bx ), whence the posterior distribution of λ x at time n is again gamma with parameters a x n and b x n, and the Bayesian updating equations are now given by a x n+1 = b x n+1 = { a x n + Wn+1 x if x n = x a x n if x n x, { b x n + 1 if x n = x b x n if x n x. (5) (6) Again, the decision-maker s knowledge about λ x at time n is represented by k x n = (a x n, b x n) with mean function m (k x n) = E (λ x F n ) = ax n b x n. Both of these models are conjugate, meaning that the posterior distribution of any unknown parameter is always gamma, at any time stage. Beliefs based on conjugate priors are easy to store and update, making such models useful for practitioners. Our approach in this paper is designed for infinitely divisible sampling distributions, a class that includes both the gamma-exponential and gamma-poisson cases. Among the standard conjugate models in DeGroot (197), these two are among the most intuitive, and also are the most suitable for applications in assortment planning and revenue management. We briefly mention a third well-known non-gaussian model, which assumes Bernoulli rewards with beta belief distributions; the sampling distribution in this case is not infinitely divisible and does not fall under our framework, but has been relatively well-studied elsewhere (Berry & Pearson, 1985; Bertsimas & Mersereau, 27), and we do not explore it here. To our knowledge, the gamma-exponential and gamma-poisson models have received less theoretical attention in the bandit literature. 2.2 Review of Gittins indices We briefly summarize the characterization of the Gittins index policy, known to optimally solve (2). For a more detailed introduction, the reader is referred to Ch. 6 of Powell & Ryzhov (212). 6

7 Furthermore, Gittins et al. (211) provides a deeper theoretical treatment with several equivalent proofs of optimality for the policy. The Gittins method considers each alternative separately from the others. Let k denote our beliefs about an arbitrary alternative, dropping the superscript x for notational convenience. Consider a situation where, in every time stage, we have a choice between implementing this alternative and receiving a known, deterministic retirement reward r. The optimal decision (implement vs. retire) can be characterized using Bellman s equation for dynamic programming. We write V (k, r) = max { r + γv (k, r), E [ W + γv ( k, r ) k }, (7) where W is the reward obtained from the implementation, and k is the future knowledge state arising due to the new information provided by W. In our setting, k would be computed using (3)-(4) or (5)-(6). Because we do not update our beliefs when we collect the fixed reward, it follows that, if we prefer the fixed reward given the knowledge state k, we will continue to prefer it for all future time periods. Then, (7) becomes { r V (k, r) = max 1 γ, m (k) + γe [ V ( k, r ) k }. (8) The Gittins index R(k) is a particular retirement reward value r that makes us indifferent between the two quantities inside the maximum in (8). In the special case where the parameter λ x is known, this retirement reward is equal to the mean single-period reward, as shown in the following lemma. Lemma 2.1. If the parameter λ x is a known constant, the Gittins index of arm x is m x. Proof: If λ x is known, there is no knowledge state, and (7) becomes V (r) = max {r + γv (r), m x + γv (r)}, whence the result follows immediately. Once the Gittins indices have been computed, the policy Xn (k n ) = arg max R (k x x n) can be shown to be optimal for the objective in (2). Thus, the Gittins method decomposes an M- dimensional problem into M one-dimensional problems, each of which can be solved independently 7

8 of the others. Furthermore, in the gamma-exponential version of the problem (that is, where k = (a, b) and (3)-(4) are used to update k), it has also been shown (Gittins & Wang, 1992) that R (a, b) = br (a, 1), (9) meaning that Gittins indices only have to be computed for a restricted class of knowledge states. Equivalently, if we can find b ( (a) such that R a, b ) (a) = 1, we can use (9) to write b (a) R (a, 1) = R (a, b ) (a) = 1, (1) whence R (a, b) = b b(a). Yet, even with this structure, it is difficult to compute R (a, 1) or b (a) for arbitrary a. Computational approaches based on PDEs have previously been developed only for problems with Gaussian rewards. 2.3 Inconsistency of knowledge gradient methods This section provides additional motivation for our work by showing that non-gaussian problems create theoretical challenges for a prominent class of sub-optimal heuristics known as knowledge gradient (KG) methods. Instead of separating the arms and computing an individual index for each one, this approach considers all the alternatives together and calculates the expected improvement [ Rn KG,x = E max m ( k y y n+1) max m (k y y n) F n, x n = x (11) that a single implementation contributes to an estimate of the value of the best alternative. The knowledge gradient (KG) policy X KG n (k n ) = arg max R KG,x x n (12) has received attention in the simulation community (see e.g. Chick, 26, for an overview), because it is computationally efficient and often performs near-optimally in experiments. Typically the goal in such problems is to identify the alternative with the highest value, rather than to maximize the cumulative reward as in (2). However, these objectives are closely related, and the method can be easily adapted to bandit problems (Ryzhov et al., 212) with a simple modification of (12) known as online KG. If the rewards are Gaussian, the policy in (12) is statistically consistent (Frazier et al., 28), meaning that m (kn) x m x a.s. for every x. In other words, the policy is guaranteed to discover the best alternative over an infinite horizon, a useful regularity property for simulation optimization 8

9 algorithms. This property typically does not hold in bandit problems, even for Gittins indices Brezzi & Lai (2), but the consistency of (12) leads to certain regularity properties for online KG (Ryzhov et al., 212). In general, the consistency of knowledge gradient policies has been shown in numerous settings where Gaussian rewards appear (Vazquez & Bect, 21; Frazier & Powell, 211; Ryzhov & Powell, 212). However, as we now show, this property does not hold for the gamma-exponential learning model. The work by Ryzhov & Powell (211) provides a closed-form solution of (11) for the gammaexponential problem, given by R KG,x n = ( ) 1 b x ax n n (a x n 1)(Cn) x ax n 1 a x n 1 (a x n 1)(Cx n )ax n 1 ( b x n a x n ) ax n ( b x n a x n 1 Cx n ) if b x n a x n 1 Cx n if bx n a x n C x n < bx n a x n 1 if C x n < bx n a x n, (13) where C x n = max y x m (k y n). We see immediately that it is possible to have R KG,x n V ar (λ x F n ) >. = even though This behavior, which does not arise in the Gaussian setting, is the cause of the inconsistency result. Essentially, if the policy places zero value on alternative x, there may be a non-zero probability that the policy will never measure x, which means that m (k x n) will not converge to the true mean reward. The proof can be found in the Appendix. Theorem 2.1. There exists a gamma-exponential problem for which (12) has a non-zero probability of never measuring a particular alternative. This result provides some insight into the fundamental differences between Gaussian and non- Gaussian learning models. It also adds to the motivation of our study of Gittins indices, a policy that looks over an infinite horizon and thus avoids under-estimating the value of information. We now return to the Gittins policy and develop a new characterization of it for non-gaussian problems with infinitely divisible rewards. 3 The Gittins index as a stopping boundary Our analysis is based on the idea of continuous-time interpolation, which has previously only been studied in the setting of Gaussian rewards. The basic idea was first proposed by Brezzi & Lai (22). For arbitrary x (in the following, we again drop the superscript x for convenience as in Section 2.2), the discrete-time process (W n ) n=1 of single-period rewards with unknown mean µ 9

10 and known variance σ 2 is replaced by a continuous-time process (X t ) t. The process (X t ) is constructed in such a way that, for integer t, the increment X t+1 X t has the same distribution as the single-period reward W t+1. Therefore, (X t ) can be viewed as a probabilistic interpolation of (W n ). In the Gaussian setting, (X t ) is conditionally a Brownian motion with unknown drift µ and known volatility σ. The formulation of the Gittins index in (8) is then extended to the continuous-time case and written as the solution to an optimal stopping problem on (X t ). Our approach is inspired by this idea. When the rewards are non-gaussian and infinitely divisible, we construct a continuous-time process (X t ) such that, for integer t, the increment X t+1 X t has the same distribution as W t+1. Since the discrete-time rewards are conditionally i.i.d. given λ, the interpolation (X t ) is a conditional Lévy process, with increments that are conditionally independent and stationary. For the gamma-exponential problem, (X t ) is conditionally a gamma process given λ (see e.g. Cinlar, 211, for a definition), with exponentially distributed increments over unit intervals. Similarly in the gamma-poisson problem, (X t ) is conditionally a Poisson process. Section 3.1 reviews the theory of conditional Lévy processes and presents the continuous-time analog to the Gittins index equation (8) under this interpolation, similar to that used by Brezzi & Lai (22) and Yao (26) in studying Gaussian problems. Section 3.2 derives a free-boundary partial integro-differential equation that can be used to solve for the Gittins indices. The Gittins indices obtained from such a continuous interpolation will then serve as approximations to the Gittins indices in the original discrete-time problem. 3.1 Continuous-time conditional Lévy interpolation In this section, we follow the theoretical characterization of conditional Lévy processes introduced in Cinlar (23). Let (X t ) be a real-valued stochastic process that will later serve in Section 4 as the continuous-time interpolation of cumulative rewards without discounting. The parameter λ is a random variable (or random vector) whose conditional distribution given F t characterizes our belief on λ at time t. Given λ, the process (X t ) has conditionally stationary and independent increments. Under the context of non-gaussian conjugate models, we further restrict (X t ) to increasing and right-continuous pure jump processes. The method below does apply to general stochastic processes, but this is not as useful for interpolation purposes in Bayesian bandit problems. 1

11 The dependence of X t on λ is described as X t = X + zµ(ds, dz) [,t R + (14) where µ is conditionally (given λ) a random measure on R + R + with mean measure ν(λ, dz)ds, satisfying R + ν(λ, dz)(z 1) < for all λ (for details on random measure and mean measure, see Ch. 6 in Cinlar, 211). The intensity measure of µ at time t, that is, the intensity given F t but not given λ, is written as ν t (dz)ds = E [ν(λ, dz) F t ds. Intuitively, this definition states that a conditional Lévy process has two layers of randomness. The conditional mean measure ν characterizes the infinitesimal behavior of the process, given a value of λ, as that of a Lévy process. The unconditional intensity measure ν further removes the randomness in the belief on λ by integrating over its distribution. Thus, ν can be described as the mean of the conditional mean measure. With the reward process (X t ) set up as a continuous-time conditional Lévy process, the Gittins logic can be extended to the continuous-time setting as follows. Let c be a continuous-time discount factor (lower c corresponds to higher γ in discrete time). The Gittins index R is the particular value of r such that r [ e cs ds = sup E e cs dx s + r e cs ds, (15) where denotes a stopping time. This expectation is evaluated given some initial state k at time, which is the starting time when the index is calculated. We omit this dependence from the notation and take the starting time to be zero without loss of generality, since the Gittins index only depends on the time through the knowledge state. This formulation is equivalent to the one in (8); see e.g. Katehakis & Veinott (1987) or Yao (26) for more details. As before, discounted rewards are collected from the process (X t ) until time, at which point we collect the fixed retirement reward r until the end of time. If (15) holds, we are indifferent between stopping immediately and running the process until the optimal stopping time. 3.2 Characterization as a free-boundary problem In the Gaussian setting, the optimal stopping problem (15) can be solved by performing a time change on the conditional Brownian motion to convert it into a standard Wiener process (Brezzi & Lai, 22). This standardized problem can be solved by simulation (Brezzi & Lai, 22; Yao, 26) 11

12 or using a free boundary heat equation (Chick & Gans, 29; Chick & Frazier, 212). When the rewards are non-gaussian, the process can still be embedded in Brownian motion (Monroe, 1978), but the time change is random and computationally intractable. Instead, we will apply a new approach based on Peskir & Shiryaev (26). We manipulate the continuous-time Gittins index set up in (15) as follows: = sup [ E e cs dx s e cs rds [ e cs zµ(dz, ds) e cs rds [, R + ) e ( R cs zν(λ, dz) r ds + e cs z [µ(dz, ds) ν(λ, dz)ds (16) + [, R + ) e ( R cs zν(λ, dz) r ds + (17) + = sup E [ = sup E [ = sup E = sup = sup = sup E e cs ze [µ(dz, ds) ν(λ, dz)ds F s [, R [ + ) E e cs zν(λ, dz) r ds ( R [ + ) E e cs ze (ν(λ, dz) F s ) r ds (18) ( R [ + ) e ( R cs z ν s (dz) r ds (19) + In (16) we use a compensating technique by adding and subtracting ν(λ, dz). Then. the random measure µ is cancelled in (17), by taking iterated conditional expected values (also called the tower property). We then use the tower property again in (18). Notice that R + z ν s (dz) is the mean of the infinitesimal increment of the process. For a classic Lévy process, where λ is a known constant and has no dependency on F s, the term R + zν(dz) is called the compensator of the process (Itô et al., 24). We denote R + z ν t (dz) by m t to emphasize that this quantity serves the same role as m(k n ) in (1). Then, (15) in continuous time can be written as, [ sup E e cs (m s r) ds = (2) which we will refer to as the calibration equation throughout this paper. We also introduce notation for the LHS of (2), [ V (t, m) := sup E e cs (m s r) ds (21) 12

13 Recall that the expectation in (21) is evaluated given some initial state at time. As we observe from models such as (3)-(6), the pair (t, m), representing a time parameter and a mean parameter, is a set of sufficient statistics for the distribution of λ given F t (in fact, these statistics are sufficient for all the models described in Gittins et al., 211). Therefore, V (t, m) indicates that our initial knowledge is characterized by (t, m). In this value function, r is a fixed constant value and the Gittins index R(t, m) is the particular value of r that makes V (t, m) =. On the other hand, if we fix r, the set of pairs (t, m) for which V (t, m) = is precisely the set of states that have r as the Gittins index. We now construct a free-boundary problem for V by equating the characteristic and infinitesimal operators of V. In this approach, we rely on the mild condition that (m t ) is a càdlàg strong Markov process. As we will see from calculating m t explicitly for gamma-exponential and gamma- Poisson problems in Section 4, this assumption is reasonable and generally satisfied in the context of Bayesian conjugate models. The characteristic operator of V is defined as L char V (t, m) = lim U {m} EV (t U c, m U c ) V (t, m), (22) E ( U c) where U is an open set that contains m, and U c (m t ). That is, is the hitting time of the set U c for the process U c = inf {t : m t U c } is the first time at which (m t ) leaves the set U. First we consider the value function at the moment when U c occurs, and then we shrink U down to the singleton {m}. The concept of the characteristic operator dates back to Dynkin (1965). The value function V = E [ e cs (m s r) ds is analogous to a killed version of the Lagrange problem introduced in Ch. 6 of Peskir & Shiryaev (26), where the value function is of the form e Λ(s) L (m s ) ds. The characteristic operator in the Lagrange problem can be explicitly calculated, and (22) is given in closed form in the following result. Lemma 3.1. If (m t ) is a càdlàg strong Markov process, the characteristic operator of V is given by L char V (t, m) = cv (t, m) (m r). (23) 13

14 Proof: The lemma follows from (7.2.8) in Peskir & Shiryaev (26). In a killed Lagrange problem on the value function e Λ(s) L (m s ) ds, by inserting Λ(s) = cs and L (m s ) = m s r, we obtain the desired results in the lemma. The infinitesimal operator L inf (also called the generator of V ) is derived using Itô s lemma. Our goal is to obtain an expression V (t, m t ) = V (, m ) + t L inf V (s, m s ) ds + Y t, (24) where (Y t ) is a martingale formed by adding and subtracting a continuous compensator to the jump component of V (see Itô et al., 24, for an exposition of this idea). Recall that in discrete time models such as (3)-(6), the mean process (m t ) is expressed by t and X t in simple closed form. We now assume that m t can be written as g(t, X t ) for some continuous function g with first-order derivatives. In Section 4, we will explicitly derive g and show that this assumption is satisfied in gamma-exponential and gamma-poisson problems. Lemma 3.2. If (m t ) can be written into the form m t = g(t, X t ) for some continuous function g with first-order derivatives, the infinitesimal operator of V is given by L inf (t, m) = V g (t, m)+ t t (t, X t) V (t, m)+ [V (t, g(t, X t +z)) V (t, g(t, X t )) ν t (dz). (25) m R + Proof: The proposition follows from the calculation V (t, m t ) = V (, m ) + t t = V (, m ) + = V (, m ) + + t V t s (s, m V s)ds + m (s, m s)dm c s + V s (s, m s)ds + V s (s, m s)ds + t t V m (s, m s)dm c s + <s t V m (s, m s) g s ds + [V (s, m s ) V (s, m s ) (26) [V (s, g(s, X s + z)) V (s, g(s, X s )) µ(ds, dz) [,t R + [V (s, g(s, X s + z)) V (s, g(s, X s )) ν s (dz)ds [,t R + [,t R + [V (s, g(s, X s + z)) V (s, g(s, X s )) (µ(ds, dz) ν(λ, dz)ds + ν(λ, dz)ds ν s (dz)ds) (27) = V (, m ) + t L inf V (s, m s ) ds + Y t In (26), we use Itô s lemma for jump-diffusion processes (see Sato, 1999, Ch. 6, Theorem 31.5), and m c s denotes the continuous part of m s, after removing all jumps. Since m s = g(s, X s ) and X s is a pure jump process, we have dm c s = g s ds. In (27), we applied a compensator technique and 14

15 put the component with respect to the random measure µ into an F t -martingale Y t. Note that, for T t, the tower property implies that, E [Y T F t [ = Y t + E [V (s, g(s, X s + z)) V (s, g(s, X s )) [µ(ds, dz) ν(λ, dz)ds + ν(λ, dz)ds ν s (dz)ds [t,t R + F t [ = Y t + E [V (s, g(s, X s + z)) V (s, g(s, X s )) E [µ(ds, dz) ν(λ, dz)ds F s [t,t R + F t [, +E [V (s, g(s, X s + z)) V (s, g(s, X s )) E [ν(λ, dz)ds ν s (dz)ds F s [t,t R + F t = Y t. Therefore, the infinitesimal operator L inf V is, L inf V (t, m) = V g (t, m) + t t (t, X t) V (t, m) + m which yields the result in the lemma. R + [V (t, g(t, X t + z)) V (t, m) ν s (dz) Essentially, the characteristic and infinitesimal operators are two different expressions for the derivative of V based on Kolmogorov theory and Itô calculus. Under general arguments (Peskir & Shiryaev, 26), the two operators exist and coincide. By matching them, we obtain a free-boundary problem on a PIDE as a consequence of the above derivations. Theorem 3.1. If (m t ) is a strong Markov càdlàg process and can be written as g(t, X t ) for some continuous function g with first order derivatives, the value function V (t, m) solves the freeboundary problem V t g (t, m) + t (t, X t) V (t, m) + m [V (t, g(x t + y)) V (t, m) ν t (dy) = cv (t, m) (m r) V (t, m (t)) = where m (t) is an unknown stopping boundary curve. For every point on the stopping boundary, the Gittins index R (t, m (t)) is equal to r. Proof: Using the characteristic operator and infinitesimal operator shown in Lemmas 3.1 and 3.2, the theorem follows from Ch. 7.2 in Peskir & Shiryaev (26). We have assumed a fixed retirement reward r in this PIDE. Thus, the solution of the PIDE does not immediately yield a Gittins index R (t, m) for an arbitrary knowledge state (t, m). However, 15

16 the stopping boundary curve m describes the set of all knowledge states whose Gittins index is exactly equal to r. In the gamma-exponential problem, it would be sufficient to find the boundary for r = 1 and apply (1). 4 Exponential and Poisson rewards problems We now apply the general result of Theorem 3.1 to characterize Gittins indices for exponential and Poisson rewards. Section 4.1 covers the gamma-exponential problem, whereas Section 4.2 covers the gamma-poisson problem. 4.1 A free-boundary problem for gamma-exponential bandits In the gamma-exponential problem, our continuous-time interpolation (X t ) is a gamma process with shape parameter 1 and unknown scale parameter λ. We begin by assuming λ Gamma (a, b ), reflecting the decision-maker s prior beliefs. Letting F t be the σ-algebra generated by the path of (X t ) up to time t, we find that the conditional distribution of λ given F t is still gamma with posterior parameters a t = a + t, b t = b + X t, as in (3)-(4). For convenience, we may also use the notation k t = (a t, b t ). The value function V (t, m) for the gamma-exponential problem can also be written as V (a, m) under a shift of variable for simplicity as a t = a + t. Theorem 4.1. The value function V (a, m) in gamma-exponential problem solves the free-boundary problem V a (a, m) m V (a, m) + a 1 m [V (a, m + z) V (a, m) 1 z ( ) m a dz = cv (a, m) (m r) m + z V (a, m (a)) = where m (a) is an unknown stopping boundary curve. boundary, the Gittins index R(a, m) is equal to r. For every point (a, m) on this stopping Proof: This can be shown through explicit calculation based on the PIDE in Theorem 3.1. In the conditional Lévy process we use to model exponential rewards, the conditional mean measure 16

17 given λ is ν(λ, dy) = e λy y, the same as in a gamma process, and the distribution of λ given F t is Gamma(a t, b t ). Therefore, the unconditional mean measure ν t (dy) is calculated as ν t (dy) = = e λy bat t λat 1 e btλ y Γ(a t ) ( bt b t + y ) at 1 y dy dλdy and m t = = = b t a t 1 y ν t (dy) ( ) at bt dy b t + y Therefore, m t = g(t, X t ) = b +X t a +t 1 g and t = b +X t (a = mt +t 1) 2 a. Also, t 1 R + [V (t, g(t, X t + y)) V (t, g(t, X t )) ν t (dy) = = R + [ ( V t, b ) ( t + y V t, a t 1 R + [V (t, m t + z) V (t, m t ) ) ( b t bt a t 1 b t + y ( ) at mt 1 m t + z z dz ) at 1 y dy where the last equality is obtained by using a change of variable z = y a t 1. In the exponential reward problem, as well as the Poisson reward problem to be discussed in Section 4.2, we use integration by parts to simplify the free boundary partial integro-differential equation from Theorem 3.1. First, we rewrite the value function as V (a, m) = sup E Observe that = sup 1 c E e cs (m s r) ds [ (m r) e c (m r) + e cs d ds m s d ds m s = d b + X s ds a + s 1 = b + X s (a + s 1) 2 ds + 1 a + s 1 dx s., 17

18 We take the expectation of this quantity, whence E e cs d ( ds m (a s, b s ) = E e cs b + X s (a + s 1) 2 ( = E e cs b + X s (a + s 1) 2 =. ) ds + E ) ds + E [ e cs 1 E a + s 1 dx s F s ( ) e cs b + X s (a + s 1) 2 ds Consequently, (2) can be rewritten as [ 1 sup E [ e c (r m ) + m r =. c We define a new value function G(a, m) := supe [e c (r m ) = cv (a, m) m + r for fixed r, and plug it into Theorem 4.1 to obtain the following equivalent free boundary problem. Proposition 4.1. The value function G (a, m) in the gamma-exponential problem solves the freeboundary problem G (a, m) m a G (a, m) + a 1 m [G (a, m + z) G (a, m) 1 z ( ) m a dz = cg (a, m) m + z G (a, m (a)) = r m (a) where m (a) is an unknown stopping boundary curve. boundary, the Gittins index R(a, m) is equal to r. For every point (a, m) on this stopping Proof: By substituting V (a, m) = 1 c [G(a, m) + m r in Theorem 4.1, we get V (a, m) m a G V (a, m) + a 1 m = 1 c a (a, m) 1 c = 1 [ G (a, m) m c a and m a 1 [ G (a, m) + 1 m G (a, m) + a 1 m cv (a, m) (m r) = G(a, m) [V (a, m + z) V (a, m) 1 ( ) m a dz z m + z + 1 [G (a, m + z) G (a, m) 1 ( m c z m + z [G (a, m + z) G (a, m) 1 ( ) m a dz z m + z ) a dz + 1 c m a 1 while on the stopping boundary, 1 c [G(a, m) + m r =. The formulation in Proposition 4.1 is equivalent to the more intuitive one in Theorem 4.1 where L inf V = L char V and the value function equals zero on the stopping boundary. We will use it to prove structural properties in Section 5 and for numerical convenience in Section 6. 18

19 4.2 A free-boundary problem for gamma-poisson bandits In the gamma-poisson problem, the continuous-time interpolation (X t ) is a Poisson process with unknown rate λ. Again, we assume λ Gamma (a, b ), let F t be the σ-algebra generated by the path of (X t ) up to time t, and update the posterior parameters using a t = a + X t, b t = b + t, as in (5)-(6). As in Section 4.1, we get the following free-boundary PIDE through calculating m t explicitly. Theorem 4.2. The value function V (b, m) in the gamma-poisson problem solves the free-boundary problem V b (b, m) m b [ ( V (b, m) + V b, m + 1 ) V (b, m) m = cv (b, m) (m r) m b V (b, m (b)) = where m (b) is an unknown stopping boundary curve. And, for every point (b, m) on this stopping boundary, the Gittins index R(b, m) is equal to r. Proof: It suffices to use explicit calculation based on the PIDE in Theorem 3.1. In the conditional Lévy process we use to model exponential rewards, the conditional mean measure given λ is identical to that of a Poisson process ν(λ, dy) = λδ 1, where δ 1 is the Dirac delta function, and the distribution of λ given F t is Gamma(a t, b t ). Therefore, the mean measure intensity ν t (dy) is calculated as and ν t (dy) = = a t b t δ 1 dy λ bat t λat 1 e btλ Γ(a t ) dλδ 1 dy m t = y ν t (dy) = a t = a + X t b t b + t 19

20 Therefore, m t = g(t, X t ) = a +X t b +t and g t = a +X t (b +t) 2 = mt b t. Also, R + [V (t, g(t, X t + y)) V (t, g(t, X t )) ν t (dy) = = R + [ V [ ( V t, a ) ( t + y V b t t, a t (t, m t + 1bt ) V (t, m t ) b t ) at b t δ 1 dy m t as required. Again, we use integration by parts to simplify the value function to get V (b, m) = 1 [ sup E [ e c (r m ) + m r =. c By defining G(b, m) := supe [e c (r m ) = cv (b, m) m + r for fixed r and replacing V in Theorem 4.2, we obtain the equivalent free-boundary problem for the gamma-poisson model. Proposition 4.2. The value function G (b, m) in the gamma-poisson problem solves the freeboundary problem, G b (b, m) m b [ ( G (b, m) + V b, m + 1 ) V (b, m) m = cg (b, m) m b G (b, m (b)) = r m (b) where m (b) is an unknown stopping boundary curve. boundary, the Gittins index R(b, m) is equal to r. For every point (b, m) on this stopping The proof is same as that of Proposition 4.1 and we omit it here. Unfortunately, the scaling properties of the Gittins index are not as straightforward in the gamma-poisson problem as they are in the gamma-exponential problem. However, in the following section, we derive a new scaling property for the gamma-poisson problem. 5 Theoretical analysis In this section, we provide theoretical results on the structure of the Gittins index for non-gaussian problems, in continuous time. We focus on two major aspects. Section 5.1 considers scaling properties, more notably for the gamma-poisson problem. Section 5.2 investigates the continuity and monotonicity of the Gittins index and value function, and concludes the theoretical analysis 2

21 with an asymptotic convergence result. These theoretical properties provide more insight on how the Gittins index, an indifference indexing rule, prices the value of a state. Also, these properties match the discrete-time results shown in Gittins et al. (211) and Yu (211), which argues in favor of our continuous-time framework as a generalization of the discrete-time model. Throughout this section, we abuse notation slightly by writing the value functions V and G, as well as the Gittins index R, as functions of (t, m), (a, m), or (b, m), as is convenient. Most results apply to both gamma-exponential and gamma-poisson problems, and therefore we use (t, m) where possible. We will specifically use (a, m) or (b, m) in proofs when needed. When we hold a parameter constant, we omit it in the argument, e.g. V (m) denotes the value function V (t, m) while we hold t constant. When needed, we also use the subscript r for value functions, e.g. V r (t, m), to denote that they are calculated given that fixed r value. This notation facilitates writing our proofs in this section. 5.1 Distributional and scaling properties We begin with two computational results on the predictive distributions appearing in the gammaexponential and gamma-poisson problems. These results are used in the proofs of some structural properties in this section, and later on help us to create initial conditions for PIDE solution procedures. The proofs can be found in the Appendix. Lemma 5.1. In the gamma-exponential model, the predictive distribution of Xt b, given F, is the beta-prime distribution with parameters t and a. Lemma 5.2. In the gamma-poisson model, the predictive distribution of X t, given F, is the generalized negative binomial distribution with parameters a and t b +t. Next, we establish scaling properties of the Gittins index for both non-gaussian problems. Theorem 5.1 extends the result of (9) to the continuous-time setting, where the Gittins index is defined to be the value of r that solves (15); we include this proof for completeness. We then derive a new scaling property for the gamma-poisson problem. Theorem 5.1. In the gamma-exponential problem, the Gittins index satisfies R (a, b) = br (a, 1). 21

22 Proof: We factor b out of the calibration equation (2) of the Gittins index to obtain [ sup E e cs (m s r) ds [ = sup E [ = b sup E = ) a + s 1 r ds e cs ( b + X s e cs ( 1 + X s b a + s 1 r b ) ds (28) The factor b in front of (28) can be dropped since (28) equals zero. properties of the gamma process and gamma distribution, we see that the process By applying the scaling ( ) Xt b has the same law as a conditional gamma process with the prior λ Gamma (a, 1). Then, if R balances (2), it follows that the index R b balances the calibration equation for a gamma-exponential problem starting from the knowledge state (a, 1). Thus, R (a, b) = br (a, 1), as required. For the gamma-poisson problem, we also emphasize the dependence of R on the discount factor c, as this plays a role in the scaling property. To our knowledge, Theorem 5.2 is the first known scaling result for problems with Poisson rewards. Theorem 5.2. In the gamma-poisson problem, the Gittins index satisfies the scaling property R (b, m, c) = 1 σ R ( b σ, σm, σc ) for all σ >. Proof: We consider the calibration equation for the gamma-poisson problem and write [ sup E e cs (m s r) ds = sup = sup E [ E [ e cs ( a + X s b + s ) r ds ( e cs a + X s b σ + s rσ σ ) 1 σ ds. (29) Letting t = s σ and Y t = X σt, we rewrite (29) as [ sup E e cs (m s r)ds [ σ = sup E [ = sup E ( ) e cσt a + X σt b σ rσ dt + t ) e cσt ( a + Y t b σ + t rσ dt Observe that if is a stopping time for X t, then σ is a stopping time for Y t, where Y t is a conditional Poisson process with rate σλ, which is equivalent to a conditional Poisson process 22

23 ) with the prior λ Gamma (a, b σ. This suggests a comparison with the calibration equation ) under discount factor cσ and prior λ Gamma (a, b σ, which yields the desired scaling property R (b, m, c) = 1 σ R ( b σ, σm, σc). Corollary 5.1. From Theorem 5.2, it follows that R (b, m, c) = 1 b R (1, mb, bc) = cr ( bc, m c, 1 ). Thus, we can scale either b or c to 1, but the other parameter will also be changed. For the gamma-exponential problem, any Gittins index can be obtained by computing a family of stopping boundaries corresponding to r = 1 for each value of c. In the gamma-poisson problem, we can standardize the discount factor, but it is necessary to construct a family of curves indexed by b and m. Since the value of c is fixed throughout a given bandit problem, while the values of a and b change in each time step, the gamma-exponential problem will be less computationally intensive. 5.2 Continuity and monotonicity In this section, we study various continuity and monotonicity properties of the value functions V r (t, m), G r (t, m), R(t, m). The continuity of the Gittins index for discrete-time problems has been studied e.g. by Aalto et al. (211). Such proofs typically use induction arguments that do not apply in continuous time. We are interested in continuity and monotonicity both to show that our continuous-time framework retains the fundamental structure of discrete-time bandit problems, and also to provide intuition on the Gittins index as an indifference pricing rule. We demonstrate that the index policy assigns higher value to states with higher belief mean m, which we describe as higher exploitation value (immediate payoff that can be exploited), and more uncertainty under smaller t, which we describe as higher exploration value (potential gain from exploration). Furthermore, even though the non-gaussian problems we study involve processes with jumps, our continuity results show that these jumps are smoothed out by the optimal retirement strategy. At the end of this section, we prove in Theorem 5.6 that, when t goes to infinity, i.e. we take infinitely many measurements, the exploration value vanishes and the Gittins index approaches the true mean value almost surely. Consequently, the stopping boundary curves from Section 4 are continuous and monotonically increase to the limit r. This structure may be useful for initializing PIDE solution procedures, although this issue is outside the scope of the present paper. 23

24 First, we recall the predictive distributions of X t given in Lemmas 5.1 and 5.2. In computing V r (t, m), G r (t, m), R(t, m), only information in the current knowledge state (t, m) is used, so when we write m t we are implying the predictive distribution of (m t ) conditioned on F. We adopt this viewpoint throughout the following analysis. Starting with the next result, we will repeatedly compare two arbitrary prior knowledge states. Let (m t ) denote the mean process starting with the prior parameters (t, m ), and let (m t) denote the process starting with (t, m ). Our proofs in this section are heavily based on stochastic dominance theory (Müller & Stoyan, 22; Shaked & Shanthikumar, 27), also used by Yu (211) to establish discrete-time results. We shall follow the notation used in Müller & Stoyan (22) and will use the usual stochastic order st, the convex order cx, and the increasing convex order icx. For random variables X and Y, X st Y if f X (c)/f Y (c) is decreasing in c. Also, X cx Y (respectively, X icx Y ) if Eφ(X) Eφ(Y ) for all convex functions (respectively, convex and increasing) φ. Useful properties that we will use include the equivalent definition X st Y if F X F Y (1.2.1 in Müller & Stoyan, 22), the implication icx cx when EX = EY, and the coupling techniques that will be re-stated into Lemmas 5.4 and 5.5 in this section. Lemma 5.3. In the gamma-exponential and gamma-poisson problems, the two following stochastic order properties hold for predictive mean processes, for every t: m t st m t if m m and t = t (3) m t cx m t if m = m and t t (31) Proof: We prove (3) first. It suffices to show that, when m m and t = t, we have F m t F m t. For the gamma-exponential problem, t = t implies a = a, and we denote this common value by a. By Lemma 5.1 we have ( ) b + X t P (m t m) = P a + t 1 m a, m = 1 F ( m (a + t 1) m (a 1) ) 1 and ( ) b P (m t m) = P + X t a + t 1 m a, m = 1 F ( ) m (a + t 1) m (a 1) 1 where F is the cdf of the Beta (t, a) distribution. When m m, m (a + t 1) m (a 1) m (a + t 1) m. (a 1) 24

25 and therefore P (m t m) P (m t m), i.e. F mt F m t. For the gamma-poisson problem, t = t implies b = b, and we denote this common value by b. By Lemma 5.2 we have ( ) a + X t P (m t m) = P m b + t b, m = 1 F (m (b + t) m b) and P (m t m) = P ( a + X t b + t ) m b, m = 1 F ( m (b + t) m b ) where F is the cdf of the generalized negative binomial (GNB) distribution with parameters m b ( ) and t b+t, and F is the cdf of GNB m b, t b+t. When m m, F (m (b + t) m b) F (m (b + t) m b) F (m (b + t) m b), whence P (m t m) P (m t m) as required. Secondly, we prove (31). We first consider the gamma-exponential case; the gamma-poisson version can be shown in exactly the same way. In the gamma-exponential case, (31) assumed that m = b a 1 = b a 1 = m, which we denote by m, and a a. We prove convex dominance by showing m t = cx cx = m t ( ) b + X t a + t 1 λ Gamma(a, b ) ( ) b + X t a + t 1 λ Gamma(a, b ) ( ) b + X t a + t 1 λ Gamma(a, b ) (32) (33) We observe that ( ) b + X t a + t 1 λ Gamma(a, b ) ( ) b + mt + (X t mt) = a + t 1 λ Gamma(a, b ) 1 = m + a + t 1 (X t mt λ Gamma(a, b )) and similarly ( ) b + X t a + t 1 λ Gamma(a 1, b ) = m + a + t 1 (X t mt λ Gamma(a, b )) 25

Optimal Learning with Non-Gaussian Rewards. Zi Ding

Optimal Learning with Non-Gaussian Rewards. Zi Ding ABSTRACT Title of dissertation: Optimal Learning with Non-Gaussian Rewards Zi Ding, Doctor of Philosophy, 214 Dissertation directed by: Professor Ilya O. Ryzhov Robert. H. Smith School of Business In this

More information

The knowledge gradient method for multi-armed bandit problems

The knowledge gradient method for multi-armed bandit problems The knowledge gradient method for multi-armed bandit problems Moving beyond inde policies Ilya O. Ryzhov Warren Powell Peter Frazier Department of Operations Research and Financial Engineering Princeton

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Filtrations, Markov Processes and Martingales. Lectures on Lévy Processes and Stochastic Calculus, Braunschweig, Lecture 3: The Lévy-Itô Decomposition

Filtrations, Markov Processes and Martingales. Lectures on Lévy Processes and Stochastic Calculus, Braunschweig, Lecture 3: The Lévy-Itô Decomposition Filtrations, Markov Processes and Martingales Lectures on Lévy Processes and Stochastic Calculus, Braunschweig, Lecture 3: The Lévy-Itô Decomposition David pplebaum Probability and Statistics Department,

More information

Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29

More information

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling Crowdsourcing & Optimal Budget Allocation in Crowd Labeling Madhav Mohandas, Richard Zhu, Vincent Zhuang May 5, 2016 Table of Contents 1. Intro to Crowdsourcing 2. The Problem 3. Knowledge Gradient Algorithm

More information

HJB equations. Seminar in Stochastic Modelling in Economics and Finance January 10, 2011

HJB equations. Seminar in Stochastic Modelling in Economics and Finance January 10, 2011 Department of Probability and Mathematical Statistics Faculty of Mathematics and Physics, Charles University in Prague petrasek@karlin.mff.cuni.cz Seminar in Stochastic Modelling in Economics and Finance

More information

Bayesian Contextual Multi-armed Bandits

Bayesian Contextual Multi-armed Bandits Bayesian Contextual Multi-armed Bandits Xiaoting Zhao Joint Work with Peter I. Frazier School of Operations Research and Information Engineering Cornell University October 22, 2012 1 / 33 Outline 1 Motivating

More information

Decentralized Stochastic Control with Partial Sharing Information Structures: A Common Information Approach

Decentralized Stochastic Control with Partial Sharing Information Structures: A Common Information Approach Decentralized Stochastic Control with Partial Sharing Information Structures: A Common Information Approach 1 Ashutosh Nayyar, Aditya Mahajan and Demosthenis Teneketzis Abstract A general model of decentralized

More information

Point Process Control

Point Process Control Point Process Control The following note is based on Chapters I, II and VII in Brémaud s book Point Processes and Queues (1981). 1 Basic Definitions Consider some probability space (Ω, F, P). A real-valued

More information

Poisson random measure: motivation

Poisson random measure: motivation : motivation The Lévy measure provides the expected number of jumps by time unit, i.e. in a time interval of the form: [t, t + 1], and of a certain size Example: ν([1, )) is the expected number of jumps

More information

arxiv: v4 [cs.lg] 22 Jul 2014

arxiv: v4 [cs.lg] 22 Jul 2014 Learning to Optimize Via Information-Directed Sampling Daniel Russo and Benjamin Van Roy July 23, 2014 arxiv:1403.5556v4 cs.lg] 22 Jul 2014 Abstract We propose information-directed sampling a new algorithm

More information

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3 Brownian Motion Contents 1 Definition 2 1.1 Brownian Motion................................. 2 1.2 Wiener measure.................................. 3 2 Construction 4 2.1 Gaussian process.................................

More information

A NOTE ON STOCHASTIC INTEGRALS AS L 2 -CURVES

A NOTE ON STOCHASTIC INTEGRALS AS L 2 -CURVES A NOTE ON STOCHASTIC INTEGRALS AS L 2 -CURVES STEFAN TAPPE Abstract. In a work of van Gaans (25a) stochastic integrals are regarded as L 2 -curves. In Filipović and Tappe (28) we have shown the connection

More information

On the robustness of a one-period look-ahead policy in multi-armed bandit problems

On the robustness of a one-period look-ahead policy in multi-armed bandit problems Procedia Computer Science Procedia Computer Science 00 (2010) 1 10 On the robustness of a one-period look-ahead policy in multi-armed bandit problems Ilya O. Ryzhov a, Peter Frazier b, Warren B. Powell

More information

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and.

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and. COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis State University of New York at Stony Brook and Cyrus Derman Columbia University The problem of assigning one of several

More information

Index Policies and Performance Bounds for Dynamic Selection Problems

Index Policies and Performance Bounds for Dynamic Selection Problems Index Policies and Performance Bounds for Dynamic Selection Problems David B. Brown Fuqua School of Business Duke University dbbrown@duke.edu James E. Smith Tuck School of Business Dartmouth College jim.smith@dartmouth.edu

More information

Mathematical Methods for Neurosciences. ENS - Master MVA Paris 6 - Master Maths-Bio ( )

Mathematical Methods for Neurosciences. ENS - Master MVA Paris 6 - Master Maths-Bio ( ) Mathematical Methods for Neurosciences. ENS - Master MVA Paris 6 - Master Maths-Bio (2014-2015) Etienne Tanré - Olivier Faugeras INRIA - Team Tosca November 26th, 2014 E. Tanré (INRIA - Team Tosca) Mathematical

More information

Supermodular ordering of Poisson arrays

Supermodular ordering of Poisson arrays Supermodular ordering of Poisson arrays Bünyamin Kızıldemir Nicolas Privault Division of Mathematical Sciences School of Physical and Mathematical Sciences Nanyang Technological University 637371 Singapore

More information

A MODEL FOR THE LONG-TERM OPTIMAL CAPACITY LEVEL OF AN INVESTMENT PROJECT

A MODEL FOR THE LONG-TERM OPTIMAL CAPACITY LEVEL OF AN INVESTMENT PROJECT A MODEL FOR HE LONG-ERM OPIMAL CAPACIY LEVEL OF AN INVESMEN PROJEC ARNE LØKKA AND MIHAIL ZERVOS Abstract. We consider an investment project that produces a single commodity. he project s operation yields

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

Balancing Revenues and Repair Costs under Partial Information about Product Reliability

Balancing Revenues and Repair Costs under Partial Information about Product Reliability Manuscript submitted to POM Journal Balancing Revenues and Repair Costs under Partial Information about Product Reliability Chao Ding, Paat Rusmevichientong, Huseyin Topaloglu We consider the problem faced

More information

Gaussian, Markov and stationary processes

Gaussian, Markov and stationary processes Gaussian, Markov and stationary processes Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ November

More information

Balancing Revenues and Repair Costs under Partial Information about Product Reliability

Balancing Revenues and Repair Costs under Partial Information about Product Reliability Vol. 00, No. 0, Xxxxx 0000, pp. 000 000 issn 0000-0000 eissn 0000-0000 00 0000 0001 INFORMS doi 10.1287/xxxx.0000.0000 c 0000 INFORMS Balancing Revenues and Repair Costs under Partial Information about

More information

Universal examples. Chapter The Bernoulli process

Universal examples. Chapter The Bernoulli process Chapter 1 Universal examples 1.1 The Bernoulli process First description: Bernoulli random variables Y i for i = 1, 2, 3,... independent with P [Y i = 1] = p and P [Y i = ] = 1 p. Second description: Binomial

More information

A Change of Variable Formula with Local Time-Space for Bounded Variation Lévy Processes with Application to Solving the American Put Option Problem 1

A Change of Variable Formula with Local Time-Space for Bounded Variation Lévy Processes with Application to Solving the American Put Option Problem 1 Chapter 3 A Change of Variable Formula with Local Time-Space for Bounded Variation Lévy Processes with Application to Solving the American Put Option Problem 1 Abstract We establish a change of variable

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Stochastic Differential Equations.

Stochastic Differential Equations. Chapter 3 Stochastic Differential Equations. 3.1 Existence and Uniqueness. One of the ways of constructing a Diffusion process is to solve the stochastic differential equation dx(t) = σ(t, x(t)) dβ(t)

More information

Infinitely divisible distributions and the Lévy-Khintchine formula

Infinitely divisible distributions and the Lévy-Khintchine formula Infinitely divisible distributions and the Cornell University May 1, 2015 Some definitions Let X be a real-valued random variable with law µ X. Recall that X is said to be infinitely divisible if for every

More information

An essay on the general theory of stochastic processes

An essay on the general theory of stochastic processes Probability Surveys Vol. 3 (26) 345 412 ISSN: 1549-5787 DOI: 1.1214/1549578614 An essay on the general theory of stochastic processes Ashkan Nikeghbali ETHZ Departement Mathematik, Rämistrasse 11, HG G16

More information

Common Knowledge and Sequential Team Problems

Common Knowledge and Sequential Team Problems Common Knowledge and Sequential Team Problems Authors: Ashutosh Nayyar and Demosthenis Teneketzis Computer Engineering Technical Report Number CENG-2018-02 Ming Hsieh Department of Electrical Engineering

More information

Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games Gabriel Y. Weintraub, Lanier Benkard, and Benjamin Van Roy Stanford University {gweintra,lanierb,bvr}@stanford.edu Abstract

More information

On Reflecting Brownian Motion with Drift

On Reflecting Brownian Motion with Drift Proc. Symp. Stoch. Syst. Osaka, 25), ISCIE Kyoto, 26, 1-5) On Reflecting Brownian Motion with Drift Goran Peskir This version: 12 June 26 First version: 1 September 25 Research Report No. 3, 25, Probability

More information

INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS

INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS Applied Probability Trust (4 February 2008) INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS SAVAS DAYANIK, Princeton University WARREN POWELL, Princeton University KAZUTOSHI

More information

Multi-armed Bandits and the Gittins Index

Multi-armed Bandits and the Gittins Index Multi-armed Bandits and the Gittins Index Richard Weber Statistical Laboratory, University of Cambridge A talk to accompany Lecture 6 Two-armed Bandit 3, 10, 4, 9, 12, 1,... 5, 6, 2, 15, 2, 7,... Two-armed

More information

An Uncertain Control Model with Application to. Production-Inventory System

An Uncertain Control Model with Application to. Production-Inventory System An Uncertain Control Model with Application to Production-Inventory System Kai Yao 1, Zhongfeng Qin 2 1 Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China 2 School of Economics

More information

Lecture Notes - Dynamic Moral Hazard

Lecture Notes - Dynamic Moral Hazard Lecture Notes - Dynamic Moral Hazard Simon Board and Moritz Meyer-ter-Vehn October 27, 2011 1 Marginal Cost of Providing Utility is Martingale (Rogerson 85) 1.1 Setup Two periods, no discounting Actions

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

Ernesto Mordecki 1. Lecture III. PASI - Guanajuato - June 2010

Ernesto Mordecki 1. Lecture III. PASI - Guanajuato - June 2010 Optimal stopping for Hunt and Lévy processes Ernesto Mordecki 1 Lecture III. PASI - Guanajuato - June 2010 1Joint work with Paavo Salminen (Åbo, Finland) 1 Plan of the talk 1. Motivation: from Finance

More information

Asymptotics for posterior hazards

Asymptotics for posterior hazards Asymptotics for posterior hazards Pierpaolo De Blasi University of Turin 10th August 2007, BNR Workshop, Isaac Newton Intitute, Cambridge, UK Joint work with Giovanni Peccati (Université Paris VI) and

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition

DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition R. G. Gallager January 31, 2011 i ii Preface These notes are a draft of a major rewrite of a text [9] of the same name. The notes and the text are outgrowths

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

A new approach for investment performance measurement. 3rd WCMF, Santa Barbara November 2009

A new approach for investment performance measurement. 3rd WCMF, Santa Barbara November 2009 A new approach for investment performance measurement 3rd WCMF, Santa Barbara November 2009 Thaleia Zariphopoulou University of Oxford, Oxford-Man Institute and The University of Texas at Austin 1 Performance

More information

A Structured Multiarmed Bandit Problem and the Greedy Policy

A Structured Multiarmed Bandit Problem and the Greedy Policy A Structured Multiarmed Bandit Problem and the Greedy Policy Adam J. Mersereau Kenan-Flagler Business School, University of North Carolina ajm@unc.edu Paat Rusmevichientong School of Operations Research

More information

On the robustness of a one-period look-ahead policy in multi-armed bandit problems

On the robustness of a one-period look-ahead policy in multi-armed bandit problems Procedia Computer Science 00 (200) (202) 0 635 644 International Conference on Computational Science, ICCS 200 Procedia Computer Science www.elsevier.com/locate/procedia On the robustness of a one-period

More information

arxiv: v7 [cs.lg] 7 Jul 2017

arxiv: v7 [cs.lg] 7 Jul 2017 Learning to Optimize Via Information-Directed Sampling Daniel Russo 1 and Benjamin Van Roy 2 1 Northwestern University, daniel.russo@kellogg.northwestern.edu 2 Stanford University, bvr@stanford.edu arxiv:1403.5556v7

More information

OPTIMAL STOPPING OF A BROWNIAN BRIDGE

OPTIMAL STOPPING OF A BROWNIAN BRIDGE OPTIMAL STOPPING OF A BROWNIAN BRIDGE ERIK EKSTRÖM AND HENRIK WANNTORP Abstract. We study several optimal stopping problems in which the gains process is a Brownian bridge or a functional of a Brownian

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Bayesian exploration for approximate dynamic programming

Bayesian exploration for approximate dynamic programming Bayesian exploration for approximate dynamic programming Ilya O. Ryzhov Martijn R.K. Mes Warren B. Powell Gerald A. van den Berg December 18, 2017 Abstract Approximate dynamic programming (ADP) is a general

More information

A Class of Fractional Stochastic Differential Equations

A Class of Fractional Stochastic Differential Equations Vietnam Journal of Mathematics 36:38) 71 79 Vietnam Journal of MATHEMATICS VAST 8 A Class of Fractional Stochastic Differential Equations Nguyen Tien Dung Department of Mathematics, Vietnam National University,

More information

Solving the Poisson Disorder Problem

Solving the Poisson Disorder Problem Advances in Finance and Stochastics: Essays in Honour of Dieter Sondermann, Springer-Verlag, 22, (295-32) Research Report No. 49, 2, Dept. Theoret. Statist. Aarhus Solving the Poisson Disorder Problem

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

On Bayesian bandit algorithms

On Bayesian bandit algorithms On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms

More information

Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication)

Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication) Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication) Nikolaus Robalino and Arthur Robson Appendix B: Proof of Theorem 2 This appendix contains the proof of Theorem

More information

A Rothschild-Stiglitz approach to Bayesian persuasion

A Rothschild-Stiglitz approach to Bayesian persuasion A Rothschild-Stiglitz approach to Bayesian persuasion Matthew Gentzkow and Emir Kamenica Stanford University and University of Chicago December 2015 Abstract Rothschild and Stiglitz (1970) represent random

More information

Jump-type Levy Processes

Jump-type Levy Processes Jump-type Levy Processes Ernst Eberlein Handbook of Financial Time Series Outline Table of contents Probabilistic Structure of Levy Processes Levy process Levy-Ito decomposition Jump part Probabilistic

More information

Numerical Methods with Lévy Processes

Numerical Methods with Lévy Processes Numerical Methods with Lévy Processes 1 Objective: i) Find models of asset returns, etc ii) Get numbers out of them. Why? VaR and risk management Valuing and hedging derivatives Why not? Usual assumption:

More information

ON ADDITIVE TIME-CHANGES OF FELLER PROCESSES. 1. Introduction

ON ADDITIVE TIME-CHANGES OF FELLER PROCESSES. 1. Introduction ON ADDITIVE TIME-CHANGES OF FELLER PROCESSES ALEKSANDAR MIJATOVIĆ AND MARTIJN PISTORIUS Abstract. In this note we generalise the Phillips theorem [1] on the subordination of Feller processes by Lévy subordinators

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Economics 2010c: Lectures 9-10 Bellman Equation in Continuous Time

Economics 2010c: Lectures 9-10 Bellman Equation in Continuous Time Economics 2010c: Lectures 9-10 Bellman Equation in Continuous Time David Laibson 9/30/2014 Outline Lectures 9-10: 9.1 Continuous-time Bellman Equation 9.2 Application: Merton s Problem 9.3 Application:

More information

Bayesian quickest detection problems for some diffusion processes

Bayesian quickest detection problems for some diffusion processes Bayesian quickest detection problems for some diffusion processes Pavel V. Gapeev Albert N. Shiryaev We study the Bayesian problems of detecting a change in the drift rate of an observable diffusion process

More information

A Rothschild-Stiglitz approach to Bayesian persuasion

A Rothschild-Stiglitz approach to Bayesian persuasion A Rothschild-Stiglitz approach to Bayesian persuasion Matthew Gentzkow and Emir Kamenica Stanford University and University of Chicago January 2016 Consider a situation where one person, call him Sender,

More information

Bayesian Congestion Control over a Markovian Network Bandwidth Process

Bayesian Congestion Control over a Markovian Network Bandwidth Process Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 1/30 Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard (USC) Joint work

More information

1. Stochastic Processes and filtrations

1. Stochastic Processes and filtrations 1. Stochastic Processes and 1. Stoch. pr., A stochastic process (X t ) t T is a collection of random variables on (Ω, F) with values in a measurable space (S, S), i.e., for all t, In our case X t : Ω S

More information

Coupled Bisection for Root Ordering

Coupled Bisection for Root Ordering Coupled Bisection for Root Ordering Stephen N. Pallone, Peter I. Frazier, Shane G. Henderson School of Operations Research and Information Engineering Cornell University, Ithaca, NY 14853 Abstract We consider

More information

Preliminary Results on Social Learning with Partial Observations

Preliminary Results on Social Learning with Partial Observations Preliminary Results on Social Learning with Partial Observations Ilan Lobel, Daron Acemoglu, Munther Dahleh and Asuman Ozdaglar ABSTRACT We study a model of social learning with partial observations from

More information

6196 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 9, SEPTEMBER 2011

6196 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 9, SEPTEMBER 2011 6196 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 9, SEPTEMBER 2011 On the Structure of Real-Time Encoding and Decoding Functions in a Multiterminal Communication System Ashutosh Nayyar, Student

More information

The strictly 1/2-stable example

The strictly 1/2-stable example The strictly 1/2-stable example 1 Direct approach: building a Lévy pure jump process on R Bert Fristedt provided key mathematical facts for this example. A pure jump Lévy process X is a Lévy process such

More information

Week 9 Generators, duality, change of measure

Week 9 Generators, duality, change of measure Week 9 Generators, duality, change of measure Jonathan Goodman November 18, 013 1 Generators This section describes a common abstract way to describe many of the differential equations related to Markov

More information

n E(X t T n = lim X s Tn = X s

n E(X t T n = lim X s Tn = X s Stochastic Calculus Example sheet - Lent 15 Michael Tehranchi Problem 1. Let X be a local martingale. Prove that X is a uniformly integrable martingale if and only X is of class D. Solution 1. If If direction:

More information

Some Aspects of Universal Portfolio

Some Aspects of Universal Portfolio 1 Some Aspects of Universal Portfolio Tomoyuki Ichiba (UC Santa Barbara) joint work with Marcel Brod (ETH Zurich) Conference on Stochastic Asymptotics & Applications Sixth Western Conference on Mathematical

More information

Lecture 21 Representations of Martingales

Lecture 21 Representations of Martingales Lecture 21: Representations of Martingales 1 of 11 Course: Theory of Probability II Term: Spring 215 Instructor: Gordan Zitkovic Lecture 21 Representations of Martingales Right-continuous inverses Let

More information

Topics in fractional Brownian motion

Topics in fractional Brownian motion Topics in fractional Brownian motion Esko Valkeila Spring School, Jena 25.3. 2011 We plan to discuss the following items during these lectures: Fractional Brownian motion and its properties. Topics in

More information

FE610 Stochastic Calculus for Financial Engineers. Stevens Institute of Technology

FE610 Stochastic Calculus for Financial Engineers. Stevens Institute of Technology FE610 Stochastic Calculus for Financial Engineers Lecture 3. Calculaus in Deterministic and Stochastic Environments Steve Yang Stevens Institute of Technology 01/31/2012 Outline 1 Modeling Random Behavior

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision

More information

Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem

Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem Parisa Mansourifard 1/37 Bayesian Congestion Control over a Markovian Network Bandwidth Process:

More information

The concentration of a drug in blood. Exponential decay. Different realizations. Exponential decay with noise. dc(t) dt.

The concentration of a drug in blood. Exponential decay. Different realizations. Exponential decay with noise. dc(t) dt. The concentration of a drug in blood Exponential decay C12 concentration 2 4 6 8 1 C12 concentration 2 4 6 8 1 dc(t) dt = µc(t) C(t) = C()e µt 2 4 6 8 1 12 time in minutes 2 4 6 8 1 12 time in minutes

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Value and Policy Iteration

Value and Policy Iteration Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite

More information

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539 Brownian motion Samy Tindel Purdue University Probability Theory 2 - MA 539 Mostly taken from Brownian Motion and Stochastic Calculus by I. Karatzas and S. Shreve Samy T. Brownian motion Probability Theory

More information

Lecture Notes - Dynamic Moral Hazard

Lecture Notes - Dynamic Moral Hazard Lecture Notes - Dynamic Moral Hazard Simon Board and Moritz Meyer-ter-Vehn October 23, 2012 1 Dynamic Moral Hazard E ects Consumption smoothing Statistical inference More strategies Renegotiation Non-separable

More information

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Multi-Attribute Bayesian Optimization under Utility Uncertainty Multi-Attribute Bayesian Optimization under Utility Uncertainty Raul Astudillo Cornell University Ithaca, NY 14853 ra598@cornell.edu Peter I. Frazier Cornell University Ithaca, NY 14853 pf98@cornell.edu

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Analysis of Thompson Sampling for the multi-armed bandit problem

Analysis of Thompson Sampling for the multi-armed bandit problem Analysis of Thompson Sampling for the multi-armed bandit problem Shipra Agrawal Microsoft Research India shipra@microsoft.com avin Goyal Microsoft Research India navingo@microsoft.com Abstract We show

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

A Central Limit Theorem for Fleming-Viot Particle Systems Application to the Adaptive Multilevel Splitting Algorithm

A Central Limit Theorem for Fleming-Viot Particle Systems Application to the Adaptive Multilevel Splitting Algorithm A Central Limit Theorem for Fleming-Viot Particle Systems Application to the Algorithm F. Cérou 1,2 B. Delyon 2 A. Guyader 3 M. Rousset 1,2 1 Inria Rennes Bretagne Atlantique 2 IRMAR, Université de Rennes

More information

Lecture 17 Brownian motion as a Markov process

Lecture 17 Brownian motion as a Markov process Lecture 17: Brownian motion as a Markov process 1 of 14 Course: Theory of Probability II Term: Spring 2015 Instructor: Gordan Zitkovic Lecture 17 Brownian motion as a Markov process Brownian motion is

More information

Erdős-Renyi random graphs basics

Erdős-Renyi random graphs basics Erdős-Renyi random graphs basics Nathanaël Berestycki U.B.C. - class on percolation We take n vertices and a number p = p(n) with < p < 1. Let G(n, p(n)) be the graph such that there is an edge between

More information

Uniformly Uniformly-ergodic Markov chains and BSDEs

Uniformly Uniformly-ergodic Markov chains and BSDEs Uniformly Uniformly-ergodic Markov chains and BSDEs Samuel N. Cohen Mathematical Institute, University of Oxford (Based on joint work with Ying Hu, Robert Elliott, Lukas Szpruch) Centre Henri Lebesgue,

More information

I forgot to mention last time: in the Ito formula for two standard processes, putting

I forgot to mention last time: in the Ito formula for two standard processes, putting I forgot to mention last time: in the Ito formula for two standard processes, putting dx t = a t dt + b t db t dy t = α t dt + β t db t, and taking f(x, y = xy, one has f x = y, f y = x, and f xx = f yy

More information

Functional Limit theorems for the quadratic variation of a continuous time random walk and for certain stochastic integrals

Functional Limit theorems for the quadratic variation of a continuous time random walk and for certain stochastic integrals Functional Limit theorems for the quadratic variation of a continuous time random walk and for certain stochastic integrals Noèlia Viles Cuadros BCAM- Basque Center of Applied Mathematics with Prof. Enrico

More information

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming University of Warwick, EC9A0 Maths for Economists 1 of 63 University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming Peter J. Hammond Autumn 2013, revised 2014 University of

More information

On pathwise stochastic integration

On pathwise stochastic integration On pathwise stochastic integration Rafa l Marcin Lochowski Afican Institute for Mathematical Sciences, Warsaw School of Economics UWC seminar Rafa l Marcin Lochowski (AIMS, WSE) On pathwise stochastic

More information

ON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME

ON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME ON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME SAUL D. JACKA AND ALEKSANDAR MIJATOVIĆ Abstract. We develop a general approach to the Policy Improvement Algorithm (PIA) for stochastic control problems

More information

Figure 10.1: Recording when the event E occurs

Figure 10.1: Recording when the event E occurs 10 Poisson Processes Let T R be an interval. A family of random variables {X(t) ; t T} is called a continuous time stochastic process. We often consider T = [0, 1] and T = [0, ). As X(t) is a random variable

More information

WITH SWITCHING ARMS EXACT SOLUTION OF THE BELLMAN EQUATION FOR A / -DISCOUNTED REWARD IN A TWO-ARMED BANDIT

WITH SWITCHING ARMS EXACT SOLUTION OF THE BELLMAN EQUATION FOR A / -DISCOUNTED REWARD IN A TWO-ARMED BANDIT lournal of Applied Mathematics and Stochastic Analysis, 12:2 (1999), 151-160. EXACT SOLUTION OF THE BELLMAN EQUATION FOR A / -DISCOUNTED REWARD IN A TWO-ARMED BANDIT WITH SWITCHING ARMS DONCHO S. DONCHEV

More information

Lecture 4a: Continuous-Time Markov Chain Models

Lecture 4a: Continuous-Time Markov Chain Models Lecture 4a: Continuous-Time Markov Chain Models Continuous-time Markov chains are stochastic processes whose time is continuous, t [0, ), but the random variables are discrete. Prominent examples of continuous-time

More information

(2m)-TH MEAN BEHAVIOR OF SOLUTIONS OF STOCHASTIC DIFFERENTIAL EQUATIONS UNDER PARAMETRIC PERTURBATIONS

(2m)-TH MEAN BEHAVIOR OF SOLUTIONS OF STOCHASTIC DIFFERENTIAL EQUATIONS UNDER PARAMETRIC PERTURBATIONS (2m)-TH MEAN BEHAVIOR OF SOLUTIONS OF STOCHASTIC DIFFERENTIAL EQUATIONS UNDER PARAMETRIC PERTURBATIONS Svetlana Janković and Miljana Jovanović Faculty of Science, Department of Mathematics, University

More information