A Value Iteration Algorithm for Partially Observed Markov Decision Process Multi-armed Bandits

Size: px
Start display at page:

Download "A Value Iteration Algorithm for Partially Observed Markov Decision Process Multi-armed Bandits"

Transcription

1 A Value Iteration Algorithm for Partially Observed Marov Decision Process Multi-armed Bandits Viram Krishnamurthy a, Bo Wahlberg b, Fran Lingelbach b, a Dept. of Electrical and Computer Engineering, The University of British Columbia, Vancouver, B.C. Canada b S3-Automatic Control, KTH, S-0044 Stocholm, Sweden Abstract A value iteration based algorithm is given for computing the Gittins index of a Partially Observed Marov Decision Process POMDP Multi-armed Bandit problem. This problem concerns dynamical allocation of efforts between a number of competing projects of which only one can be wored on at any time period. The active project evolves according to a finite state Marov chain and generates then a reward, while the states of the idle projects remain fixed. In this contribution, it is assumed that the state of the active project only can be indirectly observed from noisy observations. The objective is to find the optimal policy based on partial information to determine which project to wor on at a certain time in order to maximize the total expected reward. The solution is obtained by transforming the problem into a standard POMDP problem, for which there exist efficient near-optimal algorithms. A numerical example from the field of tas planning for an autonomous robot is presented to illustrate the algorithms.. Introduction and Problem Formulation The multi-armed bandit problem is an example of a dynamic stochastic scheduling problem for optimizing in a sequential manner the allocation effort between a number of competing projects. umerous applications of this finite state Marov chain multi-armed bandit problems appear in the robotics and stochastic control literature, see 7, 4, and 4 for examples in job scheduling, resource allocation for manufacturing systems, and target tracing. The multi-armed bandit problem structure implies that the optimal policy can be found by a so-called Gittins index rule, 7. This implies that the multi-project optimization problem is simplified to a finite number of single-project optimization problems. Several algorithms have been proposed to solve this fully observed finite state bandit problem 9, 4. This paper considers bandit problems where the underly- This wor was supported by an ARC large grant and the Centre of Expertise in etwored Decision Systems ing finite state Marov chain is not directly observed instead the observations which are assumed to belong to a finite set are a probabilistic function of the unobserved finite state Marov chain. Such problems are nown as Partially Observed Marov Decision Process POMDP multiarmed bandits and are also called Hidden Marov Model HMM multi-armed bandits. The POMDP model suits many topics in the field of robotics very well as uncertainties both in state transition and observations are considered which is characteristic for most robot applications. In 6 this approach is used to model robot navigation in a nown environment. In recent wor 8, a return-to-state argument was used in an attempt to compute the Gittins index of a POMDP multi-armed bandit problem. However, the argument in 8 is incorrect, indeed there appears no obvious way of obtaining a finite dimensional characterization of the value function in the return-to-state argument. In this paper, it is shown that by introducing the retirement formulation 7 of the multi-armed bandit problem option, a finite dimensional value iteration algorithm can be derived for computing the Gittins index of a POMDP bandit. The ey idea is to extend the state vector to include retirement information. The structure of this paper is as follows. In Section., we introduce the POMDP multi-armed bandit problem, and the Hidden Marov Model state estimation is reviewed in Section.2. A value iteration algorithm for computing the Gittins index is derived in Section 2. The main result of the paper is given in Section 2., where Theorem gives a finite dimensional characterization of the Gittins index. Section 2.2 deals with numerical algorithms and provides a numerical example of the proposed algorithm. Finally, Section 3 concludes the paper... The POMDP Multi-armed Bandit Problem Consider P independent projects p,...,p. Assume each project p has a finite number of states p. Let s denote the state of project p at discrete time 0,,... At each time instant only one of these projects can be

2 wored on. If project p is wored on at time, an instantaneous reward β Rs,p is accrued Rs,p 0 is assumed finite. Here 0 <β<denotes the discount factor; the state s evolves according to an p -state homogeneous Marov chain with transition probability matrix A a ij i,j p P s + j s i if project p is wored on at time. The states of all the other P idle projects are unaffected, i.e. s + s, if project p is idle at time. All projects are initialized with s 0 x 0 where x 0 are specified initial distributions for p,...,p. The state of the active project p is indirectly observed via noisy measurements observations y + of the active project state s +. Assume that these observations y + belong to a finite set M p indexed by m,...,m p. Let B b im i p,m M p denote the observation probability symbol probability matrix of the Hidden Marov Model HMM, where each element b im P y + m s + i, u p. Let u {,...,P} denote which project is wored on at time. Consequently, s u + denotes the state of the active project at time +. Let the policy µ denote the stationary sequence of controls, {u,, 2,...}. The total expected discounted reward over an infinite time horizon is given by J µ E 0 β R s u,u where E denotes mathematical expectation. The aim is to determine the optimal stationary policy which yields the maximum reward in 2. Denote the observation history at time as Y y u0,...,y u and let U u 0,...,u. Define U as the class of all admissible policies µ. Solving the Marov chain multi-armed bandit problem 9 involves computing the optimal policy µ arg max µ U J µ where U denotes the set of sequences u which map Y to {,...,P}. Let J max µ U J µ denote the optimal expected reward. Remar: Because M p and p are finite sets, and Rs u,u are uniformly bounded from above and below, the optimization is well defined..2. Information State Formulation The above partially observed multi-armed bandit problem can be re-expressed as a fully observed multi-armed bandit in terms of the information state 9. For each 2 project p, denote by x the information state at time : x x i i,..., p, where x i P s i Y,U The HMM multi-armed bandit problem can be viewed as the following scheduling problem: Consider P parallel HMM state estimation filters, one for each project. The project p is active, an observation y + is obtained and the information state x + is computed recursively by the HMM state filter also nown as the forward algorithm or Baum s algorithm 2 according to x 3 + B y + A x p B y + A x if project p is wored on at time 4 where if y + m, then B m diagb m,...,b p,m is the diagonal matrix formed by the mth column of the observation matrix B and p is a p dimensional column vector of ones. The state estimates of the other P projects remain unaffected, i.e. x q + xq if project q is not wored on, q {,...,P}, q p 5 Let X denote the state space of information states x, p {, 2,...,P} which is a p dimensional simplex: X { x R p : p x, 0 x i } for all i {,..., p } Using the smoothing property of conditional expectations, the reward functional 2 can be re-written in terms of the information state as J µ E 0 β R u x u where Ru denotes the u dimensional reward vector Rs,u,...,Rs u,u, and Rs,u is defined as in 2. The aim is to compute the optimal policy arg max µ U J µ. It is well nown 4 that the optimal policy has an indexable rule 9, 4: for each project p there is a function γ x called the Gittins index, which is only a function 6 7

3 of the project p and the information state x, whereby the optimal scheduling policy at time is to wor on the project with the largest Gittins index, { i.e. choose project q where q arg max p {,...,P } γ x }. Thus computing the Gittins index is a ey requirement for solving any multi-armed bandit problem. For a formal definition of the Gittins index in terms of stopping times, see 9. We will wor with a more convenient equivalent definition in terms of the parameterized retirement reward M 3, 7. The fundamental problem for the POMDP case is that the Gittins index γ x must be evaluated for each x X, an uncountably infinite set. In contrast, for the standard finite state Marov multi-armed bandit problem considered extensively in the literature e.g. 7, the Gittins index can be straightforwardly computed. The main contribution of our paper is to present a finite dimensional value iteration algorithm for computing the Gittins index. 2. Value Iteration Algorithm for Computing Gittins Index This section presents a value iteration algorithm for computing the Gittins index γ x for each project p {, 2,..., P}. As with any dynamic programming formulation, the computation of the Gittins index for each project p is off-line, independent of the Gittins indices of the other P projects and can be done a priori. For each project p, let M denote a positive real number such that 0 M M M max i p Rs i, u p β 8 To simplify subsequent notation, we omit the superscript p in M and M, and the subindex in x The Gittins index 3, 7 of project p with information state x can be defined as γ x min{m : V x,mm} 9 where V x,m satisfies the functional Bellman s recursion { V x,m max R x M p +β V B ma x m p B ma x } p B ma x,m,m 0 where M denotes the parameterized retirement reward. Observe that P y m p B ma x. The th order approximation of V x,m is obtained as the following value iteration algorithm,...,: { V + x,m max R x M p +β V B ma x,m m p B ma x } p B ma x,m Here V x,m is the value-function of a -horizon dynamic programming recursion. Let γ x denote the approximate Gittins index computed via the value iteration algorithm, i.e. γ x min{m : V x,mm} 2 It is well nown that V x,m can be uniformly approximated arbitrarily closely by a finite horizon value function V x,m of. A straightforward application of this result shows that the finite horizon Gittins index approximation γ x of 2 can be made arbitrarily accurate by choosing the horizon sufficiently large. This is summarized in the following corollary. Corollary The infinite horizon Gittins index γ x of state x can be uniformly approximated arbitrarily closely by the near optimal Gittins index γ x computed according to 2 for the finite horizon. In particular, for any δ>0, there exists a finite horizon such that: i sup x X γ x γ x δ. ii For this, sup x X γ x γ x 2βδ β. 2.. Finite Dimensional Characterization of Gittins Index Unfortunately, the value iteration recursion does not directly translate into practical solution methodologies. The fundamental problem with is that at each iteration, one needs to compute V x,m over an uncountably infinite set x X and M 0, M. The main contribution of this section is to construct a finite dimensional characterization for the value function V x,m,, 2,..., and hence the near optimal Gittins index γ x. We will show that under a different coordinate basis V x,m is piecewise linear and convex. Then computing γ x in 2 simply amounts to evaluating V x,m at the hyper-planes

4 formed by the intersection of the piecewise linear segments. Constructive algorithms based on this finite characterization will be given in Sec. 2.2 to compute the Gittins index for the information states of the original bandit process. Define the p +dimensional augmented information state x {x, 0, 0 p, } where x X is as in 3. As described below, x 0 p, is interpreted as the retirement information state. Define an augmented observation process ȳ {,...,M p +} and the corresponding p + p +transition and observation probability matrices as A A 2 A 0 p 0 p 0p p p 0 p, B B 0 p 0 p,b 2 I p+ 2,, B m diagcolumn m of B, B 2 m diagcolumn m of B 2, m {,...,M p +} 3 To construct a finite dimensional representation of V x,m we will present a coordinate transformation under which V x,m is the value function of a standard POMDP, and x,m is an invertible map to the information state of this POMDP. Because 0 M M, define the pseudo-information state z M/ M M/ M,, 0 M M 4 Define the information state π and following coordinate transformation: π z x A 0 Ā I 2 2 A 0 A A2 0 Ā 2 I 2 2 A 2 0 A 2 B m I 2 2 B m B 2 m I 2 2 B 2 m R R 0 R 0 R 2 M p 0 0 p 0 5 It is easily shown that Ā, Ā 2 are transition probability matrices their rows add to one and each element is B positive and m, B 2 m are observation probability matrices. Also the 2 p +dimensional vector π is an information state since it belongs to Π where Π {π : 2 p+ π, and π i 0 i, 2,...,2 p +} 6 Finally, define the control variable ν {, 2} at each time, where ν maps π to {, 2}. ote ν means continue and ν 2means retire. Define the policy sequence ν ν,...,ν. The policy ν is used to compute the Gittins index of project p. It is not to be confused with the policy µ defined in Sec.. which determines which project to wor on. Consider now the following POMDP problem: Parameters Ā, B, R defined in 5 form the transition probabilities, observation probabilities and reward vectors of a two valued control ν {, 2} and objective max E β R ν π ν 0 Here the vector π Π is an information state for this POMDP and evolves according to π + B ν ȳ + Ā ν π 2 p+ B ν Ā ν π ν {, 2}, ȳ + {,...,M p +} depending on the control ν chosen at each time instant. ote that ν 2results in π + attaining the retirement 0p state z. The value iteration recursion for optimizing this POMDP over the finite horizon is given by + π max R π M p+ + β m B, mā π B mā π B mā π, R 2 π M p+ B 2 + β mā 2 π m B 2 mā 2 π B 2 mā 2 π ;, 2,..., 0 π max R π, R 2π 7 Here π denotes the value-function of the dynamic program, π max E β t R νt π t π π t

5 The following is the main result of this paper. Theorem Under the coordinate basis defined in 5, the following three statements hold:. The value function V x,m in for computing the Gittins index is identically equal to the value function π of the standard POMDP At each iteration, 0,,...,, the value function π is piecewise linear and convex and has the finite dimensional representation π max λ i, Λ λ i,π 8 Here the 2 p +-dimensional vectors λ i, belong to pre-computable finite set of vectors Λ. In particular, each vector λ i, is of the form λ i, λ i, 0 λ i, 3 0, where λ i,,λ i, 3 R p 9 If all the elements of R are not equal, then there always exists a unique vector in Λ which we denote by λ, M p 0 p+2 with optimal control ν 2. If all the elements of R are equal, then Λ comprises of a single vector λ, M p 0 M p For any information state x X of project p, the near optimal Gittins index γ x is given by the finite dimensional representation γ x max λ i, Λ Mλ i, 3 x λ i, 3 λ i, x + M 20 Remar: Statement of the above theorem shows that the value iteration algorithm for computing the Gittins index γ x is identical to the dynamic programming recursion 7 for optimizing a standard finite horizon POMDP. Statement 2 says that the finite horizon POMDP has a finite dimensional piecewise linear solution which is characterized by a pre-computable finite set of vectors at each time instant. Statement 2 is well nown in the POMDP literature, e.g. see 3,. There are several linear programming based algorithms available for computing the finite set of vectors Λ at each iteration, further details are given in Sec.2.2. Statement 3 gives an explicit formula for the Gittins index of the HMM multi-armed bandit problem. Recall x is the information state computed by the pth HMM filter at time. Given that we can compute set of vectors Λ, 20 gives an explicit expression for the Gittins index γ x at any time for project p. ote if all elements of R are identical, then γ x M for all x. Finally, although π is piecewise linear for any finite, in general π π is not piecewise linear. Indeed, π is piecewise linear only in exceptional cases and the resulting policy is called finitely-transient 6. Proof. The proof of the first statement is by mathematical induction. At iteration 0, 0 π max R π, R 2 π max R z x,m V 0 x,m 2 Assume that at time, π V x,m, and consider 7. ote that by construction of the costs in 5, we have for the terminal state 0p z 0 0,, 2,..., 22 We have already shown above that R π R x and R 2π M. It is easily shown using standard properties of tensor products that for m, 2,...,M p, B mā π V B mā π B ma x B ma x,m B mā π B ma x Because B M p +diag0 p,, see 3, 5, and due to the structure of Ā 2 it follows that B M p + Ā π B B M p + Ā 0p z 0 π Π π 2 M p + Ā 2 π B 2 M p + Ā 0p z 0 π Π 2 π where Π is defined in 6 and the last equality follows from 22. Thus π V + + x,m.

6 The third statement can be shown as follows: Since λ,π M, to compute the Gittins index γ x we merely need to compute the value function π at the hyper-planes which lie at the intersection of λ, π M and max λi, λ i, π where λ i, Λ γ {λ, }. That is, x min{m : V π M} {λ, π : max λ i,π λ,π} λ i, Λ {λ, } But given the structure of λ i, in 9, with z z,z2 defined in 4, λ i,π z λ i, x + z λ i, 3x and λ, π Mz. Thus γ x Mz where max λ i, Λ {λ, } This yields 20. λ i, xz+λ i, 3x z Mz 2.2. umerical Algorithms and Example Given the finite dimensional representation of the Gittins index in 20 of Theorem, there are several linear programming based algorithms in the POMDP literature such as Sondi s algorithm 3, Monahan s algorithm, Cheng s algorithm, and the Witness algorithm 6 that can be used to compute the finite set of vectors Λ depicted in equation 8. In the numerical examples below we used the incremental-prune algorithm recently developed in the artificial intelligence community by Cassandra, et.al. in 997 the C++ code can be freely downloaded from the website 5. In general the number of linear segments that characterize the π can grow exponentially indeed the problem is P hard. It is obvious that by considering only a subset of the piecewise linear segments that characterize π and discarding the other segments, one can reduce the computational complexity. This is the basis of Lovejoy s 0 lower bound approximation. Lovejoy s algorithm 0 operates as follows: Initialize Λ 0 Λ 0, i.e. according to 2. Λ Step : Given a set of vectors, construct the set Λ by pruning Λ as follows: Pic any R points, π,π 2,...,π R in the information state simplex Π. In the numerical examples below we piced the R points based on a uniform Freudenthal triangulization of Π, see Λ 0 for details. Then set {arg min λ π r, λ Λ r, 2,...,R}. Step 2: Given Λ, compute the set of vectors Λ + using a standard POMDP algorithm. Step 3: + otice that Ṽ π max λ π is represented λ Λ completely by R piecewise linear segments. Lovejoy 0 shows that for all, π is a lower bound to the optimal value function V π, i.e. V π π for all π P. Lovejoy s algorithm gives a suboptimal scheduling policy at a computational cost of no more than R evaluations per iteration. Lovejoy 0 also provides a constructive procedure for computing an upper bound to lim sup π P V π π. Example: Consider the problem of tas planning for an autonomous mobile cleaning robot in a domestic environment. Here each room corresponds to a project and the robot has to choose in which room to clean next. For each project p, the state s {0, } describes whether the room is clean or not. Higher rewards are obtained for cleaning in a dirty room. Future rewards are discounted by a factor of β 0.8. Let the different rooms be modelled by A, R A , R A , R Here the small probability that the state changes from clean to dirty corresponds to the ris that the robot messes up while trying to clean. With this probability set to zero the cleaning progress would be monotone. The observation model depends only on the robot. Reflecting the poor sensory information about the cleaning status it is given by B , p, 2, In order to concentrate on the pure multi-armed bandit problem switching costs 2 were omitted, which prevent the robot from changing room at each time instant. The Gittins indices for the three projects γ x, γ 2 x, and γ3 x computed using 20 are plotted in Fig.. Because p 2and x + x2, it suffices to plot γ x versus x. The optimal policy is now to clean in the room with the highest Gittins index. The solid line for each project p is obtained by running the incremental-prune algorithm. A numerical resolution of ɛ 0 3 leads to 890, 740, and 477 vectors in Λ 20. The dotted line and the dashed line for project p show the results obtained by Lovejoy s lower bound algorithm for a uniform triangularization of the information

7 Gittins Index γ 2 x γ x γ 3 x First component of information state, x Figure. Gittins indices for three HMM projects. Each project is a 2 state HMM state space with R 2and R 4points respectively. For the other two projects those estimates were even closer to the solid lines and therefore omitted in the plot for clarity reasons. The R 2 point triangularization yields Λ 20 having 4, 6, and 4 vectors. The R 4 point triangularization yields Λ 20 having 6, 23, and 2 vectors. It can be seen that Lovejoy s lower bound algorithm provides an accurate estimate with relatively low computational complexity. For R 7, Lovejoy s algorithm yields estimates of γ x which are virtually indistinguishable from the solid lines. umerical experiments show that for small bandit problems, i.e. individual projects having up to 0 states, the incremental prune algorithm and Lovejoy s lower bound algorithm can be satisfactorily used. For larger dimensional problems, the computational cost becomes prohibitively large. 3. Conclusion We have presented an algorithm for computing the Gittens index for Partially Observed Marov Decision Processes Multi-armed Bandit problem. The ey idea was to reformulate this problem by extending the state vector with retirement information as a standard POMDP problem. This allows the use of efficient numerical methods for solving the stochastic dynamical programming problem. This approach can be extended to also cover what is called ongoing bandit processes. Here rewards also accrue in the idle mode. As pointed out in 7 there is a corresponding bandit process which produces the same expected total return as the ongoing bandit process if the same policy is applied to both processes. A more difficult case is the restless bandit problem where also the idle projects evolves. It would be of great interest to extend our results to this case. REFERECES. M.L. Littman A.R. Cassandra and.l. Zhang. Incremental pruning: A simple fast exact method for partially observed marov decision processes. In Proceedings of 3th Annual Conference on Uncertainty in Artificial Intelligence UAI-97, Providence, Rhode Island, M. Asawa and D. Teneetzis. Multi-armed bandits with switching penalties. Transactions on Automatic Control, 43, March D.P. Bertseas. Dynamic Programming and Optimal Control, volume and 2. Athena Scientific, Belmont, Masschusetts, D. Bertsimas and J. ino-mora. Conservation laws, extended polymatroids and multiarmed bandit problems; a polyhedral approach to indexable systems. Mathematics of Operations Research, 22: , A.R. Cassandra. Tony s pomdp page A.R. Cassandra. Exact and Approximate Algorithms for Partially Observed Marov Decision Process. PhD thesis, Brown University, J.C. Gittins. Multi-armed Bandit Allocation Indices. Wiley, V. Krishnamurthy and R. Evans. Beam scheduling for electronically scanned array tracing systems using multi-armed bandits. In Proceedings of IEEE Conf. on Decision and Control, pages , Phoenix, Arizona, P.R. Kumar and P. Varaiya. Stochastic systems - Estimation, Identification and Adaptive Control. Prentice- Hall, ew Jersey, W.S. Lovejoy. Computationally feasible bounds for partially observed marov decision process. Operations Research, 39:62 75, January-February 99.. W.S. Lovejoy. A survej of algorithmic methods for partially observed marov decision processes. Annals of Operations Research, 28:47 66, L.R. Rabiner. A tutorial on hidden marov models and selected applications in speech recognition. Proceedings of the IEEE, 772: , R.D. Smallwood and E.J. Sondi. Optimal control of partially observable marov processes over a finite horizon. Operations Research, 2:07 088, P. Whittle. Multi-armed bandits and the gittins index. J. R. Statist. Soc. B, 422:43 49, 980.

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari, Qing Zhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu

More information

On Robust Arm-Acquiring Bandit Problems

On Robust Arm-Acquiring Bandit Problems On Robust Arm-Acquiring Bandit Problems Shiqing Yu Faculty Mentor: Xiang Yu July 20, 2014 Abstract In the classical multi-armed bandit problem, at each stage, the player has to choose one from N given

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and

More information

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari,QingZhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu

More information

Efficient Maximization in Solving POMDPs

Efficient Maximization in Solving POMDPs Efficient Maximization in Solving POMDPs Zhengzhu Feng Computer Science Department University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Shlomo Zilberstein Computer Science Department University

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays

Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Sahand Haji Ali Ahmad, Mingyan Liu Abstract This paper considers the following stochastic control problem that arises

More information

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and.

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and. COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis State University of New York at Stony Brook and Cyrus Derman Columbia University The problem of assigning one of several

More information

Region-Based Dynamic Programming for Partially Observable Markov Decision Processes

Region-Based Dynamic Programming for Partially Observable Markov Decision Processes Region-Based Dynamic Programming for Partially Observable Markov Decision Processes Zhengzhu Feng Department of Computer Science University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Abstract

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Power Allocation over Two Identical Gilbert-Elliott Channels

Power Allocation over Two Identical Gilbert-Elliott Channels Power Allocation over Two Identical Gilbert-Elliott Channels Junhua Tang School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University, China Email: junhuatang@sjtu.edu.cn Parisa

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

On the static assignment to parallel servers

On the static assignment to parallel servers On the static assignment to parallel servers Ger Koole Vrije Universiteit Faculty of Mathematics and Computer Science De Boelelaan 1081a, 1081 HV Amsterdam The Netherlands Email: koole@cs.vu.nl, Url: www.cs.vu.nl/

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

10 Robotic Exploration and Information Gathering

10 Robotic Exploration and Information Gathering NAVARCH/EECS 568, ROB 530 - Winter 2018 10 Robotic Exploration and Information Gathering Maani Ghaffari April 2, 2018 Robotic Information Gathering: Exploration and Monitoring In information gathering

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Inventory Control with Convex Costs

Inventory Control with Convex Costs Inventory Control with Convex Costs Jian Yang and Gang Yu Department of Industrial and Manufacturing Engineering New Jersey Institute of Technology Newark, NJ 07102 yang@adm.njit.edu Department of Management

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Introduction to Artificial Intelligence (AI)

Introduction to Artificial Intelligence (AI) Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 10 Oct, 13, 2011 CPSC 502, Lecture 10 Slide 1 Today Oct 13 Inference in HMMs More on Robot Localization CPSC 502, Lecture

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

A Restless Bandit With No Observable States for Recommendation Systems and Communication Link Scheduling

A Restless Bandit With No Observable States for Recommendation Systems and Communication Link Scheduling 2015 IEEE 54th Annual Conference on Decision and Control (CDC) December 15-18, 2015 Osaka, Japan A Restless Bandit With No Observable States for Recommendation Systems and Communication Link Scheduling

More information

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Sofía S. Villar Postdoctoral Fellow Basque Center for Applied Mathemathics (BCAM) Lancaster University November 5th,

More information

Planning and Acting in Partially Observable Stochastic Domains

Planning and Acting in Partially Observable Stochastic Domains Planning and Acting in Partially Observable Stochastic Domains Leslie Pack Kaelbling*, Michael L. Littman**, Anthony R. Cassandra*** *Computer Science Department, Brown University, Providence, RI, USA

More information

arxiv: v1 [cs.sy] 2 Dec 2017

arxiv: v1 [cs.sy] 2 Dec 2017 Multiple Stopping Time POMDPs: Structural Results & Application in Interactive Advertising in Social Media Viram Krishnamurthy, Anup Aprem and Sujay Bhatt arxiv:1712.00524v1 [cs.sy] 2 Dec 2017 Abstract

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

1 Markov decision processes

1 Markov decision processes 2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Sensor Scheduling for Optimal Observability Using Estimation Entropy

Sensor Scheduling for Optimal Observability Using Estimation Entropy Sensor Scheduling for Optimal Observability Using Estimation Entropy Mohammad Rezaeian Department of Electrical and Electronic Engineering, University of Melbourne, Victoria, 300, Australia Email: rezaeian@unimelb.edu.au.

More information

Parametric Models Part III: Hidden Markov Models

Parametric Models Part III: Hidden Markov Models Parametric Models Part III: Hidden Markov Models Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2014 CS 551, Spring 2014 c 2014, Selim Aksoy (Bilkent

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers Huizhen (Janey) Yu (janey@mit.edu) Dimitri Bertsekas (dimitrib@mit.edu) Lab for Information and Decision Systems,

More information

Probabilistic inference for computing optimal policies in MDPs

Probabilistic inference for computing optimal policies in MDPs Probabilistic inference for computing optimal policies in MDPs Marc Toussaint Amos Storkey School of Informatics, University of Edinburgh Edinburgh EH1 2QL, Scotland, UK mtoussai@inf.ed.ac.uk, amos@storkey.org

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Finite-Horizon Optimal State-Feedback Control of Nonlinear Stochastic Systems Based on a Minimum Principle

Finite-Horizon Optimal State-Feedback Control of Nonlinear Stochastic Systems Based on a Minimum Principle Finite-Horizon Optimal State-Feedbac Control of Nonlinear Stochastic Systems Based on a Minimum Principle Marc P Deisenroth, Toshiyui Ohtsua, Florian Weissel, Dietrich Brunn, and Uwe D Hanebec Abstract

More information

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon Optimal Control of Partiality Observable Markov Processes over a Finite Horizon Report by Jalal Arabneydi 04/11/2012 Taken from Control of Partiality Observable Markov Processes over a finite Horizon by

More information

RL 14: POMDPs continued

RL 14: POMDPs continued RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

AM 121: Intro to Optimization Models and Methods: Fall 2018

AM 121: Intro to Optimization Models and Methods: Fall 2018 AM 11: Intro to Optimization Models and Methods: Fall 018 Lecture 18: Markov Decision Processes Yiling Chen Lesson Plan Markov decision processes Policies and value functions Solving: average reward, discounted

More information

Riccati difference equations to non linear extended Kalman filter constraints

Riccati difference equations to non linear extended Kalman filter constraints International Journal of Scientific & Engineering Research Volume 3, Issue 12, December-2012 1 Riccati difference equations to non linear extended Kalman filter constraints Abstract Elizabeth.S 1 & Jothilakshmi.R

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Markov decision processes and interval Markov chains: exploiting the connection

Markov decision processes and interval Markov chains: exploiting the connection Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo Supervisors: Prof. Nigel Bean, Dr Joshua Ross University of Adelaide July 10, 2013 Intervals and interval arithmetic

More information

21 Markov Decision Processes

21 Markov Decision Processes 2 Markov Decision Processes Chapter 6 introduced Markov chains and their analysis. Most of the chapter was devoted to discrete time Markov chains, i.e., Markov chains that are observed only at discrete

More information

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012 CSE 573: Artificial Intelligence Autumn 2012 Reasoning about Uncertainty & Hidden Markov Models Daniel Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 Outline

More information

Topics in Probability Theory and Stochastic Processes Steven R. Dunbar. Worst Case and Average Case Behavior of the Simplex Algorithm

Topics in Probability Theory and Stochastic Processes Steven R. Dunbar. Worst Case and Average Case Behavior of the Simplex Algorithm Steven R. Dunbar Department of Mathematics 203 Avery Hall University of Nebrasa-Lincoln Lincoln, NE 68588-030 http://www.math.unl.edu Voice: 402-472-373 Fax: 402-472-8466 Topics in Probability Theory and

More information

Stationary Probabilities of Markov Chains with Upper Hessenberg Transition Matrices

Stationary Probabilities of Markov Chains with Upper Hessenberg Transition Matrices Stationary Probabilities of Marov Chains with Upper Hessenberg Transition Matrices Y. Quennel ZHAO Department of Mathematics and Statistics University of Winnipeg Winnipeg, Manitoba CANADA R3B 2E9 Susan

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

On the Optimality of Myopic Sensing. in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels

On the Optimality of Myopic Sensing. in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels On the Optimality of Myopic Sensing 1 in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels Kehao Wang, Lin Chen arxiv:1103.1784v1 [cs.it] 9 Mar 2011 Abstract Recent works ([1],

More information

Linear Programming Methods

Linear Programming Methods Chapter 11 Linear Programming Methods 1 In this chapter we consider the linear programming approach to dynamic programming. First, Bellman s equation can be reformulated as a linear program whose solution

More information

Inference and estimation in probabilistic time series models

Inference and estimation in probabilistic time series models 1 Inference and estimation in probabilistic time series models David Barber, A Taylan Cemgil and Silvia Chiappa 11 Time series The term time series refers to data that can be represented as a sequence

More information

Policy Gradient Reinforcement Learning for Robotics

Policy Gradient Reinforcement Learning for Robotics Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous

More information

UTILIZING PRIOR KNOWLEDGE IN ROBUST OPTIMAL EXPERIMENT DESIGN. EE & CS, The University of Newcastle, Australia EE, Technion, Israel.

UTILIZING PRIOR KNOWLEDGE IN ROBUST OPTIMAL EXPERIMENT DESIGN. EE & CS, The University of Newcastle, Australia EE, Technion, Israel. UTILIZING PRIOR KNOWLEDGE IN ROBUST OPTIMAL EXPERIMENT DESIGN Graham C. Goodwin James S. Welsh Arie Feuer Milan Depich EE & CS, The University of Newcastle, Australia 38. EE, Technion, Israel. Abstract:

More information

Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret

Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret Cem Tekin, Member, IEEE, Mingyan Liu, Senior Member, IEEE 1 Abstract In this paper we consider the problem of learning the optimal

More information

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg Temporal Reasoning Kai Arras, University of Freiburg 1 Temporal Reasoning Contents Introduction Temporal Reasoning Hidden Markov Models Linear Dynamical Systems (LDS) Kalman Filter 2 Temporal Reasoning

More information

Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal

Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal Matthew Johnston, Eytan Modiano Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge,

More information

State Estimation by IMM Filter in the Presence of Structural Uncertainty 1

State Estimation by IMM Filter in the Presence of Structural Uncertainty 1 Recent Advances in Signal Processing and Communications Edited by Nios Mastorais World Scientific and Engineering Society (WSES) Press Greece 999 pp.8-88. State Estimation by IMM Filter in the Presence

More information

Sequential Selection of Projects

Sequential Selection of Projects Sequential Selection of Projects Kemal Gürsoy Rutgers University, Department of MSIS, New Jersey, USA Fusion Fest October 11, 2014 Outline 1 Introduction Model 2 Necessary Knowledge Sequential Statistics

More information

Non-negative matrix factorization with fixed row and column sums

Non-negative matrix factorization with fixed row and column sums Available online at www.sciencedirect.com Linear Algebra and its Applications 9 (8) 5 www.elsevier.com/locate/laa Non-negative matrix factorization with fixed row and column sums Ngoc-Diep Ho, Paul Van

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

On Optimal Coding of Hidden Markov Sources

On Optimal Coding of Hidden Markov Sources 2014 Data Compression Conference On Optimal Coding of Hidden Markov Sources Mehdi Salehifar, Emrah Akyol, Kumar Viswanatha, and Kenneth Rose Department of Electrical and Computer Engineering University

More information

INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS

INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS Applied Probability Trust (4 February 2008) INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS SAVAS DAYANIK, Princeton University WARREN POWELL, Princeton University KAZUTOSHI

More information

CAP Plan, Activity, and Intent Recognition

CAP Plan, Activity, and Intent Recognition CAP6938-02 Plan, Activity, and Intent Recognition Lecture 10: Sequential Decision-Making Under Uncertainty (part 1) MDPs and POMDPs Instructor: Dr. Gita Sukthankar Email: gitars@eecs.ucf.edu SP2-1 Reminder

More information

Split Rank of Triangle and Quadrilateral Inequalities

Split Rank of Triangle and Quadrilateral Inequalities Split Rank of Triangle and Quadrilateral Inequalities Santanu Dey 1 Quentin Louveaux 2 June 4, 2009 Abstract A simple relaxation of two rows of a simplex tableau is a mixed integer set consisting of two

More information

A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces

A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001 411 A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces Paul M. Baggenstoss, Member, IEEE

More information

CS 7180: Behavioral Modeling and Decision- making in AI

CS 7180: Behavioral Modeling and Decision- making in AI CS 7180: Behavioral Modeling and Decision- making in AI Hidden Markov Models Prof. Amy Sliva October 26, 2012 Par?ally observable temporal domains POMDPs represented uncertainty about the state Belief

More information

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Many slides adapted from Jur van den Berg Outline POMDPs Separation Principle / Certainty Equivalence Locally Optimal

More information

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)

More information

Hidden Markov Modelling

Hidden Markov Modelling Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Planning Under Uncertainty II

Planning Under Uncertainty II Planning Under Uncertainty II Intelligent Robotics 2014/15 Bruno Lacerda Announcement No class next Monday - 17/11/2014 2 Previous Lecture Approach to cope with uncertainty on outcome of actions Markov

More information

MULTI-ARMED BANDIT PROBLEMS

MULTI-ARMED BANDIT PROBLEMS Chapter 6 MULTI-ARMED BANDIT PROBLEMS Aditya Mahajan University of Michigan, Ann Arbor, MI, USA Demosthenis Teneketzis University of Michigan, Ann Arbor, MI, USA 1. Introduction Multi-armed bandit MAB)

More information

Efficient Sensitivity Analysis in Hidden Markov Models

Efficient Sensitivity Analysis in Hidden Markov Models Efficient Sensitivity Analysis in Hidden Markov Models Silja Renooij Department of Information and Computing Sciences, Utrecht University P.O. Box 80.089, 3508 TB Utrecht, The Netherlands silja@cs.uu.nl

More information

Control Theory : Course Summary

Control Theory : Course Summary Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields

More information

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1 Hidden Markov Models AIMA Chapter 15, Sections 1 5 AIMA Chapter 15, Sections 1 5 1 Consider a target tracking problem Time and uncertainty X t = set of unobservable state variables at time t e.g., Position

More information

Restless Bandits with Switching Costs: Linear Programming Relaxations, Performance Bounds and Limited Lookahead Policies

Restless Bandits with Switching Costs: Linear Programming Relaxations, Performance Bounds and Limited Lookahead Policies Restless Bandits with Switching Costs: Linear Programming Relaations, Performance Bounds and Limited Lookahead Policies Jerome Le y Laboratory for Information and Decision Systems Massachusetts Institute

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

A System Theoretic Perspective of Learning and Optimization

A System Theoretic Perspective of Learning and Optimization A System Theoretic Perspective of Learning and Optimization Xi-Ren Cao* Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong eecao@ee.ust.hk Abstract Learning and optimization

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Decentralized Stochastic Control with Partial Sharing Information Structures: A Common Information Approach

Decentralized Stochastic Control with Partial Sharing Information Structures: A Common Information Approach Decentralized Stochastic Control with Partial Sharing Information Structures: A Common Information Approach 1 Ashutosh Nayyar, Aditya Mahajan and Demosthenis Teneketzis Abstract A general model of decentralized

More information

Master 2 Informatique Probabilistic Learning and Data Analysis

Master 2 Informatique Probabilistic Learning and Data Analysis Master 2 Informatique Probabilistic Learning and Data Analysis Faicel Chamroukhi Maître de Conférences USTV, LSIS UMR CNRS 7296 email: chamroukhi@univ-tln.fr web: chamroukhi.univ-tln.fr 2013/2014 Faicel

More information

STRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS

STRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS STRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS Qing Zhao University of California Davis, CA 95616 qzhao@ece.ucdavis.edu Bhaskar Krishnamachari University of Southern California

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel

Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel Yanting Wu Dept. of Electrical Engineering University of Southern California Email: yantingw@usc.edu Bhaskar Krishnamachari

More information

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty 2011 IEEE International Conference on Robotics and Automation Shanghai International Conference Center May 9-13, 2011, Shanghai, China A Decentralized Approach to Multi-agent Planning in the Presence of

More information

Introduction to Artificial Intelligence (AI)

Introduction to Artificial Intelligence (AI) Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 9 Oct, 11, 2011 Slide credit Approx. Inference : S. Thrun, P, Norvig, D. Klein CPSC 502, Lecture 9 Slide 1 Today Oct 11 Bayesian

More information

Bayesian Contextual Multi-armed Bandits

Bayesian Contextual Multi-armed Bandits Bayesian Contextual Multi-armed Bandits Xiaoting Zhao Joint Work with Peter I. Frazier School of Operations Research and Information Engineering Cornell University October 22, 2012 1 / 33 Outline 1 Motivating

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Motivation for introducing probabilities

Motivation for introducing probabilities for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.

More information

Logic, Knowledge Representation and Bayesian Decision Theory

Logic, Knowledge Representation and Bayesian Decision Theory Logic, Knowledge Representation and Bayesian Decision Theory David Poole University of British Columbia Overview Knowledge representation, logic, decision theory. Belief networks Independent Choice Logic

More information

Near-optimal policies for broadcasting files with unequal sizes

Near-optimal policies for broadcasting files with unequal sizes Near-optimal policies for broadcasting files with unequal sizes Majid Raissi-Dehkordi and John S. Baras Institute for Systems Research University of Maryland College Park, MD 20742 majid@isr.umd.edu, baras@isr.umd.edu

More information

Towards Faster Planning with Continuous Resources in Stochastic Domains

Towards Faster Planning with Continuous Resources in Stochastic Domains Towards Faster Planning with Continuous Resources in Stochastic Domains Janusz Marecki and Milind Tambe Computer Science Department University of Southern California 941 W 37th Place, Los Angeles, CA 989

More information

ON THE OPTIMALITY OF AN INDEX RULE IN MULTICHANNEL ALLOCATION FOR SINGLE-HOP MOBILE NETWORKS WITH MULTIPLE SERVICE CLASSES*

ON THE OPTIMALITY OF AN INDEX RULE IN MULTICHANNEL ALLOCATION FOR SINGLE-HOP MOBILE NETWORKS WITH MULTIPLE SERVICE CLASSES* Probability in the Engineering and Informational Sciences, 14, 2000, 259 297+ Printed in the U+S+A+ ON THE OPTIMALITY OF AN INDEX RULE IN MULTICHANNEL ALLOCATION FOR SINGLE-HOP MOBILE NETWORKS WITH MULTIPLE

More information

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

On Finding Optimal Policies for Markovian Decision Processes Using Simulation On Finding Optimal Policies for Markovian Decision Processes Using Simulation Apostolos N. Burnetas Case Western Reserve University Michael N. Katehakis Rutgers University February 1995 Abstract A simulation

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

L23: hidden Markov models

L23: hidden Markov models L23: hidden Markov models Discrete Markov processes Hidden Markov models Forward and Backward procedures The Viterbi algorithm This lecture is based on [Rabiner and Juang, 1993] Introduction to Speech

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Online Learning for Markov Decision Processes Applied to Multi-Agent Systems

Online Learning for Markov Decision Processes Applied to Multi-Agent Systems Online Learning for Markov Decision Processes Applied to Multi-Agent Systems Mahmoud El Chamie Behçet Açıkmeşe Mehran Mesbahi Abstract Online learning is the process of providing online control decisions

More information

On Regression-Based Stopping Times

On Regression-Based Stopping Times On Regression-Based Stopping Times Benjamin Van Roy Stanford University October 14, 2009 Abstract We study approaches that fit a linear combination of basis functions to the continuation value function

More information