A Value Iteration Algorithm for Partially Observed Markov Decision Process Multi-armed Bandits

Size: px

Start display at page:

Download "A Value Iteration Algorithm for Partially Observed Markov Decision Process Multi-armed Bandits"

Kerry Douglas
5 years ago
Views:

1 A Value Iteration Algorithm for Partially Observed Marov Decision Process Multi-armed Bandits Viram Krishnamurthy a, Bo Wahlberg b, Fran Lingelbach b, a Dept. of Electrical and Computer Engineering, The University of British Columbia, Vancouver, B.C. Canada b S3-Automatic Control, KTH, S-0044 Stocholm, Sweden Abstract A value iteration based algorithm is given for computing the Gittins index of a Partially Observed Marov Decision Process POMDP Multi-armed Bandit problem. This problem concerns dynamical allocation of efforts between a number of competing projects of which only one can be wored on at any time period. The active project evolves according to a finite state Marov chain and generates then a reward, while the states of the idle projects remain fixed. In this contribution, it is assumed that the state of the active project only can be indirectly observed from noisy observations. The objective is to find the optimal policy based on partial information to determine which project to wor on at a certain time in order to maximize the total expected reward. The solution is obtained by transforming the problem into a standard POMDP problem, for which there exist efficient near-optimal algorithms. A numerical example from the field of tas planning for an autonomous robot is presented to illustrate the algorithms.. Introduction and Problem Formulation The multi-armed bandit problem is an example of a dynamic stochastic scheduling problem for optimizing in a sequential manner the allocation effort between a number of competing projects. umerous applications of this finite state Marov chain multi-armed bandit problems appear in the robotics and stochastic control literature, see 7, 4, and 4 for examples in job scheduling, resource allocation for manufacturing systems, and target tracing. The multi-armed bandit problem structure implies that the optimal policy can be found by a so-called Gittins index rule, 7. This implies that the multi-project optimization problem is simplified to a finite number of single-project optimization problems. Several algorithms have been proposed to solve this fully observed finite state bandit problem 9, 4. This paper considers bandit problems where the underly- This wor was supported by an ARC large grant and the Centre of Expertise in etwored Decision Systems ing finite state Marov chain is not directly observed instead the observations which are assumed to belong to a finite set are a probabilistic function of the unobserved finite state Marov chain. Such problems are nown as Partially Observed Marov Decision Process POMDP multiarmed bandits and are also called Hidden Marov Model HMM multi-armed bandits. The POMDP model suits many topics in the field of robotics very well as uncertainties both in state transition and observations are considered which is characteristic for most robot applications. In 6 this approach is used to model robot navigation in a nown environment. In recent wor 8, a return-to-state argument was used in an attempt to compute the Gittins index of a POMDP multi-armed bandit problem. However, the argument in 8 is incorrect, indeed there appears no obvious way of obtaining a finite dimensional characterization of the value function in the return-to-state argument. In this paper, it is shown that by introducing the retirement formulation 7 of the multi-armed bandit problem option, a finite dimensional value iteration algorithm can be derived for computing the Gittins index of a POMDP bandit. The ey idea is to extend the state vector to include retirement information. The structure of this paper is as follows. In Section., we introduce the POMDP multi-armed bandit problem, and the Hidden Marov Model state estimation is reviewed in Section.2. A value iteration algorithm for computing the Gittins index is derived in Section 2. The main result of the paper is given in Section 2., where Theorem gives a finite dimensional characterization of the Gittins index. Section 2.2 deals with numerical algorithms and provides a numerical example of the proposed algorithm. Finally, Section 3 concludes the paper... The POMDP Multi-armed Bandit Problem Consider P independent projects p,...,p. Assume each project p has a finite number of states p. Let s denote the state of project p at discrete time 0,,... At each time instant only one of these projects can be

2 wored on. If project p is wored on at time, an instantaneous reward β Rs,p is accrued Rs,p 0 is assumed finite. Here 0 <β<denotes the discount factor; the state s evolves according to an p -state homogeneous Marov chain with transition probability matrix A a ij i,j p P s + j s i if project p is wored on at time. The states of all the other P idle projects are unaffected, i.e. s + s, if project p is idle at time. All projects are initialized with s 0 x 0 where x 0 are specified initial distributions for p,...,p. The state of the active project p is indirectly observed via noisy measurements observations y + of the active project state s +. Assume that these observations y + belong to a finite set M p indexed by m,...,m p. Let B b im i p,m M p denote the observation probability symbol probability matrix of the Hidden Marov Model HMM, where each element b im P y + m s + i, u p. Let u {,...,P} denote which project is wored on at time. Consequently, s u + denotes the state of the active project at time +. Let the policy µ denote the stationary sequence of controls, {u,, 2,...}. The total expected discounted reward over an infinite time horizon is given by J µ E 0 β R s u,u where E denotes mathematical expectation. The aim is to determine the optimal stationary policy which yields the maximum reward in 2. Denote the observation history at time as Y y u0,...,y u and let U u 0,...,u. Define U as the class of all admissible policies µ. Solving the Marov chain multi-armed bandit problem 9 involves computing the optimal policy µ arg max µ U J µ where U denotes the set of sequences u which map Y to {,...,P}. Let J max µ U J µ denote the optimal expected reward. Remar: Because M p and p are finite sets, and Rs u,u are uniformly bounded from above and below, the optimization is well defined..2. Information State Formulation The above partially observed multi-armed bandit problem can be re-expressed as a fully observed multi-armed bandit in terms of the information state 9. For each 2 project p, denote by x the information state at time : x x i i,..., p, where x i P s i Y,U The HMM multi-armed bandit problem can be viewed as the following scheduling problem: Consider P parallel HMM state estimation filters, one for each project. The project p is active, an observation y + is obtained and the information state x + is computed recursively by the HMM state filter also nown as the forward algorithm or Baum s algorithm 2 according to x 3 + B y + A x p B y + A x if project p is wored on at time 4 where if y + m, then B m diagb m,...,b p,m is the diagonal matrix formed by the mth column of the observation matrix B and p is a p dimensional column vector of ones. The state estimates of the other P projects remain unaffected, i.e. x q + xq if project q is not wored on, q {,...,P}, q p 5 Let X denote the state space of information states x, p {, 2,...,P} which is a p dimensional simplex: X { x R p : p x, 0 x i } for all i {,..., p } Using the smoothing property of conditional expectations, the reward functional 2 can be re-written in terms of the information state as J µ E 0 β R u x u where Ru denotes the u dimensional reward vector Rs,u,...,Rs u,u, and Rs,u is defined as in 2. The aim is to compute the optimal policy arg max µ U J µ. It is well nown 4 that the optimal policy has an indexable rule 9, 4: for each project p there is a function γ x called the Gittins index, which is only a function 6 7

3 of the project p and the information state x, whereby the optimal scheduling policy at time is to wor on the project with the largest Gittins index, { i.e. choose project q where q arg max p {,...,P } γ x }. Thus computing the Gittins index is a ey requirement for solving any multi-armed bandit problem. For a formal definition of the Gittins index in terms of stopping times, see 9. We will wor with a more convenient equivalent definition in terms of the parameterized retirement reward M 3, 7. The fundamental problem for the POMDP case is that the Gittins index γ x must be evaluated for each x X, an uncountably infinite set. In contrast, for the standard finite state Marov multi-armed bandit problem considered extensively in the literature e.g. 7, the Gittins index can be straightforwardly computed. The main contribution of our paper is to present a finite dimensional value iteration algorithm for computing the Gittins index. 2. Value Iteration Algorithm for Computing Gittins Index This section presents a value iteration algorithm for computing the Gittins index γ x for each project p {, 2,..., P}. As with any dynamic programming formulation, the computation of the Gittins index for each project p is off-line, independent of the Gittins indices of the other P projects and can be done a priori. For each project p, let M denote a positive real number such that 0 M M M max i p Rs i, u p β 8 To simplify subsequent notation, we omit the superscript p in M and M, and the subindex in x The Gittins index 3, 7 of project p with information state x can be defined as γ x min{m : V x,mm} 9 where V x,m satisfies the functional Bellman s recursion { V x,m max R x M p +β V B ma x m p B ma x } p B ma x,m,m 0 where M denotes the parameterized retirement reward. Observe that P y m p B ma x. The th order approximation of V x,m is obtained as the following value iteration algorithm,...,: { V + x,m max R x M p +β V B ma x,m m p B ma x } p B ma x,m Here V x,m is the value-function of a -horizon dynamic programming recursion. Let γ x denote the approximate Gittins index computed via the value iteration algorithm, i.e. γ x min{m : V x,mm} 2 It is well nown that V x,m can be uniformly approximated arbitrarily closely by a finite horizon value function V x,m of. A straightforward application of this result shows that the finite horizon Gittins index approximation γ x of 2 can be made arbitrarily accurate by choosing the horizon sufficiently large. This is summarized in the following corollary. Corollary The infinite horizon Gittins index γ x of state x can be uniformly approximated arbitrarily closely by the near optimal Gittins index γ x computed according to 2 for the finite horizon. In particular, for any δ>0, there exists a finite horizon such that: i sup x X γ x γ x δ. ii For this, sup x X γ x γ x 2βδ β. 2.. Finite Dimensional Characterization of Gittins Index Unfortunately, the value iteration recursion does not directly translate into practical solution methodologies. The fundamental problem with is that at each iteration, one needs to compute V x,m over an uncountably infinite set x X and M 0, M. The main contribution of this section is to construct a finite dimensional characterization for the value function V x,m,, 2,..., and hence the near optimal Gittins index γ x. We will show that under a different coordinate basis V x,m is piecewise linear and convex. Then computing γ x in 2 simply amounts to evaluating V x,m at the hyper-planes

4 formed by the intersection of the piecewise linear segments. Constructive algorithms based on this finite characterization will be given in Sec. 2.2 to compute the Gittins index for the information states of the original bandit process. Define the p +dimensional augmented information state x {x, 0, 0 p, } where x X is as in 3. As described below, x 0 p, is interpreted as the retirement information state. Define an augmented observation process ȳ {,...,M p +} and the corresponding p + p +transition and observation probability matrices as A A 2 A 0 p 0 p 0p p p 0 p, B B 0 p 0 p,b 2 I p+ 2,, B m diagcolumn m of B, B 2 m diagcolumn m of B 2, m {,...,M p +} 3 To construct a finite dimensional representation of V x,m we will present a coordinate transformation under which V x,m is the value function of a standard POMDP, and x,m is an invertible map to the information state of this POMDP. Because 0 M M, define the pseudo-information state z M/ M M/ M,, 0 M M 4 Define the information state π and following coordinate transformation: π z x A 0 Ā I 2 2 A 0 A A2 0 Ā 2 I 2 2 A 2 0 A 2 B m I 2 2 B m B 2 m I 2 2 B 2 m R R 0 R 0 R 2 M p 0 0 p 0 5 It is easily shown that Ā, Ā 2 are transition probability matrices their rows add to one and each element is B positive and m, B 2 m are observation probability matrices. Also the 2 p +dimensional vector π is an information state since it belongs to Π where Π {π : 2 p+ π, and π i 0 i, 2,...,2 p +} 6 Finally, define the control variable ν {, 2} at each time, where ν maps π to {, 2}. ote ν means continue and ν 2means retire. Define the policy sequence ν ν,...,ν. The policy ν is used to compute the Gittins index of project p. It is not to be confused with the policy µ defined in Sec.. which determines which project to wor on. Consider now the following POMDP problem: Parameters Ā, B, R defined in 5 form the transition probabilities, observation probabilities and reward vectors of a two valued control ν {, 2} and objective max E β R ν π ν 0 Here the vector π Π is an information state for this POMDP and evolves according to π + B ν ȳ + Ā ν π 2 p+ B ν Ā ν π ν {, 2}, ȳ + {,...,M p +} depending on the control ν chosen at each time instant. ote that ν 2results in π + attaining the retirement 0p state z. The value iteration recursion for optimizing this POMDP over the finite horizon is given by + π max R π M p+ + β m B, mā π B mā π B mā π, R 2 π M p+ B 2 + β mā 2 π m B 2 mā 2 π B 2 mā 2 π ;, 2,..., 0 π max R π, R 2π 7 Here π denotes the value-function of the dynamic program, π max E β t R νt π t π π t

5 The following is the main result of this paper. Theorem Under the coordinate basis defined in 5, the following three statements hold:. The value function V x,m in for computing the Gittins index is identically equal to the value function π of the standard POMDP At each iteration, 0,,...,, the value function π is piecewise linear and convex and has the finite dimensional representation π max λ i, Λ λ i,π 8 Here the 2 p +-dimensional vectors λ i, belong to pre-computable finite set of vectors Λ. In particular, each vector λ i, is of the form λ i, λ i, 0 λ i, 3 0, where λ i,,λ i, 3 R p 9 If all the elements of R are not equal, then there always exists a unique vector in Λ which we denote by λ, M p 0 p+2 with optimal control ν 2. If all the elements of R are equal, then Λ comprises of a single vector λ, M p 0 M p For any information state x X of project p, the near optimal Gittins index γ x is given by the finite dimensional representation γ x max λ i, Λ Mλ i, 3 x λ i, 3 λ i, x + M 20 Remar: Statement of the above theorem shows that the value iteration algorithm for computing the Gittins index γ x is identical to the dynamic programming recursion 7 for optimizing a standard finite horizon POMDP. Statement 2 says that the finite horizon POMDP has a finite dimensional piecewise linear solution which is characterized by a pre-computable finite set of vectors at each time instant. Statement 2 is well nown in the POMDP literature, e.g. see 3,. There are several linear programming based algorithms available for computing the finite set of vectors Λ at each iteration, further details are given in Sec.2.2. Statement 3 gives an explicit formula for the Gittins index of the HMM multi-armed bandit problem. Recall x is the information state computed by the pth HMM filter at time. Given that we can compute set of vectors Λ, 20 gives an explicit expression for the Gittins index γ x at any time for project p. ote if all elements of R are identical, then γ x M for all x. Finally, although π is piecewise linear for any finite, in general π π is not piecewise linear. Indeed, π is piecewise linear only in exceptional cases and the resulting policy is called finitely-transient 6. Proof. The proof of the first statement is by mathematical induction. At iteration 0, 0 π max R π, R 2 π max R z x,m V 0 x,m 2 Assume that at time, π V x,m, and consider 7. ote that by construction of the costs in 5, we have for the terminal state 0p z 0 0,, 2,..., 22 We have already shown above that R π R x and R 2π M. It is easily shown using standard properties of tensor products that for m, 2,...,M p, B mā π V B mā π B ma x B ma x,m B mā π B ma x Because B M p +diag0 p,, see 3, 5, and due to the structure of Ā 2 it follows that B M p + Ā π B B M p + Ā 0p z 0 π Π π 2 M p + Ā 2 π B 2 M p + Ā 0p z 0 π Π 2 π where Π is defined in 6 and the last equality follows from 22. Thus π V + + x,m.

6 The third statement can be shown as follows: Since λ,π M, to compute the Gittins index γ x we merely need to compute the value function π at the hyper-planes which lie at the intersection of λ, π M and max λi, λ i, π where λ i, Λ γ {λ, }. That is, x min{m : V π M} {λ, π : max λ i,π λ,π} λ i, Λ {λ, } But given the structure of λ i, in 9, with z z,z2 defined in 4, λ i,π z λ i, x + z λ i, 3x and λ, π Mz. Thus γ x Mz where max λ i, Λ {λ, } This yields 20. λ i, xz+λ i, 3x z Mz 2.2. umerical Algorithms and Example Given the finite dimensional representation of the Gittins index in 20 of Theorem, there are several linear programming based algorithms in the POMDP literature such as Sondi s algorithm 3, Monahan s algorithm, Cheng s algorithm, and the Witness algorithm 6 that can be used to compute the finite set of vectors Λ depicted in equation 8. In the numerical examples below we used the incremental-prune algorithm recently developed in the artificial intelligence community by Cassandra, et.al. in 997 the C++ code can be freely downloaded from the website 5. In general the number of linear segments that characterize the π can grow exponentially indeed the problem is P hard. It is obvious that by considering only a subset of the piecewise linear segments that characterize π and discarding the other segments, one can reduce the computational complexity. This is the basis of Lovejoy s 0 lower bound approximation. Lovejoy s algorithm 0 operates as follows: Initialize Λ 0 Λ 0, i.e. according to 2. Λ Step : Given a set of vectors, construct the set Λ by pruning Λ as follows: Pic any R points, π,π 2,...,π R in the information state simplex Π. In the numerical examples below we piced the R points based on a uniform Freudenthal triangulization of Π, see Λ 0 for details. Then set {arg min λ π r, λ Λ r, 2,...,R}. Step 2: Given Λ, compute the set of vectors Λ + using a standard POMDP algorithm. Step 3: + otice that Ṽ π max λ π is represented λ Λ completely by R piecewise linear segments. Lovejoy 0 shows that for all, π is a lower bound to the optimal value function V π, i.e. V π π for all π P. Lovejoy s algorithm gives a suboptimal scheduling policy at a computational cost of no more than R evaluations per iteration. Lovejoy 0 also provides a constructive procedure for computing an upper bound to lim sup π P V π π. Example: Consider the problem of tas planning for an autonomous mobile cleaning robot in a domestic environment. Here each room corresponds to a project and the robot has to choose in which room to clean next. For each project p, the state s {0, } describes whether the room is clean or not. Higher rewards are obtained for cleaning in a dirty room. Future rewards are discounted by a factor of β 0.8. Let the different rooms be modelled by A, R A , R A , R Here the small probability that the state changes from clean to dirty corresponds to the ris that the robot messes up while trying to clean. With this probability set to zero the cleaning progress would be monotone. The observation model depends only on the robot. Reflecting the poor sensory information about the cleaning status it is given by B , p, 2, In order to concentrate on the pure multi-armed bandit problem switching costs 2 were omitted, which prevent the robot from changing room at each time instant. The Gittins indices for the three projects γ x, γ 2 x, and γ3 x computed using 20 are plotted in Fig.. Because p 2and x + x2, it suffices to plot γ x versus x. The optimal policy is now to clean in the room with the highest Gittins index. The solid line for each project p is obtained by running the incremental-prune algorithm. A numerical resolution of ɛ 0 3 leads to 890, 740, and 477 vectors in Λ 20. The dotted line and the dashed line for project p show the results obtained by Lovejoy s lower bound algorithm for a uniform triangularization of the information

7 Gittins Index γ 2 x γ x γ 3 x First component of information state, x Figure. Gittins indices for three HMM projects. Each project is a 2 state HMM state space with R 2and R 4points respectively. For the other two projects those estimates were even closer to the solid lines and therefore omitted in the plot for clarity reasons. The R 2 point triangularization yields Λ 20 having 4, 6, and 4 vectors. The R 4 point triangularization yields Λ 20 having 6, 23, and 2 vectors. It can be seen that Lovejoy s lower bound algorithm provides an accurate estimate with relatively low computational complexity. For R 7, Lovejoy s algorithm yields estimates of γ x which are virtually indistinguishable from the solid lines. umerical experiments show that for small bandit problems, i.e. individual projects having up to 0 states, the incremental prune algorithm and Lovejoy s lower bound algorithm can be satisfactorily used. For larger dimensional problems, the computational cost becomes prohibitively large. 3. Conclusion We have presented an algorithm for computing the Gittens index for Partially Observed Marov Decision Processes Multi-armed Bandit problem. The ey idea was to reformulate this problem by extending the state vector with retirement information as a standard POMDP problem. This allows the use of efficient numerical methods for solving the stochastic dynamical programming problem. This approach can be extended to also cover what is called ongoing bandit processes. Here rewards also accrue in the idle mode. As pointed out in 7 there is a corresponding bandit process which produces the same expected total return as the ongoing bandit process if the same policy is applied to both processes. A more difficult case is the restless bandit problem where also the idle projects evolves. It would be of great interest to extend our results to this case. REFERECES. M.L. Littman A.R. Cassandra and.l. Zhang. Incremental pruning: A simple fast exact method for partially observed marov decision processes. In Proceedings of 3th Annual Conference on Uncertainty in Artificial Intelligence UAI-97, Providence, Rhode Island, M. Asawa and D. Teneetzis. Multi-armed bandits with switching penalties. Transactions on Automatic Control, 43, March D.P. Bertseas. Dynamic Programming and Optimal Control, volume and 2. Athena Scientific, Belmont, Masschusetts, D. Bertsimas and J. ino-mora. Conservation laws, extended polymatroids and multiarmed bandit problems; a polyhedral approach to indexable systems. Mathematics of Operations Research, 22: , A.R. Cassandra. Tony s pomdp page A.R. Cassandra. Exact and Approximate Algorithms for Partially Observed Marov Decision Process. PhD thesis, Brown University, J.C. Gittins. Multi-armed Bandit Allocation Indices. Wiley, V. Krishnamurthy and R. Evans. Beam scheduling for electronically scanned array tracing systems using multi-armed bandits. In Proceedings of IEEE Conf. on Decision and Control, pages , Phoenix, Arizona, P.R. Kumar and P. Varaiya. Stochastic systems - Estimation, Identification and Adaptive Control. Prentice- Hall, ew Jersey, W.S. Lovejoy. Computationally feasible bounds for partially observed marov decision process. Operations Research, 39:62 75, January-February 99.. W.S. Lovejoy. A survej of algorithmic methods for partially observed marov decision processes. Annals of Operations Research, 28:47 66, L.R. Rabiner. A tutorial on hidden marov models and selected applications in speech recognition. Proceedings of the IEEE, 772: , R.D. Smallwood and E.J. Sondi. Optimal control of partially observable marov processes over a finite horizon. Operations Research, 2:07 088, P. Whittle. Multi-armed bandits and the gittins index. J. R. Statist. Soc. B, 422:43 49, 980.

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari, Qing Zhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu