Robust Adaptive Markov Decision Processes in Multivehicle

Size: px

Start display at page:

Download "Robust Adaptive Markov Decision Processes in Multivehicle"

Eric Stevens
6 years ago
Views:

1 Robust Adaptve Markov Decson Processes n Multvehcle Applcatons The MIT Faculty has made ths artcle openly avalable. Please share how ths access benefts you. Your story matters. Ctaton As Publshed Publsher Bertuccell, L.F., B. Bethke, and J.P. How. Robust adaptve Markov Decson Processes n mult-vehcle applcatons. Amercan Control Conference, ACC ' Copyrght 2009 IEEE Insttute of Electrcal and Electroncs Engneers Verson Fnal publshed verson Accessed Mon Apr 09 08::06 EDT 208 Ctable Lnk Terms of Use Detaled Terms Artcle s made avalable n accordance wth the publsher's polcy and may be subject to US copyrght law. Please refer to the publsher's ste for terms of use.

2 2009 Amercan Control Conference Hyatt Regency Rverfront, St. Lous, MO, USA June 0-2, 2009 WeB9.4 Robust Adaptve Markov Decson Processes n Mult-vehcle Applcatons Luca F. Bertuccell, Brett Bethke, and Jonathan P. How Aerospace Controls Laboratory Massachusetts Insttute of Technology {lucab, bbethke, jhow}@mt.edu Abstract Ths paper presents a new robust and adaptve framework for Markov Decson Processes that accounts for errors n the transton probabltes. Robust polces are typcally found off-lne, but can be extremely conservatve when mplemented n the real system. Adaptve polces, on the other hand, are specfcally suted for on-lne mplementaton, but may dsplay undesrable transent performance as the model s updated though learnng. A new method that explots the ndvdual strengths of the two approaches s presented n ths paper. Ths robust and adaptve framework protects the adaptaton process from exhbtng a worst-case performance durng the model updatng, and s shown to converge to the true, optmal value functon n the lmt of a large number of state transton observatons. The proposed framework s nvestgated n smulaton and actual flght experments, and shown to mprove transent behavor n the adaptaton process and overall msson performance. I. INTRODUCTION Many decson processes, such as Markov Decson Processes (MDPs) and Partally Observable MDPs (POMDPs) are modeled as a probablstc process drven by a known Markov Chan. In practce however, the true parameters of the Markov Chan are frequently unavalable to the modeler, and many researchers have recently addressed the ssue of robust performance n these decson systems [] [4]. Whle many authors have studed the problem of MDPs wth uncertan transton probabltes [5] [7], robust counterparts to these MDPs have been obtaned only recently. Robust MDP counterparts have been ntroduced n the work of Bagnell et al [8], Nlm [] and Iyengar [2]. Bagnell presented a robust value teraton algorthm for solvng the robust MDPs. The convergence of robust value teraton was formally proved by Nlm [] and Iyengar [2]. Both Nlm and Iyengar ntroduced meanngful uncertanty sets for the transton probabltes that could be effcently solved by addng an addtonal, nner optmzaton on the uncertan transton probabltes. One of the methods for fndng a robust polcy n [] was to use scenaro-based methods, wheren the performance s optmzed for dfferent realzatons of the transton probabltes. However, t was recently shown that a scenaro-based approach may requre an extremely large number of realzatons to yeld a robust polcy [4]. Ths observaton motvated the development of a specfc scenaro selecton process usng the frst two moments of a Bayesan pror to obtan robust polces usng much fewer scenaros [4], [2]. Robust methods fnd robust polces that hedge aganst errors n the transton probabltes. However, there are many cases when ths type of an approach s too conservatve. For example, t may be possble to dentfy the transton probabltes by observng state transtons, and obtan mproved estmates, and resolve the optmzaton to fnd a less conservatve polcy. Model-based learnng of MDPs s closely related to ndrect adaptve control [9] n that the transton probabltes are estmated n real-tme usng a maxmum lkelhood estmator. At each tme step, certanty equvalence s assumed on the transton probabltes, and a new polcy s found wth the new model estmate [0]. Jaulmes et al. [], [2] study ths problem n an actve estmaton context usng POMDPs. Marbach [3] consders ths problem, when the transton probabltes depend on a parameter vector. Konda and Tstskls [4] consder the problem of slowly-varyng Markov Chans n the context of renforcement learnng. Sato [5] consders ths problem and shows asymptotc convergence of the probablty estmates also n the context of dual control. Kumar [6] also consdered the adaptaton problem. Ford and Moore [7] consder the problem of estmatng the parameters of a non-statonary Hdden Markov Model. Ths paper demonstrates the need to account for both robust plannng and adaptaton n MDPs wth uncertanty n ther transton probabltes. Just lke n control [8] or n task assgnment problems [9], adaptaton alone s generally not suffcent to ensure relable operaton of the overall control system. Ths paper shows that robustness s crtcal to mtgatng worst-case performance, partcularly durng the transent perods of the adaptaton. Ths paper contrbutes a new combned robust and adaptve problem formulaton for MDPs wth errors n the transton probabltes. The key result of ths paper shows that robust and adaptve MDPs can converge to the truly optmal objectve n the lmt of a large number of observatons. We demonstrate the robust component of ths approach by usng a Bayesan pror, and fnds the robust polcy by usng scenaro-based methods. We then augment the robust approach wth an adaptaton scheme that s more effectve at ncorporatng new nformaton n the models. The MDP framework s dscussed n Secton II, the mpact of uncertanty s demonstrated n Secton III, and then we present the ndvdual components of robustness and adaptaton n /09/$ AACC 304

MARKOV DECISION PROCESS The Markov Decson Process (MDP) framework that we consder n ths paper conssts of a set of states S of cardnalty N, a set of control actons u U of cardnalty M wth a

3 Secton IV. The combned robust and adaptve MDP s shown to converge to the true, optmal value functon n the lmt of a large number of observatons. The paper concludes n Secton VI wth a set of demonstratve numercal smulatons and actual flght results on our UAV testbed. A. Problem Formulaton II. MARKOV DECISION PROCESS The Markov Decson Process (MDP) framework that we consder n ths paper conssts of a set of states S of cardnalty N, a set of control actons u U of cardnalty M wth a correspondng polcy µ : S U, a transton model gven by A u j = Pr( k+ j k,u k ), and a reward model g(,u). The tme-addtve objectve functon s defned as N J µ = g N ( N ) + φ k g k ( k,u k ) () k=0 where 0 < φ s an approprate dscount factor. The goal s to fnd an optmal control polcy, µ, that maxmzes an expected objectve gven some known transton model A u J = max µ E [ J µ ( 0 ) ] (2) In an nfnte horzon settng (N ), the soluton to Eq. 2 can be found by solvng the Bellman Equaton [ ] J () = max g() + φ A u u jj ( j), (3) j The optmal control s found by solvng u () argmax u U E[ J µ ( 0 ) ] S (4) The optmal polcy can be found n many dfferent ways usng Value Iteraton or Polcy Iteraton, whle Lnear Programmng can be used for moderately szed problems [20]. III. MODEL UNCERTAINTY It has been shown that the value functon can be based n the presence of small errors n the transton probabltes [3], and that the optmal polcy µ can be extremely senstve to small errors n the model parameters. For example, n the context of UAV mssons, t has been shown that errors n the state transton matrx, Ã u, can result n ncreased UAV crashes when mplemented n real systems [2]. An example of ths suboptmal performance s reflected n Fgure, whch shows two summary plots for a 2-UAV persstent survellance msson formulated as an MDP [22] averaged over 00 Monte Carlo smulatons. The smulatons were performed wth modelng errors: the polcy was found by usng an estmated probablty shown on the y-axs ( Modeled ), but mplemented on the real system that assumed a nomnal probablty shown on the x-axs, ( Actual ). Fgure (a) shows the mean number of faled vehcles n the msson. Note that n the regon labeled Rsky, the falure rate s ncreased sgnfcantly such that all the vehcles n the msson are lost due to the modelng error. Fgure (b) shows the penalty n total coverage tme when the transton probablty s underestmated (n the area denoted (a) Total number of faled vehcles (b) Mean coverage tme vs msmatched fuel flow probabltes Fg.. Impact on modelng error on the overall msson effectveness as Rsky n the fgure). In ths regon, the total coverage tme decreases from approxmately 40 tme steps (out of a 50 tme step msson) to only 0 tme steps. It s of paramount mportance to develop precse mathematcal descrptons for these errors and use ths nformaton to fnd robust polces. Whle there are many methods to descrbe uncertanty sets [], [2], our approach reles on a Bayesan descrpton of ths uncertanty. Ths choce s prmarly motvated by the need to update estmates of these probabltes n realtme n a computatonally tractable manner. Ths approach assumes a pror Drchlet dstrbuton on each row of the transton matrx, and recursvely updates ths dstrbuton wth observatons. The Drchlet dstrbuton f D at tme k for a row of the N-dmensonal transton model s gven by p k = [p, p 2,..., p N ] T and postve dstrbuton parameters α(k) = [α,α 2,...,α N ] T, s defned as f D (p k α(k)) = K N p α =, p = (5) = K p α p α N ( p ) α N = where K s a normalzng factor that ensures the probablty dstrbuton ntegrates to unty. Each p s the th entry of 305

4 the m th row, that s: p = A u m, and 0 p and p =. The prmary reasons for usng the Drchlet dstrbuton s that the mean p satsfes the requrements of a probablty vector 0 p and p = by constructon. Furthermore, the parameters α can be nterpreted as counts, or tmes that a partcular state transton was observed. Ths enables computatonally tractable updates on the dstrbuton based on new observatons. The uncertanty set descrpton for the Drchlet s known as a credblty regon, and can be found by Monte Carlo ntegraton. IV. ADAPTATION AND ROBUSTNESS Ths secton dscusses ndvdual methods for adaptng to changes n the transton probabltes, as well as methods for accountng for robustness n the presence of the transton probablty uncertanty. A. Adaptaton It s well known that the Drchlet dstrbuton s conjugate to the multnomal dstrbuton, mplyng a measurement update step that can be expressed n closed form usng the prevously observed counts α(k). The posteror dstrbuton f D (p k+ α(k + )) s gven n terms of the pror f D (p k α(k)) as f D (p k+ α(k + )) f D (p k α(k)) f M (β(k) p k ) = N = p α p β N = = p α +β where f M (β(k) p k ) s a multnomal dstrbuton wth hyperparameters β(k) = [β,...,β N ]. Each β s the total number of transtons observed from state to a new state : mathematcally β = δ, and { f transton δ, = 0 Otherwse ndcates how many tmes transtons were observed from state to state. For the next dervatons, we assume that only a sngle transton can occur per tme step, β = δ,. Upon recept of the observatons β(k), the parameters α(k) are updated accordng to α (k + ) = α (k) + δ, (6) and the mean can be found by normalzng these parameters p = α /α 0. Our recent work [23] has shown that the mean p can be equvalently expressed recursvely n terms of the prevous mean and varance p = α /α 0 (7) Σ = α (α 0 α ) α0 2(α (8) 0 + ) by recursvely wrtng these moments as δ, p (k) p (k + ) = p (k) + Σ (k) p (k)( p (k)) Σ (k + ) = γ k+ Σ (k) + p (k+)( p (k+)) p (k)( p (k)) where γ k+ = p (k+)( p (k+)). Furthermore, t was shown that these mean-varance recursons, just as ther countequvalent counterparts, can be slow n detectng changes f the model s non-statonary. Hence, a modfed set of recursons was derved that showed that the followng recursons provded a much more effectve change-detecton mechansm. δ, p (k) p (k + ) = p (k) + /λ k Σ (k) p (k)( p (k)) Σ (k + ) = λ k γ k+ Σ (k) + p (k)( p (k)) (9) (0) The key change was the addton of an effectve process through the use of a dscount factor 0 < λ k, and ths allowed for a much faster estmator response [23]. B. Robustness Whle an adaptaton mechansm s useful to account for changes n the transton probabltes, the estmates of the transton probabltes are only guaranteed to converge n the lmt of an nfnte number of observatons. Whle n practce the estmates do not requre an unbounded number of observatons, smply replacng the uncertan model Ã wth the best estmate Â may lead to a based value functon [3] and senstve polces, especally f the estmator has not yet converged to the true parameter A. For the purposes of ths paper, the robust counterpart of Eq. (2) s defned as [], [2] JR = mn max E[ J µ ( 0 ) ] () Ã A µ Lke the nomnal problem, the objectve functon s maxmzed wth respect to the control polcy; however, for the robust counterpart, the objectve s mnmzed wth respect to the uncertanty set A. When the uncertanty model A s descrbed by a Bayesan pror, scenaro-based methods can be used to generate realzatons of the transton probablty model. Ths gves rse to a scenaro-based robust method whch can turn out to be computatonally ntensve, snce the total number of scenaros needs to be large [2]. Ths motvated our work [4] that, gven a pror Drchlet dstrbuton on the transton probabltes, determnstcally generates samples of each transton probablty row Y (so-called Drchlet Sgma Ponts) usng the frst two statstcal moments of each row of the transton probablty matrx, p and Σ, Y 0 = p { Y = p + β ( Σ /2) p β ( Σ /2) =,...,N = N +,...,2N where β s a tunng parameter that depends on the level of desred conservatsm, whch n turn depends on the sze of the credblty regon. Here, Σ /2 denotes the th row of the matrx square root of Σ. The uncertanty set A contans the determnstc samples Y {,2,...,2N}. 306

5 V. ROBUST ADAPTATION There are many choces for replannng effcently usng model-based methods, such as Real Tme Dynamc Programmng (RTDP) [24], [25]. RTDP assumes that the transton probabltes are unknown, and are contnually updated through an agent s actons n the state space. Due to computatonal consderatons, only a sngle sweep of the value teraton s performed at each measurement update. The result of Gullapall [25] shows that f each state and acton are executed nfntely often, then the (asynchronous) value teraton algorthm converges to the true value functon. An alternatve strategy s to perform synchronous value teraton, by usng a bootstrappng approach where the old polcy s used as the ntal guess for the new polcy [26]. In ths secton, we consder the full robust replannng problem (see Algorthm ). The two man steps are an adaptaton step, where the Drchlet dstrbutons (or alternatvely, the Drchlet Sgma Ponts) for each row and acton are updated based on the most recent observatons, and a robust replan step. For ths paper, we use the Drchlet Sgma Ponts to fnd the robust polcy by usng scenaro-based methods, but we note that the followng theoretcal results apply to any robust value functon. Whle appealng to account for both robustness and adaptaton, t s crtcal to demonstrate that the proposed algorthm n fact converges to the true, optmal soluton n the lmt. We show ths next. A. Convergence Gullapall and Barto [25] showed that n an adaptve (but non-robust) settng, an asynchronous verson of the Value Iteraton algorthm converges to the optmal value functon. Theorem : [25] (Convergence of an adaptve, asynchronous value teraton algorthm) For any fnte state, fnte acton MDP wth an nfnte-horzon dscounted performance measure, an ndrect adaptve asynchronous value teraton algorthm converges to the optmal value functon wth probablty one f: ) the condtons for convergence of the non-adaptve algorthm are met; 2) n the lmt, every acton s executed from every state nfntely often; 3) the estmates of the state transton probabltes reman bounded and converge n the lmt to ther true values wth probablty one. Proof: See [25]. Usng the framework of the above theorem, the robust counterpart to ths theorem s stated next. Theorem 2: (Convergence of a robust adaptve, asynchronous value teraton algorthm) For any fnte state, fnte acton MDP wth an nfnte-horzon dscounted performance measure, a robust, ndrect adaptve asynchronous value teraton algorthm of Theorem { mnµ max J k+ () = Ak A k E[J µ ] f B k S (4) J k () Otherwse converges to the optmal value functon wth probablty one f the condtons of Theorem are satsfed, and the Algorthm Robust Replannng Intalze uncertanty model: for example, Drchlet dstrbuton parameters α whle Not fnshed do Usng a statstcally effcent estmator, update estmates of the transton probabltes (for each row, acton). For example usng the dscounted estmator of Eq. 9 δ, p (k) p (k + ) = p (k) + /λ k Σ (k) p (k)( p (k)) Σ (k + ) = λ k γ k+ Σ (k) + p (k)( p (k)) For each uncertan row of the transton probablty matrx, fnd the robust polcy usng robust DP mn maxe[ ] J µ (2) A µ For example, update the Drchlet Sgma Ponts (for each row, acton), Y 0 = p ( Y = p + β Σ /2) =,...,N (3) ( Y = p β Σ /2) = N +,...,2N and fnd new robust polcy mn µ max A Y E [ ] J µ Return end whle uncertanty set A k converges to the sngleton Â k, n other words, lm k A k = {Â k }. Here B k denotes the subset of states that are updated at each tme step. Proof: The key dfference between ths theorem and Theorem s the maxmzaton over the uncertanty set A k. However, as addtonal observatons are ncurred and by vrtue of the convergent, unbased estmator, the sze of the uncertanty set wll decrease to the sngleton unbased estmate Â k. Furthermore, snce the robust operator gven by T. = mn µ max Ak A k s a contracton mappng [], [2]. Usng both of these arguments, and snce ths unbased estmate wll n turn converge to the true value of the transton probablty, then the robust adaptve asynchronous value teraton algorthm wll converge to the true, optmal soluton. Corollary 3: (Convergence of synchronous verson) The synchronous verson of the robust, adaptve MDP wll converge to the true, optmal value functon. Proof: In the event that an entre sweep of the state space occurs at each value teraton, then the uncertanty set A k wll stll converge to the sngleton {Â k }. Remark: (Convergence of robust adaptaton wth Drchlet Sgma Ponts) For the Drchlet Sgma Ponts, the dscounted estmator of Eq. 9 converges n the lmt of a large number of observatons (wth approprate choce of λ k ), and the covarance Σ s eventually drven to 0, then each of the Drchlet Sgma Ponts wll collapse to the sngleton, the unbased estmate of the true transton probabltes. Ths means that the model wll have converged, and that the robust soluton wll n fact have converged to the optmal value functon. 307

6 VI. NUMERICAL RESULTS Ths secton presents actual flght demonstratons of the proposed robust and adaptve algorthm on a persstent survellance msson n the RAVEN testbed [22]. The UAVs are ntally located at a base locaton, whch s separated by some (possbly large) dstance from the survellance locaton. The objectve of the problem s to mantan a specfed number r of requested UAVs over the survellance locaton at all tmes. The base locaton s denoted by Y b, the survellance locaton s denoted by Y s, and a dscretzed set of ntermedate locatons are denoted by {Y 0,...,Y s }. Vehcles can move between adjacent locatons at a rate of one unt per tme step. The UAVs have a specfed maxmum fuel capacty F max, and we assume that the rate Ḟ burn at whch they burn fuel may vary randomly durng the msson: the probablty of nomnal fuel flow s gven by p nom. Ths uncertanty n the fuel flow may be attrbuted to aggressve maneuverng that may be requred for short tme perods, for example. Thus, the total flght tme each vehcle acheves on a gven flght s a random varable, and ths uncertanty must be accounted for n the problem. If a vehcle runs out of fuel whle n flght, t crashes and s lost. The vehcles can refuel (at a rate Ḟ re f uel ) by returnng to the base locaton. In ths secton, the adaptve replannng was mplemented by explctly accountng for the uncertanty n the probablty of nomnal fuel flow, p nom. The replannng archtecture updates both the mean and varance of the fuel flow transton probablty, whch s then passed to the onlne MDP solver, whch computes the robust polcy. Ths robust polcy s then passed to the polcy executer, whch mplements the control decson on the sytem. The Drchlet Sgma Ponts were formed usng updated mean and varance Y 0 = ˆp nom Y = ˆp nom + β σ p Y 2 = ˆp nom β σ p and used to fnd the robust polcy. Usng the results from the earler chapters, approprate choces of β could range from to 5, where β 3 corresponds to a 99% certanty regon for the Drchlet (n ths case, the Beta densty). For ths scalar problem, the robust soluton of the MDP corresponds to usng a value of ˆp nom βσ p n place of the nomnal probablty estmate ˆp nom. Flght experments were performed for a case when the probablty estmate ˆp nom was vared n md-msson, and three dfferent replannng strateges were compared Adaptve only: The frst replan strategy nvolved only an adaptve strategy, wth λ = 0.8, and usng only the estmate ˆp nom Robust replan, undscounted adaptaton: Ths replan strategy used the undscounted mean-varance estmator λ =, and set β = 4 for the Drchlet Sgma Ponts Robust replan, dscounted adaptaton: Ths replan strategy used the undscounted mean-varance estmator λ = 0.8, and set β = 4 for the Drchlet Sgma Ponts (a) Fast adaptaton (λ = 0.8) wth no robustness (β = 0) (b) Hgh robustness (β = 4) but slow adaptaton (λ = ) Fg. 2. Expermental results showng vehcle trajectores (red and blue), and probablty estmate used n the plannng (black) In all cases, the vehcle takes off from base, travels through 2 ntermedate areas, and then reaches the survellance locaton. In the nomnal fuel flow settng losng unt of fuel per tme step, the vehcle can safely reman at the survellance regon for 4 tme steps, but n the off-nomnal fuel flow settng (losng 2 unts), the vehcle can only reman on survellance for only tme step. The man results are shown n Fgure 2, where the transton n p nom occurred at t = 7 tme steps. At ths pont n tme, one of the vehcles s just completng the survellance, and s ntatng the return to base to refuel, as the second vehcle s headng to the survellance area. The key to the successful msson, n the sense of avodng vehcle crashes, s to ensure that the change s detected suffcently quckly, and that the planner mantans some level of cautousness n ths estmate by embeddng robustness. The successful msson wll detect ths change rapdly, and leave the UAVs on target for a shorter tme. The result of Fgure 2(a) gnores any uncertanty n the estmate but has a fast adaptaton (snce t uses the factor λ = 0.8). However, by not embeddng the uncertanty, the estmator whle detectng the change n p nom quckly, nonetheless allocates the second vehcle to reman at the survellance regon. Consequently, one of the vehcles runs out of fuel 308

7 ACKNOWLEDGEMENTS Research supported by AFOSR grant FA Fg. 3. Fast adaptaton (λ = 0.8) wth robustness (β = 4) and crashes. At the second cycle of the msson, the second vehcle remans at the survellance area for only tme step. The result of Fgure 2(b) accounts for uncertanty n the estmate but has a slow adaptaton (snce t uses the factor λ = ). However, whle embeddng the uncertanty, the replannng s not done quckly, and for ths dfferent reason from the adaptve, non-robust example, one of the vehcle runs out of fuel, and crashes. At the second cycle of the msson, the second vehcle remans at the survellance area for only tme step. Fgure 3 shows the robustness and adaptaton actng together to cautously allocate the vehcles, whle respondng quckly to changes n p nom. The second vehcle s allocated to perform survellance for only 2 tme steps (nstead of 3), and safely returns to base wth no fuel remanng. At the second cycle, both vehcles only stay at the survellance area for tme step. Hence, the robustness and adaptaton have together been able to recover msson effcency by brngng n ther relatve strengths: the robustness by accountng for uncertanty n the probablty, and the adaptaton by quckly respondng to the changes n the probablty. VII. CONCLUSIONS Ths paper has presented a combned robust and adaptve framework that accounts for errors n the transton probabltes. Ths framework s shown to converge to the true, optmal value functon n the lmt of a large number of observatons. The proposed framework has been verfed both n smulaton and actual flght experments, and shown to mprove transent behavor n the adaptaton process and overall msson performance. Our current work s addressng a more actve learnng mechansm for the transton probabltes, by the use of exploratory actons specfcally taken to reduce the uncertanty n the transton probabltes. Our future work wll consder the problem of decentralzaton of the robust adaptve framework across multple vehcles, specfcally addressng the ssues of model consensus n a mult-agent system, and the mpact of any dsagreement on the robust soluton. REFERENCES [] A. Nlm and L. E. Ghaou, Robust Solutons to Markov Decson Problems wth Uncertan Transton Matrces, Operatons Research, vol. 53, no. 5, [2] G. Iyengar, Robust Dynamc Programmng, Math. Oper. Res., vol. 30, no. 2, pp , [3] S. Mannor, D. Smester, P. Sun, and J. Tstskls, Bas and Varance Approxmaton n Value Functon Estmates, Management Scence, vol. 52, no. 2, pp , [4] L. F. Bertuccell and J. P. How, Robust Decson-Makng for Uncertan Markov Decson Processes Usng Sgma Pont Samplng, IEEE Amercan Controls Conference, [5] D. E. Brown and C. C. Whte., Methods for reasonng wth mprecse probabltes n ntellgent decson systems, IEEE Conference on Systems, Man and Cybernetcs, pp. 6 63, 990. [6] J. K. Sata and R. E. Lave., Markovan Decson Processes wth Uncertan Transton Probabltes, Operatons Research, vol. 2, no. 3, 973. [7] C. C. Whte and H. K. Eldeb., Markov Decson Processes wth Imprecse Transton Probabltes, Operatons Research, vol. 42, no. 4, 994. [8] A. Bagnell, A. Ng, and J. Schneder, Solvng Uncertan Markov Decson Processes, NIPS, 200. [9] K. J. Astrom and B. Wttenmark, Adaptve Control. Boston, MA, USA: Addson-Wesley Longman Publshng Co., Inc., 994. [0] R. S. Sutton and A. G. Barto, Renforcement Learnng: An Introducton (Adaptve Computaton and Machne Learnng). The MIT Press, 998. [] R. Jaulmes, J. Pneau, and D. Precup., Actve Learnng n Partally Observable Markov Decson Processes, European Conference on Machne Learnng (ECML), [2] R. Jaulmes, J. Pneau, and D. Precup., Learnng n Non-Statonary Partally Observable Markov Decson Processes, ECML Workshop on Renforcement Learnng n Non-Statonary Envronments, [3] P. Marbach, Smulaton-based methods for Markov Decson Processes. PhD thess, MIT, 998. [4] V. Konda and J. Tstskls, Lnear stochastc approxmaton drven by slowly varyng Markov chans, Systems and Control Letters, vol. 50, [5] M. Sato, K. Abe, and H. Takeda., Learnng Control of Fnte Markov Chans wth Unknown Transton Probabltes, IEEE Trans. on Automatc Control, vol. AC-27, no. 2, 982. [6] P. R. Kumar and W. Ln., Smultaneous Identfcaton and Adaptve Control of Unknown Systems over Fnte Parameters Sets., IEEE Trans. on Automatc Control, vol. AC-28, no., 983. [7] J. Ford and J. Moore, Adaptve Estmaton of HMM Transton Probabltes, IEEE Transactons on Sgnal Processng, vol. 46, no. 5, 998. [8] P. A. Ioannou and J. Sun, Robust Adaptve Control. Prentce-Hall, 996. [9] M. Alghanbar and J. P. How, A Robust Approach to the UAV Task Assgnment Problem, Internatonal Journal of Robust and Nonlnear Control, vol. 8, no. 2, [20] M. Puterman, Markov Decson Processes: Dscrete Stochastc Dynamc Programmng. Wley, [2] L. F. Bertuccell, Robust Decson-Makng wth Model Uncertanty n Aerospace Systems. PhD thess, MIT, [22] B. Bethke, J. How, and J. Van., Group Health Management of UAV Teams Wth Applcatons to Persstent Survellance, IEEE Amercan Controls Conference, [23] L. F. Bertuccell and J. P. How, Estmaton of Non-Statonary Markov Chan Transton Models, IEEE Conference on Decson and Control, [24] A. Barto, S. Bradtke, and S. Sngh., Learnng to Act usng Real-Tme Dynamc Programmng, Artfcal Intellgence, vol. 72, pp. 8 38, 993. [25] V. Gullapall and A. Barto., Convergence of Indrect Adaptve Asynchronous Value Iteraton Algorthms, Advances n NIPS, 994. [26] B. Bethke, L. Bertuccell, and J. P. How, Expermental Demonstraton of MDP- Based Plannng wth Model Uncertanty, n AIAA Gudance Navgaton and Control Conference, Aug AIAA

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

Proceedngs of the 007 Wnter Smulaton Conference S G Henderson, B Bller, M-H Hseh, J Shortle, J D Tew, and R R Barton, eds LOW BIAS INTEGRATED PATH ESTIMATORS James M Calvn Department of Computer Scence