Operations Research. An Approximate Dynamic Programming Algorithm for Monotone Value Functions

Size: px

Start display at page:

Download "Operations Research. An Approximate Dynamic Programming Algorithm for Monotone Value Functions"

Rosaline Sybil Cooper
6 years ago
Views:

This aricle was downloaded by: [140.1.241.

insrucions for auhors and subscripion informaion: hp://pubsonline.informs.org An Approximae Dynamic Programming Algorihm for Monoone Value Funcions Daniel R. Jiang, Warren B.

1 This aricle was downloaded by: [ ] On: 05 January 2016, A: 21:41 Publisher: Insiue for Operaions Research and he Managemen Sciences (INFORMS) INFORMS is locaed in Maryland, USA Operaions Research Publicaion deails, including insrucions for auhors and subscripion informaion: hp://pubsonline.informs.org An Approximae Dynamic Programming Algorihm for Monoone Value Funcions Daniel R. Jiang, Warren B. Powell To cie his aricle: Daniel R. Jiang, Warren B. Powell (2015) An Approximae Dynamic Programming Algorihm for Monoone Value Funcions. Operaions Research 63(6): hp://dx.doi.org/ /opre Full erms and condiions of use: hp://pubsonline.informs.org/page/erms-and-condiions This aricle may be used only for he purposes of research, eaching, and/or privae sudy. Commercial use or sysemaic downloading (by robos or oher auomaic processes) is prohibied wihou explici Publisher approval, unless oherwise noed. For more informaion, conac The Publisher does no warran or guaranee he aricle s accuracy, compleeness, merchanabiliy, finess for a paricular purpose, or non-infringemen. Descripions of, or references o, producs or publicaions, or inclusion of an adverisemen in his aricle, neiher consiues nor implies a guaranee, endorsemen, or suppor of claims made of ha produc, publicaion, or service. Copyrigh 2015, INFORMS Please scroll down for aricle i is on subsequen pages INFORMS is he larges professional sociey in he world for professionals in he fields of operaions research, managemen science, and analyics. For more informaion on INFORMS, is publicaions, membership, or meeings visi hp://

2 OPERATIONS RESEARCH Vol. 63, No. 6, November December 2015, pp ISSN X (prin) ó ISSN (online) hp://dx.doi.org/ /opre INFORMS Downloaded from informs.org by [ ] on 05 January 2016, a 21:41. For personal use only, all righs reserved. An Approximae Dynamic Programming Algorihm for Monoone Value Funcions Daniel R. Jiang, Warren B. Powell Deparmen of Operaions Research and Financial Engineering, Princeon Universiy, Princeon, New Jersey {drjiang@princeon.edu, powell@princeon.edu} Many sequenial decision problems can be formulaed as Markov decision processes (MDPs) where he opimal value funcion (or cos-o-go funcion) can be shown o saisfy a monoone srucure in some or all of is dimensions. When he sae space becomes large, radiional echniques, such as he backward dynamic programming algorihm (i.e., backward inducion or value ieraion), may no longer be effecive in finding a soluion wihin a reasonable ime frame, and hus we are forced o consider oher approaches, such as approximae dynamic programming (ADP). We propose a provably convergen ADP algorihm called Monoone-ADP ha explois he monooniciy of he value funcions o increase he rae of convergence. In his paper, we describe a general finie-horizon problem seing where he opimal value funcion is monoone, presen a convergence proof for Monoone-ADP under various echnical assumpions, and show numerical resuls for hree applicaion domains: opimal sopping, energy sorage/allocaion, and glycemic conrol for diabees paiens. The empirical resuls indicae ha by aking advanage of monooniciy, we can aain high qualiy soluions wihin a relaively small number of ieraions, using up o wo orders of magniude less compuaion han is needed o compue he opimal soluion exacly. Keywords: approximae dynamic programming; monooniciy; opimal sopping; energy sorage; glycemic conrol. Subjec classificaions: dynamic programming/opimal conrol: Markov, finie sae. Area of review: Opimizaion. Hisory: Received July 2014; revisions received May 2015, July 2015; acceped Augus Published online in Aricles in Advance November 4, Inroducion Sequenial decision problems are an imporan concep in many fields, including operaions research, economics, and finance. For a small, racable problem, he backward dynamic programming (BDP) algorihm (also known as backward inducion or finie-horizon value ieraion) can be used o compue he opimal value funcion, from which we ge an opimal decision making policy (Puerman 1994). However, he sae space for many real-world applicaions can be immense, making his algorihm very compuaionally inensive. Hence, we mus ofen urn o he field of approximae dynamic programming, which seeks o solve hese problems via approximaion echniques. One way o obain a beer approximaion is o exploi (problemdependen) srucural properies of he opimal value funcion, and doing so ofen acceleraes he convergence of ADP algorihms. In his paper, we consider he case where he opimal value funcion is monoone wih respec o a parial order. Alhough his paper focuses on he heory behind our ADP algorihm and no a specific applicaion, we firs poin ou ha our echnique can be broadly uilized. Monooniciy is a very common propery because i is rue in many siuaions ha more is beer. To be more precise, problems ha saisfy free disposal (o borrow a erm from economics) or no holding coss are likely o conain monoone srucure. There are also less obvious ways ha monooniciy can come ino play, such as environmenal variables ha influence he sochasic evoluion of a primary sae variable (e.g., exreme weaher can lead o increased expeced ravel imes; high naural gas prices can lead o higher elecriciy spo prices). The following lis is a small sample of real-world applicaions spanning he lieraure of he aforemenioned disciplines (and heir subfields) ha saisfy he special propery of monoone value funcions. Operaions Research The problem of opimal replacemen of machine pars is well sudied in he lieraure (see e.g., Feldsein and Rohschild 1974, Pierskalla and Voelker 1976, and Rus 1987) and can be formulaed as a regeneraive opimal sopping problem in which he value funcion is monoone in he curren healh of he par and he sae of is environmen. Secion 7 discusses his model and provides deailed numerical resuls. The problem of bach servicing of cusomers a a service saion as discussed in Papadaki and Powell (2002) feaures a value funcion ha is monoone in he number of cusomers. Similarly, he relaed problem of muliproduc bach dispach sudied in Papadaki and Powell (2003b) can be shown o have a monoone value funcion in he mulidimensional sae variable ha conains he number of producs awaiing dispach. 1489

3 Jiang and Powell: An Approximae Dynamic Programming Algorihm for Monoone Value Funcions 14 Operaions Research 63(6), pp , 2015 INFORMS Downloaded from informs.org by [ ] on 05 January 2016, a 21:41. For personal use only, all righs reserved. Energy In he energy sorage and allocaion problem, one mus opimally conrol a sorage device ha inerfaces wih he spo marke and a sochasic energy supply (such as wind or solar). The goal is o reliably saisfy a possibly sochasic demand in he mos profiable way. We can show ha wihou holding coss, he value funcion is monoone in he resource (see Sco and Powell 2012 and Salas and Powell 2013). Once again, refer o 7 for numerical work in his problem class. The value funcion from he problem of maximizing revenue using baery sorage while bidding hourly in he elecriciy marke can be shown o saisfy monooniciy in he resource, bid, and remaining baery lifeime (see Jiang and Powell 2015). Healhcare Hsih (2010) develops a model for opimal dosing applied o glycemic conrol in diabees paiens. A each decision epoch, one of several reamens (e.g., sensiizers, secreagogues, alpha-glucosidase inhibiors, or pepide analogs) wih varying levels of srengh (i.e., abiliy o decrease glucose levels) bu also varying side effecs, such as weigh gain, needs o be adminisered. The value funcion in his problem is monoone whenever he uiliy funcion of he sae of healh is monoone. See 7 for he complee model and numerical resuls. Sains are ofen used as reamen agains hear disease or sroke in diabees paiens wih lipid abnormaliies. The opimal ime for sain iniiaion, however, is a difficul medical problem due o he compeing forces of healh benefis and side effecs. Kur e al. (2011) models he problem as an MDP wih a value funcion monoone in a risk facor known as he lipid raio. Finance The problem of muual fund cash balancing, described in Nascimeno and Powell (2010), is faced by fund managers who mus decide on he amoun of cash o hold, aking ino accoun various marke characerisics and invesor demand. The value funcions urn ou o be monoone in he ineres rae and he porfolio s rae of reurn. The pricing problem for American opions (see Luenberger 1998) uses he heory of opimal sopping and depending on he model of he price process, monooniciy can be shown in various sae variables: for example, he curren sock price or he volailiy (see Eksröm 2004). Economics Kaplan and Violane (2014) model he decisions of consumers afer receiving fiscal simulus paymens o explain observed consumpion behavior. The household has boh liquid and illiquid asses (he sae variable), in which he value funcions are clearly monoone. A classical model of search unemploymen in economics describes a siuaion where a each period, a worker has a decision of acceping a wage offer or coninuing o search for employmen. The resuling value funcions can be shown o be increasing wih wage (see 10.7 of Sockey and Lucas 1989 and McCall 19). This paper makes he following conribuions. We describe and prove he convergence of an algorihm, called Monoone-ADP 4M-ADP5 for learning monoone value funcions by preserving monooniciy afer each updae. We also provide empirical resuls for he algorihm in he conex of various applicaions in operaions research, energy, and healhcare as experimenal evidence ha exploiing monooniciy dramaically improves he rae of convergence. The performance of Monoone-ADP is compared o several esablished algorihms: kernel-based reinforcemen learning (Ormonei and Sen 2002), approximae policy ieraion (Bersekas 2011), asynchronous value ieraion (Bersekas 2007), and Q-learning (Wakins and Dayan 1992). The paper is organized as follows. Secion 2 gives a lieraure review, followed by he problem formulaion and algorihm descripion in 3 and 4. Nex, 5 provides he assumpions necessary for convergence, and 6 saes and proves he convergence heorem, wih several proofs of lemmas and proposiions posponed unil he appendix and online supplemen (available as supplemenal maerial a hp://dx.doi.org/ /opre ). Secion 7 describes numerical experimens over a suie of problems, wih he larges one having a seven dimensional sae variable and nearly 20 million saes per ime period. We conclude in Lieraure Review General monoone funcions (no necessarily a value funcion) have been exensively sudied in he academic lieraure. The saisical esimaion of monoone funcions is known as isoonic or monoone regression and has been sudied as early as 1955; see Ayer e al. (1955) or Brunk (1955). The main idea of isoonic regression is o minimize a weighed error under he consrain of monooniciy (see Barlow e al for a horough descripion). The problem can be solved in a variey of ways, including he Pool Adjacen Violaors Algorihm 4PAVA5 described in Ayer e al. (1955). More recenly, Mammen (1991) builds upon his previous research by describing an esimaor ha combines kernel regression and PAVA o produce a smooh regression funcion. Addiional sudies from he saisics lieraure include Mukerjee (1988), Ramsay (1998), and Dee e al. (2006). Alhough hese approaches are ouside he conex of dynamic programming, ha hey were developed and well sudied highlighs he perinence of monoone funcions. From he operaions research lieraure, monoone value funcions and condiions for monoone opimal policies are broadly described in Puerman (1994, 4.7) and some general heory is derived herein. Similar discussions of he opic can be found in Ross (1983), Sockey and Lucas

4 Jiang and Powell: An Approximae Dynamic Programming Algorihm for Monoone Value Funcions Operaions Research 63(6), pp , 2015 INFORMS 1491 Downloaded from informs.org by [ ] on 05 January 2016, a 21:41. For personal use only, all righs reserved. (1989), Müller (1997), and Smih and McCardle (2002). The algorihm ha we describe in his paper is firs used in Papadaki and Powell (2002) as a heurisic o solve he sochasic bach service problem, where he value funcion is monoone. However, he convergence of he algorihm is no analyzed and he sae variable is scalar. Finally, in Papadaki and Powell (2003a), he auhors prove he convergence of he Discree Online Monoone Esimaion (DOME) algorihm, which akes advanage of a monooniciy preserving sep o ieraively esimae a discree monoone funcion. DOME, hough, was no designed for dynamic programming, and he proof of convergence requires independen observaions across ieraions, which is an assumpion ha canno be made for Monoone-ADP. Anoher common propery of value funcions, especially in resource allocaion problems, is convexiy/concaviy. Raher han using a monooniciy preserving sep as Monoone-ADP does, algorihms such as he Successive Projecive Approximaion Rouine 4SPAR5 of Powell e al. (2004), he Lagged Acquisiion ADP Algorihm of Nascimeno and Powell (2009), and he Leveling Algorihm of Topaloglu and Powell (2003) use a concaviy preserving sep, which is he same as mainaining monooniciy in he slopes. The proof of convergence for our algorihm, Monoone-ADP, uses ideas found in Tsisiklis (1994) (laer also used in Bersekas and Tsisiklis 1996) and Nascimeno and Powell (2009). Convexiy has also been exploied successfully in mulisage linear sochasic programs (see, e.g, Birge 1985, Pereira and Pino 1991, and Asamov and Powell 2015). In our work, we ake as inspiraion he value of convexiy demonsraed in he lieraure and show ha monooniciy is anoher imporan srucural propery ha can be leveraged in an ADP seing. 3. Mahemaical Formulaion We consider a generic problem wih a ime horizon, = T. Le S be he sae space under consideraion, where ósó < à, and le A be he se of acions or decisions available a each ime sep. Le S 2 S be he random variable represening he sae a ime and a 2 A be he acion aken a ime. For a sae S 2 S and an acion a 2 A, le C 4S 1a 5 be a conribuion or reward received in period and C T 4S T 5 be he erminal conribuion. Le A è 2 S! A be he decision funcion a ime for a policy è from he class Á of all admissible policies. Our goal is o maximize he expeced oal conribuion, giving us he following objecive funcion: sup E è2á apple T É1 X C 4S 1A è 4S 55 + C T 4S T 5 1 =0 where we seek a policy o choose he acions a sequenially based on he saes S ha we visi. Le 4W 5 T =0 be a discree ime sochasic process ha encapsulaes all of he randomness in our problem; we call i he informaion process. Assume ha W 2 W for each and ha here exiss a sae ransiion funcion f2 S A W! S ha describes he evoluion of he sysem. Given a curren sae S, an acion a, and an oucome of he informaion process W +1, he nex sae is given by S +1 = f4s 1a 1W +1 (1) Le s 2 S. The opimal policy can be expressed hrough a se of opimal value funcions using he well-known Bellman s equaion: V 4s5=sup 6C 4s1a5+E6V +1 4S +15óS =s1a =a77 a2a for = T É11 VT 4s5=C T 4s51 wih he undersanding ha S +1 ransiions from S according o (1). In many cases, he erminal conribuion funcion C T 4S T 5 is zero. Suppose ha he sae space S is equipped wih a parial order, denoed, and he following monooniciy propery is saisfied for every : s s 0 =) V 4s5 V 4s0 (3) In oher words, he opimal value funcion V is orderpreserving over he sae space S. In he case where he sae space is mulidimensional (see 7 for examples), a common example of is componenwise inequaliy, which we henceforh denoe using he radiional. A second example ha arises very ofen is he following definiion of, which we call he generalized componenwise inequaliy. Assume ha each sae s can be decomposed ino s = 4m1 i5 for some m 2 M and i 2 I. For wo saes s = 4m1 i5 and s 0 = 4m 0 1i 0 5, we have s s 0 () m m 0 1i= i 0 0 (4) In oher words, we know ha whenever i is held consan, hen he value funcion is monoone in he primary variable m. An example of when such a model would be useful is when m represens he amoun of some held resource ha we are boh buying and selling, while i represens addiional sae-of-he-world informaion, such as prices of relaed goods, ranspor imes on a shipping nework, or weaher informaion. Depending on he specific model, he relaionship beween he value of i and he opimal value funcion may be quie complex and a priori unknown o us. However, i is likely o be obvious ha for i held consan, he value funcion is increasing in m, he amoun of resource ha we own. Hence, he definiion (4) is naural for his siuaion. The following proposiion is given in he seing of he generalized componenwise inequaliy and provides a simple condiion ha can be used o verify monooniciy in he value funcion. Proposiion 1. Suppose ha every s 2 S can be wrien as s = 4m1 i5 for some m 2 M and i 2 I, and le S = 4M 1I 5 be he sae a ime, wih M 2 M and I 2 I. Le he parial order on he sae space S be described by (4). Assume he following assumpions hold. (i) For every s1s 0 2 S wih s s 0, a 2 A, and w 2 W, he sae ransiion funcion saisfies f 4s1 a1 w5 f4s 0 1 a1 w (ii) For each <T, s1s 0 2 S wih s s 0, and a 2 A, C 4s1 a5 C 4s 0 1 a5 and C T 4s5 C T 4s 0 (2)

5 Jiang and Powell: An Approximae Dynamic Programming Algorihm for Monoone Value Funcions 1492 Operaions Research 63(6), pp , 2015 INFORMS Downloaded from informs.org by [ ] on 05 January 2016, a 21:41. For personal use only, all righs reserved. (iii) For each <T, M and W +1 are independen. Then he value funcions V saisfy he monooniciy propery of (3). Proof. See he online supplemen. There are oher similar ways o check for monooniciy; for example, see Proposiion of Puerman (1994) or Theorem 9.11 of Sockey and Lucas (1989) for condiions on he ransiion probabiliies. We choose o provide he above proposiion because of is relevance o our example applicaions in 7. The mos radiional form of Bellman s equaion has been given in (2), which we refer o as he pre-decision sae version. Nex, we discuss some alernaive formulaions from he lieraure ha can be very useful for cerain problem classes. A second formulaion, called he Q-funcion (or sae-acion) form Bellman s equaion, is popular in he field of reinforcemen learning, especially in applicaions of he widely used Q-learning algorihm (see Wakins and Dayan 1992): Q hc 4s1 a5 = E 4s1 a5 + max a +1 2A Q +1 4S +11a +1 5 S = s1 i a = a for = T É 11 Q T 4s1 a5 = C T 4s51 where we mus now impose he addiional requiremen ha A is a finie se. Q is known as he sae-acion value funcion and he sae space in his case is enlarged o be S A. A hird formulaion of Bellman s equaion is in he conex of pos-decision saes (see Powell 2011 for a deailed reamen of his imporan echnique). Essenially, he posdecision sae, which we denoe S a, represens he sae afer he decision has been made, bu before he random informaion W +1 has arrived (he sae-acion pair is also a pos-decision sae). For example, in he simple problem of purchasing addiional invenory x o he curren sock R o saisfy a nex-period sochasic demand, he pos-decision sae can be wrien as R + x, and he predecision sae is R. I mus be he case ha S a conains he same informaion as he sae-acion pair 4S 1a 5, meaning ha regardless of wheher we condiion on S a or 4S 1a 5, he condiional disribuion of W +1 is he same. The araciveness of his mehod is ha (1) in cerain problems, S a is of lower dimension han 4S 1a 5 and (2) when wriing Bellman s equaion in erms of he pos-decision sae space (using a redefined value funcion), he supremum and he expecaion are inerchanged, giving us some compuaional advanages. Le s a be a pos-decision sae from he posdecision sae space S a. Bellman s equaion becomes h i V a1 4s a 5=E 6C +1 4S +1 1a5+V a1 +1 4S a +1 57óSa =sa sup a2a V a1 T É14s a 5=E6C T 4S T 5óS a T É1 =sa 71 for = T É21 (5) (6) where V 1a is known as he pos-decision value funcion. In approximae dynamic programming, he original Bellman s equaion formulaion (2) can be used if he ransiion probabiliies are known. When he ransiion probabiliies are unknown, we mus ofen rely purely on experience or some form of black box simulaor. In hese siuaions, formulaions (5) and (6) of Bellman s equaion, where he opimizaion is wihin he expecaion, become exremely useful. For he remainder of his paper, raher han disinguishing beween he hree forms of he value funcion (V, Q, and V a1 ), we simply use V and call i he opimal value funcion, wih he undersanding ha i may be replaced wih any of he definiions. Similarly, o simplify noaion, we do no disinguish beween he hree forms of he sae space (S, S A, and S a ) and simply use S o represen he domain of he value funcion (for some ). Le d =ós ó and D = 4T + 15óS ó. We view he opimal value funcion as a vecor in D ; ha is o say, V 2 D has a componen a 41 s5 denoed as V a fixed T, he noaion V resriced o ; i.e., he componens of V 4s5. Moreover, for 2 d is used o describe V are V 4s5 wih s varying over S. We adop his noaional sysem for arbirary value funcions V 2 D as well. Finally, we define he generalized dynamic programming operaor H2 D! D, which applies he righ-hand sides of eiher (2), (5), or (6) o an arbirary V 2 D, i.e., replacing V, Q, and V a wih V. For example, if H is defined in he conex of (2), hen he componen of HV a 41 s5 is given by 4HV 5 4s5 8 >< sup6c 4s1a5+E6V +1 4S +1 5óS =s1a =a77 a2a = for = T É11 >: C T 4s5 for =T0 For (5) and (6), H can be defined in an analogous way. We now sae a lemma concerning useful properies of H. Pars of i are similar o Assumpion 4 of Tsisiklis (1994), bu we can show ha hese saemens always hold rue for our more specific problem seing, where H is a generalized dynamic programming operaor. Lemma 1. The following saemens are rue for H, when i is defined using (2), (5), or (6). (i) H is monoone; i.e., for V1V 0 2 D such ha V V 0, we have ha HV HV 0 (componenwise). (ii) For any <T, le V1V 0 2 D, such ha V +1 V+1 0. I hen follows ha 4HV 5 4HV 0 5. (iii) The opimal value funcion V uniquely saisfies he fixed poin equaion HV = V. (iv) Le V 2 D and e is a vecor of ones wih dimension D. For any á>0, HV É áe H4V É áe5 H4V + áe5 HV + áe0 Proof. See Appendix A. (7)

6 Jiang and Powell: An Approximae Dynamic Programming Algorihm for Monoone Value Funcions Operaions Research 63(6), pp , 2015 INFORMS 1493 Downloaded from informs.org by [ ] on 05 January 2016, a 21:41. For personal use only, all righs reserved. 4. Algorihm In his secion, we formally describe he Monoone-ADP algorihm. Assume a probabiliy space 4Ï1 F 1 P5 and le be he approximaion of V a ieraion n, wih he random variable S n 2 S represening he sae ha is visied (by he algorihm) a ime in ieraion n. The observaion of he opimal value funcion a ime, ieraion n, and sae S n is denoed ˆv n4sn 5 and is calculaed using he esimae of he value funcion from ieraion n É 1. The raw observaion ˆv n4sn 5 is hen smoohed wih he previous esimae V né1 4S n 5, using a sochasic approximaion sep, o produce he smoohed observaion z n 4Sn 5. Before presening he descripion of he ADP algorihm, some definiions need o be given. We sar wih Á M, he monooniciy preserving projecion operaor. Noe ha he erm projecion is being used loosely here; he space ha we projec ono acually changes wih each ieraion. Definiion 1. For s r 2 S and z r 2, le 4s r 1z r 5 be a reference poin o which oher saes are compared. Le V 2 d and define he projecion operaor Á M 2 S d! d, where he componen of he vecor Á M 4s r 1z r 1V 5 a s is given by 8 z r if s = s r 1 >< Á M 4s r 1z r z r _ V 1V 54s5 = 4s5 if s r s1 s 6= s r 1 (8) z >: r ^ V 4s5 if s r s1 s 6= s r 1 V 4s5 oherwise. In he conex of he Monoone-ADP algorihm, V is he curren value funcion approximaion, 4s r 1z r 5 is he laes observaion of he value (s r is laes visied sae), and Á M 4s r 1z r 1V 5 is he updaed value funcion approximaion. Violaions of he monooniciy propery of (3) are correced by Á M in he following ways: if z r æ V 4s5 and s r s, hen V 4s5 is oo small and is increased o z r = z r _ V 4s5 and if z r V 4s5 and s r s, hen V 4s5 is oo large and is decreased o z r = z r ^ V 4s5. See Figure 1 for an example showing a sequence of wo observaions and he resuling projecions in he Caresian plane, where is he componenwise inequaliy in wo dimensions. We now provide some addiional moivaion for he definiion of Á M. Because z n 4Sn 5 is he laes observed value and i is obained via sochasic approximaion (see he Sep 2b of Figure 2), our inuiion guides us o keep his value, i.e., by seing V n4sn 5 = zn 4Sn 5. For s 2 S and v 2, le us define he se V M 4s1 z5 = 8V 2 d 2V4s5= z1 V monoone over S 9 which fixes he value a s o be z while resricing o he se of all possible V ha saisfy he monooniciy propery (3). Now, o ge he approximae value funcion of ieraion n and ime, we wan o find V n also saisfies he monooniciy propery: ha is close o É1 bu 2 arg min8òv É É1 ò 2 2V 2 V M 4S n 1zn 4Sn 5591 (9) Figure Example illusraing he projecion operaor Á M. = Observaions where ò ò 2 is he Euclidean norm. Le us now briefly pause and consider a possible alernaive, where we do no require V n4sn 5 = zn 4Sn 5. Insead, suppose we inroduce a vecor ˆV né1 2 d such ha ˆV né1 4s5 = V né1 4s5 for s 6= S n and ˆV né1 4S n5 = zn 4Sn 5. Nex, projec ˆV né1, he space of vecors V ha are monoone over S, o produce V n (his would be a proper projecion, where he space does no change). The problem wih his approach arises in he early ieraions where we have poor esimaes of he value funcion: for example, if V 04s5 = 0 for all s, hen ˆV 0 is a vecor of mosly zeros and he likely resul of he projecion, V 1, would be he original vecor V 0 hence, no progress is made. A poenial explanaion for he failure of such a sraegy is ha i is a naive adapaion of he naural approach for a bach framework o a recursive seing. The nex proposiion shows ha his represenaion of V n is equivalen o one ha is obained using he projecion operaor Á M. Proposiion 2. The soluion o he minimizaion (9) can be characerized using Á M. Specifically, Á M 4S n 1zn 4Sn 51 É1 5 M M 2 arg min8òv É É1 ò 2 2V 2 V M 4S n 1zn 4Sn 5591 so ha we can wrie V n = Á M 4S n1zn 4Sn 51 V né1 5. Proof. See Appendix B. We now inroduce, for each, a (possibly sochasic) sepsize sequence Å n 1 used for smoohing in new observaions. The algorihm only direcly updaes values (i.e., no including updaes from he projecion operaor) for saes ha are visied, so for each s 2 S, le Å n 4s5 = ÅnÉ1 1 8s=S n Le ˆv n 2 d be a noisy observaion of he quaniy 4H É1 5, and le w n 2 d represen he addiive noise associaed wih he observaion: ˆv n = 4H É1 5 + w n 0

7 Jiang and Powell: An Approximae Dynamic Programming Algorihm for Monoone Value Funcions 1494 Operaions Research 63(6), pp , 2015 INFORMS Downloaded from informs.org by [ ] on 05 January 2016, a 21:41. For personal use only, all righs reserved. Figure 2. Monoone-ADP algorihm. Sep 0a. Iniialize V 0 2 1V max 7 for each T É 1 such ha monooniciy is saisfied wihin V 0, as described in (3). Sep 0b. Se 4s5 = C T T 4s5 for each s 2 S and n N. Sep 0c. Se n = 1. Sep 1. Selec an iniial sae S n. 0 Sep 2. For = T É 15: Sep 2a. Sample a noisy observaion of he fuure value: ˆv n = 4H É1 5 + w n. Sep 2b. Smooh in he new observaion wih previous value a each s: z n4s5 = 41 É Ån4s55 É1 4s5 + Å n 4s5ˆvn 4s Sep 2c. Perform monooniciy projecion operaor: = Á M 4S n 1zn 4Sn51 É1 5. Sep 2d. Choose he nex sae S n +1 given F né1. Sep 3. If n<n, incremen n and reurn o Sep 1. Alhough he algorihm is asynchronous and only updaes he value for S n (herefore, i only needs ˆv n4sn 5, he componen of ˆv n a S n), i is convenien o assume ˆvn 4s5 and w n 4s5 are defined for all s. We also require a vecor z n 2 d o represen he smoohed observaion of he fuure value; i.e., z n 4s5 is ˆvn 4s5 smoohed wih he previous value V né1 4s5 via he sepsize Å n 4s5. Le us denoe he hisory of he algorihm up unil ieraion n by he filraion 8F n 9 næ1, where F n = ë84s m 1wm 5 m n1 T A precise descripion of he algorihm is given in Figure 2. Noice from he descripion ha if he monooniciy propery (3) is saisfied a ieraion n É 1, hen he fac ha he projecion operaor Á M is applied ensures ha he monooniciy propery is saisfied again a ime n. Our benchmarking resuls of 7 show ha mainaining monooniciy in such a way is an invaluable aspec of he algorihm ha allows i o produce very good policies in a relaively small number of ieraions. Tradiional approximae (or asynchronous) value ieraion, on which Monoone-ADP is based, is asympoically convergen bu exremely slow o converge in pracice (once again, see 7). As we have menioned, Á M is no a sandard projecion operaor, as i projecs o a differen space on every ieraion, depending on he sae visied and value observed; herefore, radiional convergence resuls no longer hold. The remainder of he paper esablishes he asympoic convergence of Monoone-ADP Exensions of Monoone-ADP We now briefly presen wo possible exensions of Monoone-ADP. Firs, consider a discouned, infinie horizon MDP. An exension (or perhaps, simplificaion) o his case can be obained by removing he loop over (and all subscrips of and T ) and acquiring one observaion per ieraion, exacly resembling asynchronous value ieraion for infinie horizon problems. Second, we consider possible exensions when represenaions of he approximae value funcion oher han lookup able are used; for example, imagine we are using basis funcions 8î g 9 g2g for some feaure se G combined wih a coefficien vecor à n (which has componens àg n ), giving he approximaion 4s5 = X g2g à n g î g4s Equaion (9) is he saring poin for adaping Monoone- ADP o handle his case. An analogous version of his updae migh be given by à n 2 arg min8òà É à né1 ò 2 2 4Sn 5 = zn 4Sn 5 and monoone91 (10) where we have alered he objecive o minimize disance in he coefficien space. Unlike (9), here is, in general, no simple and easily compuable soluion o (10), bu special cases may exis. The analysis of his siuaion is beyond he scope of his paper and lef o fuure work. In his paper, we consider he finie horizon case using a lookup able represenaion. 5. Assumpions We begin by providing some echnical assumpions ha are needed for convergence analysis. The firs assumpion gives, in more general erms han previously discussed, he monooniciy of he value funcions. Assumpion 1. The wo monooniciy assumpions are as follows. (i) The erminal value funcion C T is monoone over S wih respec o. (ii) For any <T and any vecor V 2 D such ha V +1 is monoone over S wih respec o, i is rue ha 4HV 5 is monoone over he sae space as well. The above assumpion implies ha for any choice of erminal value funcion VT = C T ha saisfies monooniciy, he value funcions for he previous ime periods are monoone as well. Examples of sufficien condiions include monooniciy in he conribuion funcion plus a condiion on he ransiion funcion, as in (i) of Proposiion 1, or a condiion on he ransiion probabiliies, as in Proposiion of Puerman (1994). Inuiively speaking, when he saemen saring wih more a ) ending wih more a + 1 applies, in expecaion, o he problem a hand, Assumpion 1 is saisfied. One obvious example ha saisfies monooniciy occurs in resource or asse managemen scenarios; ofenimes in hese problems, i is rue ha for any oucome of he random informaion W +1 ha occurs (e.g., random demand, energy producion, or profis), we end wih more of he resource a ime + 1 whenever we sar wih more of he resource a ime. Mahemaically, his propery of resource allocaion problems ranslaes o he sronger saemen: 4S +1 ó S = s1a = a5 4S +1 ó S = s 0 1a = a5 a.s. for all a 2 A when s s 0. This is essenially he siuaion ha Proposiion 1 describes.

8 Jiang and Powell: An Approximae Dynamic Programming Algorihm for Monoone Value Funcions Operaions Research 63(6), pp , 2015 INFORMS 1495 Downloaded from informs.org by [ ] on 05 January 2016, a 21:41. For personal use only, all righs reserved. Assumpion 2. For all s 2 S and <T, he sampling policy saisfies àx P4S n = s ó F né1 5 =à n=1 a0s0 By he Exended Borel-Canelli Lemma (see Breiman 1992), any scheme for choosing saes ha saisfies he above condiion will visi every sae infiniely ofen wih probabiliy one. Assumpion 3. Suppose ha he conribuion funcion C 4s1 a5 is bounded: wihou loss of generaliy, le us assume ha for all s 2 S, <T, and a 2 A, 0 C 4s1 a5 C max, for some C max > 0. Furhermore, suppose ha 0 C T 4s5 C max for all s 2 S as well. This naurally implies ha here exiss V max > 0 such ha 0 V 4s5 V max. The nex hree assumpions are sandard ones made on he observaions ˆv n, he noise wn, and he sepsize sequence Å n ; see Bersekas and Tsisiklis (1996) (e.g., Assumpion 4.3 and Proposiion 4.6) for addiional deails. Assumpion 4. The observaions ha we receive are bounded (by he same consan V max ): 0 ˆv n 4s5 V max almos surely, for all s 2 S and <T. Noe ha he lower bounds of zero in Assumpions 3 and 4 are chosen for convenience and can be shifed by a consan o sui he applicaion (as is done in 7). Assumpion 5. The following holds almos surely: E6w n+1 4s5 ó F n 7 = 0, for any sae s 2 S and <T. This propery means ha w n is a maringale difference noise process. Assumpion 6. For each s 2 S and <T, s 2 S, suppose Å n is F n -measurable and (i) P à n=1 Ån 4s5 =à a0s0, (ii) P à n=1 Ån 4s52 < à a0s Remarks on Simulaion Before proving he heorem, we offer some addiional commens regarding he assumpions as hey perain o simulaion. If H is defined in he conex of (2), hen i is no easy o perform Sep 2a of Figure 2, ˆv n = H É1 + wn 1 such ha Assumpion 5 is saisfied. Because he supremum is ouside of he expecaion operaor, an upward bias would be presen in he observaion ˆv n 4s5 unless he expecaion can be compued exacly, in which case w n 4s5 = 0 and we have ˆv n 4s5=sup 6C 4s1a5+E6 É1 +1 4S +15óS =s1a =a7 (11) a2a Thus, any approximaion scheme used o calculae he expecaion inside of he supremum would cause Assumpion 5 o be unsaisfied. When he approximaion scheme is a sample mean, he bias disappears asympoically wih he number of samples (see Kleyweg e al. 2002, which discusses he sample average approximaion or SAA mehod). I is herefore possible ha alhough heoreical convergence is no guaraneed, a large enough sample may sill achieve decen resuls in pracice. On he oher hand, in he conex of (5) and (6), he expecaion and he supremum are inerchanged. This means ha we can rivially obain an unbiased esimae of 4H É1 5 by sampling one oucome of he informaion process W+1 n from he disribuion W +1 ó S = s; compuing he nex sae S+1 n ; and solving a deerminisic opimizaion problem (i.e., he opimizaion wihin he expecaion). In hese wo cases, we would respecively use ˆv n 4s1 a5 = C 4s1 a5 + max a +1 2A and ˆv n 4sa 5 = sup6c +1 4S n +1 1 a5 + V a2a Q né1 Q né1 +1 4Sn +1 1a +15 (12) a1 né1 +1 4S a1n (13) where +1 is he approximaion o Q+1, a1 né1 V is he approximaion o V a1, and S a1 n +1 is he pos-decision sae obained from S+1 n and a. Noice ha (11) conains an expecaion whereas (12) and (13) do no, making hem paricularly well suied for model-free siuaions, where disribuions are unknown and only samples or experience are available. Hence, he bes choice of model depends heavily upon he problem domain. Finally, we give a brief discussion of he choice of sepsize. There are a variey of ways in which we can saisfy Assumpion 6, and here we offer he simples example. Consider any deerminisic sequence 8a n 9 such ha he usual sepsize condiions are saisfied: àx a n =à n=0 and àx 4a n 5 2 < à0 n=0 Le N 4s1 n1 5 = P n m=1 1 8s=S m 9 be he random variable represening he oal number of visis of sae s a ime unil ieraion n. Then Å n = an4sn 1n15 saisfies Assumpion Convergence Analysis of he Monoone-ADP Algorihm We are now ready o show he convergence of he algorihm. Noe ha alhough here is a significan similariy beween his algorihm and he DOME algorihm described in Papadaki and Powell (2003a), he proof echnique is very differen. The convergence proof for he DOME algorihm canno be direcly exended o our problem because of differences in he assumpions. Our proof draws on proof echniques found in Tsisiklis (1994) and Nascimeno and Powell (2009). In he laer, he auhors prove convergence of a purely exploiaive ADP algorihm given a concave, piecewise-linear value funcion

9 Jiang and Powell: An Approximae Dynamic Programming Algorihm for Monoone Value Funcions 1496 Operaions Research 63(6), pp , 2015 INFORMS Downloaded from informs.org by [ ] on 05 January 2016, a 21:41. For personal use only, all righs reserved. for he lagged asse acquisiion problem. We canno exploi cerain properies inheren o ha problem, bu in our algorihm we assume exploraion of all saes, a requiremen ha can be avoided when we are able o assume concaviy. Furhermore, a significan difference in his proof is ha we consider he case where S may no be a oal ordering. A consequence of his is ha we exend o he case where he monooniciy propery covers muliple dimensions (e.g., he relaion on S is he componenwise inequaliy), which was no allowed in Nascimeno and Powell (2009). Theorem 1. Under Assumpions 1 6, for each T and s 2 S, he esimae V n 4s5 produced by he Monoone-ADP Algorihm of Figure 2 converge o he opimal value funcion V 4s5 almos surely. Before providing he proof for his convergence resul, we presen some preliminary definiions and resuls. Firs, we define wo deerminisic bounding sequences, U k and L k. The wo sequences U k and L k can be hough of, joinly, as a sequence of shrinking recangles, wih U k being he upper bounds and L k being he lower bounds. The cenral idea o he proof is showing ha he esimaes ener (and say) in smaller and smaller recangles, for a fixed ó 2 Ï (we assume ha he ó does no lie in a discarded se of probabiliy zero). We can hen show ha he recangles converge o he poin V, which in urn implies he convergence of o he opimal value funcion. This idea is aribued o Tsisiklis (1994) and is illusraed in Figure 3. The sequences U k and L k are wrien recursively. Le U 0 = V + V max e1 L 0 = V É V max e1 (14) and le U k+1 = U k + HU k 1 L k+1 = Lk + HL k Lemma 2. For all k æ 0, we have ha HU k U k+1 U k 1 HL k æ L k+1 æ L k 0 Furhermore, U k É! V 1 L k É! V 0 (15) Figure 3. V n (s) U k (s) L k (s) Cenral idea of convergence proof. U k + 1 (s) L k + 1 (s) U k + 2 (s) L k + 2 (s) Ier. n V (s) Proof. The proof of his lemma is given in Bersekas and Tsisiklis (1996) (see Lemmas 4.5 and 4.6). The properies of H given in Proposiion 1 are used for his resul. Lemma 3. The bounding sequences saisfy he monooniciy propery; ha is, for k æ 0, T, s 2 S, s 0 2 S such ha s s 0, we have U k 4s5 U k 4s0 51 L k 4s5 Lk 4s0 Proof. See Appendix C. We coninue wih some definiions peraining o he projecion operaor Á M.A É in he superscrip signifies he value s is oo small and he + signifies he value of s is oo large. Definiion 2. For <T and s 2 S, le N É 4s5 be a random se represening he ieraions for which s was increased by he projecion operaor a ime. Similarly, le N + 4s5 represen he ieraions for which s was decreased: N ÁÉ 4s5 = 8n2 s 6= S n and É1 4s5 < 4s591 N Á+ 4s5 = 8n2 s 6= S n and É1 4s5 > 4s5 Definiion 3. For <T and s 2 S, le N ÁÉ 41 s5 be he las ieraion for which he sae s was increased by Á M a ime. N ÁÉ 4s5 = max N É 4s Similarly, le N Á+ 4s5 = max N + 4s Noe ha N ÁÉ 4s5 =à if ón É Á+ 4s5ó=à and N 4s5 =à if ón + 4s5ó=à. Definiion 4. Le N Á be large enough so ha for ieraions n æ N Á, any sae increased (decreased) finiely ofen by he projecion operaor Á M is no longer affeced by Á M. In oher words, if some sae is increased (decreased) by Á M on an ieraion afer N Á, hen ha sae is increased (decreased) by Á M infiniely ofen. We can wrie he following: N Á = max48n ÁÉ 4s52 < T 1 s 2 S 1N ÁÉ 4s5 < à9 [ 8N Á+ 4s52 < T 1 s 2 S 1N Á+ 4s5 < à We now define, for each, wo random subses S É and S + of he sae space S where S É conains saes ha are increased by he projecion operaor Á M finiely ofen and S + conains saes ha are decreased by he projecion operaor finiely ofen. The role ha hese wo ses play in he proof is as follows: We firs show convergence for saes ha are projeced finiely ofen (s 2 S É or s 2 S + ).

10 Jiang and Powell: An Approximae Dynamic Programming Algorihm for Monoone Value Funcions Operaions Research 63(6), pp , 2015 INFORMS 1497 Downloaded from informs.org by [ ] on 05 January 2016, a 21:41. For personal use only, all righs reserved. Nex, because convergence already holds for saes ha are projeced finiely ofen, we use an inducion-like argumen o exend he propery o saes ha are projeced infiniely ofen (s 2 S \S É or s 2 S \S + ). This sep requires he definiion of a ree srucure ha arranges he se of saes and is parial ordering in an inuiive way. Definiion 5. For <T, define S É = 8s 2 S 2N ÁÉ 4s5 < à9 and S + = 8s 2 S 2N Á+ 4s5 < à91 o be random subses of saes ha are projeced finiely ofen. Lemma 4. The random ses S É and S + are almos surely nonempy. Proof. See Appendix D. We now provide several remarks regarding he projecion operaor Á M. The value of a sae s can only be increased by Á M if we visi a smaller sae; i.e., S n s. This saemen is obvious from he second condiion of (8). Similarly, he value of he sae can only be decreased by Á M if he visied sae is larger ; i.e., S n s. Inuiively, i can be useful o imagine ha, in some sense, he values of saes can be pushed up from he lef and pushed down from he righ. Finally, because of our assumpion ha S is only a parial ordering, he updae process (from Á M ) becomes more difficul o analyze han in he oal ordering case. To faciliae he analysis of he process, we inroduce he noions of lower (upper) immediae neighbors and lower (upper) updae rees. Definiion 6. For s = 4m1 i5 2 S, we define he se of lower immediae neighbors S L 4s5 in he following way: S L 4s5 = 8s 0 2 S 2 s 0 s1 s 0 6= s 00 2 S 1s 00 6= s1 s 00 6= s 0 1s 0 s 00 s In oher words, here does no exis s 00 in beween s 0 and s. The se of upper immediae neighbors S U 4s5 is defined in a similar way: S U 4s5 = 8s 0 2 S 2 s 0 s1 s 0 6= s 00 2 S 1s 00 6= s1 s 00 6= s 0 1s 0 s 00 s The inuiion for he nex lemma is ha if some sae s is increased by Á M, hen i mus have been caused by visiing a lower sae. In paricular, eiher he visied sae was one of he lower immediae neighbors or one of he lower immediae neighbors was also increased by Á M. In eiher case, one of he lower immediae neighbors has he same value as s. This lemma is crucial laer in he proof. Lemma 5. Suppose he value of s is increased by Á M on some ieraion n: s 6= S n and V né1 4s5 < V n 4s5. Then here exiss anoher sae s 0 2 S L 4s5 (in he se of lower immediae neighbors) whose value is equal o he newly updaed value: V n4s0 5 = V n4s5. Proof. See Appendix E. Definiion 7. Consider some ó 2 Ï. Le s 2 S \S É, meaning ha s is increased by Á M infiniely ofen: ón É É 4s5ó=à.Alower updae ree T 4s5 is an organizaion of he saes in he se L = 8s 0 2 S 2s 0 s9 where he value of each node is an elemen of L. The ree T É 4s5 is consruced according o he following rules. (i) The roo node of T É 4s5 has value s. (ii) Consider an arbirary node j wih value s j. (a) If s j 2 S \S É, hen for each s jc 2 S L 4s j 5, add a child node wih value s jc o he node j. (b) If s j 2 S É, hen j is a leaf node (i does no have any child nodes). The ree T É 4s5 is unique and can easily be buil by saring wih he roo node and successively applying he rules. The upper updae ree T + 4s5 is defined in a compleely analogous way. Noe ha he lower updae ree is random and we now argue ha for each ó, i is well defined. We observe ha i canno be he case for some sae s o be an elemen of S \S É while S L 4s 0 5 = 89 because for i o be increased infiniely ofen, here mus exis a leas one lower sae whose observaions cause he monooniciy violaions. Using his fac along wih he finieness of S and Lemma 4, which saes ha S É is nonempy, i is clear ha all pahs down he ree reach a leaf node (i.e., an elemen of S É ). The reason for disconinuing he ree a saes in S É is ha our convergence proof employs an inducionlike argumen up he ree, saring wih saes in S É. Lasly, we remark ha i is possible for muliple nodes o have he same value. As an illusraive example, consider he case wih S = wih being he componenwise inequaliy. Assume ha for a paricular ó 2 Ï, s = 4s x 1s y 5 2 S É if and only if s x = 0 or s y = 0 (lower boundary of he square). Figure 4 shows he realizaion of he lower updae ree a evaluaed a he sae Figure 4. S = 2 1 S \S 0 S 1 Illusraion of he lower updae ree. 2 T {(2, 2)} = (0, 2) (0, 1) (1, 0) (1, 1) (0, 1) (1, 2) (1, 0) (1, 1) (2, 2) (2, 1) (2, 0)

11 Jiang and Powell: An Approximae Dynamic Programming Algorihm for Monoone Value Funcions 1498 Operaions Research 63(6), pp , 2015 INFORMS Downloaded from informs.org by [ ] on 05 January 2016, a 21:41. For personal use only, all righs reserved. The nex lemma is a useful echnical resul used in he convergence proof. Lemma 6. For any s 2 S, apple Y m lim 41 É Å n 4s55 = 0 a0s0 m!à n=1 Proof. See Appendix F. Wih hese preliminaries in mind (oher elemens will be defined as hey arise), we begin he convergence analysis. Proof of Theorem 1. As previously menioned, o show ha he sequence V n 4s5 (almos surely) converges o V 4s5 for each and s, we need o argue ha V n 4s5 evenually eners every recangle (or inerval, when we discuss a specific componen of he vecor ) defined by he sequence L k k 4s5 and U 4s5. Recall ha he esimaes of he value funcion produced by he algorihm are indexed by n and he bounding recangles are indexed by k. Hence, we aim o show ha for each k, we have ha for n sufficienly large, i is rue ha 8 s 2 S, L k 4s5 4s5 U k 4s (16) Following his sep, an applicaion of (15) in Lemma 2 complees he proof. We show he second inequaliy of (16) and remark ha he firs can be shown in a compleely symmeric way. The goal is hen o show ha 9 N k < à a.s. such ha 8 n æ N k and 8 s 2 S, 4s5 U k 4s (17) Choose ó 2 Ï. For ease of presenaion, he dependence of he random variables on ó is omied. We use backward inducion on o show his resul, which is he same echnique used in Nascimeno and Powell (2009). The inducive sep is broken up ino wo cases, s 2 S É and s 2 S \S É. Base case, = T. Since for all s 2 S, k, and n, we have ha (by definiion) V T n4s5 = U T k 4s5 = 0, we can arbirarily selec NT k. Suppose ha for each k, we choose N T k = N Á, allowing us o use he propery of N Á ha if s 2 S É, hen he esimae of he value a s is no longer affeced by Á M on ieraions n æ N Á. Inducion hypohesis, + 1. Assume for + 1 T ha 8 k æ 0, 9 N k +1 k < à such ha N+1 æ N Á and 8 n æ N k +1 4s5. +1, we have ha 8 s 2 S, V +1 n 4s5 U k Inducive sep from + 1 o. The remainder of he proof concerns his inducive sep and is broken up ino wo cases, s 2 S É and s 2 S \S É. For each s, we show he exisence of a sae dependen ieraion Ñ k4s5 æ N Á, such ha for n æ Ñ k 4s5, (17) holds. The sae independen ieraion N k is hen aken o be he maximum of Ñ k4s5 over s. Case 12 s2 S É. To prove his case, we induc forward on k. Noe ha we are sill inducing backward on, so he inducion hypohesis for + 1 sill holds. The inducive sep is proved in essenially he same manner as Theorem 2 of Tsisiklis (1994). Base case, k = 0 4wihin inducion on 5. By Assumpion 3 and (14), we have ha U 04s5 æ V max. Bu by Assumpion 4, he updaing equaion (Sep 2b of Figure 2), and he iniializaion of V 04s5 2 1V max7, we can easily see ha V n4s5 2 1V max7 for any n and s. Therefore, V n4s5 U 04s5, for any n and s, so we can choose N 0 arbirarily. Le us choose Ñ 04s5 = N , and since N+1 came from he inducion hypohesis for + 1, i is also rue ha Ñ 04s5 æ N Á. Inducion hypohesis, k4wihin inducion on 5. Assume for k æ 0 ha 9 Ñ k k 4s5 < à such ha Ñ 4s5 æ N +1 k æ N Á and 8 n æ Ñ k4s5, we have V n4s5 U k4s5. Before we begin he inducive sep from k o k + 1, we define some addiional sequences and sae a few useful lemmas. Definiion 8. The posiive incurred noise, since a saring ieraion m, is represened by he sequence W n1 m 4s5. For s 2 S, i is defined as follows: W m1 m 4s5 = 01 W n+11m 4s5 = 641 É Å n n1 m 4s55W 4s5 + Å n 4s5wn+1 4s57 + for n æ m0 The erm W n+11m 4s5 is only updaed from W n1 m 4s5 when s = S n, i.e., on ieraions where he sae is visied by he algorihm, because he sepsize Å n 4s5 = 0 whenever s 6= Sn. Lemma 7. For any saring ieraion m æ 0 and any sae s 2 S, under Assumpions 4, 5, and 6, W n1 m 4s5 asympoically vanishes: lim n!à W n1 m 4s5 = 0 a0s0 Proof. The proof is analogous o ha of Lemma 6.2 in Nascimeno and Powell (2009), which uses a maringale convergence argumen. To reemphasize he presence of ó, we noe ha he following definiion and he subsequen lemma boh use he realizaion Ñ k 4s54ó5 from he ó chosen a he beginning of he proof. Definiion 9. The oher auxiliary sequence ha we need is X n 4s5, which applies he smoohing sep o 4HU k 5 4s5. For any sae s 2 S, le k XÑ 4s5 4s5 = U k 4s51 X n+1 4s5 = 41 É Å n 4s55Xn 4s5 + Ån 4s54HU k 5 4s5 Lemma 8. For n æ Ñ k 4s5 and sae s 2 S É, 4s5 Xn 4s5 + W n1 Ñ k 4s5 Proof. See Appendix G. 4s for n æ Ñ k 4s

12 Jiang and Powell: An Approximae Dynamic Programming Algorihm for Monoone Value Funcions Operaions Research 63(6), pp , 2015 INFORMS 1499 Downloaded from informs.org by [ ] on 05 January 2016, a 21:41. For personal use only, all righs reserved. Inducive sep from k o k + 1. If U k4s5 = 4HU k 5 4s5, hen by Lemma 2, we see ha U k k+1 4s5 = U 4s5 so V n U k k+1 4s5 U 4s5 for any n æ Ñ k 4s5 and he proof is complee. Since we know ha 4HU k 5 4s5 U k 4s5 by Lemma 2, we can now assume ha s 2 K, where K = 8s 0 2 S 24HU k 5 4s 0 5<U k 4s0 5 In his case, we can define Ñ k = min s2s É \K Choose Ñ k+1 Y Ñ k+1 4s5É1 n=ñ k 4s5 U k 4s5 É 4HU k 5 4s5 4 4s5 æ Ñ k 4s5 such ha 41 É Å n 4s and for all n æ Ñ k+1 4s5, W n1 Ñ k 4s5 4s5 Ñ k 0 > 00 Noe ha Ñ k+1 4s5 clearly exiss because boh sequences converge o zero, by Lemma 6 and 7. Recursively using he definiion of X n 4s5, we ge ha X n 4s5 = Çn 4s5U k 4s É Çn 4s554HU k 5 4s51 where Ç n 4s5 = Q né1 41 É l=ñ k 4s5 Ål 4s55. Noice ha for n æ Ñ k+1 4s5, we know ha Ç n 4s5 1, so we can wrie 4 X n 4s5 = Çn 4s5 U k 4s É Çn 4s554HU k 5 4s5 = Ç n 4s56U k 4s5 É 4HU k 5 4s57 + 4HU k 5 4s5 1 4 U k 4s HU k 5 4s5 = 1 2 6U k 4s5 + 4HU k 5 4s57 É 1 4 6U k 4s5 É 4HU k 5 4s57 U k+1 4s5 É Ñ k 0 (18) We can apply Lemma 8 and (18) o ge 4s5 Xn 4s5 + W n1 Ñ k 4s5 4s5 4U k+1 4s5 É Ñ k 5 + Ñk = U k+1 4s51 for all n æ Ñ k+1 4s5. Thus, he inducive sep from k o k+1 is complee. Case 22 s2 S \S É. Recall ha we are sill in he inducive sep from + 1 o (where he hypohesis was he exisence of N+1 k ). As previously menioned, he proof for his case relies on an inducion-like argumen over he ree T É 4s5. The following lemma is he core of our argumen, and he proof is provided below. Lemma 9. Consider some k æ 0 and a node j of T É 4s5 wih value s j 2 S \S É and le he C j æ 1 child nodes of j be denoed by he se 8s j11 1s j12 01s j1cj 9. Suppose ha for each s j1c where 1 c C j, we have ha 9 Ñ k4s j1c5<à such ha 8 n æ Ñ k4s j1c5, 4s j1c5 U k 4s j1c (19) Then 9 Ñ k4s j5<à such ha 8 n æ Ñ k4s j5, 4s j5 U k 4s j Proof. Firs, noe ha by he inducion hypohesis, par (ii) of Lemmas 1, and 2, we have he inequaliy 4H 5 4s5 4HU k 5 4s5 U k 4s (20) We break he proof ino several seps. Sep 1. Le us consider he ieraion Ñ defined by Ñ = min4n 2 N ÁÉ 4s j 52 n æ max c Ñ k 4s j1c551 which exiss because s j 2 S \S É and is increased infiniely ofen. This means ha Á M increased he value of sae s j on ieraion Ñ. As he firs sep, we show ha 8 n æ Ñ, 4s j5 U k 4s j5 + W n1 Ñ 4s j 51 (21) using an inducion argumen. Base case, n = Ñ. Using Lemma 5, we know ha for some c C j 9, we have 4s j5 = 4s j1c5 U k 4s j1c5 U k 4s j5 + W n1 Ñ 4s j The fac ha Ñ æ Ñ k4s j1c5 for every c jusifies he firs inequaliy and he second inequaliy above follows from he monooniciy wihin U k (see Lemma 3) and ha W Ñ1Ñ 4s j 5 = 0. Inducion hypohesis, n. Suppose (21) is rue for n where n æ Ñ. Inducive sep from n o n + 1. Consider he following wo cases: (I) Suppose n N ÁÉ 4s j 5. The proof for his is exacly he same as for he base case, excep we use W n+11 Ñ 4s j 5 æ 0 o show he inequaliy. Again, his sep depends heavily on Lemma 5 and on every child node represening a sae ha saisfies (19). (II) Suppose n+1 62 N ÁÉ 4s j 5. There are again wo cases o consider: (A) Suppose S n+1 = s j. Then +1 4s j 5 = z n+1 4s j 5 = 41 É Å n+1 4s j 55 4s j5 + Å n+1 4s j 5ˆv n+1 4s j 5 41 É Å n+1 4s j 554U k 4s j5 + W n1 Ñ 4s j 55 + Å n+1 4s j 564H 5 4s j 5 + w n+1 4s j 57 U k 4s j5 + W n+11 Ñ 4s j 51 where he firs inequaliy follows from he inducion hypohesis for n and he second inequaliy follows by (20).

AN APPROXIMATE DYNAMIC PROGRAMMING ALGORITHM FOR MONOTONE VALUE FUNCTIONS

AN APPROXIMATE DYNAMIC PROGRAMMING ALGORITHM FOR MONOTONE VALUE FUNCTIONS DANIEL R. JIANG AND WARREN B. POWELL Absrac. Many sequenial decision problems can be formulaed as Markov Decision Processes (MDPs)