Information Relaxations and Duality in Stochastic Dynamic Programs

Size: px

Start display at page:

Download "Information Relaxations and Duality in Stochastic Dynamic Programs"

Aubrie Flynn
5 years ago
Views:

1 OPERATIONS RESEARCH Vol. 58, No. 4, Par 1 of 2, July Augus 2010, pp issn X eissn informs doi /opre INFORMS Informaion Relaxaions and Dualiy in Sochasic Dynamic Programs David B. Brown, James E. Smih, Peng Sun Fuqua School of Business, Duke Universiy, Durham, Norh Carolina {dbbrown@duke.edu, jes9@duke.edu, psun@duke.edu} We describe a general echnique for deermining upper bounds on maximal values (or lower bounds on minimal coss) in sochasic dynamic programs. In his approach, we relax he nonanicipaiviy consrains ha require decisions o depend only on he informaion available a he ime a decision is made and impose a penaly ha punishes violaions of nonanicipaiviy. In applicaions, he hope is ha his relaxed version of he problem will be simpler o solve han he original dynamic program. The upper bounds provided by his dual approach complemen lower bounds on values ha may be found by simulaing wih heurisic policies. We describe he heory underlying his dual approach and esablish weak dualiy, srong dualiy, and complemenary slackness resuls ha are analogous o he dualiy resuls of linear programming. We also sudy properies of good penalies. Finally, we demonsrae he use of his dual approach in an adapive invenory conrol problem wih an unknown and changing demand disribuion and in valuing opions wih sochasic volailiies and ineres raes. These are complex problems of significan pracical ineres ha are quie difficul o solve o opimaliy. In hese examples, our dual approach requires relaively lile addiional compuaion and leads o igh bounds on he opimal values. Subjec classificaions: dynamic programming; dualiy; invenory conrol; opion pricing. Area of review: Sochasic Models. Hisory: Received November 2007; revisions received November 2008, July 2009, Sepember 2009; acceped Ocober Published online in Aricles in Advance April 9, Inroducion In principle, dynamic programming provides a powerful framework for deermining opimal policies in complex decision problems where uncerainy is resolved and decisions are made over ime. However, he widespread use of dynamic programming is hampered by he so-called curse of dimensionaliy he size of he sae space ypically grows exponenially in he number of sae variables considered. In conras, Mone Carlo simulaion mehods ypically scale well wih he number of sae variables considered and, given a conrol policy, i is no difficul o simulae a complex dynamic sysem wih many uncerainies. Simulaing wih a feasible policy provides a lower bound on he expeced value (or upper bound on he expeced coss) of an opimal policy, bu Mone Carlo simulaion ypically does no provide a good way o idenify an opimal policy or provide an upper bound on he value of an opimal policy. In his paper, we describe a dual approach for sudying sochasic dynamic programs (DPs) ha focuses on providing an upper bound on he opimal expeced value. This dual approach consiss of wo elemens: (1) we relax he nonanicipaiviy consrains ha require decisions o depend only on he informaion available a he ime a decision is made and (2) we impose a penaly ha punishes violaions of he nonanicipaiviy consrains. By relaxing he nonanicipaiviy consrains, we can ofen grealy simplify he DP. For example, we sudy an adapive invenory conrol problem wih an unknown and changing demand disribuion and sochasic ordering coss. Here a perfec informaion relaxaion assumes he decision maker (DM) knows all demands and coss before placing any orders. Wih his informaion, he problem of choosing an opimal ordering schedule is a deerminisic DP ha can be solved quie easily. In anoher example, we sudy an opion-pricing model wih sochasic volailiies and sochasic ineres raes and consider an imperfec informaion relaxaion where volailiies and ineres raes are known in advance bu he sock price is no: wih he volailiies and ineres raes known, we can value he opion using sandard laice mehods. Because hese relaxaions assume he DM has more informaion han is ruly available, hey lead o an upper bound on value. Wihou any penaly for using his addiional informaion, he bound obained is ofen quie weak. Informally, we say a penaly is dual feasible if i does no penalize any policy ha is nonanicipaive; he penalies may, however, punish policies ha do no saisfy he nonanicipaiviy consrains. We will show ha in principle we can always find a dual feasible penaly ha provides a igh bound, i.e., srong dualiy holds. We view his dual approach as a complemen o he use of simulaion mehods and modern approximae 785

2 786 Operaions Research 58(4, Par 1 of 2), pp , 2010 INFORMS dynamic programming mehods for sudying DPs (see, e.g., Bersekas and Tsisiklis 1996, de Farias and Van Roy 2003, Powell 2007, or Adelman and Mersererau 2008). As menioned earlier, given a candidae policy (perhaps idenified using a heurisic approach or using approximae DP echniques), we can use sandard simulaion echniques o esimae he expeced value wih his policy and hereby generae a lower bound on he expeced value wih an opimal policy. Our dual approach can hen be used o generae an upper bound on he value of an opimal policy. If he difference beween he expeced value wih his candidae policy and he upper bound on he opimal value is small, we may conclude ha he candidae policy is good enough and no coninue searching for a beer policy. If he difference is large, i may be worhwhile o work harder o find a beer policy and/or a igher upper bound. In our invenory example, we will use he dual bounds o deermine wheher a simple myopic ordering policy is good enough or wheher we need o consider more complex one- or woperiod look-ahead policies. In he opion-pricing example, we use he dual bounds o sudy he effeciveness of an exercise policy ha ignores uncerainy abou volailiies and ineres raes. In boh examples, we will also demonsrae how we can use he resuls of he dual problem o idenify ways o improve hese heurisic policies. Our ineres in his dual approach for DPs was moivaed by he need o evaluae he qualiy of heurisic policies in applicaions, and inspired by Haugh and Kogan s (2004) dual approach for placing bounds on he value of an American opion; Rogers (2002) independenly proposed a similar dual approach, also applied o opion pricing. Boh Haugh and Kogan (2004) and Rogers (2002) consider he use of wha we call perfec informaion relaxaions and esablish heir main resuls using maringale argumens. Haugh and Kogan propose a paricular mehod for generaing penalies or, in heir erminology, dual maringales based on approximae value funcions and demonsrae he use of his mehod in high-dimensional opion-pricing problems. Andersen and Broadie (2004) propose an alernaive mehod for generaing dual maringales based on approximae policies. Glasserman (2004) provides a nice overview of his work. We generalize he work of Haugh and Kogan (2004), Rogers (2002), and Andersen and Broadie (2004) in several ways. Firs, raher han focusing exclusively on opionpricing problems, we consider general sochasic DPs. Second, raher han focusing exclusively on perfec informaion relaxaions, we consider general informaion relaxaions. Finally, we presen a general mehod for consrucing good penalies ha includes and exends he mehods proposed by Haugh and Kogan and Andersen and Broadie. These generalizaions expand he scope and flexibiliy of his dual approach. The idea of relaxing he nonanicipaiviy consrains has also been sudied in he sochasic programming lieraure (see, e.g., Rockafellar and Wes 1991, Shapiro and Ruszczyński 2003, Shapiro e. al 2009). Rogers (2007) also recenly (independenly) proposed a dual approach for Markov decision processes. In shor, hough hese alernaive approaches have similariies wih ours, our formulaion is differen and leads o resuls ha we believe are boh simpler and more general. The sochasic programming formulaion requires he reward funcions and se of feasible acions o be convex and he penalies considered are linear funcions of he acions; hey consider only perfec informaion relaxaions. Rogers focuses on Markov decision processes and considers only perfec informaion relaxaions and penalies ha are a funcion of he sae variable only; Rogers does no presen any example applicaions. In conras, our framework allows general reward funcions and acion spaces, allows general penaly funcions, and considers imperfec as well as perfec informaion relaxaions. Moreover, our dualiy proofs are quie simple and direc and do no rely on sophisicaed convex dualiy or maringale argumens. Finally, our invenory conrol and opion-pricing examples demonsrae he power of his dual approach in some complex problems of significan pracical ineres. We begin in 2 by defining he basic framework and heory underlying he dual approach; he main resuls are analogous o he dualiy resuls of linear programming. We hen illusrae he approach in he invenory conrol and opion-pricing examples in 3 4. We offer a few concluding remarks in 5. The elecronic companion provides supporing informaion: Appendix A conains mos of he proofs; Appendix B compares our resuls o similar resuls in sochasic programming and develops he connecions o linear programming more fully; and Appendix C provides some deails of he adapive invenory example. The elecronic companion o his paper is available as par of he online version ha can be found a hp://or.journal. informs.org/. 2. The Basic Framework and Resuls We begin by describing he general formulaion of he primal sochasic DP in 2.1. We hen presen our main dualiy resuls in 2.2 and discuss an approach for generaing good penalies in General Framework Uncerainy in he DP is described by a probabiliy space F where is he se of possible oucomes (wih ypical elemen ), F is a -algebra ha describes he se of all possible evens (an even is a subse of ), and is a probabiliy measure describing he likelihoods of he various evens. Time is discree and indexed by = 0 T. The DM s sae of informaion evolves over ime and is described by a filraion = F 0 F T where he -algebra F describes he DM s sae of informaion a he beginning of period, i.e., F is he se of evens ha will be known o be rue or false a ime. We will refer o as he naural filraion.we

3 Operaions Research 58(4, Par 1 of 2), pp , 2010 INFORMS 787 require all filraions o saisfy F F +1 F for all <T so he DM does no forge wha she once knew. We will assume ha F 0 =, so he DM iniially knows nohing abou he oucome of he uncerainies. A funcion (or random variable) f defined on is measurable wih respec o a algebra F (or F -measurable) if for every Borel se R in he range of f,wehave f R F ; we can inerpre f being F -measurable as meaning he resul of f depends only on he informaion known in period. A sequence of funcions f 0 f T is said o be adaped o a filraion (or -adaped) if each funcion f is measurable wih respec o F. In he DP model, he DM will choose an acion a in period from he se A ;welea A 0 A T denoe he se of all feasible acion sequences a. The DM s choice of acions is described by a policy ha selecs a sequence of acions a in A for each oucome in (i.e., A). We le denoe he se of all policies. In he primal DP, we assume ha he DM s choices are nonanicipaive in ha he choice of acion a in period depends only on wha is known a he beginning of period. More formally, we require policies o be adaped o he naural filraion in ha a policy s selecion of he firs +1 acions a 0 a mus be measurable wih respec o F. We le be he se of all nonanicipaive policies. 1 The goal of he DP is o selec a nonanicipaive policy o maximize he expeced oal reward. The rewards are defined by a sequence of reward funcions r 0 a r T a where he reward in period depends on he acion sequence a seleced and he oucome. We le r a = T =0 r a denoe he oal reward; discouning can be incorporaed ino he period reward funcion r. The primal DP is hen: sup Ɛ r (1) Here Ɛ r could be wrien more explicily as Ɛ r, where policy selecs an acion sequence ha depends on he random oucome, and he rewards r depend on he acion sequence seleced by and he oucome. We will ypically suppress he dependence on and inerpre r as a random variable represening he reward generaed wih policy. I is insrucive o wrie he primal DP (1) in he sandard Bellman-syle recursive form. Firs, we will assume ha he period- rewards r are F -measurable for each se of acions and depend only on he firs + 1 acions a 0 a ; we will wrie r a as r a 0 a wih he undersanding ha a 0 a is seleced from he full sequence of acions a. For>0, le A a 0 a 1 be he subse of period- acions A ha are feasible given he prior choice of acions a 0 a 1. We ake he erminal value funcion V T +1 a 0 a T = 0 and, for = 0 T, we define { V a 0 a 1 = sup r a 0 a a A a 0 a 1 + Ɛ V +1 a 0 a F } (2) Here boh sides are random variables (and herefore implicily funcions of he oucome ) and we selec an opimal acion a for each oucome. Because he rewards r are assumed o be F -measurable and he expeced coninuaion values are condiioned on F, and hus F -measurable, he objecive funcion on he righ is F -measurable for each se of acions a 0 a. Thus, he supremum over acions a is also F -measurable, which implies ha V is F -measurable. There is no loss in resricing he choice of acions a o be F -measurable; herefore, if he suprema on he righ side of (2) are aained, we can consruc a nonanicipaive opimal policy using his recursion. The final value V 0 is equal o he opimal value of (1) The Dual Approach In our dual approach o he DP (1), we relax he requiremen ha he policies be nonanicipaive and impose penalies ha punish violaions of he nonanicipaiviy consrains. We define relaxaions of he nonanicipaiviy requiremen by considering alernaive informaion srucures. We say ha a filraion = 0 T is a relaxaion of he naural filraion = F 0 F T if, for each, F F ; we abbreviae his by wriing. being a relaxaion of means ha he DM knows more in every period under han she knows under. The perfec informaion filraion = I 0 I T is given by aking I = F for all. We le denoe he se of policies ha are adaped o. For any relaxaion of, we have = ; hus, as we relax he filraion, we expand he se of feasible policies. The se of penalies is he se of all funcions z a ha, like he oal rewards, depend on he choice of acion sequence a and he oucome. As wih rewards, we will ypically wrie he penalies as an acion-dependen random variable z a (=z a ) or a policy-dependen random variable z =z, suppressing he dependence on he oucome. We define he se of dual feasible penalies o be hose penalies ha do no penalize nonanicipaive policies (in expecaion), ha is = z Ɛ z F 0 for all F in (3) Policies ha do no saisfy he nonanicipaiviy consrains (and hus are no feasible o implemen) may have posiive expeced penalies. We can place an upper bound on he expeced reward associaed wih any nonanicipaive policy by relaxing he nonanicipaiviy consrain on policies and imposing a dual feasible penaly. This simple resul can be viewed as a version of he weak dualiy lemma for linear programming: Lemma 2.1 (Weak Dualiy). If F and z are primal and dual feasible, respecively (i.e., F and z ), and is a relaxaion of, hen Ɛ r F sup Ɛ r G z G (4)

4 788 Operaions Research 58(4, Par 1 of 2), pp , 2010 INFORMS Proof. Wih z, F, and as defined in he lemma, we have Ɛ r F Ɛ r F z F sup Ɛ r G z G The firs inequaliy holds because z (hus Ɛ z F 0) and he second because F. Thus, any informaion relaxaion wih any dual feasible penaly provides an upper bound on all DP soluions. Wih a fixed penaly z, weaker relaxaions lead o larger ses of feasible policies and weaker bounds. For example, if we consider he perfec informaion relaxaion, he se of relaxed policies is simply he se of all policies and all acions are seleced wih full knowledge of he oucome. Thus, he weak dualiy lemma implies ha for any F in and z in, Ɛ r F sup Ɛ r z = Ɛ [ sup a A ] r a z a (5) If we ake he penaly z = 0, his upper bound is he expeced value wih perfec informaion. Noe ha he upper bound (5) is in a form ha is convenien for Mone Carlo simulaion: we can esimae he expeced value on he righ side of (5) by randomly generaing oucomes and solving a deerminisic inner problem of choosing an acion sequence a o maximize he penalized objecive r a z a for each. For insance, in our invenory example, he perfec informaion relaxaion assumes he DM has knowledge of all demands and coss before making any ordering decisions. We esimae he dual bound by randomly generaing demand/cos scenarios in he ouer simulaion, and he inner problem is a simple deerminisic DP ha chooses opimal ordering quaniies in each demand/cos scenario. Wih imperfec informaion relaxaions, we can ofen sill use Mone Carlo simulaion o esimae he upper bounds. For insance, in our opion-pricing example, we will randomly generae ineres raes and volailiies in he ouer simulaion, and he inner problem is a one-dimensional DP ha considers uncerainy in sock prices. If we minimize over he dual feasible penalies in (4), we obain he dual of he primal DP (1): { } inf sup Ɛ r G z G (6) z By he weak dualiy lemma, if we idenify a policy F and penaly z ha are primal and dual feasible, respecively, such ha equaliy holds in (4), hen F and z mus be opimal for heir respecive problems. In such a case, here would be no gap beween he values given by hese primal and dual soluions. If he primal soluion is bounded, here is always a dual feasible penaly ha yields no gap. For example, consider he penaly z a = r a v where v is he opimal value of he primal DP (1). This z is dual feasible (because Ɛ r F v for all F ) and rivially opimal: no maer wha policy is seleced, he penalized objecive funcion r a z a is equal o v. The exisence of his rivially opimal penaly is no helpful in pracice because i requires knowing he opimal value v of he primal DP. I does, however, show ha here is no gap beween he soluions o he primal and dual problems and ha, in principle, we could deermine he maximal expeced reward in he primal DP (1) by solving he dual problem (6). This resul is analogous o he srong dualiy heorem of linear programming. Theorem 2.1 (Srong Dualiy). Le be a relaxaion of. Then sup F Ɛ r F = inf z { } sup Ɛ r G z G (7) Furhermore, if he primal problem on he lef is bounded, he dual problem on he righ has an opimal soluion z ha achieves his bound. The complemenary slackness condiion furher characerizes he relaionship beween he primal and dual problems, saying ha for a primal-dual pair F z o be opimal, i is necessary and sufficien for F o have zero expeced penaly and for F o solve he dual problem in he following sense. Theorem 2.2 (Complemenary Slackness). Le F and z be feasible soluions for he primal and dual problems respecively (i.e., F and z ), wih informaion relaxaion. A necessary and sufficien condiion for hese o be opimal soluions for heir respecive problems is ha Ɛ z F = 0 and Ɛ r F z F = sup Ɛ r G z G (8) Equaion (8) can be inerpreed as saying ha wih an opimal penaly, in dual problem he DM will be conen o choose a policy ha is nonanicipaive even hough she has he opion of choosing a policy ha is no. In applicaions, we will compare he heurisic policies F used o compue a lower bound wih he policies G seleced in he dual problem o see if we can idenify some way o improve he heurisic policy. Finally, we noe a useful propery of his dual approach: if we can simplify he primal problem by focusing on some subse of policies, we can resric he dual problem o focus on policies in his same se. For example, if we know ha he opimal policy for he primal problem is myopic or has a hreshold srucure, we can simplify he dual problem by considering only policies ha have he same srucure. This leads o dual bounds ha are a leas as igh and perhaps easier o compue han he dual bounds ha do no include his consrain. We summarize his propery as follows.

5 Operaions Research 58(4, Par 1 of 2), pp , 2010 INFORMS 789 Proposiion 2.1 (Srucured Policies.) If for some we have sup Ɛ r F = sup Ɛ r F, hen, for F F any dual feasible z, we have sup F Ɛ r F sup Ɛ r G z G sup Ɛ r G z G (9) Moreover, he inequaliies also hold for all z such ha Ɛ z F 0 for all F in. For insance, in our opion-pricing example, in he primal problem i is never opimal o exercise a call opion prior o expiraion, excep possibly jus before a dividend is paid. However, in he dual problem wih a relaxed filraion, early exercise may be opimal. In our numerical experimens for his example, we will use his srucural resul and impose a no early exercise consrain in he dual problem for call opions. The resuling bounds are boh igher and easier o compue han hey would be wihou his consrain Good Penalies In our discussion so far, we have considered he se of all dual feasible penalies. We now focus on idenifying good penalies ha are likely o be useful in pracice. The main approach we will use o generae penalies is described in he following proposiion. We will show shorly ha we can, in principle, generae an opimal dual penaly using his approach, so ha srong dualiy holds even when resriced o hese good penalies. Proposiion 2.2 (Consrucing Good Penalies). Le be a relaxaion of and le w 0 a w T a be a sequence of generaing funcions defined on A where each w depends only on he firs + 1 acions a 0 a of a. Define z a = Ɛ w a Ɛ w a F and z a = T =0 z a. Then: (i) For all F in, we have Ɛ z F F = 0 for all, and Ɛ z F = 0; and (ii) z 0 a z T a is adaped o and z depends only on he firs + 1 acions a 0 a of a. Propery (i) of he proposiion implies ha he penalies z generaed using he proposiion will always be dual feasible in ha Ɛ z F 0 for F in, bu is sronger in ha i implies he inequaliy defining feasibiliy holds wih equaliy. The complemenary slackness condiion (Theorem 2) shows ha an opimal penaly z will assign zero expeced penaly o an opimal primal policy. Penalies generaed using Proposiion 2.2 will assign zero expeced penaly o all nonanicipaive policies. Propery (ii) of he proposiion implies ha he penalized objecive funcion can be decomposed ino period- componens r z ha depend only on wha is known a period under and he acions chosen in or before period. This means we can solve he dual problem using a DP recursion like ha of Equaion (2) using he penalized rewards and based on filraion raher han. Specifically, he erminal dual value funcion is VT +1 a 0 a T = 0 and, for = 0 T,wehave V a 0 a 1 { = sup r a 0 a z a 0 a a A a 0 a 1 + Ɛ V +1 a 0 a } { = sup r a 0 a Ɛ w a 0 a a A a 0 a 1 + Ɛ w a 0 a F + Ɛ V +1 a 0 a } (10) The iniial value, V0, provides on upper bound on he primal DP (1) or, equivalenly, (2). We can consruc an opimal penaly using Proposiion 2.2 by aking he generaing funcions o be based on he opimal DP value funcion given by (2). Specifically, if we ake w a = V +1 a 0 a, we arrive a an opimal dual penaly z a ha we will refer o as he ideal penaly. I is easy o show by inducion ha wih his choice of generaing funcion, he dual value funcions are equal o he corresponding primal value funcions, i.e., V = V. This is rivially rue for he erminal values (boh are zero). If we assume ha V+1 = V +1, erms cancel and (10) reduces o he expression for V given in Equaion (2). Thus, wih his choice of generaing funcion, we obain an opimal penaly for any informaion relaxaion. The following heorem summarizes his resul and adds a bi more. Theorem 2.3 (The Ideal Penaly). Le be a relaxaion of and le z be defined as in Proposiion 2.2 by aking w a = V +1 a 0 a. Then z is dual feasible and opimal in ha sup Ɛ r F = sup Ɛ r G z G (11) F Moreover, if F achieves he supremum for he primal problem on he lef side of (11), hen F is also opimal for he dual problem on he righ. Finally, if is he perfec informaion relaxaion and F is an opimal policy, hen r F z F = Ɛ r F almos always. Alhough he value funcions will no be known in applicaions, he form of z illusraes he ideal ha we would like o approximae wih our choice of penalies. Inuiively, we would like o choose penalies ha eliminae he benefi of choosing acions based on he informaion in raher han relying on he informaion in he naural filraion. Tha is, we wan o choose a generaing funcion w so ha he differences Ɛ w a Ɛ w a F approximae he differences Ɛ V +1 a Ɛ V +1 a F and he

6 790 Operaions Research 58(4, Par 1 of 2), pp , 2010 INFORMS condiional expecaions (Ɛ w a and Ɛ w a F ) are no oo difficul o compue. In applicaions, we can approximae z in a variey of ways. Haugh and Kogan (2004) and Andersen and Broadie (2004) proposed mehods for generaing penalies (or dual maringales) in he opion pricing conex ha can be generalized o our seing. Generalizing Haugh and Kogan s approach, we can approximae he ideal penaly z by using an approximae value funcion ˆv a in place of he rue value funcion V a. This leads o period- penalies of he form z a = Ɛ ˆv +1 a Ɛ ˆv +1 a F. To use his approach, we mus somehow esimae or calculae he condiional expecaions Ɛ ˆv +1 a and Ɛ ˆv +1 a F. Haugh and Kogan consider he perfec informaion relaxaion ( = F )soɛ ˆv +1 a =ˆv +1 a can be evaluaed direcly for any sample pah. They esimae Ɛ ˆv +1 a F using a nesed simulaion procedure: for each sample pah and each period, hey esimae Ɛ ˆv +1 a F by generaing random successors o he period- sae and averaging he nex period values ˆv +1 a in hese successor saes. The penalies generaed using his approach will lead o valid bounds as long as he nesed esimaes of hese condiional expecaions are unbiased; see Proposiion 2.3(iv) below. Andersen and Broadie (2004) also consider a perfec informaion relaxaion, bu base heir penaly on a given policy raher han an approximae value funcion. In our framework, heir approach can be seen as approximaing he value funcion V a wih v a = Ɛ r a F, where a denoes a policy ha akes he firs acions a 0 a 1 o mach hose of a and hen coninues according o some given rule. The penaly is hen z a = Ɛ v+1 a Ɛ v+1 a F ; wih a perfec informaion relaxaion, his is equivalen o z a = Ɛ r +1 a F +1 Ɛ r +1 a F. Andersen and Broadie generae sample pahs in he ouer simulaion and esimae he condiional expecaions using nesed simulaion. Whereas he nesed simulaions in he Haugh-Kogan approach consider a single period, here each period s nesed simulaion follows he specified policy hrough he end of he horizon or unil he policy calls for sopping. Because each fuure period is considered in each nesed simulaion, he work involved in he Andersen-Broadie approach poenially grows wih T 2 where T is he number of periods considered in he model. Again, hese penalies will lead o valid bounds as long as he esimaes of hese condiional expecaions are unbiased. In pracice, here will ypically be a rade-off beween he qualiy of he bound and he compuaional effor required o compue i. We can conrol his rade-off hrough our choice of informaion relaxaion and penaly. The following proposiion provides some properies of penalies and informaion relaxaions ha are useful in undersanding hese rade-offs. Proposiion 2.3 (Properies of Penalies and Relaxaions). (i) Le 1 and 2 be filraions saisfying 1 2 and le z 1 and z 2 be penalies consruced using Proposiion 2.2 wih relaxaions 1 and 2 and a common sequence of generaing funcions w 0 w T. Then sup Ɛ r G z 1 G sup Ɛ r G z 2 G (12) 1 2 (ii) For any wo dual feasible penalies z 1 and z 2 and informaion relaxaion, we have inf Ɛ z 2 G z 1 G sup Ɛ r G z 1 G sup Ɛ r G z 2 G sup Ɛ z 2 G z 1 G (13) (iii) Le and be filraions saisfying and le w 0 w T be a sequence of generaing funcions saisfying he condiions of Proposiion 2.2. The penaly z given by z a = Ɛ w a Ɛ w a F and z a = T =0 z a saisfies he resuls of Proposiion 2.2. (iv) Le be a relaxaion of and z a = T =0 z a be a dual feasible penaly such ha z a is -measurable and depends only on he firs + 1 acions of a. Suppose ẑ a = T =0 ẑ a where each ẑ a depends only on he firs + 1 acions of a, and furher suppose ha each ẑ a is an unbiased esimae of z a in ha Ɛ ẑ a = z a. Le be a relaxaion of ha assumes ha in addiion o wha is known under, he values of ẑ a are revealed in period. Then sup Ɛ r G z G sup Ɛ r G ẑ G (14) G The firs resul of he proposiion says ha if we generae penalies wih a common se of generaing funcions, looser relaxaions lead o weaker bounds. For example, we may find ha he bounds given by using a simple generaing funcion (say, w = 0) may be good enough wih one informaion relaxaion, bu no good enough wih a looser relaxaion. The second resul of he proposiion can be viewed as a coninuiy propery: if he penalies z 1 and z 2 are close in ha he difference Ɛ z 2 G z 1 G is small for all G, hen he bounds provided by he wo penalies will also be close. For example, if z 2 is he ideal penaly z and herefore yields he opimal upper bound, he bound given by some oher penaly z 1 will exceed he opimal bound by no more han sup G Ɛ z G z 1 G. In his sense, penalies ha are close o he ideal penaly will lead o bounds ha are close o opimal. The hird resul can be helpful for deermining penalies when Ɛ w a F is difficul o calculae. For insance in he opion-pricing example, if we assume ha under he naural filraion volailiy is unobserved, we may be able

7 Operaions Research 58(4, Par 1 of 2), pp , 2010 INFORMS 791 o simplify he compuaion of bounds by calculaing penalies using a filraion ha assumes ha he volailiy is observed. The final resul of Proposiion 2.3 concerns he effecs of errors when penalies are esimaed, for example, using nesed simulaions as in Haugh and Kogan (2004) and Andersen and Broadie (2004). Here we can imagine he probabiliy space F as including he uncerainies associaed wih he esimaion of penalies as well as he original model uncerainies. These esimaion uncerainies are no revealed under filraions or and do no affec he rewards or penalies and hus are irrelevan o he primal and rue dual problem. The esimaes are, however, revealed under and acions are seleced o maximize he esimaed penalized reward r a ẑ a raher han he rue penalized reward r a z a. Here we see ha when hese esimaed penalies are unbiased, we obain esimaes of he bounds ha are valid bu weaker han he bounds given by using he penaly z iself. Glasserman (2004) provides some numerical resuls sudying he qualiy of he bounds in an opion-pricing example wih varying numbers of rials in he nesed simulaions. His resuls (and ohers ) show he imporance of esimaing penalies accuraely. Our resuls for he opion example in 4.7 confirm his finding Summary of Approach Before urning o our examples, i may be worhwhile o summarize he seps involved in our approach. Given a dynamic programming model: Idenify a heurisic policy ha can be used in a simulaion sudy o esimae a lower bound on he opimal value (or upper bound on he opimal cos) for he problem. Choose an informaion relaxaion ha makes i easy o deermine opimal decisions given he addiional informaion in he relaxaion. I is ofen naural o sar by considering a perfec informaion relaxaion, alhough in some problems here may be oher naural saring poins. Find a penaly ha does no grealy complicae he calculaion of opimal decisions wih he chosen informaion relaxaion. We can sar wih zero penaly, bu his may lead o weak upper bounds. Esimae lower and upper bounds on he opimal value. In our examples, we will ypically esimae he upper and lower bounds simulaneously in a single simulaion. If he gap beween bounds is sufficienly small, we may conclude ha he heurisic policy is good enough for use in pracice, and we are done. If no, we can sudy he differences beween he heurisic policies and he dual policies and see if hese sugges some ideas for improving he heurisic policies, relaxaions, or penalies. In he nex wo secions, we will sudy wo complex examples and discuss issues involved in choosing heurisic policies, informaion relaxaions, and penalies in hese applicaions. 3. Example: Adapive Invenory Conrol Our firs example is an adapive invenory conrol model where demand is nonsaionary and parially observed, meaning he probabiliy disribuion for demand changes over ime and he rue demand disribuion is no known. These kinds of models are of significan pracical ineres, bu are quie difficul o solve. Treharne and Sox (2002) consider several heurisic policies and evaluae he performance of hese policies in a se of five-period examples ha hey were able o solve exacly. We illusrae our dual bounding approach by evaluaing some of hese heurisic policies in larger versions of Treharne and Sox s examples The Model The goal is o find a policy for ordering goods over T periods ( = 0 T 1) o minimize he expeced oal coss. The invenory level a he beginning of period is denoed by x and he amoun ordered in period is a. The demand in period is uncerain and denoed by d. The invenory level evolves according o x +1 = x + a d = x 0 + =0 a d, where x 0 is he iniial invenory level. This evoluion equaion assumes unme demand is backordered and appears as a negaive invenory level enering he nex period; he equaion also assumes here is no lead ime required o fulfill he orders. The order quaniies and demands are assumed o be nonnegaive inegers. The period- demand d is drawn from a disribuion ha changes sochasically, following a Markov process. The demand d is observed a he end of period, bu he disribuion is never observed. We begin wih a prior disribuion 0 on he iniial demand disribuion 0 and updae his over ime wih he period-( + 1) disribuion +1 d, aking ino accoun he prior beliefs, he observed demand d, and he possibiliy of he disribuion changing. In each period, here are ordering coss as well as coss associaed wih holding invenory or failing o mee demand. The cos of ordering a unis is c a, where c is he cos of ordering one iem or uni. The cos of holding invenory x +1 from period ino period + 1isf x +1 = h max 0 x +1 + p max 0 x +1, where h is he peruni cos of holding excess invenory in period and p is he per-uni penaly associaed wih backordering unme demand in period. Treharne and Sox assume a erminal cos of c T x T o capure he value (or cos) of holding invenory (or unme demand) a he end of he planning horizon. We generalize Treharne and Sox s model by allowing he ordering coss c o vary following a Markov chain ha is independen of he demands d and demand disribuions ; we assume ha he period- ordering cos c is known a he beginning of period. This generalizaion will allow us o consider a broader range of informaion relaxaions and makes he problem harder o solve. Placing his model in he general framework of 2.1, he acions a 0 a T 1 are he order quaniies for each

8 792 Operaions Research 58(4, Par 1 of 2), pp , 2010 INFORMS period and he acion sequences a are drawn from he se A of T -vecors of nonnegaive inegers. An oucome is a sample pah ha includes he demands, demand disribuions, and ordering coss for each period and a erminal cos c T ; ha is, he oucomes are of he form = d 0 0 c 0 d T 1 T 1 c T 1 c T. The naural filraion corresponds o knowing he demands d 0 d 1 and coss c 0 c a he beginning of period. Because he goal here is o minimize coss, we can eiher rewrie he primal DP (1) as a minimizaion problem or else ake he rewards in (1) o be he negaive coss. The srucure of he adapive invenory model is perhaps clearer if we view he problem as a parially observable Markov decision process and wrie i recursively. The period- sae variable is x c, where x is he invenory level a he beginning of period, c is he ordering cos in period, and is he probabiliy disribuion on he period- demand disribuion. In his recursive formulaion, i is convenien o ake he decision variables o be he order-up-o level y = x + a raher han he order quaniy a. We can hen wrie he period- cos-o-go funcion J, for = 0 T 1, as J x c { = c x + min c y + Ɛ f y d y x + J +1 y d c d c } (15) Here d and c +1 denoe he random period- demand and period-( + 1) coss, and he erminal cos funcion is J T x T c T T = c T x T. Wha makes his problem difficul o solve is ha each demand sequence d 0 d 1 leads o a differen and, consequenly, he number of scenarios ha mus be considered grows exponenially in he number of periods considered. For insance, he problems ha Treharne and Sox solved o opimaliy had 5 ime periods, 19 possible demand levels, and one ordering cos level. To find an opimal policy, hey had o solve he opimizaion problem (15) for approximaely 138,000 differen c -scenarios. In our numerical examples, we will consider 10 ime periods, 19 demand levels, and hree cos levels; we would have o solve approximaely such opimizaion problems o find an opimal policy Heurisic Policies Because of he complexiy of he primal problem, Treharne and Sox propose using simpler limied-look-ahead policies ha choose an order quaniy ha is opimal for a runcaed version of he model ha looks only zero, one or wo periods ino he fuure. For = 0 T 1, he L-period look-ahead cos-o-go funcion is defined as J L x c { = c x + min c y + Ɛ f y d y x + J L 1 +1 y d c d c } (16) In he erminal cases wih = T or L = 1, we ake J L x c = c x. When simulaing he invenory sysem using an L-period look-ahead policy, we deermine he order quaniy for a paricular c -scenario by solving (16) for he opimal order-up-o level y. We hen draw he random demand d and nex period cos c +1, calculae he updaed probabiliy disribuion +1, and repea he process by finding he order quaniy for he nex period using he L-period look-ahead value funcion saring a c The complexiy of hese limied-look-ahead policies grows exponenially wih he look-ahead horizon L. Inour numerical examples, we ake L = 0, 1, and 2 and we mus solve 1, 58, and 1,141 scenario-specific opimizaion problems (respecively) o deermine he recommended order quaniy for each period. If we esimae he expeced coss of hese policies using a simulaion wih T periods and K rials, we mus solve KT,58KT,or1 141KT opimizaion problems for he 0-, 1-, and 2-period look-ahead policies, respecively Informaion Relaxaions We will sudy hree differen informaion relaxaions in his example, each of which allows us o avoid considering he full ree of all possible cos/demand scenarios. Firs, we will consider he perfec informaion relaxaion. In his case, in he ouer simulaion we randomly generae he full sequence of ordering coss c 0 c T, demand disribuions 0 T 1, and acual demands d 0 d T 1.In he inner problem, we deermine opimal order quaniies by solving a simple deerminisic DP. Wih his relaxaion, we will be selecing random samples from he large ree of possible cos/demand scenarios. Second, we will consider a igher, imperfec informaion relaxaion ha assumes he demand disribuions 0 T 1 and acual demands d 0 d T 1 are known in advance, bu assumes he ordering coss c are no known unil period. In his case, we randomly generae he demand disribuions and demands in he ouer simulaion. In he inner problem, we solve a small sochasic DP ha deermines cos-dependen order quaniies for each period. The hird relaxaion is igher han he firs wo: i assumes ha he acual coss c and demands d are revealed as in he naural filraion (in period and period + 1, respecively), bu he demand disribuion is known in period ; he naural filraion assumes is never observed. In his case, if we assume zero penaly, he dual problem can be formulaed as a Markov DP wih sae variable x c ; he number of scenarios ha mus be considered no longer grows exponenially in T, and his DP is easy o solve Penalies As discussed in 2.3, he ideal penaly akes he generaing funcion w o be he opimal coninuaion value, i.e.,

9 Operaions Research 58(4, Par 1 of 2), pp , 2010 INFORMS 793 he period-( + 1) value funcion V +1. Here we will ake he generaing funcion w L for he L-period look-ahead penaly o be he L-period look-ahead cos-o-go funcions defined by Equaion (16), w = J L 1 +1 y d c d (17) For example, in he myopic case wih L = 0, he generaing funcion is simply c +1 y d ). Alhough we would no expec he (L 1)-period look-ahead cos-o-go funcions o provide a very good approximaion of he acual cos-o-go funcions J +1 (hey consider he coss over a small fracion of he oal ime frame), hese limied-look-ahead cos funcions may provide a reasonable approximaion of he change in coss due o having he addiional informaion provided by relaxaion insead of he naural filraion. In he perfec informaion relaxaion, he full sequence of demands and coss are known in advance and generaed in he ouer simulaion. Le dˆ 0 k dˆ T k 1 and ĉk 0 ĉk T denoe he sequences of hese values generaed in he kh rial of he simulaion and le ˆ k denoe he period- probabiliy disribuion on given by saring wih he prior disribuion 0 and updaing based on seeing dˆ 0 k dˆ 1 k. Following Equaion (10), we can wrie he inner problem in he kh rial wih he L-period-look-ahead penaly as J L k x = ĉ k x {ĉk + min y J L 1 +1 y dˆ k y x ĉk +1 ˆ k +1 + J L +1 y k dˆ k + Ɛ f y d + J L 1 +1 y d c ˆ k d ˆ k ĉk } (18) wih erminal value J L k T x T = ĉt k x T. Noe ha he limied-look-ahead cos-o-go funcion J+1 L 1 and he expecaion in (18) would be calculaed when deermining he limied-look-ahead order quaniy for his rial (see Equaion (16)). Consequenly, when simulaing o esimae he expeced coss wih an L-period look-ahead policy, i is no difficul o simulaneously esimae he corresponding dual bound: we need only solve one addiional scenariospecific opimizaion problem for each period. The dual bounds are also easy o calculae wih he generaing funcion of Equaion (17) for he imperfec informaion relaxaion ha assumes ha he demands d 0 d T 1 and demand disribuions 0 T 1 are known in advance, bu he ordering coss c are revealed over ime as in he naural filraion. In his case, he inner problem is a sochasic DP ha explicily considers he uncerainy abou he ordering coss; see Appendix A.7 for deails. By Proposiion 2.2(i), we know ha he bounds given by using his imperfec informaion relaxaion will be a leas as good as hose given by he perfec informaion relaxaion. The hird informaion relaxaion we consider in his problem assumes he demand disribuions are observed in period, buc and d are revealed over ime as in he naural filraion. As discussed in 3.3, wih zero penaly, his dual problem can be formulaed as a Markov decision problem ha is no difficul o solve. However, wih his relaxaion, he generaing funcion of Equaion (17) leads o an inner problem ha is no easy o solve. The difficuly is ha he generaing funcions depend on he probabiliy disribuions +1 ha, in urn, depend on he whole hisory of demands d 0 d. This dependence desroys he Markovian srucure ha makes i easy o solve he inner problem wih no penaly. Thus, he generaing funcion (17) works well wih he firs wo relaxaions, bu no wih he hird Numerical Resuls In his secion, we describe numerical resuls for he adapive invenory conrol example. Our choice of parameers closely follows Treharne and Sox (2002). Specifically, following Treharne and Sox, we assume ha here are hree possible random demand disribuions, each of which is a runcaed negaive binomial disribuion ha ranges from 0 o 18 unis. The hree disribuions are low, medium, and high and have means and sandard deviaions of , , and , respecively, before runcaion. We consider seven differen ransiion probabiliy marices represening various rends for he demand disribuions. The holding coss h are se o $1.00 per uni and he backorder coss p are $1.00, $1.86, or $4.00 per uni. Finally, we consider four differen priors on he iniial demand disribuion 0 : he firs, hird, and fourh represen cases where he demands are mos likely o be high, medium, or low, respecively; he second prior is a uniform disribuion across he hree differen demand disribuions. In oal, here are 84 differen combinaions of parameers o consider (7 ransiion marices 3 backorder coss 4 priors). In each case, we assume he iniial ordering coss c 0 are $0.60 per uni and laer coss ake values $0 00, $0 60, or $1 20 following a Markov chain. (These assumpions are described in deail in Appendix C.) Finally, we ake he planning horizon T o be 10 periods and assume zero iniial invenory. In our numerical experimens, we calculae upper and lower bounds on he opimal expeced coss using he zero-, one-, or wo-sep look-ahead policies and penalies. For each combinaion of model parameers, we esimae he bounds using a simulaion of 1,000 rials. Figure 1 summarizes he resuls for he perfec informaion relaxaion. Appendix C provides he numbers underlying his figure (esimaed means and sandard errors) as well as resuls for he imperfec informaion relaxaion described in 3.3. We call he plo of Figure 1 an aquarium plo. In he figure, here are 84 ses of bars, each appearing (if you have bad eyesigh!) like a ropical fish. Each fish represens he resuls for a paricular se of parameers and

10 794 Operaions Research 58(4, Par 1 of 2), pp , 2010 INFORMS Figure 1. Upper and lower bounds wih he perfec informaion relaxaion Expeced invenory cos Blue (lef): Myopic upper and lower bounds Black (middle): One period look-ahead upper and lower bounds Red (righ): Two-period look-ahead upper and lower bounds Black (doed): Observable demand disribuion lower bounds 0 Sable, Pos. Corr. Sable, Neg. Corr. Sable, Zero. Corr. Upward, Slow Cases Upward, Fas Downward, Slow Downward, Fas consiss of hree verical bars wih blue, black, and red colors and horizonal markers on each end. The blue bars on he lef of each fish represen he myopic (or 0-period look-ahead) upper and lower bounds; he black bars in he middle represen he 1-period look-ahead upper and lower bounds; and he red bars on he righ represen he woperiod look-ahead upper and lower bounds. The differen ses of parameers are grouped firs according o he ransiion marices (indicaed a he boom), hen by backorder coss (wih lef o righ represening high o low coss), and las by he iniial priors. In mos cases, he gaps beween bounds narrow as we increase he look-ahead horizon, albei a varying raes. In many cases, he bounds are all quie narrow and he fish look like minnows; in hese cases, we could probably assume ha he myopic policies are good enough and no consider more complex policies. 2 In he cases wih a sable ransiion marix, wih posiive correlaion (on he lef of he figure), he fish have relaively wide ails on he lef, bu narrow quickly: here we may no be saisfied wih he qualiy of he myopic policy, bu may find he oneor wo-period look-ahead policies o be good enough. There are, however, a few cases wih downward, slow and downward, fas ransiions (on he righ side of he figure) where he gaps remain relaively large even wih a wo-period look-ahead policy. We will reurn o hese cases in 3.6 below. Appendix C provides resuls for he imperfec informaion relaxaion where all demands are assumed o be known in advance, bu coss are revealed sequenially over ime. The esimaed bounds wih imperfec informaion are quie similar o hose wih perfec informaion, bu he imperfec informaion bounds are more precisely esimaed. Across he 84 cases, he mean sandard errors for he dual bounds wih he imperfec informaion relaxaion average $0.216, $0.172, and $0.137 for he zero-, one-, and woperiod look-ahead bounds, respecively. Wih he perfec informaion relaxaion, he corresponding mean sandard errors for he dual bounds average $0.821, $0.554, and $ Inuiively, he improved precision in he imperfec informaion bounds comes from eliminaing random sampling variaions associaed wih coss by explicily enumeraing he cos scenarios. In he imperfec informaion case, we also enumerae he cos scenarios when esimaing he expeced cos of he heurisic policy; his is somewha more ime consuming (for a fixed number of samples) bu improves he precision of he esimaed bounds.

11 Operaions Research 58(4, Par 1 of 2), pp , 2010 INFORMS 795 Table 1. Compuaion imes (seconds) for calculaing bounds in he invenory example. Perfec informaion relaxaion Imperfec informaion relaxaion Look-ahead Heurisic Dual Heurisic Dual horizon (L) policy bound policy bound Zero periods One period Two periods The run imes required o calculae hese bounds are shown in Table 1. We show he ime required o evaluae he zero-, one-, or wo-period look-ahead heurisic policies using 1,000 rials for one se of model parameers and he addiional ime required o calculae he dual bounds wih hese same 1,000 rials. 3 Here we see ha once we have calculaed he bounds associaed wih he heurisic policies (and he associaed look-ahead value funcions), i akes lile addiional ime o compue he dual bounds. The myopic dual bounds are somewha faser o compue han he one- and wo-period look-ahead bounds because in he myopic case we know he objecive funcion in Equaion (18) is convex and can simplify he opimizaion problem. The imperfec informaion bounds ake somewha longer o compue han he perfec informaion bounds, because we mus solve for dual opimal acions in each of he hree possible cos saes in each period raher han he one randomly chosen cos sae ha is considered in he perfec informaion case. As discussed in Secion 3.3, we can consruc an alernaive lower bound on expeced coss by considering an informaion relaxaion where he demands d and coss c are revealed over ime according o he naural filraion bu he demand disribuions are observed in period- (raher han never observed, as assumed in he naural filraion). If we ake he penaly o be zero, his problem can be formulaed as a Markov DP ha akes approximaely 0.08 seconds o solve. These observable demand disribuion bounds are shown as conneced doed lines in Figure 1. These lines are well below he fish represening he limied-look-ahead bounds. Thus, in hese examples, observing he demand disribuion is quie valuable and, wih no penaly, he corresponding bounds are quie weak Improving he Heurisic Policies and Bounds We now consider he use of he dual resuls o idenify beer policies and bounds when he gaps are relaively large. We will focus on he cases wih he downward, slow and downward, fas ransiion marices. In hese cases, demand may iniially be high (wih mean 16), bu i may drop o medium (wih mean 9) or low (wih mean 1) his period, and when demand drops, i will no increase again. Comparing he order-up-o quaniies (he y s) seleced by he myopic policy wih hose seleced in he corresponding dual bound, we find ha he dual problem akes advanage of he perfec informaion o reduce he order in he period when demand drops o he low demand sae, hereby avoiding he cos of carrying excess invenory when he sysem eners he low sae. I appears ha he myopic policies order oo much when he sysem is no in he low demand sae and he dual penalies do no appropriaely punish he DM in he dual problem for aking advanage of he perfec informaion abou demand. To undersand why his is he case, noe ha he erminal value used in deermining myopic policies and used as he generaing funcion for he myopic dual bound, J 1 x c = c x, implicily assumes ha lefover invenory subsiues for fuure purchases. One way o perhaps improve he policies and bounds is o use he erminal values based on a model ha assumes he demand disribuions is observable. Specifically, we ake he limied-look-ahead erminal value J 1 x c o be Ɛ J o x c, where J o is he value funcion for a Markov DP ha assumes he demand disribuion is observed in each period; his model was used o calculae he observable demand disribuion bounds described in 3.5. As is eviden in Figure 1, hese observable demand value funcions are no very good approximaions of he rue value funcions (hey grealy underesimae coss), bu hey are easy o compue and, unlike he original erminal values, hey include he holding coss associaed wih having excess invenory in a low demand sae. This modificaion leads o dramaic improvemens for he cases wih he downward, slow and downward, fas ransiion marices, wih lile addiional work. For example, in he case wih he downward, slow ransiion marix, high backorder coss, and a high prior disribuion, he myopic bounds wih he modified erminal values were $107.0 and $107.5 as compared o $92 and $111 for he myopic bounds wih he original erminal values; he run imes were 7.7 and 7.4 seconds, respecively. (These resuls are for he perfec informaion relaxaion and a simulaion of 1,000 rials.) The myopic bounds for he oher cases wih downward, slow and downward, fas ransiion marices are also much improved. In hese cases, hese modified myopic policies no only ouperform he original myopic policies, hey also ouperform he significanly more complex one- and wo-period look-ahead policies based on he original erminal values. (See Appendix C for deailed resuls for all cases.) Alhough his modificaion of he myopic policies grealy improves he resuls for he cases wih he downward, slow and downward, fas ransiion marices, he modified myopic policies perform worse han he original myopic policies in some oher cases, where he original myopic policies performed quie well. In all, comparing across he 84 differen ses of parameers, we find ha we can ge wihin 2% of he opimal coss (and ypically closer) using one of hese wo myopic policies. Thus, by

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB Elecronic Companion EC.1. Proofs of Technical Lemmas and Theorems LEMMA 1. Le C(RB) be he oal cos incurred by he RB policy. Then we have, T L E[C(RB)] 3 E[Z RB ]. (EC.1) Proof of Lemma 1. Using he marginal