Optimal approximate dynamic programming algorithms for a general class of storage problems

Size: px

Start display at page:

Download "Optimal approximate dynamic programming algorithms for a general class of storage problems"

Cornelius McCoy
5 years ago
Views:

1 Opimal approximae dynamic programming algorihms for a general class of sorage problems Juliana M. Nascimeno Warren B. Powell Deparmen of Operaions Research and Financial Engineering Princeon Universiy July 24, 2008

2 Absrac There are many applicaions which involve he dynamic conrol of a scalar quaniy (waer, money, vaccines) in he presence of a vecor of exogenous parameers ha influence he righ amoun of he resource ha should be held. These problems can be formulaed as dynamic programs, bu he resuling vecor-valued sae variable makes hem compuaionally inracable. Approximae dynamic programming is promising, bu opimal algorihms require ha saes be visied infiniely ofen, which has he effec of requiring ha we explore all he saes. We prove convergence of an algorihm ha uses pure exploiaion, eliminaing any need o explicily visi all he saes.

3 1 Inroducion We consider he problem of managing a single resource class (vaccines, money, waer, commodiies, invenory of a produc) in he presence of muliple parameers ha evolve exogenously (disease, prices, ineres raes, climae, echnology, demands). Le R be he scalar quaniy of resource on hand, and le W be a vecor-valued Markov process describing he relevan exogenous parameers, which means ha S = (R, W ) is he sae of our sysem. x is a poenially vecor-valued conrol. x migh describe he amoun of produc we purchase from differen suppliers, and he amoun we sell o differen ypes of cusomers. An opimal policy is described by Bellman s equaion V (S ) = max x X (C(S, x ) + γie {V +1 (S +1 ) S }) (1) where S +1 = f(s, x, W +1 ). We are going o assume ha R and W are discree. Alhough S may only have four or five dimensions, his is ofen enough o render (1) compuaionally inracable. Aside from he challenge of enumeraing he sae space, compuing he expecaion may also be compuaionally difficul. There are many applicaions of his problem class. One nice example involves deermining he hourly flows of differen ypes of raw energy resources (coal, nuclear, ehanol, solar, windmills and hydroelecric power) o serve differen ypes of demand (ligh and heavy indusrial, residenial and differen ypes of ransporaion). Assume ha he only form of sorage is he aggregae amoun of waer in waer reservoirs. The vecor of decisions involve deermining how much of differen ypes of energy resources o conver o serve he differen ypes of demand. The only decision ha links differen ime periods is he amoun of waer held in reservoirs (his is he scalar resource variable). Prices and rainfall evolve exogenously according o a Markov process (his is he vecor-valued exogenous sae variable). We propose a Mone Carlo-based algorihm based on he principles of approximae dynamic programming wih a lookup-able value funcion (since he sae is discree). We exploi he common propery ha C(S, x ) is concave in he resource dimension R, reflecing he propery ha addiional resources conribue diminishing marginal reurns. Approxi- 1

4 mae dynamic programming makes decisions by solving problems of he form ˆv n = max x X n ( C(S n, x ) + γie { }) n 1 V +1 (S +1 ) S n (2) where n 1 V (s) is an approximaion of he value funcion a sae s, and where X n feasible region ha may depend on S n. Le x n is he be he value of x ha solves (2). We use a lookup-able represenaion of he value funcion, which means ha we perform updaes using recursions of he form V n (S n ) = (1 α n 1 ) V n 1 (S n ) + α n 1ˆv n. The power of lookup ables is heir generaliy, since hey make no assumpions abou he srucure of he value funcion. However, virually he enire lieraure ha uses lookup ables also assumes small acion spaces ha make i possible o search over all possible decisions (see Suon & Baro (1998)). In our work, x is a vecor, which generally forces us o use some ype of mah programming algorihm. The srucure of our problem makes his possible. In our algorihm, S+1 n = f(s n, x n, W +1 (ω n )), which means ha he nex sae is chosen greedily. We do no require any form of sae exploraion, a common sraegy o obain a proof of convergence which does no necessarily provide a pracical algorihm (see Bersekas & Tsisiklis (1996) and Powell (2007) for complee discussions of hese issues). Bu we do inroduce a sep ha mainains concaviy in he resource dimension. We prove ha he algorihm converges almos surely o an opimal policy. Our proof echnique combines he convergence proof for a monoone mapping in Bersekas & Tsisiklis (1996) wih he proof of he SPAR algorihm in Powell e al. (2004). Curren proofs of convergence for approximae dynamic programming algorihms such as Q-learning (Tsisiklis (1994), Jaakkola e al. (1994)) and opimisic policy ieraion (Tsisiklis (2002)) require ha we visi saes (and possibly acions) infiniely ofen. A convergence proof for a Real Time Dynamic Programming (Baro e al. (1995)) algorihm ha considers a pure exploiaion scheme is provided in Bersekas & Tsisiklis (1996) [Prop. 5.3 and 5.4], bu i assumes ha expeced values can be compued and he iniial approximaions are opimisic. 2

5 We make no such assumpions, bu i is imporan o emphasize ha our resul depends on he concaviy of he opimal value funcions. Oher approaches o deal wih our problem class would be differen flavors of Benders decomposiion (Van Slyke & Wes (1969), Higle & Sen (1991), Chen & Powell (1999)) and sample average approximaion (Shapiro (2003))(SAA). However, Benders decomposiion will no handle an arbirary ype of exogenous informaion. On he oher hand, SAA relies on generaing random samples ouside of he opimizaion problems and hen solving he corresponding deerminisic problems using an appropriae opimizaion algorihm. Numerical experimens wih he SAA approach applied o problems where an ineger soluion is required can be found in Ahmed & Shapiro (2002). We begin in secion 2 wih an illusraion of a sorage problem. Secion 3 gives he opimaliy equaions and describes some properies of he value funcion. Secion 4 describes he algorihm in deail. Secion 5 gives he convergence proof, which is he hear of he paper. Secion 6 concludes he paper. 2 A sorage model In his secion we illusrae a sorage problem for an applicaion where produc can be purchased from muliple supplies, sold o muliple cusomers, and which mus be managed in he presence of muliple ypes of exogenous parameers (such as weaher ha migh affec he demand for he produc, or prices of compeing producs). This model is only an illusraion, since he cenral resul of he paper is a convergence proof for a general problem class. We assume ha he exogenous informaion process W is Markov, and ha i includes boh he demand for he sored asse as well as any oher exogenous informaion. The demand may be vecor-valued represening differen cusomer ypes, which we represen using D = (D 1,..., D B d), where B d is he number of differen sources of demand. The demand does no need o be fully saisfied, bu no backlogging is allowed. We emphasize ha differen ypes of informaion can also be conained in W. Examples include raes of reurn, 3

6 ineres raes, weaher condiions, emperaures and compeiive response. We assume he demand vecor is ineger valued. We also assume he informaion vecor W is an elemen of a se W, which has finie suppor. We denoe by R he scalar sorage level righ before a decision is aken. We also define S = (W, R ) as he pre-decision sae, where we use S and (W, R ) inerchangeably. The decision x = (x d, x s, x r ) has hree componens. The firs componen, x d, is a vecor ha represens how much o sell o saisfy he demands D. The second componen, x s, is also a vecor ha represens how much o buy from B s differen suppliers o replenish he asse level. Finally, x r is a scalar ha denoes he quaniy ha should be ransferred (discarded), in case he sorage level is oo high, even afer demand has been saisfied. The feasible se, which is dependen on he curren sae (W, R ), is given by X (W, R ) = { x d IR Bd, x s IR Bs, x r IR : B d i=1 x d i + x r R, 0 x d i D i, i = 1,..., B d, 0 x s i M i, i = 1,..., B s, 0 x s i u s (W ), i = 1,..., B s, 0 x r i u r (W ) }, where u s ( ) and u r ( ) are nonnegaive ineger valued funcions and M i is a deerminisic ineger bound. The firs consrain indicaes ha backlogging is no allowed, he second indicaes ha we do no sell more han is demanded and he hird indicaes ha each supplier can only offer a limied amoun of asses. The las wo consrains impose upper bounds ha varies wih he informaion vecor W. The conribuion generaed by he decision is given by B d B d C (W, R, x ) = c u i(w )(D i x d i) c o (W ) R i=1 B s c s i(w )x s i + c d i(w )x d i, i=1 B d i=1 4 i=1 x d i c r (W )x r

7 where c u i( ), c o ( ), c r ( ), c s i( ) and c d i( ) are nonnegaive scalar funcions. The firs hree erms represen underage, overage and ransfer coss, respecively. The fourh erm represens he buying cos and he las one represens he money ha is made saisfying demand. The decision akes he sysem o a new sorage level, denoed by he scalar R x, given by R x = f x (R, x ) B d i=1 B s = R x d i + x s i x r. i=1 where f x (R, x ) is our ransiion funcion ha reurns he pos-decision resource sae. In he remainder of he paper, we use f x (R, x ) wihou assuming any specific srucure. We define S x = (W, R x ) as he pos-decision sae, which is he sae immediaely afer we have made a decision bu before any new informaion has arrived. As wih he predecision sae, we use S x and (W, R x ) inerchangeably. New informaion W +1 becomes available a ime period + 1 and he sorage level evolves o R +1 = R x + ˆR +1, where ˆR +1 is a nonnegaive ineger valued funcion ha represens exogenous changes in he sorage level (such as rainfall ino a reservoir, or hurricane damage o power generaion capaciy). Noe ha in his illusraion, he decision x impacs he asse level in a linear way. Moreover, i does no influence in any way he informaion vecor W +1. For compleeness, we denoe by R 0 he iniial asse level, which is assumed o be a nonnegaive ineger. Given our assumpions, we have ha he pre/pos asse levels are nonnegaive and bounded from above. We le B pre and B pos be posiive inegers ha are upper bounds for R and R x, respecively. 3 The Opimal Value Funcions We define, recursively, he opimal value funcions associaed wih he sorage class. We denoe by V (W, R ) he opimal value funcion around he pre-decision sae (W, R ) and 5

8 by V x (W, R x ) he opimal value funcion around he pos-decision sae (W, R x ). A ime = T, since i is he end of he planning horizon, he value of being in any sae (W T, RT x ) is zero. Hence, V x T (W T, R x T ) = 0. A ime 1, for = T,..., 1, he value of being in any pos-decision sae (W 1, R x 1) does no involve he soluion of a deerminisic opimizaion problem, i only involves an expecaion, since he nex predecision sae (W, R ) only depends on he informaion ha firs becomes available a. On he oher hand, he value of being in any pre-decision sae (W, R ) does no involve expecaions, since he nex pos-decision sae (W, R x ) is a deerminisic funcion of W, i only requires he soluion of an opimizaion problem. Therefore, V x 1(W 1, R x 1) = IE [ V (W, R ) (W 1, R x 1) ]. (3) V (W, R ) = max C (W, R, x ) + γv x (W, R x ) (4) x X (W,R ) In he remainder of he paper, we only use he value funcion V x (S x ) defined around he pos-decision sae variable, since i allows us o make decisions by solving a deerminisic problem as in (3). We show ha V x (W, ) is concave and piecewise linear wih break poins R = 1,..., B pos. This srucural propery combined wih he opimizaion/expecaion inversion is he foundaion of our algorihmic sraegy and is proof of convergence. For W W, le v (W ) = (v (W, 1),..., v (W, B pos )) be a vecor represening he slopes of a funcion V (W, ) : [0, ) IR ha is concave and piecewise linear wih breakpoins r = 1,..., B pos. I is easy o see ha he funcion F (v (W ), W, ), where B pos F (v (W ), W, R ) = max C (W, R, x ) + γ v (W, r)y r x,y subjec o x X (W, R ), B pos r=1 r=1 y r = f x (R, x ), 0 y 1, is also concave and piecewise linear wih breakpoins r = 1,..., B pos, since he demand vecor D and he upper bounds u s i( ), u r ( ) and M i in he consrain se X (W, R ) are all ineger valued. Moreover, he opimal soluion (x, y ) o he linear programming problem ha defines F (v (W ), W, R ) does no depend on V (W, 0) and is an ineger vecor 6

9 whenever he argumen R is ineger. We also have ha F (v (W ), W, R ) is bounded for all W W. We use F o prove he following proposiion abou he opimal value funcion. Proposiion 1. For = 0,..., T and informaion vecor W W, he opimal value funcion V x (W, ) is concave and piecewise linear wih breakpoins R = 1,..., B pos. We denoe is slopes by v (W ) = (v (W, 1),..., v (W, B pos )), where, for R = 1,..., B pos and < T, v (W, R) is given by v (W, R) = V x (W, R) V x (W, R 1) = IE [ F (v +1(W +1 ), W +1, R + ˆR +1 (W +1 )) F (v +1(W +1 ), W +1, R 1 + ˆR +1 (W +1 )) W ]. (5) Proof. The proof is by backward inducion on. The base case = T holds as V x T (W T, ) is equal o zero for all W T W T. For < T he proof is obained noing ha V x (W, R x ) = IE [ γv x +1(W +1, 0) + F (v +1(W +1 ), W +1, R x + ˆR +1 (W +1 )) W ]. Due o he concaviy of V x (W, ), he slope vecor v (W ) is monoone decreasing, ha is, v (W, R) v (W, R + 1). Moreover, hroughou he paper, we work wih he ranslaed B pos version V (W, ) of V x (W, ) given by V (W, R+y) = v (W, r)+yv (W, R+1), where R is a nonnegaive ineger and 0 y 1, since he opimal soluion (x, y ) associaed wih F (v +1(W +1 ), W +1, R) does no depend on V x (W, 0). We nex inroduce he dynamic programming operaor H associaed wih he sorage class. We define H using he slopes of piecewise linear funcions insead of he funcions hemselves. Le v = {v (W ) for = 0,..., T, W W } be a se of slope vecors, where v (W ) = (v (W, 1),..., v (W, B pos )). The dynamic programming operaor H associaed wih he sorage class maps a se of slope vecors v ino a new se Hv as follows. For = 0,..., T 1, 7 r=1

10 W W and R = 1,..., B pos, (Hv) (W, R) = IE [ F (v +1 (W +1 ), W +1, R + ˆR +1 (W +1 )) F (v +1 (W +1 ), W +1, R 1 + ˆR +1 (W +1 )) W ]. (6) A well known propery of dynamic programming operaors is ha he opimal value funcions are heir unique fixed poin (see, for example, Puerman (1994), Theorem 6.1.1). Therefore, he se of slopes v corresponding o he opimal value funcions V (W, R) for = 0,..., T, W W and R = 1,..., B, is he unique fixed poin of H. Moreover, he dynamic programming operaor H defined by (6) is assumed o saisfy he following condiions. Le ṽ = {ṽ (W ) for = 0,..., T, W W } and ṽ = {ṽ (W ) for = 0,..., T, W W } be ses of slope vecors such ha ṽ (W ) = (ṽ (W, 1),..., ṽ (W, B )) and ṽ (W ) = (ṽ (W, 1),..., ṽ (W, B )) are monoone decreasing and ṽ (W ) ṽ (W ). Then, for W W and R = 1,..., B : (Hṽ) (W ) is monoone decreasing, (7) (Hṽ) (W, R) (Hṽ) (W, R), (8) (Hṽ) (W, R) ηe (H(ṽ ηe)) (W, R) (H(ṽ + ηe)) (W, R) (Hṽ) (W, R) + ηe, (9) where η is a posiive ineger and e is a vecor of ones. The condiions in equaions (7) and (9) imply ha he mapping H is coninuous (see he discussion in Bersekas & Tsisiklis (1996), page 158). The dynamic programming operaor H and he associaed condiions (7)-(9) are used laer on o consruc deerminisic sequences ha are provably convergen o he opimal slopes. 4 The SPAR-Sorage Algorihm We propose a pure exploiaion algorihm, namely he SPAR-Sorage Algorihm, ha provably learns he opimal decisions o be aken a pars of he sae space ha can be reached by an opimal policy, which are deermined by he algorihm iself. This is accomplished 8

11 by learning he slopes of he opimal value funcions a imporan pars of he sae space, hrough he consrucion of value funcion approximaions V n (W, ) ha are concave, piecewise linear wih breakpoins R = 1,..., B pos. The approximaion is represened by is slopes v n (W ) = ( v n (W, 1),..., v n (W, B pos )). Figure 1 illusraes he idea. In order o do so, he algorihm combines Mone Carlo simulaion in a pure exploiaion scheme and sochasic approximaion inegraed wih a projecion operaion. V (W, ) Opimal - Unknown V n (W, ) Approximaion - Consruced v (W, R 1 ) v (W, R 2 ) v n (W, R 1 ) v n (W, R 2 ) v (W, R 3 ) v n (W, R 3 ) v (W, R 4 ) v n (W, R 4 ) Imporan Region v n (W, R 5 ) v (W, R 5 ) R 1 R 2 R 3 Asse Dimension Figure 1: Opimal value funcion and he consruced approximaion Figure 2 describes he SPAR-Sorage algorihm. The algorihm requires an iniial concave piecewise linear value funcion approximaion V 0 (W, ), represened by is slopes v 0 (W ) = ( v 0 (W, 1),..., v 0 (W, B pos )), for each informaion vecor W W. Therefore he iniial slope vecor v 0 (W ) has o be monoone decreasing. For example, i is valid o se all he iniial slopes equal o zero. For compleeness, since we know ha he opimal value funcion a he end of he horizon is equal o zero, we se v n T (W T, R) = 0 for all ieraions n, informaion vecors W T W T and asse levels R = 1,..., B pos. The algorihm also requires an iniial asse level o be used in all ieraions. Thus, for all n 0, R x,n 1 nonnegaive ineger, as in STEP 0b. is se o be a A he beginning of each ieraion n, he algorihm observes a sample realizaion of he informaion sequence W0 n,..., WT n, as in STEP 1. The sample can be obained from a sample generaor or acual daa. Afer ha, he algorihm goes over ime periods = 0,..., T. Firs, he pre-decision asse level R n is compued, as in STEP 2. Then, he decision x n, which is opimal wih respec o he curren pre-decision sae (W n, R n ) and value funcion n 1 approximaion V (W n, ) is aken, as saed in STEP 3. We have ha V n (W, R + y) = 9

12 STEP 0: Algorihm Iniializaion: STEP 0a: Iniialize v 0 (W ) for = 1,..., T 1 and W W monoone decreasing. STEP 0b: Se R x,n 1 = r, a nonnegaive ineger, for all n 0 STEP 0c: Se n = 1. STEP 1: Sample/Observe he informaion sequence W n 0,..., W n T. Do for = 0,..., T : STEP 2: Compue he pre-decision asse level: R n STEP 3: Find he opimal soluion x n of = R x,n 1 + ˆR (W n ). max C (W n x X (W n,rn ), R n, x ) + γ STEP 4: Compue he pos-decision asse level: R x,n = f x (R n, x n ). STEP 5: Updae slopes: n 1 V (W n, f x (R, x )). If < T hen STEP 5a: Observe ˆv +1(R n x,n ) and ˆv +1(R n x,n + 1). See (10). STEP 5b: For W W and R = 1,..., B pos : z n (W, R) = (1 ᾱ n (W, R)) v n 1 (W, R) + ᾱ n (W, R)ˆv +1(R). n STEP 5c: Perform he projecion operaion v n = Π C (z n ). See (14). STEP 6: Increase n by one and go o sep 1. Figure 2: SPAR-Sorage Algorihm R v n (W, r)+y v n (W, R+1), where R is a nonnegaive ineger and 0 y 1. Nex, aking r=1 ino accoun he decision, he algorihm compues he pos-decision asse level R x,n STEP 4., as in Time period is concluded by updaing he slopes of he value funcion approximaion. Seps 5a-5c describes he procedure and figure 3 illusraes i. Sample slopes relaive o he pos-decision saes (W n, R x,n ) and (W n, R x,n + 1) are observed, see STEP 5a and figure 3a. 10

13 Afer ha, hese samples are used o updae he approximaion slopes v n 1 (W n ), hrough a emporary slope vecor z n (W n ). This procedure requires he use of a sepsize rule ha is sae dependen, denoed by ᾱ n (W, R), and i may lead o a violaion of he propery ha he slopes are monoonically decreasing, see STEP 5b and figure 3b. Thus, a projecion operaion Π C is performed o resore he propery and updaed slopes v n (W n ) are obained, see STEP 5c and figure 3c. Afer he end of he planning horizon T is reached, he ieraion couner is incremened, as in STEP 6, and a new ieraion is sared from STEP 1. We have ha x n is easily compued since i is he firs componen of he opimal soluion o he linear programming problem in he definiion of F given our assumpions and he properies of F ha R n, x n and R x,n ( v n 1 (W n ), W n, R n ). Moreover, discussed in he previous secion, i is clear are all ineger valued. We also know ha hey are bounded. Therefore, he sequences of decisions, pre and pos decision saes generaed by he algorihm, which are denoed by {x n } n 0, {S n } n 0 = {(W n, R n )} n 0 and {S x,n } = {(W n, R x,n )} n 0, respecively, have a leas one accumulaion poin. Since hese are sequences of random variables, heir accumulaion poins, denoed by x, S and S, respecively, are also random variables. The sample slopes used o updae he approximaion slopes are obained by replacing he expecaion and he slopes v +1(W +1 ) of he opimal value funcion in (5) by a sample realizaion of he informaion W n +1 and he curren slope approximaion v n 1 +1 (W n +1), respecively. Thus, for = 1..., T, he sample slope is given by ˆv n +1(R) = F ( v n 1 +1 (W n +1), W n +1, R + ˆR +1 (W n +1)) F ( v n +1(W n +1), W n +1, R 1 + ˆR +1 (W n +1)). (10) The updae procedure is hen divided ino wo pars. Firs, a emporary se of slope vecors z n = {z n (W, R) : W W, R = 1,..., B pos } is produced combining he curren approximaion and he sample slopes using he sepsize rule ᾱ n (W, R). We have ha where α n ᾱ n (W, R) = α n 1 {W=W n} (1 {R=R x,n } + 1 {R=R x,n +1}), is a scalar beween 0 and 1 and can depend only on informaion ha became available up unil ieraion n and ime. Moreover, on he even ha (W, R ) is an 11

14 F( v n 1 (W n ), W n, R n, x) R n ˆv n +1(R x,n ) x n R x,n ˆv n +1(R x,n + 1) Exac x Slopes of v n 1 (W n, R n ) n 1 V (W n, ) 3a: Curren approximae funcion, opimal decision and sampled slopes W n Concaviy violaion Exac z n (W n, R n ) Temporary slopes F( v n 1 (W n ), W n, R n, x) 3b: Temporary approximae funcion wih violaion of concaviy Exac x v n (W n, R n ) Slopes of V n (W n, ) 3c: Level projecion operaion: updaed approximae funcion wih concaviy resored Figure 3: Updae procedure of he approximae slopes accumulaion poin of {(W n, R x,n )} n 0, we make he sandard assumpions ha n=1 n=1 ᾱ n (W, R ) = a.s., (11) (ᾱ n (W, R )) 2 B α < a.s., (12) where B α is a consan. Clearly, he rule α n = 1 NV n (W,R ) saisfies all he condiions, where NV n (W, R ) is he number of visis o sae (W, R ) up unil ieraion n. Furhermore, 12

15 for all posiive inegers N, n=n (1 ᾱ n (W, R )) = 0 a.s. (13) The proof for (13) follows direcly from he fac ha log(1 + x) x. The second par is he projecion operaion, where he emporary slope vecor z n (W ), ha may no be monoone decreasing, is ransformed ino anoher slope vecor v n (W ) ha has his srucural propery. The projecion operaor imposes he desired propery by simply forcing he violaing slopes o be equal o he newly updaed ones. For W W and R = 1,..., B pos, he projecion is given by z n (W n, R x,n ), if W = W n, Π C (z n )(W, R) = z n (W n, R x,n + 1), if W = W n, R > R x,n z n (W, R), oherwise. R < R x,n, z n (W, R) z n (W n, R x,n ) + 1, z n (W, R) z n (W n, R x,n + 1) (14) For = 0,..., T, informaion vecor W W and asse level R = 1,..., B pos, he sequence of slopes of he value funcion approximaion generaed by he algorihm is denoed by { v n (W, R)} n 0. Moreover, as he funcion F ( v n 1 (W n ), W n, R n ) is bounded and he sepsizes α n are beween 0 and 1, we can easily see ha he sample slopes ˆv n (R), he emporary slopes z n (W, R) and, consequenly, he approximaed slopes v n (W, R) are all bounded. Therefore, he slope sequence { v n (W, R)} n 0 has a leas one accumulaion poin, as he projecion operaion guaranees ha he updaed vecor of slopes are elemens of a compac se. The accumulaion poins are random variables and are denoed by v (W, R), as opposed o he deerminisic opimal slopes v (W, R). We conclude denoing by B v he deerminisic ineger ha bounds ˆv n (R), z n (W, R), v n (W, R) and v (W, R) for all W W and R = 1,..., B pos. 5 Convergence Analysis We sar his secion presening he convergence resuls we wan o prove. The major resul is he almos sure convergence of he approximaion slopes corresponding o saes 13

16 ha are visied infiniely ofen. On he even ha (W, R ) is an accumulaion poin of {(W n, R x,n )} n 0, we obain v n (W, R ) v (W, R ) and v n (W, R + 1) v (W, R + 1) a.s. As a byproduc of he previous resul, we show ha, for = 0,..., T, on he even ha (W, R, x ) is an accumulaion poin of {(W n, R n, x n )} n 0, x = arg max C (W x X (W,R ), R, x ) + γv where V (W, ) is he ranslaed opimal value funcion. (W, f x (R, x )) a.s. (15) Equaion (15) implies ha he algorihm has learned almos surely an opimal decision for all saes ha can be reached by an opimal policy. This implicaion can be easily jusified as follows. Pick ω in he sample space. We omi he dependence of he random variables on ω for he sake of clariy. For = 0, since R x,n 1 = r, a given consan, for all ieraions of he algorihm, we have ha R 1 = r. Moreover, all he elemens in W 0 are accumulaion poins of {W n 0 } n 0, as W 0 has a finie suppor. Thus, (15) ells us ha he accumulaion poins x 0 of he sequence {x n 0} n 0 along he ieraions wih pre-decision sae (W 0, R 1 + ˆR 0 (W 0 )) are in fac an opimal policy for period 0 when he informaion is W 0. This implies ha all accumulaion poins R 0 = f x (R 1 + ˆR 0 (W 0 ), x 0 ) of {R x,n 0 } n 0 are pos-decision asse levels ha can be reached by an opimal policy. By he same oken, for = 1, every elemen in W 1 is an accumulaion poin of {W n 1 } n 0. Hence, (15) ells us ha he accumulaion poins x 1 of he sequence {x n 1} along ieraions wih (W n 1, R n 1 ) = (W 1, R 0 + ˆR 0 (W 1 )) are indeed an opimal policy for period 1 when he informaion is W 1 and he pre-decision asse level is R 1 = R 0 + ˆR 0 (W 1 ). As before, he accumulaion poins R 1 = f x (R 1, x 1 ) of {R,n 1 } n 0 are pos-decision asse levels ha can be reached by an opimal policy. The same reasoning can be applied for = 2,..., T. 5.1 Ouline of he Convergence Proofs Our proofs follow he ideas presened in Bersekas & Tsisiklis (1996) and in Powell e al. (2004). The firs proves convergence assuming ha all saes are visied infiniely ofen. 14

17 The auhors do no consider a concaviy-preserving sep, which is he key elemen ha has allowed us o obain a convergence proof when a pure exploiaion scheme is considered. Alhough he framework in Powell e al. (2004) also considers he concaviy of he opimal value funcions in he asse dimension, he use of a projecion operaion o resore concaviy and a pure exploiaion rouine, heir proof is resriced o wo-sage problems. The main concep o achieve he convergence of he approximaion slopes o he opimal ones is o consruc deerminisic sequences of slopes, namely, {L k (W, R)} k 0 and {U k (W, R)} k 0, ha are provably convergen o he slopes of he opimal value funcions. These sequences are based on he dynamic programming operaor H, as inroduced in (6). We hen use hese sequences o prove almos surely ha for all k 0, L k (W, R ) v n (W, R ) U k (W, R ), (16) L k (W, R + 1) v n (W, R + 1) U k (W, R + 1), (17) on he even ha he ieraion n is sufficienly large and (W, R ) is an accumulaion poin of {(W n, R x,n )} n 0, which implies he convergence of he approximaion slopes o he opimal ones. Esablishing (16) and (17) requires several inermediae seps ha need o ake ino consideraion he pure exploiaion naure of our algorihm and he concaviy preserving operaion. We give all he deails in he proof of L k (W, R ) v n (W, R ) and L k (W, R + 1) v n (W, R + 1). The upper bound inequaliies are obained using a symmerical argumen. Firs, we define wo auxiliary sochasic sequences of slopes, namely, he noise and he bounding sequences, denoed by, { s n (W, R)} n 0, and { l n (W, R)} n 0, respecively. The firs sequence represens he noise inroduced by he observaion of he sample slopes, which replaces he observaion of rue expecaions and he opimal slopes. The second one is a convex combinaion of he deerminisic sequence L k (W, R) and he ransformed sequence (HL k ) (W, R). The wo sochasic sequences are used o show ha on he even ha he 15

18 ieraion n is big enough and (W, ) is an elemen of he random se S, v n 1 (W, ) n 1 l (W, ) s n 1 (W, ) a.s. The se S conains he saes (W, R ) and (W, R + 1), such ha (W, R ) is an accumulaion poin of {(W n, R x,n )} n 0 and he projecion operaion decreased or kep he same he corresponding unprojeced slopes infiniely ofen. Then, on {(W, ) S }, convergence o zero of he noise sequence, he convex combinaion propery of he bounding sequence and he monoone decreasing propery of he approximae slopes, will give us L k (W, ) v n 1 (W, ) a.s. Noe ha his inequaliy does no cover all he accumulaion poins of {(W n, R x,n )} n 0, since hey are resriced o saes in he se S. Neverheless, his inequaliy and some properies of he projecion operaion are used o fulfill he requiremens of a bounding echnical lemma, which is used repeaedly o obain he desired lower bound inequaliies for all accumulaion poins. In order o prove (15), we noe ha F ( v n 1 (W n ), W n, R n, x ) = C (W n, R n, x ) + γ n 1 V (W n, f x (R n, x )) is a concave funcion of X and X (W n, R n ) is a convex se. Therefore, where x n 0 F ( v n 1 (W n ), W n, R n, x n ) X N (W n, R n, x n ), is he opimal decision of he opimizaion problem in STEP 3a of he algorihm, F ( v n 1 (W n ), W n, R n, x n ) is he subdifferenial of F ( v n 1 (W n ), W n, R n, ) a x n and X N (W n, R n, x n ) is he normal cone of X (W n, R n ) a x n. This inclusion and he firs convergence resul are hen combined o prove, as shown in secion 5.4, ha 0 F (v (W ), W, R, x ) X N (W, R, x ). 16

19 5.2 Technical Elemens In his secion, we se he sage o he convergence proofs by defining some echnical elemens. We sar wih he definiion of he deerminisic sequence {L k (W, R)} k 0. For = 0,..., T 1, W W and R = 1,..., B pos, we have ha L 0 (W, R) = v (W, R) 2B v and L k+1 (W, R) = Lk (W, R) + (HL k ) (W, R). (18) 2 A he end of he planning horizon T, L k T (W T, R) = 0 for all k 0. The proposiion below inroduces he required properies of he deerminisic sequence {L k (W, R)} k 0. Is proof is deferred o he appendix. Proposiion 2. For = 0,..., T 1, informaion vecor W W and asse levels R = 1,..., B pos, L k (W ) is monoone decreasing, (19) (HL k ) (W, R) L k+1 (W, R) L k (W, R), (20) L k T (W, R) < v (W, R), and lim k L k (W, R) = v (W, R). (21) The deerminisic sequence {U k (W, R)} k 0 is defined in a symmerical way. I also has he properies saed in proposiion 2, wih he reversed inequaliy signs. We move on o define he random index N ha is used o indicae when an ieraion of he algorihm is large enough for convergence analysis purposes. Le N be he smalles ineger such ha all saes (acions) visied (aken) by he algorihm afer ieraion N are accumulaion poins of he sequence of saes (acions) generaed by he algorihm. In fac, N can be required o saisfy oher consrains of he ype: if an even did no happen infiniely ofen, hen i did no happen afer N. Since we need N o be finie almos surely, he addiional number of consrains have o be finie. We inroduce he se of ieraions, namely N (W, R), ha keeps rack of he effecs produced by he projecion operaion. For W W and R = 1,..., B pos, le N (W, R) be he se of ieraions in which he unprojeced slope corresponding o sae (W, R), ha is, 17

20 z n (W, R) was oo large and had o be decreased by he projecion operaion. Formally, N (W, R) = {n IN : z n (W, R) > v n (W, R)}. For example, based on figure 3c, n N (W n, R n + 2). A relaed se is he se of saes S. A sae (W, R) is an elemen of S if (W, R) is equal o an accumulaion poin (W, R ) of {(W n, R x,n )} n 0 or is equal o (W, R + 1). Is corresponding approximae slope also has o saisfy he condiion z n (W, R) v n (W, R) for all n N, ha is, he projecion operaion increased or kep he same he corresponding unprojeced slopes infiniely ofen. We close his secion dealing wih measurabiliy issues. Le (Ω, F, IP) be he probabiliy space under consideraion. 1, = 0,..., T }. Moreover, for n 1 and = 0,..., T, F n The sigma-algebra F is defined by F = σ{(w n, x n ), ( = σ {(W m, xm ), 0 < m < n, = 0,..., T } ) {(W n, xn ), = 0,..., }. Clearly, F n F n +1 and F n T F n+1 0. Furhermore, given he iniial slopes v 0 (W ) and he iniial asse level r, we have ha R n, R x,n and α n are in F n, while ˆv n +1(R x ), z n (W ), v n (W ) are in F n +1. n A poinwise argumen is used in all he proofs of almos sure convergence presened in his paper. Thus, zero-measure evens are discarded on an as-needed basis. 5.3 Almos sure convergence of he slopes We prove ha he approximaion slopes produced by he SPAR-Sorage algorihm converge almos surely o he slopes of he opimal value funcions of he sorage class for saes ha can be reached by an opimal policy. This resul is saed in heorem 1 below. Along wih he proof of he heorem, we presen he noise and he bounding sochasic sequences and inroduce hree echnical lemmas. Their proofs are given in he appendix so ha he main reasoning is no disruped. Before he heorem, we inroduce a echnical assumpion ha is non-rivial due o he pure exploiaion naure of our algorihm. Moreover, is verificaion is highly dependen on he specific problem wihin he sorage class. An example of such verificaion can be found in Nascimeno & Powell ((o appear) for he lagged acquisiion problem. Given k 0 18

21 and = 0,..., T 1, we assume here exiss a posiive random variable N k {n N k }, such ha on (HL k ) (W n, R n ) (H v n 1 ) (W n, R n ) (HU k ) (W n, R n ) a.s., (22) (HL k ) (W n, R n + 1) (H v n 1 ) (W n, R n + 1) (HU k ) (W n, R n + 1) a.s. (23) Theorem 1. Assume he sepsize condiions (11) (12). Also assume (22) and (23). Then, for all k 0 and = 0,..., T, on he even ha (W, R ) is an accumulaion poin of {(W n, R x,n )} n 0, he sequences of slopes { v n (W, R )} n 0 and { v n (W, R + 1)} n 0 generaed by he SPAR-Sorage algorihm for he sorage class converge almos surely o he opimal slopes v (W, R ) and v (W, R + 1), respecively. Proof. As discussed in secion 5.1, since he deerminisic sequences {L k (W, R x )} k 0 and {U k (W, R x )} k 0 do converge o he opimal slopes, he convergence of he approximaion sequences is obained by showing ha for each k 0 here exiss a nonnegaive random variable N,k such ha on he even ha n N,k of {(W n, R x,n )} n 0, we have L k (W, R ) v n (W, R ) U k (W, R ) a.s. L k (W, R + 1) v n (W, R + 1) U k (W, R + 1) a.s. and (W, R ) is an accumulaion poin We concenrae on he inequaliies L k (W, R ) v n (W, R ) and L k (W, R + 1) v n (W, R + 1). The upper bounds are obained using a symmerical argumen. The proof is by backward inducion on. The base case = T is rivial as L k T (W T, R) = v n T (W T, R) = 0 for all W T can pick, for example, N,k T W T, R = 1,..., B pos, k 0 and ieraions n 0. Thus, we = N, where N, as defined in secion 5.2, is a random variable ha denoes when an ieraion of he algorihm is large enough for convergence analysis purposes. The backward inducion proof is compleed when we prove, for = T 1,..., 0 and k 0, ha here exiss N,k accumulaion poin of {(W n, R x,n )} n 0, such ha on he even ha n N,k L k (W, R ) v n (W, R ) and L k (W, R and (W, R ) is an + 1) v n (W, R + 1) a.s. (24) 19

22 Given he inducion hypohesis for +1, he proof for ime period is divided ino wo pars. In he firs par, we prove for all k 0 ha here exiss a nonnegaive random variable N k such ha L k (W, ) v n 1 (W, ), a.s. on {n N k, (W, ) S }. (25) Is proof is by inducion on k. Noe ha i only applies o saes in he random se S. Then, again for, we ake on he second par, which akes care of he saes no covered by he firs par, proving he exisence of a nonnegaive random variable N,k such ha he lower bound inequaliies are rue on {n N,k } for all accumulaion poins of {(W n, R x,n )} n 0. We sar he backward inducion on. Pick ω Ω. We omi he dependence of he random elemens on ω for compacness. Remember ha he base case = T is rivial and we pick N,k T = N. We also pick, for convenience, N k T = N. Inducion Hypohesis: Given = T 1,..., 0, assume, for +1, and all k 0 he exisence of inegers N k +1 and N,k +1 such ha, for all n N k +1, (25) is rue, and, for all n N,k +1, inequaliies in (24) hold rue for all accumulaion poins (W, R ). Par 1: For our fixed ime period, we prove for any k, he exisence of an ineger N k for n N k, inequaliy(25) is rue. The proof is by forward inducion on k. such ha We sar wih k = 0. For every sae (W, R), we have ha B v v (W, R) B v, implying, by definiion, ha L 0 (W, R) B v. Therefore, (25) is saisfied for all n 1, since we know ha v n 1 (W, R) is bounded by B v and B v for all ieraions. Thus, N 0 = max ( ) 1, N,0 +1 = N,0 +1. The inducion hypohesis on k assumes ha here exiss N k such ha, for all n N k (25) is rue. Noe ha we can always make N k larger han N,k +1, hus we assume ha N k N,k +1. The nex sep is he proof for k + 1. Before we move on, we depar from our poinwise argumen in order o define he sochasic noise sequence and sae a lemma describing an imporan propery of his sequence. We 20

23 sar by defining, for R = 1,..., B pos, he random variable ŝ n +1(R) = (H v n 1 ) (W n, R) ˆv n +1(R) ha measures he error incurred by observing a sample slope. Using ŝ n +1(R) we define for each W W he sochasic noise sequence { s n (W, R)} n 0. We have ha s n (W, R) = 0 on {n < N k } and, on {n N k }, s n (W, R) is equal o max ( 0, (1 ᾱ n (W, R)) s n 1 (W, R) + ᾱ n (W, R)ŝ n +1(R x,n The sample slopes are defined in a way such ha 1 {R R x,n } + (R n + 1)1 {R>R n }) ). IE [ ŝ n +1(R) F n ] = 0. (26) This condiional expecaion is called he unbiasedness propery. This propery, ogeher wih he maringale convergence heorem and he boundedness of boh he sample slopes and he approximae slopes are crucial for proving ha he noise inroduced by he observaion of he sample slopes, which replace he observaion of rue expecaions, go o zero as he number of ieraions of he algorihm goes o infiniy, as is saed in he nex lemma. Lemma 1. On he even ha (W, R ) is an accumulaion poin of {(W n, R x,n )} n 0 we have ha { s n (W, R )} n 0 0 and { s n (W, R + 1)} n 0 0 a.s. (27) Proof of lemma 1. Given in he appendix. Reurning o our poinwise argumen, where we have fixed ω Ω, we use he convenion ha he minimum of an empy se is +. Le { } δl k (HL k ) (W, ) L k (W, ) = min : (W, ) 4 S, (HL k ) (W, ) > L k (W, ). If δ k L < + we define an ineger N L N k o be such ha N L 1 m=n k ( 1 ᾱ m (W, ) ) for all n N L and saes (W, are rue. If δ k L 1/4 and s n 1 (W, = + hen, for all saes (W, since (20) ells us ha (HL k ) (W ) L k (W ) δ k L, (28) ) S. Such an N L exiss because boh (13) and (27) ) S, (HL k ) (W ). Thus, L k+1 21 (W,, ) = L k (W ) = L k (W,, ) ) and

24 we define he ineger N L o be equal o N k. We le N k+1 ha (25) holds for n N k+1. ( = max N L, N,k+1 +1 ) and show We pick a sae (W, ) S. If L k+1 (W, ) = L k (W, ), hen inequaliy (25) follows from he inducion hypohesis. We herefore concenrae on he case where L k+1 (W, ) > L k (W, ). Firs, we depar one more ime from he poinwise argumen o inroduce he sochasic bounding sequence. We also sae a lemma combining his sequence wih he sochasic noise sequence. For W W and R = 1,..., B pos, we have on {n < N k } ha l n (W, R) = L k (W, R) and, on {n N k }, ln (W, R) = (1 ᾱ n (W, R)) n 1 l (W, R) + ᾱ n (W, R)(HL k ) (W, R). The nex lemma saes ha he noise and he sochasic bounding sequences can be used o provide a bound for he approximae slopes as follows. Lemma 2. On {n N k, (W, ) S }, v n 1 (W, ) n 1 l (W, ) s n 1 (W, ) a.s. (29) Proof of lemma 2. Given in he appendix. Back o our fixed ω, a simple inducive argumen proves ha l n (W, R) is a convex combinaion of L k (W, R) and (HL k ) (W, R). Therefore we can wrie, where ln 1 (W, b n 1 = )) = n 1 n 1 b L k (W, ( 1 ᾱ m (W, ) + (1 ) ). For n N k+1 n 1 b )(HL k ) (W, ), N L, we have b n 1 1/4. Moreover, L k (W, m=n k ) (HL k ) (W, ). Thus, using (18) and he definiion of δl k, we obain ln 1 (W, ) 1 4 Lk (W, = 1 2 Lk (W, ) (HLk ) (W, ) ) (HLk ) (W, ) ((HLk ) (W, ) L k (W, )) L k+1 (W, ) + δl. k (30) 22

25 Combining (29) and (30), we obain, for all n N k+1 N L, v n 1 (W, ) L k+1 (W, L k+1 (W, = L k+1 (W, ) + δl k s n 1 (W ) + δl k δl k ),, ) where he las inequaliy follows from (28). Par 2: We coninue o consider ω picked in he beginning of he proof of he heorem. In his par, we ake care of he saes (W, R ) ha are accumulaion poins bu are no in S. In conras o par 1, he proof echnique here is no by forward inducion on k. We rely enirely on he definiion of he projecion operaion and on he elemens defined in secion 5.2, as his par of he proof is all abou saes for which he projecion operaion decreased he corresponding approximae slopes infiniely ofen, which migh happen when some of he opimal slopes are equal. Of course his fac is no verifiable in advance, as he opimal slopes are unknown. Remember ha a ieraion n ime period, we observe he sample slopes ˆv n +1(R x,n ) and ˆv n +1(R x,n + 1) and i is always he case ha ˆv +1(R n x,n ) ˆv +1(R n x,n + 1), implying ha he resuling emporary slope z n (W n, R x,n ) is bigger han z n (W n, R x,n + 1). Therefore, according o our projecion operaor, he updaed slopes v n (W n, R x,n ) and v n (W n, R x,n + 1) are always equal o z n (W n, R x,n ) and z n (W n, R x,n +1), respecively. Due o our sepsize rule, as described in secion 4, he slopes corresponding o (W n, R x,n ) and (W n, R x,n + 1) are he only ones updaed due o a direc observaion of sample slopes a ieraion n, ime period. All he oher slopes are modified only if a violaion of he monoone decreasing propery occurs. Therefore, he slopes corresponding o saes wih informaion vecor W W differen han W n, no maer he asse level R = 1,..., B pos, remain he same a ieraion n ime period, ha is, v n 1 (W, R) = z n (W, R) = v n (W, R). On he oher hand, i is always he case ha he emporary slopes corresponding o saes wih informaion vecor W n and asse levels smaller han R x,n can only be increased by he projecion operaion. If 23

26 necessary, hey are increased o be equal o v n (W n, R x,n ). Similarly, he emporary slopes corresponding o saes wih informaion vecor W n and asse levels greaer han R x,n + 1 can only be decreased by he projecion operaion. If necessary, hey are decreased o be equal o v n (W n, R x,n + 1), see figure 3c. Keeping he previous discussion in mind, i is easy o see ha for each W W, if R Min he minimum asse level such ha (W, R Min ) is an accumulaion poin of {(W n, R x,n )} n 0, hen he slope corresponding o (W, R Min ) could only be decreased by he projecion operaion a finie number of ieraions, as a decreasing requiremen could only be originaed from an asse level smaller han R Min. However, no sae wih informaion vecor W level smaller han R Min is and asse is visied by he algorihm afer ieraion N (as defined in secion 5.2), since only accumulaion poins are visied afer N. We hus have ha (W, R Min ) is an elemen of he se S, showing ha S is a proper se. Hence, for all saes (W, R ) ha are accumulaion poins of {(W n, R x,n )} n 0 and are no elemens of S, here exiss anoher sae (W, ), where is he maximum asse level smaller han R beween siuaion. and R such ha (W, ) S. We argue ha for all asse levels R (inclusive), we have ha N (W, R) =. Figure 4 illusraes he v (W, R) S N (W, R 2) = N (W, R N (W, R 1) = ) = R Min R S R are accumulaion poins Figure 4: Illusraion of echnical elemens relaed o he projecion operaion As inroduced in secion 5.2, we have ha N (W, R) = {n IN : z n (W, R) > v n (W, R)}. By definiion, he ses S and N (W, R) share he following relaionship. Given 24

27 ha (W, R) is an accumulaion poin of {(W n, R x,n )} n 0, hen N (W, R) = if and only if he sae (W, R) is no an elemen of S. Therefore, N (W, R ) =, oherwise (W, R ) would be an elemen of S. If = R 1 we are done. If < R 1, we have o consider wo cases, namely (W, R 1) is accumulaion poin and (W, R 1) is no an accumulaion poin. For he firs case, we have ha N (W, R 1) = from he fac ha his sae is no an elemen of S. For he second case, since (W, R 1) is no an accumulaion poin, is corresponding slope is never updaed due o a direc observaion of sample slopes for n N, by he definiion of N. Moreover, every ime he slope of (W, R ) is decreased due o a projecion (which is coming from he lef), he slope of (W, R 1) has o be decreased as well. Therefore, N (W, R ) {n N} N (W, R 1) {n N}, implying ha N (W, R 1) =. We hen apply he same reasoning for saes (W, R 2),..., (W, 1), obaining ha he corresponding ses of ieraions have an infinie number of elemens. The same reasoning applies o saes (W, R + 1) ha are no in S. We sae a lemma ha is he key elemen for he proof of Par 2, once again going away from he poinwise argumen. Lemma 3. Given an informaion vecor W W and an asse level R = 1,..., B pos 1, if for all k 0, here exiss an ineger random variable N k (W, R) such ha L k (W, R) v n 1 (W, R) almos surely on {n N k (W, R), N (W, R + 1) = }, hen for all k 0, here exiss anoher ineger random variable N k (W, R + 1) such ha L k (W, R + 1) v n 1 (W, R + 1) almos surely on {n N k (W, R + 1)}. Proof of lemma 3. Given in he appendix. Using he properies of he projecion operaor, we reurn o he proof of Par 2 and o our fixed ω. Pick k 0 and a sae (W, R ) ha is an accumulaion poin bu is no in S. The same applies if (W, R + 1) S. Consider he sae (W, ) where is he maximum asse level smaller han R condiion of lemma 3 wih N k (W, such ha (W, ) S. This sae saisfies he ) = N k (from par 1 of he proof). Thus, we can apply his lemma in order o obain, for all k 0, an ineger N k (W, + 1) such ha 25

28 L k (W, + 1) v n 1 (W, + 1), for all n N k (W, + 1). Afer ha, we use lemma 3 again, his ime considering sae (W, he firs applicaion of lemma 3 gave us he ineger N k (W, + 1). Noe ha + 1), necessary o fulfill he condiions of his second usage of he lemma. We repea he same reasoning, applying lemma 3 successively o he saes (W, + 2),..., (W, R 1). In he end, we obain, for each k 0, an ineger N k (W, R ), such ha L k (W, R ) v n 1 (W, R ), for all n N k (W, R ). Figure 5 illusraes his process. lemma 3 (W, + 1) L k (W for n N k (W + 1) v n 1,, + 1) lemma 3 (W, + 1) L k (W for n N k (W + 1) v n 1,, + 1) lemma 3 (W, + 1) L k (W for n N k (W + 1) v n 1,, + 1) v n 1 (W, for n N k (W, ) L k (W, ) = N k ) N (W, + 1) = N (W, + 2) = N (W, R ) = (W, ) S (W, + 1) S (W, = (W, R + 2) S 1) (W, R ) S Projecion Propery Figure 5: Successive applicaions of lemma 3. Finally, if we pick N,k o be greaer han N k of par 1 and greaer han N k (W, R ) and N k (W, R + 1) for all accumulaion poins (W, R ) ha are no in S, hen (24) is rue for all accumulaion poins and n N,k. 5.4 Opimaliy of he Decisions We finish he convergence analysis proving ha, wih probabiliy one, he algorihm learns an opimal decision for all saes ha can be reached by an opimal policy. Theorem 2. Assume he condiions of Theorem 1 are saisfied. For = 0,..., T, on he even ha (W, R, v, x ) is an accumulaion poin of he sequence {(W n, R n, v n 1, x n )} n 1 26

29 generaed by he SPAR-Sorage algorihm, x, is almos surely an opimal soluion of max F (v x X (W,R ) (W ), W, R, x ), (31) where F (v (W ), W, R, x ) = C (W, R, x ) + γv W, R B d x d i + i=1 B s i=1 x s i x r Proof. Fix ω Ω. As before, he dependence on ω is omied. A each ieraion n and ime of he algorihm, he decision x n in STEP 3 of he algorihm is an opimal soluion o he opimizaion problem max F ( v n 1 x X (W n,rn ) (W n ), W n, R n, x ). Since F ( v n 1 (W n ), W n, R n, ) is concave and X (W n, R n ) is convex, we have ha 0 F ( v n 1 (W n ), W n, R n, x n ) + X N (W n, R n, x n ), where F ( v n 1 (W n ), W n, R n, x n ) is he subdifferenial of F ( v n 1 (W n ), W n, R n, ) a x n and X N (W n, R n, x n ) is he normal cone of X (W n, R n ) a x n. Then, by passing o he limi, we can conclude ha each accumulaion poin (W, R, v, x ) of he sequence {(W n, R n, v n 1, x n )} n 1 saisfies he condiion 0 F ( v (W ), W, R, x )+ X N (W, R, x ). We now derive an expression for he subdifferenial. We have ha. F ( v (W ), W, R, x ) = C (W, R, x ) + γ V (W, R + A x ), where A = ( 1,..., 1, 1,..., 1 }{{}}{{} B d B s, 1), ha is, A x = (Bersekas e al., 2003, Proposiion 4.2.5), for x IN Bd +B s +1, B d i=1 xd i + B s i=1 xs i x r. From V (W, R +A x ) = {(A 1 y,..., A B d +B s +1y) T : y [ v (W, R +A x +1), v (W, R +A x )]}. Therefore, as x is ineger valued, F ( v (W ), W, R, x ) = { C (W, R, x ) + γay : y [ v (W, R + A x + 1), v (W, R + A x )]}. 27

30 Since (W, R + A x ) is an accumulaion poin of {(W n, R x,n )} n 0, i follows from heorem 1 ha v (W, R +A x ) = v (W, R +A x ) and v (W, R +A x +1) = v (W, R +A x +1). Hence, F ( v (W ), W, R, x ) = F (v (W ), W, R, x ) and 0 F (v (W ), W, R, x ) + X N (W, R, x ), which proves ha x is he opimal soluion of (31). 6 Summary We proposed a pure exploiaion approximae dynamic programming algorihm in order o find an opimal policy for problems in he sorage class. Problems in his class are of highly pracical imporance, bu may suffer from he curse of dimensionaliy, prevening he use of sandard echniques such as backward dynamic programming, real ime dynamic programming (RTDP) and Q-learning. The key propery of he sorage class is ha he opimal value funcions associaed wih is problems are concave and piecewise linear wih ineger breakpoins in he asse dimension. This feaure was used exensively boh in he design of our algorihm and on he convergence proofs, allowing for he pure exploiaion scheme. The algorihm uses Mone Carlo samples o learn he opimal value funcion only in imporan pars of he sae space, which are deermined by he algorihm iself. Acknowledgemens The auhors would like o acknowledge he valuable commens and suggesions of Andrzej Rusczynski. This research was suppored in par by gran AFOSR conrac FA

31 References Ahmed, S. & Shapiro, A. (2002), The sample average approximaion mehod for sochasic programs wih ineger recourse, E-prin available a hp:// Baro, A. G., Bradke, S. J. & Singh, S. P. (1995), Learning o ac using real-ime dynamic programming, Arificial Inelligence, Special Volume on Compuaional Research on Ineracion and Agency 72, Bersekas, D. & Tsisiklis, J. (1996), Neuro-Dynamic Programming, Ahena Scienific, Belmon, MA. Bersekas, D., Nedic, A. & Ozdaglar, E. (2003), Convex Analysis and Opimizaion, Ahena Scienific, Belmon, Massachuses. Chen, Z.-L. & Powell, W. B. (1999), A convergen cuing-plane and parial-sampling algorihm for mulisage sochasic linear programs wih recourse, Journal of Opimizaion Theory and Applicaions 102(3), Higle, J. & Sen, S. (1991), Sochasic decomposiion: An algorihm for wo sage linear programs wih recourse, Mahemaics of Operaions Research 16(3), Jaakkola, T., Jordan, M. I. & Singh, S. P. (1994), Convergence of sochasic ieraive dynamic programming algorihms, in J. D. Cowan, G. Tesauro & J. Alspecor, eds, Advances in Neural Informaion Processing Sysems, Vol. 6, Morgan Kaufmann Publishers, San Francisco, pp Nascimeno, J. & Powell, W. B. ((o appear)), An opimal approximae dynamic programming algorihm for he lagged asse acquisiion problem, Mahemaics of Operaions Research. Powell, W. B. (2007), Approximae Dynamic Programming: Solving he curses of dimensionaliy, John Wiley and Sons, New York. Powell, W. B., Ruszczyński, A. & Topaloglu, H. (2004), Learning algorihms for separable 29

An Optimal Approximate Dynamic Programming Algorithm for the Lagged Asset Acquisition Problem

An Optimal Approximate Dynamic Programming Algorithm for the Lagged Asset Acquisition Problem An Opimal Approximae Dynamic Programming Algorihm for he Lagged Asse Acquisiion Problem Juliana M. Nascimeno Warren B. Powell Deparmen of Operaions Research and Financial Engineering Princeon Universiy