An Optimal Approximate Dynamic Programming Algorithm for the Lagged Asset Acquisition Problem

Size: px

Start display at page:

Download "An Optimal Approximate Dynamic Programming Algorithm for the Lagged Asset Acquisition Problem"

Bryan Warner
5 years ago
Views:

1 An Opimal Approximae Dynamic Programming Algorihm for he Lagged Asse Acquisiion Problem Juliana M. Nascimeno Warren B. Powell Deparmen of Operaions Research and Financial Engineering Princeon Universiy Augus 21, 2006

2 Absrac We consider a mulisage asse acquisiion problem, where asses are purchased now, a a price ha varies randomly over ime, o be used o saisfy a random demand a a paricular poin in ime in he fuure. We provide a rare proof of convergence for an approximae dynamic programming algorihm using pure exploiaion, where he saes we visi depend on he decisions produced by solving he approximae problem. The resuling algorihm does no require knowing he probabiliy disribuion of prices or demands, nor does i require any assumpions abou is funcional form. The algorihm and is proof rely on he fac ha he rue value funcion is a family of piecewise linear concave funcions.

3 1 Inroducion We consider a class of mulisage problems called he lagged asse acquisiion problem. An ineger amoun x of a single asse is purchased a ime, = 0,..., T 1, o be used o saisfy a demand ha occurs only a a fixed ime T. The price P ha we pay o acquire asses a ime is a Markov process. In mos pracical applicaions, he price rends upward, bu downward flucuaions creae buying opporuniies. We do no realize he demand ˆD unil ime T, a which poin we receive a random revenue ˆr muliplied by he smaller of ˆD and he oal we have ordered up o his poin. In our problem, x is a scalar quaniy. Le Ω, F, IP) be our probabiliy space. The goal is o find an F -measurable sequence x = x ) T 1 =0 ha maximizes: max IE x [ T 1 P x ) + ˆr min =1 )] x. T 1 ˆD, =0 This problem arises in a number of seings. An energy company may be purchasing fuures conracs for oil or gas o lock in a lower price now. Companies purchasing expensive equipmen aircraf, locomoives, power ransformers) can ofen pay less if hey place orders for furher in he fuure. Shipping companies purchase space on conainer ships for a year or more in advance o guaranee space. All of hese decisions are made before knowing he rue demand, he prices and he revenues in he fuure. Our problem could be solved using classical backward dynamic programming, bu wo issues can preven his. demands and revenues. Firs, we may no know he probabiliy disribuion of prices, There has been an increasing ineres in solving sochasic opimizaion problems using a disribuion free, nonparameric approach. A disribuion free revenue managemen and muliproduc pricing applicaions can be found in van Ryzin & McGill 2000) and Rusmevichienong e al. 2006), respecively. A single-period newsvendor problem and is muli-period exension, when he demand disribuion is unknown, are considered in Levi e al. 2006). The auhors esablished bounds on he number of samples required o guaranee ha wih high probabiliy, he expeced cos of he sampling-based policies is arbirarily close o he opimal policy. Second, even hough he sae variable only has wo dimensions price and quaniy, which we assume are discree), he sae space 1

4 can sill be quie large. In secion 7, we repor on experimens where he sae space has as many as 16 million possible values. If we assume he probabiliy disribuions are known, exac soluions using classical mehods require up o 6.7 hours o compue. Even wih one dimensional sae spaces, he curse of dimensionaliy migh presen iself. In he conex of a single-iem sochasic lo-sizing problem wih known disribuion, Halman e al. 2006) develops approximaion algorihms o deal wih i. The auhors also prove ha finding an opimal policy is NP-hard. The goal of his paper is o prove convergence of an algorihm which proceeds by solving problems of he form: where R n x n = arg max 0 x M P n x + n 1 V P n, R 1 n + x), = R n 1+x n capures cumulaive pas purchases, o he dynamic programming opimal value funcion and P n n 1 V P n, R n ) is an approximaion price we mus pay for purchases a ime. Our convergence proof requires P n as would occur in any pracical applicaion. is a sample realizaion of he o be discree, However, our algorihm allows prices o be coninuous even hough we discreize he value funcion approximaion. In addiion, our numerical experimens show ha our algorihm produces very accurae resuls even when we use a relaively coarse discreizaion of he value funcion. Our algorihm and is convergence proof rely on he fac ha boh he opimal and he approximaed value funcions are piecewise linear and concave in he asse dimension wih breakpoins on he inegers. If we define F n 1 n,p x) = P n,rn 1 x + hen he slopes of n 1 F,P : IR IR, where n,rn 1 n 1 V P n, R 1 n + x), 1) F n 1,P n,rn 1 x) o he lef and righ of xn F n 1 n 1,P x)) are used o updae V n,rn 1 1 used o updae V n 1 1 which is an ineger breakpoin of obaining V n 1. As we can see from 1), he slopes depend boh on he sample informaion given by P n and on V n 1, which a ieraion n is only an approximaion of fuure profis. As a resul, he slopes are biased, causing complicaions in he convergence proof. The dependence on sample informaion and on he approximaion of he value funcion in he fuure is common in approximae dynamic programming algorihms see Bersekas 2

5 & Tsisiklis 1996), Suon & Baro 1998)), where an approximaion of he fuure is used o make decisions now, sepping forward in ime. The use of separable, piecewise linear approximaions has already proven effecive on very difficul classes of sochasic resource allocaion problems see Godfrey & Powell 2002) and Topaloglu & Powell 2006)) bu as of his wriing here are no convergence resuls for mulisage problems. Our proof echnique combines ideas from he field of approximae dynamic programming noably Bersekas & Tsisiklis 1996)) as well as he proof of he SPAR algorihm in Powell e al. 2004). Our algorihm is modeled afer he SPAR algorihm which is presened in he conex of a wo-sage problem. The resul is a rare insance of a provably convergen approximae dynamic programming algorihm ha uses pure exploiaion, which is o say ha he decision x n ha we make now based on he value funcion approximaion V n 1 ) deermines he sae we visi a + 1. Curren proofs of convergence for approximae dynamic programming algorihms such as Q-learning Tsisiklis 1994), Jaakkola e al. 1994)) and opimisic policy ieraion Tsisiklis 2002)) require ha we visi saes and possibly acions) infiniely ofen. A convergence proof for a Real Time Dynamic Programming Baro e al. 1995)) algorihm ha considers a pure exploiaion scheme is provided in Bersekas & Tsisiklis 1996)[Prop. 5.3 and 5.4 ], bu i assumes ha he disribuion of he random variables are known. We make no such assumpions, bu i is imporan o emphasize ha our resul depends on he concaviy of he objecive funcion. There are a number of compeing approaches o his problem. Since our problem requires ineger soluions, we can use any of a vas range of approximae dynamic programming algorihms Bersekas & Tsisiklis 1996)) bu hese lack provable convergence wihou compuaionally expensive seps ha require forcing he algorihm o sample saes and acions infiniely ofen. From he field of sochasic programming, here are several flavors of Benders decomposiion ha can be used Van Slyke & Wes 1969), Higle & Sen 1991), Chen & Powell 1999)). However, hese mehods will no handle he random price issue. Anoher powerful echnique is sample average approximaion Shapiro 2003))SAA), which relies on generaing random samples ouside of he opimizaion problems and hen solving he corresponding deerminisic problems using an appropriae opimizaion algorihm. Nu- 3

6 merical experimens wih he SAA approach applied o problems where an ineger soluion is required can be found in Ahmed & Shapiro 2002). The conribuions of he paper are: a) We propose an approximae dynamic programming algorihm for he lagged asse acquisiion problem using pure exploiaion; b) we prove convergence of he algorihm and c) we demonsrae experimenally ha our algorihm ouperforms compeing algorihms, and in paricular dramaically ouperforms sandard backward dynamic programming when he disribuions are assumed known. This paper is organized as follows. Secion 2 defines he problem and he corresponding dynamic programming model. Secion 3 describes he algorihmic sraegy. Secion 4 inroduces noaion and assumpions for he convergence analysis. Secion 5 presens a skech of he convergence proofs while secion 6 provides he full proofs. Finally, secion 7 provides some experimenal comparisons agains he opimal policy and oher approaches and secion 8 presens he conclusions. 2 Problem Formulaion and Model In his secion we give a precise descripion of he problem considered in his paper as well as he assumpions aken. We also provide he dynamic programming model associaed wih he problem and idenify he srucural properies ha are exploied in our proof. The problem is o deermine, in each ime period = 0,..., T 1, how much should be purchased of a given asse o mee a posiive discree ineger random demand ˆD a ime T. A sricly posiive price P is charged for each uni of asse purchased a and a sricly posiive bounded random reward ˆr is received for each uni of saisfied demand. The demand is independen of he price and reward. We denoe by x he amoun purchased a each period and we require ha x {0,..., M }, where M is a naural number. Moreover, x F and he price process P = P 0,..., P T 1 ) is a Markov process wih finie suppor P = P 1 P T 1. We can wrie P +1 = fp, ˆP +1 ), where f is a deerminisic funcion of he previous price P and he exogenous price infor- 4

7 maion ˆP +1, which migh be dependen on P. The objecive is o maximize he expeced profi. We show in secion 7 ha he finie suppor assumpion is no ha resricive, as he algorihm works well for arbirarily fine levels of discreizaion. The decision x a each period, depends boh on he curren uni price of he asse and on he amoun of asses purchased up unil ime 1 inclusive), which is denoed by R 1. We assume ha R 1 = 0. Clearly, R = R 1 + x, for = 0,..., T 1. Noe ha R T 1 denoes he oal number of asses acquired over all ime periods, which is used o saisfy demand ˆD a T. The sae variable is hus given by S = P, R ). We le S = S 0,..., S T 1 ) be our sae vecor and S = S 0 S T 1 ) be he sae space. The problem can be formulaed as a dynamic program. For = 0,..., T 2, he opimaliy equaions V : P [0, B ] IR, where B = M i, are given by [ V P, R) = IE i=0 max P +1 x +1 + V 0 x +1 M +1 +1P +1, R + x +1 ) P = P For = T 1, VT 1 : P T 1 [0, B T 1 ] IR is given by [ ) ] VT 1P, R) = IE ˆr min ˆD, R P T 1 = P. ]. Noe ha we are using a pos-decision sae variable. This is he sae of he sysem afer he decision x is aken. See Powell & Van Roy 2004)) and Van Roy e al. 1997)) for a discussion and an applicaion. Pos-decision saes lead o an inversion of he opimizaion/expecaion order in he value funcion formula. This inversion allows for more effecive compuaional sraegies. We can show ha he opimal value funcions are concave and piecewise linear wih ineger break poins in he asse dimension. Therefore, he value funcion V P, ), for = 0,..., T 1 and P P, can be idenified uniquely by is decreasing slopes v P, 1),..., v P, B )). Moreover, if R is an ineger, he opimal decision x +1 = arg max P +1 x + V 0 x M +1 +1P +1, R + x) is an ineger, wihou having o enforce inegraliy. We disregard he values a P, 0) because he opimal decisions x +1 do no change when V P, ) is shifed by a consan. In order 5

8 o simplify noaion, le S = P {1,..., B }. Noe ha S = S 0,..., S T 1 ) is he sae space minus all he sae pairs P, 0). We close he secion by summarizing he imporan properies of he opimal value funcions and heir slopes ha are used hroughou he paper. The proof is posponed o he appendix. Proposiion 1. The opimal value funcions are piecewise linear, wih ineger breakpoins and concave in he asse dimension. Moreover, for = 0,..., T 1 and P, R) S, he opimal slope v P, R) = V P, R) V P, R 1) is given by v P, R) =IE [ max min P +1, v +1P +1, R) ), v +1P +1, R + M +1 ) ) P = P ] 1 {<T 1} + rip{ ˆD R}1 {=T 1}, 2) where r = IE[ˆr P T 1 = P ]. Thus, v P, R) is bounded beween 0 and max ˆr, which is he maximum of he suppor for he reward ˆr. Furhermore, v P, 1),..., v P, B )) C, where C = { v IR B : v 1 max ˆr, v B 0, v R+1 v R for R = 1,..., B 1 }. 3 Algorihmic Sraegy Our approach o he problem consiss of learning he opimal decision given he ime period, he amoun of asses already available and he curren price. However, he objecive is o learn he opimal decision only for asse levels ha can be generaed by an opimal policy. Figure 1 describes he ADP-Lagged algorihm, a modified version of he SPAR algorihm Powell e al. 2004)). The algorihm sars wih iniial piecewise linear value funcion approximaions represened by heir slopes v 0. As discussed in he previous secion, opimal decisions depend only on he slopes of he value funcions, hus he algorihm only deals wih he slopes insead of he value funcions hemselves. The iniial approximaion of he slopes are only required o be decreasing and bounded beween 0 and max ˆr. A each ieraion n and ime, a decision x n is made. This decision is opimal wih respec o he sample realizaion of he price sequence up o ime, he asse level R 1 n and 6

9 he curren approximaion of he slopes v n 1. I will bring he sysem o he new asse level R n = R n 1 + x n. Jus afer his ransiion, a sample realizaion of he slopes of V n 1 righ of P n, R n ) is observed. These samples, denoed by ˆv n +1R n ) and ˆv n +1R n used o updae he slope approximaions v n 1 P n, R n ) and v n 1 o he lef and + 1), are P n, R n + 1). Afer ha, a projecion operaion Π C is performed in case a violaion of he concaviy propery occurs. For compleeness, we assume ha R n 1 = 0 for all n. Also assume P n 0 = ˆP n 0. We denoe by S = P, R ) a general sae a ime, while S n = P n, R n ) represens he acual sae visied by he algorihm a ieraion n and ime. Moreover, {S n } n 0 = {P n, R n )} n 0 is he sequence of saes generaed by he algorihm. The same noaion holds for he decisions x, x n and {x n } n 0. The algorihm also generaes he { v n } n 0 sequences, ha is, he sequences of slopes of he value funcion approximaions. I is imporan o realize ha here is one sequence { v n P, R )} n 0 for each ime < T and S S. The noaion { v n } n 0 represens he family of all such sequences. Remember ha, for P, R) S, he opimal slope is given by 2). A sample slope is obained by replacing he expecaion by a sample realizaion and by replacing v +1 by is curren approximaion. Therefore, he sample slope, for R = 1,..., B is given by ˆv n +1R) = max min P n +1, v n +1P n +1, R) ), v +1 P n +1, M +1 ) ) 1 {<T 1} + ˆr n 1 {R ˆDn } 1 {=T 1}. 3) Noe ha, for all, ˆv n +1R) ˆv n +1R + 1). When = T 1, he sample slope does no depend on a curren slope approximaion, as is he case for < T 1. This fac is imporan for he convergence analysis of he algorihm, since i implies ha ˆv n +1R) is an unbiased esimaor of v +1P, R) for = T 1 and i is biased for < T 1. The projecion operaor Π C maps a vecor z n ha may no be monoone decreasing in he asse dimension he concaviy propery), ino anoher vecor v n, such ha, for P P, v n P ) = v n P, 1),..., v n P, B )) C. In his paper, we consider he Level projecion operaor inroduced in Topaloglu & Powell 2003)). I imposes concaviy by simply forcing 7

10 STEP 0: Iniialize v 0 P, R) for all and P, R) o be monoone decreasing in R. Se n = 1. STEP 1: Sample he price sequence P n = P n 0,..., P n T 1 ), he demand ˆD n and reward ˆr n. STEP 2: Do for = 0,..., T 1: STEP 2a: x n = arg max 0 x M P n x + STEP 2b: S n = P n, R n + x n ). n 1 V P n, R 1 n + x). STEP 2c: Observe ˆv n +1R n ) and ˆv n +1R n + 1) according o 3). STEP 2d: For P, R) S, { z n 1 α n ) v n 1 P, R) + α n ˆv n P, R) = +1R), if P = P n, R = R n or R n + 1) v n 1 P, R), else STEP 2e: v n = Π C z n ). See 4) for he deails. STEP 3: Increase n by one and go o sep 1. Figure 1: ADP-Lagged Algorihm he violaing slopes o be equal o he newly updaed ones. For P, R) S, he operaor is given by z n P n, R n ), Π C z n )P, R) = z n P n, R n + 1), z n P, R), if P = P n, R R n, z n P, R) z n P n, R n ) if P = P n, R R n + 1, z n P, R) z n P n, R n + 1) else 4) Figure 2 helps us visualize one ieraion n of he algorihm a ime. Afer he algorihm has sampled he price sequence, demand and reward, and has made he decisions up unil ime 1, he curren price is P n and he oal amoun of asses purchased so far is R n 1. Based on he slope approximaion v n 1, he algorihm deermines he amoun of asses x n o acquire a ime and samples he slopes a R n 2a. The decision x n maximizes he funcion where he value funcion approximaion assuming V n 1 = R n 1+x n and R n +1, as illusraed in figure n 1 n n 1 F,P x) = P n,rn 1 x + V P n, R 1 n + x), is compleely deermined by he slopes v n 1, n 1 V P n, 0) = 0. Afer he curren slopes approximaions are updaed using he sampled slopes, a violaion of he concaviy propery may occur, as shown in 2b. In his 8

11 ,P n,rn 1 x) F n 1 R n 1 ˆv n +1R n ) x n R n ˆv n +1R n + 1) Exac x Slopes of v n 1 P n, R n ) n 1 V P n ) 2a: Curren approximae funcion, opimal decision and sampled slopes P n Concaviy violaion Exac z n P n, R n ) P n 2b: Temporary approximae funcion wih violaion of concaviy,p n,r n 1 x) F n Exac x v n P n, R n ) Slopes of V n P n ) P n 2c: Level projecion operaion: updaed approximae funcion wih concaviy resored Figure 2: Ieraion n of he algorihm a ime. case, he projecion operaion Π C is performed and concaviy is resored, as in 2c. 4 Theoreical Condiions and Assumpions We sar his secion poining ou ha he sequence of saes {S n } n 0 = {P n, R n )} n 0 and he sequence of decisions {x n } n 0 generaed by he algorihm have a leas one accumulaion 9

12 poin, as he price sequence has finie suppor and he decisions are ineger and bounded, which implies ha he resource sequence has finie suppor as well. Le S be he se of all saes ha are eiher equal o an accumulaion poin P, R ) of {P n, R n )} n 0 or are equal o P, R + 1). Moreover, we only consider accumulaion poins P, R ) such ha R > 0 and R < B. The slope sequences { v n } n 0 also have an accumulaion poin, as he se C defined in proposiion 1) is compac and he projecion operaion guaranees, for all ieraions n and prices P P, ha v n P ) = v n P, 1),..., v n P, B )) C. Le F be he sigma-algebra generaed by he algorihm. We denoe by F n, for = 0,..., T, he sigma-algebra generaed by he algorihm up unil ieraion n and ime period. Moreover, we denoe by F n he sigma-algebra generaed by he algorihm up unil he end of ieraion n. Clearly, for = 0,..., T 1, F n F n +1 and F n T F n F n+1 0. Furhermore, v n P, R) and z n P, R) F n, while, for all < T, P n, x n ˆD n, ˆr n and ˆv T nr) F T n. and ˆv n R) F n. We also have ha We inroduce he ineger random variable N, which is used o indicae when an ieraion of he algorihm is large enough for convergence analysis purposes. Le N be he smalles ineger such ha all accumulaion poins P, R, x ) = P 0, R 0, x 0),..., P T 1, R T 1, x T 1 )) of {P m, R m, x m )} m 0 have been observed a leas once. Moreover, N is also he smalles ineger such ha if a mahemaical saemen regarding he sequences of slopes, saes and decisions generaed by he algorihm is rue only for finiely many ieraions, hen i is false for all ieraions n N. For example: If If 1 {R n 1,P n,xn )=R,P,x)} <, hen 1 {R n 1,P n,xn )=R,P,x)} = 0; n=1 n= N 1 { v n P,R)<P } <, hen 1 { v n P,R)<P } = 0. n=1 n= N I is rivial o see ha N is finie almos surely. 10

13 For P, R) S, we presen he ses of ieraions N P, R) and N + P, R). These ses keep rack of he effecs produced by he projecion operaion. Le N P, R) N + P, R)) be he se of ieraions in which he unprojeced slope corresponding o sae P, R) was oo small large) and had o be increased decreased) by he projecion operaion. Formally, N P, R) = {n IN : z n P, R) < v n P, R)} N + P, R) = {n IN : z n P, R) > v n P, R)}. For example, based on figure 2c, n N P n, R n 1) and n N + P n, R n + 2). We now inroduce he ses of saes S and S +. The saes in S S + ) are he ones for which he projecion operaion decreased increased) or kep he same he corresponding unprojeced slopes infiniely ofen, ha is, for P, R) S, N P, R) N + P, R)) is finie if and only if P, R) S S + ). Tha is, S S + = {P, R) S : z n P, R) v n P, R) for all n N} = {P, R) S : z n P, R) v n P, R) for all n N}. Due o he definiion of he projecion operaor, S + is no empy, since P, R Min ) S +, where R Min is he minimum asse level such ha P, R) S. We can use a similar argumen o show ha S is no empy. Finally, we impose he condiions ha mus be saisfied by he sepsizes α n used o updae he value funcion approximaions. For < T, he sepsizes saisfy he following condiions α n 0, 1] and α n F n 5) α n ) 2 B < a.s., 6) n=0 where B is a consan. These are sandard condiions for sochasic approximaion proofs of convergence. We also require ha α n 1 {P n =P,R n=r } = a.s., 7) n=0 11

14 where P, R ) is an accumulaion poin of he sequence {P n, R n )} n 0. For example, he sepsize rule α n 1 n = saisfies condiions 5) 7), where NP NP n,rn ), R n ) is he number of visis o sae P n, R n ) up unil ieraion n. For ease of noaion in he nex secions, we define a new sepsize sequence ᾱ n based on he previous one. For < T and S S, le ᾱ n P, R) = α n 1{P =P n,r=rn } + 1 {P =P n,r=r n +1} ). Noe ha while α n is a scalar, ᾱ n is a vecor wih argumens P, R) S. Based on he assumpions 5) 7), we can rivially prove ha ᾱ n P, R) [0, 1], is F n - measurable and, for P, R ) S, ᾱ n P, R ) 2 B a.s. and n=0 ᾱ n P, R ) = a.s. 8) n=0 Furhermore, for all posiive inegers N, 1 ᾱ n P, R ) ) = 0 a.s. 9) n=n The proof for 9) follows direcly from he fac ha log1 + x) x. As a final remark, we can easily see ha ˆv n R), z n P, R) and v n P, R) are bounded by 0 and max ˆr for all ieraions n, because he iniial approximaions are bounded by 0 and max ˆr and he sepsizes are beween 0 and 1. 5 Skech of Convergence Analysis We inroduce he convergence resuls we wan o prove and skech he proofs, summarizing he seps ha will be used. The full proofs are given in secion 6. We are afer wo main convergence resuls. The firs one is, for each < T and P, R ) S, v n P, R ) v P, R ) a.s. 10) 12

15 The second resul says ha x = arg max P x + V P, R 1 + x) a.s. 11) 0 x M where R 1, P, x ) is an accumulaion poin of he sequence {R n 1, P n, x n )} n 0 generaed by he algorihm and V is he opimal value funcion. Equaion 11) shows ha indeed he algorihm has learned he opimal decision for all saes ha can be reached by an opimal policy. I is easy o see his implicaion. Saring wih = 0, we have by assumpion ha R 1 = 0, as R n 1 = 0 for all ieraions of he algorihm. Moreover, all prices in P 0 are accumulaion poins of {P n 0 } n 0. Thus, 11) ells us ha he accumulaion poins x 0 of he sequence {x n 0} along he ieraions wih iniial price P 0 are in fac an opimal policy for period 0 when he price is P 0. This implies ha all accumulaion poins R 0 = x 0 of {R n 0 } n 0 are asse levels ha can be reached by an opimal policy. By he same oken, for = 1, every price in P 1 is an accumulaion poin of {P n 1 } n 0. Hence, he second resul ells us ha he accumulaion poins x 1 of he sequence {x n 1} along ieraions wih R n 0, P n 1 ) = R 0, P 1 ) are indeed an opimal policy for period 1 when he asse level is R 0 and he price is P 1. As before, he accumulaion poins R 1 = R 0 +x 1 of {R n 1 } n 0 are asse levels ha can be reached by an opimal policy. The same reasoning can be applied for = 2,..., T 1. The main idea o achieve 10) is o define for each < T and P, R) S inroduced in secion 2) deerminisic sequences {L k P, R)} k 0 and {U k P, R)} k 0 ha are provably convergen o v P, R) and hen prove, for all k 0 and each P, R ) S, ha L k P, R ) v n P, R ) U k P, R ) 12) for all n big enough. Esablishing hese inequaliies is nonrivial, and draws on a proof echnique in Bersekas & Tsisiklis 1996, Secion 4.3.6)B&T). In our proof, however, we have o handle wo significan differences. Firs, our algorihm uses a pure exploiaion sraegy whereas B&T assumed ha all saes are visied infiniely ofen. Second, we inroduce a projecion operaor o mainain concaviy of he approximaion, which is no he case in B&T. In order o esablish 12), we need o presen he dynamic programming operaor H associaed wih he asse acquisiion problem and he deerminisic bounding sequences 13

16 {L k } k 0 and {U k } k 0. I is noeworhy ha hese sequences are compleely independen of he algorihm. We also define four sochasic sequences, { s n } n 0, { s n +} n 0, { l n } n 0 and {ū n } n 0, which do depend on he ieraions of he algorihm. The firs wo sequences are called sochasic noise sequences and he las wo sequences are called sochasic bounding sequences. All hese elemens are combined o obain 12) and he concaviy of he value funcions plays a major role in he proofs. Roughly speaking, using properies of he operaor H and concaviy, we prove HL k ) P n, R n ) H v n 1 ) P n, R n ) HU k ) P n, R n ) HL k ) P n, R n + 1) H v n 1 ) P n, R n + 1) HU k ) P n, R n + 1). These inequaliies enable us o prove for n big enough ha v n 1 P, R ) ū n 1 P, R ) + s n 1 P, R ), if P, R ) S, v n 1 P, R ) n 1 l P, R ) s n 1 + P, R ), if P, R ) S +. Then, convergence o zero of he noise sequences, a convex combinaion propery of he sochasic bounding sequences and concaviy, will give us v n 1 P, R ) U k P, R ), if P, R ) S, v n 1 P, R ) L k P, R ), if P, R ) S +. Finally, concaviy plays a role again and we obain 12). The second convergence resul, he opimaliy of he decisions wih respec o he opimal value funcions represened by 11), is a byproduc of he convergence of he approximae slopes. I is discussed in deail in he nex secion. 6 Convergence Analysis We presen formally he dynamic operaor H and he deerminisic bounding sequences {U k } k 0 and {L k } k 0 in secion 6.1. Afer ha, in secion 6.2, we sae and prove our major 14

17 heorem, he almos sure convergence of he approximae slopes o he opimal slopes. As par of he proof, we define he sochasic sequences and sae echnical lemmas as hey are needed. In order o focus on he main ideas of he heorem proof, he proofs of he lemmas will be deferred o he appendix. Finally, in secion 6.3 we prove he almos sure convergence o he opimal decisions. Since we deal wih almos sure convergence proofs, hroughou his secion we assume we only consider he ses in he sigma-algebra F ha have sricly posiive measure. 6.1 The Operaor H and he Bounding Sequences We sar by defining he dynamic programming operaor H ha maps a vecor v ino a new vecor Hv according o he formula Hv) P, R) = IE [ max min P +1, v +1 P +1, R)), v +1 P +1, M +1 )) 1 {<T 1} + ˆr1 {R ˆD} 1 {=T 1} P = P ]. 13) for = 0,..., T 1 and P, R) S. The following properies can be easily proved: 1. H has a unique fixed poin v, where v is he vecor of slopes of he opimal value funcions. 2. H is monoone, ha is, if v ṽ componenwise, hen Hv Hṽ. 3. Hv ηe Hv ηe) Hv + ηe) Hv + ηe, where η is a posiive consan and e is a vecor wih all componens equal o 1. The inequaliies are considered componenwise. 4. H is coninuous. We inroduce he deerminisic bounding sequences {U k } k 0 and {L k } k 0 and esablish hree imporan properies. When we refer o he sequence {U k } k 0 wihou menioning he ime index and he sae P, R) S, we are referring o he family of sequences 15

18 {U k P, R)} k 0, one for each ime < T and sae P, R). The same is rue wih he oher deerminisic sequence {L k } k 0. Le U 0 = v + max ˆre and U k+1 = U k + HU k, k 0 14) 2 L 0 = v max ˆre and L k+1 = Lk + HL k, k 0. 15) 2 Noe ha jus like he slopes v, for all k 0, L k and U k are boh monoone decreasing in he asse dimension. Lemma 1. The sequences {U k } k 0 and {L k } k 0 saisfy HU k U k+1 U k HL k L k+1 L k 16) 17) and boh converge o v. Furhermore, U k > v and L k < v for all k 0. Proof. The proof of inequaliies 16) and 17) as well as he proof of convergence of he sequences o v is given in Bersekas & Tsisiklis 1996, Lemmas 4.5 and 4.6 ). They jus require he firs four properies of he operaor H. In order o show ha L k < v for all k 0, we begin by analyzing L k T 1. By definiion of H, for all P, R) S T 1, HL k ) T 1 P, R) = vt 1 P, R), for all k 0. We also have ha L 0 T 1 P, R) = v T 1 P, R) max ˆr < v T 1 P, R). Thus, L1 T 1 P, R) < v T 1 P, R) and an inducion argumen on k shows ha L k T 1 P, R) < v T 1 P, R) for all k 0. Now, assume ha L k +1P, R) < v +1P, R) for all k 0 and P, R) S +1. We prove HL k ) P, R) v P, R) for, when = 0,..., T 2. We have, for P, R) S, HL k ) P, R) = IE [max min P +1, L +1 P +1, R)), L +1 P +1, R + M +1 )) P = P ] IE [ max min P +1, v +1P +1, R) ), v +1P +1, R + M +1 ) ) P = P ] = v P, R). Furhermore, L 0 P, R) = v P, R) max ˆr < v P, R), which implies L 1 P, R) < v P, R). Again, an inducion argumen on k shows ha L k P, R) < v P, R) for all k 0. The proof for U k follows by a symmerical argumen. 16

19 6.2 Convergence of v n P, R ) In his secion, we prove almos sure convergence of he slopes of he approximae funcions o he slopes of he opimal ones for saes in S. In he process, we presen he noise and he bounding sochasic sequences. We also inroduce hree echnical lemmas. Their proofs are given in he appendix. We assume for all possible saes P, R) ha vt P, R) = UT kp, R) = Lk T P, R) = vn T P, R) = 0 for inegers k 0 and ieraions n 0. Theorem 1. Assume he sepsize condiions 5) 7). Then, for all k 0 and = 0,..., T, here exiss a posiive ineger N,k Therefore, such ha, for all n N,k and saes P, R ) S, L k P, R ) v n 1 P, R ) U k P, R ). 18) v n P, R ) v P, R ) a.s. 19) Proof. The proof of he heorem is by backward inducion on. The base case is = T. As vt P, R) = U T kp, R) = Lk T P, R) = vn T P, R) = 0 for all saes P, R) and inegers k 0 and n 0, he inequaliies in 18) are rivial for = T. Thus, we can pick, for example, N,k T = N, where N, as defined in secion 4, is a random variable ha denoes when an ieraion of he algorihm is large enough for convergence analysis purposes. The backward inducion proof is compleed when we prove 19) for a general, = 0,..., T 1. Given he inducion hypohesis for + 1, he proof for ime period is divided ino wo pars. We prove for all k 0 ha here exiss an ineger N k n N k, such ha, for all v n 1 P, R ) U k P, R ), if P, R ) S, 20) v n 1 P, R ) L k P, R ), if P, R ) S +. 21) This is he firs par. Is proof is by inducion on k. Noe ha his par only applies o saes in he ses S and S +. Then, again for, we ake on he second par, which proves he exisence of an ineger N,k such ha 18) is rue for all saes in S and ieraions n N,k. Noe ha he second par akes care of he saes in S no covered by he firs 17

20 par. Consequenly, 19) is rue for. Figure 3 shows he relaionship beween he ses of saes. S Sae Space) S Sae Space minus P, 0) pairs) S Accumulaion poin P, R ) or P, R + 1) of {P n, R n )}) S Corresponding slope is increased finiely ofen due o he proj. op. ) S + Corresponding slope is decreased finiely ofen due o he proj. op.) Figure 3: Relaionship beween he ses of saes. We sar he backward inducion on. Remember ha he base case = T is rivial and we pick N,k T = N. We also pick, for compleeness, N k T = N. Inducion Hypohesis: Given = 0,..., T 1, assume, for +1, and all k 0 he exisence of inegers N k +1 and N,k +1 such ha, for all n N k +1, 20) and 21) are rue, and, for all n N,k +1, he inequaliies in 18) hold rue for all saes P, R ) S +1. Par 1: Now we prove for any k, he exisence of an ineger N k such ha for n N k, inequaliies 20) and 21) are rue. For a paricular ime, he proof is by forward inducion on k. We sar wih k = 0. For every P, R) S, 0 v P, R) max ˆr implies ha, by definiion, U 0 P, R) max ˆr and L 0 P, R) 0. Therefore, 20) and 21) are saisfied for all n 1, since we know ha v n 1 is bounded by 0 and max ˆr for all ieraions. Thus, N 0 = max ) 1, N,0 +1 = N,0 +1. The inducion hypohesis on k assumes ha here exiss N k such ha, for all n N k, 20) and 21) are rue. Noe ha we can always make N k larger han N,k +1, hus we assume ha N k N,k +1. The nex sep is he proof for k + 1. Before we move on, we define he variables ŝ n +1 and ŝ n +1+ o be he error incurred by 18

21 observing a sample slope. For R = 1,..., B, ŝ n +1 R) = ˆv n +1R) H v n 1 ) P n, R) and ŝ n +1+R) = ŝ n +1 R). Using ŝ n +1 and ŝ n +1+, we also define he sochasic noise sequences { s n } n 0 and { s n +} n 0. For P, R) S, s n P, R) = 0 and s n +P, R) = 0, for n < N k, and, for n N k, s n P, R) = max 0, 1 ᾱ n P, R)) s n 1 P, R) + ᾱ n P, R)ŝ n +1 R n 1 {R R n } + R n + 1)1 {R>R n }) ) s n +P, R) = max 0, 1 ᾱ n P, R)) s n 1 + P, R) + ᾱ n P, R)ŝ n +1+R n 1 {R R n } + R n + 1)1 {R>R n }) ) The sample slopes are defined in a way such ha IE [ ŝ n +1 R) F n ] = 0. 22) This condiional expecaion is called he unbiasedness propery. This propery, ogeher wih he maringale convergence heorem and he boundedness of boh he sample slopes and he approximae slopes are crucial for proving ha he noise inroduced by he observaion of he sample slopes, which replace he observaion of rue expecaions, go o zero as he number of ieraions of he algorihm goes o infiniy, as is saed in he nex lemma. Lemma 2. For P, R ) S, { s n P, R )} n 0 0 and { s n + P, R )} n 0 0 a.s. 23) Proof of lemma 2. Given in he appendix. Using he convenion ha he minimum of an empy se is +, le { δl k HL k ) = min P, R ) L k P, R } ) : 4 P, R ) S +, HL k ) P, R ) > L k P, R ). If δ k L < + we define an ineger N L N k N L 1 m=n k o be such ha 1 ᾱ m P, R ) ) 1/4 and s + n 1 P, R ) δl, k 24) 19

22 for all n N L and saes P, R ) S +. Such an N L exiss because boh 9) and 23) are rue. If δ k L = + hen, for all saes P, R ) S +, HL k ) P, R ) = L k P, R ) since 17) ells us ha HL k L k. Thus, L k+1 P, R ) = L k P, R ) and we define he ineger N L o be equal o N k. We can apply symmeric reasoning o deermine δ k U and N U. We jus need o consider he deerminisic bounding sequence {U k } k 0, he se S insead of {L k } k 0, S + L k+1 Finally, le N k+1 and { s n +} n 0, respecively. = max N L, N U, N,k+1 +1 and he noise sequence { s n } n 0 ). Firs, pick a sae P, R ) S +. If P, R ) = L k P, R ), hen inequaliy L k+1 P, R ) v n 1 P, R ) follows from he inducion hypohesis. We herefore concenrae on he case where L k+1 P, R ) > L k P, R ). Firs, we define he sochasic bounding sequences { l n } n 0 and {ū n } n 0. For each P, R) S, we have ln P, R) = L k P, R) and ū n P, R) = U k P, R), for n < N k, and, for n N k, ln P, R) = 1 ᾱ n n 1 P, R)) l P, R) + ᾱ n P, R)HL k ) P, R) ū n P, R) = 1 ᾱ n P, R)) ū n 1 P, R) + ᾱ n P, R)HU k ) P, R). A simple inducive argumen proves ha ū n P, R) is a convex combinaion of U k P, R) and HU k ) P, R), while l n P, R) is a convex combinaion of L k P, R) and HL k ) P, R). n 1 n 1 Therefore we can wrie, wih b = 1 ᾱ m P, R ) ), ln 1 P, R ) = For n N k+1 m=n k n 1 b L k P, R n 1 ) + 1 b )HL k ) P, R ). N L, we have b n 1 using 15) and he definiion of δl k, we obain ln 1 P, R ) 1 4 Lk P, R ) HLk ) P, R ) 1/4. Moreover, L k P, R ) HL k ) P, R ). Thus, = 1 2 Lk P, R ) HLk ) P, R ) HLk ) P, R ) L k P, R )) L k+1 P, R ) + δ k L. 25) 20

23 Again for n N k+1 N L, he following lemma is used o show ha v n 1 P, R ) Lemma 3. For n N k, Moreover, n 1 l P, R ) s n 1 + P, R ). 26) HL k ) P n, R n ) H v n 1 ) P n, R n ) HU k ) P n, R n ), if R n > 0 HL k ) P n, R n + 1) H v n 1 ) P n, R n + 1) HU k ) P n, R n + 1), if R n < M. v n 1 P, R ) ū n 1 P, R ) + s n 1 P, R ), if P, R ) S, v n 1 P, R ) n 1 l P, R ) s n 1 + P, R ), if P, R ) S +. Proof of lemma 3. Given in he appendix. Combining 25) and 26), we obain, for all n N k+1 N L, v n 1 P, R ) L k+1 P, R ) + δl k s + n 1 P, R ) L k+1 P, R ) + δl k δl k = L k+1 P, R ), where he las inequaliy follows from 24). To finish he proof of par 1, we pick a sae P, R ) S. The reasoning for U k+1 P, R ) is symmerical o ha for L k+1 P, R ), which complees our inducion. Thus, we have proved ha, for all k 0, here exiss N k concludes he firs par of he proof. such ha 20) and 21) hold for all n N k. This Par 2: In his par, we ake care of he saes P, R ) S \ S + S ), because if P, R ) S + S ), we have already proved in par 1 ha, for all k 0, here exiss N k such ha if n N k, hen v n 1 P, R ) L k P, R ) U k P, R )). In conras o par 1, he proof echnique here is no by forward inducion on k. 21

24 A discussion abou he projecion operaion is in order, as his par of he proof is all abou saes for which he projecion operaion decreased or increased he corresponding approximae slopes infiniely ofen. If for all P, R ) S he corresponding opimal slopes v P, R ) are disinc, hen S = S + = S and Par 2 is no necessary. However, his fac is no verifiable. Figure 4 illusraes a ypical siuaion where S \ S + S ). P v P, R) S + N + P, R 2) = N + P, R 1) = N + P, R ) = R Min R R S R + Figure 4: Opimal slopes ha can lead o S \ ) S+ S An imporan propery of he projecion operaor is ha all he slopes o he lef of R n changed by he projecion operaion are increased o be equal o he new slope a R n. Similarly, all he slopes o he righ of R n decreased o be equal o he slope a R n + 1 see figure 2c). + 1 changed by he projecion operaion are There is anoher ineresing propery ha is necessary for he proof of Par 2. P, R ) S \ S +. We argued in secion 4 ha he sae P, R Min ) is an elemen of S +, where R Min is he minimum asse level of he se {R : P, R) S }. Therefore, he sae P, R +) where R + is he maximum asse level smaller han R such ha P, R +) S + is well defined. We show nex ha for all asse levels R beween R + and R inclusive), N + P, R) is also equal o infiniy. By definiion of he se S +, N + P, R ) =. If P, R 1) = P, R +) we are done. Oherwise, we have o consider wo cases. Firs, if P, R 1) S, hen N + P, R 1) is infinie by he definiion of he se S + from he fac ha, in his case, P, R 1) S \ S +. Second, if P, R 1) S, hen he corresponding slope is never updaed due o a direc observaion of sample slopes Le and 22

25 for n N. Moreover, every ime he slope of P, R ) is decreased due o a projecion which is coming from he lef), he slope of P, R 1) is decreased as well. Therefore, N + P, R ) {n N} N + P, R 1) {n N}, implying ha N + P, R 1) is infinie. We hen apply he same reasoning for saes P, R 2), P, R 3),..., unil we reach sae P, R +). A symmerical argumen handles he saes P, R ) S \ S. Wih hese properies in mind we go back o he proof of Par 2. We inroduce he lemma ha is he key elemen for he proof. Lemma 4. If for all k 0, here exiss an ineger N k P, R) such ha v n 1 P, R) L k P, R) for all n N k P, R) and N + P, R + 1) is infinie, hen for all k 0, here exiss an ineger N k P, R + 1) such ha v n 1 P, R + 1) L k P, R + 1) for all n N k P, R + 1). Similarly, if for all k 0, here exiss an ineger N k P, R) such ha v n 1 P, R) U k P, R) for all n N k P, R) and N P, R 1) is infinie, hen for all k 0, here exiss an ineger N k P, R 1) such ha v n 1 P, R 1) U k P, R 1) for all n N k P, R 1). Proof of lemma 4. Given in he appendix. Pick k 0, P, R ) S \ S + and he sae P, R +) S + inroduced in he projecion discussion. Noe ha we can apply lemma 4 considering saes P, R +) and P, R + + 1) in order o obain, for all k 0, an ineger N k P, R + + 1) such ha v n 1 P, R + + 1) L k P, R + + 1), for all n P, R + + 1). Afer ha, we make use of lemma 4 again, his ime o saes P, R ++1) and P, R ++2). Noe ha he firs applicaion of lemma 4 gave us he ineger N k P, R + + 1), necessary o fulfill he condiions of his second usage of he lemma. We repea he same reasoning, applying lemma 4 successively o he pairs of saes P, R + +2) and P, R + +3), P, R + + 3) and P, R + + 4),..., P, R 1) and P, R ). In he end, we obain, for each k 0, an ineger N k P, R ), such ha v n 1 P, R ) L k P, R ), for all n N k P, R ). Figure 5 illusraes his process. 23

26 lemma 4 v n 1 P, R + + 1) L k for n N k P, R P, R + + 1) + + 1) lemma 4 v n 1 P, R + + 1) L k for n N k P, R P, R + + 1) + + 1) lemma 4 v n 1 P, R + + 1) L k P, R + + 1) for n N k P, R + + 1) v n 1 P, R +) L k P, R +) for n N k P, R +) = N k N + P, R + + 1) = N + P, R + + 2) = N + P, R ) = R + S S+ R R = R 1 R S \ S + Projecion Propery Figure 5: Successive applicaions of lemma 4. Similarly, pick P, R ) S \ S. By successive applicaions of he second par of lemma 4 we obain for each k 0, an ineger N k P, R ), such ha v n 1 P, R ) U k P, R ), for all n N k P, R ). { Finally, if we consider N,k = max N k, max P, R ) S \ S e + T S e ) N k P, R ), hen 18) is rue for all saes P, R ) S and n N,k. Consequenly, 19) is also rue for all saes P, R ) S. } 6.3 Opimaliy of he Decisions We are ready o prove 11), he second convergence resul. Theorem 2. For = 0,..., T 1, le v, R 1, P, x ) be an accumulaion poin of he sequence { v )} n 1, R 1, n P n, x n generaed by he algorihm. Assume all condiions of n 1 heorem 1 are saisfied. Then, wih probabiliy one, x is an opimal soluion of max P x + V P, R 1 + x). 27) 0 x M Proof. A each ieraion n and ime of he algorihm, he decision x n is opimal wih respec o he sample price P n, he curren asse level R n 1 and he value funcion approximaion for price P n, which is piecewise linear wih ineger breakpoins and is represened by is slopes 24

27 v n P n, 1),..., v n P n, B ). Therefore, i follows ha P n + v n P n, R 1 n + x n ) > 0 and P n + v n P n, R 1 n + x n + 1) 0. Then, by passing o he limi, we can conclude ha each accumulaion poin v, R 1, P, x ) of he sequence { v )} n 1, R 1, n P n, x n saisfies n 1 P + v P, R 1 + x ) > 0 and P + v P, R 1 + x + 1) 0. 28) Since saes P, R 1+x ) and P, R 1+x +1) are elemens of S, i follows from heorem 1 ha and v P, R 1 + x ) = v P, R 1 + x ) a.s. v P, R 1 + x + 1) = v P, R 1 + x + 1) a.s. This fac combined wih 28) is sufficien o conclude he proof. 7 Experimenal Resuls The purpose of his secion is o esablish he compuaional benefis relaive o oher Mone Carlo-based algorihms as well as classical backward dynamic programming where we have o assume he disribuion is known). We sar by giving a brief descripion of each approach o which we compare our algorihm. In a Bach-mode Mone-Carlo-based value ieraion algorihm Bach), a each ieraion n, once a sample for he price process, reward and demand is gahered, sample slopes a all possible asse levels R are observed and used o updae he corresponding slopes for he observed sampled prices P n = P0 n,..., PT n 1 ). Tha is, seps 2c and 2d of he algorihm described in figure 1 are replaced by STEP 2c: Observe ˆv n +1R) according o 3) for all R such ha P n, R) S. 25

28 STEP 2d: For P, R) S, z n P, R) = [ 1 α n ) v n 1 P, R) + α n ˆv +1R) ] n 1 {P =P n } + v n 1 P, R)1 {P P n }. One can argue ha such a bach algorihm would make a beer use of he informaion in he sample realizaions. Applying his mehod, which is synchronous in he sense ha all he slopes for he observed prices are updaed a once, we wish o see how i compares o an asynchronous approach our algorihm). Our mehod is asynchronous in he sense ha only wo slopes are updaed a each ieraion n and ime i can be more if a violaion of concaviy occurs). A Real Time Dynamic Programming RTDP) approach Baro e al. 1995)) assumes ha he disribuion of he random variables is known, which is no he case for eiher our algorihm or he Bach one. We could consider an Approximae RTDP ARTDP) approach Baro e al. 1995)), which sars wih an iniial esimae of a disribuion and hen updaes i during he ieraions of he algorihm. However ARTDP is a mos as good as RTDP, so we assume he disribuion is known and implemen a RTDP mehod. Insead of using he sample slope given by 3), he RTDP algorihm uses ˆv n +1R) = IE [ max min P +1, v n +1P +1, R) ), v +1 P +1, M +1 ) ) 1 {<T 1} + ˆr n 1 {R ˆDn } 1 {=T 1} P n = P ]. 29) Tha is, sep 2c of he algorihm described in figure 1 is replaced by STEP 2c: Observe ˆv n +1R n ) and ˆv n +1R n + 1) according o 29). Due o is known disribuion feaure, Proposiions 5.3 and 5.4 in Bersekas & Tsisiklis 1996)) can be applied o obain almos sure convergence resuls similar o ours. When we compare he compuaional resuls of his mehod o he compuaional resuls of our approach, we are measuring he radeoff beween more informaion given by he expecaion versus he ime spen o do his operaion. A very popular approach in he approximae dynamic lieraure is Q-learning Abounadi e al. 2002), Rummery & Niranjan 1994), Even-Dar & Mansour 2004), Cybenko e al. 26

29 1997), Tsisiklis 1994), Duff 1995)), which, like our algorihm, is also ofen used as a model-free algorihmic sraegy. A sandard Q-learning algorihm Wakins & Dayan 1992)) sores all possible Sae-Acion pairs and a proof of convergence requires all he corresponding Q-values o be sampled infiniely many imes. One can argue ha Q-values share he same characerisics as he value funcion around a pos decision sae which is he sae considered in his paper), ha is, he opimizaion is inside he expecaion. Compuaionally, however, he wo approaches are quie differen. Insead of S, he sae space in Q-learning would be S X, where X is our acion space. Moreover, in Q-learning, firs a decision is aken based only on he Q-values and hen he realizaion of he random informaion is observed. In our case, firs he realizaion of he random informaion is observed, hen a decision is aken, which of course is also dependen on he value funcion for he fuure. The size of he Q-learning sae space makes his approach impossible o be implemened for his problem. The program would have aken an unaffordable amoun of ime and virual memory o run. Therefore, insead of implemening a Q-learning approach, we propose and implemen an algorihm ha only sores he sae afer he decision is made and samples all possible acions infiniely ofen in a uniform way. This implies ha all saes are sampled infiniely ofen as well. This algorihm should be a leas as good as sandard Q-learning, due o a smaller sae space. Moreover, heorem 1 can be used o show ha i converges almos surely o he opimal slopes. Using his approach, sep 2a of he algorihm described in figure 1 is replaced by STEP 2a: Sample x n according o a discree uniform disribuion beween 0 and M. Wih his algorihm, we ry o infer how a pure exploiaion scheme our approach) compares o a pure exploraion one. Experimenal resuls omied) showed ha his algorihm worked erribly, implying ha using pure exploiaion insead of pure exploraion pays off. The insances considered in he experimens are described in able 1. Problems were randomly generaed using differen disribuions for he rewards ˆr and iniial prices P 0. 27

30 Moreover, boh discree uniform DiscU) and Poisson demand disribuions wih differen parameers were used. We also creaed differen price processes, namely random walk RW), mean reversion MR) and geomeric Brownian moion GBM), all of which are described below. Even hough all price processes are coninuous, we use a discreizaion incremen of 0.1 for all insances. I is imporan o emphasize ha when using our algorihm, he discreizaion only occurs when represening he value funcions - he prices in he simulaed pahs are sill coninuous. For all insances, he number of ime periods considered is 10. Tha is, he random demand ˆD and he random reward ˆr are observed a T = 10. Furhermore, he upper bound on he decision quaniy x, for = 0,..., T 1, is se o M = M = 400. Table 1 also conveys he size of he sae space of each insance. Ins. Sae Space Iniial Price Reward ˆr Demand ˆD Price Proc Consan 20 U50,60) DiscU180,250) RW Consan 20 U50,60) Poisson200) RW *U1,12) P T 1 *U1.03,1.15) Poisson250) MR *U1,12) P T 1 *U1.03,1.15) DiscU180,220) MR Consan 40 Consan 25 Poisson300) GBM Consan 45 Consan 15 DiscU225,375) GBM Table 1: Insances descripion - T = 10 and M = M = Discreizaion = 0.1. Nex, we give he deails of he differen price processes. The random walk price process is given by P = P 1 + ˆP, where he price incremen ˆP has he normal disribuion wih mean µ = 0.02 and sandard deviaion σ = 1.5. The mean reversion price process is given by P = P 1 + ˆP + 0.5B P 1 ), where ˆP is uniformly disribued beween 0.9 and 1.2, and B 0 = 1.7Ū1, 12) and B = B 1 Ū0.9, 1.2), where Ū is he mean of he corresponding uniform disribuion. Finally, he geomeric Brownian moion process is given by P = P 1 e ˆP, where ˆP is normally disribued wih mean µ = and sandard deviaion σ = I is easy o see ha when he random walk and he geomeric Brownian moion are 28

31 considered, he slopes v P, R) given by 2) are monoone increasing in he price dimension. Therefore, for all he differen mehods and insances 1, 2, 5, 6, his propery is going o be imposed in order o speed up he rae of convergence. We describe how he experimens were conduced. We also presen and analyze he resuls. Knowing he underlying disribuions, as described in able 1, we compued a policy using a classical backward dynamic programming echnique. A discreizaion incremen of 0.01 was used for prices. The experimens were run as follows. Using he underlying disribuions given in able 1), we compued he opimal policy using backward dynamic programming, assuming ha prices were discreized o he neares We nex randomly generaed 50 ses Ω i, i = 1,..., 50, where each se Ω i consised of sample pahs. Each approximaion algorihm was rained using he sample pahs in Ω i, generaing a policy. This exercise was repeaed 50 imes, and he qualiy of he soluion is based on an average over he 50 policies obained using he 50 differen raining daases. In order o evaluae he policies, for each insance, we randomly generaed a se ˆΩ of 800 sample pahs. For each ω ˆΩ, le X ω) be he decision of he opimal policy compued exacly using backward dynamic programming) a ime for sample pah ω and le n,i ˆX ω) be he decision of he approximae policy i = 1,..., 50, a ime for sample pah ω ˆΩ. The approximae policy i was obained afer n ieraions of a given approximaion algorihm when se Ω i was he raining daase. and From his, for he exac policy, we compued, for ω ˆΩ, T 1 ) F ω) = P ω)x ω) + ˆrω) min ˆDT ω), R T 1 ω) F = =0 F ω). ω ˆΩ 29

32 For he approximae policy, we compued, for ω ˆΩ, T 1 ˆF n,i ω) = P ω) =0 Nex, hese values are averaged o obain F n ω) = i=1 Finally, we compued F n = ˆF n,i ω). F n ω). ω ˆΩ ) n,i ˆX ω) + ˆrω) min ˆDω), RT 1 ω). Table 2 shows he ime in seconds) ha ook each mehod o be 10%, 1%,..., 10 4 % away from he opimal soluion given by he classical dynamic programming CDP) echnique. I also shows he ime o compue he CDP soluion. The error is measured according o η n = F F n F ) All mehods were limied o 2 million ieraions. Noe ha insances 3 and 4 did no reach he 10 2 % level. This is due o he fac ha hese insances use he mean reversion price process and he monoone increasing propery of he slopes in he price dimension does no apply o his process. Hence, his propery could no be imposed in order o speed up convergence. Table 2 also conveys ha he compuaional ime for he Bach approach is much higher han he compuaional ime of he ADP approach. I follows ha even hough he Bach mehod makes beer use of he informaion in each sample realizaion, i does no ranslae ino beer soluions in compeiive ime, showing ha our asynchronous algorihm performs beer han he synchronous one. The same is rue for he RTDP approach. More informaion given by he expecaion insead of a sample realizaion does no resul in an improvemen in he soluions, when he same amoun of ime is considered for boh he ADP and RTDP approach. 30

Optimal approximate dynamic programming algorithms for a general class of storage problems

Optimal approximate dynamic programming algorithms for a general class of storage problems Opimal approximae dynamic programming algorihms for a general class of sorage problems Juliana M. Nascimeno Warren B. Powell Deparmen of Operaions Research and Financial Engineering Princeon Universiy