IN this paper, we investigate the effectiveness of several

Size: px

Start display at page:

Download "IN this paper, we investigate the effectiveness of several"

Stephany Hoover
5 years ago
Views:

1 A Comparison of Approximae Dynamic Programming Techniques on Benchmark Energy Sorage Problems: Does Anyhing Work? Daniel R. Jiang, Thuy V. Pham, Warren B. Powell, Daniel F. Salas, and Warren R. Sco Absrac As more renewable, ye volaile, forms of energy like solar and wind are being incorporaed ino he grid, he problem of finding opimal conrol policies for energy sorage is becoming increasingly imporan. These sequenial decision problems are ofen modeled as sochasic dynamic programs, bu when he sae space becomes large, radiional (exac) echniques such as backward inducion, policy ieraion, or value ieraion quickly become compuaionally inracable. Approximae dynamic programming (ADP) hus becomes a naural soluion echnique for solving hese problems o near opimaliy using significanly fewer compuaional resources. In his paper, we compare he performance of he following: various approximaion archiecures wih approximae policy ieraion (API), approximae value ieraion (AVI) wih srucured lookup able, and direc policy search on a benchmarked energy sorage problem (i.e., he opimal soluion is compuable). I. INTRODUCTION IN his paper, we invesigae he effeciveness of several echniques ha fall under he realm of approximae dynamic programming (ADP) on a simple energy sorage and allocaion problem (previously described in 1] and 2]): we seek o opimally conrol (profi maximizaion) a sorage device ha ineracs wih boh he grid and an uncerain energy supply (i.e., wind) in order o mee demand. In our benchmarks, we consider a sochasic wind supply, sochasic elecriciy prices (from he grid), and a deerminisic demand. We use his problem class because i can be simplified hrough discreizaion (and possibly dimensionaliy reducion) o obain benchmark problems ha can be solved opimally. The idea is o use hese benchmark problems o provide insighs ino he performance of a variey of ADP sraegies (for an overview of radiional mehods in ADP, see e.g., 3], 4], 5]). A precise formulaion of his problem is given in Secion III. Algorihmically, we consider soluion echniques ha are varians of approximae policy ieraion (API) and approximae value ieraion (AVI). The basis for boh of hese algorihms is a value funcion approximaion (VFA) (he value funcion is also known as he cos o go funcion) and hus, by alering he approximaion archiecure, we arrive a a family of ADP algorihms. For API, we es several mehods ypically found in he machine learning (ML) lieraure o approximae he value funcion: suppor vecor regression (SVR), Gaussian process regression (GPR), local parameric mehods (LPR), and a clusering mehod called Dirichle cloud wih radial basis funcions (DCR). In he case of AVI, we consider lookup able echniques ha exploi he srucural properies of he problem a hand: monooniciy (he use of he naural concaviy in his problem was sudied previously in 2]). Alhough lookup able iself can be a very limied mehod, we find ha he addiional knowledge of problem srucure makes i an exremely effecive soluion mehod, even when compared o more advanced saisical esimaion mehods. This paper repors on he performance of a variey of approximaion mehods ha have been developed in he approximae dynamic programming communiy, esed using a series of opimal benchmark problems drawn from a relaively simple energy sorage applicaion. These sugges ha mehods based on Bellman error minimizaion, using boh approximae value ieraion and approximae policy ieraion, work surprisingly poorly if we use approximaion mehods drawn from machine learning. Pure able lookup also works poorly. By conras, a simple cos funcion approximaion esimaed using policy search works remarkable well, hining ha he problem is no he approximaion archiecure (hough his mehod does no scale o more complex policies). In addiion, lookup able mehods ha exploi convexiy or monooniciy (if applicable) work exremely well, bu do no scale o complex sae of he world variables. The implicaions for many curren ADP algorihms are no encouraging, which signals he need for furher work in his area. The paper is organized as follows. In Secion II, we give a brief lieraure review. Secion III provides he mahemaical formulaion for he problem and discusses is inheren srucure. Nex, Secions IV V give an overview of he algorihmic echniques ha we employ, followed by numerical work (including previous work) in Secions VI X. We conclude in Secion XI. II. LITERATURE REVIEW The problem of energy sorage, and is closely relaed problems in invenory and asse managemen, has been widely sudied. For example, in 6], he auhors derive, under an assumpion on he disribuion of wind energy, an analyical soluion o an energy commimen problem in he presence of sorage. The mahemaical formulaion is similar regardless of he exac applicaion; 7] and 8], for example, presen differen echniques (including opimal swiching and ADP) o sudy conrol policies of naural gas sorage faciliies. Moreover, 9] and 10] sudy he opimizaion of a hydro elecric reservoir, wih he addiional complicaion of bidding day ahead. The second paper, 10], uses a mehod based on

2 sochasic dual dynamic programming (SDDP). SDDP and is relaed mehods use Benders cus, bu he heoreical work in his area uses he assumpion ha random variables only have a finie se of oucomes 11] (and hus difficul o scale o larger problems). Taking a slighly differen poin of view, 12] considers he capaciy value of energy sorage by solving a dynamic program. Broader works include 13], 14], 15], and 16], all of which solve relaed problems ha involve sorage and an generic asse or commodiy. Simple, scalar sorage (or invenory) problems can be easily solved using backward dynamic programming (see 17]), bu hese mehods quickly become inracable as we add addiional sae of he world variables, leading us o consider he use of approximae dynamic programming. 1] uses approximae policy ieraion wih parameric linear model (i.e., basis funcions), leas squares emporal difference (LSTD), and Bellman error minimizaion o solve he same energy allocaion problem ha we consider here. 18] akes an alernaive approach o he policy evaluaion sep and uses neural neworks (in his paper, we use nonparameric models). 2] uses he naural concaviy of he value funcions o speed up he convergence of a TD(1) algorihm (see 19]). 16] akes a similar approach of exploiing concaviy for a generic problem wih a scalar resource, bu wihin an approximae value ieraion framework. Moreover, 20] considers a simple sorage problem moivaed by muual fund managemen and solves i using a lookup able approach exploiing concaviy. Also aking advanage of srucure, 21] explois he monooniciy in he value funcions in a lookup able approach o solving an opimal bidding and sorage problem. In boh he cases of 20] and 21], pure lookup able wihou srucure does no work in pracice wihin reasonable ime consrains. 7] solves he naural gas sorage conrol problem hrough he discreizaion of a coninuous ime model and applying a basis funcion approximaion of he value funcion. One of few works o consider a nonparameric approximaion of he value funcion, 15] employs Dirichle process mixure models o cluser saes and hen uses a convex model wihin each cluser. As can be seen from he lieraure, i is generally he case ha a specific algorihm is applied o a specific applicaion. The conribuion of his paper is o empirically compare he effeciveness of several popular ADP mehods on common se of problems derived from a energy sorage applicaion. III. MATHEMATICAL FORMULATION We now formulae he energy sorage and allocaion problem as a Markov decision process (MDP). Le N be a discree ime index represening he decision epochs of he MDP (in his problem, could be measured in hours or days). Over a finie horizon from = 0 o = T, our goal is o find a policy ha maximizes expeced profis. Le R R = 0, R max ] be he level of energy in sorage a ime, ha has charge and discharge efficiencies denoed by β c and β d, respecively, wih boh β c and β d in (0, 1). Also, le γ c and γ d be he maximum amoun of energy ha can be charged or discharged, respecively, from he sorage device. For example, suppose ha our sorage device is a 1 MW baery (meaning ha i can charge and discharge a a rae of 1 MW) and we make allocaion decisions every hour. In his case, we have ha γ c = γ d = 1 MWh. Le E be he amoun of energy available from wind a ime and P be he spo price of elecriciy. Finally, suppose D is he amoun of demand ha mus be saisfied a ime. To allow for differen models (eiher deerminisic or sochasic), we also define E S, P S, and D S o be he sae variables associaed wih he respecive processes a ime. As an example, if E is modeled as a Markov process, hen E S = E and if D is modeled as a deerminisic process, hen D S = {}. Hence, he sae variable for he problem is S = (R, E S, P S, D S ). To abbreviae, le W = (E S, P S, D S ) W and S = (R, W ). Throughou his paper, we operae under he assumpion ha he process W is independen of R. Nex, we define he exogenous informaion, Ŵ +1, o be he change in W : W +1 = W + Ŵ+1, which of course is model dependen (he specific processes we use for benchmarking are defined in Secion VI). The decision problem is ha, while anicipaing he fuure value of sorage, we mus combine energy from he following hree sources in order o fully saisfy he demand: 1) energy currenly in sorage, consrained by γ c, γ d, and R (represened by a decision x rd ); 2) newly available wind energy, consrained by E (represened by a decision x wd ); 3) and energy from he grid, a a spo price of P (represened by a decision x gd ). Addiional allocaion decisions are, amoun of wind energy o sore; x rg, he amoun of energy o sell o he grid a price P ; and x gr, he amoun of energy o buy from he grid and sore. These allocaion decisions are summarized by he six dimensional, nonnegaive decision vecor, x = (x wd, x gd, x rd,, x gr, x rg ) T 0, (1) and he consrains are as follows: x wd + β d x rd x rd x rd + x gd = D, (2) + x rg R, (3) + x gr R max R, (4) + x wd E, (5) + x gr γ c, (6) + x rg γ d. (7) The firs consrain guaranees ha demand is fully saisfied; (3) and (4) are sorage capaciy consrains; (5) saes ha he maximum amoun of energy used from wind is bounded by E ; and finally, (6) and (7) consrain he decisions o wihin he sorage ransfer raes. Le us denoe he feasible se, deermined by he consrains (1) (7), by X (S ). Suppressing he dependence on S for ease of noaion, define X = X (S ). See Figure 1 below for an illusraive summary of he problem, annoaed wih he componens of x.

3 Sorage R x rw x gr Wind E x rd Grid x wd x gd Demand D Fig. 1. Illusraion of he Energy Sorage/Allocaion Problem Le φ = (0, 0, 1, β c, β c, 1) T be a vecor conaining he flow coefficiens of a decision x wih respec o he sorage device. Then, he ransiion funcion is: R +1 = R + φ T x. (8) Noe ha here is no dependence on any random informaion, allowing us o easily ake advanage of he pos decision formulaion of his problem, o be made clear below. Now, we define he conribuion funcion. For a given sae S and decision x, we define: C(S, x ) = P (D + β d x rg x gr x gd ), he profi realized a ime (we ge paid for saisfying demand and for selling o he grid, bu we mus pay for any energy ha originaes from he grid). Using Bellman s opimaliy equaion, we define value funcions hrough he following se of recursive equaions. Le V T (S T ) = 0 and for T 1, V (S ) = max x X C(S, x ) + E(V +1(S +1 ) S ) ], (9) where i is undersood ha S +1 depends on boh S and x. For simulaion and compuaional purposes, i is ofen roublesome o deal wih an expecaion operaor wihin a maximum operaor. As described in deail in 3], his can be remedied by appealing o he pos decision formulaion of Bellman s equaion. Essenially, he pos decision sae S x S is he sae immediaely afer he decision x is made bu before any new informaion has arrived, where S is he pos decision sae space. The canonical form of he pos decision sae is simply S x = (S, x ), bu ofenimes, i can be wrien in a more condensed way. Mahemaically, i mus be he case ha S +1 S x d = S +1 S, x (equal in disribuion). Le R x = R +1 as defined in (8); for our problem, due o he fac ha R +1 depends solely on R and x, he pos decision sae is given by: S x = (R x, E S, P S, D S ) = (R x, W ). We define he pos decision value funcion as V x (S x ) = E(V+1(S +1 ) S x ), which gives us he following wo relaions: V (S ) = max C(S, x ) + V x (S x ) ] (10) x X and V x 1(S x 1) = E max C(S, x ) + V x x X (S x ) ] ] S x 1. (11) Equaion (10) is useful for simulaing a policy induced by a se of value funcions and equaion (11) is used in he simulaion seps of he various ADP algorihms. In 2], he concaviy of he pos decision value funcions is exploied as he VFAs are learned by he ADP algorihm. For his paper, in addiion o he API varians, we consider for comparison a more recen algorihm ha akes advanage of monooniciy, called Monoone ADP (see 22]). To do so, we give he following proposiion. Proposiion 1. For each ime T 1, he pos decision value funcion V x (R x, W ) is nondecreasing in R x. Proof. We proceed by inducion. Since VT (S T ) = 0, i is clear ha VT x 1 (Sx T 1 ) = 0 by definiion and hence saisfies monooniciy. Assume ha V x (S x ) saisfies he monooniciy propery (inducion hypohesis) and consider (11). A ime 1, fix wo saes S 1 x = (R 1, x W 1 ) and S 1 x = ( R 1, x W 1 ), wih boh R 1, x R 1 x R, such ha R 1 x < R 1. x Le ɛ = R 1 x R 1. x Denoe S = (R, W ) = (R 1, x W ) and S = ( R, W ) = ( R 1, x W ), wih S x and S x be he corresponding pos decision saes. As before, le X = X (S ), bu also le X = X ( S ). We aim o show ha he following inequaliy holds for any oucome of W W 1 d = (noe ha W S 1 x d = W W 1, so he disribuion of he exogenous W S 1 x informaion is he same in boh siuaions): max C(S, x ) + V x (S x ) ] x X max x X C( S, x ) + V x ( S x ) ]. Noe he differing feasible ses X and X. Denoe he opimal soluion o he lef hand side of he inequaliy as x and he opimal value of he objecive as F. Now, here are wo cases: 1) x X. Using his same decision on he righ hand side as well, we see ha since R < R, we have R x < R x. Using C(S, x ) = C( S, x ) and he inducion hypohesis, we conclude ha C( S, x ) + V x ( S x ) F. Since here exiss a feasible soluion, namely x, in he new decision space X ha achieves he objecive value greaer han or equal o F, he inequaliy is verified. 2) x X. To ge from X o X, consrain (3) is relaxed by ɛ and consrain (4) is ighened by ɛ. Therefore, i mus be he case ha consrain (4) is violaed by x : + x gr > R max R = R max R ɛ. To consruc a feasible soluion x X from x, le us simply decrease + x gr unil (4) is saisfied. Tha is, choose and x gr such ha + x gr = R max R. I is clear ha: ( + x gr ) ( + x gr ) ɛ, and hus, from he resource ransiion funcion (8), we see ha Rx R x. Also, C(S, x ) = C( S, x ), so by he inducion hypohesis, we have shown he exisence of a feasible soluion x in X such ha C( S, x )+V x ( S x ) F, he original inequaliy is verified. Because his is rue for any realizaion of W, monooniciy holds in expecaion as well and he proof is complee.

4 Monooniciy ofen exiss in he sae of he world dimensions as well, bu his depends on he model of he random processes used. We show how o ake advanage of his srucural propery in Secion V. IV. APPROXIMATE POLICY ITERATION Exac policy ieraion involves wo main seps: policy evaluaion and policy improvemen (see, e.g., 5]). The (exac) policy evaluaion sep can be compleed using a marix inversion (solving Bellman s opimaliy equaions), bu his is ofen inracable. One opion for approximaing he policy evaluaion sep is o apply exac value ieraion for a large number of ieraions, bu even his is difficul for complex problems wih large sae spaces and impossible when he problem admis a coninuous sae space. In our implemenaion of he approximae policy evaluaion sep (for a finie horizon model), we ake a simulaion approach where we firs generae observaions of he value funcion for he fixed policy and hen fiing a model o he observaions. Consider a fixed policy π = (π 1, π 2,..., π T ) and a general approximaion archiecure Q, which akes a se of samples Z = { (x i, y i ) i=1} M (wih xi X and y i Y) and produces a model Q(Z, ) ha maps from X o Y. Figure 2 provides he precise seps aken o perform approximae policy evaluaion, given π, Q, and he number of samples desired, M. The idea is ha we simulae he policy π from various iniial saes, keeping rack of boh he (pos decision) saes S x,m ha we visi and conribuions C m = C(S m, x m ) ha we receive. From his, we produce a se of samples Z (see Sep 5 of Figure 2) ha is used by Q o produce an approximaion. Approximae Policy Evaluaion (Inpus: policy π, approximaion Q, sample size M) Sep 0. Se m = 1. Sep 1. Selec an iniial pos decision sae S x,m 0. Sep 2. For = 1, 2,..., (T 1): Sep 2a. Sep 2b. Sample W and se pre decision sae: S m = (R x,m 0, W ). Apply policy o receive decision and compue conribuion: x m = π (S m ); Cm = C(S m, xm ). Sep 2d. Compue nex pos decision sae S x using (8). Sep 3. Compue observaions of he ime dependen value funcion. For each, se: v m T 1 = C m τ. τ= Sep 4. If m < M, incremen m and reurn o Sep 1. Sep 5. Denoe he se of samples by: { Z = (S x,m, v m } )M m=1. Using he approximaion model, reurn V x ( ) = Q(Z, ). Fig. 2. Approximae Policy Evaluaion Sep for API Wih he policy evaluaion sep defined, we define he API algorihm by essenially replacing he exac policy evaluaion sep in radiional policy ieraion wih he approximae version. As menioned above, he algorihm ieraes he wo seps of policy evaluaion and improvemen, shown in Figure 3. A. Choices of Approximaion Archiecure Q In his secion, we give a brief inroducion for each of he following approximaion archiecures Q (for a deailed Approximae Policy Ieraion (Inpus: approximaion Q, sample size M, ieraions N ) Sep 0. Se iniial policy π 0 ; se n = 1. Sep 1. Use Approximae Policy Evaluaion wih argumens (π n 1, Q, M) o compue V x,n 1 for each. Sep 2. Policy improvemen sep: π n (S ) = arg max (C(S, x ) + V x,n 1 (S x ]. ) x X Sep 3. If n < N, incremen n and reurn o Sep 1. Fig. 3. Approximae Policy Ieraion Algorihm reamen, see he corresponding lieraure). The moivaion for choosing nonparameric esimaors is ha no only have hey received he mos aenion in he saisics and machine learning communiies, bu also require lile o no problem specific uning, as opposed o parameric varians. The more radiional echnique of LSTD was esed on he same problem in 1]. Assume ha he noaion is self conained for each of he following sub secions; in addiion, for purposes of presenaion, we have removed he subscrip and superscrip from he noaion V x (s) and use V (s) insead. Suppor vecor regression (SVR), originally inroduced in 23], is an exension of he well known suppor vecor machine (SVM) algorihm for classificaion o he problem of regression. Also see 24] for an overview of SVR and implemenaions. Given a linear model, V (s) = f F θ f φ f (s) = θ f, φ f (s), where F is a se of feaures, φ f are basis funcions, and θ f are weighs. Le he raining daase be represened by (s m, y m ) for m = 1, 2,..., M. The essenial idea of SVR is o choose a hyperplane defined by he weighs θ f, so ha mos of he raining pairs fall wihin ɛ of he hyperplane while keeping he hyperplane as fla or as simple as possible, by minimizing θ = θ, θ (given wo models ha explain he raining daa, we prefer he simpler one ha is less affeced by noise in s m ). The opimizaion problem can be wrien as follows: minimize 1 2 θ 2 + η subjec o M ( ξm + ξm) m=1 y m θ, φ(s m ) ɛ + ξ m y m θ, φ(s m ) ɛ + ξm ξ m, ξm 0. In he numerical work, we leverage he of used and versaile Gaussian radial basis kernel. SVR is implemened using svm of he R package e1071 (wih λ = 10 and ɛ = 0.01). Gaussian process regression (GPR) is a Bayesian machine learning echnique (see 25] for a horough descripion) ha allows us o model he unknown value funcion by a Gaussian process indexed by elemens s of he sae space S. A Gaussian process V GP(m, k), specified by a mean funcion m(s) = EV (s)] and covariance funcion k(s, s ) = CovV (s), V (s )], is a (possibly infinie) collecion of random variables such ha any finie se of hem is joinly Gaussian. For he prior, a ypical choice of mean funcion is m(s) = 0 (noe ha he poserior mean is no necessarily zero). In our work, we choose k o be he Gaussian radial basis

5 ( funcion, k(s, s ) = 1 exp 2πσ s s 2 2 2σ ), and assume we 2 observe y = V (s) + ɛ, where ɛ N (0, σɛ 2 ). The essenial sep in GPR is compuing he poserior Gaussian process by condiioning on he observed values. We implemen GPR using gausspr of he R package kernlab. Local Polynomial Regression (LPR), or more specifically, locally weighed scaerplo smoohing (LOESS), is a nonparameric echnique used for smooh funcions 26]. As before, le (S, y) be he raining se and suppose we are ineresed in esimaing he value of V (s). Le θ(s) = (V (s), V (s),..., V (l) (s)) and U(u) = (1, u, u 2 /2!,..., u l /l!). For any s i S near s, he value V (s i ) can be approximaed using (Taylor expansion) θ(s) U(s i s). The LOESS esimaor for θ is defined by: ˆθ(s) = arg min θ R l+1 ] ( 2K y i θ si s U(s i s) h s i ). In our numerical work, we use a second order local polynomial fi (l = 2). LOESS is implemened using loess of he R package sas. Dirichle Cloud Radial Basis Funcions (DCR) is a mehod, developed in 27], ha performs local regressions on clusers of daa. As each raining poin is processed, a cluser for he poin is chosen and he local (low degree) polynomial funcions is updaed recursively. Once again, we use he Gaussian radial basis funcion; using he noaion from 27], le φ(r) = 1 2π exp( r 2 /2). Le N c be he oal number of clusers, c i be he cenroid of he i h cluser, and p i be he polynomial fied o he i h cluser. The model can be summarized by he following equaion; for a new sae s: Nc i=1 V (s) = p i(s)φ( s c i ) Nc i=1 φ( s c, i ) a weighed average of he predicions of each of he individual clusers. For a deailed descripion of when and how new clusers are creaed and he precise equaions for he fiing of local polynomials, see 27]. This mehod is implemened in R. As we can see, LPR and DCR are similarly moivaed by local approximaions, while SVR and GPR are significanly differen: SVR is a more sophisicaed version of he basis funcion echnique, while GPR is a Bayesian mehod of modeling a funcion as a random process. V. APPROXIMATE VALUE ITERATION WITH MONOTONICITY PRESERVATION We now move away from API and consider anoher main ADP echnique, approximae value ieraion (AVI). The version of AVI for finie horizon problems ha we consider is a forward simulaion mehod ha ieraively updaes he VFA based on each new observaion. A weakness of his mehod is ha i requires a lookup able represenaion of he sae space, somehing ha he API mehods do no require. Neverheless, in his paper, all problems are discreized in order o compare o opimal benchmark. We presen a version of he AVI ha explois he monoone srucure of he problem (see Proposiion 1), called Monoone ADP (MADP) 22]. Firs, we define some noaion. Le V x,n (s) be he esimae of he (pos decision) value funcion evaluaed a s S a ieraion n of he algorihm. The sae ha he algorihm visis on ieraion n and ime is denoed S x,n. Also, le α n be a possibly sochasic sepsize sequence, wih α n (s) = α n 1 {S x,n =s}. Consider wo saes s = (r, w) S and s = (r, w ) S, wih r, r R and w, w W. We say ha s s if and only if r r and w = w he necessary condiions o invoke Proposiion 1. Now we define he monooniciy preservaion operaor, Π M (see Figure 4 for an illusraion). In he following definiion, suppose ha v is he previous esimae of he value of a paricular sae s and ha z n is a new observaion of he value of he currenly visied sae S x,n. We define: Π M (S x,n, z x,n, s, v) = z x,n if s = S x,n, z x,n v if S x,n s, s S x,n, z x,n v if s S x,n, s S x,n, v oherwise. The precise descripion of he algorihm is given in Figure 5. curren esimae of value curren esimae of value : new observaions : rue value funcion : VFA (discreized) R x,n monooniciy violaion resource level monooniciy violaion resource level R x,n+1 M M curren esimae of value curren esimae of value R x,n resource level resource level R x,n+1 Fig. 4. Illusraion of he Monooniciy Preservaion Operaor in he Resource Dimension (i.e., for a fixed and fixed oucome of W ) We remark ha Monoone ADP is a provably convergen algorihm under cerain echnical condiions (see 22]). Alhough we do no describe he deails of he convergence heory in his paper, i can be easily checked ha he problem a hand, afer discreizaion, saisfies he condiions for convergence. VI. BENCHMARK PROBLEMS The problems ha we use as opimal benchmarks o he proposed algorihms originaed from 2]. For all of he benchmark problems, we choose R max = 30, β c = β d = 1, γ c = γ d = 5, and T = 100. The deerminisic demand is assumed o have a seasonal srucure: D = max { 0, 3 4 sin ( )} 2π T. We now define wo parameers ha deermine he suppor of he price processes, P min = 30 and P max = 70. Moreover, 2] defines a discree disribuion called he pseudonormal disribuion, characerized by five parameers, µ, σ 2, a, b, and. Le X be pseudonormally disribued (wrien X PN (µ, σ 2, a, b, )). The suppor of X is defined o be X =

6 Monoone ADP Algorihm Sep 0a. Iniialize V x,0 (s) = 0 for each T 1 and s S. Sep 0b. Se V x,n (s) = 0 for each s S and n N. T Sep 0c. Se n = 1. Sep 1. Selec an iniial sae S x,n 0 = (R x,n 0, W 0 ). Sep 2. For = 0,..., (T 1): Sep 2a. Sample S+1 n and ge a noisy observaion: ˆv x,n (S x,n { n ) = max C(S x +1, x +1) + V x,n 1 +1 (S x +1 )}. +1 Sep 2b. Smooh new observaion wih previous value: z x,n (S x,n ) = (1 α n 1 (S x,n )) V x,n 1 (S x,n ) + α n 1 (S x,n ) ˆv x,n (S x,n ). Sep 2c. Enforce monooniciy. For each s S: V x,n ( x,n (s) = Π M S, z x,n, s, V x,n 1 (s) ). Sep 2d. Choose he nex sae S x,n +1. Sep 3. If n < N, incremen n and reurn Sep 1. Fig. 5. Monoone ADP Algorihm using Pos Decision Saes {a, a+, a+2,..., b } and for x i X, we have P(X = x i ) = f(x i ; µ, σ 2 )/ x f(x j X j; µ, σ 2 ), where f( ; µ, σ 2 ) is he pdf of a normal random variable wih mean µ and variance σ 2. Three ypes of price processes are considered. Le ɛ P PN (µ P, σp 2, 8, 8, 1), ɛj PN (0, 50 2, 40, 40, 1) (for jumps), and u U(0, 1) be i.i.d. random variables. 1) Sinusoidal. { { ( ) 5π P = min max sin 2T } + ɛ, P min, P max }. 2) Markov chain. Le P 0 = P min and P +1 = min {max {P + ɛ +1, P min }, P max }. 3) Markov chain wih jumps. Le P 0 = P min and P +1 = min { max { P + ɛ {u+1 p}ɛ J +1, P min }, Pmax }. We consider a Markov chain model for he wind process, E. Define E min = 1 and E max = 7. The suppor of E are he values beween E min and E max discreized a a level of a parameer E. Le ɛ E i.i.d. random variables ha can be eiher uniformly or pseudonormally disribued, PN (µ E, σe 2, 3, 3, E). E +1 = min {max {E + ɛ +1, E min }, E max }. Lasly, suppose ha R x akes values beween 0 and R max, discreized a a level R. Table I summarizes he sochasic benchmark problems; for ɛ E and ɛ P, since a, b, and are defined he same way across all problems, we use PN (µ, σ 2 ) as shorhand. VII. NUMERICAL RESULTS Due o he more complex naure of he various approximaion archiecures, here are more compuaional issues associaed wih he API algorihms han he AVI algorihm. The main difficuly arises in he policy improvemen sep (given in Sep 2 of Figure 3 bu he compuaional cos is acually realized in Sep 2b of Figure 2). Due o he TABLE I PARAMETER CHOICES FOR STOCHASTIC BENCHMARK PROBLEMS 2] Problem R E ɛ E Price Process S U( 1, 1) Sinusoidal PN (0, 25 2 ) S PN (0, ) Sinusoidal PN (0, 25 2 ) S PN (0, ) Sinusoidal PN (0, 25 2 ) S PN (0, ) Sinusoidal PN (0, 25 2 ) S5 1 1 U( 1, 1) MC + Jump PN (0, ) S6 1 1 U( 1, 1) MC + Jump PN (0, ) S7 1 1 U( 1, 1) MC + Jump PN (0, ) S8 1 1 U( 1, 1) MC + Jump PN (0, ) S9 1 1 PN (0, ) MC + Jump PN (0, ) S PN (0, ) MC + Jump PN (0, ) S PN (0, ) MC + Jump PN (0, ) S PN (0, ) MC + Jump PN (0, ) S PN (0, ) MC + Jump PN (0, ) S PN (0, ) MC + Jump PN (0, ) S PN (0, ) MC + Jump PN (0, ) S PN (0, ) MC PN (0, ) S PN (0, ) MC PN (0, ) exisence of local opima when solving Sep 2 of Figure 3, he maximizaion problem is solved using grid search, a compuaionally expensive mehod. Because of hese limiaions, we are able o use approximaely 12.5% of he sae space for policy evaluaion purposes and 10 policy improvemen seps. On he oher hand, he MADP algorihm uses marix operaions o manipulae a simple lookup able represenaion of saes and can finish ieraions wihin 2 days of compuaion ime. For a given pos decision VFA, V x, we define he approximae policy as X π (S ) = arg max x X C(S, x ) + V x (S x ) ]. To compue he value of he policy, we generae 1000 sample pahs of he wind and price processes, and for each sample pah, we follow he policy and sum he conribuions. The value of he policy is hen he average conribuion over he 1000 sample pahs. The percen of opimaliy is defined o be he value of he approximae policy divided by he value of he opimal (backward dynamic programming) policy. See 3, Secion 4.94] for a deailed descripion of deermining a policy s value. The resuls are given in Figure 6. SVR and MADP generae he highes qualiy policies, bu i is noeworhy o see ha SVR does no use problem specific informaion while MADP does. Despie his, when considering he relaive simpliciy of he energy sorage problem when compared o oher real world problems, resuls of 90% are no necessarily encouraging (GPR, LPR, and DCR ofen perform significanly worse han 90%). This suggess ha care needs o be aken when combining API wih a general purpose approximaion archiecure no any approximaion mehod will work. VIII. EXPLOITING CONCAVITY I needs o be poined ou ha neiher SVR nor MADP perform a he level of he ADP algorihm of 2] (98 99% opimaliy, see Figure 7), which explois he piecewise linear concave naure of he value funcions; he algorihm of 2] also uses a specific backward pass designed wih his energy sorage applicaion in mind. Alhough experimens are no shown in his paper, we wan o sress ha, while convergence heory exiss, unsrucured ɛ P

7 SVR GPR LPR DCR MADP % Opimaliy Fig. 6. Benchmark Resuls 0 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 Sochasic Benchmark Problem lookup able wih AVI does no work for any reasonably large problems (he convergence rae is far oo slow o be of any pracical use). 20] and 21] show he benefis of aking advanage of srucure, respecively. We would also like o noe ha AVI can also be used wih oher approximaions beyond lookup able (wih or wihou srucure), such as basis funcions, bu i is shown in 28] ha here is ofen a lack of a fixed poin. Our resuls in his paper sugges ha srucured, problem specific lookup able echniques also ouperform oher, more general approaches, such as API paired wih a generic approximaion echnique. A his poin, he numerical resuls sugges ha srucured lookup able is consisenly effecive on moderaely sized problems, unlike any oher mehods ha we esed. The cavea, of course, is ha lookup able echniques do no scale o larger sae spaces due o he requiremen of soring a value esimae for every sae. No only ha, i is ypically he case ha srucure exiss in only 1 or 2 dimensions of higher dimensional sae variable (sae of he world variables quickly add dimensionaliy and here is no guaranee ha hey conain srucure). based algorihms. Several versions of API are discussed in he original paper 1]; here, we only reproduce he resuls for bes performing version, API wih insrumenal variables (IVAPI). The oher version considered in 1] is leas squares API (LSAPI). Quadraic basis funcions are used for he approximaion. Direc policy search is implemened using he knowledge gradien for coninuous parameers (KGCP, see 29]). The srucure of he policy is: X π (S θ) = arg maxc(s, x ) + φ(s x ) θ], x X where φ is he vecor of basis funcions and θ is a vecor of weighs (he parameer of he policy). Noe ha alhough he second erm resembles a VFA, he policy search echnique has no noion of minimizing he disance beween φ(s x ) θ o V x (S x ). The reproduced resuls are shown below in Figure 8. Alhough direc policy search seems robus in his applicaion, we emphasize ha his ype of direc search does no easily scale o higher dimensional parameer spaces IVAPI Direc 70 % Opimaliy Benchmark Problem (see 1]) Fig. 8. Direc Policy Search vs. IVAPI, 1] Fig. 7. Resuls from ADP Exploiing Concaviy, 2] IX. DIRECT POLICY SEARCH In his secion, we review he somewha surprising resul from 1] ha direc policy search (over a low dimensional paramerized space of policies) yields beer resuls han API X. API SAMPLING DISTRIBUTION In he implemenaions of API discussed in his paper, he sampling disribuion used for Sep 1 of he approximae policy evaluaion sep of Figure 2 is chosen o be a uniform disribuion over he sae space. One hypohesis o explain API s relaively poor performance is ha insead of sampling uniformly, we can sample from he disribuion of saes visied under he opimal policy (say, given a deerminisic iniial

8 sae, S0 x ). In mos cases, his disribuion is unknown; however, we are able o es his hypohesis on a simple problem wih a compuable opimal policy. We consider a version of our energy sorage problem where he sae variable is he scalar resource sae, R x combined wih a quadraic approximaion. When sampling uniformly, we consisenly achieve policies ha are 90% 95% opimal, bu when sampling from he opimal policy s sae disribuion, we obain policies ha are anywhere from 40% 70% opimal. The primary reason ha we observed for such low qualiy policies is ha he opimal policy visis some saes wih very low o zero probabiliy, causing he quadraic approximaion o be very accurae for a porion of he sae space bu a he same ime very poor in oher porions of he sae space. This ofen leads o policy oscillaions (or chaering, see 30] for a discussion on his issue). Besides hese preliminary observaions, he issue of he correc sampling disribuion remains a work in progress. XI. CONCLUSION In his paper, we describe a simple finie horizon energy sorage and allocaion problem ha is subjec o sochasic prices and wind supply, wih he purpose of comparing he performance of several ADP algorihms. We consider API algorihms ha ake advanage of he following approximaion archiecures: SVR, GPR, LPR, and DCR. In addiion, we es an AVI algorihm ha explois he known monooniciy of he problem, MADP. We draw he following conclusions from his and relaed papers: API performs decenly well wih SVR, bu poorly wih he oher approximaion archiecures ha we considered. However, given he simpliciy of he problem, even he resuls from SVR are no oo encouraging. Pure lookup able AVI performs poorly in pracice, despie convergence heory (see 21], 20]). Srucured lookup able AVI (concaviy or monooniciy, bu especially concaviy) works exremely well, bu is limied o a low dimensional sae of he world variable (see 2]). Direc policy search also displays superior performance compared o API based mehods (see 1]), bu canno scale o policies requiring a large number of parameers. In paricular, direc policy search is generally no suiable for ime dependen policies. From his, we can conclude ha none of hese echniques work reliably in a way ha would scale o more complex problems. Therefore, we believe ha new heory and mehodology need o be developed o order o solve real world sequenial decision problems, which are becoming increasingly difficul. REFERENCES 1] W. Sco and W. B. Powell, Approximae Dynamic Programming for Energy Sorage wih New Resuls on Insrumenal Variables and Projeced Bellman Errors, (working paper), ] D. Salas and W. B. Powell, Benchmarking a Scalable Approximaion Dynamic Programming Algorihm for Sochasic Conrol of Mulidimensional Energy Sorage Problems, (working paper), ] W. B. Powell, Approximae Dynamic Programming: Solving he Curses of Dimensionaliy, 2nd ed. Wiley, ] F. L. Lewis and D. Vrabie, Learning and Adapive Dynamic Programming for Feedback Conrol, IEEE Circuis Sys. Mag., vol. 9, no. 3, pp , ] D. P. Bersekas and J. N. Tsisiklis, Neuro Dynamic Programming. Belmon, MA: Ahena Scienific, ] J. H. Kim and W. B. Powell, Opimal Energy Commimens wih Sorage and Inermien Supply, Operaions Research, vol. 59, no. 6, pp , ] R. Carmona and M. Ludkovski, Valuaion of energy sorage: An opimal swiching approach, Quaniaive Finance, vol. 10, no. 4, pp , ] M. Thompson, M. Davison, and H. Rasmussen, Naural gas sorage valuaion and opimizaion: A real opions applicaion, Naval Research Logisics, vol. 56, no. 3, pp , ] G. Prichard, B. Philpo, and J. Neame, Hydroelecric reservoir opimizaion in a pool marke, vol. 461, pp , ] N. Löhndorf, D. Wozabal, and S. Minner, Opimizing Trading Decisions for Hydro Sorage Sysems Using Approximae Dual Dynamic Programming, Operaions Research, vol. 61, no. 4, pp , ] A. Philpo and Z. Guan, On he convergence of sochasic dual dynamic programming and relaed mehods, Operaions Research Leers, vol. 36, no. 4, pp , ] R. Sioshansi, S. H. Madaeni, and P. Denholm, A Dynamic Programming Approach o Esimae he Capaciy Value of Energy Sorage, IEEE Transacions on Power Sysems, vol. 29, no. 1, pp , ] J. M. Nascimeno and W. B. Powell, An Opimal Approximae Dynamic Programming Algorihm for he Lagged Asse Acquisiion Problem, Mahemaics of Operaions Research, vol. 34, no. 1, pp , ] N. Secomandi, Opimal Commodiy Trading wih a Capaciaed Sorage Asse, Managemen Science, vol. 56, no. 3, pp , ] L. Hannah and D. Dunson, Approximae dynamic programming for sorage problems, Proceedings of he 29h Inernaional Conference on Machine Learning, ] J. M. Nascimeno and W. B. Powell, An Opimal Approximae Dynamic Programming Algorihm for Concave, Scalar Sorage Problems Wih Vecor-Valued Conrols, IEEE Transacions on Auomaic Conrol, vol. 58, no. 12, pp , ] M. L. Puerman, Markov Decision Processes: Discree Sochasic Dynamic Programming. New York: Wiley, ] D. Liu and Q. Wei, Policy Ieraion Adapive Dynamic Programming Algorihm for Discree-Time Nonlinear Sysems for Discree-Time Nonlinear Sysems, vol. 25, no. 3, pp , ] R. S. Suon and A. G. Baro, Inroducion o reinforcemen learning. MIT Press, ] J. M. Nascimeno and W. B. Powell, Dynamic Programming Models and Algorihms for he Muual Fund Cash Balance Problem, Managemen Science, vol. 56, no. 5, pp , ] D. R. Jiang and W. B. Powell, Opimal hour-ahead bidding in he realime elecriciy marke wih baery sorage using approximae dynamic programming, arxiv preprin arxiv: , ], An Approximae Dynamic Programming Algorihm for Monoone Value Funcions, arxiv preprin arxiv: , ] H. Drucker, C. J. Burges, L. Kaufman, A. Smola, and V. Vapnik, Suppor vecor regression machines, Advances in neural informaion processing sysems, no. 9, pp , ] A. J. Smola and B. Scholkopf, A Tuorial on Suppor Vecor Regression, ] C. E. Rasmussen, Gaussian processes for machine learning, ] W. Cleveland and S. Devlin, Locally Weighed Regression: An Approach o Regression Analysis by Local Fiing, Journal of he American Saisical Associaion, vol. 83, no. 403, pp , ] A. A. Jamshidi and W. B. Powell, A recursive local polynomial approximaion mehod using Dirichle clouds and radial basis funcions, (working paper), ] D. De Farias and B. Van Roy, On he Exisence of Fixed Poins for Approximae Value Ieraion and Temporal-Difference Learning, Journal of Opimizaion heory and Applicaions, vol. 105, no. 3, pp , ] W. Sco, P. I. Frazier, and W. B. Powell, The Correlaed Knowledge Gradien for Simulaion Opimizaion of Coninuous Parameers using Gaussian Process Regression, SIAM Journal on Opimizaion, vol. 21, no. 3, p. 996, ] D. P. Bersekas, Approximae Policy Ieraion : A Survey and Some New Mehods, Journal of Conrol Theory and Applicaions, no. June, 2011.

An introduction to the theory of SDDP algorithm

An introduction to the theory of SDDP algorithm An inroducion o he heory of SDDP algorihm V. Leclère (ENPC) Augus 1, 2014 V. Leclère Inroducion o SDDP Augus 1, 2014 1 / 21 Inroducion Large scale sochasic problem are hard o solve. Two ways of aacking