On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems

Size: px

Start display at page:

Download "On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems"

Laura Dixon
5 years ago
Views:

1 MATHEMATICS OF OPERATIONS RESEARCH Vol. 38, No. 2, May 2013, pp ISSN X (prin) ISSN (online) hp://dx.doi.org/ /moor INFORMS On Boundedness of Q-Learning Ieraes for Sochasic Shores Pah Problems Huizhen Yu Laboraory for Informaion and Decision Sysems, Massachuses Insiue of Technology, Cambridge, Massachuses 02139, Dimiri P. Berseas Laboraory for Informaion and Decision Sysems and Deparmen of EECS, Massachuses Insiue of Technology, Cambridge, Massachuses 02139, We consider a oally asynchronous sochasic approximaion algorihm, Q-learning, for solving finie space sochasic shores pah (SSP) problems, which are undiscouned, oal cos Marov decision processes wih an absorbing and cos-free sae. For he mos commonly used SSP models, exising convergence proofs assume ha he sequence of Q-learning ieraes is bounded wih probabiliy one, or some oher condiion ha guaranees boundedness. We prove ha he sequence of ieraes is naurally bounded wih probabiliy one, hus furnishing he boundedness condiion in he convergence proof by Tsisilis [Tsisilis JN (1994) Asynchronous sochasic approximaion and Q-learning. Machine Learn. 16: ] and esablishing compleely he convergence of Q-learning for hese SSP models. Key words: Marov decision processes; Q-learning; sochasic approximaion; dynamic programming; reinforcemen learning MSC2000 subjec classificaion: Primary: 90C40, 93E20, 90C39; secondary: 68W15, 62L20 OR/MS subjec classificaion: Primary: dynamic programming/opimal conrol, analysis of algorihms; secondary: Marov, finie sae Hisory: Received June 6, 2011; revised April 18, Published online in Aricles in Advance November 28, Inroducion. Sochasic shores pah (SSP) problems are Marov decision processes (MDP) in which here exiss an absorbing and cos-free sae, and he goal is o reach ha sae wih minimal expeced cos. In his paper we focus on finie sae and conrol models under he undiscouned oal cos crierion. We call a policy proper if under ha policy he goal sae is reached wih probabiliy 1 (w.p.1) for every iniial sae, and improper oherwise. Le SD denoe he se of saionary and deerminisic policies. We consider a broad class of SSP models, which saisfy he following general assumpion inroduced in Berseas and Tsisilis [2]: Assumpion 1.1. (i) There is a leas one proper policy in SD, and (ii) any improper policy in SD incurs infinie cos for a leas one iniial sae. We will analyze a oally asynchronous sochasic approximaion algorihm, he Q-learning algorihm (Wains [9], Tsisilis [8]), for solving SSP problems. This algorihm generaes a sequence of so-called Q-facors, which represen expeced coss associaed wih iniial sae-conrol pairs, and i aims o obain in he limi he opimal Q-facors of he problem, from which he opimal coss and opimal policies can be deermined. Under Assumpion 1.1, Tsisilis [8, Theorems 2 and 4(c)] proved ha if he sequence Q of Q-learning ieraes is bounded w.p.1, hen Q converges o he opimal Q-facors Q w.p.1. Regarding he boundedness condiion, earlier resuls given in Tsisilis [8, Lemma 9] and he boo by Berseas and Tsisilis [3, 5.6] show ha i is saisfied in he special case where boh he one-sage coss and he iniial values Q 0 are nonnegaive. Alernaive o Tsisilis [8], here is also a line of convergence analysis of Q-learning given in Abounadi e al. [1], which does no require he boundedness condiion. However, i requires a more resricive asynchronous compuaion framewor han he oally asynchronous framewor reaed in Tsisilis [8]; in paricular, i requires some addiional condiions on he iming and frequency of componen updaes in Q-learning. In his paper we prove ha Q is naurally bounded w.p.1 for SSP models saisfying Assumpion 1.1. Our resul hus furnishes he boundedness condiion in he convergence proof by Tsisilis [8] and, ogeher wih he laer, esablishes compleely he convergence of Q-learning for hese SSP models. This boundedness resul is useful as well in oher conexs concerning SSP problems. In paricular, i is used in he convergence analysis of a new Q-learning algorihm for SSP, proposed recenly by he auhors Yu and Berseas [12], where he boundedness of he ieraes of he new algorihm was relaed o ha of he classical Q-learning algorihm considered here. The line of analysis developed in his paper has also been applied by Yu in [11] o show he boundedness and convergence of Q-learning for sochasic games of he SSP ype. We organize he paper and he resuls as follows. In 2 we inroduce noaion and preliminaries. In 3 we give he boundedness proof. Firs we show in 3.1 ha Q is bounded above w.p.1. We hen give in 3.2 a shor 209

2 210 Mahemaics of Operaions Research 38(2), pp , 2013 INFORMS proof ha Q is bounded below w.p.1 for a special case wih nonnegaive expeced one-sage coss. In 3.3 we prove ha Q is bounded below w.p.1 for he general case; he proof is long, so we divide i ino several seps given in separae subsecions. In 4 we illusrae some of hese proof seps using a simple example. 2. Preliminaries Noaion and definiions. Le S o = 0 1 n denoe he sae space, where sae 0 is he absorbing and cos-free goal sae. Le S = S o \ 0. For each sae i S, le U i denoe he finie se of feasible conrols, and for noaional convenience, le U 0 = 0. We denoe by he conrol space, = i S o U i. We define R o o be he se of sae and feasible conrol pairs, i.e., R o = i u i S o u U i, and we define R = R o \ 0 0. The sae ransiions and associaed one-sage coss are defined as follows. From sae i wih conrol u U i, a ransiion o sae j occurs wih probabiliy p ij u and incurs a one-sage cos ĝ i u j, or more generally, a random one-sage cos ĝ i u j where is a random disurbance. In he laer case random one-sage coss are all assumed o have finie variance. Le he expeced one-sage cos of applying conrol u a sae i be g i u. For sae 0, p 00 0 = 1 and he self ransiion incurs cos 0. We denoe a general hisory-dependen, randomized policy by. A randomized Marov policy is a policy of he form = 0 1, where each funcion, 0, maps each sae i S o o a probabiliy disribuion i over he se of feasible conrols U i. A randomized Marov policy of he form is said o be a saionary randomized policy and is also denoed by. A saionary deerminisic policy is a saionary randomized policy ha for each sae i assigns probabiliy 1 o a single conrol i in U i ; he policy is also denoed by. The problem is o solve he oal cos MDP on S o, where we define he oal cos of a policy for iniial sae i S o be J i = lim inf J i wih J i being he expeced -sage cos of saring from sae i. Assumpion 1.1 is saed for his oal cos definiion. The opimal cos for iniial sae i is J i = inf J i. Under Assumpion 1.1, i is esablished in Berseas and Tsisilis [2] ha he Bellman equaion (or he oal cos opimaliy equaion) { J i = TJ i = def min g i u + } p ij u J j i S (2.1) u U i j S has a unique soluion, which is he opimal cos funcion J, and here exiss an opimal policy in SD, which is proper, of course. The Q-learning algorihm operaes on he so-called Q-facors, Q = Q i u i u R o R R o. They represen coss associaed wih iniial sae-conrol pairs. For each sae-conrol pair i u R o, he opimal Q-facor Q i u is he cos of saring from sae i, applying conrol u, and aferwards following an opimal policy. (Here Q 0 0 = 0, of course.) Then, by he resuls of Berseas and Tsisilis [2] menioned above, under Assumpion 1.1, he opimal Q-facors and opimal coss are relaed by Q i u = g i u + p ij u J j J i = min Q i u u U i j S and Q resriced o R is he unique soluion of he Bellman equaion for Q-facors: i u R Q i u = FQ i u = def g i u + p ij u min Q j v i u R (2.2) v U j j S Under Assumpion 1.1, he Bellman operaors T and F given in Equaions (2.1), (2.2) are no necessarily conracion mappings wih respec o he sup-norm, bu are only nonexpansive. They would be conracions wih respec o a weighed sup-norm if all policies were proper (see Berseas and Tsisilis [3, Proposiion 2.2, pp ]), and he convergence of Q-learning in ha case was esablished by Tsisilis [8, Theorems 3 and 4(b)]. Anoher basic fac is ha for a proper policy SD, he associaed Bellman operaor F given by F Q i u = g i u + j S p ij u Q j j i u R (2.3) is a weighed sup-norm conracion, wih he norm and he modulus of conracion depending on. This fac also follows from Berseas and Tsisilis [3, Proposiion 2.2, pp ].

3 Mahemaics of Operaions Research 38(2), pp , 2013 INFORMS Q-learning algorihm. The Q-learning algorihm is an asynchronous sochasic ieraive algorihm for finding Q. Given an iniial Q 0 R R o wih Q = 0, he algorihm generaes a sequence Q by updaing a subse of Q-facors a each ime and eeping he res unchanged. In paricular, Q 0 0 = 0 for all. For each i u R and 0, le j iu S o be he successor sae of a random ransiion from sae i afer applying conrol u, generaed a ime according o he ransiion probabiliy p ij u. Then, wih s = j iu as a shorhand o simplify noaion, he ierae Q +1 i u is given by ( ) Q +1 i u = 1 i u Q i u + i u g i u + i u + min Q sv i u s v (2.4) v U s The variables in he above ieraion need o saisfy cerain condiions, which will be specified shorly. Firs we describe wha hese variables are. (i) i u 0 is a sepsize parameer, and i u = 0 if he i u h componen is no seleced o be updaed a ime. (ii) g i u + i u is he random one-sage cos of he ransiion from sae i o j iu wih conrol u; i.e., i u is he difference beween he ransiion cos and is expeced value. (iii) jv i u j v R o, are nonnegaive inegers wih jv i u. We will refer o hem as he delayed imes. In a disribued asynchronous compuaion model, if we associae a processor wih each componen i u, whose as is o updae he Q-facor for i u, hen jv i u can be viewed as he communicaion delay beween he processors a i u and j v a ime. We now describe he condiions on he variables. We regard all he variables in he Q-learning algorihm as random variables on a common probabiliy space F P. This means ha he sepsizes and delayed imes can be chosen based on he hisory of he algorihm. To deermine he values of hese variables, including which componens o updae a ime, he algorihm may use auxiliary variables ha do no appear in Equaion (2.4). Thus, o describe rigorously he dependence relaion beween he variables, i is convenien o inroduce a family F of increasing sub- -fields of F. Then he following informaion srucure condiion is required: Q 0 is F 0 -measurable, and for every i u and j v R and 0, i u and jv i u are F -measurable, and i u and j iu are F +1 -measurable. The condiion means ha in ieraion (2.4), he algorihm eiher chooses he sepsize i u and he delayed imes jv i u j v R, before generaing j iu, or i chooses he values of he former variables in a way ha does no use he informaion of j iu. We noe ha alhough his condiion seems absrac, i is naurally saisfied by he algorihm in pracice. In probabilisic erms and wih he noaion jus inroduced, he successor saes and random ransiion coss appearing in he algorihm need o saisfy he following relaions: for all i u R and 0, P j iu = j F = p ij u j S o (2.5) E i u F = 0 E 2 i u F C (2.6) where C is some deerminisic consan. There are wo more condiions on he algorihm. In he oally asynchronous compuaion framewor, we have he following minimal requiremen on he delayed imes used in each componen updae: w.p.1, jv lim i u = i u j v R (2.7) We require he sepsizes o saisfy a sandard condiion for sochasic approximaion algorihms: w.p.1, i u = i u 2 < i u R (2.8) 0 0 We collec he algorihmic condiions menioned above in one assumpion below. We noe ha hese condiions are naural and fairly mild for he Q-learning algorihm. The informaion srucure condiion holds, and w.p.1, Equa- Assumpion 2.1 (Algorihmic Condiions). ions (2.5) (2.8) are saisfied. For boundedness of he Q-learning ieraes, he condiion (2.7) is in fac no needed (which is no surprising inuiively, since bounded delayed imes canno conribue o insabiliy of he ieraes). We herefore also sae a weaer version of Assumpion 2.1, excluding condiion (2.7), and we will use i laer in he boundedness resuls for he algorihm. Assumpion 2.2. saisfied. The informaion srucure condiion holds, and w.p.1, Equaions (2.5), (2.6), (2.8) are

4 212 Mahemaics of Operaions Research 38(2), pp , 2013 INFORMS 2.3. Convergence of Q-learning: Earlier resuls. The following convergence and boundedness resuls for Q-learning in SSP problems are esablished essenially in Tsisilis [8]; see also Berseas and Tsisilis [3, 4.3 and 5.6]. Theorem 2.1 (Tsisilis [8]). Le Q be he sequence generaed by he ieraion (2.4) wih any given iniial Q 0. Then, under Assumpion 2.1, Q converges o Q w.p.1 if eiher of he following holds: (i) all policies of he SSP are proper; (ii) he SSP saisfies Assumpion 1.1 and in addiion, Q is bounded w.p.1. In case (i), we also have ha Q is bounded w.p.1 under Assumpion 2.2 (insead of Assumpion 2.1). Noe ha for a proper policy SD, by considering he SSP problem ha has as is only policy, he conclusions of Theorem 2.1 in case (i) also apply o he evaluaion of policy wih Q-learning. In his conex, Q in he conclusions corresponds o he Q-facor vecor Q, which is he unique fixed poin of he weighed sup-norm conracion mapping F (see Equaion (2.3)). The conribuion of his paper is o remove he boundedness requiremen on Q in case (ii). Our proof argumens will be largely differen from hose used o esablish he preceding heorem. For compleeness, however, in he res of his secion, we explain briefly he basis of he analysis ha gives Theorem 2.1, and he condiions involved. In he analyical framewor of Tsisilis [8], we view ieraion (2.4) as a sochasic approximaion algorihm and rewrie i equivalenly as Q +1 i u = 1 i u Q i u + i u FQ iu i u + i u i u (2.9) where F is he Bellman operaor given by Equaion (2.2); Q iu i u j v j v R o (which involve he delayed imes); and i u is a noise erm given by Q jv i u = g i u + i u + min v U s Q sv denoes he vecor of Q-facors wih componens i u s v FQ iu i u (wih s = j iu ). The noise erms i u i u R, are F +1 -measurable. Condiional on F, hey can be shown o have zero mean and mee a requiremen on he growh of he condiional variance, when he Q-learning algorihm saisfies cerain condiions (he same as hose in Assumpion 2.1 excep for a slighly sronger sepsize condiion, which will be explained shorly). We hen analyze ieraion (2.9) as a special case of an asynchronous sochasic approximaion algorihm where F is eiher a conracion or a monoone nonexpansive mapping (wih respec o he sup-norm) and Q is he unique fixed poin of F. These wo cases of F correspond o he wo differen SSP model assumpions in Theorem 2.1: when all policies of he SSP are proper, F is a weighed sup-norm conracion, whereas when Assumpion 1.1 holds, F is monoone and nonexpansive (see 2.1). The conclusions of Theorem 2.1 for case (i) follow essenially from Tsisilis [8, Theorems 1 and 3] for conracion mappings, whereas Theorem 2.1 in case (ii) follows essenially from Tsisilis [8, Theorem 2] for monoone nonexpansive mappings. A specific echnical deail relaing o he sepsize condiion is worh menioning. To apply he resuls of Tsisilis [8] here, we firs consider, wihou loss of generaliy, he case where all sepsizes are bounded by some deerminisic consan. Theorem 2.1 under his addiional condiion hen follows direcly from Tsisilis [8]; see also Berseas and Tsisilis [3, 4.3]. 1 (We menion ha he echnical use of his addiional sepsize condiion is only o ensure ha he noise erms i u i u R have well-defined condiional expecaions.) We hen remove he addiional sepsize condiion and obain Theorem 2.1 as he immediae consequence, by using a sandard, simple runcaion echnique as follows. For each posiive ineger m, define runcaed sepsizes ˆ m i u = min m i u i u R which are by definiion bounded by m, and consider he sequence ˆQ m generaed by ieraion (2.4) wih ˆQ 0 m = Q 0 and wih ˆ m i u in place of i u. This sequence has he following properies. If he original 1 The sepsize condiion appearing in Tsisilis [8] is slighly differen han condiion (2.8); i is 0 i u 2 < C w.p.1, for some (deerminisic) consan C, insead of C being, and in addiion, i is required ha i u 0 1. However, by srenghening one echnical lemma (Lemma 1) in Tsisilis [8] so ha is conclusions hold under he weaer condiion (2.8), he proof of Tsisilis [8] is essenially inac under he laer condiion. The deails of he analysis can be found in Berseas and Tsisilis [3, Proposiion 4.1 and Example 4.3, pp ] (see also Corrollary 4.1 and herein). A reproducion of he proofs in Tsisilis [8], Berseas and Tsisilis [3] wih sligh modificaions is also available Yu [10].

5 Mahemaics of Operaions Research 38(2), pp , 2013 INFORMS 213 sequence Q saisfies Assumpions 2.1 or 2.2, hen so does ˆQ m. Moreover, since he original sepsizes i u 0 i u R, are bounded w.p.1, we have ha for each sample pah from a se of probabiliy one, Q coincides wih ˆQ m for some sufficienly large ineger m. The laer means ha if for each m, ˆQ m converges o Q (or ˆQ m is bounded) w.p.1, hen he same holds for Q. Hence he conclusions of Theorem 2.1 for case (i) are direc consequences of applying he weaer version of he heorem menioned earlier o he sequences ˆQ m for each m. Case (ii) of Theorem 2.1 follows from exacly he same argumen, in view of he fac ha under Assumpion 2.1, if Q is bounded w.p.1, hen ˆQ m is also bounded w.p.1 for each m. [To see his, observe ha by condiion (2.8), he sepsizes in Q and ˆQ m coincide for sufficienly large; more precisely, w.p.1, here exiss some finie (pah-dependen) ime such ha for all and i u R, ˆ m i u = i u 0 1. I hen follows by he definiion of ˆQ m ha Q ˆQ m max Q ˆQ m for all.] So, echnically speaing, Theorem 2.1 wih he general sepsizes is a corollary of is weaer version menioned earlier. 3. Main resuls. We will prove in his secion he following heorem. I furnishes he boundedness condiion required in Tsisilis [8, Theorem 2] (see Theorem 2.1(ii)), and ogeher wih he laer, esablishes compleely he convergence of Q o Q w.p.1. Theorem 3.1. Under Assumpions 1.1 and 2.2, for any given iniial Q 0, he sequence Q generaed by he Q-learning ieraion (2.4) is bounded w.p.1. Our proof consiss of several seps which will be given in separae subsecions. Firs we show ha Q is bounded above w.p.1. This proof is shor and uses he conracion propery of he Bellman operaor F associaed wih a proper policy in SD. A similar idea has been used in earlier wors of Tsisilis [8, Lemma 9] and Berseas and Tsisilis [3, Proposiion 5.6, p. 249] o prove he boundedness of ieraes for cerain nonnegaive SSP models. In he proofs of his secion, for breviy, we will parially suppress he word w.p.1 when he algorihmic condiions are concerned. Whenever a subse of sample pahs wih a cerain propery is considered, i will be implicily assumed o be he inersecion of he se of pahs wih ha propery and he se of pahs ha saisfy he assumpion on he algorihm currenly in effec (e.g., Assumpion 2.1 or 2.2). In he proofs, he noaion a.s. sands for almos sure convergence Boundedness from above. Proposiion 3.1. Under Assumpions 1.1(i) and 2.2, for any given iniial Q 0, he sequence Q generaed by he Q-learning ieraion (2.4) is bounded above w.p.1. Proof. Le be any proper policy in SD ; such a policy exiss by Assumpion 1.1(i). Firs we define ieraes (random variables) ˆQ on he same probabiliy space as he Q-learning ieraes Q. Le ˆQ 0 = Q 0 and ˆQ 0 0 = 0 for 0. For each i u R and 0, le ˆQ +1 i u = 1 i u ˆQ i u + i u ( ( g i u + i u + ˆQ s v i u j iu j iu )) where in he superscrip of s v i u, s is a shorhand for j iu and v is a shorhand for j iu, inroduced o avoid noaional cluer; and i u, j iu and i u, as well as he delayed imes jv i u j v R o, are he same random variables ha appear in he Q-learning algorihm (2.4). The sequence ˆQ is generaed by he Q-learning algorihm (2.4) for he SSP problem ha has he proper policy as is only policy, and involves he mapping F, which is a weighed sup-norm conracion (see 2.1 and he discussion following Theorem 2.1). The sequence ˆQ also saisfies Assumpion 2.2 (since ˆQ and Q involve he same sepsizes, ransiion coss and delayed imes). Therefore, by Theorem 2.1(i), ˆQ is bounded w.p.1. Consider now any sample pah from he se of probabiliy one on which ˆQ is bounded. In view of he sepsize condiion (2.8), here exiss a ime such ha i u 1 for all and i u R. Le = max ( max Q i u ˆQ i u ) i u R Then Q i u ˆQ i u + i u R

6 214 Mahemaics of Operaions Research 38(2), pp , 2013 INFORMS We show by inducion ha his relaion also holds for all >. To his end, suppose ha for some, he relaion holds for all. Then, for each i u R, we have ha Q +1 i u 1 i u Q i u + i u ( g i u + i u + Q s v i u s v ) 1 i u ˆQ i u + + i u ( g i u + i u + ˆQ s v i u s v + ) = ˆQ +1 i u + where he firs inequaliy follows from he definiion of Q +1 and he fac i u 0, he second inequaliy follows from he inducion hypohesis and he fac i u 0 1, and he las equaliy follows from he definiion of ˆQ +1. This complees he inducion and shows ha Q is bounded above w.p Boundedness from below for a special case. The proof ha Q is bounded below w.p.1 is long and consiss of several seps o be given in he nex subsecion. For a special case wih nonnegaive expeced one-sage coss, here is a shor proof, which we give here. Togeher wih Proposiion 3.1, i provides a shor proof of he boundedness and hence convergence of he Q-learning ieraes for a class of nonnegaive SSP models saisfying Assumpion 1.1. Earlier wors of Tsisilis [8, Lemma 9] and Berseas and Tsisilis [3, Proposiion 5.6, p. 249] have also considered nonnegaive SSP models and esablished convergence resuls for hem, bu under sronger assumpions han ours. (In paricular, i is assumed here ha all ransiions incur coss ĝ i u j 0, as well as oher condiions, so ha all ieraes are nonnegaive.) To eep he proof simple, we will use Assumpion 2.1, alhough Assumpion 2.2 would also suffice. Proposiion 3.2. Suppose ha g i u 0 for all i u R and moreover, for hose i u wih g i u = 0, every possible ransiion from sae i under conrol u incurs cos 0. Then, under Assumpion 2.1, for any given iniial Q 0, he sequence Q generaed by he Q-learning ieraion (2.4) is bounded below w.p.1. Proof. We wrie Q as he sum of wo processes: for each i u R o, Q i u = g i u + Y i u 0 (3.1) where g 0 0 = g 0 0 = 0 and Y 0 0 = 0 for all, and for each i u R, g +1 i u = 1 i u g i u + i u ( g i u + i u ) Y +1 i u = 1 i u Y i u + i u min Q sv i u s v v U s wih g 0 0, Y 0 = Q 0, and s being a shorhand for j iu (o avoid noaional cluer). Using he condiions (2.6) and (2.8) of he Q-learning algorihm, i follows from he sandard heory of sochasic approximaion (see e.g., Berseas and Tsisilis [3, Proposiion 4.1 and Example 4.3, pp ] or Kushner and Yin [5], Borar [4]) ha g i u a.s. g i u for all i u R. 2 Consider any sample pah from he se of probabiliy one, on which his convergence aes place. Then by Equaion (3.1), on ha sample pah, Q is bounded below if and only if Y is bounded below. Now from he definiion of Y and Equaion (3.1) we have Y +1 i u = 1 i u Y i u + i u min v U s ( g sv i u s v + Y sv i u s v ) (3.2) By condiion (2.7) of he Q-learning algorihm, and in view also of our assumpion on one-sage coss, he convergence g j v a.s. g j v for all j v R implies ha on he sample pah under our consideraion, for all sufficienly large, g jv i u j v 0 j v R o Therefore, using Equaion (3.2) and he fac ha evenually i u 0 1 [cf. Equaion (2.8)], we have ha for all sufficienly large and for all i u R, Y +1 i u 1 i u Y i u + i u min Y sv i u s v v U s min min Y j v j v R o 2 This convergence follows from a basic resul of sochasic approximaion heory (see he aforemenioned references) if besides (2.6) and (2.8), i is assumed in addiion ha he sepsizes are bounded by some (deerminisic) consan. The desired resul hen follows by removing he addiional condiion wih he sepsize runcaion proof echnique described in 2.3. More deails can also be found in Yu [10, Lemma 1]; herein implies he convergence desired here.

7 Mahemaics of Operaions Research 38(2), pp , 2013 INFORMS 215 which implies ha for all sufficienly large, min min Y j v min min Y j v +1 j v R o j v R o Hence Y is bounded below on ha sample pah. The proof is complee Boundedness from below in general. In his secion, we will prove he following resul in several seps. Togeher wih Proposiion 3.1 i implies Theorem 3.1. Proposiion 3.3. Under Assumpions 1.1 and 2.2, he sequence Q generaed by he Q-learning ieraion (2.4) is bounded below w.p.1. The proof can be oulined roughly as follows. In we will inroduce an auxiliary sequence Q of a cerain form such ha Q is bounded below w.p.1 if and only if Q is bounded below w.p.1. In and we will give, for any given > 0, a specific consrucion of he sequence Q for each sample pah from a se of probabiliy 1, such ha each Q i u can be inerpreed as he expeced oal cos of some randomized Marov policy for a ime-inhomogeneous SSP problem ha can be viewed as a -perurbaion of he original problem. Finally, o complee he proof, we will show in ha when is sufficienly small, he expeced oal coss achievable in any of hese perurbed SSP problems can be bounded uniformly from below, so ha he auxiliary sequence Q consruced for he corresponding mus be bounded below w.p.1. This hen implies ha he Q-learning ieraes Q mus be bounded below w.p.1. In wha follows, le denoe he se of sample pahs on which he algorihmic condiions in Assumpion 2.2 hold. Noe ha has probabiliy one under Assumpion Auxiliary sequence Q. The firs sep of our proof is a echnically imporan observaion. Le us wrie he Q-learning ieraes given in Equaion (2.4) equivalenly, for all i u R and 0, as where v iu Q +1 i u = 1 i u Q i u + i u ( g i u + i u + Q sv i u j iu v iu ) (3.3) is a conrol ha saisfies v iu arg min v U s Q s v i u j iu v (3.4) and s v in he superscrip of sv i u are shorhand noaion: s sands for he sae j iu, and v now sands for he conrol v iu. We observe he following. Suppose we define an auxiliary sequence Q where Q 0 0 = 0 0 (3.5) and for some nonnegaive ineger 0, and for all i u R, Q +1 i u = 1 i u Q i u + i u ( g i u + i u + Q sv i u j iu v iu ) 0 (3.6) Q i u = Q 0 i u 0 (3.7) Le us consider each sample pah from he se. In view of Equaion (2.8), here exiss 0 0 such ha i u 0 1 for all 0 and i u R. By Equaions (3.3) and (3.6), we hen have ha for all 0 and i u R, Q +1 i u Q +1 i u 1 i u Q i u Q i u which implies + i u Q sv i u j iu v iu max Q Q Q sv i u j iu v iu max Q Q max Q Q (3.8) +1 Therefore, on ha sample pah, Q is bounded below if and only if Q is bounded below. We sae his as a lemma. Lemma 3.1. For any sample pah from he se, and for any values of 0 and Q 0, he Q-learning sequence Q is bounded below if and only if Q given by Equaions (3.5) (3.7) is bounded below. This observaion is he saring poin for he proof of he lower boundedness of Q. We will consruc a sequence Q ha is easier o analyze han Q iself. In paricular, we will choose, for each sample pah from a se of probabiliy one, he ime 0 and he iniial Q 0 in such a way ha he auxiliary sequence Q is endowed wih a special inerpreaion and srucure relaing o perurbed versions of he SSP problem.

8 216 Mahemaics of Operaions Research 38(2), pp , 2013 INFORMS Choosing 0 and iniial Q 0 for a sample pah. Firs we inroduce some noaion and definiions o be used hroughou he res of he proof. For a finie se D, le D denoe he se of probabiliy disribuions on D. For p D and x D, le p x denoe he probabiliy of x and supp p denoe he suppor of p, x D p x 0. For p 1 p 2 D, we wrie p 1 p 2 if p 1 is absoluely coninuous wih respec o p 2, ha is, supp p 1 supp p 2. For signed measures p on D, we define he noaion p x and supp p as well as he noion of absolue coninuiy similarly. We denoe by D he se of signed measures p on D such ha x D p x = 1. This se conains he se D. For each i u R o, we define he following. Le p iu o S o correspond o he ransiion probabiliies a i u : p iu o j = p ij u j S o For each > 0, le A i u S o denoe he se of probabiliy disribuions ha are boh in he -neighborhood of p iu o and absoluely coninuous wih respec o piu o, i.e., A i u = { d S o } d j p ij u j S o and d p iu o (In paricular, for i u = 0 0, p 00 o 0 = 1 and A 0 0 = p 00 o.) Le g denoe he vecor of expeced one-sage coss, g i u i u R o. Define B o be he subse of vecors in he -neighborhood of g whose 0 0 h componen is zero: wih c = c i u i u R o, B = { c c 0 0 = 0 and c i u g i u i u R } We now describe how we choose 0 and Q 0 for he auxiliary sequence Q on a cerain se of sample pahs ha has probabiliy one. We sar by defining wo sequences, a sequence g of one-sage cos vecors 3 and a sequence q of collecions of signed measures in S o. They are random sequences defined on he same probabiliy space as he Q-learning ieraes, and hey can be relaed o he empirical one-sage coss and empirical ransiion frequencies on a sample pah. We define he sequence g as follows: for 0, g +1 i u = 1 i u g i u + i u g i u + i u i u R (3.9) g 0 i u = 0 i u R and g 0 0 = 0 0 We define he sequence q as follows. I has as many componens as he size of he se R of sae-conrol pairs. For each i u R, define he componen sequence q iu by leing qiu 0 be any given disribuion in S o wih q iu, and by leing q 0 piu o iu +1 = 1 i u q iu + i u e j iu 0 (3.10) where e j denoes he indicaor of j e j S o wih e j j = 1 for j S o. Since he sepsizes i u may exceed 1, in general q iu S o. Since j iu is a random successor sae of sae i afer applying conrol u [cf. condiion (2.5)], w.p.1, q iu p iu o 0 (3.11) By he sandard heory of sochasic approximaion (see, e.g., Berseas and Tsisilis [3, Proposiion 4.1 and Example 4.3, pp ] or Kushner and Yin [5], Borar [4]; see also Foonoe 2), Equaions (2.6) and (2.8) imply ha whereas Equaions (2.5) and (2.8) imply ha g i u a.s. g i u i u R (3.12) q iu a.s. p iu i u R (3.13) o Equaions (3.13) and (3.11) ogeher imply ha w.p.1, evenually q iu lies in he se S o of probabiliy disribuions. The following is hen eviden, in view also of he sepsize condiion (2.8). Lemma 3.2. Le Assumpion 2.2 hold. Consider any sample pah from he se of probabiliy one of pahs which lie in and on which he convergence in Equaions (3.12), (3.13) aes place. Then for any > 0, here exiss a ime 0 such ha g B q iu A i u i u 1 i u R 0 (3.14) In he res of 3.3, le us consider any sample pah from he se of probabiliy one given in Lemma 3.2. For any given > 0, we choose 0 given in Lemma 3.2 o be he iniial ime of he auxiliary sequence Q. (Noe ha 0 depends on he enire pah and hence so does Q.) 3 The sequence g also appeared in he proof of Proposiion 3.2; for convenience, we repea he definiion here.

9 Mahemaics of Operaions Research 38(2), pp , 2013 INFORMS 217 We now define he iniial Q 0. Our definiion and he proof ha follows will involve a saionary randomized policy. Recall ha u i denoes he probabiliy of applying conrol u a sae i under, for u U i i S o. Recall also ha = i S o U i is he conrol space. We now regard i as a disribuion in wih is suppor conained in he feasible conrol se U i [ha is, u i = 0 if u U i ]. To define Q 0, le be a proper randomized saionary policy, which exiss under Assumpion 1.1(i). We define each componen Q 0 i u of Q 0 separaely, and we associae wih Q 0 i u a ime-inhomogeneous Marov chain and ime-varying one-sage cos funcions as follows. For each i u R, consider a ime-inhomogeneous Marov chain i 0 u 0 i 1 u 1 on he space S o wih iniial sae i 0 u 0 = i u, whose probabiliy disribuion is denoed P iu 0 and whose ransiion probabiliies a ime 1 are given by: for all ī ū j v R o, P iu 0 ( i1 = j u 1 = v i 0 = i u 0 = u ) = q iu 0 j v j P iu 0 ( i = j u = v i 1 = ī u 1 = ū ) = pī j ū v j for = 1 for 2 where P iu 0 denoes condiional probabiliy. (The ransiion probabiliies a ī ū R o can be defined arbirarily because regardless of heir values, w.p.1, he chain will never visi such sae-conrol pairs a any ime.) For each i u R, we also define ime-varying one-sage cos funcions g iu 0 R o R, 0, by g iu 0 0 = g 0 for = 0 and g iu 0 = g for 1 We exend g iu 0 o S o by defining is values ouside he domain R o o be +, and we will rea 0 = 0. This convenion will be followed hroughou. We now define Q 0 i u = E Piu 0 [ =0 ] g iu 0 i u i u R (3.15) where E Piu 0 denoes expecaion under P iu 0. The above expecaion is well defined and finie, and furhermore, he order of summaion and expecaion can be exchanged, i.e., Q 0 i u = E Piu 0 g iu 0 i u =0 This follows from he fac ha under P iu 0, from ime 1 onwards, he process i u 1 evolves and incurs coss as in he original SSP problem under he saionary proper policy. In paricular, since is a proper policy, =0 giu 0 i u is finie almos surely wih respec o P iu 0, and hence he summaion =0 giu 0 i u is well defined and also finie P iu 0 -almos surely. Since is a saionary proper policy for a finie sae SSP, we have ha under, from any sae in S, he expeced ime of reaching he sae 0 is finie, and consequenly, E Piu 0 =0 giu 0 i u is also finie. I hen follows from he dominaed convergence heorem ha he wo expressions given above for Q 0 i u are indeed equal Inerpreing Q as coss in cerain ime-inhomogeneous SSP problems. We now show ha wih he preceding choice of 0 and iniial Q 0, each componen of he ieraes Q 0 is equal o, briefly speaing, he expeced oal cos of a randomized Marov policy (represened by iu 1 below) in a ime-inhomogeneous SSP problem whose parameers (ransiion probabiliies and one-sage coss, represened by p iu g iu 0 below) lie in he -neighborhood of hose of he original problem. While he proof of his resul is lenghy, i is mosly a sraighforward verificaion. In he nex, final sep of our analysis, given in 3.3.4, we will, for sufficienly small, lower-bound he coss of hese ime-inhomogeneous SSP problems and hereby lower-bound Q. As in he preceding subsecion, for any probabiliy disribuion P, we wrie P for condiional probabiliy and E P for expecaion under P. Recall also ha he ses A i u where i u R o and he se B, defined in he preceding subsecion, are subses conained in he -neighborhood of he ransiion probabiliy parameers and expeced one-sage cos parameers of he original SSP problem, respecively. Lemma 3.3. Le Assumpions 1.1(i) and 2.2 hold. Consider any sample pah from he se of probabiliy one given in Lemma 3.2. For any > 0, wih 0 and Q 0 given as in for he chosen, he ieraes Q i u defined by Equaions (3.5) (3.7) have he following properies for each i u R and 0: (a) Q i u can be expressed as [ ] Q i u = E Piu g iu i u = E Piu g iu i u =0 =0

10 218 Mahemaics of Operaions Research 38(2), pp , 2013 INFORMS for some probabiliy disribuion P iu of a Marov chain i u 0 on S o and one-sage cos funcions g iu R o R, 0 (wih g iu + on S o \R o ). (b) The Marov chain i u 0 in (a) sars from sae i 0 u 0 = i u and is ime-inhomogeneous. Is ransiion probabiliies have he following produc form: for all ī ū j v R o, P iu i 1 = j u 1 = v i 0 = i u 0 = u = p iu 0 j i u iu 1 v j for = 1 P iu i = j u = v i 1 = ī u 1 = ū = p iu 1 j ī ū iu v j for 2 where for all 1 and ī ū R o, j S o, p iu 1 ī ū A ī ū iu and moreover, p iu 0 i u = q iu if 0. (c) The one-sage cos funcions g iu in (a) saisfy j wih supp iu j U j g iu B 0 and moreover, g iu 0 i u = g i u if 0. (d) For he Marov chain in (a), here exiss an ineger such ha i u evolves and incurs coss as in he original SSP problem under he proper policy ; i.e., for, iu ī = ī p iu ī ū = pīū o giu ī ū = g ī ū ī ū R o Proof. The proof is by inducion on. For = 0, Q 0 saisfies properies (a) (d) by is definiion and our choice of he sample pah and 0 (cf. Lemma 3.2). [In paricular, for each i u R, p iu 0 and iu 0 in (a) are given by: for = 0, p iu 0 0 i u = q iu 0, p iu 0 0 ī ū = pīū o, ī ū R o \ i u and for all 1, p iu 0 ī ū = pīū o, ī ū R o, iu 0 = whereas 0 = 1 in (d).] For < 0, since Q = Q 0 by definiion, hey also saisfy (a) (d). So le us assume ha properies (a) (d) are saisfied by all Q, 0, for some 0. We will show ha Q +1 also has hese properies. Consider Q +1 i u for each i u R. To simplify noaion, denoe = i u 0 1 (cf. Lemma 3.2). By Equaion (3.6), Q +1 i u = 1 Q i u + g i u + i u + Q sv i u s v where s = j iu v = v iu sv, and i u. By he inducion hypohesis, Q and Q sv i u can be expressed as in (a), so denoing = sv i u for shor and noicing P iu i 0 = i u 0 = u = 1 by propery (b), we have Q +1 i u = 1 E Piu i u + g i u + i u + E Psv g i u where =0 g iu = 1 g iu 0 i u + g i u + i u { + 1 E P iu g iu i u + E Psv g 1 i 1 u 1 } = =1 C (3.16) 0 C = 1 E Piu C 0 = 1 g iu 0 i u + g i u + i u (3.17) g iu i u + E Psv g 1 i 1 u 1 1 (3.18) Nex we will rewrie each erm C in a desirable form. During his procedure, we will consruc he ransiion probabiliies p iu +1 and iu +1 ha compose he probabiliy disribuion P+1 iu of he ime-inhomogeneous Marov chain for + 1, as well as he one-sage cos funcions g iu +1 required in he lemma. For clariy we divide he res of he proof ino five seps. (1) We consider he erm C 0 in Equaion (3.17) and define he ransiion probabiliies and one-sage coss for = 0 and + 1. By he inducion hypohesis and propery (c), g iu 0 i u = g i u. Using his and he definiion of g [cf. Equaion (3.9)], we have =0 C 0 = 1 g i u + g i u + i u = g +1 i u (3.19)

11 Mahemaics of Operaions Research 38(2), pp , 2013 INFORMS 219 Le us define he cos funcion and ransiion probabiliies for = 0 and + 1 by g iu +1 0 = g +1 p iu +1 0 i u = q iu +1 and p iu +1 0 ī ū = pīū o ī ū R o \ i u By Lemma 3.2 and our choice of he sample pah, g +1 B and q+1 iu A i u, so g iu +1 0 and p iu +1 0 saisfy he requiremens in properies (b) and (c). (2) We now consider he erm C in Equaion (3.18), and we inroduce several relaions ha will define he ransiion probabiliies and one-sage coss for 1 and + 1 (he precise definiions will be given in he nex wo seps). Consider each 1. Le P1 i 1 u 1 i under P sv. Le P 3 denoe he convex combinaion of hem: denoe he law of i u i +1 under P iu, and le P 2 P 3 = 1 P 1 + P 2 denoe he law of We regard P1, P 2, P 3 as probabiliy measures on he sample space = S o S o, and we denoe by X Y and Z he funcion ha maps a poin ī ū j o is firs, second, and hird coordinae, respecively. By propery (b) of P iu and P sv from he inducion hypohesis, i is clear ha under eiher P1 or P 2, he possible values of X Y are from he se R o of sae and feasible conrol pairs, so he subse R o S o of has probabiliy 1 under P3. Thus we can wrie C in Equaion (3.18) equivalenly as C = ī S o ( 1 P ū U ī 1 X = ī Y = ū g iu ī ū + P 2 X = ī Y = ū g 1 ī ū ) (3.20) In he nex wo seps, we will inroduce one-sage cos funcions g iu +1 o rewrie Equaion (3.20) equivalenly as P 3 X = ī Y = ū g iu +1 ī ū (3.21) C = ī S o ū U ī We will also define he ransiion probabiliies iu +1 ī and p iu +1 ī ū o express P3 as P 3 X = ī Y = ū = P 3 X = ī iu +1 ū ī (3.22) P 3 X = ī Y = ū Z = j = P 3 X = ī Y = ū p iu +1 j ī ū (3.23) for all ī ū R o and j S o. Noe ha in he above, by he definiion of P 3, P 3 X = ī = 1 Piu i = ī + P sv i 1 = ī ī S o (3.24) (3) We now define he one-sage cos funcions for 1 and + 1. Consider each 1. Define he cos funcion g iu +1 as follows: for each ī ū R o, g iu +1 ī ū = 1 P 1 X = ī Y = ū P3 X = ī Y = ū g iu ī ū + P 2 X = ī Y = ū P3 X = ī Y = ū gsv 1 ī ū (3.25) if P3 X = ī Y = ū > 0, and giu +1 ī ū = g ī ū oherwise. Wih his definiion, i is clear ha C can be expressed as in Equaion (3.21) and his expression is equivalen o he one given in Equaion (3.20). We verify ha g iu +1 saisfies he requiremen in propery (c); ha is, g iu +1 B (3.26) Consider each ī ū R o and discuss wo cases. If P3 X = ī Y = ū = 0, hen giu +1 ī ū g ī ū = 0 by definiion. Suppose P3 X = ī Y = ū > 0. Then by Equaion (3.25), giu +1 ī ū is a convex combinaion of g iu ī ū and g 1 ī ū, whereas giu, g 1 B by he inducion hypohesis (propery (c)). This implies, by he definiion of B, ha g iu +1 ī ū g ī ū for ī ū R and g iu = 0 for ī ū = 0 0. Combining he wo cases, and in view also of he definiion of B, we have ha g iu +1 saisfies Equaion (3.26).

12 220 Mahemaics of Operaions Research 38(2), pp , 2013 INFORMS We verify ha g iu +1 saisfies he requiremen in propery (d). By he inducion hypohesis g iu = g for and g 1 = g for + 1, whereas each componen of g iu +1 by definiion eiher equals he corresponding componen of g or is a convex combinaion of he corresponding componens of g iu and g 1. Hence g iu +1 def = g +1 = max + 1 (3.27) (4) We now define he ransiion probabiliies for 1 and + 1. Consider each 1. Define he ransiion probabiliy disribuions iu +1 and p iu +1 as follows: iu +1 ī = P 3 Y = X = ī ī S o (3.28) p iu +1 ī ū = P 3 Z = X = ī Y = ū ī ū R o (3.29) If in he righ-hand sides of Equaions (3.28) (3.29), an even being condiioned upon has probabiliy zero, hen le he corresponding condiional probabiliy (which can be defined arbirarily) be defined according o he following: P 3 Y = X = ī = ī P 3 Z = X = ī Y = ū = pīū o if P 3 X = ī = 0 if P 3 X = ī Y = ū = 0 Wih he above definiions, he equaliies (3.22) and (3.23) desired in sep (2) of he proof clearly hold. We now verify ha iu +1 and p iu +1 saisfy he requiremens in properies (b) and (d). Firs, we show ha p iu +1 saisfies he requiremen in propery (b); ha is, p iu +1 ī ū A ī ū ī ū R o This holds by he definiion of p iu +1 ī ū if P3 X = ī Y = ū = 0, so le us consider he case P3 X = ī Y = ū > 0 for each ī ū R o. By he inducion hypohesis, P iu and P sv saisfy propery (b). Using his and he definiion of P1 and P 2, we have ha for all j S o, which implies P 1 X = ī Y = ū Z = j = P iu i = ī u = ū p iu j ī ū P 2 X = ī Y = ū Z = j = P sv i 1 = ī u 1 = ū p 1 j ī ū P 1 Z = X = ī Y = ū = piu ī ū P 2 Z = X = ī Y = ū = psv 1 ī ū (3.30) and by propery (b) from he inducion hypohesis again, Then, since P 3 = 1 P 1 + P 2 P 1 Z = X = ī Y = ū A ī ū P 2 Z = X = ī Y = ū A ī ū (3.31) wih 0 1, we have P 3 Z = X = ī Y = ū = P 3 X = ī Y = ū Z = P 3 X = ī Y = ū where = 1 ī ū P 1 Z = X = ī Y = ū + ī ū P 2 Z = X = ī Y = ū (3.32) ī ū = P2 X = ī Y = ū P1 X = ī Y = ū + P2 X = ī Y = ū Since he se A ī ū is convex, using he fac ha ī ū 0 1, Equaions (3.31) (3.32) imply ha P 3 Z = X = ī Y = ū A ī ū and herefore, by definiion [cf. Equaion (3.29)], p iu +1 ī ū = P3 Z = X = ī Y = ū A ī ū. We now verify ha p iu +1 saisfies he requiremen in propery (d): for all ī ū R o, p iu +1 ī ū = pīū o +1 = max + 1 (3.33)

13 Mahemaics of Operaions Research 38(2), pp , 2013 INFORMS 221 By he inducion hypohesis, propery (d) is saisfied for, and in paricular, for all ī ū R o, p iu ī ū = pīū o for and p ī ū = pīū o for. In view of Equaions (3.30) and (3.32), we have ha if P3 X = ī Y = ū > 0, hen piu +1 ī ū is a convex combinaion of p iu ī ū and p 1 ī ū and hence saisfies Equaion (3.33). Bu if P3 X = ī Y = ū = 0, piu +1 ī ū = pīū o by definiion. Hence Equaion (3.33) holds. We now verify ha iu +1 given by Equaion (3.28) saisfies he requiremens in properies (b) and (d). For each ī S o, iu +1 ī = ī by definiion if P3 X = ī = 0; oherwise, similar o he preceding proof, ī can be expressed as a convex combinaion of iu ī and ī : iu +1 iu +1 ī = 1 P 1 X = ī P3 X = ī iu 1 ī + P 2 X = ī P3 X = ī sv 1 ī where if = 1 and ī = s, we le 0 s denoe he disribuion in ha assigns probabiliy 1 o he conrol v [if = 1 and ī s, hen he second erm above is zero because P sv i 0 = s u 0 = v = 1 by he inducion hypohesis and consequenly, P2 1 X = ī = Psv i 0 = ī = 0]. Combining he wo cases, and using properies (b) and (d) of he inducion hypohesis, we hen have ha supp iu +1 ī U ī for ī S o, and iu +1 ī = ī +1 ī S o (3.34) which are he requiremens for iu +1 in properies (b) and (d). (5) In his las sep of he proof, we define he Marov chain for + 1 and verify he expression for Q +1 i u given in propery (a). Le he ime-inhomogeneous Marov chain i u 0 wih probabiliy disribuion P+1 iu, required in propery (a) for + 1, be as follows. Le he chain sar wih i 0 u 0 = i u, and le is ransiion probabiliies have he produc forms given in propery (b) for + 1, where p iu +1, 0 and iu +1, 1, are he funcions ha we defined in he preceding proof. Also le he ime-varying one-sage cos funcions g iu +1, 0, be as defined earlier. We have shown ha hese ransiion probabiliies and one-sage cos funcions saisfy he requiremens in properies (b) (d). To prove he lemma, wha we sill need o show is ha wih our definiions, he expression given in propery (a) equals Q +1 i u. Firs of all, because our definiions of he ransiion probabiliies and one-sage cos funcions for + 1 saisfy propery (d), hey ensure ha under P+1 iu, i u +1 evolves and incurs coss as in he original SSP problem under he proper saionary policy. Consequenly, E Piu +1 =0 giu +1 i u is well defined and finie, and he order of summaion and expecaion can be exchanged (he reason is he same as he one we gave a he end of for he expression of Q 0 ): E Piu +1 [ =0 ] g iu +1 i u = Hence, o prove propery (a) for + 1, ha is, o show Q +1 i u = g iu +1 0 i u + =0 E Piu +1 g iu +1 i u (3.35) E Piu +1 g iu +1 =1 i u we only need o show, in view of he fac ha Q +1 i u = =0 C [cf. Equaion (3.16)], ha C 0 = g iu +1 0 i u C = E Piu +1 g iu +1 i u 1 (3.36) The firs relaion is rue since by definiion g iu +1 0 i u = g +1 i u = C 0 [cf. Equaion (3.19)]. We now prove he second equaliy for C, 1. For 1, recall ha by Equaion (3.21), C = ī S o ū U ī P 3 X = ī Y = ū g iu +1 ī ū Hence, o prove he desired equaliy for C, i is sufficien o prove ha P iu +1 i = ī u = ū = P 3 X = ī Y = ū ī ū R o (3.37)

14 222 Mahemaics of Operaions Research 38(2), pp , 2013 INFORMS By he definiion of P iu +1, Piu +1 u = ū i = ī = iu +1 ū ī for all ī ū R o, so in view of Equaion (3.22), he equaliy (3.37) will be implied if we prove P iu +1 i = ī = P 3 X = ī ī S o (3.38) We verify Equaion (3.38) by inducion on. For = 1, using Equaion (3.24) and propery (b) of P iu P sv, we have ha for every ī S o, P 1 3 X = ī = 1 Piu i 1 = ī + P sv i 0 = ī = 1 p iu 0 ī i u + e s ī = 1 q iu ī + e j iu ī = q iu +1 ī = piu +1 0 ī i u = P iu +1 i 1 = ī where he las hree equaliies follow from he definiion of q+1 iu [cf. Equaion (3.10)], he definiion of piu +1 and he definiion of P+1 iu, respecively. Hence Equaion (3.38) holds for = 1. Suppose Equaion (3.38) holds for some 1. Then, by he definiion of P+1 iu, we have ha for all j S o, P iu +1 i +1 = j = P iu +1 i = ī iu +1 ū ī p iu +1 j ī ū ī S o ū U ī = ī S o ū U ī P 3 X = ī iu +1 = P 3 Z = j = P +1 3 X = j ū ī p iu +1 j ī ū where he second equaliy follows from he inducion hypohesis, he hird equaliy follows from Equaions (3.22) (3.23), and he las equaliy follows from he definiion of P3 +1 and P3. This complees he inducion and proves Equaion (3.38) for all 1, which in urn esablishes Equaion (3.37) for all 1. Consequenly, for all 1, he desired equaliy (3.36) for C holds, and we conclude ha Q +1 i u equals he expressions given in Equaion (3.35). This complees he proof of he lemma Lower boundedness of Q. In and 3.3.3, we have shown ha for each sample pah from a se of probabiliy one, and for each > 0, we can consruc a sequence Q such ha Q i u for each i u R is he expeced oal cos of a randomized Marov policy in an MDP ha has ime-varying ransiion and one-sage cos parameers lying in he -neighborhood of he respecive parameers of he original SSP problem. By Lemma 3.1, herefore, o complee he boundedness proof for he Q-learning ieraes Q, i is sufficien o show ha when is sufficienly small, he expeced oal coss of all policies in all hese neighboring MDPs canno be unbounded from below. The laer can in urn be addressed by considering he following oal cos MDP. I has he same sae space S o wih sae 0 being absorbing and cos-free. For each sae i S, he se of feasible conrols consiss of no only he regular conrols U i, bu also he ransiion probabiliies and one-sage cos funcions. More precisely, he exended conrol se a sae i is defined o be U i = { u p iu i u U i p iu A i u i B i } where B i is a se of one-sage cos funcions a i: wih z = z u u U i, B i = { z z u g i u u U i } and 0, Applying conrol u p iu i a i S, he one-sage cos, denoed by c u i i, is c u i i = i u and he probabiliy of ransiion from sae i o j is p iu j. We refer o his problem as he exended SSP problem. If we can show ha he opimal oal coss of his problem for all iniial saes are finie, hen i will imply ha Q is bounded below because by Lemma 3.3, for each and i u R, Q i u equals he expeced oal cos of some policy in he exended SSP problem for he iniial sae i.

15 Mahemaics of Operaions Research 38(2), pp , 2013 INFORMS 223 The exended SSP problem has a finie number of saes and a compac conrol se for each sae. Is one-sage cos c u i i is a coninuous funcion of he conrol componen u i, whereas is ransiion probabiliies are coninuous funcions of he conrol componen u p iu for each sae i. Wih hese compacness and coninuiy properies, he exended SSP problem falls ino he se of SSP models analyzed in Berseas and Tsisilis [2]. Based on he resuls of Berseas and Tsisilis [2], he opimal oal cos funcion of he exended SSP problem is finie everywhere if Assumpion 1.1 holds in his problem ha is, if he exended SSP problem saisfies he following wo condiions: (i) here exiss a leas one proper deerminisic saionary policy, and (ii) any improper deerminisic saionary policy incurs infinie cos for some iniial sae. Lemma 3.4 (Berseas and Tsisilis [2]). opimal oal cos is finie for every iniial sae. If he exended SSP problem saisfies Assumpion 1.1, hen is The exended SSP problem clearly has a leas one proper deerminisic saionary policy, which is o apply a a sae i S he conrol i p i i o g i, where is a proper policy in he se SD of he original SSP problem (such a policy exiss in view of Assumpion 1.1(i) on he original SSP problem). We now show ha for sufficienly small, any improper deerminisic saionary policy of he exended SSP problem incurs infinie cos for some iniial sae. To his end, le us resric o be no greaer han some 0 > 0, for which p ij u > 0 implies p iu j > 0 for all p iu A i u and i u R; i.e., [Recall ha we also have p iu p iu o p iu o piu p iu A i u i u R 0 (3.39) in view of he definiion of A i u.] To simplify noaion, denoe = A i u i u R Recall he definiion of he se B, which is a subse of vecors in he -neighborhood of he expeced one-sage cos vecor g of he original problem: wih c = c i u i u R o, B = { c c 0 0 = 0 and c i u g i u i u R } Noe ha B = i S o B i, where B 0 = 0 and B i, i S are as defined earlier [for he conrol ses U i of he exended SSP problem]. For each and B, le us call an MDP a perurbed SSP problem wih parameers, if i is he same as he original SSP problem excep ha he ransiion probabiliies and one-sage coss for i u R are given by he respecive componens of and. Consider now a deerminisic and saionary policy of he exended SSP problem, which applies a each sae i some feasible conrol i = i p i i i U i. The regular conrols i ha applies a saes i correspond o a deerminisic saionary policy of he original SSP problem, which we denoe by. Then, by Equaion (3.39), is proper (or improper) in he exended SSP problem if and only if is proper (or improper) in he original SSP problem. This is because by Equaion (3.39), he opology of he ransiion graph of he Marov chain on S o ha induces in he exended SSP problem is he same as ha of he Marov chain induced by in he original SSP problem, regardless of he wo oher conrol componens p i i i of. Therefore, for Assumpion 1.1(ii) o hold in he exended SSP problem, i is sufficien ha any improper policy in SD of he original problem has infinie cos for a leas one iniial sae, in all perurbed SSP problems wih parameers and B [cf. he relaion beween, B and he conrol ses U i ]. The nex lemma shows ha he laer is rue for sufficienly small, hus providing he resul we wan. Lemma 3.5. Suppose he original SSP problem saisfies Assumpion 1.1(ii). Then here exiss 1 0 0, where 0 is as given in Equaion (3.39), such ha for all 1, he following holds: for any improper policy SD of he original problem, here exiss a sae i (depending on ) wih lim inf J i = + B where J is he -sage cos funcion of in he perurbed SSP problem wih parameers. For he proof, we will use a relaion beween he long-run average cos of a saionary policy and he oal cos of ha policy, and we will also use a coninuiy propery of he average cos wih respec o perurbaions of ransiion probabiliies and one-sage coss. The nex wo lemmas sae wo facs ha will be used in our proof.

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB Elecronic Companion EC.1. Proofs of Technical Lemmas and Theorems LEMMA 1. Le C(RB) be he oal cos incurred by he RB policy. Then we have, T L E[C(RB)] 3 E[Z RB ]. (EC.1) Proof of Lemma 1. Using he marginal