Bayesian Congestion Control over a Markovian Network Bandwidth Process

Size: px

Start display at page:

Download "Bayesian Congestion Control over a Markovian Network Bandwidth Process"

Alfred Sharp
5 years ago
Views:

1 Bayesian Congestion Contol ove a Makovian Netwok Bandwidth Pocess Paisa Mansouifad,, Bhaska Kishnamachai, Taa Javidi Ming Hsieh Depatment of Electical Engineeing, Univesity of Southen Califonia, Los Angeles, CA Depatment of Electical and Compute Engineeing, Univesity of Califonia, San Diego, La Jolla, CA s: paisama@usc.edu, bkishna@usc.edu, tjavidi@ucsd.edu Abstact We fomulate a Bayesian congestion contol poblem in which a souce must select the tansmission ate ove a netwok whose available bandwidth is modeled as a timehomogeneous finite-state Makov Chain. The decision to tansmit at a ate below the instantaneous available bandwidth esults in an unde-utilization of the esouce while tansmission at ates highe than the available bandwidth esults in a linea penalty. The tade-off is futhe complicated by the asymmety in the infomation acquisition pocess: tansmission ates that happen to be lage than the instantaneous available bandwidth esult in pefect obsevation of the state of the bandwidth pocess. In contast, when tansmission ate is below the instantaneous available bandwidth, only a (potentially athe loose) lowe bound on the available bandwidth is evealed. We show that the poblem of maximizing the thoughput of the souce while avoiding congestion loss can be expessed as a Patially Obsevable Makov Decision Pocess (POMDP). We pove stuctual esults poviding bounds on the optimal actions. The obtained bounds yield tactable sub-optimal solutions that ae shown via simulations to pefom well. I. INTRODUCTION In many netwok potocols, a device must set the communication paametes to maximize the utilization of the esouce whose availability is a stochastic pocess. One pominent example is congestion contol, in which a tansmitte must select the tansmission ate to utilize the available bandwidth, which vaies andomly due to the dynamic natue of taffic load imposed by othe uses on the netwok. In ou congestion contol poblem, we conside a use whose available bandwidth vaies as a Makovian pocess. A sende wants to set its tansmission ate at each time step. If the sende selects a ate highe than the available bandwidth, it can utilize the whole available bandwidth but has to pay an ove-utilization penalty (which is a function of how much the selected ate exceeds the available bandwidth). In this case, we assume that pefect infomation about the cuent available bandwidth is evealed (full obsevation). If the use selects a ate less than the available bandwidth, it does not expeience loss (no penalty), but the available bandwidth is unde-utilized. Also, in this case, the sende gets patial infomation about the available bandwidth, i.e. it can infe that the cuent available bandwidth is lage than o equal to the selected ate, but not its exact value. In such a setting, thee is a tade-off between getting moe infomation about the available bandwidth and paying less penalty. The goal is to find the optimal policy to maximize the total ewad (utilization minus penalty) ove a finite hoizon. 1 We fomalize this poblem as a Patially Obsevable Makov Decision Pocess (POMDP) [1], since the state of undelying Makov chain, called the esouce state, is not fully obseved. Since the POMDP poblem does not have an efficiently computable solution [2], we instead pesent uppe and lowe bounds on the optimal actions. The lowe bounds coespond to the myopic policy which at each time step selects an action maximizing the immediate expected ewad, ignoing the futue. Intuitively, selecting actions highe than the myopic action esults in less immediate ewad but also moe infomation about the esouce state. Thus, to lean moe about the esouce state the optimal action pefes to exceed the myopic action. We pove that the myopic policy has a pecentile theshold stuctue, i.e. it selects an action coesponding to the lowest state above a given pecentile of the beliefs. The uppe bound on the optimal action has a simila closed-fom stuctue, based on the belief vecto and the system paametes. In ode to deive the uppe bound, we futhe assume that the backgound Makov chain has a paticula State-Independent State Change (SISC) stuctue (see Section III fo moe details). We conside the effect of diffeent paametes on the bounds via simulation. The emainde of the pape is oganized as follows: Related woks ae pesented in Section II. In Section III, the poblem fomulation is intoduced. In Section IV, the main esults of the pape ae deived. Section V pesents some key popeties needed fo deiving ou main esults. Simulation esults ae pesented in Section VI. Finally, Section VII concludes the pape. II. RELATED WORK Congestion contol is a fundamental poblem in netwoking and has a ich liteatue on algoithms and theoetical analyses (see e.g. [3], [4]). Existing techniques, such as Tansmission Contol Potocol (TCP), adopt an Additive Incease Multiplicative Decease (AIMD) algoithm, intoduced in [5], that adjusts the congestion window based on the tansmission acknowledgments. The pefomance of AIMD is analyzed in liteatue, e.g. [6] whee Altman et al. intoduce a geneal model based on a multi-state Makov chain fo the moments at which the congestion is detected. Although AIMD often 1 The esults in this pape can be eadily extended to infinite hoizon discounted ewad

2 woks well in pactice, it is a heuistic that is not guaanteed to be optimal. In this wok, we aim to identify the stuctue of optimal congestion contol, albeit unde the special case of a Makovian available bandwidth pocess. Theshold stuctues of optimal tansmission policies have been established fo simple elated poblems with a two-state Gilbet-Elliott channel in [7], [8]. But hee, we conside a moe geneal case of multi-state Makovian channel Some ecent woks in liteatue, e.g., [9], [10], fomulate the poblem of inventoy management whee some numbe of items needs to be stock (inventoy) and the demand is a stochastic pocess. The poblem of inventoy management has a close connection to ou poblem as we can map the demand to the esouce (bandwidth) and the inventoy level to the action (ate allocated). Of these, the most closely elated wok is that of Bensoussan et al. [9], who conside a POMDP poblem whee the demand is a Makovian pocess. They conside the setting whee the esouces and actions ae both continuous, as well as the case whee the esouces ae discete but the actions emain continuous. They use the un-nomalized beliefs to pove the existence of the optimal policy which is challenging fo the continuous setting. Fo these settings, they also show that the optimal actions exceed the myopic actions. In this pape, in contast to thei wok, we conside the case whee both the esouce states and the actions ae discete and finite. Thus, the existence of the optimal policy is tivial [1]. Futhe, by investigating a specific fom of the tansition pobability, SISC, we deive additional popeties of the optimal policy and the ewad functions. Futhemoe, this wok is the fist to pesent an uppe bound fo the optimal action. Besides giving insight into the optimal actions, this bound can also be used to speed up the seach fo the optimal actions in the dynamic pogamming. In [10], Besbes et al. conside a simila poblem with the assumption that the esouce state is an independent-identical distibution pocess which is a specific fom of Makovian pocesses. They assume that the undelying esouce distibution is not available and need to be estimated fom histoical data. They show that a pecentile theshold policy (simila to ou myopic action) is optimal and chaacteize the implications of patial obsevations on the pefomance of the optimal policy in both discete and continuous settings. We conside the Makovian case in which the given pecentile theshold policy constucts a lowe bound fo the optimal actions. III. PROBLEM FORMULATION We conside a discete-time finite-state Makov pocess, whose state is denoted by B t. At each time step, the use selects an action accoding to the histoy of obsevations- thus eaning a ewad as a function of the selected action and the state B t and gathes some patial infomation about the cuent state. The objective is to find an optimal policy as a sequence of actions to select in ode to maximize the total expected discounted ewad accumulated ove the finite hoizon. Let us denote the finite hoizon by T and let the discete time steps be indexed by t = 1, 2,..., T. We fomulate ou poblem within a POMDP-based famewok defined as a tuple {M, P, B, A, O, U, R} whee: State: The state, B t, is one of the elements of a finite set denoted by M = {1, 2,..., M} State tansition: The state B t vaies ove time accoding to a Makov pocess with a known tansition pobability matix, denoted by P. This matix is an M M matix with elements P i,j = P (B t+1 = j B t = i), i, j M which indicates the pobability of moving fom state i at a time step to the state j at the next time step. Belief vecto: The pobability distibution of the esouce state (assuming a finite state set), given all past obsevations, is denoted by a belief vecto b t = [b t (1),..., b t (M)], with elements of b t (k) = P (B t = k), k M. In othe wods, b t epesents the pobability distibution of B t ove all possible states of M. The set of all possible belief vectos is denoted by B. The goal is to make a decision at each time step based on the histoy of obsevations; but due to the lack of full infomation, the decision should be based on the belief vecto. It can be shown that the belief vecto is a sufficient statistic of the complete obsevation histoy (see e.g., [1]). Action: At each time step, accoding to the cuent belief, we choose an action t A = {1,..., M}. Note the set of actions ae equal to the set of states, i.e. A = M. Obseved infomation: The obseved infomation at time step t is defined by the event o t ( t ) O as a function of selected action. The possible events coesponding to the action t is as follows: - o t ( t ) = {B t = i}, i = 1,..., t 1 is the event of fully obseving B t. This coesponds to the selection of the action highe than B t. - o t ( t ) = {B t t } is the event that B t is lage than o equal to the selected action. Belief updating: The belief updating is a mapping U : A O B B. The belief vecto fo the next time step, upon the selected action and the obsevation, is govened by: b t+1 = { T t b t P if t B t I Bt P if t > B t, whee I a is the M-dimensional unit vecto with 1 in the a-th position. T is a non-linea opeation on a belief vecto b, as follows: { 0 if i < T b(i) = b(i) (2) M if i. j= b(j) Rewad: The immediate ewad eaned at time step t is a mapping R : A O R, which is given by: { B t C( t B t ) if t > B t R(B t, t ) = (3) t if t B t, whee C is the ove-utilization penalty coefficient. The policy π specifies a sequence of functions π 1,..., π T, whee π t is the decision ule and maps a belief vecto b t to an (1)

3 action at time step t, i.e., π t : B A, t = π t (b t ). The goal is to maximize the total expected discounted ewad in finite hoizon, ove all admissible policies π, given by max J T π (b 0 ) = max π π Eπ [ β t R(B t ; t ) b 0 ], (4) t=0 whee 0 β 1 denotes the discount facto and b 0 is the initial belief vecto. J π T (b 0) is the total expected discounted ewad accumulated ove the hoizon T unde policy π and stating in the initial belief vecto b 0. The optimal policy denoted by π opt is a policy which maximizes (4) and it exists since the numbe of admissible policies ae finite. We may solve this POMDP poblem using Dynamic pogamming (DP), as the following ecusive equations holds: V t (b t ) = max V t (b t ; t ), t (5a) V t (b t ; t ) = R(b t ; t ) + βe{v t+1 (b t+1 ) t }, t T (5b) V T (b T ; T ) = R T (b T ; T ), (5c) whee b t+1 is the updated belief vecto fo the next time step shown in (1). The value functions V t (b t ) is the maximum emaining expected ewad accued stating fom time t when the cuent belief vecto is b t. Note fo all t = 1,..., T, V t (b t ) = max JT π t+1 (b t). V t (b t ; t ) is the emaining expected ewad accued afte time t with choosing action t at time t and following the optimal policy fo time t + 1,..., T with updated belief vecto accoding to the action t. V t (b t ; t ) is the summation of two tems: (i) the immediate expected ewad, given by taking expectation of (3): R(b t ; t ) = b t (i)r(i, t ) = t M i M t b t (i) + t 1 b t (i)[(1 + C)i C t ], (6) and (ii) the discounted futue expected ewad which can be computed as follows: V f t (b t ; t ) = E{V t+1 (b t+1 ) t } = b t (i)v t+1 (T t b t P ) + t t 1 b t (i)v t+1 (I i P ). (7) A policy π is optimal if and only if fo t = 1,..., T, t = π t (b t ) achieves the maximum in (5a), denoted by: opt t (b) = ag max V t(b; ). (8) A We pesent uppe and lowe bounds on the optimal actions in two theoems given in the next section using some lemmas. All the poofs ae given in the appendices. Fo some of ou esults we need the following assumptions. Assumption 1. The P matix satisfies the State-Independent State Change (SISC) popety. Assumption 2. The P matix satisfies the State-Independent State Change (SISC) popety with edge effects, defined below. By SISC popety, we mean that P i,i+k = p j,j+k. Theefoe, we can define ou tansition matix P by indicating the pobability of moving k step highe, p k, independent of which state we ae, such that k < 0 coesponds to moving k steps lowe. By edge effect, we mean that the tansition matix will be affected by the limits (edges) of the state set, since the state set M is limited fom both sides. Theefoe, the elements of the P matix fo ou desied SISC pocess ae equal to P i,j = p j i, fo all i, j except the following: P 1,j = P 1,j+1 + P 2,j+1, j M 1 (9) P M,j = P M,j 1 + P M 1,j 1, j 2 (10) whee (9) and (10) eflect the lowe and uppe edge effects of M, espectively. Note that fo simplicity of notations, fom now on we dop the subscipts of t fom the belief vectos and the actions. IV. MAIN RESULTS The main esult of ou pape is stated in the following two theoems. Theoem 1. The optimal action is bounded by an action fom below, denoted by lb, given by lb = min{ M : b(i) 1 }. (11) 1 + C This lowe bound is equal to the myopic action which at each time step selects the action maximizing the immediate expected ewad and ignoes its impact on the futue ewad. The myopic action, fo poblem (4), unde belief vecto b is given by: myopic (b) = ag max R(b; ). (12) M Theoem 2. Unde Assumption 1 o 2, the optimal action is bounded fom above by an action, denoted by ub, which is a function of C, β, and belief vecto b as follows: ub = min{ M : f(β) S + [(1 + C) f(β)]s C 0}, (13) whee fo simplicity we define the following notations: S h +1 b(i), S h +1 ib(i), 1 f(β) β. (14) Note that fom the above inequalities, l lb = myopic opt ub h, whee l and h ae the lowest and the highest states with non-zeo beliefs, espectively.

4 V. ANALYSES To pove Theoem 1, we need the following lemmas. Lemma 1. The expected immediate ewad is a uni-modal function of the action, which is inceasing when b(i) < 1 1+C and it is deceasing othewise. Theefoe, the myopic action which maximizes the expected immediate ewad (given in (12)) is equal to the lowest action satisfying the inequality b(i) 1 1+C. Inteestingly, the myopic action has a pecentile theshold stuctue, i.e. it coesponds to the lowest state above a given pecentile of the beliefs. Lemma 2. The emaining expected discounted ewad in the action, V t (b; ), and the value function, V t (b), ae convex with espect to the belief vecto b, i.e. V t (b; ) λv t (b 1 ; ) + (1 λ)v t (b 2 ; ), M, V t (b) λv t (b 1 ) + (1 λ)v t (b 2 ), 0 λ 1. (15) Lemma 3. The futue expected ewad, V f t (b; ) defined in (7), is monotonically inceasing in the action, i.e., V f t (b; 1 ) V f t (b; 2 ) 0, 1 2. (16) Now let define an odeing fo the belief vectos and deive a key popety of value functions, thei monotonicity with espect to an odeing of the belief vectos. Then we state some lemmas which ae needed to pove Theoem 2. Definition 1. (Fist Ode Stochastically Dominance, [11]) Let b 1, b 2 B be any two belief vectos. Then b 1 fist ode stochastically dominates b 2 (o b 1 is FOSD geate than b 2 ), denoted as b 1 s b 2, if b 1 (j) b 2 (j), {1,..., M}. (17) j= j= Lemma 4. Unde Assumption 1 o 2, the value function is a FOSD-inceasing function of the belief vecto. i.e., if b 1 s b 2, then V (b 1 ) V (b 2 ). Now let conside two belief vectos b and b α such that b α is shifted vesion of b by α steps, i.e. b α (i) = b(i + α). Because of similaity in thei pobability distibution shapes, they have some common popeties, given in the following lemmas. Lemma 5. The immediate expected ewad and the myopic action follow the shifting popety: R(b α ; ) = R(b; α) + α, (18) myopic (b α ) = myopic (b) + α. (19) Lemma 6. Unde Assumption 1, the value functions and the optimal policies of the belief vecto b and its shifted vesion, b α, have the following popeties: V t (b α t+1 ; ) V t (b; α) = α, (20) opt t (b α ) = opt t (b) + α, (21) V t (b α t ) = V t (b) + α. (22) Note fo β = 1, we need to substitute 1 βx 1 β by x. Lemma 7. Unde Assumption 2, the following elation between the value functions of b and b α holds: V t (b α t ) V t (b) + α. (23) The poofs of the above lemmas ae given in Appendix A, followed by the poofs of Theoems 1 and 2 in Appendix B. VI. SIMULATION We pesent some simulation esults to conside the uppe and the lowe bounds achieved in the pevious section. The simulation paametes, except in the figues that thei effect is consideed, ae fixed as follows: the numbe of states M = 10, the ove-utilization penalty coefficient C = 5, the discount facto β = 0.8, and the tansition pobabilities p 1 = p 1 = 0.3, p 0 = 0.4. Due to computational complexity it is had to find the optimal policy. Instead, we could conside the Extended Myopic (EM) policy with futue window of size τ as an appoximation fo the optimal policy. This policy, at each time step t, solves the dynamic pogamming fo shot hoizon of t, t + 1,..., t + τ. Hee we use τ = 4 fo ou simulations. Note that we could limit ou dynamic pogamming seaches between the uppe and the lowe bound to speed up the simulations. A. Uppe and Lowe Bound on Optimal Actions Fig. 1 shows an example of a sequence of optimal actions and thei coesponding uppe and lowe bounds. We assume that the selected actions does not exceed B t till time step 14. Note that the stas in the figue indicate the non-zeo beliefs. action Myopic action Optimal action uppe bound time steps Fig. 1. Selected actions by optimal policy (EM, τ = 4) and thei coesponding lowe and uppe bounds, fo C = 5, β = 0.8, and tansition of p 1 = p 1 = 0.3, p 0 = 0.4 Next, we conside the effect of β on the gap between the uppe and the lowe bounds of optimal actions in Fig. 2. As expected, this gap inceases with inceasing β. Fig. 3 shows the gap between the uppe and the lowe bounds vesus the vaiance of SISC tansition, defined as E[(B t+1 B t ) 2 ]. This figue confims that by inceasing the vaiance, the gap will incease.

5 gap p 2 =p 2 p 1 =p 1 ewad UB policy Myopic policy EM policy, τ= β C Fig. 2. C = 5. gap The gap between the lowe and the uppe bounds vesus β, fo vaiance Fig. 3. The gap between the lowe and the uppe bounds vesus the vaiance of andom walk steps, fo β = 0.8, C = 5. B. Myopic and Uppe-Bound policies Now we compae two sub-optimal policies: (i) the myopic policy, (ii) the uppe-bound (UB) policy, which pick the actions given in (11) and (13), espectively, at all time steps and update thei belief vectos accoding to these actions. Fig. 4 shows the total expected discounted ewad vesus β fo the myopic and the UB policies. Fo smalle β the ewad of the myopic policy is highe than that of UB policy, but fo lage β (close to 1) the UB policy outpefoms the myopic policy. Fig. 5 shows the total expected discounted ewad vesus C fo ewad 10 2 Myopic policy,p 2 =p 2 UB policy,p 2 =p 2 Myopic policy,p 1 =p 1 UB policy,p 1 =p β Fig. 4. The total expected discounted ewad fo two sub-optimal policies vesus β fo C = 5, τ = 4, fo hoizon T = 100. the myopic and the UB policies fo β = 0.8. The diffeence between the ewad of two policies will incease as C gow. VII. CONCLUSION We fomulated a Bayesian congestion contol poblem in which a souce must select the tansmission ate (the action) Fig. 5. The total expected discounted ewad fo two sub-optimal policies vesus C fo β = 0.8, τ = 4, fo hoizon T = 100 and tansition of p 1 = p 1 = 0.3, p 0 = 0.4. ove a netwok whose available bandwidth (esouce) evolves as a stochastic pocess. We modeled the poblem as a POMDP and deived some key popeties fo the myopic and the optimal policies. We poved stuctual esults poviding bounds on the optimal actions, yielding tactable sub-optimal solutions that have been shown via simulations to pefom well. Since the myopic action has a pecentile theshold stuctue on the state beliefs, and we can simplify ou uppe bound to get a loose one with a simila fom, we conjectue that thee may be even bette appoximation fo the optimal policy with the simila pecentile theshold stuctue. ACKNOWLEDGEMENT This wok was suppoted in pat by the U.S. National Science foundation unde ECCS-EARS awads numbeed and and by the Okawa foundation though an awad to suppot eseach on Netwok Potocols that Lean. REFERENCES [1] R. D. Smallwood and E. J. Sondik, The optimal contol of patially obsevable makov pocesses ove a finite hoizon, Opeations Reseach, vol. 21, no. 5, pp , [2] C. H. Papadimitiou and J. N. Tsitsiklis, The complexity of makov decision pocesses, Mathematics of opeations eseach, vol. 12, no. 3, pp , [3] R. Sikant, The mathematics of Intenet congestion contol. Bikhause Boston, [4] C. Jin, D. X. Wei, and S. H. Low, Fast tcp: motivation, achitectue, algoithms, pefomance, in INFOCOM Twenty-thid AnnualJoint Confeence of the IEEE Compute and Communications Societies, vol. 4. IEEE, 2004, pp [5] V. Jacobson, Congestion avoidance and contol, in ACM SIGCOMM Compute Communication Review, vol. 18, no. 4. ACM, 1988, pp [6] E. Altman, K. Avachenkov, C. Baakat, and P. Dube, Pefomance analysis of aimd mechanisms ove a multi-state makovian path, Compute Netwoks, vol. 47, no. 3, pp , [7] A. Laouine and L. Tong, Betting on gilbet-elliot channels, Wieless Communications, IEEE Tansactions on, vol. 9, no. 2, pp , [8] Y. Wu and B. Kishnamachai, Online leaning to optimize tansmission ove an unknown gilbet-elliott channel, in Modeling and Optimization in Mobile, Ad Hoc and Wieless Netwoks (WiOpt), th Intenational Symposium on. IEEE, 2012, pp [9] A. Bensoussan, M. Çakanyıldıım, and S. P. Sethi, A multipeiod newsvendo poblem with patially obseved demand, Mathematics of Opeations Reseach, vol. 32, no. 2, pp , 2007.

6 [10] O. Besbes and A. Muhaemoglu, On implications of demand censoing in the newsvendo poblem, Management Science, Fothcoming, pp. 12 7, [11] A. Mulle and D. Stoyan, Compaison Methods fo Stochastic Models and Risks. Hoboken, NJ: Wiley, that λ T b 1 + (1 λ )T b 2 = T b: A = λ T b 1 + (1 λ )T b 2 = λ M b 1(i)T b 1 + (1 λ) M b 2(i)T b 2 M b(i) APPENDIX A Poof of Lemma 1: To pove this lemma, let us compute the deivative of the expected immediate ewad given in (6), R(b; ) = R(b; + 1) R(b; ) = 1 (1 + C) b(i). (24) 1 M b 1 (j) A(j) = M b(i)[λ b 1 (i) M b 1(i) A(j) = 0, + (1 λ) b 2 (j) b 2 (i) M b 2(i) ] = λb 1(j) + (1 λ)b 2 (j) b(j) M b(i) = M b(i), j < j (28) It s easily seen that if the inequality in (11) holds, R(b; ) 0. Othewise it is positive. This concludes that R(b; ) is unimodal with unique maximum at the myopic action given in (11). Poof of Lemma 2: We use induction to pove the convexity of V t (b; ) with espect to the belief vecto, b, fo finite hoizon. Let s assume b is a linea combination of two belief vectos b 1 and b 2, such that: b = λb 1 + (1 λ)b 2, 0 λ 1. (25) Fist at hoizon T, the emaining expected ewad is equal to the expected immediate ewad which is affine linea with espect to the belief vecto and using (6) and (25) we have: R(b; ) = λ R(b 1 ; ) + (1 λ) R(b 2 ; ), (26) which confims the convexity of the emaining expected ewad at hoizon T. Now assuming the convexity holds at t+1, we will conside the time step t. Using (5b) and (7) we have: V t (b; ) λv t (b 1 ; ) (1 λ)v t (b 2 ; ) = [R(b; ) λ R(b 1 ; ) (1 λ) R(b 2 ; )] + β[v t+1 (T bp ) b(i) λv t+1 (T b 1 P ) b 1 (i) (1 λ)v t+1 (T b 2 P ) = β b 2 (i)] b(i)[v t+1 (T bp ) λ V t+1 (T b 1 P ) (1 λ )V t+1 (T b 2 P )] (27a) (27b) whee λ M = λ b1(i) M. Note using (26), the tem inside b(i) [.] in (27a) is zeo. Multiplying the tansition matix P is a linea opeation, and with some manipulation, we can show So fom definition of (2), we have A = T b. Now fom convexity at t + 1 the tem inside [.] in (27b) is less than o equal to zeo and we have: V t (b; ) λv t (b 1 ; ) (1 λ)v t (b 2 ; ) 0 (29) which means the convexity of V t (b; ) with espect to b. To pove the convexity of value function, V t (b), with espect to b we use the definition of (5a) and the esult of Lemma 2 to get: V t (b) = max V t (b; ) = V t (b; ) λv t (b 1 ; ) + (1 λ)v t (b 2 ; ) (30a) (30b) λ max V t (b 1 ; 1 ) + (1 λ) max V t (b 2 ; 2 ) 1 2 (30c) = λv t (b 1 ) + (1 λ)v t (b 2 ), (30d) whee = ag max V t (b; ). Applying the definition of (5a) one moe time in (30d) completes the poof. Poof of Lemma 3: We can wite (16) as follows: V f t (b; 1 ) V f t (b; 2 ) = = 1 1 = 2 {V f t (b; + 1) V f t (b; )} (31) and fo each tem inside the summation, using (7), we have: V f t (b; ) = V f t (b; + 1) V ( t b; ) = b(i)v t+1 (T +1 bp ) b(i)v t+1 (T bp ) +1 + b()v t+1 (I P ) 0 (32) The inequality (32) is achieved fom the convexity of value function in (15) fo b = T bp, b 1 = T +1 bp, b 2 = I P and M +1 b(i) M b(i) λ = zeo and poof is complete. Poof of Lemma 4:. Theefoe, (31) is geate than o equal to

7 If b 1 s b 2, i.e., by ecalling Definition 1, b 1 (j) b 2 (j), {1,..., M}, (33) j= j= we can wite b 1 as a linea combination of some belief vectos with coefficients equal to the elements of b 2, as follows: b 1 = j b 2 (i)b i, b i B, b i(j) = 0, j < i. (34) We can obtain (33) using (34), as follows: b 1 (j) = j= j= j= j=i j b 2 (i)b i(j) j b 2 (i)b i(j) b i(j)b 2 (i) b 2 (i), (35a) (35b) (35c) whee (35a) and (35b) come fom switching the indexes and in (35c) we substitute M j=i b i (j) = 1, because b i is a belief vecto. Fo the poof of the othe diection, in a simila way, we can show that if b 1 s b 2, thee ae valid belief vectos b i, i {1,..., M} such that (34) holds. Let ecall the definition of value function. V t (b k ) = max π t:t V t (b k ; π t:t ), k = 1, 2 (36) whee π t:t = [π t, π t+1,..., πt ] is the sequence of policies such that π t is the selected action at each time step t = t,..., T. Then V t (b k ) V t (b k ; π t:t ) fo any policy sequence π t:t. Now we only need to pove that the emaining expected ewad achieved by policy sequence π t:t, fo belief vecto b 1 is highe than b 2, i.e., V t (b 1 ; π t:t ) V t (b 2 ; π t:t ) (37) Let s name the sequence of the belief vectos with initial vecto of b 1 by b 1,τ, t τ T 1, and the sequence of the belief vectos with initial vecto of b 2 by b 2,τ. Consideing all possible sample paths of [B t, B t+1,..., B T ], defined as B t:t, the total emaining expected ewads fo k = 1, 2 will be as follows: V t (b k ; π t:t ) = E Bt:T [ β τ t R(B τ ; π τ ) b k, π t:t ]. (38) Theefoe, we can wite V t (b k ; π t:t ) as follows: V t (b k ; π t:t ) = {b k (i) E Bt:T [ β τ t R(B τ ; π τ ) B t = i]} (39) So using (39) and substituting (34) we have: whee, V t (b 1 ; π t:t ) V t (b 2 ; π t:t ) = {b 2 (i) [ b i(j)e Bt:T [ β τ t R(B τ ; π τ ) B t = j] j=i E Bt:T [ β τ t R(B τ ; π τ ) B t = i]]} = b 2 (i)[ b i(j) E j,i ], j=i E j,i = E Bt:T [ β τ t R(B τ ; π τ ) B t = j] (40) E Bt:T [ β τ t R(B τ ; π τ ) B t = i]]. (41) To pove (37), we only need to show E j,i 0. Without loss of geneality, we can assume i = j 1 = l. The esult will be easily extendable to j > i + 1, by using E j,i = j 1 l=i E l+1,l. Fo any sample path B t:t stating fom l (call it Bt:T l ), thee is a coesponding sample path B l+1 t:t stating fom l + 1, such that, Bτ l+1 = Bτ l + 1. So we can have the expectation ove only all possible Bt:T l and wite (41) as follows: [ β τ t R(Bτ l+1 ; π τ )] E B l t:t [ β τ t R(Bτ l ; π τ )] E B l+1 t:t = E B l t:t [ β τ t {R(Bτ l + 1; π τ ) R(Bτ l ; π τ )}] (42) Now if we show T βτ t {R(Bτ l +1; π τ )] R(Bτ l ; π τ )} 0 fo any sample path Bt:T l, (42), theefoe (40) ae geate than o equal zeo and (37) holds. We could use induction to pove this inequality. Fo the base case at hoizon T, it holds using the fact that (3) is monotonically inceasing with espect to B t. Now assuming it is tue fo time steps t + 1,..., T, we need to show it fo time step t. Now let s define the fist time that the selected action will exceed the Bτ l, with µ, i.e., µ = min{t τ T : π τ > B l τ } (43) Thee ae thee diffeent possible cases to happen:

8 Case I: π τ B l τ, τ T Case II: π µ > B l µ + 1 = B l+1 µ Case III: π µ = B l µ + 1 Let s conside Case I, fist. In this case, the actions ae always lowe than Bτ l. Theefoe by (3), both ewads fo Bt:T l and Bt:T l + 1 will be equal to the selected action. So, β τ t {R(Bτ l + 1; π τ ) R(Bτ l ; π τ )} = β τ t π τ 0 (44) In Case II, the ewads accumulated befoe time µ ae equal fo both beliefs. Theefoe, µ 1 β τ t {R(Bτ l + 1; π τ ) R(Bτ l ; π τ )} = 0 (45) The ewads eaned at time µ, by (3) is equal to: R(B l µ + 1; π µ ) R(B l µ; π µ ) = 1 + C (46) The expected value of (46) is also equal to 1+C. Afte time µ, both beliefs will be estated to I B l µ and I B l µ +1, and fom the coectness of (42) at time step µ as the induction assumption, the emaining expected ewad elated to B l+1 will be highe than the one elated to B l. E B l µ+1:t [ β τ t {R(Bτ l + 1; π τ ) R(Bτ l ; π τ )}] τ=µ+1 = β µ+1 t [V µ+1 (I B l τ +1P ) V µ+1 (I B l τ P )] 0 (47) Theefoe, (42) which is the summation of thee tems in (45), (46) multiplied by β µ t, and (47), will be geate than o equal to zeo. Fo Case III, at µ the action will exceed B l µ and the belief will be estated to I B l µ P, but the othe sequence s belief will change as b µ+1 = T πµ b µ P. The diffeence between ewads accumulated befoe time µ and at time µ ae equal to 0 and C + 1, espectively, simila to (45) and (46). Now fo times afte µ we have: E B l µ+1:t [ τ=µ+1 β τ t {R(B l τ + 1; π τ ) R(B l τ ; π τ )}] = β µ+1 t [V µ+1 (T πµ b µ P ) V µ+1 (I B l τ P )] 0 (48) whee the last inequality comes fom the coectness of (37) at time µ + 1 as the assumption of the induction, consideing the new belief T πµ b µ and I B l τ P. Note B l µ is the lowest index of non-zeo beliefs in T πµ b µ. Theefoe, (42) is geate than o equal to zeo, so ae (41) and (40), and thus (37) is tue fo time t and the poof is complete. Now assume π t:t = ag max π t:t V t (b 2 ; π t:t ) is the optimal policy sequence fo the initial belief vecto b 2 at time t. We have: V t (b 1 ) V t (b 1 ; π t:t ) V t (b 2 ; π t:t ) = V t (b 2 ), (49) and this completes the poof of Lemma 4. Poof of Lemma 5: To pove the shifting popety of immediate expected ewad we have: R(b α ; ) = 1 b α (i) + b α (i)[(1 + C)i C] = ( + α) b(i ) i = 1 + b(i )[(1 + C)(i + α) C ] i =1 = R(b; ) + α[ b(i ) + i = 1 i =1 b(i )] = R(b; ) + α, (50a) (50b) (50c) whee, (50b) is achieved by substituting i and with i +α and + α, espectively. Also we aleady know that b α (i + α) = b(i ). Fo the myopic actions, using (50c) we have: myopic (b α ) = ag max R(b α ; ) = ag max{ R(b; α) + α} = myopic (b) + α. (51) Poof of Lemma 6: To pove (20), we use induction. Fo hoizon T, it s simila to (18) with t = T. Now assuming it s tue fo t + 1, we will conside time step t: V t (b α ; ) = R(b α ; ) + β[ b α (i)v t+1 (T b α P ) b α (i)v t+1 (I i P )] 1 + = R(b; ) + α + β[ α 1 + i =1 i = α b(i )V t+1 (I i +αp )] b(i )V t+1 (T b α P ) (52a) In (52a), we use (18) and the shifting popety of belief vectos and substitute i α with i. Using (20) at t + 1 we have: V t (b α ; ) = R(b; ) + α + β b(i t 1 )[V t+1 (T α bp ) + α ] i = α α 1 + β b(i t 1 )[V t+1 (I i P ) + α ] (53a) i =1 = V t (b; α) + α + αβ b(i ) t 1 = V t (b; α) + α t (53b)

9 To pove (21), using (20), we have: opt t (b α ) = ag max V t (b α ; ) t = ag max{v t (b; α) + α } = ag max V t (b; α) = ag max V t (b; ) + α = opt t (b) + α. (54) Using (20), (22) and maximizing ove actions, we obtain (22). Poof of Lemma 7: Conside a belief vecto b u in the unlimited state set and its coespond b l in the limited state set M. If b u has nonzeo elements above M, i.e., the uppe edge of M, b l (M) = i M bu (i), and b l (j) = b u (j), j < M. Theefoe facing the uppe edge effect esults in b l s b u. Similaly, is b u has non-zeo elements below 1, i.e., the lowe edge of M, b l (1) = i 1 bu (i), and b l (j) = b u (j), j > 1. Thus facing with the lowe edge effect esults in b l s b u. Now using Lemma 6, if the belief vecto b t and its shifted vesion b α t, α > 0 ae in the unlimited state set, between thei value functions the equation (22) holds. Lets b τ and b α τ denote the sequence of belief vectos constucted by updating b t and b α t, espectively, accoding to thei optimal actions at all time steps τ = t + 1,..., T. Then b α τ, τ = t + 1,..., T ae shifted vesion of b τ. Now with limiting the state set,thee ae some possibilities. If none of the b τ, b α τ, τ = t + 1,..., T face with the edge effects and (22) will still hold. Othewise, b α τ will face with the uppe edge effect soone than b τ, o b τ will face with the lowe edge effect soone than b α τ. In both cases, combining the above agument about the FOSD odeings, the esults of Lemma 4 and (22), we can conclude the inequality (23). APPENDIX B Poof of Theoem 1: The immediate expected ewad and the futue expected ewad achieved by the action < myopic ae less than those achieved by myopic, as esults of Lemma 1 and Lemma 3, espectively. Combining them as (5b), we get V t (b; ) V t (b; myopic ) which means cannot be optimal and thus opt myopic. Poof of Theoem 2: To achieve the uppe bound on the optimal action we will show that: V t (b; ) = R(b; ) + β V f t (b; ) 0, ub. (55) By ecalling (24) we know R(b; ) = 1 (1 + C) b(i) 0, myopic. (56) Moeove, using (32), V f t (b; ) = +1 b(i)[v t+1 (T +1 bp ) V t+1 (T bp )] + b()[v t+1 (I P ) V t+1 (T bp )], (57) and accoding to lemma 3 we have V f t (b; ) 0 fo all. Hence (55) is equivalent to: β V f t (b; ) R(b; ), ub. (58) To this end, it is sufficient to find an uppe bound fo β V f t (b; ) such that it is less than R(b; ) fo all geate than a specific action, i.e., ou desied uppe bound, ub. Applying Lemma 4 one bp s I lp, we have: V t (bp ) V t (I lp ), (59) Now by above inequality and the fact that is the lowest nonzeo index of the belief vecto T b, the second tem in (57) is less than zeo and we can bound V f t (b; ) as follows: V f t (b; ) +1 b(i)[v t+1 (T +1 bp ) V t+1 (T bp )]. (60) Fom the convexity of value function poved in lemma 2 and (59) we get: V t+1 (T +1 bp ) V t+1 (T bp ) h +1 h +1 b (i)v t+1 (I i P ) V t+1 (I P ) {b (i)[v t+1 (I i P ) V t+1 (I P )]}, (61) whee b = T +1 b. Now fom (23) and substituting b = I i P and α = j i, it follows that: V t+1 (T +1 bp ) V t+1 (T bp ) = = h +1 {b (i) (i )( 1 ) } h 1 [ +1 ib (i) ] 1 [ b ], (62) h +1 ib(i) h +1 b(i) whee b = h +1 ib (i) = = S S. Accodingly, we may bound (60) using (62) as follows: V f t (b; ) S T 1 [ S S ]. (63)

10 To obtain (58) using (63) it suffices to make the following inequality satisfied: T 1 βs [ S ] (1 + C) b(i) 1 S = C (1 + C)S, ub. (64) Applying the definition of f(β) in (14) and doing some staightfowad manipulations, we can get: f(β) S + [(1 + C) f(β)]s C 0, ub. (65) ub is the smallest satisfying (65), as mentioned in (13), and the poof is complete.

Temporal-Difference Learning

Temporal-Difference Learning .997 Decision-Making in Lage-Scale Systems Mach 17 MIT, Sping 004 Handout #17 Lectue Note 13 1 Tempoal-Diffeence Leaning We now conside the poblem of computing an appopiate paamete, so that, given an appoximation