Finite Horizon Risk Sensitive MDP and Linear Programming

Size: px

Start display at page:

Download "Finite Horizon Risk Sensitive MDP and Linear Programming"

Vanessa Washington
5 years ago
Views:

1 Finite Horizon Risk Sensitive MDP nd Liner Progrmming Atul Kumr, Veerrun Kvith nd N. Hemchndr IEOR, Indin Institute of Technology Bomby, Indi Abstrct In the context of stndrd Mrkov decision processes (MDPs), the connection between Dynmic Progrm (DP) nd Liner Progrm (LP) is well understood nd is well estblished under sufficiently generl conditions. LP bsed pproch fcilittes solving the constrined MDPs. Multiplictive or Risk sensitive MDPs, introduced to control the fluctutions/vritions round the expected vlue, re reltively less studied objects. DP equtions re considerbly well understood even in the context of Risk MDPs, however the LP connection is not known. We consider finite horizon risk MDP problem nd estblish the connections between the DP nd LP pproches. We ugment the stte spce with suitble component, to obtin the optiml policies for constrined risk MDPs. We pply this results to server selection problem in Ber/M/K/K queues, with constrint on the utiliztion of the fst server. We discuss some interesting structurl properties of the risk optiml policies. I. INTRODUCTION Mrkov Decision Process (MDP) is mthemticl frmework used to solve the problem of sequentil decision mking in stochstic situtions ([6], [3], [], [4] etc). The im of MDP is to find n optiml policy for the decision mkers. A policy is sequence of decisions one for ech time slot, possibly depending upon the (current stte or history of ll sttes) stte of the system. MDP considers running cost t every time step, depending upon the stte nd the ction tken t tht time step, nd obtins n optiml policy tht optimizes the expected vlue of the sum (integrl in cse of continuous time problems) of the running costs over ll the time slots under considertion. In cse of finite horizon problems, it lso considers terminl cost. There cn be three vrieties of MDP problems bsed on the time horizon for which the problem spns. It is finite time horizon problem if the sum cost is considered for finite durtion. In cse of infinite time horizon problems, vrints like discounted cost, verge cost nd totl cost re considered. The focus of this pper is on finite horizon problems. In mny scenrios, the gents re interested not just in the verge cost. But some gents would like to reduce the risk on most of the smple pths. Worst cse nlysis dels with n extreme cse in this direction. While risk sensitive frmework offers vrying rnges of importnce to smple pths nd verge vlue s controlled by prmeter. Depending upon the prmeter, clled risk prmeter, it provides importnce to higher moments of the sum cost. In ll, while verge cost/liner MDPs re concerned bout first moment of the sum cost, risk sensitive MDPs incorporte higher moments of the cost, to control the vribility/fluctutions bout the expected vlue. The liner MPDs re lso viewed s risk neutrl MDPs. The liner MDP is well studied topic nd mny solutions pproches re known. Dynmic progrmming (DP), Liner progrmming (LP), Vlue itertion re some of them ([6], [3], [], [4] etc). DP obtins the vlue function, the optiml cost to go till termintion from ny time nd ny stte, using bckwrd induction. Alterntively vlue functions cn be obtined using solution of n pproprite Liner Progrm (LP). The dul LP directly provides the optiml policy (e.g., [6], []). Reltively risk sensitive MDPs re studied to limited extent. Nevertheless, dynmic progrmming pproch cn be pplied even in the context of finite horizon risk MDP ([5]). To the best of our knowledge, the connection between risk sensitive MDPs nd n pproprite liner progrm is not yet estblished nd this is the min focus of the pper. This connection does not solve the dimensionlity problem. Nevertheless, the dvent of fst LP solvers mkes this very ttrctive lterntive. Further nd more importntly one cn incorporte constrints in the MDP frmework. Our LP connection thus provides computtionl methods to solve such constrined risk sensitive MDPs. Avilbility of resources cn be cptured s suitble constrints nd hence solutions to constrined MDPs re importnt. The work on risk sensitive control is vst nd vried nd we give smple of some of the strnds. The pioneering work is by Howrd nd Mthieson [7]. The bckwrd recursion dynmic progrmming equtions in finite horizon setting re of multiplictive type nd lgorithms to compute optiml polices in this model re known. In generl, the optiml policies in infinite horizon setting tend to be non sttionry, [8], etc. Some ppers identify suitble sufficient conditions tht ensure tht the optiml polices re sttionry nd lso develop lgorithms to compute the sme ([]) or pproximte optiml policies. Other ppers lso explore the reltions between robust MDPs nd risk sensitive MDPs [9]. We develop LP bsed lgorithm tht computes optiml risk sensitive polices for constrined risk MDPs. Nottion: The bold letters represent the vectors, e.g.,

2 y = {y(t, x, )} t,x, represents fesible vector of dul LP (9), given below. While x t n represents the vector x t n = [x n,, x t ]. The rndom vribles re represented by cpitl letters, while their reliztion by the corresponding smll letters. When required to specify the time index, subscript of the time index is used. When not required it is voided. For exmple, x represents reliztion of rndom vrible X t for ny t. If it required to represent reliztion of the pir of rndom vribles X t, X n, then we use x t, x n. The reliztions for rndom vribles of subsequent time slots, like X t, X t+, re represented by (x, x ). II. RISK SENSITIVE MDP FRAMEWORK Risk sensitive MDP, s in the cse of liner MDP, consists of set X of ll possible sttes, set A of ll possible ctions nd n immedite rewrd function r t : X A R for ech time slot t. The terminl cost r T depends only upon the stte x X. The stte, ction spces X, A do not depend the time slot t, tht is we consider the sme set for ll the time indices. It is further chrcterized by trnsition function p : X A X, which defines the ction dependent stte trnsitions. Here p(x x, ) gives the probbility of the stte trnsition from x to x, when ction is chosen. We consider finite horizon problem nd let {X t } t T, {A t } t T respectively represent the trjectories of the stte nd the ction processes. The terminl cost, cost in finl time slot T, depends only upon the (finl) stte. A policy t = (π t, π t+ π T ) is sequence of stte dependent nd possibly rndomized ctions, given for time slots between t nd T. Given policy t, the stte, ction pir evolve rndomly over the time slots t < n < T, with trnsitions given by the following lws: qn t (x, x, ) =P (X n = x, A n = X n = x, A n = ) = π n(x, )p(x x, ) where p(x x, ) = P (X n = x X n = x, A n = ) nd π n(x, ) = P (A n = X n = x ). () The bove evolution further depends upon the initil condition, i.e., the initiliztion of the stting point X t. Let E x,t represent the expecttion opertor with initil condition X t = x nd when the policy t is used. Let E α,t represent the sme expecttion opertor when the initil condition is distributed ccording to α, written s X t α. Here α(x) = P (X t = x). We re interested in optimizing the following risk sensitive objective: J t(α, t ) = γ log ( J t(α, t ) ) where J t(α, t ) = E α,t [ e γ T n=t rn(xn,an)+r T (X T ) ]. () The bove eqution represents the cost to go from time slot t to T under the policy t, with X t α. The vlue function, function of (x, t), is defined s the optiml vlue of the bove risk sensitive objective given the initil condition X t = x: V t (x) := min t D t Jt (x, t ) for ny x X, (3) where D t represents the spce of policies t. Dynmic Progrmming We re interested in the optiml policy = (we discrd in superscript when it strts from ) tht optimizes the risk cost J (x, ), or equivlently policy tht chieves the vlue function, i.e., such tht V (x) = J (x, ) for ll x X. Dynmic progrmming (DP) is well known technique, tht provides solution to such control problems, nd DP equtions re given by bckwrd induction s below for ny x X (see [5]): V T (x) = r T (x), nd for ny t T, { [ ] } V t(x) = min r t(x, ) + A γ log p(x x, )e γv t+(x ). We consider the following trnsltion of the vlue function, to simplify the bove set of equtions: u t (x) = e γvt(x) for ll t T, nd x X. Note by monotonicity nd continuity u t for ny t is minimum vlue of J t given in ( ): u t (x) = min t J t (x, t ). The DP equtions cn now be rewritten s: u T (x) = e γr T (x) for ny x X nd (5) { u t(x) = min e } γr t(x,) p(x x, )u t+(x ) for ny t T, nd x X. (6) For ese of nottions, we bsorb γ into the cost functions {r t }. One needs to solve the bove set of equtions to obtin the vlue function: u = {u t (x); t < T, x X }, nd the optimizers in the minimiztion step will provide us the optiml policy ([6], [5] etc). III. LINEAR PROGRAMMING FORMULATION The dynmic progrmming bsed pproch suffers from the curse of dimension. As we increse the number of sttes nd/or time epochs, the complexity of the problem increses significntly. This results in limited pplicbility of dynmic progrmming. In the context of liner MDPs, it is well known fct tht DP problem cn be reformulted s Liner Progrm (LP), under considerble generlity (see for e.g., [6] in the context of infinite horizon problems). However this conversion my

3 C t,πt, = π t(, )e r t(,)... π t(, )e r t(,) , π t(n, )e r t(n,) P = p(, ) p(, )... p(n, ) p(, ) p(, )... p(n, ) p( N, ) p( N, )... p(n N, ) (4) not solve the problem of dimension. But recent improvements in LP solvers mkes it n ttrctive lterntive. Further nd more importntly the LP bsed pproch extends esily nd provides solutions for constrined MDPs. In the coming sections, s in the cse of liner MDP (see for e.g., [6]), we will obtin two relevnt LPs ( priml nd dul LP). The solution of the priml LP will be the trnslted vlue function vector, u, which is the function vlue on the left hnd side (LHS) of the DP eqution (6), t the optimizer(s). On the other hnd the solution of the dul LP will directly provide the optiml policy D of the control problem (3). We begin with introducing some more nottions. Let N be the number of elements of the stte spce nd without loss of generlity let X = {,, N}. Let u t = [u t (), u t (N)] represent n N dimensionl vector indexed by time t, indictive of the possible vlue function for different sttes t time t. And let the combined vector tht includes the vlue function for ll combintions of time slots nd sttes be rewritten s below: u = {u t (x); t < T, x X } = [u, u, u,, u T ]. Define the opertor tht opertes on the combined vector u by: Lu = [L u, L u,, L T u] where L t u := inf π t C t,πt,p u t+ (7) with the mtrices C t,πt,, P re defined using (4), plced t the top of the pge nd u T = {u T (x); x X } is given by eqution (5). The bove opertor is constructed using the right hnd side (RHS) of the DP eqution (6). We now hve the following theorems (whose proofs re in Appendix): Theorem : Any vector u with Lu u (the component wise inequlity), stisfies: u u. Theorem : Any vector u with Lu u, stisfies: u u. Any vector u tht stisfies the constrint, Lu u, i.e., when u t C t,πt,p u t+ for ll t nd π t, (8) by Theorem, is lower thn the vlue function of risk MDP, u. It is trivil to check tht u lso stisfies (8). Thus it is the gretest lower bound mong ll vectors tht stisfy (8). Hence we hve the following LP for priml. Following similr procedure s in ([6]), we chose nonnegtive vector α(x), x X tht stisfies x X α(x) =. Using this one cn obtin n equivlent LP, whose solution equls the vlue function vector u. Priml Liner Progrm mx {{u t(x)} x X,t T } x X α(x)u (x) subject to: u T (x) b x, for ll x,, u t(x) e r t(x,) p(x x, )u t+(x ) for ll, x nd t T with b x, := e r T (x,) p(x x, )e r T (x ). The vector α cn be interpreted s the distribution of initil stte, X. The dul of the bove LP, is given by: Dul Liner Progrm min y={y(t,x,);t T,x X, A} b x, y(t, x, ) x X subject to: y(, x, ) = α(x ) for ll x X, (9) y(t, x, ) = e rt (x,) p(x x, )y(t, x, ) x for ll t T nd x X. ()

4 Below we give series of results connecting the dul LP (9) nd the trnslted risk MDP (6). Some of the proofs nd results of this section hve similr structure s tht given in [6]. However there re significnt chnges due to risk sensitive nture of the cost. A. Fesible region F nd the set of risk policies D: We sy tht vector y is fesible if it stisfies the dul constrints (9), () nd let F represent this fesible region. We first show one to one correspondence between the two spces, F nd D. Theorem 3: (i) For ny policy D of risk MDP, there exists vector y which stisfies ll the constrints of dul LP (9), i.e., y F. The fesible vector is given by the eqution (see ()): y (, x, ) = α(x )π (x, ) for ll x X, A, y (t, x t, t) = α(x t )e n= rn(xn,n) t n= qn (x n, n x n, n ) t,s t In the bove we define for ll x t X, t A, nd t < T. () q π (x, x, ) := π (x, ). (ii) Given vector y F, define policy y using the following rule: π y,t(x, ) := y(t, x, ) for ll x X, nd A. () y(t, x, ) The vector y y defined by eqution () of point (i) is gin in fesible region nd equls y. Proof: The proof is provided in Appendix. B. Expecttion t optiml policy: To obtin the connection between risk MDP problem nd the dul LP, one needs to study the connection between the risk sensitive cost for given policy nd the dul objective function t the fesible point y, defined using. We lso require similr connecting between the fesible point y nd the corresponding policy y. Further, we would like to solve constrined risk sensitive MDP problems (in section IV). The constrints usully bound the expected vlue of some function of the stte, ction rndom trjectories. In ll, we require the expression for the expected vlue of given function, in terms of the dul vrible y F. As first step, we hve the following, with proof in Appendix: Lemm : Let X α. For ny fesible point y of dul LP, integrble function f nd t < T y(t, x, )f(x, ) = E [ α,y Ψ t f(x t, A t ) ] with x, Ψ t := e t n= rn(xn,an). (3) Further, for ny integrble function f of the lst two sttes X T, X T nd the finl ction A T, we hve: x,,x y(t, x, )p(x x, )f(x,, x ) = E α,y [ Ψ T f(x T, A T, X T ) ]. (4) We hve the sme result when we replce y, y with y, respectively, following exctly similr steps s in the previous theorem. Lemm : For ny policy D of risk MDP nd for ny integrble function f, we hve: x,,x y (T, x, )p(x x, )f(x,, x ) = E α, [ Ψ T f(x T, A T, X T ) ]. (5) If we use the bove theorem with the function, f(x,, x ) = e r T (x,) e r T (x ), we obtin the following for ny (y, y ): y(t, x, ) p(x x, )e r T (x,) e r T (x ) x x, = E α,y [ e T n= rn(xn,an) e r T (X T ) ]. Note tht the LHS is the dul objective (9) t point y nd RHS is the risk cost t policy y. This is the bsic element in proving the equivlence of optiml policies nd optiml dul solutions, given below. C. Optiml policies nd the dul solutions The following theorem shows the reltion between the two optimizers (proof in Appendix). Theorem 4: () If y is n optiml solution of the dul LP, then y defined by () is n optiml policy for risk MDP. (b) If is n optiml policy for risk MDP, then y is n optiml solution of the dul LP. IV. CONSTRAINED RISK MDP We now consider constrined MDP problem, with n dditionl constrint s given below: Subject to: t min J (α, ) (6) E α, [f t (X t, A t )] B, for some set of integrble functions {f t }, initil distribution α nd bound B. The eqution (3) of Lemm could hve been useful in obtining the expecttion defining the constrint, but for the extr fctor Ψ t, s seen from the RHS of the eqution (3). We propose to dd Ψ t s dditionl stte component to the originl Mrkov chin {X t } to tckle this problem. We consider two component Mrkov chin {(X t, Ψ t )} nd

5 the corresponding probbility trnsition mtrix depends explicitly upon time index s below: p t+ (x, ψ t+ x, ψ t, ) = {ψ t+ =ψ te r t (x,) }p(x x, ). With the introduction of the new stte component, for ny dul LP fesible point y we hve: y(t, x, ψ t, ) ψ t f(x, ) = E α,y [f(x t, A t)]. (7) x,ψ t, Thus one cn obtin optiml policy of constrined risk MDP (6) by considering n dditionl stte component nd by dding n extr constrint to the dul LP (9) s below: min [ ] e r T (x,) p(x x, )e r T (x ) y(t, x, ) x (8) subject to: y(t, x, ) = ψ t y(t, x, ψ t, ) for ll t y(, x, ψ, ) = α(x) {ψ =} for ll x, ψ y(t, x, ψ t, ) =,x,ψ t e rt (x,) p(x, ψ t x, ψ t, )y(t, x, ψ t, ) t x,ψ t, for ll t T nd x, ψ t nd y(t, x, ψ t, )ψ tf t(x, ) B. We would like to mention here tht ψ = is lwys initilized to one, Ψ cn tke t mximum X A vlues while Ψ t for ny t cn tke t mximum X t A t possible vlues. There will lso be considerble deletions if the concerned mpping ( t, x t ) e t n= rn(xn,n) is not one-one. One needs to consider this time dependent stte spce while solving the dul LP given bove nd we omit the discussion of these obvious detils. V. APPLICATIONS In [], we pplied LP bsed pproch to solve constrined risk sensitive cost tht rises nturlly in the context of Dely Sensitive Networks (DTNs). The probbility of messge delivery filure with exponentilly distributed contcts turns out to hve risk sensitive form. The direct solution to the power constrined problem works significntly superior in comprison with the solution obtined for model with soft constrints. In this pper we consider nother exmple, which investigtes the effect of risk sensitive cost on optiml policy. As the risk fctor increses, the optiml policies re no more monotone. A. Queueing with losses We consider queueing system with two possible service options. The fst service fcility offers service t rte µ nd is expensive, while the service rte of the slower one is µ with µ < µ. The system cn support t mximum N jobs in prllel nd ny job rrivl tht finds ll the N servers busy, leves the system without service. Aim is to utilize the fst service fcility in n optiml mnner which minimizes the totl number of jobs lost in given time horizon, while mintining the utiliztion of the fst service fcility within given limit. We consider queueing system with Bernoulli rrivls. In every time slot (of unit durtion), customer rrives with probbility δ nd there is no rrivl with probbility δ. The job demnds re exponentilly distributed with prmeter µ (prmeter µ ) when served by the fst (slow) server. Let X t represent the number of customers in the system nd let A t be the indictor of the service type used in time slot t. The flg A t = implies fster service fcility is used cross ll the servers, while A t = implies the use of slower service fcility. A customer leves the system fter service completion, in one time slot with probbility Θ At where Θ := e µ, µ := µ {=} + µ {=}. Thus the trnsition probbility mtrix of this controlled Mrkov chin is given by p(x x, ) Θ N + δnθ N ( Θ ) if x = x = N δθ x {x<n} if x = x + ( δ)( Θ ) x if x = ( ) = x δ x Θ x ( Θ ) x x + ( ) x +( δ) x Θ x ( Θ ) x x if < x x else. With G t representing the flg indicting the rrivl of customer in time slot t, the totl number of customers lost in totl of T time slots is given by: T t= {Xt=N}G t nd we re interested in minimizing the corresponding risk sensitive cost for given [ risk prmeter γ J(x, ) = E x, e γ ] T t= {X t =N}G t. Theorem 5: The required risk sensitive cost hs simpler form s below: [ J(x, ) = E x, e β ] T t= {X t =N} with β = ln (δe γ + ( δ)). (9)

6 Proof : Note tht the rrivls in the time slots with X t = N re lost, becuse ll the servers re busy. These rrivls does not chnge the number in the system in the next time slot X t+ nd hence re independent of the system evolution. By conditioning on the Mrkov chin trjectory {X t } T t= nd becuse of the independence just discussed bove, [ we hve: J(x, ) = E x, T ] g δ t= {X t =N} with g δ := E [ e γgt]. Probbility of fst service, t * γ =. γ = π * t (,), γ = π * t (3,), γ = π * (,), γ =. t π * (3,), γ =. t Let β := ln(g δ ), so tht e β = g δ. We would like to optimize the bove risk sensitive cost under the following constrint for given utiliztion bound B: [ T ] E x, {At=}X t B. t= Bsiclly when fst fcility is chosen s option in ny time slot, X t number of servers re using fst fcility nd hence the bove constrint. Numericl nlysis We obtin the optiml policy for the bove queueing bsed control problem, [ min Ex, e β T t= {X t =N}] such tht [ T ] E x, {At=}X t B, t= by solving the corresponding LP (8), where β depends upon the risk prmeter γ s given by Theorem 5. We did most of the coding in Mtlb except for the LP prt. We used AMPL to model the LP nd Gurobi solver to solve the LP. The solution y of the LP provides the optiml policy y s given by eqution (). In Figure, we consider system with 3 servers. We plot the optiml policy for two vlues of γ. The optiml policy with x = (i.e., with no customers in the system) hs no impct s the server(s) re not utilized. With both vlues of γ, the optiml policy with one customer in the system, i.e., with x =, is to switch off the fst serve fcility t ll time slots. But there is big difference for the remining two sttes x =, 3 nd these policies re plotted in the figure. We plot the probbility of fst service, s dictted by optiml policy, with sttes x = nd x = 3 cross the time slots. When γ =., the optiml policies re threshold type. With x = it switches off the fst fcility t 5th time slot, while with x = 3 it switches off t the nd time slot. The risk cost is close to the liner cost with smll vlues of risk fctor γ, the optiml policies re well understood to be threshold type for liner control nd this explins the figure for the cse with γ =.. With γ =, the risk optiml policies re no more threshold. In fct they re not even monotone s seen Time epochs Fig.. Optiml policy with T =, N = 3, µ =.3, µ =. nd B =.5. For x = optiml policy lwys uses slow server. from the Figure. Further, the probbility of fst service is higher t smller sttes (x = ) with smll γ, while the opposite is true when γ =. With lrger importnce to risk cost, the policy is more cutious t the points of high risk, i.e., when x = 3. We estimte the verge number of customers lost, t optiml policy, using Lemm s below: [ ] E x, [N lost ] = E x, {Xt=N}G t t = y (t, x, )ψ t {x=n} δ. t x,,ψ t The verge number lost equls.37 nd.45 respectively t γ =. nd. This is obvious becuse with high risk fctor, the importnce is drifted wy from the verge number lost. We considered mny more numericl exmples nd found similr chrcteriztion of the optiml policies. CONCLUSIONS We consider finite horizon risk MDP problem nd estblish the connections between the DP nd LP pproches. We show tht the solution of the unconstrined risk MDP problem (3) cn be obtined vi the solution of ny one of the two LPs, priml nd dul. The priml solution provides the vlue function while the dul solution directly provides the risk optiml policy. It is not strightforwrd to extend the solution to the constrined risk MDP problem. We ugment the stte spce with suitble component, tht t ny time slot cptures the effect of the risk cost until tht slot. We propose third LP using the ugmented stte spce trnsitions, which provides the solution to the constrined risk MDP problem. We pply the results so obtined, to study the server selection problem in the context of Bernoulli queues with losses. Our im is to minimize the number of customers lost, i.e., returned without service. We consider minimizing the risk version of the cost nd optimize it under

7 fst server utiliztion constrint. The optiml policy is threshold policy when the risk fctors re close to zero. It is well known tht the risk MDP is close to the liner MDP with smll risk fctors nd hence threshold policy is nticipted. However, with lrge risk fctors the risk optiml policy is no longer threshold type. The policies re not even monotone. Further we notice tht the probbility of choosing fst server is higher t lrger sttes. With higher preference to risk cost, the policy emphsizes utiliztion of the fst server t high risk sttes, the lrger sttes. Thus the proposed LPs re useful in obtining the solutions of the constrined/ unconstrined finite horizon risk MDPs. ACKNOWLEDGEMENTS The work towrds this problem originted with Prof. Eitn Altmn s remrk bout finding the connections between LPs nd risk sensitive MDP problems. APPENDIX: PROOFS The proofs of this section follow similr structure s given in [6]. However there re significnt chnges due to risk sensitive nture of the cost. Proof of Theorem : Consider ny vector u stisfying u Lu. Consider ny policy = [π, π,, π T ]. By definition of the opertor L { } u inf C,,π P u () π C,,π P u C,,π P C,,π P u. C,,π P C,,π P C T,T,π T P u T T T = J ( ) with J ( ) := J (, ) J (, ). J (N, ) This is true for ny policy. Thus. () u inf J ( ) = u. Following exctly similr logic one cn show for ll t < T tht u t u t. Proof of Theorem : Let us consider ny vector u which stisfies u Lu. By definition of L: { } u inf C,,π P u () π Consider ny ɛ, by definition of infimum there exists policy π such tht: u C,,π P u ɛ By boundedness of the mtrices (finite sttes nd ctions) involved nd further choosing the policies π, π etc., inductively we obtin the following for ny incresing sequence of {ɛ i } nd with ɛ = ɛ T : u C,,π P C,,π P u ɛ. C,,π P C,,π P C T,T,π T P u T T ɛ T Note in the bove tht, for exmple, ɛ is chosen such tht (for pproprite choice of ɛ ): ɛ C,,π P ɛ + ɛ. Thus s in the proof of the previous theorem, u C,,π P C,,π P C T,T,π T P u T T ɛ T = J ( ) ɛ with = [π, π,, π T ]. Thus u J ( ) ɛ inf J () ɛ = u ɛ. Thus for ny ɛ > one cn chose pproprite incresing sequence of {ɛ i } such tht u u ɛ. Since ɛ > is rbitrry, consider the limit ɛ nd hence u u. Following exctly similr logic one cn show for ll t < T tht u t u t. Proof of Theorem 3: It is esy to see tht the defined point y stisfies the first constrint (9), s by definition for ll x y (, x, ) = α(x )π (x, ) = α(x ). Define t := t n= q n (x n, n x n, n )e t n= rn(xn,n). Note tht t depends upon the vectors t, x t. Using the bove definition, we cn rewrite

8 y (t, x, ) = t x t α(x ) t. To simplify the nottions, we represent the ction stte pir by z t := (x t, t ), for every t. Considering the right hnd side (RHS) of the second constrint (): e r t ( ) p(x t )y (t, ) =(x t, t ) = e r t ( ) p(x t ) = = ( α(x ) t e r t ( ) α(x ) t z t =( t x t ) α(x ) t π t(z t)p(x t ) t e r t ( ) t qt (z t ) = α(x ) t = y (t, z t). t t Hence the point y stisfies the second constrint (). Prt (ii): Consider ny y F, the policy y defined s in () nd then the point y y defined using policy y s in (). Aim is to prove tht y(t, z t ) = y y (t, z t ) for ll t T, x t X, t A. Fix (t, z t ). As in previous proof, define k,t := t n=k q n (z n z n )e t n=k rn(zn) now including the strting time k. Since y F, it stisfies (9) nd by definition () of y we hve: y y (t, z t) = = = =,t α(x ),t,t α(x)er (z ) π y,(z ) α(x)er (z ) y(, z ) ; using () y(, x, ),t er (z ) y(, z ); using (9). Further expnding,t =,t er(z) q (z z ) nd simplifying s before, we reduce one pir of elements ) (z ) in the summtion: y y (t, z t) =,t er (z ) z e r (z ) q (z z )y(, z ) =,t er (z ) π y,(z ) e r (z ) p(x z )y(, z ) z =,t er (z ) π y,(z ) y(, x, ); using () =,t er (z ) y(, z ) y(, x, ) y(, x, ); using () =,t er (z ) y(, z ). Proceeding in similr wy, we reduce one more pir of elements z = (x, ) summtion, tht is: y y (t, z t) = 3,t er (z ) z e r (z ) q (z z )y(, z ) = 3,t er (z ) π y,(z ) e r (z ) p(x z )y(, z ) z = 3,t er (z ) π y,(z ) y(, x, ); using () = 3,t er (z ) y(, z ) y(, x, ) y(, x, ) = 3,t er (z ) y(, z ). Repeting exctly the sme steps, we eliminte ll the terms till nd including (z t ), to obtin the following (note tht t,t = q t (z t )): y y (t, z t ) = q t (z t )e rt (zt ) y(t, ) = π t (z t ) e rt (zt ) p(x t )y(t, ) = t = y(t, z t ). y(t, z t ) y(t, x t, y(t, x t, t); using () t) t This is true for ll t T. Proof of Lemm : By Theorem (3), y(t, z t ) = y X (t, z t ) for ny t < T. Further, using the definition of y X (t, z t ) from (), one cn rewrite left hnd side

9 (LHS) of (3) z t y(t, z t)f(z t) = z t y X (t, z t)f(z t) = z t f(z t) = z t [ = E y t α(x ) qn (z t n z n )e n= rn(zn) n= (f(z t)e t n= rn(zn)) α(x ) e t t qn (z n z n ) n= ] n= rn(xn,an) f(x t, A t) for ny t < T. [8] Corluppi, Stefno P., nd Steven I. Mrcus. Risk-sensitive nd minimx control of discrete-time, finite-stte Mrkov decision processes. Automtic 35. (999): [9] Osogmi, Tkyuki. Robustness nd risk-sensitivity in Mrkov decision processes. Advnces in Neurl Informtion Processing Systems.. [] Borkr, Vivek S., nd Sen P. Meyn. Risk-sensitive optiml control for Mrkov decision processes with monotone cost. Mthemtics of Opertions Reserch 7. (): 9-9. [] Atul Kumr, Veerrun Kvith nd N. Hemchndr, Power Constrined DTNs: Risk MDP-LP Approch. Interntionl Workshop on DD Communictions held in conjunction with WiOpt 5, Mumbi, Indi. One cn get the second result (4) in exctly similr lines. Proof of Theorem (4): We begin with proof of (). Let g(y) represent the dul objective (see (9)), for ny y F, i.e., g(y) := e rt (zt ) T x T [ ] p(x T z T )e r T (x T ) y(t, z T ). x T X Let y be n optiml solution of the dul LP, nd let y be the corresponding policy given by (). By optimlity nd becuse of eqution (5), for ny D: [ E α, e ] T n=t rn(xn,an)+r T (X T ) = g(y ) g(y ) = E α, y [ e T n=t rn(xn,an)+r T (X T ) ], estblishing the required optimlity. Prt (b) cn be proved in similr wy. REFERENCES [] Atul Kumr, Veerrun Kvith nd N. Hemchndr, Finite horizon risk sensitive MDP nd liner progrmming, 5, Technicl report vilble t [] Altmn, Eitn. Constrined Mrkov decision processes. Vol. 7. CRC Press, 999. [3] Bertseks, Dimitri P., nd Dimitri P. Bertseks. Dynmic progrmming nd optiml control. Vol.. No.. Belmont, MA: Athen Scientific, 995. [4] Feinberg, Eugene A., nd Adm Shwrtz, eds. Hndbook of Mrkov decision processes: methods nd pplictions. Boston, MA: Kluwer Acdemic Publishers,. [5] Stefno P Corluppi nd Steven I Mrcus. Risk-sensitive queueing. In Proceedings of the Annul Allerton Conference on Communiction Control nd Computing, volume 35, pges Citeseer, 997. [6] Mrtin L Putermn. Mrkov decision processes: discrete stochstic dynmic progrmming. John Wiley & Sons, 4. [7] Howrd, Ronld A., nd Jmes E. Mtheson. Risk-sensitive Mrkov decision processes. Mngement Science 8.7 (97):

Power Constrained DTNs: Risk MDP-LP Approach

Power Constrained DTNs: Risk MDP-LP Approach Power Constrined DTNs: Risk MDP-LP Approch Atul Kumr tulkr.in@gmil.com IEOR, IIT Bomby, Indi Veerrun Kvith vkvith@iitb.c.in, IEOR, IIT Bomby, Indi N Hemchndr nh@iitb.c.in, IEOR, IIT Bomby, Indi. Abstrct