Markovian Decision Process (MDP): theory and applications to wireless networks

Size: px

Start display at page:

Download "Markovian Decision Process (MDP): theory and applications to wireless networks"

Kristopher Beasley
6 years ago
Views:

1 Markovian Decision Process (MDP): theory and applications to wireless networks Philippe Ciblat Joint work with I. Fawaz, N. Ksairi, C. Le Martret, M. Sarkiss

2 Outline Examples MDP Applications A few examples Markov Decision Process Markov chain Discounted approach Bellman s equation and fixed-point theorem Practical algorithms Applications to wireles networks Channel exploration Hybrid ARQ optimization Scheduling with Energy Harvesting Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 2 / 24

3 Examples MDP Applications Part 1 : A few examples Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 3 / 24

4 Examples MDP Applications Example 1 : Inventory control. Slot k x k x k+1 Packet s request and input a k Packet s output w k. x k stock at the beginning of slot k a k stock ordered and immediatly delivered at the beginning of slot k w k request of stock during slot k (iid process) Goal : optimizing {a k }? but a k too large (cost) ; a k too small (stock outage with x k+1 = 0) Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 3 / 24

5 Examples MDP Applications Example 1 : Inventory control (cont d) Mathematical model 1 : Reward : r(x k, a k ) α-discounted (long-term) reward : r = lim N N α n r(x n, a n ) Sequential dynamic : Additive cost over time x k+1 = f (x k, a k, w k ) = (x k + a k w k ) + {x k } Markov chain with transition probability Q(x k+1 x k, a k ) Policy : a k+1 = µ (k) (x k ) [ {µ (k) }? min E x0,µ lim N ] N α n r(x n, µ (n) (x n )) Markov Decision Process (MDP) Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 4 / 24

6 Examples MDP Applications Example 1 : Inventory control (cont d) Mathematical model 2 Reward/Cost : c(x k ) Outage : x k = 0 if x k + a k w k 0, 1 otherwise 1 o( x k ) = x k lim N N N o( x n ) < O t Let s k = [x k, x k ] : system state. Optimal policy : {µ (k) }? min E s0,µ [ lim N ] N α n r(s n, µ (n) (s n )) s.t. E s0,µ [ lim N 1 N ] N o(s n ) < O t Constrained Markov Decision Process (CMDP) Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 5 / 24

7 Examples MDP Applications Example 2 : Water reservoir control Reservoir with C finite capacity : maximize simultaneously the stock and water release x k water volume at time k a k water released volume at time k (for electricity or irrigation) w k random inflow during slot k (rain and tributary rivers) : iid process We have x k+1 = min(x k a k + w k, C) Cost : r(x k, a k ) = x k + a k x k not perfectly known, but estimated. So we know y k = g(x k, ζ k ) Partially Observable Markov Decision Process (POMDP) Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 6 / 24

8 Examples MDP Applications Markov Chain Part 2 : Markov Decision Process Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 7 / 24

9 Markov chain Examples MDP Applications Markov Chain Let {x k } a random sequence/process. Definition A sequence is said a Markov chain iff P(x k+1 x k,, x 0 ) = P(x k+1 x k ), k If x k takes a finite number of values X = {x (1),, x (V ) },. transition probability matrix : T = (t m,n ) 1 m,n V with t m,n = Prob(x k+1 = x (m) x k = x (n) ) 0 and V m=1 t m,n = 1. Graph theory (Stochastic) non-negative matrix theory x (1) x (2) x (3) x (4). Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 7 / 24

10 Examples MDP Applications Markov Chain State, Action, and Policy State : x k (system information at time k) Action : a k Disturbance : w k iid process Transition kernel : x k+1 = f (x k, a k, w k ) Q(x k+1 x k, a k ) = Prob(x k+1 = x (m) x k = x (n), a k = a) = tm,n (a, w)p w (w)dw Reward : r(x k, a k ) [ ] E x0 α n r(x n, a n ) (discount) or E x0 [ lim N 1 N ] N r(x n, a n ) Policy : µ k (x k ) A with A set of actions deterministic (µ k is a function) or random (µ k is a pdf) stationary (µ k independent of k) or nonstationary Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 8 / 24

11 Examples MDP Applications Markov Chain Bellman s equation Let µ be a deterministic stationary policy Let R µ (x 0 ) be the discounted reward assuming used policy µ and initialized at x 0 [ ] R µ (x 0 ) = E x0 α n r(x n, µ(x n )) Bellman s equation [ ] = r(x 0, µ(x 0 )) + αe x0 α n r(x n+1, µ(x n+1 )) = r(x 0, µ(x 0 )) + αe x0 [R µ (x 1 )] = r(x 0, µ(x 0 )) + α R µ (x)q(x x 0, µ(x 0 ))dx, x 0 R µ = T µ (R µ ) (fixed-point) x 0 T µ (f )(x 0 ) = r(x 0, µ(x 0 )) + α f (x)q(x x 0, µ(x 0 ))dx Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 9 / 24

12 Examples MDP Applications Markov Chain Fixed-point theorem We also define T (f )(x 0 ) = max a A { r(x 0, a) + α } f (x)q(x x 0, a)dx Lemma It exits an unique function R (resp. R µ ) such that R = T (R) and R µ = T µ (R µ ) Sketch of proof : Let f = sup x f (x) (within set of bounded function) Assuming r(, ) bounded function for any state and action T (f ) T (g) α max a A { α f g f (x)q(x x 0, a)dx T is a α-contraction, so Banach s fixed-point theorem applies } g(x)q(x x 0, a)dx Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 10 / 24

13 Examples MDP Applications Markov Chain Optimal deterministic policy Theorem A (deterministic stationary) policy µ is said optimal R µ (x 0 ) = R (x 0 ), x 0 with R (x 0 ) = sup µ R µ (x 0 ) Equivalently, R µ is the fixed point of T Sketch of proof : Let µ a policy s.t. R µ is the fixed point of T R µ (x 0 ) r(x 0, a) + α R µ (x)q(x x 0, a)dx, (x 0, a) X A R µ (x n ) r(x n, a n ) + α R µ (x)q(x x n, a n )dx E x0,µ[α n R µ (x n )] α n+1 E x0,µ[ R µ (x)q(x x n, a n )dx] E x0,µ[α n r(x n, a n )] ( α n +expectation) E x0,µ[α n R µ (x)q(x h n 1 )dx] E x0,µ[α n+1 R µ (x)q(x h n )dx] E x0,µ[α n r(x n, a n )] (Markov) R µ (x 0 ) α N+1 E x0,µ[ R µ (x)q(x h N )dx] E x0,µ[ N αn r(x n, a n )] (sum over n) R µ (x 0 ) E x0,µ[ αn r(x n, a n )] = R µ (x 0 ), µ (Q.E.D) Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 11 / 24

14 Examples MDP Applications Markov Chain Algorithm 1 : Value iteration (VI) According to fixed-point thorem, we have R µ Let {R n } wityh any function R 0 s.t. = lim T... T N }{{} (R), R N times R n+1 = T (R n ) { R n+1 (x) = max r(x, a) + α a µ n (x) = arg max a { r(x, a) + α } R n (y)q(y x, a)dy } R n (y)q(y x, a)dy Theorem lim µ N(x) = µ (x), x N Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 12 / 24

15 Examples MDP Applications Markov Chain Algorithm 2 : Linear Programming (LP) Lemma Finite set of states X = {x (0),, x (V ) } Finite set of actions A = {a (0),, a (V ) } Let T be the Bellman s operator on vector s.t. W = T (W ) { } V W (m) = max r(x (m), a) + α W (n)q(x (n) x (m), a) a Let be the elementwise greater than. If U V, then T (U) T (V ) Let W be fixed point of T and W s.t. W T (W ), then W W As W T (W ), we get W = arg min W V W (m) s.t. W T (W ) m=0 Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 13 / 24

16 Examples MDP Applications Markov Chain Algorithm 2 : Linear Programming (LP) (cont d) Function R µ Vector R = [R µ (x (0) ),, R µ (x (V ) )] Linear programming algorithm s.t. R T (R) R(m) max a i.e. R fixed point of T R = arg min R { R(m) r(x (m), a (t) ) + α V R(n) m=0 r(x (m), a) + α } V R(n)Q(x (n) x (m), a) V R(n)Q(x (n) x (m), a (t) ), m, t Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 14 / 24

17 Examples MDP Applications Markov Chain Extension : CMDP Deterministic stationary policy not optimal anymore It exists a random stationary policy Computation of optimal policy still through linear programming Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 15 / 24

18 Examples MDP Applications Exploration HARQ optimization Scheduling Part 3 : Applications to wireless networks Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 16 / 24

19 Examples MDP Applications Exploration HARQ optimization Scheduling Multiband Exploration [Lun15] One secondary user may use N c non-contiguous channels (in 4G with carrier aggregation) Can explore only N e channels simultaneously Issue : which ones selecting at any time? Partially Observable Markov Decision Process (POMDP) States (at time n) -not all known empty/occupied channels Action N e channels to explore Transition kernel Q(s n+1 s n, a n ) Markovian process Reward r(s n, a n ) #empty tested channel Policy a n = π(s n ) [ ] Policy maximizing E α n r(s n, a n ) Extension : Competition between secondary users stochastic game Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 16 / 24 n

20 Examples MDP Applications Exploration HARQ optimization Scheduling Hybrid ARQ optimization Type-I HARQ : Let S be a packet composed by N coded symbols. TX RX S1 T S1 S2 NACK ACK NO YES Type-II HARQ : Memory at RX and non-identical packets at TX. TX RX S1(1). S1(2) NACK NO S2(1) ACK YES. Y 1 = S 1 (1) + N 1 Y 2 = S 1 (2) + N 2 detection on Y = [Y 1, Y 2 ] (Coding gain) Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 17 / 24

21 Examples MDP Applications Exploration HARQ optimization Scheduling Case 1 : Power optimization [Taj13] We would like to adapt the power packet per packet After the n-th transmission, if NACK is received, we have New action A n+1 to do : here choosing P n+1 Available information : accumulated mutual information equal to I n = n log 2 (1 + G l P l ) l=1 K n {ACK and new round, 1 attempt and NACK received,, L attempts and NACK received} P(K n+1, I n+1 K n, I n, ) = P(K n+1, I n+1 K n, I n ) Markov Chain : S n = (K n, I n ) A policy µ = how selecting P n+1 given S n Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 18 / 24

22 Examples MDP Applications Exploration HARQ optimization Scheduling Optimization problem and numerical results max µ s.t. lim N N 1 1 reward µ (S n, A n ) N }{{} #information bits after ACK N 1 1 lim power(s n, A n ) P max N N Throughput (in bcpu) constant power optimal random policy based power Optimal random policy exists Optimal pdf obtained through linear programming SNR (db) [source: Tajan s PhD] Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 19 / 24

23 Examples MDP Applications Exploration HARQ optimization Scheduling Case 2 : Best Modulation and Coding scheme [Ksa15] Action : Best modulation and coding for the first packet transmission Action : Best modulation for the retransmission if BPSK : a few redondancy bits sent but well protected if QAM : a lot of redondancy bits sent but not well protected State : current channel impulse response, number of HARQ attempts, effective SNR Numerical result 1.5 LTE set-up Correlated channel (25km/h) Average throughput (Mbps) Proposed algorithm HARQ-aware scheme (not accounting for the future) HARQ-unaware scheme (not accounting for the future) Trivial scheme with random MCS selection subframe index (iteration number) Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 20 / 24

24 Examples MDP Applications Exploration HARQ optimization Scheduling Scheduling with energy harvesting [Faw17]. Packet arrival a n (λ a ) Packets in trash w n DEVICE Battery: B Buffer: P Latency: K 0 Energy arrival e n (λ e ) Packets sent u n (cost in energy). Action : u n sent packets (of length L) using the following energy C n = σ2 g n ( 2 u nl 1 ) States : s n k n : age of the packet within the buffer b n : battery level b n+1 = min(b n C n + e n+1, B) g n : channel gain Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 21 / 24

25 Examples MDP Applications Exploration HARQ optimization Scheduling Scheduling with energy harvesting (cont d) Reward : packet loss (buffer overflow, delay non-fulfillment) [ ] min lim E 1 N (ε o (s n, u n ) + ε d (s n, u n )) N N with ε o number of dropped packets due to buffer overflow ε d number of dropped packets due to delay non-fulfillment Numerical results Dropped packet probability λ =1: Naive e λ e =1: Optimal λ =3: Naive e λ =3: Optimal e λ a Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 22 / 24

26 Examples MDP Applications Exploration HARQ optimization Scheduling Not treated issues Curse of dimensionality Competition between users (game theory [Lun15]) Unknown Q : reinforcment learning, Q-learning Closely related to Dynamic Programming (Viterbi algorithm for channel decoding or ISI management) Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 23 / 24

27 References Examples MDP Applications Exploration HARQ optimization Scheduling [Her89] O. Hernández-Lerma, Adaptive Markov Control Processes, [Ber95] D. Bertsekas, Dynamic Programming and Optimal Control, [Taj13] R. Tajan, HARQ retransmission scheme in cognitive radio, PhD thesis, Univ. Cergy-Pontoise, France, [Ksa15] N. Ksairi and P. Ciblat, Modulation and Coding Schemes Selection for Type-II HARQ in Time-Correlated Fading Channels, IEEE Workshop on Signal Processing Advances for Wireless Communications (SPAWC), [Lun15] J. Lunden, V. Koivunen, and H.V. Poor, Spectrum Exploration and Exploitation for Cognitive Radio : Recent Advances, IEEE Signal Processing Magazine, March [Faw17] I. Fawaz, P. Ciblat, and M. Sarkiss, Energy minimization based Resource Scheduling for Strict Delay Constrained Wireless Communications, IEEE Global Signal Processing Conference (GLOBALSIP), Philippe Ciblat Markovian Decision Process (MDP): theory and applications to wireless networks 24 / 24

Energy minimization based Resource Scheduling for Strict Delay Constrained Wireless Communications

Energy minimization based Resource Scheduling for Strict Delay Constrained Wireless Communications Ibrahim Fawaz 1,2, Philippe Ciblat 2, and Mireille Sarkiss 1 1 LIST, CEA, Communicating Systems Laboratory,