On Stochastic Feedback Control for Multi-antenna Beamforming: Formulation and Low-Complexity Algorithms

Size: px
Start display at page:

Download "On Stochastic Feedback Control for Multi-antenna Beamforming: Formulation and Low-Complexity Algorithms"

Transcription

1 1 On Stochastic Feedback Control for Multi-antenna Beamforming: Formulation and Low-Complexity Algorithms Sun Sun, Min Dong, and Ben Liang Abstract We consider transmit beamforming in multipleinput-single-output (MISO systems. Based on a first-order Gauss-Markov channel model, we investigate the stochastic feedback control for transmit beamforming to maximize the net system throughput under a given average feedback cost, and design practical implementation algorithms of feedback control using techniques in dynamic programming and reinforcement learning. We first validate the Markov Decision Process (MDP formulation of the underlying feedback control problem when the state space is of a size of transmit antennas. Due to the complexity of finding an optimal feedback policy under such state space, we consider state-space reduction to a 2-variable (2-V state. On contrary to a previous study that assumes the feedback control problem under such 2-V state remains MDP formulation, our analysis indicates that the underlying problem is no longer an MDP. Nonetheless, the approximation as an MDP is shown to be justifiable and efficient. Based on the quantized 2-V state and the MDP approximation, we study practical implementation algorithms for feedback control. With unknown state transition probabilities, we consider model-based off-line and on-line learning approaches, which estimate the transition probabilities first then obtain an optimal feedback control policy, as well as model-free reinforcement learning approach, which directly learns an optimal feedback control policy. We investigate and compare these algorithms through extensive simulations under various feedback cost and power constraint settings, and provide efficiency analysis of all proposed algorithms. From these results, the application rule of these algorithms is established, taking into account both stationary and unstationary systems. Index Terms Beamforming, stochastic feedback control, implementation algorithms, reinforcement learning. I. INTODUCTION Multi-antenna transmit beamforming is an effective physical-layer technique to improve transmission rate and reliability over a wireless link in a slow fading environment [1]. The actual beamforming gain achieved depends on the feedback quality which controls the accuracy of beamforming vector to be used at the transmitter. Thus, a main challenge in practical systems to maximally realize the beamforming potential is designing feedback strategies to efficiently obtain the knowledge of channel or beamforming vector at the transmitter. As feedback incurs overhead and thus rate loss to the system, the net gain in terms of system throughput is the result of both rate gain due to beamforming and rate loss due to feedback. The temporal correlation of the channel adds an additional time dimension for the feedback strategy design in order to make proper trade-off between the beamforming Sun Sun and Ben Liang are with Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada. {ssun, liang}@comm.utoronto.ca. Min Dong is with Department of Electrical Computer and Software Engineering, University of Ontario Institute of Technology, Toronto, Canada. min.dong@uoit.ca. gain and feedback quality. Beamforming feedback design in literature typically focuses on the channel or beamforming vector quantization techniques, i.e., codebook design [2], [3]. The temporal dimension is addressed either assuming block fading channel model, with independent fade changes from block to block, or through simple periodic feedback, without actively exploring the temporal correlation of channel fades. ecently, the approach of stochastic feedback control is studied for beamforming feedback design by formulating the problem as a Markov Decision Process (MDP problem [4]. In this case, the feedback decision takes into account of channel temporal correlation and feedback cost, with the goal of maximizing the system net throughput. Such approach is attractive as the feedback decision is dynamic which improves the feedback efficiency, and the feedback cost is also incorporated in the decision. In this paper, we focus on analyzing the stochastic feedback control approach and designing feedback control algorithms for transmit beamforming. Specifically, we investigate the problem from the perspectives of channel temporal correlation modeling, problem space reduction and its optimality, and low-complexity algorithms for online feedback control. A. Contributions We consider a multi-input single-output (MISO transmit beamforming system with transmit antennas and a single receive antenna. The channel dynamics are modeled by a first-order Gauss-Markov process. By modeling the feedback control as an on-off decision process for the beamforming vector update, we aim at designing feedback control policy to maximize the net system throughput under a given average feedback cost. Our contributions are summarized as follows. We provide a detailed analysis on validity of the MDP formulation of the underlying feedback control problem. Specifically, by constructing the system state with a size of 4-variable (4-V using the vector channel and available beamforming vector at the transmitter, we prove that the feedback control problem is an MDP problem. Based on the quantized 4-V state, there exits an optimal stationary feedback control policy which is Markovian (only depends on the current system state and deterministic. However, it is well-known that, for such MDP problem, obtaining an optimal policy using dynamic programming (DP [5] is hindered by the curse of dimensionality due to the exponential complexity with respect to the size of the state space. To make the problem trackable for a practical feedback control design, we investigate into state-space reduction

2 2 to two variables as a 2-V state, which consists of the channel power and the achieved beamforming gain. In [4], such state-space reduction was claimed to incur no loss of optimality in the sense that, the feedback control problem under this 2-V state remains an MDP problem. However, through our rigorous analysis, we show that such reduction is in fact suboptimal, i.e., the underlaying feedback control problem is no longer an MDP problem. Nonetheless, the approximation as an MDP is shown to be justifiable and efficient. Based on the quantized 2-V state space and the MDP approximation, we consider practical implementation algorithms for the feedback control. Since the system state transition probabilities are unknown, we consider modelbased learning approaches, i.e., learning the transition probabilities, either off-line or online, combined with policy iteration algorithm, as well as model-free reinforcement learning approach to directly learn the feedback control policy without the knowledge of the system state transition probabilities. To the best of our knowledge, this is the first paper which employs reinforcement learning technique in feedback control for multi-antenna beamforming. We give detailed study and comparison of these algorithms through extensive simulations under various feedback cost and power constraint settings, and provide the efficiency analysis of these algorithms. Based on these results, the application rule of these algorithms is suggested, taking into account both stationary and unstationary systems. B. elated Works There is a large body of literature work on beamforming feedback design [2]. Most previous works focus on codebook design for beamforming vector under limited feedback [2], [3], [6], [7], with emphasis on the quantization performance of a given channel. Approaches such as Grassmannian line packing [3], vector quantization [7], [8], and random vector quantization [6] are proposed and studied. To further improve the feedback efficiency and reduce feedback overhead, opportunistic feedback approach is considered in multiuser transmission systems [9] [11], where the threshold type of feedback is assumed and derived for various design metrics. However, these results assume block fading channel with independent fade from block to block. Thus, channel temporal correlation is not actively explored for feedback efficiency. When exploring the temporal correlation of fading channel, the channel dynamics are commonly modeled either as continuous Gauss-Markov model [12], [13], or finite-state Markov model [14] [16]. The compression methods for channel state or beamforming vector by considering channel temporal correlation are investigated in [17], [18]. Among these results, the periodic feedback strategy is assumed, and the effect of feedback cost in the system throughput is not incorporated in the design. The cost due to feedback is studied in the form of resource allocation (e.g. bandwidth, power between feedback and data transmission in [19], [20]. A stochastic feedback control approach is first considered in [4], where the beamforming feedback problem is formulated as an MDP problem and the optimal control policy for decision on when to feedback is studied to maximize the system net throughput. By modeling the channel temporal correlation as a first-order Gauss-Markov process and viewing the feedback control problem as an MDP, the feedback control is shown to be of the threshold type. This work is the most relevant to our study in this paper. The main differences between our work and [4] are the following: 1 We rigorously examine the validity of MDP formulation of the beamforming feedback control problem under both original full state-space and reduced statespace, while the MDP problem was directly assumed for the feedback control problem under the reduced state-space model in [4]; 2 We explore and compare various practical feedback control implementation algorithms, either off-line or on-line and either model-based or model-free. In [4], the focus was on proving the threshold type of the optimal feedback control policy under the MDP formulation. Although the existence of a threshold was shown, no practical algorithm was provided to determine the threshold for feedback control. Feedback control policy design for transmit beamforming under the MDP formulation is essentially a problem of sequential on-off decision-making. There is a rich literature on the control policy design of an MDP problem [5]. When the state transition probabilities are known, the traditional DP method like policy iteration or value iteration can be employed to find the optimal policy [5]. In reality, however, the transition probabilities are typically unavailable. einforcement learning provides model-free algorithms where the decision-maker tries to learn an optimal policy directly without learning the model first. Excellent tutorials of this subject can be found in [21] [23]. C. Organization and Notation The remainder of this paper is organized as follows. Section II provides the system model and problem formulation. In Section III, we consider the transformation of the underlying feedback control problem to an MDP problem based on a 4-V state. In Section IV, we investigate the optimality of state-space reduction to a 2-V state. In Section V, based on the quantized 2-V state, four learning algorithms for feedback control are described. The performances of these algorithms are then studied and compared in Section VI followed by the conclusion in Section VII. Notation: The conjugate, transpose, and Hermitian of a matrix A are denoted by A, A T, and A H, respectively; 0 m,n is anm n matrix with all zero entries;i n is an identity matrix with dimensionn n; letgbe ann 1 vector, then g denotes the 2-norm of g, and g CN(0 n,1,σ 2 I n means that g is a circular complex Gaussian random vector with mean zero and covariance σ 2 I n ; E [ ] stands for the expectation; log( denotes the log function with base 2; ( and I( denote the real part and the imaginary part of the parameter respectively; denotes the real space; 1( denotes the indicator function which equals1(resp. 0 when the statement inside the brackets is true (resp. false; denote {h t } WN(0,σ 2 if {h t } is a White Noise process where all distinct elements in {h t } are uncorrelated and are with mean zero and variance σ 2.

3 3 II. CHANNEL MODEL AND POBLEM STATEMENT A. Channel Model Consider a MISO system where the transmitter is equipped with ( 2 antennas and the receiver is equipped with a single antenna. We assume a discrete-time slow fading channel model. Denote g i,t as the channel gain from the i-th transmit antenna to the receiver at time t, and define the 1 channel gain vector g t [g 1,t,,g,t ] T. Assume that all channel gains are spatially independent but temporally correlated, and the dynamics of the channel gain vector g t follows the firstorder autoregressive Gauss-Markov model given by g t+1 ρg t +w t+1, t 0,1,, (1 where g 0 CN(0,1,I, and {w t } is a sequence of i.i.d. random vectors independent of g 0 and is with distribution w t CN ( 0,1,(1 ρ 2 I ; parameter ρ is the correlation coefficient defined as ρ J 0 (2πf D T s, where J 0 ( is the zero-order Bessel function of the first kind, and f D T s is the normalized Doppler frequency with f D being the maximum Doppler frequency shift and T s the transmitted symbol period. Higher value of ρ indicates a higher temporal correlation 1. The result below indicates that {g t } in (1 follows a block Markovian model. Furthermore, it can be proven that the stationary distribution of {g t } follows CN(0,1,I. Fact 1 ( [24]: Let ǫ 0,ǫ 1,ǫ 2, be independent random variables, X 0 ǫ 0, and X n+1 ρx n +ǫ n+1,n 0,1,2,, where ρ is a real constant. Then {X n } forms a Markov chain. B. Problem Statement Denote an 1 normalized complex vector ĝ u,t, i.e., ĝ u,t 1, as the available beamforming vector at the transmitter at time t. Then, the received signal at the receiver at time t is given by y t x t ĝ H u,tg t +v t, (2 where x t is the transmitted signal at time t with the power constraint E [ x t 2 ] P, v t is the noise at the receiver with v t CN(0,1, which is independent of {g t } and {x t }. The received SN and the data rate are given by P ĝu,tg H t 2, and log(1+p ĝu,tg H t 2, respectively. In practical systems, the beamforming vector at the transmitter is obtained through limited feedback. The effective beamforming gain at the receiver is directly affected by the accuracy of the beamformer with respect to the current CSI. Let c be the fixed cost of each feedback. Clearly, there is a trade-off between the feedback cost and the data rate provided by the beamforming gain. We define the reward at time t, denoted as r t, as the net throughput obtained through beamforming as { log(1+p g t 2 c, if feedback r t log(1+p ĝu,tg H t 2 (3, if no feedback, 1 For example, when f D T s 0.01, i.e., the channel coherence time is approximately 100 times the symbol period, we have ρ which indicates that the successive channel gains are highly correlated where in (3, upon feedback, the available beamforming vector at the transmitter is updated as ĝ u,t g u,t g t / g t. The expected total discounted reward can be expressed as [ ] V E λ t r t (4 t0 where λ (0,1 is the discount factor, and the expectation is taken over the randomness of the channels. By including the discount factor, the future net throughput is considered as less valuable than the current one, which reflects the time value in practice. Our objective in this paper is to design a feedback control policy to maximize the expected total discounted reward. To simplify the analysis, we assume that the receiver knows the CSI perfectly, and the feedback to the transmitter is delay-free and error-free. III. OPTIMAL FEEDBACK CONTOL: AN MDP FOMULATION To maximize the expected total discounted reward, the optimal feedback policy may be a randomized policy that depends on all previous system states and feedback actions. In this section, we will show that, by introducing a 4-V state, the underlying feedback control problem can be formulated as an MDP problem. With the quantized 4-V state, the formulated MDP problem admits an optimal stationary policy which is Markovian (only depends on the current system state and deterministic. Unfortunately, we also point out that finding an optimal policy based on the quantized 4-V state is difficult. A. An MDP Formulation We first review the definition of an MDP problem given below. Definition 1 ( [5]: Define a collection of objects {T,S,A,P( s t,a t,r t (s t,a t } as an MDP, where T,S, and A are the spaces of decision epoch t, state s t, and action a t, respectively, P( s t,a t is the transition probability of the next state given the state and action at the current time t, and r t (s t,a t is the corresponding reward function. Define z t (s 0,a 0,s 1,a 1,,s t to be the history of the decision process up to time t. The key of an MDP is that the transition probabilities and the reward function depend on the history only through the current state. We now show that the underlying feedback control problem can be formulated as an MDP problem. Let the space of the decision epochs be T {0,1, }. At time t T, the receiver makes a decision on whether to feedback g u,t to the transmitter. Denote the space of the actions as A {0,1}, where 0 and 1 represent no feedback and feedback, respectively. Note that the final available beamforming vector at the transmitter at time t depends on the receiver s decision, so is undetermined when the receiver makes the decision at time t. We therefore define the state at time t as s t (g t,ĝ t 1, where ĝ t 1 is the unnormalized version (i.e., includes the associated channel power of the available beamforming vector at the transmitter at time

4 4 t 1. Particularly, if the most recent feedback update happens q t {1,,t + 1} time slots ago from time t, then we have ĝ t 1 g t qt, with g 1 being the initial unnormalized beamforming vector used at the transmitter at time slot 0. Upon the receiver s decision, the available beamforming vector at the transmitter, the reward, and the state at the next time slot are as follows: ĝ u,t g u,t g t / g t a t 1 r t log(1+p g t 2 c s t+1 (g t+1,g t ; ĝ u,t ĝ u,t 1 g t qt / g t qt a t 0 r t log(1+p ĝu,tg H t 2 (5 s t+1 (g t+1,g t qt. Since g t and ĝ t 1 are complex, rewriting the state s t ((g t,i(g t,(ĝ t 1,I(ĝ t 1, we then have totally 4 real scalar variables in s t, referred to as the 4-V state. Thus, the state space S is the Cartesian product of 4 real spaces, with being the real space. It is easy to see that the reward function depends on z t only throughs t. We now show that this is the case for the transition probabilities as well. Lemma 1: Based on the 4-V state s t (g t,ĝ t 1, we have P(s t+1 z t,a t P(s t+1 s t,a t. (6 Proof: First assume a t 1. At time t + 1, the receiver will observe s t+1 (g t+1,g t. The left hand side (LHS of (6 is given by P(s t+1 z t,a t P ( g t+1,ĝ g0 t,ĝ 1,a 0,g 1,ĝ 0,a 1,,g t,ĝ t 1,a t (7 1(ĝ t g t P ( g t+1 g0,g 1,g 1,g 1 q1,,g t,g t qt 1(ĝ t g t P ( g t+1 gt, (9 where (7 is derived based on the definitions of s t and z t, (8 follows because once all actions {a τ } t τ0 are given, ĝ τ 1, τ {1,,t + 1}, can be specified as g τ qτ, with q τ {1,,τ + 1}, and (9 follows due to the Markovian property of {g t }. The right hand side (HS of (6 is given by P(s t+1 s t,a t P ( g t+1,ĝ t gt,ĝ t 1,a t 1(ĝ t g t P ( g t+1 gt,ĝ t 1 1(ĝ t g t P ( g t+1 g t. Therefore, P(s t+1 z t,a t 1 P(s t+1 s t,a t 1. When a t 0, at time t+1, the receiver will observes t+1 (g t+1,g t qt. The LHS of (6 is given by P(s t+1 z t,a t 1(ĝ t g t qt P ( g t+1 g0,g 1,g 1,g 1 q1,,g t,g t qt 1(ĝ t g t qt P ( g t+1 g t, (8 and the HS of (6 is given by P(s t+1 s t,a t 1 ( ĝ t g t qt P ( gt+1 gt,g t qt 1(ĝ t g t qt P ( g t+1 g t. This completes the proof. Based on the MDP objects defined and Lemma 1, we have now transformed the beamforming feedback control problem considered in Section II into an MDP problem. As the state space S being continuous and thus difficult to work with, we take the common approach to quantize it into finite state space 2. As a result, due to the finite state and action spaces and the stationarity of the channel model, the underlying feedback control problem admits an optimal stationary policy, which is both Markovian and deterministic [5]. In other words, we are to find an optimal stationary policy π {d : S A}, where d denotes the optimal decision rule and the superscript indicates that the decision rule d is stationary. B. Challenges in Determining the Optimal Policy Although there exists an optimal policy π with the quantized 4-V state space, finding π is challenging, as it is difficult to find an efficient way to perform the quantization. One possible method is to quantize the 4 variables in s t individually. For a quantization level L on each variable in s t, the total number of the states is L 4 which is exponential in. This number is large even for moderate values of L and (e.g., when L 4 and 4, L Thus, we quickly face the curse of dimensionality as L and increase. Another approach is to directly quantize the state space S by vector quantization technique [25], such as random codebook. The random code book C contains i.i.d. vectors each with length 4 1 and having the same distribution as s t for each element. However, to achieve good performance, the size of the random code book should be large. From the above discussion, we see that, since the number of the variables in the state s t increases with, it is difficult to find an efficient quantization method which has a good scaling performance with. Hence, it is impractical to perform beamforming feedback control directly based on the 4-V state, especially for large, and the state space reduction is necessary. IV. STATE SPACE EDUCTION: ANALYSIS ON 2-V STATE Our focus now is on finding a suitable reduced state for beamforming feedback control while maintaining the validity of the MDP formulation. An attempt for the state space reduction is given in [4], where a 2-V state s t ( g t 2,m t with m t ĝ H u,t 1g u,t 2 [0,1], is proposed as an optimal state reduction 3. The state in this case consists of the current channel power and the beamforming gain at time t if no feedback is finally performed. Based on such a state, upon 2 For the slow fading channel model, we can approximately treat the quantized version of the problem as an MDP problem. 3 The optimality is claimed in the sense that the reduced 2-V state does not compromising the controller s optimality [4].

5 5 the receiver s decision, the reward and the state at the next time slot are now as follows: { r t log(1+p g t 2 c a t 1 s t+1 ( g t+1 2, gu,t H g u,t+1 2 ; { r t log(1+p g t 2 m t a t 0 s t+1 ( g t+1 2, gu,t q H t g u,t+1 2 (10, where q t {1,,t+1}. Compared to the 4-V state, the total number of the variables in s t is reduced to 2, which is independent of and is highly desirable. However, for the reduced state system to be valid as an MDP, as pointed out in Section III-A, the reward function and the transition probabilities should depend on the history only through the current state. For the proposed state reduction, it is claimed in [4] (without proof that the processes of { g t 2 } and {m t } both form Markov chains and they are independent. However, verifying these statements is non-trivial. In this section, we will take a detailed look at the 2-V state. Specifically, we will show that the MDP formulation no longer holds under such 2-V state, and thus it is a suboptimal reduction. Nonetheless, we will also show that the proposed 2-V state provides good approximation and is reasonably efficient. A. Analysis on the Process { g t 2 } Based on the channel gain dynamics in (1, we now show that { g t 2 } asymptotically forms a Markov chain. The following lemma gives the property of the normalized process of { g t 2 }. Lemma 2: Define d t 1 ( g t 2. Then, {d t } is a first-order autoregressive process as d t+1 ρ 2 d t + w d,t+1, t 0,1,, where w d,t 1 [ 2ρ(g H t 1 w t + w t 2 (1 ρ 2 ], and { w d,t } WN(0,1 ρ 4. Further, d 0 is uncorrelated with { w d,t }, i.e., E (d 0 w d,t 0 for all t 1. Proof: See Appendix A. The following fact indicates that the Markovian property is transferable for a one-to-one mapping function. Fact 2 ( [24]: If {x t } is Markovian and y t g(x t where g( is a one-to-one mapping, then {y t } is also Markovian. Combining Facts 1 and 2 and Lemma 2, we have the following result. Proposition 1: { g t 2 } asymptotically forms a Markov chain as. Proof: By Fact 2, to show { g t 2 } is Markovian is equivalent to show {d t } is Markovian due to the one-to-one correspondence between g t 2 and d t. By Fact 1 and Lemma 2, to show {d t } is Markovian, it suffices to show that { w d,t } is an i.i.d. sequence and d 0 is independent of { w d,t }. To this end, by Lemma 2, it suffices to show (d 0,{ w d,t } is jointly Gaussian. It is because if this were true, then { w d,t } would be an i.i.d. Gaussian sequence from { w d,t } WN(0,1 ρ 4 and d 0 would be independent of { w d,t } from E (d 0 w d,t 0 for all t 1. We now show (d 0,{ w d,t } is asymptotically jointly Gaussian as by showing that any linear combination of d 0, w d,t1, w d,t2,, w d,tn where 0 < t 1 < t 2 < < t n is Gaussian n > 0. ewrite w d,t 1 [ 2ρ(gi,t 1 w i,t + w i,t 2 (1 ρ 2 ] i1 1 y i,t, i1 where we have defined y i,t 2ρ(g i,t 1 w i,t + w i,t 2 (1 ρ 2, and g i,t and w i,t are the i-th elements of g t and w t, respectively. ewrite d 0 1 i1 ( g i, Let c 0,c 1,,c n be the coefficients for linear combining. Then, we have n c 0 d 0 + c l w d,tl l1 ( 1 c 0 ( g i, i1 1 ỹ i, i1 n c l y i,tl where ỹ i c 0 ( g i, n l1 c ly i,tl. By straightforward computation and the fact that all channel gains are spatially independent, it can be shown that ỹ i s are i.i.d. with zero mean and variancec 2 0 +(1 ρ4 n i1 c2 i. Applying the Central Limit 1 Theorem, when, the distribution of i1ỹi converges to Gaussian distribution with the same mean and variance as ỹ i. Hence, (d 0,{ w d,t } is asymptotically jointly Gaussian. Therefore, we have { g t 2 } asymptotically form a Markov chain when goes to infinity in the sense that ( lim P gt+1 2 g0 2,, g t 2 l1 P ( g t+1 2 g t 2 0. When is large, we can approximately treat 1 i1ỹi as Gaussian hence (d 0,{ w d,t } as jointly Gaussian. In other words, when is large, { g t 2 } can be approximately treated as Markovian, and P ( g t+1 2 g 0 2,, g t 2 can be approximated by P ( g t+1 2 gt 2. B. Validity of 2-V State based MDP Formulation Consider the reduced 2-V state s t ( g t 2,m t. The state space is S + U, where + denotes the space of the non-negative real numbers and U [0,1]. For the feedback control problem under the new state to be an MDP problem, we need to verify that the reward function and the transition probabilities depend on the history only through the current state. From (10, the reward function is only a function of the current state. To verify (6 under the 2-V state, we make the following assumptions: A1 The processes { g t 2 } and {m t } are independent. A2 is large. We will further discuss the assumption A1 later. Let z t be the history of the decision process under the 2-V state, similar

6 6 as before. Using (10, when a t 1, we have m t+1 g H u,t g u,t+1 2, and the LHS of (6 equals P(s t+1 z t,a t P ( g t+1 2 g0,m t+1 2,m 0,a 0,, g t 2,m t,a t m0 P (m t+1,a 0,,m t,a t P ( g t+1 2 gt 2 (11 P( g H u,t g u,t+1 2 P ( g t+1 2 gt 2 (12 where (11 is due to the assumption A1 and { g t 2 } is approximately Markovian for large by Proposition 1, and (12 follows because, when a t 1, the available beamforming gain at time t equals ĝ H u,tg u,t 2 g H u,tg u,t 2 1, so the distribution of the successive beamforming gain m t+1 is independent of the previous values {m τ } t τ0. For the HS of (6, we have P(s t+1 s t,a t P ( g t+1 2,m t+1 gt 2,m t,a t P( g H u,tg u,t+1 2 P ( g t+1 2 g t 2 which is equal to (12. Using (10, when a t 0, we havem t+1 g H u,t q t g u,t+1 2 with q t {1,,t+1}. The LHS of (6 equals P(s t+1 z t,a t P ( g H u,t q t g u,t+1 2 g H u, 1 g u,0 2, g H u,1 q 1 g u,1 2,, gu,t q H t g u,t 2 P ( g t+1 2 gt 2 (13 ( P gu,t q H t g u,t+1 2 g H u,t qt g u,t+1 qt 2, gu,t q H t g u,t+2 qt 2,, g H u,t q t g u,t 2 P ( g t+1 2 g t 2 (14 where (14 follows since as argued for (12, when a τ 1, m τ+1 is independent of {m t } τ t 0. Hence, for the first conditional probability in (13, it suffices to only consider m τ s since the most recent feedback. We consider two such cases: i For q t t+1 (i.e., no feedback since the beginning: we have the first probability in (14 given as ( P gu, 1g H u,t+1 2 g H u, 1 g u,0 2, gu, 1g H u,1 2,, g H u, 1 g u,t 2. (15 ii For q t {1,2,,t}: due to the stationarity of the channel gain process, the first probability in (14 is equal to ( P gu,0 H g u,q t+1 2 g H u,0 g u,1 2, gu,0 H g u,2 2,, g H u,0 g u,q t 2. (16 The HS of (6 is given by P(s t+1 s t,a t P ( g H g u,t qt u,t+1 2 g H u,t qt g u,t 2 P ( g t+1 2 gt 2 (17 where the value of q t is unknown due to a t 1 being unknown. Hence, for (14 and (17 to be equal, we need to show that the first probability in (17 is equal to (15 or (16. In other words, we need to show that {m t } forms a Markov chain when the beamforming vector ĝ u,t is fixed to be g u, 1 or g u,0. Consider ĝ u,t g u, 1, t. By the channel dynamics in (1, the relationship between m t+1 g H u, 1 g u,t+1 2 and m t g H u, 1g u,t 2 is given by g H u, 1g u,t+1 2 ρ 2 g t 2 g t+1 2 gh u, 1g u,t 2 + w m,t+1, 2ρ g t where w m,t+1 g t+1 (gu,tg H 2 u, 1 gu, 1w H t g t+1 gu, 1w H 2 t+1 2. Based on [26, Lemma 2] and [26, Lemma 4], it can be shown that the marginal distribution of m t, t 0,1,, is given by P(m t x 1 (1 x 1 (18 which is non-gaussian. Thus, it is difficult to prove the independence of m 0 and { w m,t }, as in Lemma 2, due to their non-gaussian distributions. As a result, in contrast to the proof in Proposition 1, {m t } cannot be shown as Markovian (or approximately Markovian for large. Similar difficulty occurs when ĝ u,t g u,0, t. To rigorously establish the Markovian property of {m t } is very challenging. Nonetheless, note that when ĝ u,t g u, 1 or g u,0, t, the evolution of {m t } only depends on the evolution of {g u,t }. Since {g t } is Markovian, heuristically, it is reasonable to approximate {m t } as Markovian. Based on the above discussion, under the assumptions A1 and A2, the underlying feedback control process is an approximated MDP. We now discuss the independence assumption A1. According to the definition, the processes { g t 2 } and {m t } are said to be independent if the random vectors g [ g t1 2,, g tk 2 ] and m [m t 1,,m t j ] are independent for all k,j and all distinct choices of t 1,,t k and t 1,,t j. In other words, we need to show P g,m ( g t1 2,, g tk 2,m t 1,,m t j P g ( g t1 2,, g tk 2 P m (m t 1,,m t j. (19 Note that the distribution of m is policy-dependent. For example, when feedback is performed every time, all elements in m are close to 1 with high probability. In contrast, when the receiver never performs feedback, we have the marginal distribution of m t as shown in (18. Particularly, when 2, m t is uniformly distributed, t. Therefore, verification of (19 requires the consideration of all feasible policies, which is intractable. Instead, we provide a heuristic explanation of why { g t 2 } and {m t } can be treated as independent. Note that the two processes capture different effects of the beamformed system, and the two provide little information about each other. Precisely, { g t 2 } describes the process of the channel power, of which g t 2 only depends on the current channel; {m t } describes the process of the beamforming gain and is policydependent. For the same channel realization, under different feedback policies we can have various realizations of {m t }, each reveals little information about the channel power. On the other hand, from the realization of { g t 2 }, we cannot obtain the information about the beamforming gain {m t } neither. In the next section, where the feedback control implementation algorithms are discussed, we will revisit this independence

7 7 assumption, and discuss how algorithms are designed with or without A1. To summarize, the 2-V state is not an optimal reduced state, as it does not rigorously form an MDP. Nonetheless, this form of reduced state is still very attractive, because the reduction is highly efficient with the state variables of size 2, and its approximation to an MDP is also reasonable as explained above. C. Quantized 2-V State Even with the 2-V state, finite state is desired for designing algorithms for feedback control in an MDP problem. To obtain a quantized 2-V state, we consider the quantized state space which is used in [4] 4. For g t 2, we quantize its value into M levels as T g {[0,g 1,[g 1,g 2,,[g M 1,+ } with each interval having probability 1/M. The representative points are denoted as { g 1,, g M } with g k E [ g t 2 g t 2 [g k 1,g k ]. For m t, we quantize its value into N levels with equal length 1/N, i.e., T m {[0,1/N,[1/N,2/N,,[(N 1/N,1]}. The rang of m t is quantized with equal length instead of equal probability is because the distribution of m t is policy-dependent, and cannot be determined in advance. The representative points, denoted as { m 1,, m N }, are chosen as the midpoints of the associated intervals. By treating the underlying feedback control problem as a finite-state MDP problem, there exists an optimal stationary policy which is Markovian and deterministic. V. LEANING ALGOITHMS UNDE THE QUANTIZED 2-V STATE In this section, based on the quantized 2-V state, we consider the implementation algorithms for the feedback control by treating the underlying problem as an MDP problem. Four learning algorithms including model-based off-line (on-line algorithms and model-free on-line algorithm are discussed. A. Model-Based Off-Line PI Algorithms For an MDP problem, the optimal policy can be obtained through Policy Iteration (PI [5], if the transition probabilities P(s s,a are known at the receiver. In practice, these transition probabilities can be obtained a prior through off-line learning of the system. 1 Learning of Transition Probabilities: To perform PI, we need to obtain the transition probabilities under feedback and no feedback, i.e., P(s s,a for a 0,1, where the time indices are suppressed for the easy of notation. Depending on whether the processes { g t 2 } and {m t } are treated jointly or independently in learning the transition probabilities, we consider joint and independent off-line PI algorithms, respectively. Using the sampling transition probabilities of g t 2, m t, and ( g t 2,m t under a t 0, P(s s,a 0 can be learned using the maximum likelihood estimation. For the independent off-line PI algorithm, P(s s,a 0 is derived as the product of the average sampling transition 4 It should be noted that, to get better performance the joint quantization scheme needs to be considered. But this is beyond the scope of this paper. probabilities of g t 2 and m t ; for the joint off-line PI algorithm, P(s s,a 0 is derived as the average sampling transition probabilities of ( g t 2,m t. The transition probabilities P(s s,a 1 can be derived similarly using sampled g t 2 and m t, or ( g t 2,m t, under a t 1. Note that for slow fading channel, we can assume that m t+1 would jump to the highest state N (i.e., the highest normalized beamforming gain, upon a t 1. With this assumption, the procedure of generating transition probabilities for both the joint and independent off-line PI algorithms is summarized in Algorithm 1 in Appendix B. Since learning phase incurs data rate loss, it is desirable to make it as short as possible. For the independent off-line PI algorithm, we need to simultaneously learn two average transition probability matrices P g and P m, whose sizes are M M and N N, respectively; on the other hand, for the joint off-line PI algorithm, we need to learn the average transition probability matrix P gm whose size is MN MN, which is much larger than that of P g or P m. Hence, a longer learning time is needed for the joint off-line PI algorithm to reach a similar level of learning accuracy as that of the independent one. In other words, the independent off-line PI algorithm is more time efficient. 2 PI for the Optimal Control: Once the transition probabilities are available, we can form the expected total discounted reward (value function under the state s and policy π as V π (s r(s,a+λ s P ( s s,a V π (s, (20 where the action in state s follows the policy π, and the time indices are suppressed for the easy of notation. Denote π as an optimal policy. Then, the optimal value function should satisfy the Bellman s equation [5] [ ] V π (s max a A r(s,a+λ s P ( s s,a V π (s. (21 Using PI to obtain the solution to (21 involves alternating between policy evaluation and policy improvement iteratively, until the algorithm converges. The detailed algorithm for PI for finite state space and finite action space can be found in [5]. For finite state space and finite action space, it is proven that PI is guaranteed to converge within finite updates of policy. In our simulations, with quantization levels M N 4, typically 2 or 3 iterations are needed, while with M N 32, typically 7 or 8 iterations are needed. The computational complexity of PI lies in the step of policy evaluation, which is of the order O((MN 3. The storage required is mainly determined by the size of the transition probability matrix, which is of the order O((MN 2. B. Model-Free On-Line EQ Algorithm In model-free learning, the receiver assumes no prior knowledge about P(s s,a. Instead, it learns the optimal policy π directly without obtaining P(s s,a first. In this section, we suggest a model-free algorithm, the Exploratory Q-learning (EQ algorithm, for the feedback control.

8 8 1 Background in Q-Learning: Q-learning is a model-free algorithm where the decisions are made through the learning of the optimal Q-factors [27]. Denote Q π (s,a as the Q-factor of the state-action pair (s,a under a given policy π. It is defined as follows Q π (s,a r(s,a+λ s P(s s,av π (s (22 where V π (s is the value function of state s under policy π. From (22, Q π (s,a represents the expected total discounted reward of taking action a in state s and following a fixed policy π thereafter. If π is an optimal policy, we have V π (s max a A Q π (s,a, which means that π can be derived from the optimal Q-factors Q π (s,a s. To learn the optimal Q-factors, at each decision epoch t, the decision maker first observes the current state s, then selects an action according to some strategy and observes the next state s. The update of Q-factor at decision epoch t, denoted as Q t (s,a, is given by Q t (s,a (1 α t (s,aq t 1 (s,a +α t (s,a[r t (s,a+λv t 1 (s ], if s s t and a a t Q t 1 (s,a, otherwise (23 where V t 1 (s max a A Q t 1 (s,a, and α t (s,a is the learning rate of the state-action pair (s,a at time t. Note that, by (23, only the Q-factor associated with the visited state-action pair is updated. It is proven in [27] that, if all stateaction pairs are visited infinitely often and the learning rate is appropriately designed, then all Q-factors will eventually converge to the optimal ones. 2 EQ Algorithm: The EQ algorithm was first introduced in [28], combining Q-learning with a counter-based directed exploration strategy. Below, we briefly describe how the EQ algorithm chooses the action at each decision epoch. The interested reader can refer to [28] for details. Denote n t (s,a as the number of times that the state-action pair (s,a is visited up to time t, and denote n t (s as the number of times that state s is visited up to time t. Let c t (s,a reflect the number of times the actionahas not been performed in state s since the initial time. Initialize c 0 (s,a 0, s,a. In the EQ algorithm, given the current state s t, a greedy actionâ t is first determined. It is defined asâ t argmax a A Q t (s t,a, which results in the maximum Q-factor at epoch t; the final action a t is then determined by the following maximization problem: max g t(s t,a a A { ct(s t,a n t(s t,a +1, if a â t c t(s t,a n t(s t,a, otherwise (24 where the addition of 1 indicates the decision maker s preference for the greedy action; it is also possible for the decision maker to take a non-greedy action a â t, if the associated ratio ct(st,a n t(s t,a is large, i.e., a has not been performed in s t for a long time, or (s t,a has not been visited for a long time. The update of c t (s t,a is as follows c t+1 (s t,a { c t (s t,a+θ(n t+1 (s t n t+1 (s t,a, if a a t c t (s t,a, otherwise (25 where Θ(n is a positive function satisfying the conditions lim Θ(n 0, and n Θ(n. (26 n1 a Complexity: At every decision epoch, given the state, the updates are based on the size A of the state-action pairs, thus the computational complexity of the EQ algorithm is O( A. Since there are only two actions in our problem, the computational complexity of the EQ algorithm is much lower than that of PI. The state-action based parameters such as the Q-factors need to be stored for each update, thus the required storage size for EQ algorithm is O( A MN. b Design parameters: For the updates in (23 and (25, there are two design parameters in the EQ algorithm: the learning rate α t (s,a and Θ(n. In [22], the learning rate is α suggested as α t (s,a 0τ τ+n, where α t(s,a 0 is the initial learning rate and τ is some positive parameter. For Θ(n, we set Θ(n 1, θ 1, which can be shown to satisfy n 1/θ the constraint (26. Note that, from (25, larger θ results in larger value of c t+1 (s t,a for a a t (usually a non-greedy action. Therefore, the value of θ affects the trade-off between exploration (a non-greedy action, and exploitation (the greedy action. In Section VI, we will study how to choose θ and α 0. As will be seen through simulation, the parameters of θ and α 0 turn out to be important and should be carefully tuned. C. Model-Based On-Line PI Algorithm For comparison purpose, we also provide a model-based on-line PI algorithm, which combines PI with the counterbased directed exploration strategy as in (24 and (25 of the EQ algorithm [28]. We briefly describes this algorithm below. The details can be found in [28]. In this algorithm, at each decision epoch t, first, the estimates of P(s s,a are updated using the observation history of state-action pair { g τ 2,m τ,a τ } t τ0 where the processes { g t 2 } and {m t } are jointly considered; second, with the updated estimates of P(s s,a, PI is employed to generate the action, which is treated as the greedy action â t ; third, the final action a t is determined by (24. Note that the difference between this model-based on-line PI algorithm and the EQ algorithm in Section V-B is that, the greedy action â t in the former is generated by PI while â t in the latter is generated by the Q-learning algorithm, which is a model-free algorithm. Compared to the modelbased joint off-line PI in Section V-A, in the model-based on-line PI algorithm, the estimates of P(s s,a are updated at every decision epoch based on the observation history of stateaction pair. Thus, in using these data, the policy seeks certain balance between exploration and exploitation. Note that the

9 9 Expected total discounted reward , c4, MN4, θ , c4, MN4, θ2 2, c4, MN4, θ , c12, MN4, θ1 2, c12, MN4, θ2 2, c12, MN4, θ Time slot x 10 5 Fig. 1. Design of θ: 2-V EQ algorithm with 2, M N 4, c 4 and 12, and θ 1,2 and 3. Expected total discounted reward , MN4, α , MN8, α , MN8, α , MN4, α , MN8, α , MN8, α Feedback cost Fig. 2. Design of α 0 : 2-V EQ algorithm with 2 and 4, M N 4 and 8, and α and computational complexity in this on-line PI algorithm is high, as PI is performed at every decision epoch. VI. PEFOMANCE STUDY OF THE POPOSED LEANING ALGOITHMS In this section, we will study and compare the performances of the four learning algorithms proposed in the previous section. In all experiments, we set the discount factor λ 0.98, the normalized Doppler frequency f D T s 0.01, and the power at the transmitter P 10dBW (except otherwise mentioned. The expected total discounted reward in (4 is approximated by the sample mean of the total discounted rewards over 400 independently generated channel sequences. For the n-th generated channel sequence, the total discounted (n reward beginning at time t is approximated by ˆV t t+ t τt λ τ t r τ (n (n, where t 350. Note that ˆV t is an approximation in the sense that t is set as a finite number instead of infinity. The parameterτ in the learning rateα t (s,a in the EQ algorithm is set as 300. We consider a range of feedback cost {4,8,12,20,40,60}. A. Model-Based versus Model-Free Algorithms We first provide the guideline on choosing the parameters θ and α 0 in the EQ algorithm, then compare the performances of the EQ algorithm and the on-line PI algorithm. 1 EQ Algorithm Design of θ and α 0 : Let 2 and M N 4. The feedback cost is set to be c 4 or 12. In Fig. 1, we show the expected total discounted rewards under the EQ algorithm vs. time slot under θ 1,2, and 3, respectively. ecall that θ controls the exploration-exploitation tradeoff (non-greedy vs. greedy actions in the learning algorithm. Each curve in Fig. 1 shows the convergence behaviour with a specified value of θ under the EQ algorithm. From Fig. 1, we see that, when the feedback cost is low, i.e., c 4, a balanced exploration-exploitation (θ 2 results in the best receiver beamforming performance, while greedy action for exploitation with little exploration (θ 1 results in the lowest performance. When the feedback cost becomes high, i.e., c 12, however, θ 1 results in the best beamforming performance. In this case, choosing the greedy action for more exploitation of immediate net throughput given in (3 gives better performance. Besides these, more exploration prolongs the convergence time of the EQ algorithm. In summary, the choice of θ for the best performance depends on the feedback cost. This is consistent with our intuition that, with lower feedback cost, the system benefits from more exploration under non-greedy action to improve overall net throughput; when the cost increases, such exploration becomes costly, and exploitation of immediate maximum net throughput under greedy action is preferred. In Fig. 2, for 2 and 4, we investigate the relationship between the initial learning rate α 0 and the quantization levels M andn. We plot the approximated expected total discounted reward vs. the feedback cost c, where θ is chosen based on the guideline in the previous paragraph. When M N 4, α is found to result in the best performance. However, when M N 8, we see that the same α 0 leads to a worse performance, especially at the high feedback cost region. For example, when c 60, the degradation is 3%( 2 and 5%( 4, respectively. Instead, a different initial learning rate α results in the best performance for M N 8. Nonetheless, compared to M N 4, the improvement is limited. From Fig. 2, it can be seen that the initial learning rateα 0 should be adjusted according to the quantization levels. Specifically, α 0 should be smaller with higher quantization levels M and N. Intuitively, higher quantization levels result in larger state space and hence less observation samples for each state, therefore, a smaller learning rate is required. 2 EQ Algorithm versus On-Line PI Algorithm: For 2 and 4, in Fig. 3, we compare the performances of the EQ algorithm and the on-line PI algorithm at various feedback costs for the quantization levels M N 4. ecall that

10 Expected total discounted reward , MN4, 2 V EQ 4, MN4, 2 V EQ 2, MN4, on line PI 4, MN4, on line PI Feedback cost Fig. 3. EQ algorithm versus on-line PI algorithm: 2 and 4, M N 4. Expected total discounted reward , MN4, joint off line PI 2, MN8, joint off line PI, 2, MN16, joint off line PI, 160 2, MN32, joint off line PI, 2, MN4, independent off line PI 2, MN8, independent off line PI 2, MN16, independent off line PI 150 2, MN32, independent off line PI 2, MN4, 2 V EQ 2, L32, 4 V EQ, Feedback cost Fig. 4. Joint off-line PI algorithm versus independent off-line PI algorithm: 2, P 10dBW, and M N 4,8,16 and 32. the only difference between the EQ algorithm and the modelbased on-line PI algorithm is how to generate the greedy action â t. From Fig. 3, we see that the EQ algorithm outperforms the on-line PI algorithm at the mediate-to-high cost range, showing the advantage of the EQ algorithm over the online PI algorithm. Such advantage can be explained that, the model-free algorithm is basically reward-driven, hence avoiding to construct the transition probabilities that might have mismatches to those of the actual system. B. Independent versus Joint Off-Line PI Algorithm ecall from Section IV that it is challenging to determine the dependency of the two processes { g t 2 } and {m t }. We have thus proposed two off-line PI algorithms treating the two processes either jointly or independently. In this section, we will compare the performances of these two off-line PI algorithms. We consider two scenarios: fixed transmit power constraint with various feedback costs, and various power constraints with fixed feedback cost. 1 Effect of Feedback Costs: In Fig. 4, we compare the expected total discounted reward of the joint and independent off-line PI algorithms under various feedback costs when the transmit power is fixed to P 10dBW. We set 2, and the quantization levels M N 4,8,16, and 32, respectively. As expected, the joint off-line PI algorithm outperforms the independent one, which confirms that { g t 2 } and {m t } are indeed correlated. As the quantization levels M and N increase, the performance of both algorithms improves, while the performance gap between the two algorithms reduces. For M N 32, the performance gap becomes negligible, indicating that treating the two processes { g t 2 } and {m t } as independent is reasonable. For comparison, the performance of the EQ algorithm with M N 4 is also shown. Surprisingly, we observe that the EQ algorithm gives the best performance as compared to all off-line PI algorithms, especially at the high feedback cost region. Expected total discounted reward , MN4, joint off line PI 4, MN8, joint off line PI 4, MN16, joint off line PI 210 4, MN32, joint off line PI 4, MN4, independent off line PI 4, MN8, independent off line PI 200 4, MN16, independent off line PI 4, MN32, independent off line PI 4, MN4, 2 V EQ Feedback cost Fig. 5. Joint off-line PI algorithm versus independent off-line PI algorithm: 4, P 10dBW, and M N 4,8,16 and 32. In addition, in Fig. 4, we also include the EQ algorithm based on the original 4-V state given in Section III, which leads to a valid MDP problem. The random code book with the size C 256 is used, where the average quantization level per real variable is L 32. As compared to the performance under the 2-V state, the performance under the 4-V EQ algorithm is uniformly below that of the independent off-line PI algorithm with M N 16. This observation reveals that, although4-v state is optimal, there is large performance loss with quantization; in contrast, the 2-V state is more efficient. The same experiments are conducted for 4 in Fig. 5, where similar behaviours as in the 2 case are observed. In particular, the performance gap between the joint and independent off-line PI algorithms is very small for M N 16, and 32, which again indicates that independence of { g t 2 } and {m t } is a reasonable assumption.

On Stochastic Feedback Control for Multi-antenna Beamforming: Formulation and Low-Complexity Algorithms

On Stochastic Feedback Control for Multi-antenna Beamforming: Formulation and Low-Complexity Algorithms 1 On Stochastic Feedback Control for Multi-antenna Beamforming: Formulation and Low-Complexity Algorithms Sun Sun, Student Member, IEEE, Min Dong, Senior Member, IEEE, and Ben Liang, Senior Member, IEEE

More information

4888 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 15, NO. 7, JULY 2016

4888 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 15, NO. 7, JULY 2016 4888 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 15, NO. 7, JULY 2016 Online Power Control Optimization for Wireless Transmission With Energy Harvesting and Storage Fatemeh Amirnavaei, Student Member,

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal

Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal Matthew Johnston, Eytan Modiano Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge,

More information

Dynamic spectrum access with learning for cognitive radio

Dynamic spectrum access with learning for cognitive radio 1 Dynamic spectrum access with learning for cognitive radio Jayakrishnan Unnikrishnan and Venugopal V. Veeravalli Department of Electrical and Computer Engineering, and Coordinated Science Laboratory University

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Lecture 7 MIMO Communica2ons

Lecture 7 MIMO Communica2ons Wireless Communications Lecture 7 MIMO Communica2ons Prof. Chun-Hung Liu Dept. of Electrical and Computer Engineering National Chiao Tung University Fall 2014 1 Outline MIMO Communications (Chapter 10

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Online Learning Schemes for Power Allocation in Energy Harvesting Communications

Online Learning Schemes for Power Allocation in Energy Harvesting Communications Online Learning Schemes for Power Allocation in Energy Harvesting Communications Pranav Sakulkar and Bhaskar Krishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering

More information

Title. Author(s)Tsai, Shang-Ho. Issue Date Doc URL. Type. Note. File Information. Equal Gain Beamforming in Rayleigh Fading Channels

Title. Author(s)Tsai, Shang-Ho. Issue Date Doc URL. Type. Note. File Information. Equal Gain Beamforming in Rayleigh Fading Channels Title Equal Gain Beamforming in Rayleigh Fading Channels Author(s)Tsai, Shang-Ho Proceedings : APSIPA ASC 29 : Asia-Pacific Signal Citationand Conference: 688-691 Issue Date 29-1-4 Doc URL http://hdl.handle.net/2115/39789

More information

OFDMA Downlink Resource Allocation using Limited Cross-Layer Feedback. Prof. Phil Schniter

OFDMA Downlink Resource Allocation using Limited Cross-Layer Feedback. Prof. Phil Schniter OFDMA Downlink Resource Allocation using Limited Cross-Layer Feedback Prof. Phil Schniter T. H. E OHIO STATE UNIVERSITY Joint work with Mr. Rohit Aggarwal, Dr. Mohamad Assaad, Dr. C. Emre Koksal September

More information

VECTOR QUANTIZATION TECHNIQUES FOR MULTIPLE-ANTENNA CHANNEL INFORMATION FEEDBACK

VECTOR QUANTIZATION TECHNIQUES FOR MULTIPLE-ANTENNA CHANNEL INFORMATION FEEDBACK VECTOR QUANTIZATION TECHNIQUES FOR MULTIPLE-ANTENNA CHANNEL INFORMATION FEEDBACK June Chul Roh and Bhaskar D. Rao Department of Electrical and Computer Engineering University of California, San Diego La

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

The Optimality of Beamforming: A Unified View

The Optimality of Beamforming: A Unified View The Optimality of Beamforming: A Unified View Sudhir Srinivasa and Syed Ali Jafar Electrical Engineering and Computer Science University of California Irvine, Irvine, CA 92697-2625 Email: sudhirs@uciedu,

More information

Online Power Control Optimization for Wireless Transmission with Energy Harvesting and Storage

Online Power Control Optimization for Wireless Transmission with Energy Harvesting and Storage Online Power Control Optimization for Wireless Transmission with Energy Harvesting and Storage Fatemeh Amirnavaei, Student Member, IEEE and Min Dong, Senior Member, IEEE arxiv:606.046v2 [cs.it] 26 Feb

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Wireless Channel Selection with Restless Bandits

Wireless Channel Selection with Restless Bandits Wireless Channel Selection with Restless Bandits Julia Kuhn and Yoni Nazarathy Abstract Wireless devices are often able to communicate on several alternative channels; for example, cellular phones may

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Markov decision processes and interval Markov chains: exploiting the connection

Markov decision processes and interval Markov chains: exploiting the connection Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo Supervisors: Prof. Nigel Bean, Dr Joshua Ross University of Adelaide July 10, 2013 Intervals and interval arithmetic

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Multiuser Capacity in Block Fading Channel

Multiuser Capacity in Block Fading Channel Multiuser Capacity in Block Fading Channel April 2003 1 Introduction and Model We use a block-fading model, with coherence interval T where M independent users simultaneously transmit to a single receiver

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

WIRELESS systems often operate in dynamic environments

WIRELESS systems often operate in dynamic environments IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 61, NO. 9, NOVEMBER 2012 3931 Structure-Aware Stochastic Control for Transmission Scheduling Fangwen Fu and Mihaela van der Schaar, Fellow, IEEE Abstract

More information

Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio

Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio Jayakrishnan Unnikrishnan, Student Member, IEEE, and Venugopal V. Veeravalli, Fellow, IEEE 1 arxiv:0807.2677v2 [cs.ni] 21 Nov 2008

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information

Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information 204 IEEE International Symposium on Information Theory Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information Omur Ozel, Kaya Tutuncuoglu 2, Sennur Ulukus, and Aylin Yener

More information

Power Allocation over Two Identical Gilbert-Elliott Channels

Power Allocation over Two Identical Gilbert-Elliott Channels Power Allocation over Two Identical Gilbert-Elliott Channels Junhua Tang School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University, China Email: junhuatang@sjtu.edu.cn Parisa

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Minimum Feedback Rates for Multi-Carrier Transmission With Correlated Frequency Selective Fading

Minimum Feedback Rates for Multi-Carrier Transmission With Correlated Frequency Selective Fading Minimum Feedback Rates for Multi-Carrier Transmission With Correlated Frequency Selective Fading Yakun Sun and Michael L. Honig Department of ECE orthwestern University Evanston, IL 60208 Abstract We consider

More information

Appendix B Information theory from first principles

Appendix B Information theory from first principles Appendix B Information theory from first principles This appendix discusses the information theory behind the capacity expressions used in the book. Section 8.3.4 is the only part of the book that supposes

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Optimal Data and Training Symbol Ratio for Communication over Uncertain Channels

Optimal Data and Training Symbol Ratio for Communication over Uncertain Channels Optimal Data and Training Symbol Ratio for Communication over Uncertain Channels Ather Gattami Ericsson Research Stockholm, Sweden Email: athergattami@ericssoncom arxiv:50502997v [csit] 2 May 205 Abstract

More information

6196 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 9, SEPTEMBER 2011

6196 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 9, SEPTEMBER 2011 6196 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 9, SEPTEMBER 2011 On the Structure of Real-Time Encoding and Decoding Functions in a Multiterminal Communication System Ashutosh Nayyar, Student

More information

ON BEAMFORMING WITH FINITE RATE FEEDBACK IN MULTIPLE ANTENNA SYSTEMS

ON BEAMFORMING WITH FINITE RATE FEEDBACK IN MULTIPLE ANTENNA SYSTEMS ON BEAMFORMING WITH FINITE RATE FEEDBACK IN MULTIPLE ANTENNA SYSTEMS KRISHNA KIRAN MUKKAVILLI ASHUTOSH SABHARWAL ELZA ERKIP BEHNAAM AAZHANG Abstract In this paper, we study a multiple antenna system where

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM?

HOW TO CHOOSE THE STATE RELEVANCE WEIGHT OF THE APPROXIMATE LINEAR PROGRAM? HOW O CHOOSE HE SAE RELEVANCE WEIGH OF HE APPROXIMAE LINEAR PROGRAM? YANN LE ALLEC AND HEOPHANE WEBER Abstract. he linear programming approach to approximate dynamic programming was introduced in [1].

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Performance Analysis of Multiple Antenna Systems with VQ-Based Feedback

Performance Analysis of Multiple Antenna Systems with VQ-Based Feedback Performance Analysis of Multiple Antenna Systems with VQ-Based Feedback June Chul Roh and Bhaskar D. Rao Department of Electrical and Computer Engineering University of California, San Diego La Jolla,

More information

Maximizing throughput in zero-buffer tandem lines with dedicated and flexible servers

Maximizing throughput in zero-buffer tandem lines with dedicated and flexible servers Maximizing throughput in zero-buffer tandem lines with dedicated and flexible servers Mohammad H. Yarmand and Douglas G. Down Department of Computing and Software, McMaster University, Hamilton, ON, L8S

More information

Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio

Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio Jayakrishnan Unnikrishnan, Student Member, IEEE, and Venugopal V. Veeravalli, Fellow, IEEE 1 arxiv:0807.2677v4 [cs.ni] 6 Feb 2010

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Approximately achieving the feedback interference channel capacity with point-to-point codes

Approximately achieving the feedback interference channel capacity with point-to-point codes Approximately achieving the feedback interference channel capacity with point-to-point codes Joyson Sebastian*, Can Karakus*, Suhas Diggavi* Abstract Superposition codes with rate-splitting have been used

More information

Limited Feedback in Wireless Communication Systems

Limited Feedback in Wireless Communication Systems Limited Feedback in Wireless Communication Systems - Summary of An Overview of Limited Feedback in Wireless Communication Systems Gwanmo Ku May 14, 17, and 21, 2013 Outline Transmitter Ant. 1 Channel N

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

Secure Degrees of Freedom of the MIMO Multiple Access Wiretap Channel

Secure Degrees of Freedom of the MIMO Multiple Access Wiretap Channel Secure Degrees of Freedom of the MIMO Multiple Access Wiretap Channel Pritam Mukherjee Sennur Ulukus Department of Electrical and Computer Engineering University of Maryland, College Park, MD 074 pritamm@umd.edu

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

On the static assignment to parallel servers

On the static assignment to parallel servers On the static assignment to parallel servers Ger Koole Vrije Universiteit Faculty of Mathematics and Computer Science De Boelelaan 1081a, 1081 HV Amsterdam The Netherlands Email: koole@cs.vu.nl, Url: www.cs.vu.nl/

More information

PERFORMANCE COMPARISON OF DATA-SHARING AND COMPRESSION STRATEGIES FOR CLOUD RADIO ACCESS NETWORKS. Pratik Patil, Binbin Dai, and Wei Yu

PERFORMANCE COMPARISON OF DATA-SHARING AND COMPRESSION STRATEGIES FOR CLOUD RADIO ACCESS NETWORKS. Pratik Patil, Binbin Dai, and Wei Yu PERFORMANCE COMPARISON OF DATA-SHARING AND COMPRESSION STRATEGIES FOR CLOUD RADIO ACCESS NETWORKS Pratik Patil, Binbin Dai, and Wei Yu Department of Electrical and Computer Engineering University of Toronto,

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games International Journal of Fuzzy Systems manuscript (will be inserted by the editor) A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games Mostafa D Awheda Howard M Schwartz Received:

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

A Framework for Training-Based Estimation in Arbitrarily Correlated Rician MIMO Channels with Rician Disturbance

A Framework for Training-Based Estimation in Arbitrarily Correlated Rician MIMO Channels with Rician Disturbance A Framework for Training-Based Estimation in Arbitrarily Correlated Rician MIMO Channels with Rician Disturbance IEEE TRANSACTIONS ON SIGNAL PROCESSING Volume 58, Issue 3, Pages 1807-1820, March 2010.

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

How Much Training and Feedback are Needed in MIMO Broadcast Channels?

How Much Training and Feedback are Needed in MIMO Broadcast Channels? How uch raining and Feedback are Needed in IO Broadcast Channels? ari Kobayashi, SUPELEC Gif-sur-Yvette, France Giuseppe Caire, University of Southern California Los Angeles CA, 989 USA Nihar Jindal University

More information

Heterogeneous Doppler Spread-based CSI Estimation Planning for TDD Massive MIMO

Heterogeneous Doppler Spread-based CSI Estimation Planning for TDD Massive MIMO Heterogeneous Doppler Spread-based CSI Estimation Planning for TDD Massive MIMO Salah Eddine Hajri, Maialen Larrañaga, Mohamad Assaad *TCL Chair on 5G, Laboratoire des Signaux et Systemes L2S, CNRS, CentraleSupelec,

More information

Outage Probability for Two-Way Solar-Powered Relay Networks with Stochastic Scheduling

Outage Probability for Two-Way Solar-Powered Relay Networks with Stochastic Scheduling Outage Probability for Two-Way Solar-Powered Relay Networks with Stochastic Scheduling Wei Li, Meng-Lin Ku, Yan Chen, and K. J. Ray Liu Department of Electrical and Computer Engineering, University of

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

arxiv:cs/ v1 [cs.it] 11 Sep 2006

arxiv:cs/ v1 [cs.it] 11 Sep 2006 0 High Date-Rate Single-Symbol ML Decodable Distributed STBCs for Cooperative Networks arxiv:cs/0609054v1 [cs.it] 11 Sep 2006 Zhihang Yi and Il-Min Kim Department of Electrical and Computer Engineering

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement Learning. Summer 2017 Defining MDPs, Planning Reinforcement Learning Summer 2017 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Expectation propagation for signal detection in flat-fading channels

Expectation propagation for signal detection in flat-fading channels Expectation propagation for signal detection in flat-fading channels Yuan Qi MIT Media Lab Cambridge, MA, 02139 USA yuanqi@media.mit.edu Thomas Minka CMU Statistics Department Pittsburgh, PA 15213 USA

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

Learning Algorithms for Minimizing Queue Length Regret

Learning Algorithms for Minimizing Queue Length Regret Learning Algorithms for Minimizing Queue Length Regret Thomas Stahlbuhk Massachusetts Institute of Technology Cambridge, MA Brooke Shrader MIT Lincoln Laboratory Lexington, MA Eytan Modiano Massachusetts

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process

More information

Statistical Beamforming on the Grassmann Manifold for the Two-User Broadcast Channel

Statistical Beamforming on the Grassmann Manifold for the Two-User Broadcast Channel 6464 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 10, OCTOBER 2013 Statistical Beamforming on the Grassmann Manifold for the Two-User Broadcast Channel Vasanthan Raghavan, Senior Member, IEEE,

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Optimum Power Allocation in Fading MIMO Multiple Access Channels with Partial CSI at the Transmitters

Optimum Power Allocation in Fading MIMO Multiple Access Channels with Partial CSI at the Transmitters Optimum Power Allocation in Fading MIMO Multiple Access Channels with Partial CSI at the Transmitters Alkan Soysal Sennur Ulukus Department of Electrical and Computer Engineering University of Maryland,

More information

Vector Channel Capacity with Quantized Feedback

Vector Channel Capacity with Quantized Feedback Vector Channel Capacity with Quantized Feedback Sudhir Srinivasa and Syed Ali Jafar Electrical Engineering and Computer Science University of California Irvine, Irvine, CA 9697-65 Email: syed@ece.uci.edu,

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

Incremental Grassmannian Feedback Schemes for Multi-User MIMO Systems

Incremental Grassmannian Feedback Schemes for Multi-User MIMO Systems 1130 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 63, NO. 5, MARCH 1, 2015 Incremental Grassmannian Feedback Schemes for Multi-User MIMO Systems Ahmed Medra and Timothy N. Davidson Abstract The communication

More information

A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation

A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation Karim G. Seddik and Amr A. El-Sherif 2 Electronics and Communications Engineering Department, American University in Cairo, New

More information

CHANNEL FEEDBACK QUANTIZATION METHODS FOR MISO AND MIMO SYSTEMS

CHANNEL FEEDBACK QUANTIZATION METHODS FOR MISO AND MIMO SYSTEMS CHANNEL FEEDBACK QUANTIZATION METHODS FOR MISO AND MIMO SYSTEMS June Chul Roh and Bhaskar D Rao Department of Electrical and Computer Engineering University of California, San Diego La Jolla, CA 9293 47,

More information

Multiuser Joint Energy-Bandwidth Allocation with Energy Harvesting - Part I: Optimum Algorithm & Multiple Point-to-Point Channels

Multiuser Joint Energy-Bandwidth Allocation with Energy Harvesting - Part I: Optimum Algorithm & Multiple Point-to-Point Channels 1 Multiuser Joint Energy-Bandwidth Allocation with Energy Harvesting - Part I: Optimum Algorithm & Multiple Point-to-Point Channels arxiv:1410.2861v1 [cs.it] 10 Oct 2014 Zhe Wang, Vaneet Aggarwal, Xiaodong

More information

Spatial and Temporal Power Allocation for MISO Systems with Delayed Feedback

Spatial and Temporal Power Allocation for MISO Systems with Delayed Feedback Spatial and Temporal Power Allocation for MISO Systems with Delayed Feedback Venkata Sreekanta Annapureddy 1 Srikrishna Bhashyam 2 1 Department of Electrical and Computer Engineering University of Illinois

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Sofía S. Villar Postdoctoral Fellow Basque Center for Applied Mathemathics (BCAM) Lancaster University November 5th,

More information

K User Interference Channel with Backhaul

K User Interference Channel with Backhaul 1 K User Interference Channel with Backhaul Cooperation: DoF vs. Backhaul Load Trade Off Borna Kananian,, Mohammad A. Maddah-Ali,, Babak H. Khalaj, Department of Electrical Engineering, Sharif University

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

STRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS

STRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS STRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS Qing Zhao University of California Davis, CA 95616 qzhao@ece.ucdavis.edu Bhaskar Krishnamachari University of Southern California

More information

On the Required Accuracy of Transmitter Channel State Information in Multiple Antenna Broadcast Channels

On the Required Accuracy of Transmitter Channel State Information in Multiple Antenna Broadcast Channels On the Required Accuracy of Transmitter Channel State Information in Multiple Antenna Broadcast Channels Giuseppe Caire University of Southern California Los Angeles, CA, USA Email: caire@usc.edu Nihar

More information

Communication constraints and latency in Networked Control Systems

Communication constraints and latency in Networked Control Systems Communication constraints and latency in Networked Control Systems João P. Hespanha Center for Control Engineering and Computation University of California Santa Barbara In collaboration with Antonio Ortega

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Low-High SNR Transition in Multiuser MIMO

Low-High SNR Transition in Multiuser MIMO Low-High SNR Transition in Multiuser MIMO Malcolm Egan 1 1 Agent Technology Center, Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic. 1 Abstract arxiv:1409.4393v1

More information

Exploiting Quantized Channel Norm Feedback Through Conditional Statistics in Arbitrarily Correlated MIMO Systems

Exploiting Quantized Channel Norm Feedback Through Conditional Statistics in Arbitrarily Correlated MIMO Systems Exploiting Quantized Channel Norm Feedback Through Conditional Statistics in Arbitrarily Correlated MIMO Systems IEEE TRANSACTIONS ON SIGNAL PROCESSING Volume 57, Issue 10, Pages 4027-4041, October 2009.

More information

Stochastic Content-Centric Multicast Scheduling for Cache-Enabled Heterogeneous Cellular Networks

Stochastic Content-Centric Multicast Scheduling for Cache-Enabled Heterogeneous Cellular Networks 1 Stochastic Content-Centric Multicast Scheduling for Cache-Enabled Heterogeneous Cellular Networks Bo Zhou, Ying Cui, Member, IEEE, and Meixia Tao, Senior Member, IEEE Abstract Caching at small base stations

More information