On Stochastic Feedback Control for Multi-antenna Beamforming: Formulation and Low-Complexity Algorithms

Size: px

Start display at page:

Download "On Stochastic Feedback Control for Multi-antenna Beamforming: Formulation and Low-Complexity Algorithms"

Joella Simmons
6 years ago
Views:

1 1 On Stochastic Feedback Control for Multi-antenna Beamforming: Formulation and Low-Complexity Algorithms Sun Sun, Min Dong, and Ben Liang Abstract We consider transmit beamforming in multipleinput-single-output (MISO systems. Based on a first-order Gauss-Markov channel model, we investigate the stochastic feedback control for transmit beamforming to maximize the net system throughput under a given average feedback cost, and design practical implementation algorithms of feedback control using techniques in dynamic programming and reinforcement learning. We first validate the Markov Decision Process (MDP formulation of the underlying feedback control problem when the state space is of a size of transmit antennas. Due to the complexity of finding an optimal feedback policy under such state space, we consider state-space reduction to a 2-variable (2-V state. On contrary to a previous study that assumes the feedback control problem under such 2-V state remains MDP formulation, our analysis indicates that the underlying problem is no longer an MDP. Nonetheless, the approximation as an MDP is shown to be justifiable and efficient. Based on the quantized 2-V state and the MDP approximation, we study practical implementation algorithms for feedback control. With unknown state transition probabilities, we consider model-based off-line and on-line learning approaches, which estimate the transition probabilities first then obtain an optimal feedback control policy, as well as model-free reinforcement learning approach, which directly learns an optimal feedback control policy. We investigate and compare these algorithms through extensive simulations under various feedback cost and power constraint settings, and provide efficiency analysis of all proposed algorithms. From these results, the application rule of these algorithms is established, taking into account both stationary and unstationary systems. Index Terms Beamforming, stochastic feedback control, implementation algorithms, reinforcement learning. I. INTODUCTION Multi-antenna transmit beamforming is an effective physical-layer technique to improve transmission rate and reliability over a wireless link in a slow fading environment [1]. The actual beamforming gain achieved depends on the feedback quality which controls the accuracy of beamforming vector to be used at the transmitter. Thus, a main challenge in practical systems to maximally realize the beamforming potential is designing feedback strategies to efficiently obtain the knowledge of channel or beamforming vector at the transmitter. As feedback incurs overhead and thus rate loss to the system, the net gain in terms of system throughput is the result of both rate gain due to beamforming and rate loss due to feedback. The temporal correlation of the channel adds an additional time dimension for the feedback strategy design in order to make proper trade-off between the beamforming Sun Sun and Ben Liang are with Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada. {ssun, liang}@comm.utoronto.ca. Min Dong is with Department of Electrical Computer and Software Engineering, University of Ontario Institute of Technology, Toronto, Canada. min.dong@uoit.ca. gain and feedback quality. Beamforming feedback design in literature typically focuses on the channel or beamforming vector quantization techniques, i.e., codebook design [2], [3]. The temporal dimension is addressed either assuming block fading channel model, with independent fade changes from block to block, or through simple periodic feedback, without actively exploring the temporal correlation of channel fades. ecently, the approach of stochastic feedback control is studied for beamforming feedback design by formulating the problem as a Markov Decision Process (MDP problem [4]. In this case, the feedback decision takes into account of channel temporal correlation and feedback cost, with the goal of maximizing the system net throughput. Such approach is attractive as the feedback decision is dynamic which improves the feedback efficiency, and the feedback cost is also incorporated in the decision. In this paper, we focus on analyzing the stochastic feedback control approach and designing feedback control algorithms for transmit beamforming. Specifically, we investigate the problem from the perspectives of channel temporal correlation modeling, problem space reduction and its optimality, and low-complexity algorithms for online feedback control. A. Contributions We consider a multi-input single-output (MISO transmit beamforming system with transmit antennas and a single receive antenna. The channel dynamics are modeled by a first-order Gauss-Markov process. By modeling the feedback control as an on-off decision process for the beamforming vector update, we aim at designing feedback control policy to maximize the net system throughput under a given average feedback cost. Our contributions are summarized as follows. We provide a detailed analysis on validity of the MDP formulation of the underlying feedback control problem. Specifically, by constructing the system state with a size of 4-variable (4-V using the vector channel and available beamforming vector at the transmitter, we prove that the feedback control problem is an MDP problem. Based on the quantized 4-V state, there exits an optimal stationary feedback control policy which is Markovian (only depends on the current system state and deterministic. However, it is well-known that, for such MDP problem, obtaining an optimal policy using dynamic programming (DP [5] is hindered by the curse of dimensionality due to the exponential complexity with respect to the size of the state space. To make the problem trackable for a practical feedback control design, we investigate into state-space reduction

2 2 to two variables as a 2-V state, which consists of the channel power and the achieved beamforming gain. In [4], such state-space reduction was claimed to incur no loss of optimality in the sense that, the feedback control problem under this 2-V state remains an MDP problem. However, through our rigorous analysis, we show that such reduction is in fact suboptimal, i.e., the underlaying feedback control problem is no longer an MDP problem. Nonetheless, the approximation as an MDP is shown to be justifiable and efficient. Based on the quantized 2-V state space and the MDP approximation, we consider practical implementation algorithms for the feedback control. Since the system state transition probabilities are unknown, we consider modelbased learning approaches, i.e., learning the transition probabilities, either off-line or online, combined with policy iteration algorithm, as well as model-free reinforcement learning approach to directly learn the feedback control policy without the knowledge of the system state transition probabilities. To the best of our knowledge, this is the first paper which employs reinforcement learning technique in feedback control for multi-antenna beamforming. We give detailed study and comparison of these algorithms through extensive simulations under various feedback cost and power constraint settings, and provide the efficiency analysis of these algorithms. Based on these results, the application rule of these algorithms is suggested, taking into account both stationary and unstationary systems. B. elated Works There is a large body of literature work on beamforming feedback design [2]. Most previous works focus on codebook design for beamforming vector under limited feedback [2], [3], [6], [7], with emphasis on the quantization performance of a given channel. Approaches such as Grassmannian line packing [3], vector quantization [7], [8], and random vector quantization [6] are proposed and studied. To further improve the feedback efficiency and reduce feedback overhead, opportunistic feedback approach is considered in multiuser transmission systems [9] [11], where the threshold type of feedback is assumed and derived for various design metrics. However, these results assume block fading channel with independent fade from block to block. Thus, channel temporal correlation is not actively explored for feedback efficiency. When exploring the temporal correlation of fading channel, the channel dynamics are commonly modeled either as continuous Gauss-Markov model [12], [13], or finite-state Markov model [14] [16]. The compression methods for channel state or beamforming vector by considering channel temporal correlation are investigated in [17], [18]. Among these results, the periodic feedback strategy is assumed, and the effect of feedback cost in the system throughput is not incorporated in the design. The cost due to feedback is studied in the form of resource allocation (e.g. bandwidth, power between feedback and data transmission in [19], [20]. A stochastic feedback control approach is first considered in [4], where the beamforming feedback problem is formulated as an MDP problem and the optimal control policy for decision on when to feedback is studied to maximize the system net throughput. By modeling the channel temporal correlation as a first-order Gauss-Markov process and viewing the feedback control problem as an MDP, the feedback control is shown to be of the threshold type. This work is the most relevant to our study in this paper. The main differences between our work and [4] are the following: 1 We rigorously examine the validity of MDP formulation of the beamforming feedback control problem under both original full state-space and reduced statespace, while the MDP problem was directly assumed for the feedback control problem under the reduced state-space model in [4]; 2 We explore and compare various practical feedback control implementation algorithms, either off-line or on-line and either model-based or model-free. In [4], the focus was on proving the threshold type of the optimal feedback control policy under the MDP formulation. Although the existence of a threshold was shown, no practical algorithm was provided to determine the threshold for feedback control. Feedback control policy design for transmit beamforming under the MDP formulation is essentially a problem of sequential on-off decision-making. There is a rich literature on the control policy design of an MDP problem [5]. When the state transition probabilities are known, the traditional DP method like policy iteration or value iteration can be employed to find the optimal policy [5]. In reality, however, the transition probabilities are typically unavailable. einforcement learning provides model-free algorithms where the decision-maker tries to learn an optimal policy directly without learning the model first. Excellent tutorials of this subject can be found in [21] [23]. C. Organization and Notation The remainder of this paper is organized as follows. Section II provides the system model and problem formulation. In Section III, we consider the transformation of the underlying feedback control problem to an MDP problem based on a 4-V state. In Section IV, we investigate the optimality of state-space reduction to a 2-V state. In Section V, based on the quantized 2-V state, four learning algorithms for feedback control are described. The performances of these algorithms are then studied and compared in Section VI followed by the conclusion in Section VII. Notation: The conjugate, transpose, and Hermitian of a matrix A are denoted by A, A T, and A H, respectively; 0 m,n is anm n matrix with all zero entries;i n is an identity matrix with dimensionn n; letgbe ann 1 vector, then g denotes the 2-norm of g, and g CN(0 n,1,σ 2 I n means that g is a circular complex Gaussian random vector with mean zero and covariance σ 2 I n ; E [ ] stands for the expectation; log( denotes the log function with base 2; ( and I( denote the real part and the imaginary part of the parameter respectively; denotes the real space; 1( denotes the indicator function which equals1(resp. 0 when the statement inside the brackets is true (resp. false; denote {h t } WN(0,σ 2 if {h t } is a White Noise process where all distinct elements in {h t } are uncorrelated and are with mean zero and variance σ 2.

3 3 II. CHANNEL MODEL AND POBLEM STATEMENT A. Channel Model Consider a MISO system where the transmitter is equipped with ( 2 antennas and the receiver is equipped with a single antenna. We assume a discrete-time slow fading channel model. Denote g i,t as the channel gain from the i-th transmit antenna to the receiver at time t, and define the 1 channel gain vector g t [g 1,t,,g,t ] T. Assume that all channel gains are spatially independent but temporally correlated, and the dynamics of the channel gain vector g t follows the firstorder autoregressive Gauss-Markov model given by g t+1 ρg t +w t+1, t 0,1,, (1 where g 0 CN(0,1,I, and {w t } is a sequence of i.i.d. random vectors independent of g 0 and is with distribution w t CN ( 0,1,(1 ρ 2 I ; parameter ρ is the correlation coefficient defined as ρ J 0 (2πf D T s, where J 0 ( is the zero-order Bessel function of the first kind, and f D T s is the normalized Doppler frequency with f D being the maximum Doppler frequency shift and T s the transmitted symbol period. Higher value of ρ indicates a higher temporal correlation 1. The result below indicates that {g t } in (1 follows a block Markovian model. Furthermore, it can be proven that the stationary distribution of {g t } follows CN(0,1,I. Fact 1 ( [24]: Let ǫ 0,ǫ 1,ǫ 2, be independent random variables, X 0 ǫ 0, and X n+1 ρx n +ǫ n+1,n 0,1,2,, where ρ is a real constant. Then {X n } forms a Markov chain. B. Problem Statement Denote an 1 normalized complex vector ĝ u,t, i.e., ĝ u,t 1, as the available beamforming vector at the transmitter at time t. Then, the received signal at the receiver at time t is given by y t x t ĝ H u,tg t +v t, (2 where x t is the transmitted signal at time t with the power constraint E [ x t 2 ] P, v t is the noise at the receiver with v t CN(0,1, which is independent of {g t } and {x t }. The received SN and the data rate are given by P ĝu,tg H t 2, and log(1+p ĝu,tg H t 2, respectively. In practical systems, the beamforming vector at the transmitter is obtained through limited feedback. The effective beamforming gain at the receiver is directly affected by the accuracy of the beamformer with respect to the current CSI. Let c be the fixed cost of each feedback. Clearly, there is a trade-off between the feedback cost and the data rate provided by the beamforming gain. We define the reward at time t, denoted as r t, as the net throughput obtained through beamforming as { log(1+p g t 2 c, if feedback r t log(1+p ĝu,tg H t 2 (3, if no feedback, 1 For example, when f D T s 0.01, i.e., the channel coherence time is approximately 100 times the symbol period, we have ρ which indicates that the successive channel gains are highly correlated where in (3, upon feedback, the available beamforming vector at the transmitter is updated as ĝ u,t g u,t g t / g t. The expected total discounted reward can be expressed as [ ] V E λ t r t (4 t0 where λ (0,1 is the discount factor, and the expectation is taken over the randomness of the channels. By including the discount factor, the future net throughput is considered as less valuable than the current one, which reflects the time value in practice. Our objective in this paper is to design a feedback control policy to maximize the expected total discounted reward. To simplify the analysis, we assume that the receiver knows the CSI perfectly, and the feedback to the transmitter is delay-free and error-free. III. OPTIMAL FEEDBACK CONTOL: AN MDP FOMULATION To maximize the expected total discounted reward, the optimal feedback policy may be a randomized policy that depends on all previous system states and feedback actions. In this section, we will show that, by introducing a 4-V state, the underlying feedback control problem can be formulated as an MDP problem. With the quantized 4-V state, the formulated MDP problem admits an optimal stationary policy which is Markovian (only depends on the current system state and deterministic. Unfortunately, we also point out that finding an optimal policy based on the quantized 4-V state is difficult. A. An MDP Formulation We first review the definition of an MDP problem given below. Definition 1 ( [5]: Define a collection of objects {T,S,A,P( s t,a t,r t (s t,a t } as an MDP, where T,S, and A are the spaces of decision epoch t, state s t, and action a t, respectively, P( s t,a t is the transition probability of the next state given the state and action at the current time t, and r t (s t,a t is the corresponding reward function. Define z t (s 0,a 0,s 1,a 1,,s t to be the history of the decision process up to time t. The key of an MDP is that the transition probabilities and the reward function depend on the history only through the current state. We now show that the underlying feedback control problem can be formulated as an MDP problem. Let the space of the decision epochs be T {0,1, }. At time t T, the receiver makes a decision on whether to feedback g u,t to the transmitter. Denote the space of the actions as A {0,1}, where 0 and 1 represent no feedback and feedback, respectively. Note that the final available beamforming vector at the transmitter at time t depends on the receiver s decision, so is undetermined when the receiver makes the decision at time t. We therefore define the state at time t as s t (g t,ĝ t 1, where ĝ t 1 is the unnormalized version (i.e., includes the associated channel power of the available beamforming vector at the transmitter at time

4 4 t 1. Particularly, if the most recent feedback update happens q t {1,,t + 1} time slots ago from time t, then we have ĝ t 1 g t qt, with g 1 being the initial unnormalized beamforming vector used at the transmitter at time slot 0. Upon the receiver s decision, the available beamforming vector at the transmitter, the reward, and the state at the next time slot are as follows: ĝ u,t g u,t g t / g t a t 1 r t log(1+p g t 2 c s t+1 (g t+1,g t ; ĝ u,t ĝ u,t 1 g t qt / g t qt a t 0 r t log(1+p ĝu,tg H t 2 (5 s t+1 (g t+1,g t qt. Since g t and ĝ t 1 are complex, rewriting the state s t ((g t,i(g t,(ĝ t 1,I(ĝ t 1, we then have totally 4 real scalar variables in s t, referred to as the 4-V state. Thus, the state space S is the Cartesian product of 4 real spaces, with being the real space. It is easy to see that the reward function depends on z t only throughs t. We now show that this is the case for the transition probabilities as well. Lemma 1: Based on the 4-V state s t (g t,ĝ t 1, we have P(s t+1 z t,a t P(s t+1 s t,a t. (6 Proof: First assume a t 1. At time t + 1, the receiver will observe s t+1 (g t+1,g t. The left hand side (LHS of (6 is given by P(s t+1 z t,a t P ( g t+1,ĝ g0 t,ĝ 1,a 0,g 1,ĝ 0,a 1,,g t,ĝ t 1,a t (7 1(ĝ t g t P ( g t+1 g0,g 1,g 1,g 1 q1,,g t,g t qt 1(ĝ t g t P ( g t+1 gt, (9 where (7 is derived based on the definitions of s t and z t, (8 follows because once all actions {a τ } t τ0 are given, ĝ τ 1, τ {1,,t + 1}, can be specified as g τ qτ, with q τ {1,,τ + 1}, and (9 follows due to the Markovian property of {g t }. The right hand side (HS of (6 is given by P(s t+1 s t,a t P ( g t+1,ĝ t gt,ĝ t 1,a t 1(ĝ t g t P ( g t+1 gt,ĝ t 1 1(ĝ t g t P ( g t+1 g t. Therefore, P(s t+1 z t,a t 1 P(s t+1 s t,a t 1. When a t 0, at time t+1, the receiver will observes t+1 (g t+1,g t qt. The LHS of (6 is given by P(s t+1 z t,a t 1(ĝ t g t qt P ( g t+1 g0,g 1,g 1,g 1 q1,,g t,g t qt 1(ĝ t g t qt P ( g t+1 g t, (8 and the HS of (6 is given by P(s t+1 s t,a t 1 ( ĝ t g t qt P ( gt+1 gt,g t qt 1(ĝ t g t qt P ( g t+1 g t. This completes the proof. Based on the MDP objects defined and Lemma 1, we have now transformed the beamforming feedback control problem considered in Section II into an MDP problem. As the state space S being continuous and thus difficult to work with, we take the common approach to quantize it into finite state space 2. As a result, due to the finite state and action spaces and the stationarity of the channel model, the underlying feedback control problem admits an optimal stationary policy, which is both Markovian and deterministic [5]. In other words, we are to find an optimal stationary policy π {d : S A}, where d denotes the optimal decision rule and the superscript indicates that the decision rule d is stationary. B. Challenges in Determining the Optimal Policy Although there exists an optimal policy π with the quantized 4-V state space, finding π is challenging, as it is difficult to find an efficient way to perform the quantization. One possible method is to quantize the 4 variables in s t individually. For a quantization level L on each variable in s t, the total number of the states is L 4 which is exponential in. This number is large even for moderate values of L and (e.g., when L 4 and 4, L Thus, we quickly face the curse of dimensionality as L and increase. Another approach is to directly quantize the state space S by vector quantization technique [25], such as random codebook. The random code book C contains i.i.d. vectors each with length 4 1 and having the same distribution as s t for each element. However, to achieve good performance, the size of the random code book should be large. From the above discussion, we see that, since the number of the variables in the state s t increases with, it is difficult to find an efficient quantization method which has a good scaling performance with. Hence, it is impractical to perform beamforming feedback control directly based on the 4-V state, especially for large, and the state space reduction is necessary. IV. STATE SPACE EDUCTION: ANALYSIS ON 2-V STATE Our focus now is on finding a suitable reduced state for beamforming feedback control while maintaining the validity of the MDP formulation. An attempt for the state space reduction is given in [4], where a 2-V state s t ( g t 2,m t with m t ĝ H u,t 1g u,t 2 [0,1], is proposed as an optimal state reduction 3. The state in this case consists of the current channel power and the beamforming gain at time t if no feedback is finally performed. Based on such a state, upon 2 For the slow fading channel model, we can approximately treat the quantized version of the problem as an MDP problem. 3 The optimality is claimed in the sense that the reduced 2-V state does not compromising the controller s optimality [4].

5 5 the receiver s decision, the reward and the state at the next time slot are now as follows: { r t log(1+p g t 2 c a t 1 s t+1 ( g t+1 2, gu,t H g u,t+1 2 ; { r t log(1+p g t 2 m t a t 0 s t+1 ( g t+1 2, gu,t q H t g u,t+1 2 (10, where q t {1,,t+1}. Compared to the 4-V state, the total number of the variables in s t is reduced to 2, which is independent of and is highly desirable. However, for the reduced state system to be valid as an MDP, as pointed out in Section III-A, the reward function and the transition probabilities should depend on the history only through the current state. For the proposed state reduction, it is claimed in [4] (without proof that the processes of { g t 2 } and {m t } both form Markov chains and they are independent. However, verifying these statements is non-trivial. In this section, we will take a detailed look at the 2-V state. Specifically, we will show that the MDP formulation no longer holds under such 2-V state, and thus it is a suboptimal reduction. Nonetheless, we will also show that the proposed 2-V state provides good approximation and is reasonably efficient. A. Analysis on the Process { g t 2 } Based on the channel gain dynamics in (1, we now show that { g t 2 } asymptotically forms a Markov chain. The following lemma gives the property of the normalized process of { g t 2 }. Lemma 2: Define d t 1 ( g t 2. Then, {d t } is a first-order autoregressive process as d t+1 ρ 2 d t + w d,t+1, t 0,1,, where w d,t 1 [ 2ρ(g H t 1 w t + w t 2 (1 ρ 2 ], and { w d,t } WN(0,1 ρ 4. Further, d 0 is uncorrelated with { w d,t }, i.e., E (d 0 w d,t 0 for all t 1. Proof: See Appendix A. The following fact indicates that the Markovian property is transferable for a one-to-one mapping function. Fact 2 ( [24]: If {x t } is Markovian and y t g(x t where g( is a one-to-one mapping, then {y t } is also Markovian. Combining Facts 1 and 2 and Lemma 2, we have the following result. Proposition 1: { g t 2 } asymptotically forms a Markov chain as. Proof: By Fact 2, to show { g t 2 } is Markovian is equivalent to show {d t } is Markovian due to the one-to-one correspondence between g t 2 and d t. By Fact 1 and Lemma 2, to show {d t } is Markovian, it suffices to show that { w d,t } is an i.i.d. sequence and d 0 is independent of { w d,t }. To this end, by Lemma 2, it suffices to show (d 0,{ w d,t } is jointly Gaussian. It is because if this were true, then { w d,t } would be an i.i.d. Gaussian sequence from { w d,t } WN(0,1 ρ 4 and d 0 would be independent of { w d,t } from E (d 0 w d,t 0 for all t 1. We now show (d 0,{ w d,t } is asymptotically jointly Gaussian as by showing that any linear combination of d 0, w d,t1, w d,t2,, w d,tn where 0 < t 1 < t 2 < < t n is Gaussian n > 0. ewrite w d,t 1 [ 2ρ(gi,t 1 w i,t + w i,t 2 (1 ρ 2 ] i1 1 y i,t, i1 where we have defined y i,t 2ρ(g i,t 1 w i,t + w i,t 2 (1 ρ 2, and g i,t and w i,t are the i-th elements of g t and w t, respectively. ewrite d 0 1 i1 ( g i, Let c 0,c 1,,c n be the coefficients for linear combining. Then, we have n c 0 d 0 + c l w d,tl l1 ( 1 c 0 ( g i, i1 1 ỹ i, i1 n c l y i,tl where ỹ i c 0 ( g i, n l1 c ly i,tl. By straightforward computation and the fact that all channel gains are spatially independent, it can be shown that ỹ i s are i.i.d. with zero mean and variancec 2 0 +(1 ρ4 n i1 c2 i. Applying the Central Limit 1 Theorem, when, the distribution of i1ỹi converges to Gaussian distribution with the same mean and variance as ỹ i. Hence, (d 0,{ w d,t } is asymptotically jointly Gaussian. Therefore, we have { g t 2 } asymptotically form a Markov chain when goes to infinity in the sense that ( lim P gt+1 2 g0 2,, g t 2 l1 P ( g t+1 2 g t 2 0. When is large, we can approximately treat 1 i1ỹi as Gaussian hence (d 0,{ w d,t } as jointly Gaussian. In other words, when is large, { g t 2 } can be approximately treated as Markovian, and P ( g t+1 2 g 0 2,, g t 2 can be approximated by P ( g t+1 2 gt 2. B. Validity of 2-V State based MDP Formulation Consider the reduced 2-V state s t ( g t 2,m t. The state space is S + U, where + denotes the space of the non-negative real numbers and U [0,1]. For the feedback control problem under the new state to be an MDP problem, we need to verify that the reward function and the transition probabilities depend on the history only through the current state. From (10, the reward function is only a function of the current state. To verify (6 under the 2-V state, we make the following assumptions: A1 The processes { g t 2 } and {m t } are independent. A2 is large. We will further discuss the assumption A1 later. Let z t be the history of the decision process under the 2-V state, similar

6 6 as before. Using (10, when a t 1, we have m t+1 g H u,t g u,t+1 2, and the LHS of (6 equals P(s t+1 z t,a t P ( g t+1 2 g0,m t+1 2,m 0,a 0,, g t 2,m t,a t m0 P (m t+1,a 0,,m t,a t P ( g t+1 2 gt 2 (11 P( g H u,t g u,t+1 2 P ( g t+1 2 gt 2 (12 where (11 is due to the assumption A1 and { g t 2 } is approximately Markovian for large by Proposition 1, and (12 follows because, when a t 1, the available beamforming gain at time t equals ĝ H u,tg u,t 2 g H u,tg u,t 2 1, so the distribution of the successive beamforming gain m t+1 is independent of the previous values {m τ } t τ0. For the HS of (6, we have P(s t+1 s t,a t P ( g t+1 2,m t+1 gt 2,m t,a t P( g H u,tg u,t+1 2 P ( g t+1 2 g t 2 which is equal to (12. Using (10, when a t 0, we havem t+1 g H u,t q t g u,t+1 2 with q t {1,,t+1}. The LHS of (6 equals P(s t+1 z t,a t P ( g H u,t q t g u,t+1 2 g H u, 1 g u,0 2, g H u,1 q 1 g u,1 2,, gu,t q H t g u,t 2 P ( g t+1 2 gt 2 (13 ( P gu,t q H t g u,t+1 2 g H u,t qt g u,t+1 qt 2, gu,t q H t g u,t+2 qt 2,, g H u,t q t g u,t 2 P ( g t+1 2 g t 2 (14 where (14 follows since as argued for (12, when a τ 1, m τ+1 is independent of {m t } τ t 0. Hence, for the first conditional probability in (13, it suffices to only consider m τ s since the most recent feedback. We consider two such cases: i For q t t+1 (i.e., no feedback since the beginning: we have the first probability in (14 given as ( P gu, 1g H u,t+1 2 g H u, 1 g u,0 2, gu, 1g H u,1 2,, g H u, 1 g u,t 2. (15 ii For q t {1,2,,t}: due to the stationarity of the channel gain process, the first probability in (14 is equal to ( P gu,0 H g u,q t+1 2 g H u,0 g u,1 2, gu,0 H g u,2 2,, g H u,0 g u,q t 2. (16 The HS of (6 is given by P(s t+1 s t,a t P ( g H g u,t qt u,t+1 2 g H u,t qt g u,t 2 P ( g t+1 2 gt 2 (17 where the value of q t is unknown due to a t 1 being unknown. Hence, for (14 and (17 to be equal, we need to show that the first probability in (17 is equal to (15 or (16. In other words, we need to show that {m t } forms a Markov chain when the beamforming vector ĝ u,t is fixed to be g u, 1 or g u,0. Consider ĝ u,t g u, 1, t. By the channel dynamics in (1, the relationship between m t+1 g H u, 1 g u,t+1 2 and m t g H u, 1g u,t 2 is given by g H u, 1g u,t+1 2 ρ 2 g t 2 g t+1 2 gh u, 1g u,t 2 + w m,t+1, 2ρ g t where w m,t+1 g t+1 (gu,tg H 2 u, 1 gu, 1w H t g t+1 gu, 1w H 2 t+1 2. Based on [26, Lemma 2] and [26, Lemma 4], it can be shown that the marginal distribution of m t, t 0,1,, is given by P(m t x 1 (1 x 1 (18 which is non-gaussian. Thus, it is difficult to prove the independence of m 0 and { w m,t }, as in Lemma 2, due to their non-gaussian distributions. As a result, in contrast to the proof in Proposition 1, {m t } cannot be shown as Markovian (or approximately Markovian for large. Similar difficulty occurs when ĝ u,t g u,0, t. To rigorously establish the Markovian property of {m t } is very challenging. Nonetheless, note that when ĝ u,t g u, 1 or g u,0, t, the evolution of {m t } only depends on the evolution of {g u,t }. Since {g t } is Markovian, heuristically, it is reasonable to approximate {m t } as Markovian. Based on the above discussion, under the assumptions A1 and A2, the underlying feedback control process is an approximated MDP. We now discuss the independence assumption A1. According to the definition, the processes { g t 2 } and {m t } are said to be independent if the random vectors g [ g t1 2,, g tk 2 ] and m [m t 1,,m t j ] are independent for all k,j and all distinct choices of t 1,,t k and t 1,,t j. In other words, we need to show P g,m ( g t1 2,, g tk 2,m t 1,,m t j P g ( g t1 2,, g tk 2 P m (m t 1,,m t j. (19 Note that the distribution of m is policy-dependent. For example, when feedback is performed every time, all elements in m are close to 1 with high probability. In contrast, when the receiver never performs feedback, we have the marginal distribution of m t as shown in (18. Particularly, when 2, m t is uniformly distributed, t. Therefore, verification of (19 requires the consideration of all feasible policies, which is intractable. Instead, we provide a heuristic explanation of why { g t 2 } and {m t } can be treated as independent. Note that the two processes capture different effects of the beamformed system, and the two provide little information about each other. Precisely, { g t 2 } describes the process of the channel power, of which g t 2 only depends on the current channel; {m t } describes the process of the beamforming gain and is policydependent. For the same channel realization, under different feedback policies we can have various realizations of {m t }, each reveals little information about the channel power. On the other hand, from the realization of { g t 2 }, we cannot obtain the information about the beamforming gain {m t } neither. In the next section, where the feedback control implementation algorithms are discussed, we will revisit this independence

7 7 assumption, and discuss how algorithms are designed with or without A1. To summarize, the 2-V state is not an optimal reduced state, as it does not rigorously form an MDP. Nonetheless, this form of reduced state is still very attractive, because the reduction is highly efficient with the state variables of size 2, and its approximation to an MDP is also reasonable as explained above. C. Quantized 2-V State Even with the 2-V state, finite state is desired for designing algorithms for feedback control in an MDP problem. To obtain a quantized 2-V state, we consider the quantized state space which is used in [4] 4. For g t 2, we quantize its value into M levels as T g {[0,g 1,[g 1,g 2,,[g M 1,+ } with each interval having probability 1/M. The representative points are denoted as { g 1,, g M } with g k E [ g t 2 g t 2 [g k 1,g k ]. For m t, we quantize its value into N levels with equal length 1/N, i.e., T m {[0,1/N,[1/N,2/N,,[(N 1/N,1]}. The rang of m t is quantized with equal length instead of equal probability is because the distribution of m t is policy-dependent, and cannot be determined in advance. The representative points, denoted as { m 1,, m N }, are chosen as the midpoints of the associated intervals. By treating the underlying feedback control problem as a finite-state MDP problem, there exists an optimal stationary policy which is Markovian and deterministic. V. LEANING ALGOITHMS UNDE THE QUANTIZED 2-V STATE In this section, based on the quantized 2-V state, we consider the implementation algorithms for the feedback control by treating the underlying problem as an MDP problem. Four learning algorithms including model-based off-line (on-line algorithms and model-free on-line algorithm are discussed. A. Model-Based Off-Line PI Algorithms For an MDP problem, the optimal policy can be obtained through Policy Iteration (PI [5], if the transition probabilities P(s s,a are known at the receiver. In practice, these transition probabilities can be obtained a prior through off-line learning of the system. 1 Learning of Transition Probabilities: To perform PI, we need to obtain the transition probabilities under feedback and no feedback, i.e., P(s s,a for a 0,1, where the time indices are suppressed for the easy of notation. Depending on whether the processes { g t 2 } and {m t } are treated jointly or independently in learning the transition probabilities, we consider joint and independent off-line PI algorithms, respectively. Using the sampling transition probabilities of g t 2, m t, and ( g t 2,m t under a t 0, P(s s,a 0 can be learned using the maximum likelihood estimation. For the independent off-line PI algorithm, P(s s,a 0 is derived as the product of the average sampling transition 4 It should be noted that, to get better performance the joint quantization scheme needs to be considered. But this is beyond the scope of this paper. probabilities of g t 2 and m t ; for the joint off-line PI algorithm, P(s s,a 0 is derived as the average sampling transition probabilities of ( g t 2,m t. The transition probabilities P(s s,a 1 can be derived similarly using sampled g t 2 and m t, or ( g t 2,m t, under a t 1. Note that for slow fading channel, we can assume that m t+1 would jump to the highest state N (i.e., the highest normalized beamforming gain, upon a t 1. With this assumption, the procedure of generating transition probabilities for both the joint and independent off-line PI algorithms is summarized in Algorithm 1 in Appendix B. Since learning phase incurs data rate loss, it is desirable to make it as short as possible. For the independent off-line PI algorithm, we need to simultaneously learn two average transition probability matrices P g and P m, whose sizes are M M and N N, respectively; on the other hand, for the joint off-line PI algorithm, we need to learn the average transition probability matrix P gm whose size is MN MN, which is much larger than that of P g or P m. Hence, a longer learning time is needed for the joint off-line PI algorithm to reach a similar level of learning accuracy as that of the independent one. In other words, the independent off-line PI algorithm is more time efficient. 2 PI for the Optimal Control: Once the transition probabilities are available, we can form the expected total discounted reward (value function under the state s and policy π as V π (s r(s,a+λ s P ( s s,a V π (s, (20 where the action in state s follows the policy π, and the time indices are suppressed for the easy of notation. Denote π as an optimal policy. Then, the optimal value function should satisfy the Bellman s equation [5] [ ] V π (s max a A r(s,a+λ s P ( s s,a V π (s. (21 Using PI to obtain the solution to (21 involves alternating between policy evaluation and policy improvement iteratively, until the algorithm converges. The detailed algorithm for PI for finite state space and finite action space can be found in [5]. For finite state space and finite action space, it is proven that PI is guaranteed to converge within finite updates of policy. In our simulations, with quantization levels M N 4, typically 2 or 3 iterations are needed, while with M N 32, typically 7 or 8 iterations are needed. The computational complexity of PI lies in the step of policy evaluation, which is of the order O((MN 3. The storage required is mainly determined by the size of the transition probability matrix, which is of the order O((MN 2. B. Model-Free On-Line EQ Algorithm In model-free learning, the receiver assumes no prior knowledge about P(s s,a. Instead, it learns the optimal policy π directly without obtaining P(s s,a first. In this section, we suggest a model-free algorithm, the Exploratory Q-learning (EQ algorithm, for the feedback control.

8 8 1 Background in Q-Learning: Q-learning is a model-free algorithm where the decisions are made through the learning of the optimal Q-factors [27]. Denote Q π (s,a as the Q-factor of the state-action pair (s,a under a given policy π. It is defined as follows Q π (s,a r(s,a+λ s P(s s,av π (s (22 where V π (s is the value function of state s under policy π. From (22, Q π (s,a represents the expected total discounted reward of taking action a in state s and following a fixed policy π thereafter. If π is an optimal policy, we have V π (s max a A Q π (s,a, which means that π can be derived from the optimal Q-factors Q π (s,a s. To learn the optimal Q-factors, at each decision epoch t, the decision maker first observes the current state s, then selects an action according to some strategy and observes the next state s. The update of Q-factor at decision epoch t, denoted as Q t (s,a, is given by Q t (s,a (1 α t (s,aq t 1 (s,a +α t (s,a[r t (s,a+λv t 1 (s ], if s s t and a a t Q t 1 (s,a, otherwise (23 where V t 1 (s max a A Q t 1 (s,a, and α t (s,a is the learning rate of the state-action pair (s,a at time t. Note that, by (23, only the Q-factor associated with the visited state-action pair is updated. It is proven in [27] that, if all stateaction pairs are visited infinitely often and the learning rate is appropriately designed, then all Q-factors will eventually converge to the optimal ones. 2 EQ Algorithm: The EQ algorithm was first introduced in [28], combining Q-learning with a counter-based directed exploration strategy. Below, we briefly describe how the EQ algorithm chooses the action at each decision epoch. The interested reader can refer to [28] for details. Denote n t (s,a as the number of times that the state-action pair (s,a is visited up to time t, and denote n t (s as the number of times that state s is visited up to time t. Let c t (s,a reflect the number of times the actionahas not been performed in state s since the initial time. Initialize c 0 (s,a 0, s,a. In the EQ algorithm, given the current state s t, a greedy actionâ t is first determined. It is defined asâ t argmax a A Q t (s t,a, which results in the maximum Q-factor at epoch t; the final action a t is then determined by the following maximization problem: max g t(s t,a a A { ct(s t,a n t(s t,a +1, if a â t c t(s t,a n t(s t,a, otherwise (24 where the addition of 1 indicates the decision maker s preference for the greedy action; it is also possible for the decision maker to take a non-greedy action a â t, if the associated ratio ct(st,a n t(s t,a is large, i.e., a has not been performed in s t for a long time, or (s t,a has not been visited for a long time. The update of c t (s t,a is as follows c t+1 (s t,a { c t (s t,a+θ(n t+1 (s t n t+1 (s t,a, if a a t c t (s t,a, otherwise (25 where Θ(n is a positive function satisfying the conditions lim Θ(n 0, and n Θ(n. (26 n1 a Complexity: At every decision epoch, given the state, the updates are based on the size A of the state-action pairs, thus the computational complexity of the EQ algorithm is O( A. Since there are only two actions in our problem, the computational complexity of the EQ algorithm is much lower than that of PI. The state-action based parameters such as the Q-factors need to be stored for each update, thus the required storage size for EQ algorithm is O( A MN. b Design parameters: For the updates in (23 and (25, there are two design parameters in the EQ algorithm: the learning rate α t (s,a and Θ(n. In [22], the learning rate is α suggested as α t (s,a 0τ τ+n, where α t(s,a 0 is the initial learning rate and τ is some positive parameter. For Θ(n, we set Θ(n 1, θ 1, which can be shown to satisfy n 1/θ the constraint (26. Note that, from (25, larger θ results in larger value of c t+1 (s t,a for a a t (usually a non-greedy action. Therefore, the value of θ affects the trade-off between exploration (a non-greedy action, and exploitation (the greedy action. In Section VI, we will study how to choose θ and α 0. As will be seen through simulation, the parameters of θ and α 0 turn out to be important and should be carefully tuned. C. Model-Based On-Line PI Algorithm For comparison purpose, we also provide a model-based on-line PI algorithm, which combines PI with the counterbased directed exploration strategy as in (24 and (25 of the EQ algorithm [28]. We briefly describes this algorithm below. The details can be found in [28]. In this algorithm, at each decision epoch t, first, the estimates of P(s s,a are updated using the observation history of state-action pair { g τ 2,m τ,a τ } t τ0 where the processes { g t 2 } and {m t } are jointly considered; second, with the updated estimates of P(s s,a, PI is employed to generate the action, which is treated as the greedy action â t ; third, the final action a t is determined by (24. Note that the difference between this model-based on-line PI algorithm and the EQ algorithm in Section V-B is that, the greedy action â t in the former is generated by PI while â t in the latter is generated by the Q-learning algorithm, which is a model-free algorithm. Compared to the modelbased joint off-line PI in Section V-A, in the model-based on-line PI algorithm, the estimates of P(s s,a are updated at every decision epoch based on the observation history of stateaction pair. Thus, in using these data, the policy seeks certain balance between exploration and exploitation. Note that the

9 9 Expected total discounted reward , c4, MN4, θ , c4, MN4, θ2 2, c4, MN4, θ , c12, MN4, θ1 2, c12, MN4, θ2 2, c12, MN4, θ Time slot x 10 5 Fig. 1. Design of θ: 2-V EQ algorithm with 2, M N 4, c 4 and 12, and θ 1,2 and 3. Expected total discounted reward , MN4, α , MN8, α , MN8, α , MN4, α , MN8, α , MN8, α Feedback cost Fig. 2. Design of α 0 : 2-V EQ algorithm with 2 and 4, M N 4 and 8, and α and computational complexity in this on-line PI algorithm is high, as PI is performed at every decision epoch. VI. PEFOMANCE STUDY OF THE POPOSED LEANING ALGOITHMS In this section, we will study and compare the performances of the four learning algorithms proposed in the previous section. In all experiments, we set the discount factor λ 0.98, the normalized Doppler frequency f D T s 0.01, and the power at the transmitter P 10dBW (except otherwise mentioned. The expected total discounted reward in (4 is approximated by the sample mean of the total discounted rewards over 400 independently generated channel sequences. For the n-th generated channel sequence, the total discounted (n reward beginning at time t is approximated by ˆV t t+ t τt λ τ t r τ (n (n, where t 350. Note that ˆV t is an approximation in the sense that t is set as a finite number instead of infinity. The parameterτ in the learning rateα t (s,a in the EQ algorithm is set as 300. We consider a range of feedback cost {4,8,12,20,40,60}. A. Model-Based versus Model-Free Algorithms We first provide the guideline on choosing the parameters θ and α 0 in the EQ algorithm, then compare the performances of the EQ algorithm and the on-line PI algorithm. 1 EQ Algorithm Design of θ and α 0 : Let 2 and M N 4. The feedback cost is set to be c 4 or 12. In Fig. 1, we show the expected total discounted rewards under the EQ algorithm vs. time slot under θ 1,2, and 3, respectively. ecall that θ controls the exploration-exploitation tradeoff (non-greedy vs. greedy actions in the learning algorithm. Each curve in Fig. 1 shows the convergence behaviour with a specified value of θ under the EQ algorithm. From Fig. 1, we see that, when the feedback cost is low, i.e., c 4, a balanced exploration-exploitation (θ 2 results in the best receiver beamforming performance, while greedy action for exploitation with little exploration (θ 1 results in the lowest performance. When the feedback cost becomes high, i.e., c 12, however, θ 1 results in the best beamforming performance. In this case, choosing the greedy action for more exploitation of immediate net throughput given in (3 gives better performance. Besides these, more exploration prolongs the convergence time of the EQ algorithm. In summary, the choice of θ for the best performance depends on the feedback cost. This is consistent with our intuition that, with lower feedback cost, the system benefits from more exploration under non-greedy action to improve overall net throughput; when the cost increases, such exploration becomes costly, and exploitation of immediate maximum net throughput under greedy action is preferred. In Fig. 2, for 2 and 4, we investigate the relationship between the initial learning rate α 0 and the quantization levels M andn. We plot the approximated expected total discounted reward vs. the feedback cost c, where θ is chosen based on the guideline in the previous paragraph. When M N 4, α is found to result in the best performance. However, when M N 8, we see that the same α 0 leads to a worse performance, especially at the high feedback cost region. For example, when c 60, the degradation is 3%( 2 and 5%( 4, respectively. Instead, a different initial learning rate α results in the best performance for M N 8. Nonetheless, compared to M N 4, the improvement is limited. From Fig. 2, it can be seen that the initial learning rateα 0 should be adjusted according to the quantization levels. Specifically, α 0 should be smaller with higher quantization levels M and N. Intuitively, higher quantization levels result in larger state space and hence less observation samples for each state, therefore, a smaller learning rate is required. 2 EQ Algorithm versus On-Line PI Algorithm: For 2 and 4, in Fig. 3, we compare the performances of the EQ algorithm and the on-line PI algorithm at various feedback costs for the quantization levels M N 4. ecall that

10 Expected total discounted reward , MN4, 2 V EQ 4, MN4, 2 V EQ 2, MN4, on line PI 4, MN4, on line PI Feedback cost Fig. 3. EQ algorithm versus on-line PI algorithm: 2 and 4, M N 4. Expected total discounted reward , MN4, joint off line PI 2, MN8, joint off line PI, 2, MN16, joint off line PI, 160 2, MN32, joint off line PI, 2, MN4, independent off line PI 2, MN8, independent off line PI 2, MN16, independent off line PI 150 2, MN32, independent off line PI 2, MN4, 2 V EQ 2, L32, 4 V EQ, Feedback cost Fig. 4. Joint off-line PI algorithm versus independent off-line PI algorithm: 2, P 10dBW, and M N 4,8,16 and 32. the only difference between the EQ algorithm and the modelbased on-line PI algorithm is how to generate the greedy action â t. From Fig. 3, we see that the EQ algorithm outperforms the on-line PI algorithm at the mediate-to-high cost range, showing the advantage of the EQ algorithm over the online PI algorithm. Such advantage can be explained that, the model-free algorithm is basically reward-driven, hence avoiding to construct the transition probabilities that might have mismatches to those of the actual system. B. Independent versus Joint Off-Line PI Algorithm ecall from Section IV that it is challenging to determine the dependency of the two processes { g t 2 } and {m t }. We have thus proposed two off-line PI algorithms treating the two processes either jointly or independently. In this section, we will compare the performances of these two off-line PI algorithms. We consider two scenarios: fixed transmit power constraint with various feedback costs, and various power constraints with fixed feedback cost. 1 Effect of Feedback Costs: In Fig. 4, we compare the expected total discounted reward of the joint and independent off-line PI algorithms under various feedback costs when the transmit power is fixed to P 10dBW. We set 2, and the quantization levels M N 4,8,16, and 32, respectively. As expected, the joint off-line PI algorithm outperforms the independent one, which confirms that { g t 2 } and {m t } are indeed correlated. As the quantization levels M and N increase, the performance of both algorithms improves, while the performance gap between the two algorithms reduces. For M N 32, the performance gap becomes negligible, indicating that treating the two processes { g t 2 } and {m t } as independent is reasonable. For comparison, the performance of the EQ algorithm with M N 4 is also shown. Surprisingly, we observe that the EQ algorithm gives the best performance as compared to all off-line PI algorithms, especially at the high feedback cost region. Expected total discounted reward , MN4, joint off line PI 4, MN8, joint off line PI 4, MN16, joint off line PI 210 4, MN32, joint off line PI 4, MN4, independent off line PI 4, MN8, independent off line PI 200 4, MN16, independent off line PI 4, MN32, independent off line PI 4, MN4, 2 V EQ Feedback cost Fig. 5. Joint off-line PI algorithm versus independent off-line PI algorithm: 4, P 10dBW, and M N 4,8,16 and 32. In addition, in Fig. 4, we also include the EQ algorithm based on the original 4-V state given in Section III, which leads to a valid MDP problem. The random code book with the size C 256 is used, where the average quantization level per real variable is L 32. As compared to the performance under the 2-V state, the performance under the 4-V EQ algorithm is uniformly below that of the independent off-line PI algorithm with M N 16. This observation reveals that, although4-v state is optimal, there is large performance loss with quantization; in contrast, the 2-V state is more efficient. The same experiments are conducted for 4 in Fig. 5, where similar behaviours as in the 2 case are observed. In particular, the performance gap between the joint and independent off-line PI algorithms is very small for M N 16, and 32, which again indicates that independence of { g t 2 } and {m t } is a reasonable assumption.

On Stochastic Feedback Control for Multi-antenna Beamforming: Formulation and Low-Complexity Algorithms

1 On Stochastic Feedback Control for Multi-antenna Beamforming: Formulation and Low-Complexity Algorithms Sun Sun, Student Member, IEEE, Min Dong, Senior Member, IEEE, and Ben Liang, Senior Member, IEEE