Power Allocation in Multi-user Cellular Networks With Deep Q Learning Approach

Size: px

Start display at page:

Download "Power Allocation in Multi-user Cellular Networks With Deep Q Learning Approach"

May West
5 years ago
Views:

1 Power Allocaion in Muli-user Cellular Neworks Wih Deep Q Learning Approach Fan Meng, Peng Chen and Lenan Wu arxiv: v1 [cs.it] 7 Dec 2018 Absrac The model-driven power allocaion (PA) algorihms in he wireless cellular neworks wih inerfering muliple-access channel (IMAC) have been invesigaed for decades. Nowadays, he daa-driven model-free machine learning-based approaches are rapidly developed in his field, and among hem he deep reinforcemen learning (DRL) is proved o be of grea promising poenial. Differen from supervised learning, he DRL akes advanages of exploraion and exploiaion o maximize he objecive funcion under cerain consrains. In our paper, we propose a wo-sep raining framework. Firs, wih he off-line learning in simulaed environmen, a deep Q nework (DQN) is rained wih deep Q learning (DQL) algorihm, which is welldesigned o be in consisen wih his PA issue. Second, he DQN will be furher fine-uned wih real daa in on-line raining procedure. The simulaion resuls show ha he proposed DQN achieves he highes averaged sum-rae, comparing o he ones wih presen DQL raining. Wih differen user densiies, our DQN ouperforms benchmark algorihms and hus a good generalizaion abiliy is verified. Index Terms Deep reinforcemen learning, deep Q learning, inerfering muliple-access channel, power allocaion. I. INTRODUCTION Daa ransmiing in wireless communicaion neworks has experienced explosively growh in recen decades and will keep rising in he fuure. The user densiy is grealy increasing, resuling in criical demand for more capaciy and specral efficiency. Therefore, boh inra-cell and iner-cell inerference managemens are significan o improve he overall capaciy of a cellular nework sysem. The problem of maximizing a generic sum-rae is sudied in his paper, and i is non-convex, NP-hard and canno be solved efficienly. Various model-driven algorihms have been proposed in he presen papers for PA problems, such as fracional programming (FP) [1], weighed MMSE (WMMSE) [2] and some ohers [3], [4]. Excellen performance can be observed hrough heoreical analysis and numerical simulaions, bu serious obsacles are faced in pracical deploymens [5]. Firs, hese echniques highly rely on racable mahemaical models, which are imperfec in real communicaion scenarios wih he specific user disribuion, geographical environmen, ec. Second, he compuaional complexiies of hese algorihms are high. In recen years, he machine learning (ML)-based approaches have been rapidly developed in wireless communicaions [6]. These algorihms are usually model-free, and are Fan Meng, and Lenan Wu are wih he School of Informaion Science and Engineering, Souheas Universiy, Nanjing , China ( mengxiaomaomao@oulook.com, wuln@seu.edu.cn). Peng Chen is wih he Sae Key Laboraory of Millimeer Waves, Souheas Universiy, Nanjing , China ( chenpengseu@seu.edu.cn). complian wih opimizaions in pracical communicaion scenarios. Addiionally, wih developmens of graphic processing uni (GPU) or specialized chips, he execuions can be boh fas and energy-efficien, which brings in solid foundaions for massive applicaions. Two main branches of ML, supervised learning and reinforcemen learning (RL) [7], are briefly inroduced here. Wih supervised learning, a deep neural nework (DNN) is rained o approximae some given opimal (or subopimal) objecive algorihms, and i has been realized in some applicaions [8] [10]. However, he arge algorihm is usually unavailable and he performance of DNN is bounded by he supervisor. Therefore, he RL has received widespread aenion, due o is naure of ineracing wih an unknown environmen by exploraion and exploiaion. The Q learning mehod is he mos well-sudied RL algorihm, and i is exploied o cope wih power allocaion (PA) in [11] [13], and some ohers [14]. The DNN rained wih Q learning is called deep Q nework (DQN), and i is proposed o address he disribued downlink single-user PA problem [15]. In our paper, we exend he work in [15], and he PA problem in cellular cells wih muliple users is invesigaed. The design of he DQN model is discussed and inroduced. Simulaion resuls show ha our DQN ouperforms he presen DQNs and he benchmark algorihms. The conribuions of his work are summarized as follows: A model-free wo-sep raining framework is proposed. The DQN is firs off-line rained wih DRL algorihm in simulaed scenarios. Second, he learned DQN can be furher dynamically opimized in real communicaion scenarios, wih he aid of ransfer learning. The PA problem using deep Q learning (DQL) is discussed, hen a DQN enabled approach is proposed o be rained wih curren sum-rae as reward funcion, including no fuure reward. The inpu feaures are welldesigned o help he DQN ge closer o he opimal soluion. Afer cenralized raining, he proposed DQN is esed by disribued execuion. The averaged rae-sum of DQN ouperforms he model-driven algorihms, and also shows good generalizaion abiliy in a series of benchmark simulaion ess. The remainder of his paper is organized as follows. Secion II oulines he PA problem in he wireless cellular nework wih IMAC. In Secion III our proposed DQN is inroduced in deail. Then, his DQN is esed in disinc scenarios, along

2 wih benchmark algorihms, and he simulaion resuls are analyzed in Secion IV. Conclusions and discussion are given in Secion V. II. SYSTEM MODEL The problem of PA in he cellular nework wih inerfering muliple-access channel (IMAC) is considered. In a sysem wih N cells, a he cener of each cell a base saions (BS) simulaneously serves K users wih sharing frequency bands. A simple nework example is shown in Fig. 1. A ime slo, he independen channel coefficien beween he n-h BS and he user k in cell j is denoed by gn,j,k, and can be expressed as g n,j,k = h n,j,k 2 β n,j,k, (1) where h n,j,k is he small scale complex fla fading elemen, and β n,j,k is he large scale fading componen aking accoun of boh he geomeric aenuaion and he shadow fading. Therefore, he signal o inerference plus noise raio (SINR) of his link can be described by g n,n,k p n,k sinr n,k = k k g n,n,k p n,k + n D n gn,n,k j p n,j +, σ2 (2) where D n is he se of inerference cells around he n-h cell, p is he emiing power of BS, and σ 2 denoes he addiional noise power. Wih normalized bandwidh, he downlink rae of his link is given as C n,k = log 2 ( 1 + sinr n,k ), (3) The opimizaion arge is o maximize his generic sum-rae objecive funcion under maximum power consrain, and i is formulaed as max p n k C n,k s.. 0 p n,k P max, n, k, where p = {p n,k n, k}, and P max denoes he maximum emiing power. We also define sum-rae C = n k C n,k, C = {Cn,k n, k}, and channel sae informaion (CSI) g = {gn,j,k n, j, k}. This problem is non-convex and NP-hard, so we propose a daa-driven learning algorihm based on he DQN model in he following secion. A. Background III. DEEP Q NETWORK Q learning is one of he mos popular RL algorihms aiming o deal wih he Markov decision process (MDP) problems [16]. A ime insan, by observing he sae s S, he agen akes acion a A and ineracs wih he environmen, and hen ge he reward r and he nex sae s +1 is obained. The noaions A and S are he acion se and he sae se, respecively. Since S can be coninuous, he DQN is proposed o combine Q learning wih a flexible DNN (4) Y axis (km) X axis (km) BS User Fig. 1. An illusraive example of a muli-user cellular nework wih 9 cells. In each cell, a BS serves 2 users simulaneously. o sele infinie sae space. The cumulaive discouned reward funcion is given as R = γ τ r +τ+1, (5) τ=0 where γ [0, 1) is a discoun facor ha rades off he imporance of immediae and fuure rewards, and r denoes he reward. Under a cerain policy π, he Q funcion of he agen wih an acion a in sae s is given as Q π (s, a; θ) = E π [ R s = s, a = a ], (6) where θ denoes he DQN parameers, and E [ ] is he expecaion operaor. Q learning concerns wih how agens ough o inerac wih an unknown environmen so as o maximize he Q funcion. The maximizaion of (6) is equivalen o he Bellman opimaliy equaion [17], and i is describe as y = r + γ max Q(s +1, a ; θ ), (7) a where y is he opimal Q value. The DQN is rained o approximae he Q funcion, and he sandard Q learning updae of he parameers θ is described as θ +1 = θ + η ( y Q(s, a ; θ ) ) Q(s, a ; θ ), (8) where η is he learning rae. This updae resembles sochasic gradien descen, gradually updaing he curren value Q(s, a ; θ ) owards he arge y. The experience daa of he agen is loaded as ( s, a, r, s +1). The DQN is rained wih recorded bach daa randomly sampled from he experience replay memory, which is a firs-in firs-ou queue. B. Discussion on DRL In many applicaions such as playing video games [16], where curren sraegy has long-erm impac on cumulaive reward, he DQN achieve remarkable resuls and bea humans. However, he discoun facor is suggesed o be zero in his

3 C p g C p g C p g Fig. 2. The soluion of DQN is deermined by CSI g, along wih downlink rae C 1 and ransmiing power p 1. PA problem. The DQL aims o maximize he Q funcion. Le γ = 0, from (6) we have max Q = max E [ π r s = s, a = a ]. (9) a A For a PA problem, clearly ha s = g, a = p. Then we le r = C and ge ha max Q = [ max E π C g, p ]. (10) 0 p p max In he execuion period he policy is deerminisic, and hus (10) can be wrien as max Q = max C ( g, p ), (11) 0 p p max which is a equivalen form of (4). In his inference process we assume ha γ = 0 and r = C, indicaing ha he opimal soluion o (4) is idenical o ha of (6), under hese wo condiions. As shown in Fig. 2, i is well-known ha he opimal soluion p of (4) is only deermined by curren CSI g, and he sum-rae C is calculaed wih (g, p ). Theoreically he opimal power p can be obained using a DQN wih inpu being jus g. In fac, he performance of his designed DQN is poor, since i is non-convex and he opimal poin is hard o find. Therefore, we propose o uilize wo more auxiliary feaures: C 1 and p 1. Since ha he channel can be modeled as a firs-order Markov process, he soluion of las ime period can help he DQN ge closer o he opimum, and (11) can be rewrien as max Q = max 0 p p max C ( g, p, C 1, p 1). (12) Once γ = 0 and r = C, (7) is simplified o be y = C, and he replay memory is also reduced o be (s, a, r ). The DQN works as an esimaor o predic he curren sumrae of corresponding power levels wih a cerain CSI. These discussions provide good guidance for he following DQN design. C. DQN Design in Cellular Nework In our proposed model-free wo-sep raining framework, he DQN is firs off-line pre-rained wih DRL algorihm in simulaed wireless communicaion sysem. This procedure is o reduce he on-line raining sress, due o he large daa requiremen of daa-driven algorihm by naure. Second, wih he aid of ransfer learning, he learned DQN can be furher dynamically fine-uned in real scenarios. Since he pracical wireless communicaion sysem is dynamic and influenced by unknown issues, he daa-driven algorihm is believed o be a promising echnique. We jus discuss he wo-sep framework here, and he firs raining sep is mainly focused in he following manuscrip. In a cerain cellular nework, each BS-user link is regarded as an agen and hus a muli-agen sysem is sudied. However, muli-agen raining is difficul since i needs much more learning daa, raining ime and DNN parameers. Therefore, cenralized raining is considered, and only one agen is rained by using all agens experience replay memory. Then, his agen s learned policy is shared in he disribued execuion period. For our designed DQN, componens of he replay memory are inroduced as follows. 1) Sae: The sae design for a cerain agen (n, k) is imporan, since he full environmen informaion is redundan and irrelevan elemens mus be removed. The agen is assumed o have corresponding perfec insan CSI informaion in (2), and we define logarihmic normalized inerferer se Γ n,k as Γ n,k = 1, } {{, 1, } K 1 {log 2 ( 1 + g n,j,k g n,k,k ) n D n, j }. (13) The channel ampliude of inerferers are normalized by ha of he needed link, and he logarihmic represenaion is preferred since he ampliudes of channel ofen vary by orders of magniude. The cardinaliy of Γ n,k is ( D n + 1)K 1. To furher decrease he inpu dimension and reduce he compuaional complexiies, he elemens in Γ n,k are sored in decrease urn and only he firs C elemens remain. As we discussed in III-B, hese remained componens and his link s corresponding downlink rae C 1 n,k and ransmiing power p 1 n,k a las ime slo, are he addiional wo pars of he inpu o our DQN. Therefore, he sae is composed of hree feaures: s n,k = {Γ n,k, C 1 n,k, p 1 n,k }. The cardinaliy of sae, i.e., he inpu dimension for DQN is S = 3C ) Acion: In (4) he downlink power is a coninuous variable, and is only consrained by maximum power consrain. Since he acion space of DQN mus be finie, he possible emiing power is quanized in A levels. The allowed power se is given as A = { 0, P min, P min ( Pmax P min ) 1 A 2,, Pmax } where P min is he non-zero minimum emiing power., (14)

4 3) Reward: In some manuscrips he reward funcion is elaboraely designed o improve he agen s ransmiing rae and also miigae he inerference influence. However, mos of hese reward funcions are subopimal approaches o he arge funcion of (4). In our paper, he C is direcly used as he reward funcion, and i is shared by all agens. In he raining simulaions wih small or medium scale cellular nework, his simple mehod proves o be feasible. TABLE I HYPER-PARAMETERS SETUP OF DQN TRAINING Parameer Value Parameer Value Number of T per episode 50 Iniial η 10 3 Observe episode number 100 Final η 10 4 Explore episode number 9900 Iniial ɛ 0.2 Train inerval 10 Final ɛ 10 4 Memory size Bach size 256 IV. SIMULATION RESULTS A. Simulaion Configuraion A cellular nework wih N = 25 cells is simulaed. A cener of each cell, a BS is deployed o synchronously serve K = 4 users which are locaed uniformly and randomly wihin he cell range r [R min, R min ], where R min = 0.01 km and R min = 1 km are he inner space and half cell-o-cell disance, respecively. The small-scale fading is simulaed o be Rayleigh disribued, and he Jakes model is adoped wih Doppler frequency f d = 10 Hz and ime period T = 20 ms. According o he LTE sandard, he large-scale fading is modeled as β = log 10 (d)+10 log 10 (z) db, where z is a log-normal random variable wih sandard deviaion being 8 db, and d is he ransmier-o-receiver disance (km). The AWGN power σ 2 is 114 dbm, and he emiing power consrains P min and P max are 5 and 38 dbm, respecively. A four-layer feed-forward neural nework (FNN) is chosen as DQN, and he neuron numbers of wo hidden layers are 128 and 64, respecively. The acivaion funcion of oupu layer is linear, and he ReLU is adoped in he hidden layers. The cardinaliy of adjacen cells is D n = 18, n, he firs C = 16 inerferers remain and power level number A = 10. Therefore, he inpu and oupu dimensions are 50 and 10, respecively. In he off-line raining period, he DQN is firs randomly iniialized and hen rained epoch by epoch. In he firs 100 episodes, he agens only ake acions sochasically, hen hey follow by adapive ɛ-greedy learning sraegy [17] o sep in he following exploring period. In each episode, he largescale fading is invarian, and hus he number of raining episode mus be large enough o overcome he generalizaion problem. There are 50 ime slos per episode, and he DQN is rained wih 256 random samples in he experience replay memory every 10 ime slos. The Adam algorihm [18] is adoped as he opimizer in our paper, and he learning rae η exponenially decays from 10 3 o All raining hyperparameers are lised in Tab.I for beer illusraion. In he following simulaions, hese defaul hyper-parameers will be clarified once changed. The FP algorihm, WMMSE algorihm, maximum PA and random PA schemes are reaed as benchmarks o evaluae our proposed DQN-based algorihm. The perfec CSI of curren momen is assumed o be known for all schemes. The simulaion code will be available afer formal publicaion. Average rae (bps) = 0.0 = 0.1 = 0.3 = 0.7 = Training Epoch Fig. 3. Wih differen γ values, he recorded average rae during raining period (Curves smoohed by averaged window). B. Discoun Facor In his subsecion, he performance of differen discoun facor γ is sudied. We se γ {0.0, 0.1, 0.3, 0.7, 0.9}, and he average rae C over he raining period is shown in Fig. 3. A he same ime slo, obviously he values of C wih higher γ {0.7, 0.9} are lower han he res wih lower γ values. The rained DQNs are hen esed in hree cellular neworks wih differen cell numbers. As shown in Fig. 4 shows ha DQN wih γ = 0.0 achieves he highes C score, while he lowes value is obained by he one wih highes γ value. The simulaion resul shows ha he non-zero γ has a negaive influence on he performance of DQN, which is consisen wih he analysis in III-B. Therefore, a zero or low discoun facor value is recommended. C. Algorihm Comparison The DQN rained wih zero γ is used, and he four benchmark algorihms saed before are esed as comparisons. In real cellular nework, he user densiy is changing over ime, and he DQN mus have good generalizaion abiliy agains his issue. The user number per cell K is assumed o be in se {1, 2, 4, 6}. The averaged simulaion resuls are obained afer 500 repeas. As shown in Fig. 5, he DQN achieves he highes C in all esing scenarios. Alhough i is rained wih K = 4, he DQN sill ouperforms he oher algorihms in he oher cases. We also noe ha he gap beween random/maximum PA schemes and he res opimizaion algorihms is increased

5 Average rae (bps) = 0.0 = 0.1 = 0.3 = 0.7 = 0.9 Average rae (bps/hz) DQN FP WMMSE Random power Maximal power N=25 N=49 N=100 Number of cells 0.0 K=1 K=2 K=4 K=6 User number per cell Fig. 4. The average rae C versus cellular nework scalabiliy for rained DQNs wih differen γ values. Fig. 5. The average rae C versus user number per cell. Five power allocaion schemes are esed. when K becomes larger. This can be mainly aribued ha he inra-cell inerference ges sronger wih increased user densiy, which indicaing ha he opimizaion of PA is more significan in he cellular neworks wih denser users. We also give an example resul of one esing episode here (K = 4). In comparison wih he averaged sum-rae values in Fig. 5, in Fig. 6 he performance of hree PA algorihms (DQN, FP, WMMSE) is no sable, especially depending on he specific large-scale fading effecs. Addiionally, in some episodes he DQN can no be beer han he oher algorihms over he ime (no shown in his paper), which means ha here is sill poenial o improve he DQN performance. In erms of compuaion complexiy, he ime cos of DQN is in linear relaionship wih layer numbers, wih he uilizaion of GPU. Meanwhile, boh FP and WMMSE are ieraive algorihms, and hus he ime cos is no consan, depending on he sopping crierion condiion, iniializaion and CSI. V. CONCLUSIONS The PA problem in he cellular nework wih IMAC has been invesigaed, and he daa-driven model-free DQL has been applied o solve his issue. To be in consisen wih he PA opimizaion arge, he curren sum-rae is used as reward funcion, including no fuure reward. This designed DQL algorihm is proposed, and he DQN simply works as an esimaor o predic he curren sum-rae under all power levels wih a cerain CSI. Simulaion resuls show ha he DQN rained wih zero γ achieves he highes average sumrae. Then in a series of differen scenarios, he proposed DQN ouperforms he benchmark algorihms, indicaing ha he designed DQN has good generalizaion abiliies. In our wo-sep raining framework, we have realized he off-line cenralized learning wih simulaed communicaion neworks, and he learned DQN is esed by disribued execuions. In our fuure work, he on-line learning will be furher Average rae (bps) DQN FP WMMSE Random power Maximal power Time slo Fig. 6. Comparisons of all five power allocaion schemes over 1000 ime slos (Curves smoohed by averaged window). sudied o accommodae he real scenarios wih specific user disribuions and geographical environmens. VI. ACKNOWLEDGMENTS This work was suppored in par by he Naional Naural Science Foundaion of China (Gran No , , ), he Naural Science Foundaion of Jiangsu Province (Gran No. BK ), he Open Program of Sae Key Laboraory of Millimeer Waves (Souheas Universiy, Gran No. Z201804). REFERENCES [1] K. Shen and W. Yu, Fracional programming for communicaion sysemspar i: Power conrol and beamforming, IEEE Transacions on Signal Processing, vol. 66, no. 10, pp , 2018.

6 [2] Q. Shi, M. Razaviyayn, Z. Q. Luo, and C. He, An ieraively weighed mmse approach o disribued sum-uiliy maximizaion for a mimo inerfering broadcas channel, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing, 2011, pp [3] M. Chiang, P. Hande, T. Lan, and C. W. Tan, Power conrol in wireless cellular neworks, Foundaions and Trends in Neworking, vol. 2, no. 4, pp , [4] H. Zhang, L. Venurino, N. Prasad, P. Li, S. Rangarajan, and X. Wang, Weighed sum-rae maximizaion in muli-cell neworks via coordinaed scheduling and discree power conrol, IEEE Journal on Seleced Areas in Communicaions, vol. 29, no. 6, pp , June [5] Z. Qin, H. Ye, G. Y. Li, and B. F. Juang, Deep learning in physical layer communicaions, CoRR, vol. abs/ , [Online]. Available: hp://arxiv.org/abs/ [6] T. OShea and J. Hoydis, An inroducion o deep learning for he physical layer, IEEE Transacions on Cogniive Communicaions and Neworking, vol. 3, no. 4, pp , Dec [7] Y. Lecun, Y. Bengio, and G. Hinon, Deep learning. Naure, vol. 521, no. 7553, p. 436, [8] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, Learning o opimize: Training deep neural neworks for inerference managemen, IEEE Transacions on Signal Processing, vol. 66, no. 20, pp , Oc [9] F. Meng, P. Chen, L. Wu, and X. Wang, Auomaic modulaion classificaion: A deep learning enabled approach, IEEE Transacions on Vehicular Technology, pp. 1 1, [10] H. Ye, G. Y. Li, and B. Juang, Power of deep learning for channel esimaion and signal deecion in ofdm sysems, IEEE Wireless Communicaions Leers, vol. 7, no. 1, pp , Feb [11] R. Amiri, H. Mehrpouyan, L. Fridman, R. K. Mallik, A. Nallanahan, and D. Maolak, A machine learning approach for power allocaion in henes considering qos, CoRR, vol. abs/ , [Online]. Available: hp://arxiv.org/abs/ [12] E. Ghadimi, F. D. Calabrese, G. Peers, and P. Soldai, A reinforcemen learning approach o power conrol and rae adapaion in cellular neworks, in 2017 IEEE Inernaional Conference on Communicaions (ICC), May 2017, pp [13] F. D. Calabrese, L. Wang, E. Ghadimi, G. Peers, L. Hanzo, and P. Soldai, Learning radio resource managemen in rans: Framework, opporuniies, and challenges, IEEE Communicaions Magazine, vol. 56, no. 9, pp , Sep [14] L. Xiao, D. Jiang, D. Xu, H. Zhu, Y. Zhang, and V. Poor, Twodimensional ani-jamming mobile communicaion based on reinforcemen learning, IEEE Transacions on Vehicular Technology, pp. 1 1, [15] Y. S. Nasir and D. Guo, Deep reinforcemen learning for disribued dynamic power allocaion in wireless neworks, CoRR, vol. abs/ , [Online]. Available: hp://arxiv.org/abs/ [16] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Osrovski, Human-level conrol hrough deep reinforcemen learning. Naure, vol. 518, no. 7540, p. 529, [17] S. Suon and A. G. Baro, Reinforcemen Learning: An Inroducion. Cambridge, MA: MIT Press, [18] D. P. Kingma and J. Ba, Adam: A mehod for sochasic opimizaion, CoRR, vol. abs/ , [Online]. Available: hp://arxiv.org/abs/

Vehicle Arrival Models : Headway

Chaper 12 Vehicle Arrival Models : Headway 12.1 Inroducion Modelling arrival of vehicle a secion of road is an imporan sep in raffic flow modelling. I has imporan applicaion in raffic flow simulaion where