Upper confidence bound based decision making strategies and dynamic spectrum access

Size: px

Start display at page:

Download "Upper confidence bound based decision making strategies and dynamic spectrum access"

Lynette Brianne Anderson
5 years ago
Views:

Upper confidence bound based decision making sraegies and dynamic specrum access Wassim Jouini wassim.jouini@supelec.fr Damien Erns Universiy of Liège derns@ulg.ac.be Chrisophe Moy chrisophe.

We sugges ha Upper Confidence Bound (UCB) algorihms could be useful o design decision making sraegies for SUs o exploi inelligenly he specrum resources based on heir pas observaions.

1 Upper confidence bound based decision making sraegies and dynamic specrum access Wassim Jouini Damien Erns Universiy of Liège Chrisophe Moy Jacques Palico Absrac In his paper, we consider he problem of exploiing specrum resources for a secondary user (SU) of a wireless communicaion nework. We sugges ha Upper Confidence Bound (UCB) algorihms could be useful o design decision making sraegies for SUs o exploi inelligenly he specrum resources based on heir pas observaions. The algorihms use an index ha provides an opimisic esimaion of he availabiliy of he resources o he SU. The suggesion is suppored by some experimenal resuls carried ou on a specific dynamic specrum access (DSA) framework. Index Terms Cogniive Radio, Dynamic Specrum Access, Upper Confidence Bound Algorihm. A. Dynamic specrum access I. INTRODUCTION During he las cenury, mos of he meaningful frequency bands were licensed o he emerging wireless applicaions. Because of he saic model of frequency allocaion, he growing number of specrum demanding services led o a specrum scarciy. However, recenly, series of measuremens on he specrum uilizaion [1] showed ha he differen frequency bands were underuilized (someimes even unoccupied) and hus ha he scarciy of he specrum resource is virual and only due o he saic allocaion of he differen bands o specific wireless services. Moreover, he underuilizaion of he specrum resource varies on differen scales in ime and space offering many opporuniies o an unlicensed user or nework o access he specrum. Dynamic Specrum Access (DSA, also known as Opporunisic Specrum Access: OSA) was inroduced as a possible soluion ha could alleviae he specrum scarciy issue. In general, DSA relaed issues consider a pool of users referred o as primary users (PUs). PUs access specrum resources dedicaed o he services provided (or available) o hem. Consequenly hey have an unconsrained access o hese resources. The primary users communicae in a primary nework (PN) which is characerized by is environmen, i.e., is geographical posiion as well as he resources provided during a cerain amoun of ime. The concep of DSA allows new users o access heir surrounding PU s licensed bands even hough hey do no belong o he primary nework. These users are referred o as secondary users (SUs). The main goal of a SU is o find in his surrounding environmen new communicaion opporuniies compared o he usual and curren specrum allocaion scheme. Fig. 1. Cogniive Radio conex. Usually an opporuniy, in DSA relaed issues, is defined as: a band of frequencies ha are no being used by he primary users of ha band a a paricular ime in a paricular geographic area [2]. However, a SU usually has no apriori informaion on he available opporuniies surrounding him. To ha issue, he Federal Communicaions Commission (USA) suggesed he concep of Cogniive Radio, inroduced by J. Miola [3] in 1999, as a possible soluion. B. Decision making engine of a cogniive radio equipmen A Cogniive Radio (CR) device is a communicaion sysem aware of is environmen as well as of is operaional abiliies and capable of using hem inelligenly. Thus i is a device ha has he abiliy o collec informaion hrough i sensors and ha can use he pas observaions on is surrounding environmen o improve is behavior consequenly. A simplified cogniive radio behavior in DSA is illusraed in Figure 1: he CR equipmen observes is surrounding environmen looking for opporuniies. As illusraed by he magnifying glass, usually, a CR canno see (or sense) he enire environmen alogeher. The resuls of hese observaions are aken ino accoun by he decision making engine ha decides on he nex acion o ake (e.g. which par of he environmen o sense? ransmi or no ransmi?). In some cases a numerical signal (reward or acknowledgmen) is compued and help he CR equipmen o evaluae is performance a ha specific ime. The design of such CR equipmens o ackle OSA issues has been, recenly, he cener of a lo of aenion (e.g. [3] [4] [5]). We refer o as Cogniive Agen (CA) he decision making engine of he CR equipmen ha can be seen as he /10/$ IEEE

Thus, a proper sensing of he environmen mus be done o avoid inerfering wih PUs. On he oher hand, he SU has o find an allocaion policy o selec, and if possible, access he available resources.

2 Fig. 3. Occupancy of he differen channels considered by he SU. Fig. 2. Cogniive radio resource selecion and access. brain of he CR device. A he level of he CA, he challenges are wofold: on he one hand, he SU mus no compromise he efficiency of he primary nework. Thus, a proper sensing of he environmen mus be done o avoid inerfering wih PUs. On he oher hand, he SU has o find an allocaion policy o selec, and if possible, access he available resources. A simple represenaion of he differen ineracions beween he environmen and he cogniive agen is described in Figure 2. In his paper, we assume ha he CA can only ake acions, (e.g. selec and access a channel if possible) a discree ime insans =0, 1, 2,.... A every insan, he CA observes is radio frequency environmen and can collec differen kind of informaion (e.g., available frequency bands, noise level, posiion, hroughpu, ec.). All he informaion colleced by he CA up o insan is supposed o be gahered in a vecor i. We assume ha he CA has o selec a every insan an acion a in a discree se A. Wihou loss of generaliy, he behavior of he CA can be seen as a policy (decision sraegy) π ha maps he informaion vecor i ino he acion a A, ha is: a = π(i ) (1) The purpose of his paper is o sudy he performance of a paricular policy on an academic DSA problem. The academic DSA problem is described in Secion II. The policy which is based on he compuaion of upper confidence bound indexes is described in Secion III. Secion IV repors he simulaion resuls and, finally, Secion V concludes. II. DYNAMIC SPECTRUM ACCESS: NETWORK MODEL We consider a single secondary user (SU) operaing in a primary nework composed of K channels referenced by he inegers {1, 2,...,K}. The CR equipmen of he SU can only sense (hen access if possible) one channel a a ime. As illusraed in Figure 3, we address he paricular case where he ime is divided ino slos =0, 1, 2,..., and ha PUs are synchronous. The emporal occupancy paern of every channel k is supposed o follow an unknown Bernoulli disribuion θ k. Moreover, he disribuions θ 1,θ 2,...,θ K are assumed o be saionary. When he SU senses a channel k a he slo number, he cogniive agen compues a binary signal X ha provides informaion on he availabiliy of he sensed slo Fig. 4. Slo represenaion for a radio equipmen conrolled by a CA. I is assumed here ha T d + T a are small wih respec o T s and T. a ha paricular insan. X is an independen realizaion of he disribuion θ k,aheslo. Le us define μ k as follows: k, μ k Δ =E[θk ]=P (channel k is free) Wihou loss of generaliy, we assume ha μ 1 μ 2... μ K 1 <μ K. Moreover, we assume in his paper ha he oucome of he sensing process is error free. However he disribuion probabiliies θ 1,θ 2,...,θ K areassumedobe unknown o he CA. A every insan, and for every channel k he sae of he channel observed by he SU can be eiher free or busy. If he channel is free, he CR equipmen can ransmi a cerain number of bis B. Oherwise, he CR equipmen wais unil he nex slo and selecs a new channel o sense. A slo is divided ino 4 periods (cf. Figure 4). During he firs period, he CA chooses he nex channel o access. During he second period he CA senses he seleced channel before communicaing if i is possible (channel free during he slo). A he end of every slo, he CA compues a numerical signal referred o as reward r ha depends on he occupancy sae of he seleced channel and evaluaes he CA s performance (e.g., hroughpu in his paper) during he communicaion process. The added informaion a he end of every slo is used o improve he decision making behavior of he CA which is characerized by he policy π. As menioned earlier, his policy akes an informaion vecor i as inpu and oupus he acion o be seleced a ime. The acion is here he channel o selec, A = {1, 2,...,K}, and he informaion vecor is i =[a 0,r 0,a 1,r 1,...,a 1,r 1 ]. The hroughpu achieved by he CR equipmen a he slo number can be defined as: r Δ =B.X (2)

3 which is he reward considered in his paricular framework. For he sake of simpliciy we assume here ha if he channel is free he CR can always ransmi B = B bis. Thus, he cumulaed hroughpu afer slos can be wrien: W π = r m = B X m where he suffix π is used o emphasize ha he CR equipmen uses he policy π o selec he channels. The purpose of he CA is o maximize he expeced cumulaed hroughpu of he CR equipmen: E[W π ]=B E[X m ] (3) Le R π denoe he regre of he CA a he slo number, using a policy π. The regre R π is defined as: R π = B.μ K. W π (4) The general idea behind he noion of regre can be explained as follows: if he CA knew apriorihe values of {μ k } k A, he bes choice would be o always selec he channel wih he highes expeced availabiliy, i.e., μ K. Unforunaely, he CA usually lacks ha informaion and has o learn i. For ha purpose, he CA has o explore he channels in order o have beer esimaions of heir emporal occupancy paern. While exploring i should also exploi he already colleced informaion o minimize he regre during he learning process. This leads o an exploraion-exploiaion radeoff. The regre represens he loss due o subopimal channel selecions during he learning process. Maximizing he expeced hroughpu is equivalen o minimizing he cumulaed expeced regre. The expeced cumulaed regre can be wrien as follows: E[R π ]=B. K Δ k.e[t k ()] = B.E[ R π ] (5) k=1 where R π = Rπ B, Δ k = μ K μ k and T k () refers o he number of imes he channel k has been seleced from insan 0 o insan 1. We propose in he nex secion policies π ha upper bound he expeced cumulaed regre of he CR equipmen by a logarihmic funcion of he slo number. A. UCB index III. UPPER CONFIDENCE BOUND INDEX Building a cogniive agen o ackle he DSA issue requires o find a policy π for his agen ha offers a good soluion o he exploraion-exploiaion radeoff behind he noion of regre s minimizaion. The general approach suggesed in his secion aims a selecing acions based on indexes ha provide upper confidence bounds (UCB) on he rewards associaed o he channels he secondary user can poenially exploi. Policies based on he compuaion of UCB indexes were Parameers: K, exploraion coefficien α Inpu: i =[a 0,r 0,a 1,r 1,...,a 1,r 1 ] Oupu: a Algorihm: If: K reurn a = +1 Else: T k () 1 1 {a m=k}, k A k,,tk () α. ln() T k (), k B k,,tk () X k,tk () + A k,,tk (), k reurn a = arg max(b k,,tk ()) k Fig. 5. A abular version of a policy π(i ) using a UCB 1 algorihm for compuing acions a. iniially inroduced in he machine learning communiy o solve he so-called muli-armed bandi problem (see [6] and [7]). A usual approach o evaluae he average reward provided by a resource k is o consider a confidence bound for is sample mean. Le X k,tk () be he sample mean of he resource k A afer being seleced T k () imes a he sep : 1 X k,tk () = r m.1 1 {am=k} (6) For every k A and a every sep = 0, 1, 2,..., an upper bound confidence index (UCB index), B k,,tk (), isa numerical value compued from i. For all k, B k,,tk () gives an opimisic esimaion of he expeced reward obained when he CA selecs he resource k a a ime afer being esed T k (). The UCB indexes we use in his paper have he following general expression: B k,,tk () = X k,tk () + A k,,tk () (7) where A k,,tk () is an upper confidence bias added o he sample mean. An upper confidence bound (UCB) based cogniive agen uses a policy π o compue from i hese indexes from which i selecs a resource a as follows: a = π(i ) = arg max(b k,,tk ()) (8) k 1) UCB 1 [8] [9]: When using he following upper confidence bias: A k,,tk () = α. ln() T k () wih α > 1, we obain an upper confidence bound index referred o as UCB 1 in he lieraure. A fully deailed version of he policy using UCB 1 indexes is given in Figure 5. 1 Indicaor funcion: 1 {logical expression} ={1 if logical expression=rue ; 0 if logical expression=false}. (9)

4 2) UCB V [8]: The UCB 1 index uses only firs order saisic informaion (empirical mean). I was suggesed in [9] ha adding he second order saisic informaion (empirical variance) o he UCB indexes could lead o beer performances. The UCB V index explois he empirical variance of he esimaed rewards. More specifically i uses he following upper confidence bias: A k,,tk () = 2ξ.V k (). ln() 3.c.ξ. ln() + T k () T k () (10) wih c 1 and 3.ξ.c > 1 and where V k () refers o he empirical variance of he channel k. In Secion IV we will compare he performances of UCB 1 and UCB V policies on he dynamic specrum access problem inroduced in Secion II. B. Performance evaluaion When using a policy π, an ineresing way o analyze is behavior is o consider he noion of consisency. This noion gives informaion on he growh rae of he regre. A policy π is said o be β-consisen, 0 <β 1, if i saisfies: E[R π ] lim β =0 (11) We expec a good policy o be a leas 1-consisen. Asa maer of fac, his propery ensures ha asympoically he mean expeced reward is opimal, i.e.: lim 1 r m = B.μ K (12) Theorem 1: (cf. [8] for proofs) For all K 2, if policy UCB 1 (α>1) is run on K channels having arbirary reward disribuions θ 1,..., θ K wih suppor in [0,1], hen: π=ucb1 E[ R ] k:δ k >0 4.α Δ k. ln() (13) Noice ha a similar heorem could be wrien if he reward disribuions had a bounded suppor raher han a suppor in [0,1]. An equivalen heorem also exiss for he index UCB V : Theorem 2: (cf. [8] for proofs) For all K 2, if policy UCB V (ξ 1,c=1)is run on K channels having arbirary reward disribuions θ 1,..., θ K wih suppor in [0,1], hen C ξ > 0 s.. π=ucbv E[ R ] C ξ k:δ k >0 ( σ2 k Δ k +2). ln() (14) Acually a similar resul would sill hold if c 1 bu saisfies noneheless 3.ξ.c > 1. These resuls are of a paricular ineres for many reasons: They bound he expeced regre of he UCB policies by a logarihmic funcions for all. This guaranees ha he suggesed policies are β consisen for all 0 <β 1. Thus hese policies converge quickly o he opimal channel K. Moreover, he indexes hese policies rely on o selec acions can be compued incremenally [10]. Thus, heir complexiy, in erms of memory usage and compuaional needs, are low. Las bu no leas, i has been proven in [6] ha when having no aprioriinformaion on he emporal occupancy paern of he differen channels θ 1,θ 2,...,θ K,a logarihmic upper bound is he bes we can expec. IV. SIMULATIONS In our simulaions, we consider ha he CA agen can choose beween 10 channels. The parameers of he Bernouilli disribuions which characerize he emporal occupancy of hese channels are: [μ 1,μ 2,...,μ 10 ] = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9]. We consider ha he number of bis a SU can ransmi on a free channel is B =1bi. Every numerical resul repored hereafer is he average of he values obained over 100 experimens. In his secion, he parameer α of he UCB 1 algorihm is chosen equal o 1.2. The parameers ξ and c of he UCB V algorihm are equal o 1 and 0.4, respecively. Wih such values for c and ξ, he condiion 3.ξ.c > 1 is saisfied and he bound on he expeced cumulaed regre given by Equaion (14) sill holds. The simulaion resuls depend on he paramaers values, however we chose hese values o be close o he criical ones (α =1, ξ =1and c =1/3) wihou being oo conservaive. Figure 6-op shows he evoluion of he average cumulaed regre for he differen UCB policies. For boh policies, he cumulaed regre firs increases raher rapidly wih he slo number and hen more and more slowly. This shows ha UCB policies are able o process he pas informaion in an appropriae way such ha mos available resources are favored wih ime. This is furher illusraed by he 3 graphics on he boom of Figure 6. These graphics show he average hroughpu achieved by he UCB policies. As we observe, he hroughpu increases wih ime. Acually, one has he heoreical guaranee ha i will converge o 0.9, which is he larges probabiliy of availabiliy of a channel. Figure 7 shows he percenage p of imes a UCB policy selecs he opimal 1 channel unil he slo number (p = {am=k} ). As one can observe, his percenage ends o ge closer and closer o 100 as he slo number increases. In our simulaions resuls, we have always found ou ha UCB 1 seems o ouperform UCB V a he beginning of he learning process and ha, aferwards, UCB V ouperforms UCB 1. This may be explained by he fac a he beginning of he learning UCB V spends more ime collecing informaion on he differen channels han UCB 1 since i also depends on he variances of he differen channels and no only on heir empirical mean. During his phase, i mainly has a pure exploraion sraegy while UCB 1 sars already exploiing he informaion ha has been gahered. However, once i sars having good esimaes of hese variances, i can address he

5 Fig. 6. UCB based policies and dynamic specrum access problem: simulaion resuls. Figure on op plos he average cumulaed reward as a funcion of he number of slos for he differen UCB based policies. The figures on he boom represen he evoluion of he normalized average hroughpu achieved by hese policies. Bernoulli disribuions or when many SUs use hese UCB based policies o access he same primary nework. ACKNOWLEDGMENT Damien Erns is a Research Associae of he Belgian FRS- FNRS of which he acknowledges he financial suppor. Fig. 7. Percenage of ime a UCB-based policy selecs he opimal channel. exploraion-exploiaion radeoff in a more efficien way han UCB 1. V. CONCLUSION We presened in his paper a new approach o ackle he resource selecion and access problem in dynamic specrum access in he case of one secondary user in a primary nework. This approach explois some upper confidence based algorihms inroduced in he machine learning communiy for solving he muli-armed bandi problems. Alhough his research is sill in is infancy, we believe ha his approach can lead o efficien CAs o address DSA problems. However many quesions sill need o be answered especially when he emporal occupancy paern of he channels do no follow REFERENCES [1] Federal Communicaions Commission. Specrum policy ask force repor. November [2] P. Kolodzy and al. Nex generaion communicaions: Kickoff meeing. In Proc. DARPA, Ocober [3] J. Miola and G.Q. Maguire. Cogniive radio: making sofware radios more personal. Personal Communicaions, IEEE, 6:13 18, Augus [4] S. Haykin. Cogniive radio: brain-empowered wireless communicaions. IEEE Journal on Seleced Areas in Communicaions, 23, no. 2: , Feb [5] T. Yucek and H. Arslan. A survey of specrum sensing algorihms for cogniive radio applicaions. In IEEE Communicaions Surveys and Tuorials, 11, no.1, [6] T.L. Lai and H. Robbins. Asympoically efficien adapive allocaion rules. Advances in Applied Mahemaics, 6:4 22, [7] R. Agrawal. Sample mean based index policies wih o(log(n)) regre for he muli-armed bandi problem. Advances in Applied Probabiliy, 27: , [8] J.-Y. Audiber, R. Munos, and C. Szepesvári. Tuning bandi algorihms in sochasic environmens. In Proceedings of he 18h inernaional conference on Algorihmic Learning Theory, [9] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finie ime analysis of muliarmed bandi problems. Machine learning, 47(2/3): , [10] W. Jouini, D. Erns, C. Moy, and J. Palico. Muli-armed bandi based policies for cogniive radio s decision making issues. In Proceedings of he 3rd inernaional conference on Signals, Circuis and Sysems (SCS), November 2009.

Stochastic Bandits with Pathwise Constraints

Stochastic Bandits with Pathwise Constraints Sochasic Bandis wih Pahwise Consrains Auhor Insiue Absrac. We consider he problem of sochasic bandis, wih he goal of maximizing a reward while saisfying pahwise consrains. The moivaion for his problem