Joint Channel Selection and Power Control in Infrastructureless Wireless Networks: A Multi-Player Multi-Armed Bandit Framework

Size: px

Start display at page:

Download "Joint Channel Selection and Power Control in Infrastructureless Wireless Networks: A Multi-Player Multi-Armed Bandit Framework"

Maximillian Nigel Cummings
5 years ago
Views:

1 Join Channel Selecion and Power Conrol in Infrasrucureless Wireless Neworks: A Muli-Player Muli-Armed Bandi Framework Seareh Maghsudi and Sławomir Sańczak, Senior Member, IEEE arxiv: v [cs.gt] 2 Jul 24 Absrac This paper deals wih he problem of efficien resource allocaion in dynamic infrasrucureless wireless neworks. Assuming a reacive inerference-limied scenario, each ransmier is allowed o selec one frequency channel from a common pool ogeher wih a power level a each ransmission rial; hence, for all ransmiers, no only he fading gain, bu also he number of inerfering ransmissions and heir ransmi powers are varying over ime. Due o he absence of a cenral conroller and ime-varying nework characerisics, i is highly inefficien for ransmiers o acquire global channel and nework knowledge. Therefore a reasonable assumpion is ha ransmiers have no knowledge of fading gains, inerference, and nework opology. Each ransmiing node selfishly aims a maximizing is average reward or minimizing is average cos, which is a funcion of he acion of ha specific ransmier as well as hose of all oher ransmiers. This scenario is modeled as a muli-player muli-armed adversarial bandi game, in which muliple players receive an a priori unknown reward wih an arbirarily imevarying disribuion by sequenially pulling an arm, seleced from a known and finie se of arms. Since players do no know he arm wih he highes average reward in advance, hey aemp o minimize heir so-called regre, deermined by he se of players acions, while aemping o achieve equilibrium in some sense. To his end, we design in his paper wo join power level and channel selecion sraegies. We prove ha he gap beween he average reward achieved by our approaches and ha based on he bes fixed sraegy converges o zero asympoically. Moreover, he empirical join frequencies of he game converge o he se of correlaed equilibria. We also characerize his se for wo special cases Pars of he maerial in his paper were presened a he IEEE Wireless Communicaions and Neworking Conference, Shanghai, April, 23. The work was suppored by he German Research Foundaion DFG under gran STA 864/3-3. The auhors are wih he Fachgebie für Informaionsheorie und heoreische Informaionsechnik, Technische Universiä Berlin. The second auhor is also wih he Fraunhofer Insiue for Telecommunicaions Heinrich Herz Insiue, Berlin, Germany seareh.maghsudi@u-berlin.de, slawomir.sanczak@hhi.fraunhofer.de.

2 2 of our designed game. We furher discuss experimenal regre-esing procedure as anoher poenial soluion, which converges o Nash equilibrium. Finally all approaches are compared hrough exensive numerical analysis. Index Terms Adversarial bandis, channel selecion, equilibrium, infrasrucureless wireless nework, power conrol. I. INTRODUCTION A. Bandi Theory and Wireless Communicaion Muli-armed bandi MAB is a class of sequenial opimizaion problems, o he bes of our knowledge originally inroduced in []. In he mos radiional form of MAB, given a se of arms acions, a player pulls an arm a each rial of he game o receive a reward. The rewards of arms are no known o he player in advance; however, upon pulling an arm, is insananeous reward is revealed. In such unknown seing, afer playing an arm, he player may lose some reward or incur addiional cos due o no playing anoher arm insead of he currenly played arm. This can be quanified by he difference beween he reward ha would have been achieved had he player seleced anoher arm, and he reward of he played arm. This quaniy is called regre. The player decides which arm o pull in a sequence of rials so ha is accumulaed regre over he game horizon is minimized. Such problems obviously render he inrinsic rade-off beween exploraion learning and exploiaion conrol, i.e. playing he arm which has exhibied he bes performance in he pas and playing oher arms o guaranee he opimal payoff in fuure. An imporan class of bandi games is adversarial bandis, where he series of rewards generaed by an arm canno be aribued o any specific disribuion funcion. In recen years, bandi heory has been used in communicaion heory. For insance, [2] and [3] uilize he classical bandi game o model specrum sharing in cogniive radio neworks. In [4], he auhors propose a cooperaive specrum sensing scheme based on bandi heory. Furher, References [5], [6], and [7] use bandi heory o model relay selecion, sensor scheduling and objec racking, respecively. Channel monioring using bandi model is invesigaed in [8] and [9]. Bandi models have been also used o solve he disribued resource allocaion problem, as discussed in he following.

3 3 B. Disribued Resource Allocaion in Infrasrucureless Wireless Neworks In recen years, game heory and reinforcemen learning have been widely used o solve he disribued resource allocaion problem. The vas majoriy of game-heoreic approaches are based on eiher cooperaion e.g. coaliion formaion, mechanism design e.g. aucion heory, or exchange economy e.g. supply-demand markes. Alhough hese approaches can be implemened in a disribued manner, such an implemenaion in a real nework environmen requires ha each player a leas knows is own uiliy funcion a priori. On he oher hand, hese approaches are in general inefficien as players have o exchange informaion for coordinaion, which increases signaling and feedback overhead. For example, mos models from cooperaive game heory require coordinaion and/or communicaion among players o consruc coaliions [], []. In wireless resource allocaion using aucion games, bids mus be submied o some cenral conroller ha performs necessary compuaions and makes decisions [2], [3]. Finally, in supply-demand marke models, prices and demands are exchanged among buyers and sellers [4], [5]. When he uiliy funcions are no known in advance, he resource allocaion problem is ofen solved by using learning approaches, including bandi models. A large body of lieraure, such as [6], [7] and [8], analyze single-agen sochasic learning problems. Anoher example is [9]. In his work, nework opimizaion is modeled as a sochasic bandi game, where a each rial muliple arms are seleced by a single player and he reward is some linear combinaion of he rewards of seleced arms. An applicaion of his formulaion migh be a downlink user selecion, performed by he base saion. In single-agen seings, he agen learns from is previous experiences, and no informaion flow is required. However, his ype of learning canno generally be used in wireless neworks, where muliple players ac selfishly by responding o each oher and heir uiliies are influenced by he acions of oher players. Moreover, similar o games wih complee informaion, i is desired ha players achieve equilibrium in some sense. As for muli-agen seings, mos sudies assume ha players are able o observe he acions of each oher. This assumpion, despie being realisic for some specrum sharing problems, is no always applicable o general resource allocaion problems, especially in power conrol games, where i is difficul o idenify he ransmi power level of players. In addiion, he assumpion ha each player announces is acions e.g. is ransmi power is no inensive compaible. As a resul, a

4 4 grea majoriy of previous works focus on specrum sharing and/or sensing, as well as channel monioring. On he oher hand, mos of previous sudies assume ha he rewards achieved by each acion can be aribued o a single densiy disribuion. However his assumpion is highly resricive especially for dynamic neworks. In [2], muli-agen bandi problem is invesigaed. This sudy assumes ha in case of inerference, no reward is paid o inerfering users, hereby eliminaing inerference, which degrades he overall performance depending on uiliy funcions. In addiion, communicaion among players is necessary. Finally, no equilibrium analysis is performed. Anoher example is Reference [2], where opporunisic specrum access is formulaed as a muli-agen learning game. In his work, upon availabiliy, each channel pays he same reward o all users so ha his scenario is sricly resricive as i neglecs differen channel qualiies. Moreover, if a channel is seleced by muliple users, orhogonal specrum access scheme is used, which is known o be sub-opimal in general. References [22] and [23] consider graphical games for an inerference minimizaion problem wih parially overlapping channels, where he inerference is presen only beween neighboring users. These works esablish he convergence of proposed learning approaches for he special case of exac poenial games; Noneheless he analysis does no hold for more general games. The auhors of [24] model he cooperaive rae maximizaion in cogniive radio neworks as bandi game, and propose wo approaches, depending on he availabiliy of informaion. The sabiliy of he soluion is however no invesigaed. Reference [25] proposes wo approaches ha achieve Nash equilibrium in a muli-player cogniive environmen. Sysem verificaion, however, is only based on numerical approaches. References [26], [27] and [28] propose various selecion schemes o achieve logarihmic regre; however, no equilibrium analysis is performed. All of he works named above assume ha he generaed rewards of any given acion are independen and idenically disribued. C. Our Conribuion As discussed in Secion I-B, he resource allocaion problem using machine learning heory has been subjec o exensive research in recen years. In shor, our focus is on a resource allocaion problem in an infrasrucureless nework. Firs, we model his problem as an adversarial muliplayer muli-armed bandi game. Wih he aim of an efficien managemen of nework resources and he co-channel inerference miigaion, we follow an approach suggesed in [29] o design

5 5 wo join power conrol and channel selecion PC-CS, hereafer sraegies, which are adaped versions of exponenial-based weighed average [3] and follow he leader [3] sraegies. Boh PC-CS sraegies no only resul in small ha is, wih sublinear growh rae in ime regre for each individual player, bu also guaranee he convergence of empirical frequencies of play o he se of correlaed equilibria. We furher characerize his se for wo special cases of our designed game. Moreover, we implemen he experimenal regre-esing procedure [32], which is shown o converge o he se of Nash equilibria of he game. Our work exends he sae-of-he-ar in his area significanly since i differs from he exising sudies in he following crucial aspecs: We analyze he muli-agen bandi problem and ake ino accoun he selfishness of players. We do no assume ha he reward generaing process of any given acion is ime-invarian. In fac, he reward funcions are allowed o vary arbirarily, which enables us o accommodae he dynamic naure of wireless channels and disribued neworks. We do no allow any communicaion among players, hereby minimizing he overhead. Moreover, players do no observe he acions of each oher, so ha he developed model can be applied o a large body of resource allocaion problems. An example is a power conrol problem wih unknown power levels used by oher players. We sudy a wo-dimensional problem, namely join channel and power level selecion problem, by modeling i as a muliplayer muli-armed bandi game. In our model, channel qualiies are aken ino accoun so ha channels pay differen rewards o differen users. In addiion, we impose no limiaions on inerference paern. Our convergence analysis is valid for a wide range of games. This is in conras o many previous works where he game should be necessarily poenial for he convergence analysis o hold. We characerize he se of correlaed equilibria for wo special cases of our formulaed game model. D. Paper Srucure Secion II briefly reviews some conceps and resuls of bandi heory. In Secion III he resource allocaion game is formulaed. Secion IV presens a PC-CS sraegy based on exponenialbased weighed average rule [3]. In Secion V, anoher PC-CS sraegy, derived from follow

6 6 he leader rule [3] is discussed. Secion VI is devoed o experimenal regre-esing procedure [32]. Numerical analysis are presened in Secion VII. Secion VIII concludes he paper. A. Noions of Regre II. MULTI-PLAYER MULTI-ARMED BANDIT GAMES Muli-player muli-armed bandi problem MP-MAB, hereafer is a class of sequenial decision making problems wih limied informaion. In his game, each player k {,..., K} is assigned an acion se including N k acions arms, N k N. Every player selecs an acion a successive rials in order o receive an iniially unknown reward, which is deermined no only by is own acions, bu also by hose of oher players. The acion se, he played acion and he reward achieved by each player are regarded as privae informaion. The reward generaing processes of arms are independen. Le I and I k be he join acion space and he acion space of player k, respecively. Accordingly, I = I of players a ime, wih I k,..., I k,..., I K denoes he join acion profile being he acion of player k. Moreover, le g k I [, ] be he reward achieved by some player k a ime. The insananeous regre of any player k is defined as he difference beween he reward of he opimal acion, 2 and ha of he played acion. Based on his definiion, he cumulaive regre of player k is formally defined in he following. Definiion. The cumulaive regre of player k up o ime n is defined as R k n = max i=,...,n k = g k i, I,k = g k I k, I,k, where I,k is defined o be he join acion profile of all players excep for k a ime. Each player aims a minimizing is accumulaed regre, which is an insance of he well-known exploiaion-exploraion dilemma: Find a desired balance beween exploiing acions ha have exhibied well performance in he pas conrol on he one hand, and exploring acions which migh lead o a beer performance in he fuure learning on he oher hand. Now, suppose ha players use mixed sraegies. This means ha, a each rial, player k selecs a probabiliy disribuion P k = p k,,..., p k,..., pk over arms, and plays arm i wih i, N k, Noe ha all resuls can be also expressed in erms of loss d, provided ha he loss is relaed o he gain by d = g, g [, ]. 2 Opimaliy is defined in he sense of he highes insananeous reward.

7 7 probabiliy p k i,. In his case, we resor o expeced regre, also called exernal regre [33], defined as follows. Definiion 2. The exernal cumulaive regre of player k is defined as R k Ex := Rk Ex n = max i=,...,n k = = max i=,...,n k g k = i, I,k N j= p k j, = g k ḡ k P k, I,k i, I,k gk j, I,k, 2 where ḡ k ḡ k = N denoes he expeced reward a round by using mixed sraegy P k, defined as j= gk p k j,. By definiion, exernal regre compares he expeced reward of he curren mixed sraegy wih ha of he bes fixed acion in he hindsigh, bu fails o compare he rewards achieved by changing acions in a pair-wise manner. In order o compare acions in pairs, inernal regre [33] is inroduced ha is closely relaed o he concep of equilibrium in games. Definiion 3. The inernal cumulaive regre of player k is defined as R k In := R k In n = max R k i,j=,...,n i j,n k = max i,j=,...,n k = p k i, g k j, I,k Noice ha on he righ-hand side of 3, r k i j, = pk i, k g i, I,k. g k j, g k i, 3 denoes he expeced regre caused by pulling arm i insead of arm j. By comparing 2 and 3, exernal regre can be bounded above by inernal regre as [34] R k Ex = max i=,...,n k N k j= R k i j,n N k max R k i,j=,...,n i j,n = N kr k In. 4 k Remark. Throughou he paper, vanishing zero-average exernal and inernal regre means ha lim n n R Ex = and lim n n R In =, respecively. In oher words, we have R Ex on and R In on. Noe ha by 4, R In on yields R Ex on. Throughou he paper, we call any sraegy wih R In on as no-regre sraegy.

8 8 B. Equilibrium From he view poin of each player k, an MP-MAB is seen as a game wih wo agens: player k iself, and he se of all oher K players referred o as he opponen, whose join acion profile affecs he reward achieved by player k. We consider here he mos general framework, where he opponen is non-oblivious, i.e. is series of acions depends on he acions of player k. I is known ha a game agains a non-oblivious opponen can be modeled only by adversarial bandi games [35], while similar o oher game-heoreic formulaions, he soluion is considered o be equilibrium, mos imporanly Nash and correlaed equilibria. 3 In he conex of game-heoreic bandis, an imporan resul is he following heorem. Theorem [33]. Consider a K-player bandi game, where each player k is provided wih an acion se of cardinaliy N k. Denoe he inernal regre of player k by R k In, and he se of correlaed equilibria by C. A ime n, define he empirical join disribuion of he game as ˆπ n i = n = I {I=i}, i = i,..., i K K {,..., N K }. 5 k= Then, if all players k {,..., K} play according o any sraegy so ha lim n n Rk In =, 6 he disance inf π C i ˆπ ni πi beween he empirical join disribuion of plays and he se of correlaed equilibria converges o almos surely. Theorem simply saes ha in an MP-MAB game, if all players play according o a sraegy wih vanishing inernal regre no-regre, hen he empirical join disribuion of plays converges o he se of correlaed equilibria. Noe ha he sraegies used by players are no required o be idenical. Since a raional player is always ineresed in minimizing is regre, he assumpion ha every player plays according o a no-regre sraegy is reasonable. C. From Vanishing Exernal Regre o Vanishing Inernal Regre In [34], an approach is proposed for convering any selecion sraegy wih vanishing exernal regre o anoher version wih vanishing inernal regre. We describe his approach briefly. 3 These definiions are quie sandard see e.g. [36], and hus we do no resae hem here.

9 9 Consider a selecion sraegy O-sraegy, hereafer which a each ime assigns probabiliy disribuion P o he se of N acions, and selecs an acion according o his disribuion. Assume ha he player sars using O-Sraegy wih uniform disribuion over N acions. A each ime >, he O-sraegy has already seleced P = p,,.., p i,,.., p j,,.., p N,. Now, he O-sraegy consrucs a mea-sraegy M-sraegy, hereafer wih NN virual sraegies based on P. Each virual sraegy corresponds o a pair of acions i j, i, j {,..., N}, i j, and consrucs a disribuion over N acions by assigning he probabiliy mass of acion i o acion j. Tha is, i defines P i j = p,,..,,.., p j, + p i,,.., p N,, which has and p j, +p i, a he place of p i, and p j,, respecively, and all oher elemens remain unchanged. Assume ha he M-sraegy reas hese virual sraegies as acions. Tha is, a each ime, i defines a probabiliy vecor δ over NN virual acions, where he probabiliy of acion i j, i.e. δ i j,, depends on is pas performance. 4 Now, a ime, he O-sraegy assigns a disribuion P o N acions, where P = i,j:i j Pi j δ i j,. The consruced O- sraegy has he characerisic ha is inernal regre is upper-bounded by he exernal regre of he M-sraegy over NN virual acions according o probabiliy δ. Thus, if he M-sraegy exhibis vanishing exernal regre, he O-sraegy resuls in vanishing inernal regre. In Secion IV and V, we use his propery o design no-regre selecion sraegies. III. BANDIT-THEORETICAL MODEL OF INFRASTRUCTURELESS WIRELESS NETWORKS We consider a nework consising of K ransmier-receiver pairs, denoed by k, k, where k, k {,..., K}. The ransmier-receiver pair k, k is referred o as user or player k. Each user k can access C k muually orhogonal channels a L k quanized power levels. This implies ha is sraegy se includes N k = C k L k acions, where a ime each acion I k = c k, l k consiss of one channel index which corresponds o some channel qualiy, and one power level. Therefore, he join acion profile of users, I, is o be undersood here as he pair c, l, where c = c,..., c K and l = l,..., l K. As each channel migh be accessible by muliple users, co-channel inerference collision, inerchangeably is likely o arise. Since users are allowed o selec a new channel and o adap heir power levels a each ransmission rial, inerference paern in general changes over ime. In addiion, he disribuion of fading coefficiens migh be 4 Noe ha he gains of virual acions canno be calculaed explicily. Laer we will see ha he gain achieved by any virual acion i j is calculaed based on he gain achieved by playing rue acions i and j.

10 ime-varying so ha acquiring channel and/or nework informaion a he level of auonomous ransmiers would be exremely challenging and inefficien. Therefore, we assume ha A ransmiers have no channel knowledge or any oher side informaion such as he number of users or heir seleced acions. A2 In addiion, users do no coordinae heir acions ha can be chosen compleely asynchronously by each user. Noe ha as users do no observe he acions of each oher, i migh be in heir ineres o selec heir acions a he beginning of rials, hereby using he remaining ime for daa ransmission. In his paper, we model he join channel and power level selecion problem as a K-player adversarial bandi game, where player k decides for one of he N k acions. We define he expeced uiliy funcion reward of player k o be 5 G k l k h kk I = log,,c k 2 Qk q= lq h qk,,c k 2 + N α l k, 7 for some given join acion profile I = c, l. In 7, Q k < K is he number of players ha inerfere wih user k in channel c k. Throughou he paper, h uv,,c 2 R + is used o denoe he average gain of channel c beween u v a ime. N is he variance of zero-mean addiive whie Gaussian noise, and α is he consan power price facor. The las erm in 7 is used o penalize he use of excessive power. According o Secion II, le g k I [, ] denoe he achieved reward of player k a ime, as a funcion of join acion profile I. We consider a game wih noisy rewards where g k I = G k I + ɛ, wih ɛ being some zero-mean random variable wih bounded variance, which is independen and idenically disribued over ime. As i is well-known, in a non-cooperaive game, he primary goal of each selfish player is o maximize is own accumulaed reward. Formally, his can be wrien as maximize k c,l k = g k c, l, 8 where c k {,..., C k } and l k {,..., L k }. By Assumpions A and A2, however, i is clear ha he objecive funcion in 8 is no available. For his reason, we argue for a less ambiious goal, which is known as regre minimizaion. More precisely, each player k aemps 5 Throughou he paper, logarihms are based 2 unless oherwise is saed.

11 o achieve vanishing exernal regre in he sense ha lim n n Rk Ex = lim n n max i=,...,n k = g k i, I,k = ḡ k P k, I,k =. 9 In addiion o he individual sraegy of each user aiming a saisfying 9, all players should achieve some seady sae, i.e. equilibrium. Therefore, in he remainder of his paper, we develop algorihmic soluions o he resource allocaion problem wih a wofold objecive in mind: i exernal regre of each user should vanish asympoically according o 9 and ii he acions of all players should convergence o equilibrium. By 4, he exernal regre of each user is upper-bounded by is inernal regre. As a resul, if all users selec heir acions according o some no-regre sraegy, no only 9 is achieved by all of hem see also Remark, bu also he corresponding game converges o equilibrium in some sense, which immediaely follows from Theorem. In Secions IV and V, we presen wo inernal-regre minimizing sraegies ha are shown o solve he game and, wih i, o achieve he wo objecives menioned above. Boh algorihms can be applied in a fully decenralized manner by each player, since a each ime, hey only require he se of pas rewards of he respecive player. Finally, i is worh noing ha he se of correlaed equilibria for he general ime-varying repeaed game defined by 7 canno be characerized. Neverheless, in wha follows, we characerize his se for wo games defined by some relaxed versions of 7. Firs, consider a game similar o he one defined above, wih he difference ha unlike 7, he reward process is assumed o be saionary, i.e. l k h G k kk I = log,c k 2 Qk q= lq h qk,c k 2 + N α l k, which implies ha he average channel gains are ime-invarian. By he following proposiion, his game has a unique correlaed equilibrium. Proposiion. Consider a K-player game where he expeced reward funcion of each player k is defined by. This game has a unique correlaed equilibrium which places probabiliy one on is unique pure-sraegy Nash equilibrium.

12 2 Proof: See Secion IX-B. Now le he expeced reward funcion be defined as follows: G k I = log l k h kk,c k 2 N αl k, which is more resriced, bu simpler han. Wih his choice of expeced reward funcion, he game can be shown o have a unique correlaed equilibrium ha maximizes he aggregae uiliy of all players, i.e. he social welfare. This resul is saed formally in he following proposiion. Proposiion 2. Consider a K-player game where each player k has he expeced reward funcion G k given by. This game has a unique correlaed equilibrium which places probabiliy one on a unique pure sraegy Nash equilibrium ha maximizes K k= Gk. Proof: See Secion IX-C. IV. NO-REGRET BANDIT EXPONENTIAL-BASED WEIGHTED AVERAGE STRATEGY The basic idea of an exponenial-based weighed sraegy is o assign each acion, a every rial, some selecion probabiliy which is inversely proporional o exponenially-weighed accumulaed regre or direcly proporional o exponenially-weighed accumulaed reward caused by ha acion in he pas [37]. Roughly speaking, if playing an acion has resuled in large regre in he pas, is fuure selecion probabiliy is small, and vice versa. As described in Secion II-A, in bandi formulaion, players only observe he reward of he played acion, and no hose of ohers. Therefore he reward of each acion i is esimaed as [33] g k I k i = I k g k p i = k i,, 2 o.w. which is an unbiased esimae of he rue reward of acion i; ha is, E [ g k i ] = g k i. Esimaed rewards are aferwards used o calculae regres. For example, he regre of no playing acion j insead of acion i yields R k i j, = s= r k i j,s = p k i,s gk s s= j g k s i. 3 Despie exhibiing vanishing exernal regre, weighed average sraegies yield in general large inernal regre; as a resul, even if all players play according o such sraegies, he game does no

13 3 converge o equilibrium. In he following, we uilize he bandi version of exponenially weighed average sraegy [38], and conver i o an improved version ha yields small inernal regre, using he approach of Secion II-C. The sraegy is called no regre bandi exponenially-weighed average sraegy NR-BEWAS, and is described in Algorihm. Algorihm No-Regre Bandi Exponenial-Based Weighed Average Sraegy NR-BEWAS : If he game horizon, n, is known, define γ and η as given in Proposiion 3, oherwise as hose given in Proposiion 4. 2: Define ΦU = Nk η ln i= expη u i, where U = u,..., u Nk R N k. 3: Le P k = N k,..., N k 4: Selec an acion using P k. 5: Play and observe he reward. 6: for = 2,..., n do 7: Le P k uniform disribuion. be he mixed sraegy a ime, i.e. Pk = 8: Consruc P k,i j as follows: replace p k i, in Pk Oher elemens remain unchanged. We obain P k,i j 9: Define where : Given δ k i j, δ k i j, = = exp p k,,.., pk i,,.., pk j, by zero, and insead increase pk p k,,..,,.., pk η Rk i j, m l:m l exp η Rk m l, k R i j, is calculaed by using 2 and 3., solve he following fixed poin equaion o find Pk : Final probabiliy disribuion yields P k = P k i j:i j = γ P k 2: Using he final P k, given by 6, selec an acion. 3: Play and observe he reward. 4: end for :,.., pk j, + pk i, N k,. j, o pk j, +pk i,.,.., pk N k,., 4 P k,i j δ k i j,. 5 + γ N k. 6 From Algorihm, NR-BEWAS has wo parameers, namely γ and η. In he even ha he game horizon, n, is known in advance, hese wo parameers are consan over ime η = η and γ = γ, and he growh rae of regre can be bounded precisely, mainly based on he resuls of [33]. Oherwise, hey vary wih ime. In his case, vanishing sub-linear in ime inernal regre can be guaraneed; neverheless, his bound migh be loose. This discussion is formalized by following proposiions.

14 4 Proposiion 3. Le η = η = ln N k 2N k n 2 3 and γ = γ = N 2 k ln N k 4n yields vanishing inernal regre and we have R k In OnN 2 k ln N k 2 3. Proof: See Appendix IX-D. 3. Then Algorihm NR-BEWAS Proposiion 4. Le η = γ3 and γ Nk 2 = 3. Then Algorihm NR-BEWAS yields vanishing inernal regre; ha is we have R k In on. Proof: See Appendix IX-E. The following corollaries follow from he above proposiions and Theorem. Corollary. If all players play according o NR-BEWAS, hen he empirical join frequencies of play converge o he se of correlaed equilibria. 4. Proof: The proof is a direc consequence of Theorem and Proposiion 3 or Proposiion Corollary 2. Le ɛ-correlaed equilibrium approximae correlaed equilibrium in he sense ha ɛ> C ɛ = C. Assuming ha he game horizon is known and all players play according o NR- BEWAS, hen he minimum required number of rials o achieve ɛ-correlaed equilibrium yields max k=,...,k ɛ 3 2 O N k KN 2 k ln N k + K 2 ln K, which is proporional o ɛ 3 2 polynomially in he number of acions as well as in he number of players. and increases Proof: The proof follows from he bound of Proposiion 3 and Remark 7.6 of [33]. 6 V. NO-REGRET BANDIT FOLLOW THE PERTURBED LEADER STRATEGY Similar o he weighed-average sraegy presened in he previous secion, he sraegy follow he perurbed leader is an approach o solve online decision-making problems. In he basic version of his approach, called follow he leader [39], he acion wih he minimum regre in he pas is seleced a each rial. However, his mehod is deerminisic and herefore does no achieve vanishing regre agains non-oblivious opponens. Therefore, in follow he perurbed leader, player adds a random perurbaion o he vecor of accumulaed regres, and he acion wih he minimum perurbed regre in he pas is seleced [33]. In [4], a bandi version of his 6 Deails are omied o avoid unnecessary resaemen of exising analysis.

15 5 algorihm is consruced, where unobserved rewards are esimaed. The auhors show ha he developed algorihm exhibis vanishing exernal regre. Similar o NR-BEWAS, we here modify he algorihm of [4] o ensure vanishing inernal regre. The approach is called no-regre bandi follow he perurbed leader sraegy NR-BFPLS. Algorihm 2 No-Regre Bandi Follow he Perurbed Leader Sraegy NR-BFPLS : Define ɛ = ɛ n = ln n 3, and γ N k n = min, N k ɛ. Noe ha unlike NR-BEWAS, here we know he game horizon n in advance. 2: Le P k = N k,..., N k uniform disribuion. 3: Selec an acion using P k. 4: Play and observe he reward. 5: for = 2,..., n do 6: Le P k be he mixed sraegy a ime, i.e. Pk = p k,,.., pk i,,.., pk j,,.., pk N k,. 7: Consruc P k,i j as follows: replace p k i, in Pk by zero, and insead increase pk j, o pk j, +pk i,. Oher elemens remain unchanged. We obain P k,i j 8: Calculae R k i j, 9: Define σ i j, = variables : Le Rk i j, using 2 and 3. R k i j, [4]. τ= δ k i j, = p k,,..,,.., pk j, + pk i,,.., pk N k, 2, which is he upper-bound of condiional variances of random = R k i j, + 2/N k σ i j, ln [4]. : Randomly selec a perurbaion vecor µ wih N k N k elemens from wo-sided exponenial disribuion wih widh ɛ. 2: Consider a selecion rule which selecs he acion i j given by argmax { Rk i j, + µ i j, }, i j {,..., N k N k } 7 Noe ha in our seing R i j denoes he esimaed regre of no playing acion i j, hence we find he acion wih larges R. 3: From 7, calculae he probabiliy δ k i j, assigned o each pair i j. 4: Given δ k i j,, solve he following fixed poin equaion o find Pk. 5: Final probabiliy disribuion yields P k = P k i j:i j = γ P k 6: Using he final P k, given by 9, selec an acion. 7: Play and observe he reward. 8: end for P k,i j δ k i j,. 8 + γ N k. 9. Algorihm 2 requires he knowledge of he probabiliy assigned o each acion by he follow he perurbed leader sraegy a every rial. However, in conras o NR-BEWAS, hese probabiliies

16 6 are no assigned explicily; herefore we explain how o calculae hese values. From 7, he selecion probabiliy of virual acion i j {,..., N k N k } is he probabiliy ha Ri j, plus perurbaion µ i j, is larger han hose of oher acions, i.e. Pr[I = i j] = Pr[ R i j, + µ i j, R i j, + µ i j, i j i j ] = = Pr[ R i j, + µ i j, = m R i j, + µ i j, m i j i j ]dm Pr[ R i j, + µ i j, = m] i j i j Pr[ R i j, + µ i j, m]dm. Since µ is disribued according o a wo-sided exponenial disribuion wih widh ɛ n, he erms under inegral can be calculaed easily see [4], for example. Now we are in a posiion o show some properies of NR-BFPLS Algorihm 2. Proposiion 5. Le ɛ = ɛ = ln n 3 N k n and γ = γ = min, N k ɛ. Then Algorihm 2 NR-BFPL yields vanishing inernal regre wih R k In OnN 2 k ln N k 2. Proof: By [4], we know ha if he BPFL algorihm is applied o N k acions, hen R k Ex OnN k ln N k 2. Using his, he proof proceeds along similar lines as he proof of Proposiion 3 and is herefore omied here. Corollary 3. Assuming ha he game horizon is known and all players play according o NR- BFPLS, hen he minimum required number of rials o achieve ɛ-correlaed equilibrium yields max k=,...,k ɛ 2 O N k KN 2 k ln N k + K 2 ln K, which is proporional o ɛ 2 polynomially in he number of acions as well as in he number of players. 2 and increases Proof: The proof is a resul of he bound of Proposiion 5 and Remark 7.6 of [33]. VI. BANDIT EXPERIMENTAL REGRET-TESTING STRATEGY Experimenal regre-esing belongs o he large family of exhausive search algorihms, and is comprehensively discussed in [32] and [33] for bandi games. In his secion, we briefly review his approach, and invesigae is performance laer in Secion VII-A. Firs, he ime is divided ino periods m =, 2,... of lengh T so ha for each m we have

17 7 [m T +, mt ]. A he beginning of period m, any player k randomly selecs a mixed sraegy, denoed by P k m. Moreover, some random variable U m k, {,..., n k,..., N k } is defined as follows. For [m T +, mt ], and for each n k, here are exacly s values of such ha U m k, = n k, and U m k, is seleced o be [38] = for he remaining = T sn k rials. A ime, he acion I k I k : is disribued as P k m equals n k if U m k, =. 2 if U m k, = n k A he end of period m, player k calculaes he experimenal regre of playing each acion n k as [38] ˆr k m,n k = T sn k mt =m T + g k I I { } U m k, = s mt =m T + g k n k, I,k I{ }. 22 U m k, =n k If he regre is smaller han an accepable hreshold ρ, he player coninues o play is curren mixed sraegy. Oherwise, anoher mixed sraegy is seleced. The procedure is summarized in Algorihm 3. I is known ha if he parameers of BERTS e.g. T and ρ are chosen appropriaely, hen, in a long run, he played mixed sraegy profile is an approximae Nash equilibrium for almos all he ime. Deails can be found in [33], and hence are omied. Algorihm 3 Bandi Experimenal Regre Tesing Sraegy [33] BERTS : Se T period lengh, ρ accepable regre hreshold, ξ exploraion parameer, m = period index. Noice ha for each period m =,..., M, we have [m T +, mt ]. according o he uniform disribuion, from he probabiliy simplex wih N k dimensions. 3: For each n k {,.., N k } selec s exploring rials a random. Exploraion rials which are dedicaed o differen acions should no overlap. 4: for = m T + y, where y < T do 5: if is an exploring rial dedicaed o acion i hen 6: play acion i and observe he reward. 7: else 2: Selec a mixed sraegy, P k m 8: selec an acion using P k m. Play and observe he reward. 9: end if : end for : Calculae he experimenal regre of period m, ˆr k 2: if max ˆr m,n k k > ρ, hen n k =,...,N k m,n k, using 22; 3: se m = m +, 2 go o line 2. 4: else 5: wih probabiliy ξ: se m = m +, 2 go o line 2; wih probabiliy ξ: le P k m+ = Pk m, 2 se m = m +, 3 go o line 3. 6: end if

18 8 VII. NUMERICAL ANALYSIS Numerical analysis consiss of wo pars. In Secion VII-A, we consider a simple nework, and clarify he work flow of algorihms. In Secion VII-B, we consider a larger nework, and sudy he performance of he proposed game model and algorihmic soluions in comparison wih some oher selecion sraegies. A. Par One Nework model: The nework consiss of wo ransmier-receiver pairs users. There exis wo orhogonal channels, C and C 2, and wo power-levels, P and P 2. Hence, he acion se of each user yields {a : C, P, a 2 : C, P 2, a 3 : C 2, P, a 4 : C 2, P 2 }. The disribuion of channel gains changes a each rial. We assume ha he variance of mean values of hese disribuions is relaively small, which corresponds o low dynamiciy. 7 Channel marices [.5,.8] [.5,.2] [.2,.5] [.2,.6] are H = and H 2 =, where H l,u,v [.,.5] [.,.9] [.5,.5] [.75,.95] u, v, l {, 2}, corresponds o he link u v hrough channel l, and presens he inerval from which he mean value of he disribuion of channel gain is seleced a each rial. Moreover, we assume P =, P 2 = 5 and α = 3. Excep for heir insananeous rewards, no oher informaion is revealed o users. This informaion can be provided by he receiver feedback o ransmier. Wih hese seings, i is easy o see ha C, P 2, C 2, P 2 is he unique pure sraegy Nash equilibrium of his game, i.e. he heoreical convergence poin. 2 Resuls and Discussion: We invesigae he performance of selecion sraegies NR-BEWAS, NR-BFPLS and BERTS. The following sraegies are also considered as benchmark: opimal cenralized acion channel and power level assignmen ha is based on global saisical channel knowledge and is performed by a cenral uni. uniformly random selecion. Figure compares he average reward achieved by NR-BEWAS and NR-BFPLS by hose of random and opimal selecions. From he figure, despie being provided wih only sricly limied informaion, boh NR-BFPLS and NR-BEWAS exhibi vanishing regre, in he sense ha he achieved average reward converges o ha of cenralized scenario. 7 Noe ha his assumpion is made in order o simplify he implemenaion; as esablished heoreically, all proposed procedures converge o equilibrium for arbirary varying disribuions.

19 9 2.5 User Average Reward Average Reward Opimal NR BFPLS NR BEWAS Random Trials/ User Opimal NR BFPLS NR BEWAS Random Trials/ Fig.. Performance of four selecion sraegies. Boh NR-BEWAS and NR-BFPLS exhibi vanishing regre; ha is, heir average rewards converge o ha of opimal cenralized selecion. T= T= T=4 T= Fig. 2. Evoluion of he mixed sraegy of User, applying NR-BEWAS. Horizonal axis denoes he acion indices, where index i, i {, 2, 3, 4}, sands for acion a i. Verical axis shows he weigh of each acion in he mixed sraegy, i.e. is probabiliy of being seleced. The mixed sraegy of User converges o π =,,,. Figures 2 and 3 illusrae he evoluion of mixed sraegies of he wo users when NR-BEWAS is used. Figures 4 and 5, on he oher hand, show he same variable when acions are seleced by using NR-BFPLS. For boh cases, he firs and second users respecively converge o a 2 : C, P 2 and a 4 : C 2, P 2, as suggesed by he heory.

20 2 T= T= T=4 T= Fig. 3. Evoluion of he mixed sraegy of User 2, applying NR-BEWAS. The horizonal and verical axes respecively depic he indices of acions and heir selecion probabiliies. The mixed sraegy of User 2 converges o π 2 =,,,. T= T= T=4 T= Fig. 4. Evoluion of he mixed sraegy of User, applying NR-BFPLS. The horizonal and verical axes respecively depic he indices of acions and heir selecion probabiliies. The mixed sraegy of User converges o π =,,,. The performance of BERTS, however, is no an explici funcion of game duraion. As described before, he procedure coninues o search mixed sraegies unil a suiable one, which yields a regre less han he seleced hreshold, is capured. Then his sraegy is played for he res of he game. Theorem 7.8 of [33] specifies he minimum game duraion o guaranee he convergence of BERTS, which is relaively long even for small number of users and acions. Neverheless, similar o oher search-based algorihms, here also exiss he possibiliy of finding some accepable sraegy a early sages of he game. As a resul, for relaively shor games, he

21 2 T= T= T=4 T= Fig. 5. Evoluion of he mixed sraegy of User 2, applying NR-BFPLS. The horizonal and verical axes respecively depic he indices of acions and heir selecion probabiliies. The mixed sraegy of User 2 converges o π 2 =,,,. performance of BERTS is raher unpredicable. The oher issue is he effec of regre hreshold. On he one hand, larger hreshold reduces he search ime, since he se of accepable sraegies is large. On he oher hand, large regre hreshold migh lead o performance loss, since here is he possibiliy ha he user ges locked a some sub-opimal sraegy a early sages, hereby incurring large accumulaed regre. I is worh noing ha due o is simpliciy, and despie unpredicable performance, BERTS is an appealing approach in cases where compuaional effor should be minimized, and convergence o Nash equilibrium is desired. Figure 6 summarizes he resuls of few exemplary performances of BERTS. The parameers are seleced as T = 8, M = 5 and ρ =.6 see Secion VI. Simulaion is performed for six independen rounds. The curve on he lef side of Figure 6 depics he period m 5 a which he algorihm finds an accepable sraegy. As expeced, he resuls exhibi no specific paern. The four sub-figures on he righ depic he mixed sraegies seleced by BERTS a rounds and 2, ogeher wih average rewards. From his figure, a round 2, accepable sraegies are found earlier han round by boh users, leading o beer average performance. I is also worh noing ha for User 2, he sraegy of round is in essence beer han ha of round 2; neverheless, i is found laer. As a resul, he average performance of round 2 is superior o ha of round.

22 22 Period a which an accepable sraegy is found User User Simulaion Round MS: User, Round A.R.= MS: User 2, Round A.R.= MS: User, Round 2 A.R.= MS: User 2, Round 2 A.R.= Fig. 6. Performance of BERTS. On he lef, he verical and horizonal axes show he periods and round number, respecively. The wo curves depic he period a which a suiable mixed sraegy MS is found a each of he 6 rounds. On he righ, hese mixed sraegies are shown for boh users a rounds and 2, ogeher wih average rewards. The horizonal and verical axes respecively depic he indices of acions and heir selecion probabiliies. B. Par Two In his secion we consider a wireless nework consising of 5 users ransmier-receiver pairs, ha compee for access o hree orhogonal channels a wo possible power levels hence six acions. We compare BFPLS and BEWAS wih he following selecion approaches. 8 Opimal cenralized acion assignmen as described in Secion VII-A2. Cenralized no-collision acion selecion, where no reward is assigned o users ha access he same channel. Thus, users are encouraged o avoid collisions a collision-avoidance 8 As menioned before, observing he join acion profile and/or communicaion among users is no required for implemening BEWAS, BFPLS and BERTS. Therefore, hey canno be compared wih sraegies ha include muual observaion and/or communicaion. A good example of such algorihms is he widely-used bes-response dynamics, where he sraegy of each player is o play wih he bes-response o eiher he hisorical [] or he prediced [5] join acion profile of opponens. Anoher example is he sraegy suggesed in [2], which is a combinaion of learning and aucion algorihms where users communicae wih each oher.

23 Aggregae Average Reward Opimal NR BFPLS NR BEWAS Greedy No Collision upper bound Epsilon Greedy Random LnTrials Fig. 7. Aggregae average reward of BFPLS and BEWAS compared o some oher selecion sraegies. sraegy. This curve can be considered as an upper-bound for he performance of learning algorihms ha selec acions based on collision avoidance, such as [2]. ɛ-greedy algorihm, where a each rial, wih probabiliy ɛ exploraion parameer, an acion is seleced uniformly a random, while wih probabiliy ɛ he bes acion so far is played. The average reward of seleced acion is updaed afer each play [42]. For saionary environmens, ɛ is usually ime-varying and converges o zero in he limi, while in adversarial cases, ɛ is preferred o remain fixed. Here we le ɛ =.. Greedy approach, where a he beginning of he game, some rials are reserved for exploraion, in which acions are seleced a random exploraion period. The lengh of his period is a pre-defined fracion of he enire game duraion. Based on he rewards of exploraion period, he bes possible acion is seleced, and is played for he res of he game exploiaion period [33]. This approach is exremely simple o implemen; however, o he bes of our knowledge, here is no analysis on he opimal lengh of he exploraion period. Uniformly random selecion. The numerical resuls are depiced in Figure 7. From his figure, we can conclude he following.

24 24 The performance of inerference-avoidance sraegies is srongly influenced by channel marices and ends o be poor specifically when he number of channels is less han ha of users. The reason is ha he sum reward of muliple inerfering users wih limied ransmi power migh be larger han he maximum achievable reward of any single user. The performance of boh BFPLS and BEWAS converge o ha of cenralized approach. As expeced, BFPLS converges faser han BEWAS and we poin ou ha he convergence speed of boh algorihms would be dramaically enhanced if some side informaion was available o players, e.g. if users observed he acions of each oher, or if communicaion was allowed among players. I is also worh noing ha alhough BFPLS converges faser han BEWAS, he compuaion of inegral 2 migh be involved, especially for large number of acions [4]. In general, ɛ-greedy and greedy approaches can be implemened easily wih low compuaional cos; neverheless, i can be seen ha he greedy approaches are inferior o BEWAS and BFPLS in erms of asympoic performance. Basically, hese approaches are more suiable for saionary environmens. VIII. CONCLUSION AND REMARKS This paper deals wih resource allocaion in muli-user infrasrucureless wireless neworks. The problem of uiliy maximizaion has been formulaed using he muli-player muli-armed bandi heory framework. More precisely, given no side informaion, he users aim a minimizing some regre expressed in erms of he loss of reward by selecing appropriae acions on a given space of ransmi power levels and orhogonal frequency channels. Based on some recen mahemaical resuls, we have designed wo selecion sraegies, which no only provide vanishing regre for each player, bu also guaranee he asympoic convergence of he game o he se of correlaed equilibria. We have also sudied experimenal regre esing sraegy ha asympoically converges o he se of Nash equilibria. Numerical resuls confirms he applicabiliy of he game model and proposed sraegies o wireless channel selecion and power conrol.

25 25 IX. APPENDIX A. Some Auxiliary Resuls In his secion, we sae some auxiliary resuls and maerials from game heory as well as bandi heory ha are necessary for proofs. Game Theory: Throughou his par, we consider a game G consising of a se of K players where he sraegy se of each player k {,..., K} is denoed by I k wih a generic elemen i k = i k,..., i k. Similarly, he se of join sraegy profiles of players is denoed by I wih a M generic elemen i = i,..., i K and i k sands for he join acion profile of all players excep for player k. Moreover, g k i sands for he uiliy funcion of some player k. 9 Definiion 4. A game G is smooh if, for each k {,..., K}, g k i has coninuous parial derivaives wih respec o he componens of i k. Definiion 5. Le g k g = k,, gk, and call g k he payoff gradien of i k i k k {,...,K} M a smooh game G. We say ha he payoff gradien is sricly monoone if K g k i g k j T i k j k <, 23 k= holds for all i, j I wih i j. Theorem 2 [43]. Consider a smooh game G wih compac sraegy ses. If he payoff gradien of G is sricly monoone hen i has a unique correlaed equilibrium, which places probabiliy one on a unique pure-sraegy Nash equilibrium. Definiion 6. A game G is poenial if here exiss a poenial funcion f : I R such ha for all i, j I k and k {,..., K}. g k i, i k gk j, i k = fi, i k fj, i k, 24 Theorem 3 [44]. Le G be a smooh poenial game wih a sricly concave poenial funcion. Then a sraegy profile is he unique pure sraegy Nash equilibrium if and only if i is he poenial maximizer. 9 Noe ha compared o he sysem model some noaion has been changed slighly.

26 26 Lemma [43]. Le G be a smooh poenial game. A poenial of G is sricly concave if and only if he payoff gradien of G is sricly monoone. 2 Bandi Theory: Lemma 2. Le R n and R Ex be given by and 2, respecively. Then, for any δ, ], we 2 have Pr R n R Ex n 2 ln 2δ, 25 δ from which i follows ha if R n on, hen we have R Ex on, wih arbirarily high probabiliy. Proof: By comparing and 2, i suffices o show ha Pr n = g I ḡ P n ln 2δ. To his end, define S := n 2 δ = g I, where g I [, ], n, are independen random variables see also Secion II-A. Furher noe ha S = E[S] = n = ḡp. Therefore, by Hoeffding s inequaliy [33], n Pr R n R Ex 2 ln =Pr S δ S n 2 ln δ 2 exp 2 n ln 26 2 δ = 2δ. n n Hence he Lemma follows wih Pr R n R Ex ln n = Pr R 2 δ n R Ex ln. 2 δ Lemma 3. Le R Ex be given by 2. Moreover, define R n = max i=,...,n n = g i n = g P, where g P = N i= p i, g i and g i is given by 2. Then we have n Pr Rn R Ex 2 ln 2δ. 27 δ Hence, for sufficienly small δ >, R Ex on implies ha Rn on, wih arbirarily high probabiliy. Throughou his secion and in order o simplify he noaion, he player index k is omied unless ambiguiy arises. Here and hereafer, he saemen Xn on wih arbirarily high probabiliy for some nonnegaive random sequence Xn R means ha he probabiliy of Xn / on can be made arbirarily small, provided ha some parameer is chosen sufficienly small.

27 27 Proof: Similar o he proof of Lemma 2, i follows from 2 and he definiion of R n ha i is sufficien o show ha Pr n = g n P ḡ P ln 2δ for δ, ]. To 2 δ 2 his end, noe ha g P [, ], n, are independen random variables. Moreover, since g i is an unbiased esimae of g i, we have E[ n = g P ] = n = ḡp. Hence, defining S = ḡ P g P and proceeding as in he proof of Lemma 2 wih he Hoeffding s inequaliy in hand proves he lemma. Proposiion 6. Le R n be given by and R n be defined as in Lemma 3. Then, R n on implies ha R n on. Proof: Lemma 2 implies ha R n on R Ex on wih arbirarily high probabiliy, while by Lemma 3, we have R Ex on R on. Therefore, if R n on, hen R on wih arbirarily high probabiliy. Theorem 4. [33] Le ΦU = ψ N i= φu i, where U = u,..., u N. Consider a selecion sraegy, which a ime selecs acion I according o disribuion P, whose elemens p i, are defined as where R i, = s= g si g s I s. Assume ha: A. n = γ 2 = o n2 ln n, φ R i, p i, = γ N k= φ R i, + γ N, 28 A2. For all vecors V = v,,..., v n, wih v i, N γ, we have lim n ψφn where CV = sup U R N ψ N i= φu i N i= φ u i v 2 i,. CV =, 29 = A3. For all vecors U = u,,..., u n,, wih u i,, lim n ψφn = γ N i= i ΦU =. 3 A4. For all vecors U = u,,..., u n,, wih u i,, ln n lim n N 2 i ΦU. 3 n ψφn = γ 2 i=

1 Review of Zero-Sum Games

1 Review of Zero-Sum Games COS 5: heoreical Machine Learning Lecurer: Rob Schapire Lecure #23 Scribe: Eugene Brevdo April 30, 2008 Review of Zero-Sum Games Las ime we inroduced a mahemaical model for wo player zero-sum games. Any