1904 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 58, NO. 4, MAY 2009

Size: px

Start display at page:

Download "1904 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 58, NO. 4, MAY 2009"

Wilfrid Jenkins
5 years ago
Views:

1 1904 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 58, NO. 4, MAY 2009 Learning o Compee for Resources in Wireless Sochasic Games Fangwen Fu, Suden Member, IEEE, and Mihaela van der Schaar, Senior Member, IEEE Absrac In his paper, we model he various users in a wireless newor e.g., cogniive radio newor) as a collecion of selfish auonomous agens ha sraegically inerac o acquire dynamically available specrum opporuniies. Our main focus is on developing soluions for wireless users o successfully compee wih each oher for he limied and ime-varying specrum opporuniies, given experienced dynamics in he wireless newor. To analyze he ineracions among users given he environmen disurbance, we propose a sochasic game framewor for modeling how he compeiion among users for specrum opporuniies evolves over ime. A each sage of he sochasic game, a cenral specrum moderaor CSM) aucions he available resources, and he users sraegically bid for he required resources. The join bid acions affec he resource allocaion and, hence, he rewards and fuure sraegies of all users. Based on he observed resource allocaions and corresponding rewards, we propose a bes-response learning algorihm ha can be deployed by wireless users o improve heir bidding policy a each sage. The simulaion resuls show ha by deploying he proposed bes-response learning algorihm, he wireless users can significanly improve heir own bidding sraegies and, hence, heir performance in erms of boh he applicaion qualiy and he incurred cos for he used resources. Index Terms Delay-sensiive ransmission, ineracive learning, muliuser resource managemen, reinforcemen learning, sochasic games, wireless newors. I. INTRODUCTION DYNAMIC resource managemen in heerogeneous wireless newors is a challenging problem [3]. The wireless saions and radio sysems ha mus coexis in such a newor differ in heir individual uiliy funcions, ransmission acions, resource demands, and capabiliies. Thus, various levels of sraegic 1 ineracion and adapaion are necessary o cope wih he widely varying dynamics. In his paper, we focus on synhesizing new, dynamic, and informaionally decenralized resource-managemen mechanisms o achieve high uiliy in compeiive and heerogeneous wireless newors including cogniive radio newors [1] [3]). Specifically, our focus is on designing associaed communicaion algorihms ha enable Manuscrip received Augus 28, 2007; revised April 17, 2008 and July 1, Firs published July 29, 2008; curren version published April 22, This wor was suppored by he Naional Science Foundaion under CAREER Award CCF The review of his paper was coordinaed by Prof. O. B. Aan. The auhors are wih he Deparmen of Elecrical Engineering, Universiy of California a Los Angeles, Los Angeles, CA USA fwfu@ee.ucla.edu; mihaela@ee.ucla.edu). Color versions of one or more of he figures in his paper are available online a hp://ieeexplore.ieee.org. Digial Objec Idenifier /TVT By sraegic users, we mean users ha are no price aers and do no have an aprioriconsensus on resource allocaion. self-ineresed auonomous wireless saions o sraegically compee for he available specrum resources in eiher ISM bands [1] or bands shared wih licensed users, according o apriorimandaed or negoiaed rules. This paper is primarily concerned wih he ensions and relaionships among auonomous adapaion by secondary unlicensed) users SUs), he compeiion among hese users, he ineracion of hese users wih specrum moderaors having heir own goals, e.g., maing money, imposing fairness rules, ensuring compliance wih he Federal Communicaions Commission FCC) [1], and local regulaions wih respec o primary licensed) users PUs), ec. Unlie previous wors on resource managemen [6], [21], [26], our main focus is on discussing how users can adap, predic, learn, and deermine how hey compee for he ime-varying resources, as well as how hey selec he associaed ransmission sraegies, given he experienced dynamics. In wireless newors, hese dynamics can be caegorized ino wo ypes: One is he disurbance due o he environmen, and he oher is he impac caused by compeing users. The disurbance due o he environmen resuls from variaions uncerainies) of he wireless channels or source e.g., mulimedia) characerisics. For example, he sochasic behavior of he PUs, he ime-varying channel condiions experienced by he SUs, and he ime-varying source raffic ha needs o be ransmied by he SUs can be considered as environmenal disurbances. These ypes of dynamics are generally modeled as saionary processes. For insance, he use of each channel by he PUs can be modeled as a wo-sae Marov chain wih ON-sae he channel is used by PUs) and OFF-sae he channel is available for he SUs) [7]. The channel condiions can be modeled using a finie-sae Marov model [24]. The pace arrival of he source raffic can be modeled as a Poisson process 2 [11]. Convenionally, wireless saions have only considered hese environmen disurbances when adaping heir cross-layer sraegies [12] for delay-sensiive ransmission. The oher ype of dynamics he impac from compeing users, which is due o he noncollaboraive, auonomous, and sraegic SUs in he newor ransmiing heir raffic is less well sudied o wireless communicaion newors. The goal of his paper is o provide soluions and associaed merics ha can be used by an auonomous SU o analyze and predic he oucome of various dynamic ineracions among compeing SUs in dynamic muliuser communicaion sysems 2 Oher pace arrival models can also been used in our proposed framewor /$ IEEE

2 FU AND VAN DER SCHAAR: LEARNING TO COMPETE FOR RESOURCES IN WIRELESS STOCHASTIC GAMES 1905 and, based on his forecas, adap and opimize is ransmission sraegy. In our considered wireless newors, he SUs are modeled as raional and sraegic. We model he specrum managemen as a sochasic game [22] in which he SUs simulaneously and repeaedly mae heir own resource bids. The compeiion for dynamic resources is assised by a cenral coordinaor similar o ha in exising wireless LAN WLAN) sandards such as e hybrid coordinaion funcion HCF) [13]). We refer o his coordinaor as he cenral specrum moderaor CSM). The role of he CSM is o allocae resources o he SUs based on he predeermined uiliy maximizaion rule. 3 In his paper, o explicily consider he sraegic behavior of auonomous SUs and he informaionally decenralized naure of he compeiion for wireless resources, we assume ha he CSM deploys an aucion mechanism for dynamically allocaing resources. Aucion heory has exensively been sudied in economics [19], and i has also been recenly applied o newor resource allocaion [4] [6]. Noe ha he role of he CSM 4 in our resource managemen game for our considered wireless newors will be ep o a minimum. Unlie alernaive exising soluions [21], he CSM will no require nowledge of he privae informaion of he users and will no perform complex compuaions for deciding he resource allocaion. Is only role will be he implemenaion of he specrum eiquee rules as in [8] and ensuring ha he available specrum holes are aucioned among users. To capure he newor dynamics, we allow he CSM o repeaedly aucion he available specrum opporuniies based on he PUs behaviors. Meanwhile, each SU is allowed o sraegically adap is bidding sraegy based on informaion abou he available specrum opporuniies, is source and channel characerisics, and he impac of he oher SU bidding acions. Using his sochasic wireless allocaion framewor, we develop a learning mehodology for SUs o improve heir policies for playing he aucion game, i.e., he policies for generaing he bids o compee for available resources. Specifically, during repeaed muliuser ineracion, he SUs can observe parial hisoric informaion of he oucome of he aucion game, hrough which he SUs can esimae he impac on heir fuure rewards and hen adop heir bes response o effecively compee for channel opporuniies. The esimaion of he impac on he expeced fuure reward can be performed using differen ypes of ineracive learning [18]. In his paper, we focus on reinforcemen learning [17], [27] because his allows he SUs o improve heir bidding sraegy based only on he nowledge of heir own pas received payoffs wihou nowing he bids or payoffs of he oher SUs. Our proposed bes-response learning algorihm is inspired from he Q-learning for he single agen ineracing wih he environmen. Unlie Q-learning, he proposed bes-response learning explicily considers he ineracions and coupling among SUs in he wireless newor. By deploying he bes-response learning algorihm, he SUs can sraegically 3 Oher fairness rules can also be deployed in he CSM such as air-ime fairness, uiliy-based fairness, ec. [12]. 4 I should be noed ha his approach can also allow for muliple CSMs o manage he specrum by fairly dividing heir responsibiliies, e.g., based on heir geolocaion or frequency band in which hey are operaing, or by compeing agains each oher for he number of SUs ha will associae wih hem. predic he impac of curren acions on fuure performance and hen opimally mae heir resource bids. This paper is organized as follows. In Secion II, we inroduce a sochasic game formulaion for muliuser ineracion in wireless newors. In Secion III, we show how a onesage aucion mechanism can be used o divide he specrum allocaion among sraegic SUs. In Secion IV, we presen he sae definiion, sae ransiion model, and sage reward funcion for he SUs in he sochasic game. In Secion V, we discuss he bidding sraegies of he SUs for playing he sochasic game. In Secion VI, we propose a bes-response learning approach for he SUs o predic heir fuure rewards based on he observed hisoric informaion. In Secion VII, we presen he simulaion resuls, followed by conclusions and fuure research in Secion VIII. II. STOCHASTIC GAME FORMULATION FOR DYNAMIC MULTIUSER INTERACTION We consider a specrum consising of N channels, each indexed by j {1,...,N}. TheN wireless channels are originally licensed o a primary newor PN) whose users i.e., PUs) exclusively access he channels. In he secondary newor SN), he MM N) auonomous SUs, each indexed by i {1,...,M} and ransmiing delay-sensiive daa, compee for he specrum opporuniies released by he PUs in hese N channels. Alhough he available ransmission opporuniies TxOps) for SUs depend on he access paerns of PUs and he deecion sysems [2], we do no discuss he deecion mehods in his paper bu raher rely on he exising lieraure for his purpose [3]. Insead, we assume ha he available TxOps in each channel change over ime due o he PUs joining or leaving he newor and can be modeled as a wo-sae Marov chain, as in [7] and [10]. Our goal is o develop a general framewor for muliuser ineracion in he SN, where users can compee for dynamically available TxOps. Moreover, we also aim o provide soluions for SUs o improve heir sraegies for playing he repeaed resource-managemen game by considering heir pas ineracions wih oher SUs. The communicaions of he PUs are assumed o follow a synchronous slo srucure. The ime slo has lengh of ΔT seconds. We assume ha during each ime slo, each channel is eiher exclusively occupied by PUs or ha here is no PU accessing he channel [7], [10]. Hence, during each ime slo, he channel is in one of he following wo saes: ON-sae his channel is currenly used by he PUs) or OFF-sae his channel is no used by he PUs, and hence, he SUs can use his channel). Noe ha if his is an unlicensed band, he channel will always be in he off mode and can be uilized by he SUs a all imes. The TxOp of channel j a ime slo N is denoed by yj {0, 1}, where y j is 0 if he channel is in he ON-sae and 1 if i is in he OFF-sae. In his paper, he TxOp yj of channel j is modeled by a wo-sae Marov chain wih ransiion probabiliy p FN j = py j =0 yj =1) and p NF j = py j =1 yj =0). The TxOp profile of he N channels is represened by y =[y1,...,y N ]. As in [13], we assume ha a polling-based medium-access proocol is deployed in he SN, which is arbiraed by a CSM.

1906 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 58, NO. 4, MAY 2009 Fig. 1. Concepual overview of he muli-su ineracion in he SN. The polling policy is only changed a he sar of every ime slo.

3 1906 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 58, NO. 4, MAY 2009 Fig. 1. Concepual overview of he muli-su ineracion in he SN. The polling policy is only changed a he sar of every ime slo. For simpliciy, we assume ha each SU can access a single channel, and ha each channel can be accessed by a single SU wihin he ime slo. The SUs can swich he channels only when crossing ime slos. Noe ha his simple medium-access model used for illusraion in his paper can easily be exended o more sophisicaed models [10], where each SU can simulaneously access muliple channels or he channels are being shared by muliple SUs, ec. When using his ime-division channel access, we assume ha he wireless users deploy consan ransmission power and experience no inerference. Furhermore, we assume ha he wireless users move slowly, and hus, heir experienced channel condiions slowly change. During each ime slo, an SU needs o firs deermine how o compee wih he oher SUs for he ime-varying TxOps. This represens is exernal acions, since hey deermine he ineracion beween his SU and he oher SUs, and he amoun of resources allocaed o ha SU. The exernal acions a ime slo are denoed by a i A i, where A i is he se of possible exernal acions available o SU i. Based on he allocaed resources, he SU deermines how o ransmi is raffic applicaion layer daa) by selecing he various sraegies a differen layers of he open sysems inerconnecion OSI) sac e.g., hrough cross-layer adapaion [12]). These acions are referred o as inernal acions, since hey only deermine he SU s uiliy a he curren ime. The inernal acions a ime slo are denoed by b i B i, where B i is he se of possible inernal acions available o SU i. In his paper, we propose an aucion mechanism deployed in he CSM. Hence, he exernal acion a i of SU i is he bid i submis o CSM. The aucion mechanism will be deailed in Secion III. The environmen experienced by an SU i can be characerized by is curren sae s i S i, which will be discussed in Secion IV. A each ime slo, SU i generaes he exernal acion a i o compee for he TxOps y. The compeiion resul is ϑ i, based on which SU i performs is inernal acion b i and obains he reward r i a his ime slo. Afer pace ransmission, SU i ransis o he nex sae s i S i. The concepual overview of he muli-su ineracions in he repeaed aucions is illusraed in Fig. 1. The repeaed compeiion among he SUs can be modeled as a sochasic game [16], [22]. The ime slo corresponds o he erm sage, which is commonly used in sochasic games. In he remainder of his paper, we inerchangeably use he erms ime slo and sage. We define he sochasic game for SN resource allocaion as S i,a i,b i,o i,q i,r i M i=1, Y, where each SU i is associaed wih a uple S i,a i,b i,o i,q i,r i. Specifically, we have he following. 1) Y is a finie se of possible TxOps available for SUs. In his paper, Y = {0, 1} N, and y Y is he available TxOps a sage, which is common informaion for SUs. 2) S i is a finie local sae space of SU i. WeleS := N =1 S be he global sae space of all SUs and S i := i S be he global sae space of SUs oher han i. A sage, he global sae is denoed by s = s 1,...,s M )=s i, s i ), where i represens all he SUs oher han i. 3) A i is a finie se of exernal acions performed by SU i o compee for he available TxOps. The exernal acion vecor a sage for all SUs will be a =a 1,...,a M ). 4) B i is a finie se of inernal acions performed by SU i o deermine he pace ransmission. 5) O i is a finie se of possible oupu from muli-su compeiion. In his paper, he oupu ϑ i O i is he aucion resul compued by he CSM for SU i a sage. We will give he specific form of he oupu in Secion III. 6) q i is he sae ransiion probabiliy for SU i. Thus,,y s i,y,ϑ i,b i ) is he probabiliy ha he sae o s i and TxOp ransis from y o y if he compeiion oupu is ϑ i and he inernal acion is b i. The reason ha he ransiion probabiliy includes he common TxOp y is because he channel condiion ransiion of SU i depends on he available TxOp. 7) r i is he sage reward immediae reward) received by SU i, where r i :S i,o i,b i ) R. I should be noed ha he reward funcion r i depends on he compeiion oupu q i s i of SU i ransis from s i

4 FU AND VAN DER SCHAAR: LEARNING TO COMPETE FOR RESOURCES IN WIRELESS STOCHASTIC GAMES 1907 and, hence, indirecly depends on he oher SUs exernal acions. To design a sochasic game for he SN wih sraegic SUs, we have o consider he following: 1) Wha aucion mechanism can be deployed o resolve he compeiion among SUs; 2) how he dynamic environmen experienced by each SU can be modeled; and 3) how he SUs can forecas he impac of heir bids made a he curren ime on heir fuure performance? Fig. 2. Informaion exchange beween CSM and SU i. III. AUCTION MECHANISM ONE STAGE RESOURCE ALLOCATION In his paper, we assume ha he CSM is aware of he TxOp y and allocaes hrough polling he SUs) hose channels wih yj =1 o he SUs. To efficienly allocae he available resources opporuniies), he CSM needs o collec informaion abou he SUs [21]. However, as menioned in Secion I, in a wireless newor, he informaion is decenralized, and hus, he informaion exchange beween he SUs and he CSM needs o be ep limied due o he incurred communicaion cos. On he oher hand, he SUs compeing wih each oher are selfish and sraegic, and hence, he informaion hey hold is privae, and hey may no desire o reveal his informaion o he CSM. Therefore, one of our ey ineress in his paper is o deermine wha informaion should be exchanged beween he SUs and he CSM and how his informaion should be exchanged. In he following, we presen an aucion mechanism for dynamically coordinaing he ineracions among SUs and discuss he compuaional complexiy in he CSM and he communicaion cos beween SUs and CSM. Firs, he CSM announces he aucion by broadcasing he TxOp y. The SUs receive he announcemen and deermine he exernal acion i.e., he bid vecor) a i =[a i1,...,a in ] RN based on he announced informaion and heir own privae informaion abou he environmen hey experience, which is discussed in deail in Secion IV. Subsequenly, each SU submis he bid vecor o he CSM. Afer receiving he bid vecors from he SUs, he CSM compues he channel allocaion z i =[z i1,...,z in ] {0, 1}N for each SU i based on he submied bids. To compel he SUs o ruhfully declare heir bids [23], he CSM also compues he paymen τi R ha he SUs have o pay for he use of resources during he curren sage of he game. The negaive value of he paymen means he absolue value ha SU i has o pay he CSM for he used resources. Hence, he compeiion oupu ϑ i in his aucion mechanism includes he channel allocaion z i and he paymen τ i, i.e., ϑ i =z i,τ i ). The compeiion oupu is hen ransmied bac o he SUs. The compuaion of he channel allocaion z i and paymen τ i is described as follows. Afer each SU submis he bid vecor, he CSM performs wo compuaions, i.e., channel allocaion and paymen compuaion. Noe ha mos exising muliuser wireless resource allocaion soluions can be modeled as such repeaed aucions for resources. If he resources are priced or he users may lie abou heir resource needs, axes associaed wih he resource usage will need o be imposed [14]. Oherwise, hese axes can be considered o be zero hroughou he paper. We denoe he channel allocaion marix Z =[zij ] M N wih zij being 1 if channel j is assigned o SU i, and 0 oherwise. The feasible se of channel assignmens is denoed as Z = {Z M i=1 z ij = y j, j, N j=1 z ij 1, i, z ij {0, 1}}. The channel allocaion marix wihou he presence of SU i is denoed Z i =[z j ] M 1) N, and he corresponding feasible se is Z i = {Z i M =1, i z j = yj j, N j=1 z j 1 i, z j {0, 1}}, where i = {1,...,i 1,i+1,...,M}. During he firs phase, he CSM allocaes he channels o SUs based on is adoped fairness rule, e.g., maximizing he oal social welfare, 5 as Z,op = arg max Z Z M i=1 j=1 N zija ij. 1) If he resources are priced, we will consider in his paper, for illusraion, a second-price aucion mechanism [19], [23] for deermining he ax ha needs o be paid by SU i based on he above opimal channel assignmen Z,op =[z,op ij ] M N.This ax is equal o τ i = M N =1, i j=1 z,op j a j max M N Z i Z i =1, i j=1 z ja j. Noe ha when N =1, he generalized aucion mechanism presened above becomes he well-nown second-price aucion [19]. Alhough he opimizaion problems in 1) and 2) are discree opimizaions, hey can efficienly be solved using linear programming. As argued in [20], he linear opimizaion problem can be solved in polynomial ime, and hence, he CSM only requires limied compuaional complexiy. The informaion exchange beween he CSM and he SUs is illusraed in Fig. 2. From Fig. 2, we noe ha, a each sage, he CSM firs broadcass he available TxOps o all he SUs for he aucion, and hen each SU submis is own bid vecor over all he available TxOps. Afer receiving he bids, he CSM compues he aucion resuls and sends bac o he users he channel allocaions and he corresponding paymens. The signaling required for he aucion is mos ofen implemened a he applicaion layer. In he wors case, he amoun of 5 Noe ha oher fairness soluions han maximizing he social welfare could be adoped, and his will no influence our proposed soluion. 2)

5 1908 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 58, NO. 4, MAY 2009 daa communicaed beween he CSM o he SUs is equal o M +1)N + nn bis, where n is he amoun of bis represening he paymen for each SU. The amoun of daa communicaed by each SU o he CSM is n N bis, where n is he amoun of bis represening he bid submied o he CSM on each channel. Compared wih radiional one-sage resource allocaion mehods, our proposed aucion mechanism has he following advanages. 1) Unlie radiional cenralized resource allocaion mehods [30], our proposed aucion mechanism is no required o now he SUs uiliy funcions or preferences, which is ofen he privae informaion of he users and is no common nowledge. In fac, our aucion mechanism only requires he SUs o submi heir bid vecors for he available TxOps. The bid vecor compuaion is performed by he SUs, bu no he CSM, based on heir uiliies, preferences, acion ses, experienced environmen characerisics, ec. 2) Unlie radiional decenralized resource allocaion mehods [28] where muliple ieraions are required before convergence, our proposed aucion mechanism only requires he SUs o submi he bid vecors once. Hence, our proposed aucion mechanism is suiable for online resource managemen. Moreover, we do no assume as in [29] ha users are price aers and ha here is consensus abou wha is a fair disribuion of he resources. Insead, in he proposed framewor, users are sraegic and are able o deermine heir own bid vecors for resources based on heir nowledge, uiliies, preferences, ec. IV. USER MODELING IN THE STOCHASTIC GAME FRAMEWORK A. Definiion of SU Saes As discussed in Secion I, each SU needs o cope wih wo ypes of uncerainies, i.e., disurbances from he environmen and ineracions wih oher SUs. The environmen is characerized by pace arrivals from he source i.e., source/raffic characerizaion) conneced wih he ransmier and he channel condiions. In his secion, we will illusrae how hese disurbances can be modeled. However, noe ha oher models of he environmen exising in he lieraure can be adoped. The use of a specific model will only affec he performance of he proposed soluion and no he general framewor for muliuser ineracion proposed in his paper. For illusraion, we assume ha each SU i mainains a buffer wih limied size X i, which can be inerpreed as a ime window ha specifies which paces are considered for ransmission a each ime based on heir delay deadlines. Expired paces are dropped from he buffer. This model has exensively been used for delay-sensiive daa ransmission, e.g., leay buce model for video ransmission [25]. The number of paces in he buffer a ime slo is denoed as x i 0 x i X i). We assume ha he paces arrive from he source a he beginning of each ime slo, i.e., x i is only updaed a he beginning of a ime slo. The number of paces arriving ino he buffer during one ime slo is a random variable independen of he ime and denoed as χ i. χ i follows he Poisson disribuion wih he average arrival rae χ i paces per second [11]. However, noe ha he Poisson process is simply used for illusraion purposes, and oher raffic models e.g., renewal process, ec.) can also be used in our framewor. The average number of paces arriving during one ime slo is equal o χ i ΔT [11]. The condiion of channel j experienced by SU i is represened by he signal-o-noise raio SNR) and denoed as ρ ij in decibels). When y j =1, we assume ha he channel condiion of each channel can be represened by a se of discree SNR values, i.e., ρ ij {σ1 ij,...,σk ij }. Noe ha he number of discree SNR values K can be deermined by SU i by rading off he complexiy a larger K leads o a larger sae space) and he resuling impac on performance. When yj =0,weseρ ij equal o, which means ha he channel is unavailable o SUs a ha ime. As shown in [24], when yj =1, he channel condiion in erms of SNR) can also be modeled as a finie-sae Marov chain, where he ransiion from channel condiion σij l a ime o channel condiion σij a ime +1aes place wih probabiliy p l ij. These ransiion probabiliies can easily be esimaed by SU i by repeaedly ineracing wih he channel. We denoe by p ij he probabiliy ha he channel condiion is σij a ime +1, nowing ha y j =0and y j =1. The probabiliy ha he channel condiion ransiion o, nowing ha y j =0, is 1 no maer in wha condiion he channel j is a ime. Then, he combinaion yj,ρ ij ) is sill a Marov chain wih sae ransiion probabiliy as in 3), shown a he boom of he page. To model he dynamics experienced by SU i a ime in he SN, we define a sae s i =v i, ρ i ) S i, where ρ i = ρ i1,...,ρ in ). The sae encapsulaes he curren buffer sae as well as he sae of each channel. S i is he se of possible saes. 6 The oal number of possible saes for SU i is equal o S i =X i +1) K +1) N. We will show laer in his paper ha he sae informaion is sufficien for SU i o compee for resources mae bid vecor) a he curren ime. 6 We assume ha he channel sae and he ransmission buffer independenly evolve as ime goes by. ) 1 p FN j p l p ij, if yj =1, ρ ij = σl ij, y j =1, ρ ij = σij y j,ρ ij yj,ρ p ij) NF = j p ij, if yj =0, y j =1, ρ ij = σij p FN j, if yj =1, ρ ij = σl ij, y j =0 1 p NF j o. w. 3)

6 FU AND VAN DER SCHAAR: LEARNING TO COMPETE FOR RESOURCES IN WIRELESS STOCHASTIC GAMES 1909 B. Sae Transiion and Sage Reward We will now discuss he sae ransiion process. Remember ha he sae of SU i includes he buffer sae vi and he channel sae ρ i. In his paper, we assume ha he channel sae ransiion is independen of he buffer sae ransiion. In he above, we describe he ransiion of he channel sae ρ i and he TxOp y. The buffer sae ransiion is deermined by he number of paces arriving and he channel allocaion zi as well as he inernal acion b i during ha ime slo. The number of paces ransmied a sage is denoed by N i s i,z i,b i ). Given he channel allocaion, SU i can adap is own inernal acion o maximize he number of ransmied paces, i.e., n i s i, z i) = max N i s i, z i,b i). 4) b i B i The opimizaion can be performed by a cross-layer adapaion algorihm as in [5], [12], and [21]. Since our focus is on he muli-su ineracion, we assume ha he inernal acion will always be performed o maximize he number of ransmied paces. We simply use n i s i, z i ) o represen he number of ransmied paces and omi he inernal acions in he following noaions. The evoluion of he buffer sae is capured by v i =min{vi ns i, z i ))+ +χ i,x i}. We define h=v i vi ns i, z i ))+. Based on he pace arrival model, he buffer sae ransiion probabiliy is compued as in 5), shown a he boom of he page. The sae ransiion combined wih TxOps, given he curren resource allocaion z i, can be compued as q i s i, y s i, y, z i = p buf i v i vi, z ) i }{{} j=1 buffer sae ransiion ) N p y j,ρ ij yj,ρ ) ij } {{ } channel sae ransiion where he firs erm represens he buffer sae ransiion, which is independen of he second erm of he channel sae ransiion. Based on he channel allocaion zi, he SU ransmis he available paces in he buffer. In he nex ime slo, new paces arrive ino he buffer. Newly incoming paces may lead o paces already exising in he buffer being dropped whenever he buffer is full or heir delay deadline has passed. Clearly, he performance of he applicaion e.g., video qualiy) improves when fewer paces are los. Hence, we can inerpre a negaive value of he number of los paces as he sage gain, which is denoed by gi, i.e., gi s i, y, z i )= v i n is i, z i ))+ + χ i X i) +. The reward a ime for SU i is expressed using he quasi-linear form r i s i,ϑ i )=g i + τ i. Noe ha he gain g i and paymen 6) τi depend on he saes and bids of all he compeing SUs in he SN. Hence, he reward is also rewrien as r i s, y, a ). V. B IDDING STRATEGY FOR PLAYING THE STOCHASTIC GAME A. Bes-Response Bidding Policy In he SN, we assume ha he sochasic game is played by all he SUs for an infinie number of sages. This assumpion is reasonable for applicaions having a long duraion, such as video sreaming. In our newor seing, we define a hisory of he sochasic game up o ime as h = {s 0, y 0, a 0, z 0, τ 0,...,s 1, y 1, a 1, z 1, τ 1, s, y } H, which summarizes all previous saes, available TxOps, and he acions aen by he SUs as well as he oucomes a each sage of he aucion game, and H is he se of all possible hisories up o ime. However, during he sochasic game, each SU i canno observe he enire hisory bu raher par of he hisory h. The observaion of SU i is denoed as o i O i and o i h. Noe ha he curren sae s i can always be observed, i.e., s i o i. In his paper, we focus on he exernal acion selecion for he SUs. The exernal acion selecion for SU i o play he sochasic game is also referred o as a bidding policy πi : O i A i for SU i a ime and defined as a mapping from he observaions up o he ime ino he specific acion, i.e., a i = π i o i ). Furhermore, a policy profile π i for SU i aggregaes he bidding policies abou how o play he game over he enire course of he sochasic game, i.e., π i =πi 0,...,π i,...). The policy profile for all he SUs a ime slo is denoed as π =π1,...,π M )=π i, π i ). The policy π i is said o be Marov if he bidding policy πi for is, given he curren sae s i and curren available TxOp y, independen of he saes, TxOps, and acions prior o he ime, i.e., πi o i )=π i s i, y ). The policy π i is said o be saionary if he bidding policy πi = π i for. Therewardr i s, y, a ) of he sage is discouned by he facor α i ) a ime. The facor α i 0 α i < 1) is he discouned facor deermined by a specific applicaion for insance, for video sreaming applicaions, his facor can be se based on he olerable delay). The oal discouned sum of rewards Q i s, y, π) for SU i can be calculaed a ime saring from he sae profile s, assuming ha all SUs deploy saionary and Marov policies π =π i, π i ),asin7), shown a he boom of he nex page. The oal discouned sum of rewards in 7) consiss of wo pars: 1) he curren sage reward and 2) he expeced fuure reward discouned by α i. Noe ha SU i canno independenly deermine he above value wihou explicily nowing he policies and saes of oher SUs. The SU maximizes he oal discouned sum of fuure rewards o selec he bidding policy, which explicily considers p buf i v i vi, z i) = μ i ΔT ) h e μ i ΔT =h h!, if 0 h<x i vi n s i, z i ))+ μ i ΔT ) e μ i ΔT!, if h = X i vi n s i, 5) z i ))+

7 1910 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 58, NO. 4, MAY 2009 he impac of he curren bid vecor on he expeced fuure rewards. We define he bes response β i for SU i o oher SUs policies π i as β i π i ) = arg max Q i s, y, π i, π i ) ). 8) π i The cenral issue in our sochasic game is how he besresponse policies can be deermined by he SUs. In he repeaed aucion mechanism discussed in Secion III, he procedure ha each SU i follows o compee for he channel opporuniies is illusraed in Fig. 3. In his procedure, he bidding sraegy πi is coninuously improved by he bidding sraegy improvemen module. In Secion V-B, we discuss he challenges involved in building such a module, and in Secion VI, we develop a besresponse learning algorihm ha can be used o improve he bidding sraegy. B. Challenges for Selecing he Bes-Response Bidding Policy Recall ha during each ime slo, he CSM announces an aucion based on he available TxOps, and hen SUs bid for he resources. To enable he successful deploymen of his resource aucion mechanism, we can prove similar o our prior wor in [21]) ha SUs have no incenive o misrepresen heir informaion, i.e., hey adhere o he ruh elling policy. We assume ha a each ime slo, SUi has preference u ij over he channel j, which capures he benefi derived when using ha channel. The preference u ij is inerpreed as he benefi obained by SU i when using channel j compared o he benefi when his channel is no used. Noe ha his benefi also includes he expeced fuure rewards. The opimal bid a,op ij ha SU i can ae on channel j a ime is he bid maximizing he ne benefi u ij + τ i. In he aucion discussed in Secion III, he opimal bid ha SU i can mae is a,op ij = u ij, i.e., he opimal bid for SU i is o announce is rue preference o he CSM [21]. The proof is omied here due o space limiaions, since i is similar o ha in [21]. The paymen made by SU i is compued by he CSM based on he inconvenience incurred by oher SUs due o SU i during ha ime slo [23]. Nex, we define he preference u ij in he conex of he sochasic game model. Using he channel j, SUi obains he immediae gain gi s i, y, e j ) by ransmiing he paces in is buffer, where e j indicaes ha channel j is allocaed o SU i during he curren ime slo. SU i hen moves ino he nex sae s i from which i may obain he fuure reward Q i s, y, π). On he oher hand, if no channel is assigned o SU i, i receives he immediae gain gi s i, y, 0) and hen moves ino he nex sae s i, from which i may obain he fuure reward Q i s, y, π). We define a feasible se of channel assignmens o SU i s opponens given SU i s channel allocaion z i )asz i z i ), wih Z i z i )={Z i M =1, i z j = y j z i j, N j=1 z j 1 i, z j {0, 1}}. The preference over he curren sae can hen be compued as u ijs, y ) [ = s i, y ), e j + αi g i [ s S y {0,1} N q i s i, y s i, y, e j ) Z i Z i e j) [ M ]]] q s, y s, y, z ) Q i s, y, π) [ =1 s i, y, 0 ) + α i [ g i s S y {0,1} N q i s i, y s i, y, 0 ) Z i Z i 0) [ M ]]] q s, y s, y,z ) Q i s, y, π). =1 9) Q is, y, π) = α i ) r i s, y, πs, y ) ) = r i s, y, πs, y ) ) }{{} = sage reward a ime { M } + α i q s, y s, y, z πs, y )) Q i s, y, π) = { g i s S y {0,1} N =1 } {{ } expeced fuure reward s i, y, z i πs, y ) )) + τi πs, y ) ) } {{ } sage reward a ime { M + α i s S y {0,1} N =1 q s, y s, y, z πs, y )) Q i s, y, π) } {{ } expeced fuure reward } 7)

8 FU AND VAN DER SCHAAR: LEARNING TO COMPETE FOR RESOURCES IN WIRELESS STOCHASTIC GAMES 1911 Fig. 3. Procedure for SU i o play he aucion game a ime slo. From his equaion, i is clear ha he rue value u ij depends no only on is own curren sae s i bu also on he oher SUs saes s i, he channel allocaions Z i e j) o he oher users when channel j is assigned o SU i, Z i 0) when SU i is no assigned o any channel, and he sae ransiion models q s, y s, y, z ). However, he oher SUs saes, he channel allocaions, and he sae ransiion models of oher SUs are no nown o SU i, and i is, hus, impossible for each SU o deermine is preference u ij s, y ). Wihou nowing he oher SUs saes and sae ransiion models, SU i canno derive is opimal bidding sraegy a,op ij = u ij s, y ). However, if SU i chooses he bid vecor by only maximizing he immediae reward gi + τ i, i.e., he oal discouned sum of reward degeneraes in Q i s, y, π) =gi s i, y, z i πs, y ))) + τi πs, y ) by seing α i =0. Then, he preference over channel j becomes u ij s, y )=gi s i, y, e j ) gi s i, y, 0). Now, since u ij only depends on he sae s i, SU i can compue boh he opimal bid vecor and he opimal bidding policy. We refer o his opimal bidding policy as he myopic policy since i only aes he immediae reward ino consideraion and ignores he fuure impac. The myopic policy is referred o as π myopic i.to solve he difficul problem of opimal bidding policy selecion when α i 0, an SU needs o forecas he impac of is curren bidding acions on he expeced fuure rewards discouned by α i. The forecas can be performed using learning from is pas experiences. VI. INTERACTIVE LEARNING FOR PLAYING THE RESOURCE MANAGEMENT GAME A. How o Evaluae Learning Algorihms? Secion V-B shows ha an SU needs o now he oher SUs saes and sae ransiion models o derive is own opimal bidding policy. This coupling among SUs is due o he shared naure of wireless resources. However, an SU canno exacly now he oher SUs models and privae informaion in wireless newors. Thus, o improve he bidding policy, an SU can only predic he impacs of dynamics uncerainies) caused by he compeing SUs based on is observaions from pas aucions. In his paper, we propose a learning algorihm for predicing hese impacs. We define a learning algorihm L i for SU i as a funcion aing he observaion o i as inpu and having he bidding policy πi as oupu.

9 1912 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 58, NO. 4, MAY 2009 Before developing a learning algorihm, we firs discuss how o evaluae he performance of a learning algorihm in erms of is impac on he SU s reward. Unlie exising muliagen learning research, which is aimed a achieving converge o an equilibrium poin for he ineracing agens, we develop learning algorihms based on he performance of he bidding sraegy on he SU s reward. We denoe a bidding policy generaed by he learning algorihm L i as π L i i. An SU will learn o improve is bidding policy and is rewards from paricipaing in he aucion game. The performance of he bidding sraegy π i is defined as he ime average reward ha SU i obains in a ime window wih lengh T when i adops π i, i.e., V π i T )= 1 T T ri. 10) =1 Using his definiion, he performance of wo learning algorihms can easily be compared. For insance, given wo algorihms L i i and L i,ifvπl i > V πl i i, hen we say ha he learning algorihm L i is beer han L i. B. Wha Informaion o Learn From? Firs, le us consider wha informaion he SU can observe while playing he sochasic game in our SN. As shown in Fig. 1, a he beginning of ime slo, he SUs submi he bids a i i. Then, he CSM reurns he channel allocaions z i i and τi i. If SU i is no allowed o observe he bids, he channel allocaions, and paymens for oher SUs, hen he observaion of SU i becomes o i = {s 0 i, y0,a 0 i, z0 i, τ 0 i,...,s 1 i, y 1,a 1 i, z 1 i, τ 1 i,s i, y }.If he informaion is exchanged among SUs or broadcased and overheard by all SUs, he observed informaion by SU i becomes o i = {s0 i, y0,a 0, z 0, τ 0,..., s 1 i, y 1, a 1, z 1, τ 1,s i, y }. Now, he problem ha needs o be solved by SU i is how i can improve is own policy for playing he game by learning from he observaion o i.in his paper, we assume ha SU i observes he informaion o i = {s 0 i, y0,a 0 i, z0 i, τ 0 i,...,s 1 i, y 1,a 1 i, z 1 i, τ 1 i,s i, y }. C. Wha o Learn? In Secion VI-A, we inroduce learning as a ool o predic he impacs of dynamics and, hence, improve he bidding policy. However, a ey quesion is wha needs o be learned. Recall ha he opimal bidding policy for SU i is o generae a bid vecor ha represens is preferences for using differen channels. From 9), we can see ha SU i needs o learn he following: 1) he sae space of oher SUs, i.e., S i ; 2) he curren sae of oher SUs, i.e., s i ; 3) he ransiion probabiliy of oher SUs, i.e., i q s, y s, y, z ); 4) he resource allocaions Z i e j) j and Z i 0); and 5) he discouned sum of rewards Q i s, y, π). However, SU i can only observe he informaion o i = {s0 i, y 0,a 0 i, z0 i, τ 0 i,...,s 1 i, y 1,a 1 i, z 1 i, τ 1 i,s i, y } from which SU i canno accuraely infer he oher SUs sae space and ransiion probabiliy. Moreover, capuring he exac informaion abou oher SUs requires heavy compuaional and sorage complexiy. Insead, we allow SU i o classify he space S i ino H i classes, each of which is represened by a represenaive sae s i,h, h {1,...,H i }. We discuss how he space S i is decomposed in Secion VI-D. By dividing he sae space S i, he ransiion probabiliy i q s, y s, y, z ) is approximaed by q i s i,y s i, y, z i ), where s i and are he represenaive saes of he classes o which s i and s i s i belong. This approximaion is performed by aggregaing all he oher SUs saes ino one represenaive sae and assuming ha he ransiion depends on he resource allocaion zi. The ransiion probabiliy approximaion is also discussed in Secion VI-D. The discouned sum of rewards Q i s, y, π) is approximaed by V i s i, s i ), y ). Noe ha he classificaion on he sae space S i and he approximaion of he ransiion probabiliy and discouned sum of rewards affec he learning performance. Hence, a user can radeoff an increased complexiy for an increased performance. Afer he classificaion, he preference compuaion can be approximaed as u ij s i, s i), y ) [ = giq s i, y ), e j +αi [ [ s, s i i ) S i, S i ) y {0,1} N q i s i, y s i, y, e j ) q i s i, y s i, y, e j ) g i V i s i, s ) i, y )]] s i, y, 0 ) +α i [ s, s i i ) S i, S i ) y {0,1} N q i s i, y s i, y, 0 ) q i s i, y s i, y, 0 ) V i s i, s ) i, y )]]. 11) In his seing, o find he approximaed preference and, hus, he approximaed opimal bidding policy, we need o learn he following from pas observaions: 1) how he space S i is classified; 2) he ransiion probabiliy q i s i, y s i, y, z i ); and 3) he approximaed fuure i ), y ). rewards V i s i, s D. How o Learn? In his secion, we develop a learning algorihm o esimae he erms lised in Secion VI-C. 1) Decomposiion of he Space S i : As discussed in Secion VI-B, only o i = {s0 i, y0,a 0 i, z0 i, τ 0 i,...,s 1 i, y 1,a 1 i, z 1 i, τ 1 i,s i, y } are observed. From he aucion mechanism presened in Secion III, we now ha he value of

10 FU AND VAN DER SCHAAR: LEARNING TO COMPETE FOR RESOURCES IN WIRELESS STOCHASTIC GAMES 1913 he ax τi is compued based on he inconvenience ha SU i causes o he oher SUs. In oher words, a higher value of τi indicaes ha he newor is more congesed. 7 Based on he bid vecor b i, he channel allocaion z i, and he ax τ i,sui can infer newor congesion and hus, indirecly, he resource requiremens of he compeing SUs. Insead of nowing he exac sae space of oher SUs, SU i can classify he space S i as follows. We assume ha he maximum absolue ax is Γ. We spli he range [0, Γ] ino [Γ 0, Γ 1 ), [Γ 1, Γ 2 ),...,[Γ Hi 1, Γ Hi ] wih 0= Γ 0 Γ 1 Γ Hi =Γ. Here, we assume ha he values of {Γ 1,...,Γ Hi 1} are equally locaed in he range of [0, Γ]. Noe ha more sophisicaed selecion for hese values can be deployed, and his forms an ineresing area of fuure research.) We need o consider hree cases o deermine he represenaive sae s i a ime. 1) If he resource allocaion z i 0, hen he represenaive sae of he oher SUs is chosen as s i = h, if τ i [Γh 1, Γ h ). 12) 2) If he resource allocaion z i = 0 bu y 0, heaxis 0. In his case, we canno use he ax o predic newor congesion. However, we can infer ha he congesion is more severe han he minimum bid for hose available channels, i.e., min j {l:y l 0} {a ij }. This is because, in his curren sage of he aucion game, only SU i wih a i j a ij can obain channel j, which indicaes ha τi min j {l:y l 0} {a ij } if SU i is allocaed any channel. Then, he represenaive sae of he oher SUs is chosen as s i = h, if min j {l:y l 0} { a ij } [Γh 1, Γ h ). 13) 3) If he resource allocaion z i = 0 and y = 0, here is no ineracion among he SUs in his ime slo. Hence, s i = s 1 i. 7 When he CSM deploys a mechanism wihou ax for resource managemen, he space classificaion for oher SUs can also be done based on he announced informaion and corresponding resource allocaion. 2) Esimaing he Transiion Probabiliy: To esimae he ransiion probabiliy, SU i mainains a able F wih size H i H i N +1). Each enry f h,h,j in he able F represens he number of ransiions from sae s i = h o sae s i = h when he resource allocaion z i = e j or 0 if j =0). I is clear ha H i will significanly influence he complexiy and memory requiremens, ec., of SU i. The updae of F is simply based on he observaion o i and he sae classificaion in he above secion. Then, we use he frequency o approximae he ransiion probabiliy [15], i.e., q i s i = h s i = h ) f h, e j =,h,j h f. 14) h,h,j 3) Learning he Fuure Reward: By classifying he sae space S i and esimaing he ransiion probabiliy, SU i can now forecas he value of he average fuure reward V i s i, s i ), y ) using learning. Equaion 7) can be approximaed by 15), shown a he boom of he page. Similar o he Q-learning esablished in [17], we also use he received rewards o updae he esimaion of fuure rewards. However, he main difference beween our proposed algorihm and Q-learning is ha our soluion explicily considers he impacs of oher SUs bidding acions hrough he sae classificaions and ransiion probabiliy approximaion. We use a 3-D able o sore he value V i s i, s i ), y) wih s i S i, s i S i. The oal number of enries in V i is S i H i 2 N.SUi updaes he value of V i s i, s i ), y) a ime according o he rules in 16), shown a he boom of he page, where γi [0, 1) is a learning rae facor saisfying =1 γ i =, and =1 γ i )2 < [17]. In summary, he learning procedure ha is developed for an SU is shown in Table I. E. Complexiy of Learning In Secion III, we have discussed he compuaion complexiy incurred by he CSM and he communicaion cos beween he CSM and he SUs. In his secion, we furher quanify he complexiy of learning in erms of he compuaional and sorage burden. We use a floaing-poin operaion flop ) as a measure of complexiy, which will provide us an esimaion of Q i s i, s ) i, y, π ) {. = gi s i, y, z i πs, y ) )) + τi πs, y ) ) + α i { q i s i, y s i, y, z i πs, y ) )) q i s i, y s i, y, z i s, s i i ) S i, S i ) y {0,1} N πs, y ) )) V i s i, s ) i, y ) }} 15) { 1 γ Vi s i, s i ), y) = i ) Vi 1 s i, s i ), y)+γi Q i s i, s i ), y, π), if s i, i) s =si, s i ), y = y Vi 1 s i, s i ), y), oherwise 16)

1914 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 58, NO. 4, MAY 2009 TABLE I LEARNING PROCEDURE Fig. 4. Bidding sraegies based on he required informaion.

A each sage, SU performs he classificaion of oher he SUs saes, which, in he wors case, requires a number of flops of approximaely N.

11 1914 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 58, NO. 4, MAY 2009 TABLE I LEARNING PROCEDURE Fig. 4. Bidding sraegies based on he required informaion. he compuaional complexiy required o perform he learning algorihm. In addiion, based on his, we can deermine how complexiy grows wih he increasing number of SUs [20]. A each sage, SU performs he classificaion of oher he SUs saes, which, in he wors case, requires a number of flops of approximaely N. The number of flops o esimae he ransiion probabiliy of oher SUs saes as in 14) is approximaely H i +1). The number of flops o learn he fuure reward is approximaely 2 S i H i +6). Therefore, he oal number of flops incurred by he SU is N + H i +2 S i H i +7,from which we can noe ha he complexiy of learning for each SU is proporional o he number of possible saes of ha SU and he number of classes in which he oher SUs sae space is decomposed. To perform he learning algorihm, he SU needs o sore wo ables i.e., ransiion probabiliy able and sae value able), which, in oal, have H 2 i N +1)+2N S i H i ) enries. We also noe ha he sorage complexiy is also proporional o he number of possible saes of ha SU and he number of classes in which he oher SUs sae space is decomposed. VII. SIMULATION RESULTS In his secion, we aim a quanifying he performance of our proposed sochasic ineracion and learning framewor. We assume ha he SUs compee for available specrum opporuniies o ransmi delay-sensiive mulimedia daa. Firs, we compare he performance of various bidding sraegies. Nex, we quanify he performance of our proposed learning algorihm in various newor environmens. We will only presen here several illusraive examples. However, he same observaions can be obained using a larger number of SUs or channels. A. Various Bidding Sraegies for Dynamic Muliuser Ineracion In his secion, we highligh he meris of he sochasic game framewor proposed in Secion II by comparing he performance of differen SUs, which deploy differen bidding sraegies. The SUs are required o submi he bid vecor on he available channels. The SUs can deploy differen bidding sraegies o generae heir bid vecor. 1) Fixed bidding sraegy πi fixed : This sraegy generaes a consan bid vecor during each sage of he aucion game, irrespecive of he sae ha SU i is currenly in and of he saes oher SUs are in. In oher words, πi fixed does no consider any of he dynamics defined in Secion IV. 2) Source-aware bidding sraegy πi source : This sraegy generaes various bid vecors by considering he dynamics in source characerisics based on he curren buffer sae) bu no he channel dynamics. 3) Myopic bidding sraegy π myopic i : This sraegy aes ino accoun he disurbance due o he environmen as well as he impac caused by oher SUs, as discussed in Secion V-B. However, i does no consider he impac on fuure rewards. 4) Bidding sraegy based on bes-response learning π L i i : This sraegy is produced using he learning algorihm proposed in Secion VI. π L i i considers he wo ypes of dynamics defined in Secion IV and he ineracion impac on fuure reward. In erms of required informaion, he above bidding sraegies are illusraed in Fig. 4. For insance, he fixed bidding sraegy πi fixed does no require informaion abou SU i s sae or oher SUs saes. The source-aware bidding sraegy πi buff considers

FU AND VAN DER SCHAAR: LEARNING TO COMPETE FOR RESOURCES IN WIRELESS STOCHASTIC GAMES 1915 TABLE II PERFORMANCE OF SU 1 AND 2WITH VARIOUS BIDDING STRATEGIES IN THE TWO SU NETWORKS Fig. 5.

12 FU AND VAN DER SCHAAR: LEARNING TO COMPETE FOR RESOURCES IN WIRELESS STOCHASTIC GAMES 1915 TABLE II PERFORMANCE OF SU 1 AND 2WITH VARIOUS BIDDING STRATEGIES IN THE TWO SU NETWORKS Fig. 5. Accumulaed pace loss and cos of SU 1 in he five scenarios. a) Accumulaed pace loss over he ime slo. b) Accumulaed cos over he ime slo. he source characerisics based on he curren buffer sae. However, he myopic bidding sraegy π myopic i requires full informaion abou SU i s sae. The bidding sraegy based on bes-response learning π L i i also requires informaion abou he saes of oher SUs. In his simulaion, we consider he SN as an exension of WLANs wih specral agile capabiliy [9]. In he following, we firs simulae he case ha wo SUs compee for he channel opporuniies and hen exend o he case wih muliple five) SUs. 1) Compeiion Among Two SUs for Channel Opporuniies: We firs consider a simple illusraive newor wih wo SUs compeing for available TxOps. The pace arrivals of he SUs are modeled using a Poisson process wih he same average arrival rae of 1 Mb/s. For simpliciy of illusraion, he channel condiion of SU 1 SU 2) on each channel only aes hree values K =3), which are 18, 23, and 26 db. The ransiion probabiliies are p 0 1 ij = p 0 2 ij =0.4, p 0 3 ij =0.2, p l 1 1j = p l 2 1j = 0.4, and p l 3 1j =0.2 i, j, l. The ransiion probabiliy of he availabiliy of channels o SUs is p NF j = p FN j =0.5. Forsimpliciy of illusraion, he environmen parameers experienced by he wo SUs are he same. The lengh of he ime slo ΔT is 10 2 s. In his simulaion, we consider five scenarios. In scenario 1, boh SU 1 and SU 2 deploy he fixed bidding sraegy π1 fixed. In scenarios 2 5, SU 1 deploys he fixed bidding sraegy π1 fixed, source-aware bidding sraegy π1 source, myopic bidding sraegy π myopic 1, and bes-response learning-based bidding sraegy π L 1 1, respecively, and SU 2 always deploys he myopic bidding sraegy π myopic 2. The discouned facor for he besresponse learning algorihm is se o 0.8. As discussed in Secion IV-B, he sage reward is defined as ri =g i + τ i ), wih gi τ i ) being he number of pace los plus he ax charged by he CSM noe ha τi 0). This can be inerpreed as he cos incurred a each sage. Similar o 10), we use he average cos over he ime window T = 1000 o evaluae he performance of he bidding sraegies. Hence, he lower he average cos, he beer he performance of he bidding sraegy. The pace loss rae, average ax, and cos per ime slo are presened in Table II. The accumulaed pace loss and cos of SU 1 for he five scenarios are ploed in Fig. 5a) and b), respecively. From his simulaion, comparing scenario 2 wih scenario 1, we observe ha when SU 2 deploys he myopic sraegy agains SU 1, which adoped he fixed bidding sraegy, SU 2 reduces is average cos by around 42% and he average pace loss rae by around 16.6%. This significan improvemen is because SU 2 can more accuraely value he channel opporuniies by modeling and considering is experienced dynamics, i.e., source characerisics, channel condiions, and availabiliy. In scenario 3, SU 1 improves is bidding sraegy i.e., i deploys now a source-aware bidding sraegy) by parially considering is experienced environmen, i.e., SU 1 generaes is bid vecor by only considering he source dynamics hough

13 1916 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 58, NO. 4, MAY 2009 TABLE III PERFORMANCE OF SU 1 5 WITH VARIOUS BIDDING STRATEGIES IN THE FIVE SU NETWORKS is curren buffer sae. Compared wih scenario 2, if SU 1 considers more informaion abou is own sae, i can furher reduce is pace loss rae by an average of 4.5% and an average cos by around 5.4%. This observaion verifies ha he informaion abou he SU s sae improves he bidding sraegy. In scenario 4, SU 1 deploys a myopic bidding sraegy, which is more advanced han he source-aware bidding sraegy since i considers boh ypes of dynamics defined in Secion IV including he dynamics regarding he source characerisics, channel condiions, and channel availabiliy, and he ineracion wih oher SUs in he aucion mechanism). The significan improvemen in erms of pace loss rae 13% reduced) and average cos 25% reduced), compared wih scenario 2, indicaes ha he myopic bidding sraegy provides he opimal bid vecor when only curren benefis are considered, as shown in Secion V-B. In scenario 5, SU 1 furher improves he bidding sraegy using he bes-response learning algorihm developed in Secion VI. Using learning, SU 1 reduces he pace loss rae o 15.14% and he average cos o % lower compared wih scenario 4). This significan improvemen is due o he abiliy of he SU o learning and forecas he fuure impac of is curren acions. I is also worh noing ha he reducion of pace loss rae of SU 1 in scenarios 2 5 comes from wo pars: One is he advanced bidding sraegies, which allows he SU o ae ino consideraion more informaion abou is own saes and he oher SUs saes and, based on his beer forecas, he impac of various acions; he oher one is he increase in he amoun of resources consumed by SU 1, which corresponds o a higher ax charged by he CSM, as shown in Table II. We furher noe ha he bidding sraegy deployed by SU 1 will affec he performance of SU 2. For example, comparing scenario 2 wih scenario 4, he fixed bidding sraegy of SU 1 in scenario 2 leads o a lower average cos 15% reduced) for SU 2. This is because SU 1 uses a fixed bidding sraegy, which does no accoun for he dynamic changes in is environmen, while SU 2 minimizes is curren cos he number of paces los plus he ax) based on is curren sae. However, when comparing scenario 5 wih scenario 4, SU 1 using learning no only improves is predicion of he curren environmen dynamics bu also beer predics he impac on he fuure cos based on he observaions. The improvemen leads o higher resource allocaion hence, incurring higher ax, see in Table II) for SU 1, hereby resuling in worse performance for SU 2 i.e., he average cos is increased by 22.2%). 2) Muliple SUs Compeiion for Channel Opporuniies: In his simulaion, we consider five SUs compeing for he available TxOps in he WLAN-lie SN. The pace arrivals of all he five SUs are modeled using a Poisson process wih he same average arrival rae of 1 Mb/s. The number of channels is 3, and he channel condiion of all he five SUs on each channel aes only hree values K =3), which are 18, 23, and 26 db. The ransiion probabiliies are p 0 1 ij = p 0 2 ij =0.4, p 0 3 ij =0.2, p l 1 1j = p l 2 1j =0.4, and p l 3 1j =0.2 i, j, l. The parameers of he model of he availabiliy of he channels o he SUs are p NF j =0.7and p FN j =0.3. The lengh of he ime slo ΔT is also 10 2 s. Similar parameers are used for he five SUs o clearly illusrae he performance differences obained based on he differen sraegies. In his simulaion, we consider only wo scenarios. In scenario 1, all SUs deploy a myopic bidding sraegy π myopic i, i = 1, 2,...,5, whereas in scenario 2, SU 5 deploys he muliuser learning-based bidding sraegy π L 5 5 wih he discoun facor of 0.5, and he oher SUs deploy he myopic bidding sraegy, i =1,...,4. The pace loss rae and cos per ime slo incurred by he SUs are presened in Table III. The accumulaed pace loss and cos of SU 5 for he five scenarios are ploed in Fig. 6a) and b), respecively. Similar o he wo-su newor, SU 5 significanly reduces he pace loss rae by 14.6% and average cos by 16.1% by adoping he bes-response learning-based bidding sraegy. Fig. 6a) and b) furher verifies he improvemen of he performance for SU 1. However, he oher SUs performances are decreased as hey now need o compee agains a learning SU i.e., SU 5), which is able o mae beer bids for he available resources. π myopic i B. Muliuser Learning and Delay Impac in a Wireless Tes Bed To validae he performance of muliuser learning and he impac of various delays in a realisic newor seing, we considered wo SUs compeing for he available TxOps in our a-enabled wireless es bed [31]. The channel condiion experienced by he SUs varied beween 10 and 30 db, and we represened his variaion using en saes K = 10). The parameers of he TxOp model are p NF j =0.6and p FN j =0.4. The lengh of he ime slo ΔT is also 10 2 s. The SUs sream he delay-sensiive video raffic e.g., he Mobile sequence encoded using an H.264 video encoder) o heir own desinaions wih an average daa rae of 1.5 Mb/s. We compare hree scenarios. In scenario 1, boh SUs deploy a myopic bidding sraegy π myopic i, i =1, 2. In scenario 2, SU 1 deploys he learning-based bidding sraegy π L 1 1 wih a discoun facor of 0.5, and SU 2 deploys a myopic sraegy π myopic 2. In scenario 3, boh SUs deploy he learning-based bidding sraegy π L i i, i = 1, 2. In he menioned hree scenarios, video applicaions are

FU AND VAN DER SCHAAR: LEARNING TO COMPETE FOR RESOURCES IN WIRELESS STOCHASTIC GAMES 1917 Fig. 6. Accumulaed pace loss and cos of SU 5 in he wo scenarios. a) Accumulaed pace loss over he ime slo.

14 FU AND VAN DER SCHAAR: LEARNING TO COMPETE FOR RESOURCES IN WIRELESS STOCHASTIC GAMES 1917 Fig. 6. Accumulaed pace loss and cos of SU 5 in he wo scenarios. a) Accumulaed pace loss over he ime slo. b) Accumulaed cos over he ime slo. TABLE IV PERFORMANCE OF SU 1 AND 2WITH VARIOUS BIDDING STRATEGIES IN THE MORE REALISTIC NETWORK considered o olerae a delay 8 of 533 ms, which is used in some real-ime video sreaming applicaions. In scenario 4, SU 1 deploys he learning-based bidding sraegy π L 1 1 wih a discoun facor of 0.5, and SU 2 deploys a myopic sraegy π myopic 2. However, in his scenario, SU 1 sreams a video sequence ha can only olerae a delay of 266 ms, which is ypical for video conferencing applicaions. Table IV shows he average video qualiy in erms of pea SNR PSNR) 9 and incurred cos for boh SUs under various scenarios. Comparing scenario 2 wih scenario 1, we observe ha he SU using he learning-based bidding sraegy improves he received video qualiy by 2.2 db and reduces he incurred cos by 9.3%. However, as he performance of SU 1 improves, his also resuls in worse performance for SU 2. This observaion is similar o he resuls in Secion VII-A1 and has he same explanaion. In scenario 3, boh SUs deploy he learning-based bidding sraegies and are able o beer predic he impac of heir curren bidding acions on he fuure cos based on heir observaions. Thus, compared wih scenario 1, he performance of boh SUs has improved: SU 1 SU 2) increases by 1 db 1.2 db) in erms of PSNR and reduces is cos by 4.3% 4.0%). Compared o scenario 2, if SU 2 also deploys he learning-based approach, hen SU 2 also observes is esimaed fuure reward and will increase is bid, hereby reducing he performance 8 During he simulaions, for simpliciy, we assume ha he paces wihin one Group of Picure GOP) have he same delay deadline. 9 PSNR is a widely adoped meric o objecively measure he video qualiy. A PSNR difference of 1 db is significan and can be seen by an unrained human observer. of SU 1. From Table IV, we noe ha he PSNR of SU 1 is decreased by 1.2 db, whereas he PSNR of SU 2 is increased by 2 db. We also observe ha he cos of SU 1 is increased by around 5.6%, whereas he cos of SU is decreased by 9.1%. In scenario 4, since SU 1 sreams a video applicaion wih a lower delay deadline, i has o bid more o ensure ha paces wih sringen delay deadline are ransmied o he desinaion, and hence, SU 1 incurs a higher ransmission cos 41% increased) compared wih scenario 2. Alhough SU 1 bids more for he limied available resources, he video qualiy of SU 1 is reduced by 1.8 db due o is sringen delay deadline. Ineresingly, he sringen delay deadline of he SU 1 s applicaion also increases he ransmission cos of SU 2 and also reduces is video qualiy. This is because he higher bid of SU 1 on limied resources auomaically increases he bid of SU 2. C. Learning Wih Imperfec Informaion In his secion, we consider ha SU 1 deploys he learningbased bidding sraegy and SU 2 deploys he myopic sraegy. The environmen parameers are he same as in Secion VII-B. To quanify he impac of imperfec informaion abou he environmen on SUs performance, we assume ha SU 1 has he ransiion probabiliy of TxOps p NF j =0.55 and p FN j =0.45), which is slighly differen from he rue one i.e., p NF j =0.6 and p FN j =0.4). Table V shows he PSNRs and corresponding cos of boh SUs when SU 1 has perfec or imperfec informaion abou he TxOps. From Table V, we observe ha an inaccurae model of TxOps reduces he performance of SU 1 i.e., he PSNR decreases by

Vehicle Arrival Models : Headway

Chaper 12 Vehicle Arrival Models : Headway 12.1 Inroducion Modelling arrival of vehicle a secion of road is an imporan sep in raffic flow modelling. I has imporan applicaion in raffic flow simulaion where