CHANNEL SELECTION WITH RAYLEIGH FADING: A MULTI-ARMED BANDIT FRAMEWORK. Wassim Jouini and Christophe Moy

Size: px

Start display at page:

Download "CHANNEL SELECTION WITH RAYLEIGH FADING: A MULTI-ARMED BANDIT FRAMEWORK. Wassim Jouini and Christophe Moy"

Mary McDowell
5 years ago
Views:

1 CHANNEL SELECTION WITH RAYLEIGH FADING: A MULTI-ARMED BANDIT FRAMEWORK Wassim Joini and Christophe Moy SUPELEC, IETR, SCEE, Avene de la Bolaie, CS 47601, 5576 Cesson Sévigné, France. INSERM U96 - IFR140- FacltédeMédecine Université de Rennes1-50 Rennes France. ABSTRACT Channel Selection in fading environments with no prior information on the channels qality is a challenging isse. In the case of Rayleigh channels the measred Signal-To-Noise Ratio follows exponential distribtions. Ths, we sggest in this paper a simple algorithm that deals with resorce selection when the measred samples are drawn from exponential distribtions. This strategy, referred to as Mltiplicative Upper Confidence Bond Algorithm MUCB, associates a tility index to every available arm, and then selects the arm with the highest index. For every arm, the associated index is eqal to the prodct of a mltiplicative factor by the sample mean of the rewards collected by this arm. We show that MUCB policies are order optimal. Moreover simlations illstrate and validate the stated theoretical reslts. 1. INTRODUCTION Several seqential decision maing problems face a dilemma between the exploration of a space of choices, or soltions, and the exploitation of the information available to the decision maer. The problem described herein is nown as seqential decision maing nder ncertainty. In this paper we focs on a sb-class of this problem, where the decision maer has a discrete set of stateless choices and the added information is a real valed seqence of feedbacs, or rewards that qantifies how well the decision maer behaved in the previos time steps. This particlar instance of seqential decision maing problems is generally nown as the mltiarmed bandit MAB problem [1, 2]. A common approach to solving the exploration verss exploitation dilemma within MAB problems consists in assigning an tility vale to every arm. An arm s tility aggregates all the past information abot the lever and qantifies the gambler s interest in plling it. Sch tilities are called indexes. Agrawal et al. [2] emphasized the family of indexes minimizing the expected cmlated loss and called them Upper Confidence Bond UCB indexes. UCB indexes provide an optimistic estimation of the arms performances while ensring a The athors wold lie to than Damien Ernst, Raphael Fontenea and Emmanel Rachelson for their many helpfl comments and answers regarding this wor. rapidly decreasing probability of selecting a sboptimal arm. The decision maer bilds its policy by greedily selecting the largest index. Recently, Aer et al. [] proved that a simple additive form, of the rewards sample mean and a bias, nown as UCB 1 can achieve order optimality over time when dealing with rewards drawn from bonded distribtions. Tacling exponentially distribted rewards, as it sally occrs when measring Signal-to-Noise Ratios SNR in Fading environments remains a challenge. Asymptotically, optimal, or order optimal, algorithms exists [1, 2]. These algorithms are nown to be complex to compte. Recently, an algorithm for locally sb-gassian distribtions has been sggested 1. Unfortnately, the algorithm needs prior nowledge on the distribtions parameters. Sch nowledge, on the fading channels, is not available prior to the learning process. This paper is inspired from the aforementioned wor [1, 2] and is motivated by the problematic of channel selection when the channels as sbject to Rayleigh fading. However, we sggest the analysis of a mltiplicative rather than an additive expression for the index. The main contribtion of this paper is to design and analyze a simple, deterministic, mltiplicative index-based policy. The decision maing strategy comptes an index associated to every available arm, and then selects the arm with the highest index. Every index associated to an arm is eqal to the prodct of the sample mean of the reward collected by this arm and a scaling factor. The scaling factor is chosen so as to provide an optimistic estimation of the considered arm s performance. We show that or decision policy has both a low comptational complexity and can lead to a logarithmic loss over time nder some non-restrictive conditions. For the rest of this paper we will refer to or sggested policy as Mltiplicative Upper Confidence Bond index MUCB. The otline of this paper is the following: We start by presenting some general notions on the mlti-armed bandit framewor with exponentially distribted rewards in Section 2. Then, Section introdces or index policy and Section 4 analyzes its behavior, proving the order optimality of the sggested algorithm. Finally, Section 6 concldes. 1 Fond online on arxiv:

2 2. MULTI-ARMED BANDITS A K-armed bandit K N is a machine learning problem based on an analogy with the traditional slot machine onearmed bandit bt with more than one lever. Sch a problem is defined by the K-tple θ 1,θ 2,..., θ K Θ K, Θ being the set of all positive reward distribtions. When plled at a time t N, each lever 2 1,K where 1,K = 1,..., K} provides a reward r t drawn from a distribtion θ associated to that specific lever. The objective of the gambler is to maximize the cmlated sm of rewards throgh iterative plls. It is generally assmed that the gambler has no or partial initial nowledge abot the levers. The crcial tradeoff the gambler faces at each trial is between exploitation of the lever that has the highest expected payoff and exploration to get more information abot the expected payoffs of the other levers. In this paper, we assme that the different exponentially distribted payoffs drawn from a machine are independent and identically distribted i.i.d. and that the independence of the rewards holds between the machines. However the different machines reward distribtions θ 1,θ 2,..., θ K are not spposed to be the same. Let I t 1,K denote the machine selected at a time t, and let H t be the history vector available to the gambler at instant t, i.e., H t =[I 0,r 0,I 1,r 1,...,I t1,r t1 ] We assme that the gambler ses a policy π to select arm I t at instant t, sch that I t = πh t. We shall also write Δ 1,K, μ = 1 Δ λ =E[θ ], where λ refers to the parameter of the considered exponential distribtion with pdf f θ x =λ e λx,x 0, and we assme that μ > 0, 1,K. The cmlated regret of a policy π at time t after t plls is defined as follows: R t = tμ t1 μ = m=0 r m, where max μ } refers to the expected reward of the opti- 1,K mal arm. We see to find a policy that minimizes the expected cmlated regret Eqation 1, E [R t ]= = Δ E [T,t ], 1 where Δ = μ μ is the expected loss of playing arm, and T,t refers to the nmber of times the machine has been played from instant 0 to instant t 1.. MULTIPLICATIVE UPPER CONFIDENCE BOUND ALGORITHMS This section presents or main contribtion, the introdction of a new mltiplicative index. Let B,t T,t denote the index of arm at time t after being plled T,t. We refer to as Mltiplicative Upper Confidence Bond algorithms MUCB the family of indexes that can be written in the form: B,t T,t =X,t T,t M,t T,t, 2 We se indifferently the words lever, arm, or machine. where X,t T,t is the sample mean of machine at step t after T,t plls, i.e., X,t T,t = 1 t1 T,t i=0 1 I i=}r i and M,t is an pper confidence scaling factor chosen to insre that the index B,t T,t is an increasing fnction of the nmber of ronds t. This last property insres that the index of an arm that has not been plled for a long time will increase, ths eventally leading to the sampling of this arm. We introdce a particlar parametric class of MUCB indexes, which we call MUCBα, given as follows : α 0, M,t T,t = 1 } 2 max 0; 1 α lnt T,t We adopt the convention that 1 0 =+. Given a history H t, one can compte the vales of T,t and M,t and derive an index-based policy π as follows: I t = πh t arg max B,t T,t }. 1,K 4. ANALYSIS OF MUCBα POLICIES This section analyses the theoretical properties of MUCBα algorithms. More specifically, it focses on determining how fast is the optimal arm identified and what are the probabilities of anomalies, that is sb-optimal plls Consistency and order optimality of MUCB indexes Definition 1 β-consistency Consider the set Θ K of K-armed bandit problems. A policy π is said to be β-consistent, 0 <β 1, with respect to Θ K, if and only if: θ 1,...,θ K Θ K E[R t ], lim t t β =0 4 We expect good policies to be at least 1-consistent. Asa matter of fact, 1-consistency ensres that, asymptotically, the average expected reward is optimal. From the expression of Eqation 1 one can remar that its is sfficient to pper bond the expected nmber of times E[T,t ] one plays a sboptimal machine after t ronds, to obtain an pper bond on the expected cmlated regret. This leads to the main reslt of this paper in the form of the following theorem. Theorem 1 Order optimality of MUCBα policies Let ρ = μ /μ, 1,K }. For all K 2, if policy MUCBα > 4 isrn onk machines having rewards drawn from exponential distribtions θ 1,..., θ K then: E [R t ] :Δ >0 4μ α 1 ρ lnt+o lnt 5 This form offers a compact mathematical formla. However practically speaing, a machine is played when T,t α lnt. Otherwise the machine with largest finite index is played.

3 Proving Theorem 1 relies on three lemmas that we analyze and prove in the next sbsection Learning Anomalies and Consistency of MUCB policies Let s introdce the set S = N R; then, one can write S,t =T,t,B,t S the decision state of arm at time t. We associate the prodct order to the set S: for a pair of states S =T,B S and S =T,B S, we write S S if and only if T T and B B. Definition 2 Anomaly of type 1 We assme that there exists at least one sboptimal machine, i.e., 1,K } =. We call anomaly of type 1, denoted by φ 1 } π,t,fora sboptimal machine 1,K }, and with parameter N, the following event: φ 1 } π,t = S,t,μ }. Definition Anomaly of type 2 We refer to as anomaly of type 2, denoted by φ 2 } π t, associated to the optimal machine, the following event: φ 2 } π t = S,t <,μ T,t 1}. Lemma 1 Expected cmlated regret. Proof in 8.2 Given a policy π and a MAB problem, let =[ 1,..., K ] represent a set of integers, then the expected cmlated regret is pper bonded by: E[R t ] Δ + Δ P t = = with, P t = t m= +1 P φ 2 } π m+p φ 1 } π,m We consider the following vales for the set, for all sboptimal arms, t = lnt. 4α 1ρ 2 We show in the two following lemmas that for the defined set the anomalies are pper bonded by exponentially decreasing fnctions of the nmber of iterations. Lemma 2 Upper bond of Anomaly 1. Proof in 8. For all K 2, if policy MUCBα is rn on K machines having rewards drawn from exponential distribtions θ 1,..., θ K then 1,K }: We end this paper by the proof of Theorem 1 [Proof of Theorem 1] For α>4, relying on Lemmas 1, 2 and we can write: E[R t ] 4α 1 ρ 2 lnt + olnt = Δ with, = Δ P t = olnt. Finally, since Δ = μ 1 ρ and t = 4α lnt+olnt, we find the 1ρ 2 stated reslt in Theorem SIMULATION RESULTS For illstration prpose we consider a SU willing to evalate the qality of K = 10 channels. The SU relies on the measre of the channels SNR to evalate the best channel. We assme that the SU sffers Rayleigh fading. Conseqently, for every channel, the measred SNR follows an exponential distribtion. The presented simlation consider the following parameters μ = μ 1,,μ 10 } for the channels, where μ 1 μ 10 withot loss of generality: and μ = 0.1; 0.2; 0.; 0.4; 0.5; 0.6; 0.7; 0.8; 0.9; 1}. The simlations compare three MUCB policies for α eqal to respectively, 1; 2; 4.01}. Theses algorithms are referred to as MUCB1, MUCB2 and MUCB4 respectively. Notice that MUCB4 is chosen so as to respect the condition imposed in Theorem 1, i.e., α>4. MUCB1 and MUCB2 on the contrary are considered as possibly risy by Theorem 1. The simlations consider a time horizon of 10 6 iterations. Figre 1 plots the cmlated averaged regret of MUCB policies. In order to obtain relevant reslts, the crves were averaged over 100 experiments. All crves show a similar behavior: first an exploration phase were the regret grows qicly. Then the crves tend to confirm that the regret of MUCB policies grow as a logarithmic fnction of the nmber of iterations. As matter of fact, we notice that after the first exploration phase, on a logarithmic scale, the regret grows as a linear fnction. Moreover, since MUCB1 and MUCB2 seem to respect this trend, these crves sggest that the imposed condition in Theorem 1, α>4, might be improvable. P φ 1 } π,t t α/ CONCLUSION Lemma Upper bond of Anomaly 2. Proof in 8.4 For all K 2, if policy MUCBα is rn on K machines having rewards drawn from exponential distribtions θ 1,..., θ K then: P φ 2 } π t t α/2+1 7 A new low complexity algorithm for MAB problems is sggested and analyzed in this paper: MUCB. The analysis of its regret proves that the algorithm is order optimal over time. In order to qantify it performance compared to optimal algorithms, frther empirical evalations are needed and are crrently nder investigation.

4 Regret Averaged Regret over 100 experiments MUCB1 MUCB2 MUCB4 This reslt was initially proposed and proved in [4]. The bonds provided by this lemma are called Large Deviations Ineqalities LDIs in this paper. In the case of exponential distribtions this theorem can be applied and LDI fnctions have the following expressions: 2 l 1 β =l 2 β = β β 1 β E[X] 1 ln E[X] E[X] β E[X] Iterations Fig. 1. Average Regret Over 100 experiments: Illstration of Theorem REFERENCES [1] T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rles. Advances in Applied Mathematics, 6:4 22, [2] R. Agrawal. Sample mean based index policies with Ologn regret for the mlti-armed bandit problem. Advances in Applied Probability, 27: , [] P. Aer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of mlti-armed bandit problems. Machine learning, 472/:25 256, [4] H. Chernoff. A measre of asymptotic efficiency fo tests of a hypothesis based on the sm of observations. The Annals of Mathematical Statistics, pages , APPENDIX 8.1. Large deviations ineqalities Assmption 1 Cramer condition Let X be a real random variable. X satisfies the Cramer condition if and only if γ >0: η 0,γ, E [ e ηx] <. Lemma 4 Cramer-Chernoff Lemma for the sample mean Let X 1,...,X n n N be a seqence of i.i.d. real random variables satisfying the Cramer condition with expected vale E[X]. We denote by X n the sample mean X n = 1 n n i=1 X i. Then, there exist two fnctions l 1 and l 2 sch that: β 1 > E[X], PX n β 1 e l1β1n, β 2 < E[X], PX n β 2 e l2β2n. Fnctions l 1 and l 2 do not depend on the sample size n and are continos non-negative, strictly increasing respectively strictly-decreasing for all β 1 > EX respectively β 2 < EX, both nll for β 1 = β 2 = EX Proof of Lemma 1 According to Eqation 1: E[Rt π ]= Δ E [T,t ]. Per definition T,t = t1 = 1 Im=. Then, E[T,t ]= t1 E [1 Im=]. m=0 m=0 After playing an arm times, bonding the first terms by 1 yields: E[T,t ] + m= +1 P I m = } T,m > } 8 Then we can notice that the following events are eqivalent: } I m = } = B,m > max B,m = Moreover we can notice that: } B,m > max B,m B,m >B,m } = Which can be frther inclded in the following nion of events: B,m >B,m } B,m >μ } μ >B,m } Conseqently we can write: I m = } T,m > } Φ 1 } π,m Φ 2} π m 9 Finally, we apply the probability operator: E[T,t ] + m= +1 PΦ 1 } π,m +PΦ 2} π m 10 The combination of Eqation 1 - given at the beginning of this proof - and Eqation 10 concldes this proof. 8.. Proof of Lemma 2 From the definition of φ 1 } π,t we can write that: P φ 1 } π,t = P S,t,μ, S,t S = P B,t μ.

5 In the case of MUCB policies, we have: t, P B,t μ =P X,t μ M,t Conseqently, we can pper bond the probability of occrrence of type 1 anomalies by: P φ 1 } π,t P X,t μ 11 M =,t μ M,t T,t Let s define β,t T,t = Since we are dealing with exponential distribtions, the rewards provided by the arm satisfy the Cramer condition. As a matter of fact, since α lnt then: 1ρ 2 β,t λ = ρ 1 1 α lnt 1 So, according to the large deviation ineqality for X,t T,t given by Lemma 4 with T,t and large enogh, there exists a continos, non-decreasing, non-negative fnction l 1, sch that: P X,t T,t β,t T,t T,t = e l 1,β,t Finally: P φ 1 } π,t t1 = e l 1,β,t The end of this proof aims at proving that for: : l 1, β,t α lnt 2. Note that since we are dealing with exponential distribtions we can write: l 1, β,t 1β,tλ β,t λ. Moreover since α lnt then: 1ρ 2 β,t λ = ρ 1 1 α lnt ρ 1 Conseqently it is sfficient to prove that: 1 β,t λ ρ 1 α lnt 2 Let s define ht as a fnction of time: ht = α lnt [0, 1]. We analyze the sign of the fnction: gt = ρ 1 ht ρ ρ 1 ht 2 12 Conseqently we need to prove that for, g has positive vales. Factorizing last eqation leads to the following two terms: ρ ρ ρ ρ ht ρ 1 1 ht ρ Since per definition: ht [0, 1] and ρ 1 ht ρ 1 ρ ρ 1 then, 1 0. Conseqently, g is positive only if the second term of Eqation 1 is negative, i.e., α lnt ρ 1 1. Since,the ρ ρ 1 last ineqation is verified. Finally pper bonding Eqation 11 for : P φ 1 } π,t e α ln/2 1 1 α/2 t α/21 = = 8.4. Proof of Lemma This proof follows the same steps as the the proof in Sbsection 8.. From the definition of φ 1 } π,t we can write that: P φ 2 } π t t1 =1 P B,t μ In the case of MUCB policies, we have: t, P B,t μ μ =P X,t M,t Conseqently, we can pper bond the probability of occrrence of type 2 anomalies by: P φ 2 } π t P =1 X,t μ max 0; 1 } α lnt T,t } Since μ max 0; 1 α lnt T,t μ Cramer s condition is verified. Moreover since the machine is played when the maximal of the previos term is eqal to 0, we can consider that α lnt and that: μ max 0; 1 } α lnt = μ 1 T,t α lnt T,t Conseqently, we can pper-bond the occrrence of Anomaly 2: P φ 2 } π t =α lnt e l2β,t 14 Where, l 2 β,t verifies the LDI as defined in Appendix 8.1. Ths, after mild simplifications we can write, l 2 β,t 2 α lnt 1+21 α lnt α lnt 2 Conseqently, inclding this last ineqality into Eqation 14 ends the proof.

Sources of Non Stationarity in the Semivariogram

Sources of Non Stationarity in the Semivariogram Sorces of Non Stationarity in the Semivariogram Migel A. Cba and Oy Leangthong Traditional ncertainty characterization techniqes sch as Simple Kriging or Seqential Gassian Simlation rely on stationary