Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel
|
|
- Mildred McDonald
- 5 years ago
- Views:
Transcription
1 Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel Yanting Wu Dept. of Electrical Engineering University of Southern California Bhaskar Krishnamachari Dept. of Electrical Engineering University of Southern California Abstract This paper studies the optimal transmission policy for a Gilbert-Elliott Channel. The transmitter has two actions: sending aggressively or sending conservatively, with rewards depending on the action chosen and the underlying channel state. The aim is to compute the scheduling policy to determine which actions to choose at each time slot in order to maximize the expected total discounted reward. We first establish the threshold structure of the optimal policy when the underlying channel statistics are known. We then consider the more challenging case when the statistics are unknown. For this problem, we map different threshold policies to arms of a suitably defined multiarmed bandit problem. To tractably handle the complexity introduced by countably infinite arms and the infinite time horizon, we weaken our objective a little: finding a (OP T (ɛ + δ))- approximate policy instead. We present the UCB-P algorithm, which can achieve this objective with logarithmic-time regret. I. INTRODUCTION Communication over the wireless channels are affected by fading conditions, interference, path loss, etc. To gain a better utilization of wireless channels, a transmitter needs to adapt transmission parameters such as data rate and transmission power according to the communication channel states. In this paper, we analyze mathematically a communication system operating over a 2-state Markov channel (known as the Gilbert-Elliott Channel) in a time-slotted fashion. The objective is for the transmitter to decide at each time, based on prior observations, whether to send data at an aggressive or conservative rate. The former incurs the risk of failure, but reveals the channels true state, while the latter is a safe but unrevealing choice. When the channel transition probabilities are known, this problem can be modelled as a Partially Observable Markov Decision Process (POMDP). This formulation is very closely related to a recent work pertaining to betting on Gilbert Elliott Channels [3], which considers three choices, and shows that a threshold-type policy consisting of one, two, or three thresholds depending on the parameters, is optimal. In our setting, we show that the optimal policy always has a single threshold that corresponds to a K-conservative policy, in which the transmitter adopts the conservative approach for K steps after each failure before reattempting the aggressive strategy. Unlike [3], however, our focus is on the case when the underlying state transition matrix is unknown. In this case, the problem of finding the optimal strategy is equivalent to finding the optimal choice of K. We map the problem to a Multi-armed bandit, where each possible K-conservative policy corresponds to an arm. To deal with the difficulties of optimizing the discounted cost over an infinite horizon, and the countably infinite arms that result from this mapping, we introduce approximation parameters δ, ɛ, and show that a modification of the well-known UCB1 policy guarantees that the number of times that the arms that are more than (ɛ + δ)) away from the optimal are played is bounded by a logarithmic function of time. In other words, we show that the time-averaged regret with respect to a (OP T (ɛ + δ)) policy tends to zero. We briefly review some other recent papers in the literature that have treated similar problems. Johnston and Krishnamurthy [5] consider the problem of minimizing the transmission energy and latency associated with transmitting a file across a Gilbert Elliott fading channel, formulate it as a POMDP, identify a threshold policy for it, and analyze it for various parameter settings. Karmokar et al. [6] consider optimizing multiple objectives (transmission power, delay, and drop probability) for packet scheduling across a more general finite-state Markov channel. Motivated by opportunistic spectrum sensing, several recent studies have explored optimizing sensing and access decisions over multiple independent but stochastically identical parallel Gilbert Elliott channels, in which the objective is to select one channel at each time, showing that a simple myopic policy is optimal [7], [8]. In [9], Dai et al. consider the non-bayesian version of the sensing problem where the channel parameters are unknown, and show that when the problem has a finite-option structure, online learning algorithms can be designed to achieve nearlogarithmic regret. In [10], Nayyar et al. consider the same sensing problem over two non-identical channels, derive the optimal threshold structure policy for it. For the non-bayesian setting, show a mapping to countably infinite-armed multiarmed bandit, and also prove logarithmic regret with respect to a (OP T δ) policy, similar to the approach adopted in this work for a different formulation. The paper is organized as follows: section II introduces the model; section III gives the structure of the optimal policy when the underlying channel transition probabilities are known; section IV talks about K-conservative policies and we prove that the optimal policy corresponds to a K-conservative policy; section V discusses using multi-arm bandits to find out
2 optimal K; to address the two challenges: infinite number of arms, and infinite time horizon, we weaken our objective to find policies which are at most (ɛ + δ) away from the optimal instead. We design the UCB-P algorithm to learn such policies. We present simulation results in section VI. Finally, section VII concludes the paper. II. MODEL For our problem setting, we consider the Gilbert-Elliott channel which is a Markov chain with two states: good (denoted by 1) or bad (denoted by 0). If the channel is good, it allows the transmitter to send data with a high rate successfully. However, if the channel is bad, it only allows transmitter to send data with a low rate successfully. The transition probabilities matrix is given as: P = [ ] [ ] P00 P 01 1 λ0 λ = 0. (1) P 10 P 11 1 λ 1 λ 1 Define α = λ 1 λ 0. We assume that the channel is positive correlated, which means α 0. At the beginning of each time slot, the transmitter chooses one of the following two actions: Sending Conservatively (SC): the transmitter sends data with a low rate. No matter what the channel state is, it can successfully transmit a small number of bits. We assign a reward R 1 to this action. Since the transmission is always successful, the transmitter cannot learn the state if this action is chosen. Sending Aggressively (SA): the transmitter sends data with a high rate. If the channel is in good state, we consider the transmission successful and the transmitter can get a high reward R 2 (> R 1 ). If the channel is in bad state, sending with a high rate will cause high error rate and drop rate, we consider the transmission a failure and the transmitter gets a constant penalty C. We assume if the transmitter sends aggressively, it can learn the state of the channel. In other words, we assume that when the channel is in a bad state an aggressive transmission strategy will encounter and detect failure. Because when sending conservatively, the state of the channel is not directly observable, the problem we consider in this paper turns out to be a Partially Observable Markov Decision Process (POMDP) problem. In [1], it has been shown that a sufficient statistic to make an optimal decision for this POMDP problem is the conditional probability that the channel is in state 1 given all past actions and observations. We call this conditional probability the belief, represented by b t = P r [S t = 1 H t ] which H t is the history of all actions and observations before t th time slot. When sending aggressively, the transmitter learns the state of the channel; so the belief is λ 0 if the channel is bad or λ 1 if the channel is good. The following is the expression for the expected rewards: { R1 if A R(b t, A t ) = t = SC, (2) b t R 2 (1 b t )C if A t = SA, where b t is the belief of the channel in good state and A t is the action taken by the transmitter at time t. In this paper, we consider the expected total-discounted reward to make decisions, which is defined as [ ] E β t R(b t, A t ) b 0 = p, (3) where R(b t, A t ) is the expected reward at t, β(< 1) is a constant discounter factor, and b 0 is the initial belief. III. THRESHOLD STRUCTURE OF THE OPTIMAL POLICY In this section, we discuss the optimal policy when the transition probabilities are known. We prove that the optimal policy is a one threshold policy and give a closed form formulation of the threshold. Policy, denoted by π, is defined as a rule which maps belief probabilities to actions. We use V π (p) to represent the expected total-discounted reward the transmitter can get given the initial belief is p and policy is π: [ ] V π (p) = E π β t R(b t, A t ) b 0 = p. (4) The aim is to find a policy having the greatest value of the expected total-discounted reward, denoted by V (p): V (p) = max π {V π (p)}. (5) According to [2, Thm. 6.3], there exists a stationary policy which makes V (p) = V π (p), thus V (p) can be calculated by the following equation: V (p) = max {V A(p)}, (6) A {SA,SC} where V A (p) is the greatest value of the expected totaldiscounted reward by taking action A when the initial belief probability is p. V A (p) can be expressed as: V A (p) = R(p, A) + βe [V A (p ) b 0 = p, A 0 = A], (7) where b 0 is the initial belief, A is the action taken by transmitter, and p is the new belief after taking action A. Sending conservatively: by taking this action, the belief changes from p to T (p) = λ 0 (1 p) + λ 1 p = αp + λ 0, hence, V SC (p) = R 1 + βv (T (p)). (8) Sending aggressively: by taking this action, the channel state is known, so V SA (p) (9) = (pr 2 (1 p)c) + β[pv (λ 1 ) + (1 p)v (λ 0 )]. This formulation turns out to be similar to the problem formulation in [3], except for two main differences as follows: (1). We do not have a separate action for sensing. (2). We introduce penalty when sending fails. Theorem 3.1: The optimal policy has a single threshold structure. [ π SC if 0 p ρ, (p) = (10) SA if ρ p 1,
3 where p is the belief, and ρ is the threshold. We omit the proof here because it follows in a straightforward manner from the results in [3]. Since the penalty does not change the linear property of function V SA (p), according to [3, Thm. 1, Thm.2], V β (p) is convex and nondecreasing, thus the optimal solution follows a threshold structure. However, unlike [3], which shows the existence of multiple thresholds, there is a single threshold for our problem setting. IV. K-CONSERVATIVE POLICIES In this section, we will discuss the K-conservative policy, where K is the number of time slots to send conservatively after a failure, before sending aggressively again. The Markov chain corresponding to K-conservative policy is as shown in Fig. 1. The states are the number of time slots which the transmitter A. Closed Form Expression of Threshold ρ The threshold ρ is the solution of the following equation: R 1 + βv (T (ρ)) = V SA (ρ). (11) There are two possible scenarios for T (ρ): If T (ρ) ρ, we have V (T (ρ)) = V SC (T (ρ)), substituting in Eq. (11), we can get: ρ = R 1 + C (R 2 + C) + βv (λ 1 ) β R1 1 β. (12) Otherwise, we have V (T (ρ)) = V SA (T (ρ)), substituting in Eq. (11), we can get Eq. (13) at the top of next page. B. V (λ 0 ) and V (λ 1 ) To calculate V (λ 1 ) and V (λ 0 ), there are also two possible scenarios: If λ 1 ρ, then since λ 0 λ 1 ρ, V (λ 1 ) = V (λ 0 ) = R 1 1 β. (14) Otherwise, λ 1 > ρ, V (λ 1 ) = V SA (λ 1 ), using Eq. (9), we get V (λ 1 ) = λ 1R 2 (1 λ 1 )C + β(1 λ 1 )V (λ 0 ) 1 βλ 1. (15) To get V (λ 0 ), we adapt [3, Thm.4]: V (λ 0 ) = max {V A(λ 0 )} (16) A {SA,SC} = max{r 1 + βv (T (λ 0 )), V SA (λ 0 )} 1 β N = max{r 1 1 β, βn max {1 0 n N 1 1 β R 1 + β n V SA (T n (λ 0 ))}}. Since N is arbitrary and 0 β < 1, when N, we have V (p) = max {R 1 β n 1 n 0 1 β + βn V SA (T n (λ 0 ))}. (17) Using Eq. (9), Eq.(15), Eq. (17), we can get V (λ 0 ) (18) { 1 β n 1 β = max R } 1 + β n (κ n R 2 + κ n C(1 β) C) n 0 1 β n+1, [1 (1 β)κ n ] where κ n = T n (λ 0) 1 βλ 1 = (1 αn )λ S 1 βλ 1. Fig. 1. K Conservative Policy Markov Chain. has sent conservatively since last failure. There are K + 2 states in this Markov chain. State 0 corresponds to the moment that sending aggressively fails, and it goes back to sending conservatively stage. State K 1 corresponds to that the transmitter has already sent conservatively for K time slots, and it will send aggressively next time slot. If the transmitter sends aggressively and succeeds, it goes to state SA and continues to send aggressively at the next time slot; otherwise it goes back to state 0. The probability that transmitter stays in state SA is λ 1. The transmitter has to wait K time slots before sending aggressively again, so the probabilities from state i to state i + 1 is always 1 when 0 i < K. There are K + 2 states, each state corresponds to a belief and an action; belief and action determine the expected totaldiscounted reward. Thus given K, there are K + 2 different expected total-discounted rewards. Theorem 4.1: The threshold ρ policy structure is equivalent to a K opt -Conservative policy structure, where K opt is the number of time slots that a transmitter sends conservatively after a failure before sending aggressively again. If λ S > ρ, K opt = log α (1 ρ(1 α) λ 0 ) 1, otherwise K opt =. Proof: Whenever the transmitter sends aggressively but fails, it goes back to sending conservatively stage. The belief after a failure is λ 0, and it changes with time according to formula: T n (λ 0 ) = T (T n 1 (λ 0 )) = λ 0 ( 1 αn+1 1 α ), where n is the number of time slots that the transmitter has sent conservatively since last time failure. If λ S ρ, T n (λ 0 ) increases when n increases, when n, T n (λ 0 ) λ S. The optimal policy is always sending conservatively, K opt =. If λ S > ρ, then there exists a finite integer K opt which makes T Kopt 1 (λ 0 ) < ρ and T Kopt (λ 0 ) ρ, K opt = log α (1 ρ(1 α) λ 0 ) 1. (19) V. ONLINE LEARNING FOR UNKNOWN CHANNEL In this section, we will discuss how to find the optimal policy if the underlying channel s transition probabilities are
4 ρ = (1 βλ 1)R 1 + λ 0 βr 2 + (1 αβ)(1 β)c + β(β 1)(1 αβ)v (λ 0 ). (13) (1 αβ)(r 2 + (1 β)c + β(β 1)V (λ 0 )) unknown. To find K opt, we use the idea of mapping each K-conservative policy to a countable multi-armed bandits of countable time horizon. Now there are two challenges: (1). The number of arms can be infinite. (2). To get the true total discounted reward, each arm requires to be continuing played until time goes to infinity. To address these two challenges, we weaken our objective to find a suboptimal which is an (OP T (ɛ + δ)) approximation of the optimal arm instead. Theorem 5.1 and theorem 5.2 address the two challenges respectively. We define (OP T ɛ) arm as the arm which gives (OP T ɛ)-approximation of the optimal arm no matter what the initial belief is. Let arm SC correspond to the always sending conservatively policy, or K opt =. Theorem 5.1: Given an ɛ and bound B on α, there exists a K max, such that K K max, the best arm in the arm set C = {0, 1,...K, SC} is an (OP T ɛ) arm. Proof: If K > K opt or K opt =, the optimal arm is already included in the arm sets. If K < K opt <, suppose that transmitter has already sent conservatively for n time slots, let k opt = K opt n, k = K n, and C = R 2 + C + β 1 β (R 2 R 1 ), V π k opt (p) V π k (p) (20) 1 β kopt = [R 1 + β kopt V SA (T kopt (p))] 1 β 1 β k [R 1 1 β + βk V SA (T k (p))] R 1 = β k [ 1 β (1 βkopt k ) +β kopt k V SA (T kopt (p)) V SA (T k (p))] < β k [V SA (T kopt (p)) V SA (T k (p))] Proof: = E[ E[ T max β t R i (t) b 0 = p] E[ t=t max+1 < β Tmax+1 R 2 1 β. β t R i (t) b 0 = p] β t R i (t) b 0 = p] (21) δ(1 β) When T log β R2 1, E[ βt R i (t) b 0 = p] E[ T max β t R i (t) b 0 = p] < δ. We define period as time interval between arm switches, A- reward as the average (OP T δ)-finite horizon approximation total discounted reward at one period, regret as the number of time slots during which the transmitter uses policies which are more than (ɛ + δ) away from the optimal policy. We design the UCB-Period (UCB-P) algorithm as shown in Algorithm 1. It is similar to UCB1 algorithm, but the time unit is period and the A-rewards are accumulated for each period. Algorithm 1 Deterministic policy: UCB-P Initialization: L( K max +2) arms: arm 0,, arm (L 2) and the SC arm. Play each arm for T ( T max ) time slots, then keep playing the arm until the arm hits arm state 0. Get initial A-reward of state 0 for each arm. Let Ā = A(i), (i = 0, 1, L 2, SC) as initial average A-reward for state 0 of each arm; n i = 1(i = 0, 1, L 2, SC) as initial number of periods arm i has been played, and n = L as initial number of periods played so far. for period n = 1, 2, do Select the arm with highest value of (1 β)āi+c 2ln(n) R 2+C + n j. Play the selected arm for a period. Update the average A-reward for state 0, n i of the selected arm and n. end for = β k (T kopt (p) T k (p))(r 2 + C + β(v (λ 1 ) V (λ 0 ))) Theorem 5.3: The regret of Algorithm UCB-P is bounded < β k (T Kopt (λ 0 ) T K (λ 0 ))(R 2 + C + β( R 2 1 β R 1 1 β )) by O(L(L + T )ln(t)), where t is the number of time slots passed so far. = β k α K+1 (1 α Kopt K )λ S C Proof: The procedure is similar to UCB1, [4, Thm.1 ] < α K+1 C < B K+1 C can be adapted.. A-reward is within the range of [ C 1 β, R 2 1 β ], where the left boundary corresponds to the transmitter sending aggressively but failing every time slot, and the right boundary corresponds ɛ Let K max = log B C 1, when K K max, V π to the transmitter sending aggressively and succeeding every k opt (p) V π k (p) < ɛ. time slot. We normalize the A-reward to be in the range of Theorem 5.2: Given an δ, there exists a T max such that [0, 1], UCB1 algorithm shows that the number of time slots T T max, an arm for the finite horizon total discounted reward up to time T is at most δ away from the infinite horizon total discounted reward. that selects non-optimal arm is bounded by O((L 1)ln(n)), where L is the number of arms and n the overall number of plays done so far.
5 The best arm is an (OP T (ɛ + δ)) arm. If any other nonbest arm K hits SA states at the T th time slots, the transmitter keeps playing that arm until sending fails, and the time playing the arm can be larger than T. Since the reward R 2 is the best reward the transmitter can get, these time slots do not count towards the regret. However, if such a K th arm hits state 0 just before T th time slot, the transmitter needs to send conservatively for K time slots. Since K L, the arm can contribute regret for at most (L + T ) time slots. Thus, we will use (L + T ) to bound time slots generating regret in one period. The number of plays selecting non-optimal arms in UCB1 is bounded by O((L 1)ln(n)). In our problem, the number of periods playing non-best arm is bounded by O((L 1)ln(n)), where n is the number of periods, n < t T. In total, the number of time slots playing non-best arms is bounded by O((L 1)(L + T )ln( t T )) = O(L(L + T )ln(t)). Note that besides the best arm, some of the non-best arm in the arm sets may also give (OP T (ɛ + δ)) approximation of the real expected total discounted reward, this only makes the regret smaller. O(L(L + T )ln(n)) is still an upper bound of the regret. More specifically, taking L = K max + 2 and T = T max, the regret can be bounded by O(K max (K max + T max )ln(n)). Fig. 2. Expected total discounted reward for K-conservative policies percentage of time playing the (OP T (ɛ + δ)) arm. Fig.3 is the simulation results running UCB-P algorithm. We can see when time goes to infinity, the percentage of time playing (OP T (ɛ + δ)) arms approaches 100%. VI. SIMULATIONS We start by analyzing the known underlying transition matrix case. We select 5 groups of transition probabilities, each corresponding to a different threshold, or equivalently, a different K opt -conservative policy. The first corresponds to a scenario that the optimal policy is always sending conservatively. The other 4 correspond to different K opt - conservative policies. In Fig. 2, the x axis represents different K-conservative policies and the y axis represents expected total discounted rewards. 5 curves correspond to different transition probabilities. The expected total discounted rewards get maximum when K = K opt. Take the K opt = 4 curve as an example,when K < 4, the total discounted reward increases when K increases and when K > 4, the total discounted reward decreases when K increase. The expected total discounted rewards get maximum when K = 4. R 1 = 1; R 2 = 2; C = 0.5; β = 0.75 λ 0 λ 1 ρ K opt TABLE I THE OPTIMAL STRATEGY TABLE Next, we consider the unknown transition probability matrix case. For K scenarios, if α is bounded by 0.8, taking ɛ = 0.02 and δ = 0.02, we get K max = 26 and T max = 20. For the simulations, we take L = 30, T = 100. We run the simulations with different λ 0 and λ 1 and measure the Fig. 3. Percentage of time that OP T (ɛ + δ) arms are selected by UCB-P algorithm UCB-P algorithm, although mathematically proven to have logarithmic regret when T, doesn t work that well in practice since it convergence slowly when the differences between arms are small. Thus, we use UCBP-TUNED algorithm, which the bound of UCB-P algorithm is tuned more finely. UCB1-TUNED [4] is adapted in our UCBP- TUNED algorithm. We use V k (s) def = ( 1 s Σs τ=1( (1 β)a k,τ + C ) 2 ) (22) R 2 + C (1 β)āk + C 2 ln n +, R 2 + C s
6 as an upper confidence bound for the variance of arm k, in which k is the arm index, and s is the number of periods that arm k is played during the first n periods, A k,τ is the A-reward that arm k played the τth time. We replace the upper confidence bound 2 ln(n)/n j of policy UCB-P with ln(n) n j min{1/4, V j (n j )}, (23) and rerun the simulations. From Fig. 4, we can see that UCBP-TUNED algorithm converges much faster than UCB-P algorithm. REFERENCES [1] R. Smallwood and E. Sondik, The optimal control of partially observable Markov processes over a finite horizon, Ops. Research, pp , 1971 [2] S. M. Ross, Applied Probability Models with Optimization Applications, San Francisco: Holden-Day, [3] A. Laourine and L. Tong, Betting on Gilbert-Elliott Channels, IEEE Transactions on Wireless Communications, vol. 50, no.3, pp , Feb [4] P.Auer, N.Cesa-Bianchi, and P.Fischer, Finite-time analysis of the multiarmed bandit problem, Machine Learning, 47:235?56, [5] L. A. Johnston and V. Krishnamurthy, Opportunistic file transfer over a fading channel: A POMDP search theory formulation with optimal threshold policies, IEEE Transactions on Wireless Communications, vol. 5, pp , [6] A.K. Karmokar, D.V. Djonin, and V.K. Bhargava, Optimal and suboptimal packet scheduling over correlated time varying flat fading channels, IEEE Transactions On Wireless Communications, Vol. 5, No. 2, February [7] Q. Zhao, B. Krishnamachari, and K. Liu, On myopic sensing for multichannel opportunistic access: Structure, optimality, and performance, IEEE Transactions on Wireless Communications, vol. 7, no. 12, pp , [8] S. H. A. Ahmad, M. Liu, T. Javidi, Q. Zhao, and B. Krishnamachari, Optimality of myopic sensing in multi-channel opportunistic access, IEEE Transactions on Information Theory, vol. 55, no. 9, pp , [9] W. Dai, Y. Gai, B. Krishnamachari and Q. Zhao, The Non-Bayesian Restless Multi-Armed Bandit: a Case of Near-Logarithmic Regret, The 36th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2011), Prague, Czech Republic, May, [10] N. Nayyar, Y. Gai, and B. Krishnamachari, On a restless multi-armed bandit problem with non-identical arms, in Allerton, Fig. 4. Percentage of time that OP T (ɛ + δ) arms are selected by UCBP- TUNED algorithm VII. CONCLUSION This paper discusses the optimal policy of transmitting over a Gilbert Elliott Channel to maximize the expected total discounted rewards. If the underlying channel transition probabilities are known, the optimal policy has a single threshold. The threshold determines a K-conservative policy. If the underlying channel transition probabilities are unknown but a bound on α is known, we relax the requirement and show how to learn (OP T (ɛ + δ)) policies. We designed UCB- P algorithm and the simulation results have shown that the percentage of selecting the (OP T (ɛ + δ)) arms approaches 100% when n. For future work, we plan to relax the assumption of known bound on α and consider ways to speed up the learning further. ACKNOWLEDGMENT This research was sponsored in part by the U.S. Army Research Laboratory under the Network Science Collaborative Technology Alliance, Agreement Number W911NF and by the Okawa Foundation, under an Award to support research on Network Protocols that Learn.
Power Allocation over Two Identical Gilbert-Elliott Channels
Power Allocation over Two Identical Gilbert-Elliott Channels Junhua Tang School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University, China Email: junhuatang@sjtu.edu.cn Parisa
More informationOptimal Power Allocation Policy over Two Identical Gilbert-Elliott Channels
Optimal Power Allocation Policy over Two Identical Gilbert-Elliott Channels Wei Jiang School of Information Security Engineering Shanghai Jiao Tong University, China Email: kerstin@sjtu.edu.cn Junhua Tang
More informationA Restless Bandit With No Observable States for Recommendation Systems and Communication Link Scheduling
2015 IEEE 54th Annual Conference on Decision and Control (CDC) December 15-18, 2015 Osaka, Japan A Restless Bandit With No Observable States for Recommendation Systems and Communication Link Scheduling
More informationPerformance of Round Robin Policies for Dynamic Multichannel Access
Performance of Round Robin Policies for Dynamic Multichannel Access Changmian Wang, Bhaskar Krishnamachari, Qing Zhao and Geir E. Øien Norwegian University of Science and Technology, Norway, {changmia,
More informationOn the Optimality of Myopic Sensing. in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels
On the Optimality of Myopic Sensing 1 in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels Kehao Wang, Lin Chen arxiv:1103.1784v1 [cs.it] 9 Mar 2011 Abstract Recent works ([1],
More informationChannel Probing in Communication Systems: Myopic Policies Are Not Always Optimal
Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal Matthew Johnston, Eytan Modiano Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge,
More informationBetting on Gilbert-Elliot Channels
IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL 9, NO 2, FEBRUARY 200 723 Betting on Gilbert-Elliot Channels Amine Laourine, Student Member, IEEE, and Lang Tong, Fellow, IEEE Abstract In this paper a
More informationBayesian Congestion Control over a Markovian Network Bandwidth Process
Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 1/30 Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard (USC) Joint work
More informationSTRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS
STRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS Qing Zhao University of California Davis, CA 95616 qzhao@ece.ucdavis.edu Bhaskar Krishnamachari University of Southern California
More informationMulti-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays
Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Sahand Haji Ali Ahmad, Mingyan Liu Abstract This paper considers the following stochastic control problem that arises
More informationDynamic spectrum access with learning for cognitive radio
1 Dynamic spectrum access with learning for cognitive radio Jayakrishnan Unnikrishnan and Venugopal V. Veeravalli Department of Electrical and Computer Engineering, and Coordinated Science Laboratory University
More informationBayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem
Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem Parisa Mansourifard 1/37 Bayesian Congestion Control over a Markovian Network Bandwidth Process:
More informationA POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation
A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation Karim G. Seddik and Amr A. El-Sherif 2 Electronics and Communications Engineering Department, American University in Cairo, New
More informationOptimality of Myopic Sensing in Multi-Channel Opportunistic Access
Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari, Qing Zhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu
More informationMulti-Armed Bandit: Learning in Dynamic Systems with Unknown Models
c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University
More informationOnline Learning Schemes for Power Allocation in Energy Harvesting Communications
Online Learning Schemes for Power Allocation in Energy Harvesting Communications Pranav Sakulkar and Bhaskar Krishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering
More informationLearning Algorithms for Minimizing Queue Length Regret
Learning Algorithms for Minimizing Queue Length Regret Thomas Stahlbuhk Massachusetts Institute of Technology Cambridge, MA Brooke Shrader MIT Lincoln Laboratory Lexington, MA Eytan Modiano Massachusetts
More informationOptimality of Myopic Sensing in Multi-Channel Opportunistic Access
Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari,QingZhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu
More informationWireless Channel Selection with Restless Bandits
Wireless Channel Selection with Restless Bandits Julia Kuhn and Yoni Nazarathy Abstract Wireless devices are often able to communicate on several alternative channels; for example, cellular phones may
More informationOnline Learning of Rested and Restless Bandits
Online Learning of Rested and Restless Bandits Cem Tekin, Mingyan Liu Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor, Michigan, 48109-2122 Email: {cmtkn, mingyan}@umich.edu
More informationExploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks
Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks Wenzhuo Ouyang, Sugumar Murugesan, Atilla Eryilmaz, Ness B. Shroff Department of Electrical and Computer Engineering The
More informationOptimality of Myopic Sensing in Multichannel Opportunistic Access
4040 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 9, SEPTEMBER 2009 Optimality of Myopic Sensing in Multichannel Opportunistic Access Sahand Haji Ali Ahmad, Mingyan Liu, Member, IEEE, Tara Javidi,
More informationOptimal Control of Partiality Observable Markov. Processes over a Finite Horizon
Optimal Control of Partiality Observable Markov Processes over a Finite Horizon Report by Jalal Arabneydi 04/11/2012 Taken from Control of Partiality Observable Markov Processes over a finite Horizon by
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More informationOpportunistic Spectrum Access for Energy-Constrained Cognitive Radios
1206 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 8, NO. 3, MARCH 2009 Opportunistic Spectrum Access for Energy-Constrained Cognitive Radios Anh Tuan Hoang, Ying-Chang Liang, David Tung Chong Wong,
More informationExploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks A Whittle s Indexability Analysis
1 Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks A Whittle s Indexability Analysis Wenzhuo Ouyang, Sugumar Murugesan, Atilla Eryilmaz, Ness B Shroff Abstract We address
More informationThe Multi-Armed Bandit Problem
Università degli Studi di Milano The bandit problem [Robbins, 1952]... K slot machines Rewards X i,1, X i,2,... of machine i are i.i.d. [0, 1]-valued random variables An allocation policy prescribes which
More informationA State Action Frequency Approach to Throughput Maximization over Uncertain Wireless Channels
A State Action Frequency Approach to Throughput Maximization over Uncertain Wireless Channels Krishna Jagannathan, Shie Mannor, Ishai Menache, Eytan Modiano Abstract We consider scheduling over a wireless
More informationStochastic Contextual Bandits with Known. Reward Functions
Stochastic Contextual Bandits with nown 1 Reward Functions Pranav Sakulkar and Bhaskar rishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering University of Southern
More informationEnergy minimization based Resource Scheduling for Strict Delay Constrained Wireless Communications
Energy minimization based Resource Scheduling for Strict Delay Constrained Wireless Communications Ibrahim Fawaz 1,2, Philippe Ciblat 2, and Mireille Sarkiss 1 1 LIST, CEA, Communicating Systems Laboratory,
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationLearning of Uncontrolled Restless Bandits with Logarithmic Strong Regret
Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret Cem Tekin, Member, IEEE, Mingyan Liu, Senior Member, IEEE 1 Abstract In this paper we consider the problem of learning the optimal
More informationCSE250A Fall 12: Discussion Week 9
CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.
More informationCommunication constraints and latency in Networked Control Systems
Communication constraints and latency in Networked Control Systems João P. Hespanha Center for Control Engineering and Computation University of California Santa Barbara In collaboration with Antonio Ortega
More informationEvaluation of multi armed bandit algorithms and empirical algorithm
Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationOptimal and Suboptimal Policies for Opportunistic Spectrum Access: A Resource Allocation Approach
Optimal and Suboptimal Policies for Opportunistic Spectrum Access: A Resource Allocation Approach by Sahand Haji Ali Ahmad A dissertation submitted in partial fulfillment of the requirements for the degree
More informationCombinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations
IEEE/ACM TRANSACTIONS ON NETWORKING 1 Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations Yi Gai, Student Member, IEEE, Member,
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationOn the Throughput, Capacity and Stability Regions of Random Multiple Access over Standard Multi-Packet Reception Channels
On the Throughput, Capacity and Stability Regions of Random Multiple Access over Standard Multi-Packet Reception Channels Jie Luo, Anthony Ephremides ECE Dept. Univ. of Maryland College Park, MD 20742
More informationOPPORTUNISTIC Spectrum Access (OSA) is emerging
Optimal and Low-complexity Algorithms for Dynamic Spectrum Access in Centralized Cognitive Radio Networks with Fading Channels Mario Bkassiny, Sudharman K. Jayaweera, Yang Li Dept. of Electrical and Computer
More informationMarkov Decision Processes
Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour
More informationAn Introduction to Markov Decision Processes. MDP Tutorial - 1
An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal
More informationAdaptive Shortest-Path Routing under Unknown and Stochastically Varying Link States
Adaptive Shortest-Path Routing under Unknown and Stochastically Varying Link States Keqin Liu, Qing Zhao To cite this version: Keqin Liu, Qing Zhao. Adaptive Shortest-Path Routing under Unknown and Stochastically
More informationExploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms
Exploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms Julia Kuhn The University of Queensland, University of Amsterdam j.kuhn@uq.edu.au Michel Mandjes University of Amsterdam
More informationExploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks
1 Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks Wenzhuo Ouyang, Sugumar Murugesan, Atilla Eryilmaz, Ness B. Shroff arxiv:1009.3959v6 [cs.ni] 7 Dec 2011 Abstract We
More informationPower Aware Wireless File Downloading: A Lyapunov Indexing Approach to A Constrained Restless Bandit Problem
IEEE/ACM RANSACIONS ON NEWORKING, VOL. 24, NO. 4, PP. 2264-2277, AUG. 206 Power Aware Wireless File Downloading: A Lyapunov Indexing Approach to A Constrained Restless Bandit Problem Xiaohan Wei and Michael
More informationRestless Bandit Index Policies for Solving Constrained Sequential Estimation Problems
Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Sofía S. Villar Postdoctoral Fellow Basque Center for Applied Mathemathics (BCAM) Lancaster University November 5th,
More informationThe Non-Bayesian Restless Multi-Armed Bandit: A Case of Near-Logarithmic Strict Regret
1 The Non-Bayesian Restless Multi-Armed Bandit: A Case of Near-Logarithmic Strict Regret arxiv:19.1533v1 [math.oc] 7 Sep 11 Wenhan Dai, Yi Gai, Bhaskar Krishnamachari and Qing Zhao Department of Aeronautics
More informationarxiv: v3 [cs.lg] 4 Oct 2011
Reinforcement learning based sensing policy optimization for energy efficient cognitive radio networks Jan Oksanen a,, Jarmo Lundén a,b,1, Visa Koivunen a a Aalto University School of Electrical Engineering,
More informationCOS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement
More informationDistributed Reinforcement Learning Based MAC Protocols for Autonomous Cognitive Secondary Users
Distributed Reinforcement Learning Based MAC Protocols for Autonomous Cognitive Secondary Users Mario Bkassiny and Sudharman K. Jayaweera Dept. of Electrical and Computer Engineering University of New
More informationIntroduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16
600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an
More informationApproximation Algorithms for Partial-information based Stochastic Control with Markovian Rewards
Approximation Algorithms for Partial-information based Stochastic Control with Markovian Rewards Sudipto Guha Department of Computer and Information Sciences, University of Pennsylvania, Philadelphia PA
More informationCombinatorial Multi-Armed Bandit: General Framework, Results and Applications
Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Wei Chen Microsoft Research Asia, Beijing, China Yajun Wang Microsoft Research Asia, Beijing, China Yang Yuan Computer Science
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability
More informationPerformance and Convergence of Multi-user Online Learning
Performance and Convergence of Multi-user Online Learning Cem Tekin, Mingyan Liu Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor, Michigan, 4809-222 Email: {cmtkn,
More informationarxiv:cs/ v1 [cs.ni] 27 Feb 2007
Joint Design and Separation Principle for Opportunistic Spectrum Access in the Presence of Sensing Errors Yunxia Chen, Qing Zhao, and Ananthram Swami Abstract arxiv:cs/0702158v1 [cs.ni] 27 Feb 2007 We
More informationThe Multi-Arm Bandit Framework
The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationAlgorithms for Dynamic Spectrum Access with Learning for Cognitive Radio
Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio Jayakrishnan Unnikrishnan, Student Member, IEEE, and Venugopal V. Veeravalli, Fellow, IEEE 1 arxiv:0807.2677v4 [cs.ni] 6 Feb 2010
More informationExperts in a Markov Decision Process
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow
More informationAnalysis of Thompson Sampling for the multi-armed bandit problem
Analysis of Thompson Sampling for the multi-armed bandit problem Shipra Agrawal Microsoft Research India shipra@microsoft.com avin Goyal Microsoft Research India navingo@microsoft.com Abstract We show
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationOptimal Channel Probing and Transmission Scheduling for Opportunistic Spectrum Access
Optimal Channel Probing and Transmission Scheduling for Opportunistic Spectrum Access ABSTRACT Nicholas B. Chang Department of Electrical Engineering and Computer Science University of Michigan Ann Arbor,
More informationEE 550: Notes on Markov chains, Travel Times, and Opportunistic Routing
EE 550: Notes on Markov chains, Travel Times, and Opportunistic Routing Michael J. Neely University of Southern California http://www-bcf.usc.edu/ mjneely 1 Abstract This collection of notes provides a
More informationIntroducing strategic measure actions in multi-armed bandits
213 IEEE 24th International Symposium on Personal, Indoor and Mobile Radio Communications: Workshop on Cognitive Radio Medium Access Control and Network Solutions Introducing strategic measure actions
More informationP e = 0.1. P e = 0.01
23 10 0 10-2 P e = 0.1 Deadline Failure Probability 10-4 10-6 10-8 P e = 0.01 10-10 P e = 0.001 10-12 10 11 12 13 14 15 16 Number of Slots in a Frame Fig. 10. The deadline failure probability as a function
More informationCan Decentralized Status Update Achieve Universally Near-Optimal Age-of-Information in Wireless Multiaccess Channels?
Can Decentralized Status Update Achieve Universally Near-Optimal Age-of-Information in Wireless Multiaccess Channels? Zhiyuan Jiang, Bhaskar Krishnamachari, Sheng Zhou, Zhisheng Niu, Fellow, IEEE {zhiyuan,
More informationDistributed Stochastic Online Learning Policies for Opportunistic Spectrum Access
1 Distributed Stochastic Online Learning Policies for Opportunistic Spectrum Access Yi Gai, Member, IEEE, and Bhaskar Krishnamachari, Member, IEEE, ACM Abstract The fundamental problem of multiple secondary
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationAn Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits
An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits Peter Jacko YEQT III November 20, 2009 Basque Center for Applied Mathematics (BCAM), Bilbao, Spain Example: Congestion
More informationLecture notes for Analysis of Algorithms : Markov decision processes
Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with
More informationAverage Age of Information with Hybrid ARQ under a Resource Constraint
1 Average Age of Information with Hybrid ARQ under a Resource Constraint arxiv:1710.04971v3 [cs.it] 5 Dec 2018 Elif Tuğçe Ceran, Deniz Gündüz, and András György Department of Electrical and Electronic
More informationTuning the TCP Timeout Mechanism in Wireless Networks to Maximize Throughput via Stochastic Stopping Time Methods
Tuning the TCP Timeout Mechanism in Wireless Networks to Maximize Throughput via Stochastic Stopping Time Methods George Papageorgiou and John S. Baras Abstract We present an optimization problem that
More informationAlgorithms for Dynamic Spectrum Access with Learning for Cognitive Radio
Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio Jayakrishnan Unnikrishnan, Student Member, IEEE, and Venugopal V. Veeravalli, Fellow, IEEE 1 arxiv:0807.2677v2 [cs.ni] 21 Nov 2008
More informationLecture 4: Lower Bounds (ending); Thompson Sampling
CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from
More informationOptimal Sleeping Mechanism for Multiple Servers with MMPP-Based Bursty Traffic Arrival
1 Optimal Sleeping Mechanism for Multiple Servers with MMPP-Based Bursty Traffic Arrival Zhiyuan Jiang, Bhaskar Krishnamachari, Sheng Zhou, arxiv:1711.07912v1 [cs.it] 21 Nov 2017 Zhisheng Niu, Fellow,
More informationOptimal Channel Probing and Transmission Scheduling in a Multichannel System
Optimal Channel Probing and Transmission Scheduling in a Multichannel System Nicholas B. Chang and Mingyan Liu Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor,
More informationTowards control over fading channels
Towards control over fading channels Paolo Minero, Massimo Franceschetti Advanced Network Science University of California San Diego, CA, USA mail: {minero,massimo}@ucsd.edu Invited Paper) Subhrakanti
More informationOnline regret in reinforcement learning
University of Leoben, Austria Tübingen, 31 July 2007 Undiscounted online regret I am interested in the difference (in rewards during learning) between an optimal policy and a reinforcement learner: T T
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationNear-optimal policies for broadcasting files with unequal sizes
Near-optimal policies for broadcasting files with unequal sizes Majid Raissi-Dehkordi and John S. Baras Institute for Systems Research University of Maryland College Park, MD 20742 majid@isr.umd.edu, baras@isr.umd.edu
More informationA Structured Multiarmed Bandit Problem and the Greedy Policy
A Structured Multiarmed Bandit Problem and the Greedy Policy Adam J. Mersereau Kenan-Flagler Business School, University of North Carolina ajm@unc.edu Paat Rusmevichientong School of Operations Research
More informationCognitive Spectrum Access Control Based on Intrinsic Primary ARQ Information
Cognitive Spectrum Access Control Based on Intrinsic Primary ARQ Information Fabio E. Lapiccirella, Zhi Ding and Xin Liu Electrical and Computer Engineering University of California, Davis, California
More informationReinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies
Reinforcement earning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC ab, University of Georgia roi@ceren.net Prashant Doshi THINC ab, University
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationThroughput-Delay Analysis of Random Linear Network Coding for Wireless Broadcasting
Throughput-Delay Analysis of Random Linear Network Coding for Wireless Broadcasting Swapna B.T., Atilla Eryilmaz, and Ness B. Shroff Departments of ECE and CSE The Ohio State University Columbus, OH 43210
More informationCsaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008
LEARNING THEORY OF OPTIMAL DECISION MAKING PART I: ON-LINE LEARNING IN STOCHASTIC ENVIRONMENTS Csaba Szepesvári 1 1 Department of Computing Science University of Alberta Machine Learning Summer School,
More informationScheduling in parallel queues with randomly varying connectivity and switchover delay
Scheduling in parallel queues with randomly varying connectivity and switchover delay The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters.
More informationLecture 3: Lower Bounds for Bandit Algorithms
CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and
More informationCooperative HARQ with Poisson Interference and Opportunistic Routing
Cooperative HARQ with Poisson Interference and Opportunistic Routing Amogh Rajanna & Mostafa Kaveh Department of Electrical and Computer Engineering University of Minnesota, Minneapolis, MN USA. Outline
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationWIRELESS systems often operate in dynamic environments
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 61, NO. 9, NOVEMBER 2012 3931 Structure-Aware Stochastic Control for Transmission Scheduling Fangwen Fu and Mihaela van der Schaar, Fellow, IEEE Abstract
More informationOpportunistic Scheduling Using Channel Memory in Markov-modeled Wireless Networks
Opportunistic Scheduling Using Channel Memory in Markov-modeled Wireless Networks Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School
More informationNotes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed
CS 683 Learning, Games, and Electronic Markets Spring 007 Notes from Week 9: Multi-Armed Bandit Problems II Instructor: Robert Kleinberg 6-30 Mar 007 1 Information-theoretic lower bounds for multiarmed
More informationHybrid Machine Learning Algorithms
Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More information