Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel

Size: px
Start display at page:

Download "Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel"

Transcription

1 Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel Yanting Wu Dept. of Electrical Engineering University of Southern California Bhaskar Krishnamachari Dept. of Electrical Engineering University of Southern California Abstract This paper studies the optimal transmission policy for a Gilbert-Elliott Channel. The transmitter has two actions: sending aggressively or sending conservatively, with rewards depending on the action chosen and the underlying channel state. The aim is to compute the scheduling policy to determine which actions to choose at each time slot in order to maximize the expected total discounted reward. We first establish the threshold structure of the optimal policy when the underlying channel statistics are known. We then consider the more challenging case when the statistics are unknown. For this problem, we map different threshold policies to arms of a suitably defined multiarmed bandit problem. To tractably handle the complexity introduced by countably infinite arms and the infinite time horizon, we weaken our objective a little: finding a (OP T (ɛ + δ))- approximate policy instead. We present the UCB-P algorithm, which can achieve this objective with logarithmic-time regret. I. INTRODUCTION Communication over the wireless channels are affected by fading conditions, interference, path loss, etc. To gain a better utilization of wireless channels, a transmitter needs to adapt transmission parameters such as data rate and transmission power according to the communication channel states. In this paper, we analyze mathematically a communication system operating over a 2-state Markov channel (known as the Gilbert-Elliott Channel) in a time-slotted fashion. The objective is for the transmitter to decide at each time, based on prior observations, whether to send data at an aggressive or conservative rate. The former incurs the risk of failure, but reveals the channels true state, while the latter is a safe but unrevealing choice. When the channel transition probabilities are known, this problem can be modelled as a Partially Observable Markov Decision Process (POMDP). This formulation is very closely related to a recent work pertaining to betting on Gilbert Elliott Channels [3], which considers three choices, and shows that a threshold-type policy consisting of one, two, or three thresholds depending on the parameters, is optimal. In our setting, we show that the optimal policy always has a single threshold that corresponds to a K-conservative policy, in which the transmitter adopts the conservative approach for K steps after each failure before reattempting the aggressive strategy. Unlike [3], however, our focus is on the case when the underlying state transition matrix is unknown. In this case, the problem of finding the optimal strategy is equivalent to finding the optimal choice of K. We map the problem to a Multi-armed bandit, where each possible K-conservative policy corresponds to an arm. To deal with the difficulties of optimizing the discounted cost over an infinite horizon, and the countably infinite arms that result from this mapping, we introduce approximation parameters δ, ɛ, and show that a modification of the well-known UCB1 policy guarantees that the number of times that the arms that are more than (ɛ + δ)) away from the optimal are played is bounded by a logarithmic function of time. In other words, we show that the time-averaged regret with respect to a (OP T (ɛ + δ)) policy tends to zero. We briefly review some other recent papers in the literature that have treated similar problems. Johnston and Krishnamurthy [5] consider the problem of minimizing the transmission energy and latency associated with transmitting a file across a Gilbert Elliott fading channel, formulate it as a POMDP, identify a threshold policy for it, and analyze it for various parameter settings. Karmokar et al. [6] consider optimizing multiple objectives (transmission power, delay, and drop probability) for packet scheduling across a more general finite-state Markov channel. Motivated by opportunistic spectrum sensing, several recent studies have explored optimizing sensing and access decisions over multiple independent but stochastically identical parallel Gilbert Elliott channels, in which the objective is to select one channel at each time, showing that a simple myopic policy is optimal [7], [8]. In [9], Dai et al. consider the non-bayesian version of the sensing problem where the channel parameters are unknown, and show that when the problem has a finite-option structure, online learning algorithms can be designed to achieve nearlogarithmic regret. In [10], Nayyar et al. consider the same sensing problem over two non-identical channels, derive the optimal threshold structure policy for it. For the non-bayesian setting, show a mapping to countably infinite-armed multiarmed bandit, and also prove logarithmic regret with respect to a (OP T δ) policy, similar to the approach adopted in this work for a different formulation. The paper is organized as follows: section II introduces the model; section III gives the structure of the optimal policy when the underlying channel transition probabilities are known; section IV talks about K-conservative policies and we prove that the optimal policy corresponds to a K-conservative policy; section V discusses using multi-arm bandits to find out

2 optimal K; to address the two challenges: infinite number of arms, and infinite time horizon, we weaken our objective to find policies which are at most (ɛ + δ) away from the optimal instead. We design the UCB-P algorithm to learn such policies. We present simulation results in section VI. Finally, section VII concludes the paper. II. MODEL For our problem setting, we consider the Gilbert-Elliott channel which is a Markov chain with two states: good (denoted by 1) or bad (denoted by 0). If the channel is good, it allows the transmitter to send data with a high rate successfully. However, if the channel is bad, it only allows transmitter to send data with a low rate successfully. The transition probabilities matrix is given as: P = [ ] [ ] P00 P 01 1 λ0 λ = 0. (1) P 10 P 11 1 λ 1 λ 1 Define α = λ 1 λ 0. We assume that the channel is positive correlated, which means α 0. At the beginning of each time slot, the transmitter chooses one of the following two actions: Sending Conservatively (SC): the transmitter sends data with a low rate. No matter what the channel state is, it can successfully transmit a small number of bits. We assign a reward R 1 to this action. Since the transmission is always successful, the transmitter cannot learn the state if this action is chosen. Sending Aggressively (SA): the transmitter sends data with a high rate. If the channel is in good state, we consider the transmission successful and the transmitter can get a high reward R 2 (> R 1 ). If the channel is in bad state, sending with a high rate will cause high error rate and drop rate, we consider the transmission a failure and the transmitter gets a constant penalty C. We assume if the transmitter sends aggressively, it can learn the state of the channel. In other words, we assume that when the channel is in a bad state an aggressive transmission strategy will encounter and detect failure. Because when sending conservatively, the state of the channel is not directly observable, the problem we consider in this paper turns out to be a Partially Observable Markov Decision Process (POMDP) problem. In [1], it has been shown that a sufficient statistic to make an optimal decision for this POMDP problem is the conditional probability that the channel is in state 1 given all past actions and observations. We call this conditional probability the belief, represented by b t = P r [S t = 1 H t ] which H t is the history of all actions and observations before t th time slot. When sending aggressively, the transmitter learns the state of the channel; so the belief is λ 0 if the channel is bad or λ 1 if the channel is good. The following is the expression for the expected rewards: { R1 if A R(b t, A t ) = t = SC, (2) b t R 2 (1 b t )C if A t = SA, where b t is the belief of the channel in good state and A t is the action taken by the transmitter at time t. In this paper, we consider the expected total-discounted reward to make decisions, which is defined as [ ] E β t R(b t, A t ) b 0 = p, (3) where R(b t, A t ) is the expected reward at t, β(< 1) is a constant discounter factor, and b 0 is the initial belief. III. THRESHOLD STRUCTURE OF THE OPTIMAL POLICY In this section, we discuss the optimal policy when the transition probabilities are known. We prove that the optimal policy is a one threshold policy and give a closed form formulation of the threshold. Policy, denoted by π, is defined as a rule which maps belief probabilities to actions. We use V π (p) to represent the expected total-discounted reward the transmitter can get given the initial belief is p and policy is π: [ ] V π (p) = E π β t R(b t, A t ) b 0 = p. (4) The aim is to find a policy having the greatest value of the expected total-discounted reward, denoted by V (p): V (p) = max π {V π (p)}. (5) According to [2, Thm. 6.3], there exists a stationary policy which makes V (p) = V π (p), thus V (p) can be calculated by the following equation: V (p) = max {V A(p)}, (6) A {SA,SC} where V A (p) is the greatest value of the expected totaldiscounted reward by taking action A when the initial belief probability is p. V A (p) can be expressed as: V A (p) = R(p, A) + βe [V A (p ) b 0 = p, A 0 = A], (7) where b 0 is the initial belief, A is the action taken by transmitter, and p is the new belief after taking action A. Sending conservatively: by taking this action, the belief changes from p to T (p) = λ 0 (1 p) + λ 1 p = αp + λ 0, hence, V SC (p) = R 1 + βv (T (p)). (8) Sending aggressively: by taking this action, the channel state is known, so V SA (p) (9) = (pr 2 (1 p)c) + β[pv (λ 1 ) + (1 p)v (λ 0 )]. This formulation turns out to be similar to the problem formulation in [3], except for two main differences as follows: (1). We do not have a separate action for sensing. (2). We introduce penalty when sending fails. Theorem 3.1: The optimal policy has a single threshold structure. [ π SC if 0 p ρ, (p) = (10) SA if ρ p 1,

3 where p is the belief, and ρ is the threshold. We omit the proof here because it follows in a straightforward manner from the results in [3]. Since the penalty does not change the linear property of function V SA (p), according to [3, Thm. 1, Thm.2], V β (p) is convex and nondecreasing, thus the optimal solution follows a threshold structure. However, unlike [3], which shows the existence of multiple thresholds, there is a single threshold for our problem setting. IV. K-CONSERVATIVE POLICIES In this section, we will discuss the K-conservative policy, where K is the number of time slots to send conservatively after a failure, before sending aggressively again. The Markov chain corresponding to K-conservative policy is as shown in Fig. 1. The states are the number of time slots which the transmitter A. Closed Form Expression of Threshold ρ The threshold ρ is the solution of the following equation: R 1 + βv (T (ρ)) = V SA (ρ). (11) There are two possible scenarios for T (ρ): If T (ρ) ρ, we have V (T (ρ)) = V SC (T (ρ)), substituting in Eq. (11), we can get: ρ = R 1 + C (R 2 + C) + βv (λ 1 ) β R1 1 β. (12) Otherwise, we have V (T (ρ)) = V SA (T (ρ)), substituting in Eq. (11), we can get Eq. (13) at the top of next page. B. V (λ 0 ) and V (λ 1 ) To calculate V (λ 1 ) and V (λ 0 ), there are also two possible scenarios: If λ 1 ρ, then since λ 0 λ 1 ρ, V (λ 1 ) = V (λ 0 ) = R 1 1 β. (14) Otherwise, λ 1 > ρ, V (λ 1 ) = V SA (λ 1 ), using Eq. (9), we get V (λ 1 ) = λ 1R 2 (1 λ 1 )C + β(1 λ 1 )V (λ 0 ) 1 βλ 1. (15) To get V (λ 0 ), we adapt [3, Thm.4]: V (λ 0 ) = max {V A(λ 0 )} (16) A {SA,SC} = max{r 1 + βv (T (λ 0 )), V SA (λ 0 )} 1 β N = max{r 1 1 β, βn max {1 0 n N 1 1 β R 1 + β n V SA (T n (λ 0 ))}}. Since N is arbitrary and 0 β < 1, when N, we have V (p) = max {R 1 β n 1 n 0 1 β + βn V SA (T n (λ 0 ))}. (17) Using Eq. (9), Eq.(15), Eq. (17), we can get V (λ 0 ) (18) { 1 β n 1 β = max R } 1 + β n (κ n R 2 + κ n C(1 β) C) n 0 1 β n+1, [1 (1 β)κ n ] where κ n = T n (λ 0) 1 βλ 1 = (1 αn )λ S 1 βλ 1. Fig. 1. K Conservative Policy Markov Chain. has sent conservatively since last failure. There are K + 2 states in this Markov chain. State 0 corresponds to the moment that sending aggressively fails, and it goes back to sending conservatively stage. State K 1 corresponds to that the transmitter has already sent conservatively for K time slots, and it will send aggressively next time slot. If the transmitter sends aggressively and succeeds, it goes to state SA and continues to send aggressively at the next time slot; otherwise it goes back to state 0. The probability that transmitter stays in state SA is λ 1. The transmitter has to wait K time slots before sending aggressively again, so the probabilities from state i to state i + 1 is always 1 when 0 i < K. There are K + 2 states, each state corresponds to a belief and an action; belief and action determine the expected totaldiscounted reward. Thus given K, there are K + 2 different expected total-discounted rewards. Theorem 4.1: The threshold ρ policy structure is equivalent to a K opt -Conservative policy structure, where K opt is the number of time slots that a transmitter sends conservatively after a failure before sending aggressively again. If λ S > ρ, K opt = log α (1 ρ(1 α) λ 0 ) 1, otherwise K opt =. Proof: Whenever the transmitter sends aggressively but fails, it goes back to sending conservatively stage. The belief after a failure is λ 0, and it changes with time according to formula: T n (λ 0 ) = T (T n 1 (λ 0 )) = λ 0 ( 1 αn+1 1 α ), where n is the number of time slots that the transmitter has sent conservatively since last time failure. If λ S ρ, T n (λ 0 ) increases when n increases, when n, T n (λ 0 ) λ S. The optimal policy is always sending conservatively, K opt =. If λ S > ρ, then there exists a finite integer K opt which makes T Kopt 1 (λ 0 ) < ρ and T Kopt (λ 0 ) ρ, K opt = log α (1 ρ(1 α) λ 0 ) 1. (19) V. ONLINE LEARNING FOR UNKNOWN CHANNEL In this section, we will discuss how to find the optimal policy if the underlying channel s transition probabilities are

4 ρ = (1 βλ 1)R 1 + λ 0 βr 2 + (1 αβ)(1 β)c + β(β 1)(1 αβ)v (λ 0 ). (13) (1 αβ)(r 2 + (1 β)c + β(β 1)V (λ 0 )) unknown. To find K opt, we use the idea of mapping each K-conservative policy to a countable multi-armed bandits of countable time horizon. Now there are two challenges: (1). The number of arms can be infinite. (2). To get the true total discounted reward, each arm requires to be continuing played until time goes to infinity. To address these two challenges, we weaken our objective to find a suboptimal which is an (OP T (ɛ + δ)) approximation of the optimal arm instead. Theorem 5.1 and theorem 5.2 address the two challenges respectively. We define (OP T ɛ) arm as the arm which gives (OP T ɛ)-approximation of the optimal arm no matter what the initial belief is. Let arm SC correspond to the always sending conservatively policy, or K opt =. Theorem 5.1: Given an ɛ and bound B on α, there exists a K max, such that K K max, the best arm in the arm set C = {0, 1,...K, SC} is an (OP T ɛ) arm. Proof: If K > K opt or K opt =, the optimal arm is already included in the arm sets. If K < K opt <, suppose that transmitter has already sent conservatively for n time slots, let k opt = K opt n, k = K n, and C = R 2 + C + β 1 β (R 2 R 1 ), V π k opt (p) V π k (p) (20) 1 β kopt = [R 1 + β kopt V SA (T kopt (p))] 1 β 1 β k [R 1 1 β + βk V SA (T k (p))] R 1 = β k [ 1 β (1 βkopt k ) +β kopt k V SA (T kopt (p)) V SA (T k (p))] < β k [V SA (T kopt (p)) V SA (T k (p))] Proof: = E[ E[ T max β t R i (t) b 0 = p] E[ t=t max+1 < β Tmax+1 R 2 1 β. β t R i (t) b 0 = p] β t R i (t) b 0 = p] (21) δ(1 β) When T log β R2 1, E[ βt R i (t) b 0 = p] E[ T max β t R i (t) b 0 = p] < δ. We define period as time interval between arm switches, A- reward as the average (OP T δ)-finite horizon approximation total discounted reward at one period, regret as the number of time slots during which the transmitter uses policies which are more than (ɛ + δ) away from the optimal policy. We design the UCB-Period (UCB-P) algorithm as shown in Algorithm 1. It is similar to UCB1 algorithm, but the time unit is period and the A-rewards are accumulated for each period. Algorithm 1 Deterministic policy: UCB-P Initialization: L( K max +2) arms: arm 0,, arm (L 2) and the SC arm. Play each arm for T ( T max ) time slots, then keep playing the arm until the arm hits arm state 0. Get initial A-reward of state 0 for each arm. Let Ā = A(i), (i = 0, 1, L 2, SC) as initial average A-reward for state 0 of each arm; n i = 1(i = 0, 1, L 2, SC) as initial number of periods arm i has been played, and n = L as initial number of periods played so far. for period n = 1, 2, do Select the arm with highest value of (1 β)āi+c 2ln(n) R 2+C + n j. Play the selected arm for a period. Update the average A-reward for state 0, n i of the selected arm and n. end for = β k (T kopt (p) T k (p))(r 2 + C + β(v (λ 1 ) V (λ 0 ))) Theorem 5.3: The regret of Algorithm UCB-P is bounded < β k (T Kopt (λ 0 ) T K (λ 0 ))(R 2 + C + β( R 2 1 β R 1 1 β )) by O(L(L + T )ln(t)), where t is the number of time slots passed so far. = β k α K+1 (1 α Kopt K )λ S C Proof: The procedure is similar to UCB1, [4, Thm.1 ] < α K+1 C < B K+1 C can be adapted.. A-reward is within the range of [ C 1 β, R 2 1 β ], where the left boundary corresponds to the transmitter sending aggressively but failing every time slot, and the right boundary corresponds ɛ Let K max = log B C 1, when K K max, V π to the transmitter sending aggressively and succeeding every k opt (p) V π k (p) < ɛ. time slot. We normalize the A-reward to be in the range of Theorem 5.2: Given an δ, there exists a T max such that [0, 1], UCB1 algorithm shows that the number of time slots T T max, an arm for the finite horizon total discounted reward up to time T is at most δ away from the infinite horizon total discounted reward. that selects non-optimal arm is bounded by O((L 1)ln(n)), where L is the number of arms and n the overall number of plays done so far.

5 The best arm is an (OP T (ɛ + δ)) arm. If any other nonbest arm K hits SA states at the T th time slots, the transmitter keeps playing that arm until sending fails, and the time playing the arm can be larger than T. Since the reward R 2 is the best reward the transmitter can get, these time slots do not count towards the regret. However, if such a K th arm hits state 0 just before T th time slot, the transmitter needs to send conservatively for K time slots. Since K L, the arm can contribute regret for at most (L + T ) time slots. Thus, we will use (L + T ) to bound time slots generating regret in one period. The number of plays selecting non-optimal arms in UCB1 is bounded by O((L 1)ln(n)). In our problem, the number of periods playing non-best arm is bounded by O((L 1)ln(n)), where n is the number of periods, n < t T. In total, the number of time slots playing non-best arms is bounded by O((L 1)(L + T )ln( t T )) = O(L(L + T )ln(t)). Note that besides the best arm, some of the non-best arm in the arm sets may also give (OP T (ɛ + δ)) approximation of the real expected total discounted reward, this only makes the regret smaller. O(L(L + T )ln(n)) is still an upper bound of the regret. More specifically, taking L = K max + 2 and T = T max, the regret can be bounded by O(K max (K max + T max )ln(n)). Fig. 2. Expected total discounted reward for K-conservative policies percentage of time playing the (OP T (ɛ + δ)) arm. Fig.3 is the simulation results running UCB-P algorithm. We can see when time goes to infinity, the percentage of time playing (OP T (ɛ + δ)) arms approaches 100%. VI. SIMULATIONS We start by analyzing the known underlying transition matrix case. We select 5 groups of transition probabilities, each corresponding to a different threshold, or equivalently, a different K opt -conservative policy. The first corresponds to a scenario that the optimal policy is always sending conservatively. The other 4 correspond to different K opt - conservative policies. In Fig. 2, the x axis represents different K-conservative policies and the y axis represents expected total discounted rewards. 5 curves correspond to different transition probabilities. The expected total discounted rewards get maximum when K = K opt. Take the K opt = 4 curve as an example,when K < 4, the total discounted reward increases when K increases and when K > 4, the total discounted reward decreases when K increase. The expected total discounted rewards get maximum when K = 4. R 1 = 1; R 2 = 2; C = 0.5; β = 0.75 λ 0 λ 1 ρ K opt TABLE I THE OPTIMAL STRATEGY TABLE Next, we consider the unknown transition probability matrix case. For K scenarios, if α is bounded by 0.8, taking ɛ = 0.02 and δ = 0.02, we get K max = 26 and T max = 20. For the simulations, we take L = 30, T = 100. We run the simulations with different λ 0 and λ 1 and measure the Fig. 3. Percentage of time that OP T (ɛ + δ) arms are selected by UCB-P algorithm UCB-P algorithm, although mathematically proven to have logarithmic regret when T, doesn t work that well in practice since it convergence slowly when the differences between arms are small. Thus, we use UCBP-TUNED algorithm, which the bound of UCB-P algorithm is tuned more finely. UCB1-TUNED [4] is adapted in our UCBP- TUNED algorithm. We use V k (s) def = ( 1 s Σs τ=1( (1 β)a k,τ + C ) 2 ) (22) R 2 + C (1 β)āk + C 2 ln n +, R 2 + C s

6 as an upper confidence bound for the variance of arm k, in which k is the arm index, and s is the number of periods that arm k is played during the first n periods, A k,τ is the A-reward that arm k played the τth time. We replace the upper confidence bound 2 ln(n)/n j of policy UCB-P with ln(n) n j min{1/4, V j (n j )}, (23) and rerun the simulations. From Fig. 4, we can see that UCBP-TUNED algorithm converges much faster than UCB-P algorithm. REFERENCES [1] R. Smallwood and E. Sondik, The optimal control of partially observable Markov processes over a finite horizon, Ops. Research, pp , 1971 [2] S. M. Ross, Applied Probability Models with Optimization Applications, San Francisco: Holden-Day, [3] A. Laourine and L. Tong, Betting on Gilbert-Elliott Channels, IEEE Transactions on Wireless Communications, vol. 50, no.3, pp , Feb [4] P.Auer, N.Cesa-Bianchi, and P.Fischer, Finite-time analysis of the multiarmed bandit problem, Machine Learning, 47:235?56, [5] L. A. Johnston and V. Krishnamurthy, Opportunistic file transfer over a fading channel: A POMDP search theory formulation with optimal threshold policies, IEEE Transactions on Wireless Communications, vol. 5, pp , [6] A.K. Karmokar, D.V. Djonin, and V.K. Bhargava, Optimal and suboptimal packet scheduling over correlated time varying flat fading channels, IEEE Transactions On Wireless Communications, Vol. 5, No. 2, February [7] Q. Zhao, B. Krishnamachari, and K. Liu, On myopic sensing for multichannel opportunistic access: Structure, optimality, and performance, IEEE Transactions on Wireless Communications, vol. 7, no. 12, pp , [8] S. H. A. Ahmad, M. Liu, T. Javidi, Q. Zhao, and B. Krishnamachari, Optimality of myopic sensing in multi-channel opportunistic access, IEEE Transactions on Information Theory, vol. 55, no. 9, pp , [9] W. Dai, Y. Gai, B. Krishnamachari and Q. Zhao, The Non-Bayesian Restless Multi-Armed Bandit: a Case of Near-Logarithmic Regret, The 36th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2011), Prague, Czech Republic, May, [10] N. Nayyar, Y. Gai, and B. Krishnamachari, On a restless multi-armed bandit problem with non-identical arms, in Allerton, Fig. 4. Percentage of time that OP T (ɛ + δ) arms are selected by UCBP- TUNED algorithm VII. CONCLUSION This paper discusses the optimal policy of transmitting over a Gilbert Elliott Channel to maximize the expected total discounted rewards. If the underlying channel transition probabilities are known, the optimal policy has a single threshold. The threshold determines a K-conservative policy. If the underlying channel transition probabilities are unknown but a bound on α is known, we relax the requirement and show how to learn (OP T (ɛ + δ)) policies. We designed UCB- P algorithm and the simulation results have shown that the percentage of selecting the (OP T (ɛ + δ)) arms approaches 100% when n. For future work, we plan to relax the assumption of known bound on α and consider ways to speed up the learning further. ACKNOWLEDGMENT This research was sponsored in part by the U.S. Army Research Laboratory under the Network Science Collaborative Technology Alliance, Agreement Number W911NF and by the Okawa Foundation, under an Award to support research on Network Protocols that Learn.

Power Allocation over Two Identical Gilbert-Elliott Channels

Power Allocation over Two Identical Gilbert-Elliott Channels Power Allocation over Two Identical Gilbert-Elliott Channels Junhua Tang School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University, China Email: junhuatang@sjtu.edu.cn Parisa

More information

Optimal Power Allocation Policy over Two Identical Gilbert-Elliott Channels

Optimal Power Allocation Policy over Two Identical Gilbert-Elliott Channels Optimal Power Allocation Policy over Two Identical Gilbert-Elliott Channels Wei Jiang School of Information Security Engineering Shanghai Jiao Tong University, China Email: kerstin@sjtu.edu.cn Junhua Tang

More information

A Restless Bandit With No Observable States for Recommendation Systems and Communication Link Scheduling

A Restless Bandit With No Observable States for Recommendation Systems and Communication Link Scheduling 2015 IEEE 54th Annual Conference on Decision and Control (CDC) December 15-18, 2015 Osaka, Japan A Restless Bandit With No Observable States for Recommendation Systems and Communication Link Scheduling

More information

Performance of Round Robin Policies for Dynamic Multichannel Access

Performance of Round Robin Policies for Dynamic Multichannel Access Performance of Round Robin Policies for Dynamic Multichannel Access Changmian Wang, Bhaskar Krishnamachari, Qing Zhao and Geir E. Øien Norwegian University of Science and Technology, Norway, {changmia,

More information

On the Optimality of Myopic Sensing. in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels

On the Optimality of Myopic Sensing. in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels On the Optimality of Myopic Sensing 1 in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels Kehao Wang, Lin Chen arxiv:1103.1784v1 [cs.it] 9 Mar 2011 Abstract Recent works ([1],

More information

Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal

Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal Matthew Johnston, Eytan Modiano Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge,

More information

Betting on Gilbert-Elliot Channels

Betting on Gilbert-Elliot Channels IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL 9, NO 2, FEBRUARY 200 723 Betting on Gilbert-Elliot Channels Amine Laourine, Student Member, IEEE, and Lang Tong, Fellow, IEEE Abstract In this paper a

More information

Bayesian Congestion Control over a Markovian Network Bandwidth Process

Bayesian Congestion Control over a Markovian Network Bandwidth Process Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 1/30 Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard (USC) Joint work

More information

STRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS

STRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS STRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS Qing Zhao University of California Davis, CA 95616 qzhao@ece.ucdavis.edu Bhaskar Krishnamachari University of Southern California

More information

Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays

Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Sahand Haji Ali Ahmad, Mingyan Liu Abstract This paper considers the following stochastic control problem that arises

More information

Dynamic spectrum access with learning for cognitive radio

Dynamic spectrum access with learning for cognitive radio 1 Dynamic spectrum access with learning for cognitive radio Jayakrishnan Unnikrishnan and Venugopal V. Veeravalli Department of Electrical and Computer Engineering, and Coordinated Science Laboratory University

More information

Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem

Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem Parisa Mansourifard 1/37 Bayesian Congestion Control over a Markovian Network Bandwidth Process:

More information

A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation

A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation Karim G. Seddik and Amr A. El-Sherif 2 Electronics and Communications Engineering Department, American University in Cairo, New

More information

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari, Qing Zhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu

More information

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University

More information

Online Learning Schemes for Power Allocation in Energy Harvesting Communications

Online Learning Schemes for Power Allocation in Energy Harvesting Communications Online Learning Schemes for Power Allocation in Energy Harvesting Communications Pranav Sakulkar and Bhaskar Krishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering

More information

Learning Algorithms for Minimizing Queue Length Regret

Learning Algorithms for Minimizing Queue Length Regret Learning Algorithms for Minimizing Queue Length Regret Thomas Stahlbuhk Massachusetts Institute of Technology Cambridge, MA Brooke Shrader MIT Lincoln Laboratory Lexington, MA Eytan Modiano Massachusetts

More information

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari,QingZhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu

More information

Wireless Channel Selection with Restless Bandits

Wireless Channel Selection with Restless Bandits Wireless Channel Selection with Restless Bandits Julia Kuhn and Yoni Nazarathy Abstract Wireless devices are often able to communicate on several alternative channels; for example, cellular phones may

More information

Online Learning of Rested and Restless Bandits

Online Learning of Rested and Restless Bandits Online Learning of Rested and Restless Bandits Cem Tekin, Mingyan Liu Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor, Michigan, 48109-2122 Email: {cmtkn, mingyan}@umich.edu

More information

Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks

Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks Wenzhuo Ouyang, Sugumar Murugesan, Atilla Eryilmaz, Ness B. Shroff Department of Electrical and Computer Engineering The

More information

Optimality of Myopic Sensing in Multichannel Opportunistic Access

Optimality of Myopic Sensing in Multichannel Opportunistic Access 4040 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 9, SEPTEMBER 2009 Optimality of Myopic Sensing in Multichannel Opportunistic Access Sahand Haji Ali Ahmad, Mingyan Liu, Member, IEEE, Tara Javidi,

More information

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon Optimal Control of Partiality Observable Markov Processes over a Finite Horizon Report by Jalal Arabneydi 04/11/2012 Taken from Control of Partiality Observable Markov Processes over a finite Horizon by

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Opportunistic Spectrum Access for Energy-Constrained Cognitive Radios

Opportunistic Spectrum Access for Energy-Constrained Cognitive Radios 1206 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 8, NO. 3, MARCH 2009 Opportunistic Spectrum Access for Energy-Constrained Cognitive Radios Anh Tuan Hoang, Ying-Chang Liang, David Tung Chong Wong,

More information

Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks A Whittle s Indexability Analysis

Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks A Whittle s Indexability Analysis 1 Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks A Whittle s Indexability Analysis Wenzhuo Ouyang, Sugumar Murugesan, Atilla Eryilmaz, Ness B Shroff Abstract We address

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem Università degli Studi di Milano The bandit problem [Robbins, 1952]... K slot machines Rewards X i,1, X i,2,... of machine i are i.i.d. [0, 1]-valued random variables An allocation policy prescribes which

More information

A State Action Frequency Approach to Throughput Maximization over Uncertain Wireless Channels

A State Action Frequency Approach to Throughput Maximization over Uncertain Wireless Channels A State Action Frequency Approach to Throughput Maximization over Uncertain Wireless Channels Krishna Jagannathan, Shie Mannor, Ishai Menache, Eytan Modiano Abstract We consider scheduling over a wireless

More information

Stochastic Contextual Bandits with Known. Reward Functions

Stochastic Contextual Bandits with Known. Reward Functions Stochastic Contextual Bandits with nown 1 Reward Functions Pranav Sakulkar and Bhaskar rishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering University of Southern

More information

Energy minimization based Resource Scheduling for Strict Delay Constrained Wireless Communications

Energy minimization based Resource Scheduling for Strict Delay Constrained Wireless Communications Energy minimization based Resource Scheduling for Strict Delay Constrained Wireless Communications Ibrahim Fawaz 1,2, Philippe Ciblat 2, and Mireille Sarkiss 1 1 LIST, CEA, Communicating Systems Laboratory,

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret

Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret Cem Tekin, Member, IEEE, Mingyan Liu, Senior Member, IEEE 1 Abstract In this paper we consider the problem of learning the optimal

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Communication constraints and latency in Networked Control Systems

Communication constraints and latency in Networked Control Systems Communication constraints and latency in Networked Control Systems João P. Hespanha Center for Control Engineering and Computation University of California Santa Barbara In collaboration with Antonio Ortega

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Optimal and Suboptimal Policies for Opportunistic Spectrum Access: A Resource Allocation Approach

Optimal and Suboptimal Policies for Opportunistic Spectrum Access: A Resource Allocation Approach Optimal and Suboptimal Policies for Opportunistic Spectrum Access: A Resource Allocation Approach by Sahand Haji Ali Ahmad A dissertation submitted in partial fulfillment of the requirements for the degree

More information

Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations

Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations IEEE/ACM TRANSACTIONS ON NETWORKING 1 Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations Yi Gai, Student Member, IEEE, Member,

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

On the Throughput, Capacity and Stability Regions of Random Multiple Access over Standard Multi-Packet Reception Channels

On the Throughput, Capacity and Stability Regions of Random Multiple Access over Standard Multi-Packet Reception Channels On the Throughput, Capacity and Stability Regions of Random Multiple Access over Standard Multi-Packet Reception Channels Jie Luo, Anthony Ephremides ECE Dept. Univ. of Maryland College Park, MD 20742

More information

OPPORTUNISTIC Spectrum Access (OSA) is emerging

OPPORTUNISTIC Spectrum Access (OSA) is emerging Optimal and Low-complexity Algorithms for Dynamic Spectrum Access in Centralized Cognitive Radio Networks with Fading Channels Mario Bkassiny, Sudharman K. Jayaweera, Yang Li Dept. of Electrical and Computer

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

An Introduction to Markov Decision Processes. MDP Tutorial - 1

An Introduction to Markov Decision Processes. MDP Tutorial - 1 An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal

More information

Adaptive Shortest-Path Routing under Unknown and Stochastically Varying Link States

Adaptive Shortest-Path Routing under Unknown and Stochastically Varying Link States Adaptive Shortest-Path Routing under Unknown and Stochastically Varying Link States Keqin Liu, Qing Zhao To cite this version: Keqin Liu, Qing Zhao. Adaptive Shortest-Path Routing under Unknown and Stochastically

More information

Exploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms

Exploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms Exploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms Julia Kuhn The University of Queensland, University of Amsterdam j.kuhn@uq.edu.au Michel Mandjes University of Amsterdam

More information

Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks

Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks 1 Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks Wenzhuo Ouyang, Sugumar Murugesan, Atilla Eryilmaz, Ness B. Shroff arxiv:1009.3959v6 [cs.ni] 7 Dec 2011 Abstract We

More information

Power Aware Wireless File Downloading: A Lyapunov Indexing Approach to A Constrained Restless Bandit Problem

Power Aware Wireless File Downloading: A Lyapunov Indexing Approach to A Constrained Restless Bandit Problem IEEE/ACM RANSACIONS ON NEWORKING, VOL. 24, NO. 4, PP. 2264-2277, AUG. 206 Power Aware Wireless File Downloading: A Lyapunov Indexing Approach to A Constrained Restless Bandit Problem Xiaohan Wei and Michael

More information

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Sofía S. Villar Postdoctoral Fellow Basque Center for Applied Mathemathics (BCAM) Lancaster University November 5th,

More information

The Non-Bayesian Restless Multi-Armed Bandit: A Case of Near-Logarithmic Strict Regret

The Non-Bayesian Restless Multi-Armed Bandit: A Case of Near-Logarithmic Strict Regret 1 The Non-Bayesian Restless Multi-Armed Bandit: A Case of Near-Logarithmic Strict Regret arxiv:19.1533v1 [math.oc] 7 Sep 11 Wenhan Dai, Yi Gai, Bhaskar Krishnamachari and Qing Zhao Department of Aeronautics

More information

arxiv: v3 [cs.lg] 4 Oct 2011

arxiv: v3 [cs.lg] 4 Oct 2011 Reinforcement learning based sensing policy optimization for energy efficient cognitive radio networks Jan Oksanen a,, Jarmo Lundén a,b,1, Visa Koivunen a a Aalto University School of Electrical Engineering,

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

Distributed Reinforcement Learning Based MAC Protocols for Autonomous Cognitive Secondary Users

Distributed Reinforcement Learning Based MAC Protocols for Autonomous Cognitive Secondary Users Distributed Reinforcement Learning Based MAC Protocols for Autonomous Cognitive Secondary Users Mario Bkassiny and Sudharman K. Jayaweera Dept. of Electrical and Computer Engineering University of New

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an

More information

Approximation Algorithms for Partial-information based Stochastic Control with Markovian Rewards

Approximation Algorithms for Partial-information based Stochastic Control with Markovian Rewards Approximation Algorithms for Partial-information based Stochastic Control with Markovian Rewards Sudipto Guha Department of Computer and Information Sciences, University of Pennsylvania, Philadelphia PA

More information

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Wei Chen Microsoft Research Asia, Beijing, China Yajun Wang Microsoft Research Asia, Beijing, China Yang Yuan Computer Science

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

Performance and Convergence of Multi-user Online Learning

Performance and Convergence of Multi-user Online Learning Performance and Convergence of Multi-user Online Learning Cem Tekin, Mingyan Liu Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor, Michigan, 4809-222 Email: {cmtkn,

More information

arxiv:cs/ v1 [cs.ni] 27 Feb 2007

arxiv:cs/ v1 [cs.ni] 27 Feb 2007 Joint Design and Separation Principle for Opportunistic Spectrum Access in the Presence of Sensing Errors Yunxia Chen, Qing Zhao, and Ananthram Swami Abstract arxiv:cs/0702158v1 [cs.ni] 27 Feb 2007 We

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio

Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio Jayakrishnan Unnikrishnan, Student Member, IEEE, and Venugopal V. Veeravalli, Fellow, IEEE 1 arxiv:0807.2677v4 [cs.ni] 6 Feb 2010

More information

Experts in a Markov Decision Process

Experts in a Markov Decision Process University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow

More information

Analysis of Thompson Sampling for the multi-armed bandit problem

Analysis of Thompson Sampling for the multi-armed bandit problem Analysis of Thompson Sampling for the multi-armed bandit problem Shipra Agrawal Microsoft Research India shipra@microsoft.com avin Goyal Microsoft Research India navingo@microsoft.com Abstract We show

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Optimal Channel Probing and Transmission Scheduling for Opportunistic Spectrum Access

Optimal Channel Probing and Transmission Scheduling for Opportunistic Spectrum Access Optimal Channel Probing and Transmission Scheduling for Opportunistic Spectrum Access ABSTRACT Nicholas B. Chang Department of Electrical Engineering and Computer Science University of Michigan Ann Arbor,

More information

EE 550: Notes on Markov chains, Travel Times, and Opportunistic Routing

EE 550: Notes on Markov chains, Travel Times, and Opportunistic Routing EE 550: Notes on Markov chains, Travel Times, and Opportunistic Routing Michael J. Neely University of Southern California http://www-bcf.usc.edu/ mjneely 1 Abstract This collection of notes provides a

More information

Introducing strategic measure actions in multi-armed bandits

Introducing strategic measure actions in multi-armed bandits 213 IEEE 24th International Symposium on Personal, Indoor and Mobile Radio Communications: Workshop on Cognitive Radio Medium Access Control and Network Solutions Introducing strategic measure actions

More information

P e = 0.1. P e = 0.01

P e = 0.1. P e = 0.01 23 10 0 10-2 P e = 0.1 Deadline Failure Probability 10-4 10-6 10-8 P e = 0.01 10-10 P e = 0.001 10-12 10 11 12 13 14 15 16 Number of Slots in a Frame Fig. 10. The deadline failure probability as a function

More information

Can Decentralized Status Update Achieve Universally Near-Optimal Age-of-Information in Wireless Multiaccess Channels?

Can Decentralized Status Update Achieve Universally Near-Optimal Age-of-Information in Wireless Multiaccess Channels? Can Decentralized Status Update Achieve Universally Near-Optimal Age-of-Information in Wireless Multiaccess Channels? Zhiyuan Jiang, Bhaskar Krishnamachari, Sheng Zhou, Zhisheng Niu, Fellow, IEEE {zhiyuan,

More information

Distributed Stochastic Online Learning Policies for Opportunistic Spectrum Access

Distributed Stochastic Online Learning Policies for Opportunistic Spectrum Access 1 Distributed Stochastic Online Learning Policies for Opportunistic Spectrum Access Yi Gai, Member, IEEE, and Bhaskar Krishnamachari, Member, IEEE, ACM Abstract The fundamental problem of multiple secondary

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits

An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits Peter Jacko YEQT III November 20, 2009 Basque Center for Applied Mathematics (BCAM), Bilbao, Spain Example: Congestion

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Average Age of Information with Hybrid ARQ under a Resource Constraint

Average Age of Information with Hybrid ARQ under a Resource Constraint 1 Average Age of Information with Hybrid ARQ under a Resource Constraint arxiv:1710.04971v3 [cs.it] 5 Dec 2018 Elif Tuğçe Ceran, Deniz Gündüz, and András György Department of Electrical and Electronic

More information

Tuning the TCP Timeout Mechanism in Wireless Networks to Maximize Throughput via Stochastic Stopping Time Methods

Tuning the TCP Timeout Mechanism in Wireless Networks to Maximize Throughput via Stochastic Stopping Time Methods Tuning the TCP Timeout Mechanism in Wireless Networks to Maximize Throughput via Stochastic Stopping Time Methods George Papageorgiou and John S. Baras Abstract We present an optimization problem that

More information

Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio

Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio Jayakrishnan Unnikrishnan, Student Member, IEEE, and Venugopal V. Veeravalli, Fellow, IEEE 1 arxiv:0807.2677v2 [cs.ni] 21 Nov 2008

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Optimal Sleeping Mechanism for Multiple Servers with MMPP-Based Bursty Traffic Arrival

Optimal Sleeping Mechanism for Multiple Servers with MMPP-Based Bursty Traffic Arrival 1 Optimal Sleeping Mechanism for Multiple Servers with MMPP-Based Bursty Traffic Arrival Zhiyuan Jiang, Bhaskar Krishnamachari, Sheng Zhou, arxiv:1711.07912v1 [cs.it] 21 Nov 2017 Zhisheng Niu, Fellow,

More information

Optimal Channel Probing and Transmission Scheduling in a Multichannel System

Optimal Channel Probing and Transmission Scheduling in a Multichannel System Optimal Channel Probing and Transmission Scheduling in a Multichannel System Nicholas B. Chang and Mingyan Liu Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor,

More information

Towards control over fading channels

Towards control over fading channels Towards control over fading channels Paolo Minero, Massimo Franceschetti Advanced Network Science University of California San Diego, CA, USA mail: {minero,massimo}@ucsd.edu Invited Paper) Subhrakanti

More information

Online regret in reinforcement learning

Online regret in reinforcement learning University of Leoben, Austria Tübingen, 31 July 2007 Undiscounted online regret I am interested in the difference (in rewards during learning) between an optimal policy and a reinforcement learner: T T

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Near-optimal policies for broadcasting files with unequal sizes

Near-optimal policies for broadcasting files with unequal sizes Near-optimal policies for broadcasting files with unequal sizes Majid Raissi-Dehkordi and John S. Baras Institute for Systems Research University of Maryland College Park, MD 20742 majid@isr.umd.edu, baras@isr.umd.edu

More information

A Structured Multiarmed Bandit Problem and the Greedy Policy

A Structured Multiarmed Bandit Problem and the Greedy Policy A Structured Multiarmed Bandit Problem and the Greedy Policy Adam J. Mersereau Kenan-Flagler Business School, University of North Carolina ajm@unc.edu Paat Rusmevichientong School of Operations Research

More information

Cognitive Spectrum Access Control Based on Intrinsic Primary ARQ Information

Cognitive Spectrum Access Control Based on Intrinsic Primary ARQ Information Cognitive Spectrum Access Control Based on Intrinsic Primary ARQ Information Fabio E. Lapiccirella, Zhi Ding and Xin Liu Electrical and Computer Engineering University of California, Davis, California

More information

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Reinforcement earning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC ab, University of Georgia roi@ceren.net Prashant Doshi THINC ab, University

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Throughput-Delay Analysis of Random Linear Network Coding for Wireless Broadcasting

Throughput-Delay Analysis of Random Linear Network Coding for Wireless Broadcasting Throughput-Delay Analysis of Random Linear Network Coding for Wireless Broadcasting Swapna B.T., Atilla Eryilmaz, and Ness B. Shroff Departments of ECE and CSE The Ohio State University Columbus, OH 43210

More information

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008 LEARNING THEORY OF OPTIMAL DECISION MAKING PART I: ON-LINE LEARNING IN STOCHASTIC ENVIRONMENTS Csaba Szepesvári 1 1 Department of Computing Science University of Alberta Machine Learning Summer School,

More information

Scheduling in parallel queues with randomly varying connectivity and switchover delay

Scheduling in parallel queues with randomly varying connectivity and switchover delay Scheduling in parallel queues with randomly varying connectivity and switchover delay The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters.

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Cooperative HARQ with Poisson Interference and Opportunistic Routing

Cooperative HARQ with Poisson Interference and Opportunistic Routing Cooperative HARQ with Poisson Interference and Opportunistic Routing Amogh Rajanna & Mostafa Kaveh Department of Electrical and Computer Engineering University of Minnesota, Minneapolis, MN USA. Outline

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

WIRELESS systems often operate in dynamic environments

WIRELESS systems often operate in dynamic environments IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 61, NO. 9, NOVEMBER 2012 3931 Structure-Aware Stochastic Control for Transmission Scheduling Fangwen Fu and Mihaela van der Schaar, Fellow, IEEE Abstract

More information

Opportunistic Scheduling Using Channel Memory in Markov-modeled Wireless Networks

Opportunistic Scheduling Using Channel Memory in Markov-modeled Wireless Networks Opportunistic Scheduling Using Channel Memory in Markov-modeled Wireless Networks Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School

More information

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed CS 683 Learning, Games, and Electronic Markets Spring 007 Notes from Week 9: Multi-Armed Bandit Problems II Instructor: Robert Kleinberg 6-30 Mar 007 1 Information-theoretic lower bounds for multiarmed

More information

Hybrid Machine Learning Algorithms

Hybrid Machine Learning Algorithms Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information