Bayesian Congestion Control over a Markovian Network Bandwidth Process

Size: px
Start display at page:

Download "Bayesian Congestion Control over a Markovian Network Bandwidth Process"

Transcription

1 Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 1/30 Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard (USC) Joint work with Bhaskar Krishnamachari (USC) and Tara Javidi (UCSD) November 4, 2013

2 Introduction Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 2/30 Outline Introduction problem formulation main results Analysis: some key properties Simulation results Summary

3 Introduction Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 3/30 Motivation In many network protocols, a device must set the communication parameters to maximize the utilization of the resource whose availability is a stochastic process. One prominent example is congestion control, in which a transmitter must select the transmission rate to utilize the available bandwidth, which varies randomly due to the dynamic nature of traffic load imposed by other users on the network. The goal is to find the optimal policy to maximize the total reward (utilization minus penalty)

4 Introduction Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 4/30 Assume available bandwidth varies as a Markovian process. A sender wants to set its transmission rate at each time step. If the sender selects a rate higher than the available bandwidth, - it can utilize the whole available bandwidth - but has to pay an over-utilization penalty - perfect information about the current available bandwidth is revealed If the user selects a rate less than the available bandwidth, - it does not experience loss (no penalty), - but the available bandwidth is under-utilized. - the sender gets partial information about the available bandwidth a trade-off between getting more information about the available bandwidth and paying less penalty.

5 Problem Formulation Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 5/30 Assumptions: a Discrete-time finite-state Markov process, whose state is denoted by B t the finite horizon by T and the discrete time steps by t = 1, 2,..., T. A known transition matrix The state of the process is not fully observable Partially Observable Markov Decision Process (POMDP)

6 Problem Formulation Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 6/30 At each time step, the decision-maker selects an action based on the history of observations It earns a reward which is a function of the selected action and the state (belief vector) Objective: selecting the sequential actions to maximize the total expected discounted reward

7 Problem Formulation Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 7/30 POMDP tuple {M, P, B, A, O, U, R} State: B t, is one of the elements of a finite state set denoted by M = {1, 2,..., M} State transition: The transition probabilities of the states B t over time - assumed to be known and stationary - indicated by an M M transition probability matrix, P. - elements P i,j = Pr(B t+1 = j B t = i), i, j M, t Belief vector: The probability distribution of the resource state, - given all past observations, - denoted by a belief vector b t = [b t (1),..., b t (M)], - with elements of b t (k) = Pr(B t = k), k M

8 Problem Formulation Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 8/30 POMDP tuple {M, P, B, A, O, U, R} Action: At each time step, according to the current belief, we choose an action r t A = {1,..., M}. Observed information: defined by the events o t (r t ) O as follows: - o t (r t ) = {B t = i}, i = 1,..., r t 1 is the event of fully observing B t. - o t (r t ) = {B t r t } is the event of partial observation

9 Problem Formulation Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 9/30 Figure : An example of belief updating: M = 6, p 1 0, p 1 0, B t = 5, r t = 3.

10 Problem Formulation Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 10/30 Figure : An example of belief updating: M = 6, p 1 0, p 1 0, B t = 2, r t = 3.

11 Problem Formulation Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 11/30 POMDP tuple {M, P, B, A, O, U, R} Reward: The immediate reward earned is a mapping R : A O R: { B t C(r t B t ) if r t > B t R(B t, r t ) = r t if r t B t, (1) - C: over-utilization penalty coefficient, - B t : the available bandwidth and r t : the selected rate

12 Optimal Policy and Value Function Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 12/30 The policy π specifies a sequence of functions π 1,..., π T, π t : B A, r t = π t (b t ). Goal: to maximize the total expected discounted reward in the finite horizon T, over all admissible policies π, given by max π Jπ T (b 0 ) = max E π [ π T β t R(B t ; r t ) b 0 ], (2) - 0 β 1: the discount factor and b 0 : the initial belief vector. - The optimal policy π opt : a policy which maximizes (2) - It exists since the number of admissible policies are finite. t=0

13 Optimal Policy and Value Function Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 13/30 Dynamic programming (DP) V t (b t ) = max r t V t (b t ; r t ), V t (b t ; r t ) = R(b t ; r t ) + βe{v t+1 (b t+1 ) r t }, t T V T (b T ; r T ) = R T (b T ; r T ), The value function V t (b t ): the maximum remaining expected reward accrued starting from time t when the current belief vector is b t V t (b t ; r t ) is the remaining expected reward accrued after time t with choosing action r t at time t and following the optimal policy for time t + 1,..., T. Optimal Policy rt opt (b) = arg max V t(b; r). r A

14 Main Results Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 14/30 Assumption 1 The P matrix satisfies the State-Independent State Change (SISC) property. SISC property: P i,i+k = p j,j+k. Only indicating the probability of moving k step higher, p k, independent of which state we are, such that k < 0 corresponds to moving k steps lower.

15 Main Results Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 15/30 Assumption 2 The P matrix satisfies SISC property with edge effects. Edge effect: the transition matrix will be affected by the limits (edges) of the state set, since the state set M is limited from both sides. For example for M = 4, p 1 = p, p 1 = 1 p p p p 0 p p 0 p p 1 p

16 Main Results Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 16/30 Theorem 1: Lower bound r lb = min{r M : r i=1 b(i) C }. This lower bound is equal to the myopic action which at each time step selects the action maximizing the immediate expected reward and ignores its impact on the future reward: r myopic (b) = arg max R(b; r). r M A percentile threshold structure: the lowest action with the cumulative distribution (the summation of the beliefs up to the action) is higher than a threshold.

17 Main Results Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 17/30 Theorem 2: Upper bound Under Assumption 1 or 2, r ub = min{r M : f (β) S r + [(1 + C) f (β)r]s r C 0}, S r r h i=r+1 b(i), S r r h i=r+1 ib(i), 1 1 βt f (β) β 1 β. r l r lb = r myopic r opt r ub r h, r l :the lowest and r h : the highest states with non-zero beliefs

18 Analysis Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 18/30 Lemma 1 The expected immediate reward is a uni-modal function of the action r, Lemma 2 V t (b; r), and V t (b), are convex with respect to the belief vector b, V t (b; r) λv t (b 1 ; r) + (1 λ)v t (b 2 ; r), r M, V t (b) λv t (b 1 ) + (1 λ)v t (b 2 ), 0 λ 1. Lemma 3 The future expected reward, V f t (b; r), is monotonically increasing in the action, V f t (b; r 1 ) V f t (b; r 2 ) 0, r 1 r 2.

19 Analysis Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 19/30 First Order Stochastically Dominance Let b 1, b 2 B be any two belief vectors, b 1 first order stochastically dominates b 2 (or b 1 is FOSD greater than b 2 ), denoted as b 1 s b 2, if M b 1 (j) j=r M b 2 (j), r {1,..., M}. j=r Lemma 4 Under Assumption 1 or 2, the value function is a FOSD-increasing function of the belief vector. i.e., if b 1 s b 2, then V (b 1 ) V (b 2 ).

20 Analysis Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 20/30 b α : shifted version of b by α steps, i.e. b α (i) = b(i + α). Lemma 5 R(b α ; r) = R(b; r α) + α, r myopic (b α ) = r myopic (b) + α. Lemma 7 Under Assumption 2, V t (b α t 1 βt ) V t (b) + α 1 β.

21 Simulation Results: Upper and Lower Bound on Optimal Actions Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 21/30 Figure : The gap between the lower and the upper bounds versus β and the variance (β = 0.8), for C = 5. The simulation parameters, except in the figures that their effect is considered, are fixed as follows: M = 10,C = 5, β = 0.8, and the transition probabilities p 1 = p 1 = 0.3, p 0 = 0.4.

22 Simulation Results: Myopic and Upper-Bound policies Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 22/30 Figure : The total expected discounted reward for two sub-optimal policies versus β for C = 5, for horizon T = 100. Now we compare two sub-optimal policies: (i) the myopic policy, (ii) the upper-bound (UB) policy These policies pick the myopic and UB actions, respectively, at all time steps and update their belief vectors according to these actions.

23 Conclusions and Future Work Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 23/30 Summary and Future Work We formulated a Bayesian congestion control problem in which a source must select the transmission rate (the action) over a network whose available bandwidth (resource) evolves as a stochastic process. We modeled the problem as a POMDP and derived some key properties for the myopic and the optimal policies. We proved structural results providing bounds on the optimal actions, yielding tractable sub-optimal solutions that have been shown via simulations to perform well. We conjecture that there may be even better approximation for the optimal policy with the similar percentile threshold structure. Finding upper bound for general form of transition matrix. Heuristic policies better than two sub-optimal policies.

24 Thank you for your attention Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 24/30

25 Simulation Results: Upper and Lower Bound on Optimal Actions Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 25/30 Figure : Selected actions by EM (τ = 4) and their corresponding lower and upper bounds, for C = 5, β = 0.8, M = 10, and transition of p 1 = p 1 = 0.3, p 0 = 0.4 Note that the stars in the figure indicate the non-zero beliefs. This figure shows the policy sequence where the selected actions does not exceed B t.

26 Simulation Results: Myopic and Upper-Bound policies Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 26/30 Figure : The total expected discounted reward for two sub-optimal policies versus C for β = 0.8, τ = 4, for horizon T = 100 and transition of p 1 = p 1 = 0.3, p 0 = 0.4.

27 Problem Formulation Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 27/30 POMDP tuple {M, P, B, A, O, U, R} Belief updating: a mapping U : A O B B. { T rt b t P if r t B t b t+1 = I Bt P if r t > B t, T r : a non-linear operation on a belief vector b: { 0 if i < r T r b(i) = b(i) M j=r b(j) if i r.

28 Main Results Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 28/30 Proposition 1 A looser upper bound, r ub2 r ub, r ub2 = min{r M : 1 1 βt where U = β 1 β (r h r l ). r b(i) 1 + U 1 + C + U }, i=1 - A percentile threshold structure with an extra term of U in the nominator and denominator of the threshold - r ub2 is an increasing function of U.

29 Analysis Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 29/30 Lemma 6 Under Assumption 1, V t (b α t+1 1 βt ; r) V t (b; r α) = α, 1 β rt opt (b α ) = rt opt (b) + α, V t (b α t 1 βt ) = V t (b) + α 1 β. Note for β = 1, we need to substitute 1 βx 1 β by x. Lemma 7 Under Assumption 2, V t (b α t 1 βt ) V t (b) + α 1 β.

30 Optimal Policy and Value Function Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 30/30 V t (b t ; r t ) is the summation of two terms: (i) the immediate expected reward (ii) the discounted future expected reward. There is no efficiently computable solution for the above POMDP problem, We present upper and lower bounds on the optimal actions.

Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem

Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem Parisa Mansourifard 1/37 Bayesian Congestion Control over a Markovian Network Bandwidth Process:

More information

Percentile Threshold Policies for Inventory Problems with Partially Observed Markovian Demands

Percentile Threshold Policies for Inventory Problems with Partially Observed Markovian Demands Percentile Threshold Policies for Inventory Problems with Partially Observed Markovian Demands Parisa Mansourifard Joint work with: Bhaskar Krishnamachari and Tara Javidi (UCSD) University of Southern

More information

Power Allocation over Two Identical Gilbert-Elliott Channels

Power Allocation over Two Identical Gilbert-Elliott Channels Power Allocation over Two Identical Gilbert-Elliott Channels Junhua Tang School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University, China Email: junhuatang@sjtu.edu.cn Parisa

More information

Percentile Policies for Inventory Problems with Partially Observed Markovian Demands

Percentile Policies for Inventory Problems with Partially Observed Markovian Demands Proceedings o the International Conerence on Industrial Engineering and Operations Management Percentile Policies or Inventory Problems with Partially Observed Markovian Demands Farzaneh Mansouriard Department

More information

STRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS

STRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS STRUCTURE AND OPTIMALITY OF MYOPIC SENSING FOR OPPORTUNISTIC SPECTRUM ACCESS Qing Zhao University of California Davis, CA 95616 qzhao@ece.ucdavis.edu Bhaskar Krishnamachari University of Southern California

More information

Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel

Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel Yanting Wu Dept. of Electrical Engineering University of Southern California Email: yantingw@usc.edu Bhaskar Krishnamachari

More information

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari, Qing Zhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari,QingZhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu

More information

Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays

Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Sahand Haji Ali Ahmad, Mingyan Liu Abstract This paper considers the following stochastic control problem that arises

More information

OPPORTUNISTIC Spectrum Access (OSA) is emerging

OPPORTUNISTIC Spectrum Access (OSA) is emerging Optimal and Low-complexity Algorithms for Dynamic Spectrum Access in Centralized Cognitive Radio Networks with Fading Channels Mario Bkassiny, Sudharman K. Jayaweera, Yang Li Dept. of Electrical and Computer

More information

A Game Theoretic Approach to Newsvendor Problems with Censored Markovian Demand

A Game Theoretic Approach to Newsvendor Problems with Censored Markovian Demand Paris, France, July 6-7, 018 Game Theoretic pproach to Newsvendor Problems with Censored Markovian Demand Parisa Mansourifard Ming Hsieh Department of Electrical Engineering University of Southern California

More information

Optimal and Suboptimal Policies for Opportunistic Spectrum Access: A Resource Allocation Approach

Optimal and Suboptimal Policies for Opportunistic Spectrum Access: A Resource Allocation Approach Optimal and Suboptimal Policies for Opportunistic Spectrum Access: A Resource Allocation Approach by Sahand Haji Ali Ahmad A dissertation submitted in partial fulfillment of the requirements for the degree

More information

Opportunistic Spectrum Access for Energy-Constrained Cognitive Radios

Opportunistic Spectrum Access for Energy-Constrained Cognitive Radios 1206 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 8, NO. 3, MARCH 2009 Opportunistic Spectrum Access for Energy-Constrained Cognitive Radios Anh Tuan Hoang, Ying-Chang Liang, David Tung Chong Wong,

More information

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Reinforcement earning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC ab, University of Georgia roi@ceren.net Prashant Doshi THINC ab, University

More information

Dynamic spectrum access with learning for cognitive radio

Dynamic spectrum access with learning for cognitive radio 1 Dynamic spectrum access with learning for cognitive radio Jayakrishnan Unnikrishnan and Venugopal V. Veeravalli Department of Electrical and Computer Engineering, and Coordinated Science Laboratory University

More information

An Introduction to Markov Decision Processes. MDP Tutorial - 1

An Introduction to Markov Decision Processes. MDP Tutorial - 1 An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal

More information

Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks A Whittle s Indexability Analysis

Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks A Whittle s Indexability Analysis 1 Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks A Whittle s Indexability Analysis Wenzhuo Ouyang, Sugumar Murugesan, Atilla Eryilmaz, Ness B Shroff Abstract We address

More information

Wireless Channel Selection with Restless Bandits

Wireless Channel Selection with Restless Bandits Wireless Channel Selection with Restless Bandits Julia Kuhn and Yoni Nazarathy Abstract Wireless devices are often able to communicate on several alternative channels; for example, cellular phones may

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

A Restless Bandit With No Observable States for Recommendation Systems and Communication Link Scheduling

A Restless Bandit With No Observable States for Recommendation Systems and Communication Link Scheduling 2015 IEEE 54th Annual Conference on Decision and Control (CDC) December 15-18, 2015 Osaka, Japan A Restless Bandit With No Observable States for Recommendation Systems and Communication Link Scheduling

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal

Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal Matthew Johnston, Eytan Modiano Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge,

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and

More information

Markov decision processes and interval Markov chains: exploiting the connection

Markov decision processes and interval Markov chains: exploiting the connection Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo Supervisors: Prof. Nigel Bean, Dr Joshua Ross University of Adelaide July 10, 2013 Intervals and interval arithmetic

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Sachin Patil Guest Lecture: CS287 Advanced Robotics Slides adapted from Pieter Abbeel, Alex Lee Outline Introduction to POMDPs Locally Optimal Solutions

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

On the Optimality of Myopic Sensing. in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels

On the Optimality of Myopic Sensing. in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels On the Optimality of Myopic Sensing 1 in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels Kehao Wang, Lin Chen arxiv:1103.1784v1 [cs.it] 9 Mar 2011 Abstract Recent works ([1],

More information

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost

More information

Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks

Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks Wenzhuo Ouyang, Sugumar Murugesan, Atilla Eryilmaz, Ness B. Shroff Department of Electrical and Computer Engineering The

More information

A State Action Frequency Approach to Throughput Maximization over Uncertain Wireless Channels

A State Action Frequency Approach to Throughput Maximization over Uncertain Wireless Channels A State Action Frequency Approach to Throughput Maximization over Uncertain Wireless Channels Krishna Jagannathan, Shie Mannor, Ishai Menache, Eytan Modiano Abstract We consider scheduling over a wireless

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

Preference Elicitation for Sequential Decision Problems

Preference Elicitation for Sequential Decision Problems Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These

More information

Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement

Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement Jefferson Huang March 21, 2018 Reinforcement Learning for Processing Networks Seminar Cornell University Performance

More information

Performance of Round Robin Policies for Dynamic Multichannel Access

Performance of Round Robin Policies for Dynamic Multichannel Access Performance of Round Robin Policies for Dynamic Multichannel Access Changmian Wang, Bhaskar Krishnamachari, Qing Zhao and Geir E. Øien Norwegian University of Science and Technology, Norway, {changmia,

More information

1 Markov decision processes

1 Markov decision processes 2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe

More information

Point-Based Value Iteration for Constrained POMDPs

Point-Based Value Iteration for Constrained POMDPs Point-Based Value Iteration for Constrained POMDPs Dongho Kim Jaesong Lee Kee-Eung Kim Department of Computer Science Pascal Poupart School of Computer Science IJCAI-2011 2011. 7. 22. Motivation goals

More information

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Many slides adapted from Jur van den Berg Outline POMDPs Separation Principle / Certainty Equivalence Locally Optimal

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Optimal Power Allocation Policy over Two Identical Gilbert-Elliott Channels

Optimal Power Allocation Policy over Two Identical Gilbert-Elliott Channels Optimal Power Allocation Policy over Two Identical Gilbert-Elliott Channels Wei Jiang School of Information Security Engineering Shanghai Jiao Tong University, China Email: kerstin@sjtu.edu.cn Junhua Tang

More information

Optimality of Myopic Sensing in Multichannel Opportunistic Access

Optimality of Myopic Sensing in Multichannel Opportunistic Access 4040 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 9, SEPTEMBER 2009 Optimality of Myopic Sensing in Multichannel Opportunistic Access Sahand Haji Ali Ahmad, Mingyan Liu, Member, IEEE, Tara Javidi,

More information

6.231 DYNAMIC PROGRAMMING LECTURE 13 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 13 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 13 LECTURE OUTLINE Control of continuous-time Markov chains Semi-Markov problems Problem formulation Equivalence to discretetime problems Discounted problems Average cost

More information

Dynamic Games with Asymmetric Information: Common Information Based Perfect Bayesian Equilibria and Sequential Decomposition

Dynamic Games with Asymmetric Information: Common Information Based Perfect Bayesian Equilibria and Sequential Decomposition Dynamic Games with Asymmetric Information: Common Information Based Perfect Bayesian Equilibria and Sequential Decomposition 1 arxiv:1510.07001v1 [cs.gt] 23 Oct 2015 Yi Ouyang, Hamidreza Tavafoghi and

More information

Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio

Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio Jayakrishnan Unnikrishnan, Student Member, IEEE, and Venugopal V. Veeravalli, Fellow, IEEE 1 arxiv:0807.2677v2 [cs.ni] 21 Nov 2008

More information

An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits

An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits Peter Jacko YEQT III November 20, 2009 Basque Center for Applied Mathematics (BCAM), Bilbao, Spain Example: Congestion

More information

A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation

A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation Karim G. Seddik and Amr A. El-Sherif 2 Electronics and Communications Engineering Department, American University in Cairo, New

More information

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Optimally Solving Dec-POMDPs as Continuous-State MDPs Optimally Solving Dec-POMDPs as Continuous-State MDPs Jilles Dibangoye (1), Chris Amato (2), Olivier Buffet (1) and François Charpillet (1) (1) Inria, Université de Lorraine France (2) MIT, CSAIL USA IJCAI

More information

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018 Section Notes 9 Midterm 2 Review Applied Math / Engineering Sciences 121 Week of December 3, 2018 The following list of topics is an overview of the material that was covered in the lectures and sections

More information

Stochastic convexity in dynamic programming

Stochastic convexity in dynamic programming Economic Theory 22, 447 455 (2003) Stochastic convexity in dynamic programming Alp E. Atakan Department of Economics, Columbia University New York, NY 10027, USA (e-mail: aea15@columbia.edu) Received:

More information

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling Crowdsourcing & Optimal Budget Allocation in Crowd Labeling Madhav Mohandas, Richard Zhu, Vincent Zhuang May 5, 2016 Table of Contents 1. Intro to Crowdsourcing 2. The Problem 3. Knowledge Gradient Algorithm

More information

6.231 DYNAMIC PROGRAMMING LECTURE 7 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 7 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 7 LECTURE OUTLINE DP for imperfect state info Sufficient statistics Conditional state distribution as a sufficient statistic Finite-state systems Examples 1 REVIEW: IMPERFECT

More information

Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks

Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks 1 Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks Wenzhuo Ouyang, Sugumar Murugesan, Atilla Eryilmaz, Ness B. Shroff arxiv:1009.3959v6 [cs.ni] 7 Dec 2011 Abstract We

More information

Introduction to Sequential Teams

Introduction to Sequential Teams Introduction to Sequential Teams Aditya Mahajan McGill University Joint work with: Ashutosh Nayyar and Demos Teneketzis, UMichigan MITACS Workshop on Fusion and Inference in Networks, 2011 Decentralized

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Bayesian Social Learning with Random Decision Making in Sequential Systems

Bayesian Social Learning with Random Decision Making in Sequential Systems Bayesian Social Learning with Random Decision Making in Sequential Systems Yunlong Wang supervised by Petar M. Djurić Department of Electrical and Computer Engineering Stony Brook University Stony Brook,

More information

Equilibria for games with asymmetric information: from guesswork to systematic evaluation

Equilibria for games with asymmetric information: from guesswork to systematic evaluation Equilibria for games with asymmetric information: from guesswork to systematic evaluation Achilleas Anastasopoulos anastas@umich.edu EECS Department University of Michigan February 11, 2016 Joint work

More information

Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies

Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies Kamyar Azizzadenesheli U.C. Irvine Joint work with Prof. Anima Anandkumar and Dr. Alessandro Lazaric. Motivation +1 Agent-Environment

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS

OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS Xiaofei Fan-Orzechowski Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony

More information

Distributed power allocation for D2D communications underlaying/overlaying OFDMA cellular networks

Distributed power allocation for D2D communications underlaying/overlaying OFDMA cellular networks Distributed power allocation for D2D communications underlaying/overlaying OFDMA cellular networks Marco Moretti, Andrea Abrardo Dipartimento di Ingegneria dell Informazione, University of Pisa, Italy

More information

The Multi-Path Utility Maximization Problem

The Multi-Path Utility Maximization Problem The Multi-Path Utility Maximization Problem Xiaojun Lin and Ness B. Shroff School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 47906 {linx,shroff}@ecn.purdue.edu Abstract

More information

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming University of Warwick, EC9A0 Maths for Economists 1 of 63 University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming Peter J. Hammond Autumn 2013, revised 2014 University of

More information

Influence of product return lead-time on inventory control

Influence of product return lead-time on inventory control Influence of product return lead-time on inventory control Mohamed Hichem Zerhouni, Jean-Philippe Gayon, Yannick Frein To cite this version: Mohamed Hichem Zerhouni, Jean-Philippe Gayon, Yannick Frein.

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

arxiv:cs/ v1 [cs.ni] 27 Feb 2007

arxiv:cs/ v1 [cs.ni] 27 Feb 2007 Joint Design and Separation Principle for Opportunistic Spectrum Access in the Presence of Sensing Errors Yunxia Chen, Qing Zhao, and Ananthram Swami Abstract arxiv:cs/0702158v1 [cs.ni] 27 Feb 2007 We

More information

On Stochastic Feedback Control for Multi-antenna Beamforming: Formulation and Low-Complexity Algorithms

On Stochastic Feedback Control for Multi-antenna Beamforming: Formulation and Low-Complexity Algorithms 1 On Stochastic Feedback Control for Multi-antenna Beamforming: Formulation and Low-Complexity Algorithms Sun Sun, Student Member, IEEE, Min Dong, Senior Member, IEEE, and Ben Liang, Senior Member, IEEE

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty 2011 IEEE International Conference on Robotics and Automation Shanghai International Conference Center May 9-13, 2011, Shanghai, China A Decentralized Approach to Multi-agent Planning in the Presence of

More information

A Mean Field Approach for Optimization in Discrete Time

A Mean Field Approach for Optimization in Discrete Time oname manuscript o. will be inserted by the editor A Mean Field Approach for Optimization in Discrete Time icolas Gast Bruno Gaujal the date of receipt and acceptance should be inserted later Abstract

More information

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon Optimal Control of Partiality Observable Markov Processes over a Finite Horizon Report by Jalal Arabneydi 04/11/2012 Taken from Control of Partiality Observable Markov Processes over a finite Horizon by

More information

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Sofía S. Villar Postdoctoral Fellow Basque Center for Applied Mathemathics (BCAM) Lancaster University November 5th,

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Decentralized Stochastic Control with Partial Sharing Information Structures: A Common Information Approach

Decentralized Stochastic Control with Partial Sharing Information Structures: A Common Information Approach Decentralized Stochastic Control with Partial Sharing Information Structures: A Common Information Approach 1 Ashutosh Nayyar, Aditya Mahajan and Demosthenis Teneketzis Abstract A general model of decentralized

More information

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University

More information

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro pcastr@cs.mcgill.ca McGill University Joint work with: Doina Precup and Prakash

More information

Finding the Value of Information About a State Variable in a Markov Decision Process 1

Finding the Value of Information About a State Variable in a Markov Decision Process 1 05/25/04 1 Finding the Value of Information About a State Variable in a Markov Decision Process 1 Gilvan C. Souza The Robert H. Smith School of usiness, The University of Maryland, College Park, MD, 20742

More information

Dynamic Pricing for Non-Perishable Products with Demand Learning

Dynamic Pricing for Non-Perishable Products with Demand Learning Dynamic Pricing for Non-Perishable Products with Demand Learning Victor F. Araman Stern School of Business New York University René A. Caldentey DIMACS Workshop on Yield Management and Dynamic Pricing

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of

More information

Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret

Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret Cem Tekin, Member, IEEE, Mingyan Liu, Senior Member, IEEE 1 Abstract In this paper we consider the problem of learning the optimal

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

Polyhedral Approaches to Online Bipartite Matching

Polyhedral Approaches to Online Bipartite Matching Polyhedral Approaches to Online Bipartite Matching Alejandro Toriello joint with Alfredo Torrico, Shabbir Ahmed Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Industrial

More information

Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio

Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio Algorithms for Dynamic Spectrum Access with Learning for Cognitive Radio Jayakrishnan Unnikrishnan, Student Member, IEEE, and Venugopal V. Veeravalli, Fellow, IEEE 1 arxiv:0807.2677v4 [cs.ni] 6 Feb 2010

More information

Efficient Maximization in Solving POMDPs

Efficient Maximization in Solving POMDPs Efficient Maximization in Solving POMDPs Zhengzhu Feng Computer Science Department University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Shlomo Zilberstein Computer Science Department University

More information

Persuading Skeptics and Reaffirming Believers

Persuading Skeptics and Reaffirming Believers Persuading Skeptics and Reaffirming Believers May, 31 st, 2014 Becker-Friedman Institute Ricardo Alonso and Odilon Camara Marshall School of Business - USC Introduction Sender wants to influence decisions

More information

Macro 1: Dynamic Programming 2

Macro 1: Dynamic Programming 2 Macro 1: Dynamic Programming 2 Mark Huggett 2 2 Georgetown September, 2016 DP Problems with Risk Strategy: Consider three classic problems: income fluctuation, optimal (stochastic) growth and search. Learn

More information

Optimizing Memory-Bounded Controllers for Decentralized POMDPs

Optimizing Memory-Bounded Controllers for Decentralized POMDPs Optimizing Memory-Bounded Controllers for Decentralized POMDPs Christopher Amato, Daniel S. Bernstein and Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst, MA 01003

More information

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers Huizhen (Janey) Yu (janey@mit.edu) Dimitri Bertsekas (dimitrib@mit.edu) Lab for Information and Decision Systems,

More information

Exploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms

Exploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms Exploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms Julia Kuhn The University of Queensland, University of Amsterdam j.kuhn@uq.edu.au Michel Mandjes University of Amsterdam

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty JMLR: Workshop and Conference Proceedings vol (212) 1 12 European Workshop on Reinforcement Learning Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty Shie Mannor Technion Ofir Mebel

More information

Dynamic Power Management under Uncertain Information. University of Southern California Los Angeles CA

Dynamic Power Management under Uncertain Information. University of Southern California Los Angeles CA Dynamic Power Management under Uncertain Information Hwisung Jung and Massoud Pedram University of Southern California Los Angeles CA Agenda Introduction Background Stochastic Decision-Making Framework

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Competitive Scheduling in Wireless Collision Channels with Correlated Channel State

Competitive Scheduling in Wireless Collision Channels with Correlated Channel State Competitive Scheduling in Wireless Collision Channels with Correlated Channel State Utku Ozan Candogan, Ishai Menache, Asuman Ozdaglar and Pablo A. Parrilo Abstract We consider a wireless collision channel,

More information

Stochastic Primal-Dual Methods for Reinforcement Learning

Stochastic Primal-Dual Methods for Reinforcement Learning Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,

More information