Bayesian Congestion Control over a Markovian Network Bandwidth Process

Similar documents
Temporal-Difference Learning

Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem

Bayesian Congestion Control over a Markovian Network Bandwidth Process

Do Managers Do Good With Other People s Money? Online Appendix

Central Coverage Bayes Prediction Intervals for the Generalized Pareto Distribution

International Journal of Mathematical Archive-3(12), 2012, Available online through ISSN

Multiple Criteria Secretary Problem: A New Approach

Absorption Rate into a Small Sphere for a Diffusing Particle Confined in a Large Sphere

EM Boundary Value Problems

COMPUTATIONS OF ELECTROMAGNETIC FIELDS RADIATED FROM COMPLEX LIGHTNING CHANNELS

10/04/18. P [P(x)] 1 negl(n).

Research Article On Alzer and Qiu s Conjecture for Complete Elliptic Integral and Inverse Hyperbolic Tangent Function

ON INDEPENDENT SETS IN PURELY ATOMIC PROBABILITY SPACES WITH GEOMETRIC DISTRIBUTION. 1. Introduction. 1 r r. r k for every set E A, E \ {0},

4/18/2005. Statistical Learning Theory

Surveillance Points in High Dimensional Spaces

On Computing Optimal (Q, r) Replenishment Policies under Quantity Discounts

Gradient-based Neural Network for Online Solution of Lyapunov Matrix Equation with Li Activation Function

Web-based Supplementary Materials for. Controlling False Discoveries in Multidimensional Directional Decisions, with

The Substring Search Problem

Multihop MIMO Relay Networks with ARQ

Rate Splitting is Approximately Optimal for Fading Gaussian Interference Channels

Solution to HW 3, Ma 1a Fall 2016

ON THE INVERSE SIGNED TOTAL DOMINATION NUMBER IN GRAPHS. D.A. Mojdeh and B. Samadi

Fractional Zero Forcing via Three-color Forcing Games

A Bijective Approach to the Permutational Power of a Priority Queue

LET a random variable x follows the two - parameter

A Converse to Low-Rank Matrix Completion

IEEE/ACM Transactions on Networking. Markov-modeled Downlink Environment: Opportunistic Multiuser Scheduling and the Stability Region

An Application of Fuzzy Linear System of Equations in Economic Sciences

Linear Program for Partially Observable Markov Decision Processes. MS&E 339B June 9th, 2004 Erick Delage

Journal of Inequalities in Pure and Applied Mathematics

NOTE. Some New Bounds for Cover-Free Families

CSCE 478/878 Lecture 4: Experimental Design and Analysis. Stephen Scott. 3 Building a tree on the training set Introduction. Outline.

3.1 Random variables

Method for Approximating Irrational Numbers

Internet Appendix for A Bayesian Approach to Real Options: The Case of Distinguishing Between Temporary and Permanent Shocks

Interaction of Feedforward and Feedback Streams in Visual Cortex in a Firing-Rate Model of Columnar Computations. ( r)

Energy Savings Achievable in Connection Preserving Energy Saving Algorithms

The Congestion of n-cube Layout on a Rectangular Grid S.L. Bezrukov J.D. Chavez y L.H. Harper z M. Rottger U.-P. Schroeder Abstract We consider the pr

Pearson s Chi-Square Test Modifications for Comparison of Unweighted and Weighted Histograms and Two Weighted Histograms

A NEW VARIABLE STIFFNESS SPRING USING A PRESTRESSED MECHANISM

New problems in universal algebraic geometry illustrated by boolean equations

Stanford University CS259Q: Quantum Computing Handout 8 Luca Trevisan October 18, 2012

Unobserved Correlation in Ascending Auctions: Example And Extensions

Efficiency Loss in a Network Resource Allocation Game

Duality between Statical and Kinematical Engineering Systems

FUSE Fusion Utility Sequence Estimator

1 Explicit Explore or Exploit (E 3 ) Algorithm

PAPER 39 STOCHASTIC NETWORKS

Goodness-of-fit for composite hypotheses.

A Multivariate Normal Law for Turing s Formulae

Lecture 28: Convergence of Random Variables and Related Theorems

COLLAPSING WALLS THEOREM

Lifting Private Information Retrieval from Two to any Number of Messages

Suggested Solutions to Homework #4 Econ 511b (Part I), Spring 2004

PROBLEM SET #1 SOLUTIONS by Robert A. DiStasio Jr.

A STUDY OF HAMMING CODES AS ERROR CORRECTING CODES

Secret Exponent Attacks on RSA-type Schemes with Moduli N = p r q

Convergence Dynamics of Resource-Homogeneous Congestion Games: Technical Report

6 Matrix Concentration Bounds

QIP Course 10: Quantum Factorization Algorithm (Part 3)

arxiv: v1 [math.nt] 12 May 2017

Computers and Mathematics with Applications

Relating Branching Program Size and. Formula Size over the Full Binary Basis. FB Informatik, LS II, Univ. Dortmund, Dortmund, Germany

ANA BERRIZBEITIA, LUIS A. MEDINA, ALEXANDER C. MOLL, VICTOR H. MOLL, AND LAINE NOBLE

Chapter 3: Theory of Modular Arithmetic 38

16 Modeling a Language by a Markov Process

An Exact Solution of Navier Stokes Equation

State tracking control for Takagi-Sugeno models

arxiv: v2 [physics.data-an] 15 Jul 2015

An upper bound on the number of high-dimensional permutations

Revision of Lecture Eight

Macro Theory B. The Permanent Income Hypothesis

APPLICATION OF MAC IN THE FREQUENCY DOMAIN

arxiv: v1 [math.co] 1 Apr 2011

Encapsulation theory: the transformation equations of absolute information hiding.

CENTER FOR MULTIMODAL SOLUTIONS FOR CONGESTION MITIGATION (CMS)

Value of Traveler Information for Adaptive Routing in Stochastic Time-Dependent Networks

Efficiency Loss in a Network Resource Allocation Game: The Case of Elastic Supply

Parameter identification in Markov chain choice models

arxiv: v1 [math.co] 4 May 2017

JENSEN S INEQUALITY FOR DISTRIBUTIONS POSSESSING HIGHER MOMENTS, WITH APPLICATION TO SHARP BOUNDS FOR LAPLACE-STIELTJES TRANSFORMS

Math 301: The Erdős-Stone-Simonovitz Theorem and Extremal Numbers for Bipartite Graphs

School of Electrical and Computer Engineering, Cornell University. ECE 303: Electromagnetic Fields and Waves. Fall 2007

Hypothesis Test and Confidence Interval for the Negative Binomial Distribution via Coincidence: A Case for Rare Events

Alternative Tests for the Poisson Distribution

Using Laplace Transform to Evaluate Improper Integrals Chii-Huei Yu

On a quantity that is analogous to potential and a theorem that relates to it

Psychometric Methods: Theory into Practice Larry R. Price

arxiv: v2 [astro-ph] 16 May 2008

Notes on McCall s Model of Job Search. Timothy J. Kehoe March if job offer has been accepted. b if searching

On the Poisson Approximation to the Negative Hypergeometric Distribution

Compactly Supported Radial Basis Functions

Application of Parseval s Theorem on Evaluating Some Definite Integrals

Markscheme May 2017 Calculus Higher level Paper 3

AQI: Advanced Quantum Information Lecture 2 (Module 4): Order finding and factoring algorithms February 20, 2013

Hydroelastic Analysis of a 1900 TEU Container Ship Using Finite Element and Boundary Element Methods

Steady State and Transient Performance Analysis of Three Phase Induction Machine using MATLAB Simulations

Supplementary information Efficient Enumeration of Monocyclic Chemical Graphs with Given Path Frequencies

Fall 2014 Randomized Algorithms Oct 8, Lecture 3

Transcription:

Bayesian Congestion Contol ove a Makovian Netwok Bandwidth Pocess Paisa Mansouifad,, Bhaska Kishnamachai, Taa Javidi Ming Hsieh Depatment of Electical Engineeing, Univesity of Southen Califonia, Los Angeles, CA Depatment of Electical and Compute Engineeing, Univesity of Califonia, San Diego, La Jolla, CA emails: paisama@usc.edu, bkishna@usc.edu, tjavidi@ucsd.edu Abstact We fomulate a Bayesian congestion contol poblem in which a souce must select the tansmission ate ove a netwok whose available bandwidth is modeled as a timehomogeneous finite-state Makov Chain. The decision to tansmit at a ate below the instantaneous available bandwidth esults in an unde-utilization of the esouce while tansmission at ates highe than the available bandwidth esults in a linea penalty. The tade-off is futhe complicated by the asymmety in the infomation acquisition pocess: tansmission ates that happen to be lage than the instantaneous available bandwidth esult in pefect obsevation of the state of the bandwidth pocess. In contast, when tansmission ate is below the instantaneous available bandwidth, only a (potentially athe loose) lowe bound on the available bandwidth is evealed. We show that the poblem of maximizing the thoughput of the souce while avoiding congestion loss can be expessed as a Patially Obsevable Makov Decision Pocess (POMDP). We pove stuctual esults poviding bounds on the optimal actions. The obtained bounds yield tactable sub-optimal solutions that ae shown via simulations to pefom well. I. INTRODUCTION In many netwok potocols, a device must set the communication paametes to maximize the utilization of the esouce whose availability is a stochastic pocess. One pominent example is congestion contol, in which a tansmitte must select the tansmission ate to utilize the available bandwidth, which vaies andomly due to the dynamic natue of taffic load imposed by othe uses on the netwok. In ou congestion contol poblem, we conside a use whose available bandwidth vaies as a Makovian pocess. A sende wants to set its tansmission ate at each time step. If the sende selects a ate highe than the available bandwidth, it can utilize the whole available bandwidth but has to pay an ove-utilization penalty (which is a function of how much the selected ate exceeds the available bandwidth). In this case, we assume that pefect infomation about the cuent available bandwidth is evealed (full obsevation). If the use selects a ate less than the available bandwidth, it does not expeience loss (no penalty), but the available bandwidth is unde-utilized. Also, in this case, the sende gets patial infomation about the available bandwidth, i.e. it can infe that the cuent available bandwidth is lage than o equal to the selected ate, but not its exact value. In such a setting, thee is a tade-off between getting moe infomation about the available bandwidth and paying less penalty. The goal is to find the optimal policy to maximize the total ewad (utilization minus penalty) ove a finite hoizon. 1 We fomalize this poblem as a Patially Obsevable Makov Decision Pocess (POMDP) [1], since the state of undelying Makov chain, called the esouce state, is not fully obseved. Since the POMDP poblem does not have an efficiently computable solution [2], we instead pesent uppe and lowe bounds on the optimal actions. The lowe bounds coespond to the myopic policy which at each time step selects an action maximizing the immediate expected ewad, ignoing the futue. Intuitively, selecting actions highe than the myopic action esults in less immediate ewad but also moe infomation about the esouce state. Thus, to lean moe about the esouce state the optimal action pefes to exceed the myopic action. We pove that the myopic policy has a pecentile theshold stuctue, i.e. it selects an action coesponding to the lowest state above a given pecentile of the beliefs. The uppe bound on the optimal action has a simila closed-fom stuctue, based on the belief vecto and the system paametes. In ode to deive the uppe bound, we futhe assume that the backgound Makov chain has a paticula State-Independent State Change (SISC) stuctue (see Section III fo moe details). We conside the effect of diffeent paametes on the bounds via simulation. The emainde of the pape is oganized as follows: Related woks ae pesented in Section II. In Section III, the poblem fomulation is intoduced. In Section IV, the main esults of the pape ae deived. Section V pesents some key popeties needed fo deiving ou main esults. Simulation esults ae pesented in Section VI. Finally, Section VII concludes the pape. II. RELATED WORK Congestion contol is a fundamental poblem in netwoking and has a ich liteatue on algoithms and theoetical analyses (see e.g. [3], [4]). Existing techniques, such as Tansmission Contol Potocol (TCP), adopt an Additive Incease Multiplicative Decease (AIMD) algoithm, intoduced in [5], that adjusts the congestion window based on the tansmission acknowledgments. The pefomance of AIMD is analyzed in liteatue, e.g. [6] whee Altman et al. intoduce a geneal model based on a multi-state Makov chain fo the moments at which the congestion is detected. Although AIMD often 1 The esults in this pape can be eadily extended to infinite hoizon discounted ewad

woks well in pactice, it is a heuistic that is not guaanteed to be optimal. In this wok, we aim to identify the stuctue of optimal congestion contol, albeit unde the special case of a Makovian available bandwidth pocess. Theshold stuctues of optimal tansmission policies have been established fo simple elated poblems with a two-state Gilbet-Elliott channel in [7], [8]. But hee, we conside a moe geneal case of multi-state Makovian channel Some ecent woks in liteatue, e.g., [9], [10], fomulate the poblem of inventoy management whee some numbe of items needs to be stock (inventoy) and the demand is a stochastic pocess. The poblem of inventoy management has a close connection to ou poblem as we can map the demand to the esouce (bandwidth) and the inventoy level to the action (ate allocated). Of these, the most closely elated wok is that of Bensoussan et al. [9], who conside a POMDP poblem whee the demand is a Makovian pocess. They conside the setting whee the esouces and actions ae both continuous, as well as the case whee the esouces ae discete but the actions emain continuous. They use the un-nomalized beliefs to pove the existence of the optimal policy which is challenging fo the continuous setting. Fo these settings, they also show that the optimal actions exceed the myopic actions. In this pape, in contast to thei wok, we conside the case whee both the esouce states and the actions ae discete and finite. Thus, the existence of the optimal policy is tivial [1]. Futhe, by investigating a specific fom of the tansition pobability, SISC, we deive additional popeties of the optimal policy and the ewad functions. Futhemoe, this wok is the fist to pesent an uppe bound fo the optimal action. Besides giving insight into the optimal actions, this bound can also be used to speed up the seach fo the optimal actions in the dynamic pogamming. In [10], Besbes et al. conside a simila poblem with the assumption that the esouce state is an independent-identical distibution pocess which is a specific fom of Makovian pocesses. They assume that the undelying esouce distibution is not available and need to be estimated fom histoical data. They show that a pecentile theshold policy (simila to ou myopic action) is optimal and chaacteize the implications of patial obsevations on the pefomance of the optimal policy in both discete and continuous settings. We conside the Makovian case in which the given pecentile theshold policy constucts a lowe bound fo the optimal actions. III. PROBLEM FORMULATION We conside a discete-time finite-state Makov pocess, whose state is denoted by B t. At each time step, the use selects an action accoding to the histoy of obsevations- thus eaning a ewad as a function of the selected action and the state B t and gathes some patial infomation about the cuent state. The objective is to find an optimal policy as a sequence of actions to select in ode to maximize the total expected discounted ewad accumulated ove the finite hoizon. Let us denote the finite hoizon by T and let the discete time steps be indexed by t = 1, 2,..., T. We fomulate ou poblem within a POMDP-based famewok defined as a tuple {M, P, B, A, O, U, R} whee: State: The state, B t, is one of the elements of a finite set denoted by M = {1, 2,..., M} State tansition: The state B t vaies ove time accoding to a Makov pocess with a known tansition pobability matix, denoted by P. This matix is an M M matix with elements P i,j = P (B t+1 = j B t = i), i, j M which indicates the pobability of moving fom state i at a time step to the state j at the next time step. Belief vecto: The pobability distibution of the esouce state (assuming a finite state set), given all past obsevations, is denoted by a belief vecto b t = [b t (1),..., b t (M)], with elements of b t (k) = P (B t = k), k M. In othe wods, b t epesents the pobability distibution of B t ove all possible states of M. The set of all possible belief vectos is denoted by B. The goal is to make a decision at each time step based on the histoy of obsevations; but due to the lack of full infomation, the decision should be based on the belief vecto. It can be shown that the belief vecto is a sufficient statistic of the complete obsevation histoy (see e.g., [1]). Action: At each time step, accoding to the cuent belief, we choose an action t A = {1,..., M}. Note the set of actions ae equal to the set of states, i.e. A = M. Obseved infomation: The obseved infomation at time step t is defined by the event o t ( t ) O as a function of selected action. The possible events coesponding to the action t is as follows: - o t ( t ) = {B t = i}, i = 1,..., t 1 is the event of fully obseving B t. This coesponds to the selection of the action highe than B t. - o t ( t ) = {B t t } is the event that B t is lage than o equal to the selected action. Belief updating: The belief updating is a mapping U : A O B B. The belief vecto fo the next time step, upon the selected action and the obsevation, is govened by: b t+1 = { T t b t P if t B t I Bt P if t > B t, whee I a is the M-dimensional unit vecto with 1 in the a-th position. T is a non-linea opeation on a belief vecto b, as follows: { 0 if i < T b(i) = b(i) (2) M if i. j= b(j) Rewad: The immediate ewad eaned at time step t is a mapping R : A O R, which is given by: { B t C( t B t ) if t > B t R(B t, t ) = (3) t if t B t, whee C is the ove-utilization penalty coefficient. The policy π specifies a sequence of functions π 1,..., π T, whee π t is the decision ule and maps a belief vecto b t to an (1)

action at time step t, i.e., π t : B A, t = π t (b t ). The goal is to maximize the total expected discounted ewad in finite hoizon, ove all admissible policies π, given by max J T π (b 0 ) = max π π Eπ [ β t R(B t ; t ) b 0 ], (4) t=0 whee 0 β 1 denotes the discount facto and b 0 is the initial belief vecto. J π T (b 0) is the total expected discounted ewad accumulated ove the hoizon T unde policy π and stating in the initial belief vecto b 0. The optimal policy denoted by π opt is a policy which maximizes (4) and it exists since the numbe of admissible policies ae finite. We may solve this POMDP poblem using Dynamic pogamming (DP), as the following ecusive equations holds: V t (b t ) = max V t (b t ; t ), t (5a) V t (b t ; t ) = R(b t ; t ) + βe{v t+1 (b t+1 ) t }, t T (5b) V T (b T ; T ) = R T (b T ; T ), (5c) whee b t+1 is the updated belief vecto fo the next time step shown in (1). The value functions V t (b t ) is the maximum emaining expected ewad accued stating fom time t when the cuent belief vecto is b t. Note fo all t = 1,..., T, V t (b t ) = max JT π t+1 (b t). V t (b t ; t ) is the emaining expected ewad accued afte time t with choosing action t at time t and following the optimal policy fo time t + 1,..., T with updated belief vecto accoding to the action t. V t (b t ; t ) is the summation of two tems: (i) the immediate expected ewad, given by taking expectation of (3): R(b t ; t ) = b t (i)r(i, t ) = t M i M t b t (i) + t 1 b t (i)[(1 + C)i C t ], (6) and (ii) the discounted futue expected ewad which can be computed as follows: V f t (b t ; t ) = E{V t+1 (b t+1 ) t } = b t (i)v t+1 (T t b t P ) + t t 1 b t (i)v t+1 (I i P ). (7) A policy π is optimal if and only if fo t = 1,..., T, t = π t (b t ) achieves the maximum in (5a), denoted by: opt t (b) = ag max V t(b; ). (8) A We pesent uppe and lowe bounds on the optimal actions in two theoems given in the next section using some lemmas. All the poofs ae given in the appendices. Fo some of ou esults we need the following assumptions. Assumption 1. The P matix satisfies the State-Independent State Change (SISC) popety. Assumption 2. The P matix satisfies the State-Independent State Change (SISC) popety with edge effects, defined below. By SISC popety, we mean that P i,i+k = p j,j+k. Theefoe, we can define ou tansition matix P by indicating the pobability of moving k step highe, p k, independent of which state we ae, such that k < 0 coesponds to moving k steps lowe. By edge effect, we mean that the tansition matix will be affected by the limits (edges) of the state set, since the state set M is limited fom both sides. Theefoe, the elements of the P matix fo ou desied SISC pocess ae equal to P i,j = p j i, fo all i, j except the following: P 1,j = P 1,j+1 + P 2,j+1, j M 1 (9) P M,j = P M,j 1 + P M 1,j 1, j 2 (10) whee (9) and (10) eflect the lowe and uppe edge effects of M, espectively. Note that fo simplicity of notations, fom now on we dop the subscipts of t fom the belief vectos and the actions. IV. MAIN RESULTS The main esult of ou pape is stated in the following two theoems. Theoem 1. The optimal action is bounded by an action fom below, denoted by lb, given by lb = min{ M : b(i) 1 }. (11) 1 + C This lowe bound is equal to the myopic action which at each time step selects the action maximizing the immediate expected ewad and ignoes its impact on the futue ewad. The myopic action, fo poblem (4), unde belief vecto b is given by: myopic (b) = ag max R(b; ). (12) M Theoem 2. Unde Assumption 1 o 2, the optimal action is bounded fom above by an action, denoted by ub, which is a function of C, β, and belief vecto b as follows: ub = min{ M : f(β) S + [(1 + C) f(β)]s C 0}, (13) whee fo simplicity we define the following notations: S h +1 b(i), S h +1 ib(i), 1 f(β) β. (14) Note that fom the above inequalities, l lb = myopic opt ub h, whee l and h ae the lowest and the highest states with non-zeo beliefs, espectively.

V. ANALYSES To pove Theoem 1, we need the following lemmas. Lemma 1. The expected immediate ewad is a uni-modal function of the action, which is inceasing when b(i) < 1 1+C and it is deceasing othewise. Theefoe, the myopic action which maximizes the expected immediate ewad (given in (12)) is equal to the lowest action satisfying the inequality b(i) 1 1+C. Inteestingly, the myopic action has a pecentile theshold stuctue, i.e. it coesponds to the lowest state above a given pecentile of the beliefs. Lemma 2. The emaining expected discounted ewad in the action, V t (b; ), and the value function, V t (b), ae convex with espect to the belief vecto b, i.e. V t (b; ) λv t (b 1 ; ) + (1 λ)v t (b 2 ; ), M, V t (b) λv t (b 1 ) + (1 λ)v t (b 2 ), 0 λ 1. (15) Lemma 3. The futue expected ewad, V f t (b; ) defined in (7), is monotonically inceasing in the action, i.e., V f t (b; 1 ) V f t (b; 2 ) 0, 1 2. (16) Now let define an odeing fo the belief vectos and deive a key popety of value functions, thei monotonicity with espect to an odeing of the belief vectos. Then we state some lemmas which ae needed to pove Theoem 2. Definition 1. (Fist Ode Stochastically Dominance, [11]) Let b 1, b 2 B be any two belief vectos. Then b 1 fist ode stochastically dominates b 2 (o b 1 is FOSD geate than b 2 ), denoted as b 1 s b 2, if b 1 (j) b 2 (j), {1,..., M}. (17) j= j= Lemma 4. Unde Assumption 1 o 2, the value function is a FOSD-inceasing function of the belief vecto. i.e., if b 1 s b 2, then V (b 1 ) V (b 2 ). Now let conside two belief vectos b and b α such that b α is shifted vesion of b by α steps, i.e. b α (i) = b(i + α). Because of similaity in thei pobability distibution shapes, they have some common popeties, given in the following lemmas. Lemma 5. The immediate expected ewad and the myopic action follow the shifting popety: R(b α ; ) = R(b; α) + α, (18) myopic (b α ) = myopic (b) + α. (19) Lemma 6. Unde Assumption 1, the value functions and the optimal policies of the belief vecto b and its shifted vesion, b α, have the following popeties: V t (b α t+1 ; ) V t (b; α) = α, (20) opt t (b α ) = opt t (b) + α, (21) V t (b α t ) = V t (b) + α. (22) Note fo β = 1, we need to substitute 1 βx 1 β by x. Lemma 7. Unde Assumption 2, the following elation between the value functions of b and b α holds: V t (b α t ) V t (b) + α. (23) The poofs of the above lemmas ae given in Appendix A, followed by the poofs of Theoems 1 and 2 in Appendix B. VI. SIMULATION We pesent some simulation esults to conside the uppe and the lowe bounds achieved in the pevious section. The simulation paametes, except in the figues that thei effect is consideed, ae fixed as follows: the numbe of states M = 10, the ove-utilization penalty coefficient C = 5, the discount facto β = 0.8, and the tansition pobabilities p 1 = p 1 = 0.3, p 0 = 0.4. Due to computational complexity it is had to find the optimal policy. Instead, we could conside the Extended Myopic (EM) policy with futue window of size τ as an appoximation fo the optimal policy. This policy, at each time step t, solves the dynamic pogamming fo shot hoizon of t, t + 1,..., t + τ. Hee we use τ = 4 fo ou simulations. Note that we could limit ou dynamic pogamming seaches between the uppe and the lowe bound to speed up the simulations. A. Uppe and Lowe Bound on Optimal Actions Fig. 1 shows an example of a sequence of optimal actions and thei coesponding uppe and lowe bounds. We assume that the selected actions does not exceed B t till time step 14. Note that the stas in the figue indicate the non-zeo beliefs. action 10 8 6 4 2 Myopic action Optimal action uppe bound 2 4 6 8 10 12 14 time steps Fig. 1. Selected actions by optimal policy (EM, τ = 4) and thei coesponding lowe and uppe bounds, fo C = 5, β = 0.8, and tansition of p 1 = p 1 = 0.3, p 0 = 0.4 Next, we conside the effect of β on the gap between the uppe and the lowe bounds of optimal actions in Fig. 2. As expected, this gap inceases with inceasing β. Fig. 3 shows the gap between the uppe and the lowe bounds vesus the vaiance of SISC tansition, defined as E[(B t+1 B t ) 2 ]. This figue confims that by inceasing the vaiance, the gap will incease.

gap 6 5 4 3 2 p 2 =p 2 p 1 =p 1 ewad 24 22 20 18 UB policy Myopic policy EM policy, τ=4 1 16 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 β 14 1 2 3 4 5 6 7 8 9 10 C Fig. 2. C = 5. gap 6 5 4 3 2 1 0 The gap between the lowe and the uppe bounds vesus β, fo 0.5 1 1.5 2 2.5 3 3.5 4 vaiance Fig. 3. The gap between the lowe and the uppe bounds vesus the vaiance of andom walk steps, fo β = 0.8, C = 5. B. Myopic and Uppe-Bound policies Now we compae two sub-optimal policies: (i) the myopic policy, (ii) the uppe-bound (UB) policy, which pick the actions given in (11) and (13), espectively, at all time steps and update thei belief vectos accoding to these actions. Fig. 4 shows the total expected discounted ewad vesus β fo the myopic and the UB policies. Fo smalle β the ewad of the myopic policy is highe than that of UB policy, but fo lage β (close to 1) the UB policy outpefoms the myopic policy. Fig. 5 shows the total expected discounted ewad vesus C fo ewad 10 2 Myopic policy,p 2 =p 2 UB policy,p 2 =p 2 Myopic policy,p 1 =p 1 UB policy,p 1 =p 1 10 1 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 β Fig. 4. The total expected discounted ewad fo two sub-optimal policies vesus β fo C = 5, τ = 4, fo hoizon T = 100. the myopic and the UB policies fo β = 0.8. The diffeence between the ewad of two policies will incease as C gow. VII. CONCLUSION We fomulated a Bayesian congestion contol poblem in which a souce must select the tansmission ate (the action) Fig. 5. The total expected discounted ewad fo two sub-optimal policies vesus C fo β = 0.8, τ = 4, fo hoizon T = 100 and tansition of p 1 = p 1 = 0.3, p 0 = 0.4. ove a netwok whose available bandwidth (esouce) evolves as a stochastic pocess. We modeled the poblem as a POMDP and deived some key popeties fo the myopic and the optimal policies. We poved stuctual esults poviding bounds on the optimal actions, yielding tactable sub-optimal solutions that have been shown via simulations to pefom well. Since the myopic action has a pecentile theshold stuctue on the state beliefs, and we can simplify ou uppe bound to get a loose one with a simila fom, we conjectue that thee may be even bette appoximation fo the optimal policy with the simila pecentile theshold stuctue. ACKNOWLEDGEMENT This wok was suppoted in pat by the U.S. National Science foundation unde ECCS-EARS awads numbeed 1247995 and 1248017 and by the Okawa foundation though an awad to suppot eseach on Netwok Potocols that Lean. REFERENCES [1] R. D. Smallwood and E. J. Sondik, The optimal contol of patially obsevable makov pocesses ove a finite hoizon, Opeations Reseach, vol. 21, no. 5, pp. 1071 1088, 1973. [2] C. H. Papadimitiou and J. N. Tsitsiklis, The complexity of makov decision pocesses, Mathematics of opeations eseach, vol. 12, no. 3, pp. 441 450, 1987. [3] R. Sikant, The mathematics of Intenet congestion contol. Bikhause Boston, 2004. [4] C. Jin, D. X. Wei, and S. H. Low, Fast tcp: motivation, achitectue, algoithms, pefomance, in INFOCOM 2004. Twenty-thid AnnualJoint Confeence of the IEEE Compute and Communications Societies, vol. 4. IEEE, 2004, pp. 2490 2501. [5] V. Jacobson, Congestion avoidance and contol, in ACM SIGCOMM Compute Communication Review, vol. 18, no. 4. ACM, 1988, pp. 314 329. [6] E. Altman, K. Avachenkov, C. Baakat, and P. Dube, Pefomance analysis of aimd mechanisms ove a multi-state makovian path, Compute Netwoks, vol. 47, no. 3, pp. 307 326, 2005. [7] A. Laouine and L. Tong, Betting on gilbet-elliot channels, Wieless Communications, IEEE Tansactions on, vol. 9, no. 2, pp. 723 733, 2010. [8] Y. Wu and B. Kishnamachai, Online leaning to optimize tansmission ove an unknown gilbet-elliott channel, in Modeling and Optimization in Mobile, Ad Hoc and Wieless Netwoks (WiOpt), 2012 10th Intenational Symposium on. IEEE, 2012, pp. 27 32. [9] A. Bensoussan, M. Çakanyıldıım, and S. P. Sethi, A multipeiod newsvendo poblem with patially obseved demand, Mathematics of Opeations Reseach, vol. 32, no. 2, pp. 322 344, 2007.

[10] O. Besbes and A. Muhaemoglu, On implications of demand censoing in the newsvendo poblem, Management Science, Fothcoming, pp. 12 7, 2010. [11] A. Mulle and D. Stoyan, Compaison Methods fo Stochastic Models and Risks. Hoboken, NJ: Wiley, 2002. that λ T b 1 + (1 λ )T b 2 = T b: A = λ T b 1 + (1 λ )T b 2 = λ M b 1(i)T b 1 + (1 λ) M b 2(i)T b 2 M b(i) APPENDIX A Poof of Lemma 1: To pove this lemma, let us compute the deivative of the expected immediate ewad given in (6), R(b; ) = R(b; + 1) R(b; ) = 1 (1 + C) b(i). (24) 1 M b 1 (j) A(j) = M b(i)[λ b 1 (i) M b 1(i) A(j) = 0, + (1 λ) b 2 (j) b 2 (i) M b 2(i) ] = λb 1(j) + (1 λ)b 2 (j) b(j) M b(i) = M b(i), j < j (28) It s easily seen that if the inequality in (11) holds, R(b; ) 0. Othewise it is positive. This concludes that R(b; ) is unimodal with unique maximum at the myopic action given in (11). Poof of Lemma 2: We use induction to pove the convexity of V t (b; ) with espect to the belief vecto, b, fo finite hoizon. Let s assume b is a linea combination of two belief vectos b 1 and b 2, such that: b = λb 1 + (1 λ)b 2, 0 λ 1. (25) Fist at hoizon T, the emaining expected ewad is equal to the expected immediate ewad which is affine linea with espect to the belief vecto and using (6) and (25) we have: R(b; ) = λ R(b 1 ; ) + (1 λ) R(b 2 ; ), (26) which confims the convexity of the emaining expected ewad at hoizon T. Now assuming the convexity holds at t+1, we will conside the time step t. Using (5b) and (7) we have: V t (b; ) λv t (b 1 ; ) (1 λ)v t (b 2 ; ) = [R(b; ) λ R(b 1 ; ) (1 λ) R(b 2 ; )] + β[v t+1 (T bp ) b(i) λv t+1 (T b 1 P ) b 1 (i) (1 λ)v t+1 (T b 2 P ) = β b 2 (i)] b(i)[v t+1 (T bp ) λ V t+1 (T b 1 P ) (1 λ )V t+1 (T b 2 P )] (27a) (27b) whee λ M = λ b1(i) M. Note using (26), the tem inside b(i) [.] in (27a) is zeo. Multiplying the tansition matix P is a linea opeation, and with some manipulation, we can show So fom definition of (2), we have A = T b. Now fom convexity at t + 1 the tem inside [.] in (27b) is less than o equal to zeo and we have: V t (b; ) λv t (b 1 ; ) (1 λ)v t (b 2 ; ) 0 (29) which means the convexity of V t (b; ) with espect to b. To pove the convexity of value function, V t (b), with espect to b we use the definition of (5a) and the esult of Lemma 2 to get: V t (b) = max V t (b; ) = V t (b; ) λv t (b 1 ; ) + (1 λ)v t (b 2 ; ) (30a) (30b) λ max V t (b 1 ; 1 ) + (1 λ) max V t (b 2 ; 2 ) 1 2 (30c) = λv t (b 1 ) + (1 λ)v t (b 2 ), (30d) whee = ag max V t (b; ). Applying the definition of (5a) one moe time in (30d) completes the poof. Poof of Lemma 3: We can wite (16) as follows: V f t (b; 1 ) V f t (b; 2 ) = = 1 1 = 2 {V f t (b; + 1) V f t (b; )} (31) and fo each tem inside the summation, using (7), we have: V f t (b; ) = V f t (b; + 1) V ( t b; ) = b(i)v t+1 (T +1 bp ) b(i)v t+1 (T bp ) +1 + b()v t+1 (I P ) 0 (32) The inequality (32) is achieved fom the convexity of value function in (15) fo b = T bp, b 1 = T +1 bp, b 2 = I P and M +1 b(i) M b(i) λ = zeo and poof is complete. Poof of Lemma 4:. Theefoe, (31) is geate than o equal to

If b 1 s b 2, i.e., by ecalling Definition 1, b 1 (j) b 2 (j), {1,..., M}, (33) j= j= we can wite b 1 as a linea combination of some belief vectos with coefficients equal to the elements of b 2, as follows: b 1 = j b 2 (i)b i, b i B, b i(j) = 0, j < i. (34) We can obtain (33) using (34), as follows: b 1 (j) = j= j= j= j=i j b 2 (i)b i(j) j b 2 (i)b i(j) b i(j)b 2 (i) b 2 (i), (35a) (35b) (35c) whee (35a) and (35b) come fom switching the indexes and in (35c) we substitute M j=i b i (j) = 1, because b i is a belief vecto. Fo the poof of the othe diection, in a simila way, we can show that if b 1 s b 2, thee ae valid belief vectos b i, i {1,..., M} such that (34) holds. Let ecall the definition of value function. V t (b k ) = max π t:t V t (b k ; π t:t ), k = 1, 2 (36) whee π t:t = [π t, π t+1,..., πt ] is the sequence of policies such that π t is the selected action at each time step t = t,..., T. Then V t (b k ) V t (b k ; π t:t ) fo any policy sequence π t:t. Now we only need to pove that the emaining expected ewad achieved by policy sequence π t:t, fo belief vecto b 1 is highe than b 2, i.e., V t (b 1 ; π t:t ) V t (b 2 ; π t:t ) (37) Let s name the sequence of the belief vectos with initial vecto of b 1 by b 1,τ, t τ T 1, and the sequence of the belief vectos with initial vecto of b 2 by b 2,τ. Consideing all possible sample paths of [B t, B t+1,..., B T ], defined as B t:t, the total emaining expected ewads fo k = 1, 2 will be as follows: V t (b k ; π t:t ) = E Bt:T [ β τ t R(B τ ; π τ ) b k, π t:t ]. (38) Theefoe, we can wite V t (b k ; π t:t ) as follows: V t (b k ; π t:t ) = {b k (i) E Bt:T [ β τ t R(B τ ; π τ ) B t = i]} (39) So using (39) and substituting (34) we have: whee, V t (b 1 ; π t:t ) V t (b 2 ; π t:t ) = {b 2 (i) [ b i(j)e Bt:T [ β τ t R(B τ ; π τ ) B t = j] j=i E Bt:T [ β τ t R(B τ ; π τ ) B t = i]]} = b 2 (i)[ b i(j) E j,i ], j=i E j,i = E Bt:T [ β τ t R(B τ ; π τ ) B t = j] (40) E Bt:T [ β τ t R(B τ ; π τ ) B t = i]]. (41) To pove (37), we only need to show E j,i 0. Without loss of geneality, we can assume i = j 1 = l. The esult will be easily extendable to j > i + 1, by using E j,i = j 1 l=i E l+1,l. Fo any sample path B t:t stating fom l (call it Bt:T l ), thee is a coesponding sample path B l+1 t:t stating fom l + 1, such that, Bτ l+1 = Bτ l + 1. So we can have the expectation ove only all possible Bt:T l and wite (41) as follows: [ β τ t R(Bτ l+1 ; π τ )] E B l t:t [ β τ t R(Bτ l ; π τ )] E B l+1 t:t = E B l t:t [ β τ t {R(Bτ l + 1; π τ ) R(Bτ l ; π τ )}] (42) Now if we show T βτ t {R(Bτ l +1; π τ )] R(Bτ l ; π τ )} 0 fo any sample path Bt:T l, (42), theefoe (40) ae geate than o equal zeo and (37) holds. We could use induction to pove this inequality. Fo the base case at hoizon T, it holds using the fact that (3) is monotonically inceasing with espect to B t. Now assuming it is tue fo time steps t + 1,..., T, we need to show it fo time step t. Now let s define the fist time that the selected action will exceed the Bτ l, with µ, i.e., µ = min{t τ T : π τ > B l τ } (43) Thee ae thee diffeent possible cases to happen:

Case I: π τ B l τ, τ T Case II: π µ > B l µ + 1 = B l+1 µ Case III: π µ = B l µ + 1 Let s conside Case I, fist. In this case, the actions ae always lowe than Bτ l. Theefoe by (3), both ewads fo Bt:T l and Bt:T l + 1 will be equal to the selected action. So, β τ t {R(Bτ l + 1; π τ ) R(Bτ l ; π τ )} = β τ t π τ 0 (44) In Case II, the ewads accumulated befoe time µ ae equal fo both beliefs. Theefoe, µ 1 β τ t {R(Bτ l + 1; π τ ) R(Bτ l ; π τ )} = 0 (45) The ewads eaned at time µ, by (3) is equal to: R(B l µ + 1; π µ ) R(B l µ; π µ ) = 1 + C (46) The expected value of (46) is also equal to 1+C. Afte time µ, both beliefs will be estated to I B l µ and I B l µ +1, and fom the coectness of (42) at time step µ as the induction assumption, the emaining expected ewad elated to B l+1 will be highe than the one elated to B l. E B l µ+1:t [ β τ t {R(Bτ l + 1; π τ ) R(Bτ l ; π τ )}] τ=µ+1 = β µ+1 t [V µ+1 (I B l τ +1P ) V µ+1 (I B l τ P )] 0 (47) Theefoe, (42) which is the summation of thee tems in (45), (46) multiplied by β µ t, and (47), will be geate than o equal to zeo. Fo Case III, at µ the action will exceed B l µ and the belief will be estated to I B l µ P, but the othe sequence s belief will change as b µ+1 = T πµ b µ P. The diffeence between ewads accumulated befoe time µ and at time µ ae equal to 0 and C + 1, espectively, simila to (45) and (46). Now fo times afte µ we have: E B l µ+1:t [ τ=µ+1 β τ t {R(B l τ + 1; π τ ) R(B l τ ; π τ )}] = β µ+1 t [V µ+1 (T πµ b µ P ) V µ+1 (I B l τ P )] 0 (48) whee the last inequality comes fom the coectness of (37) at time µ + 1 as the assumption of the induction, consideing the new belief T πµ b µ and I B l τ P. Note B l µ is the lowest index of non-zeo beliefs in T πµ b µ. Theefoe, (42) is geate than o equal to zeo, so ae (41) and (40), and thus (37) is tue fo time t and the poof is complete. Now assume π t:t = ag max π t:t V t (b 2 ; π t:t ) is the optimal policy sequence fo the initial belief vecto b 2 at time t. We have: V t (b 1 ) V t (b 1 ; π t:t ) V t (b 2 ; π t:t ) = V t (b 2 ), (49) and this completes the poof of Lemma 4. Poof of Lemma 5: To pove the shifting popety of immediate expected ewad we have: R(b α ; ) = 1 b α (i) + b α (i)[(1 + C)i C] = ( + α) b(i ) i = 1 + b(i )[(1 + C)(i + α) C ] i =1 = R(b; ) + α[ b(i ) + i = 1 i =1 b(i )] = R(b; ) + α, (50a) (50b) (50c) whee, (50b) is achieved by substituting i and with i +α and + α, espectively. Also we aleady know that b α (i + α) = b(i ). Fo the myopic actions, using (50c) we have: myopic (b α ) = ag max R(b α ; ) = ag max{ R(b; α) + α} = myopic (b) + α. (51) Poof of Lemma 6: To pove (20), we use induction. Fo hoizon T, it s simila to (18) with t = T. Now assuming it s tue fo t + 1, we will conside time step t: V t (b α ; ) = R(b α ; ) + β[ b α (i)v t+1 (T b α P ) b α (i)v t+1 (I i P )] 1 + = R(b; ) + α + β[ α 1 + i =1 i = α b(i )V t+1 (I i +αp )] b(i )V t+1 (T b α P ) (52a) In (52a), we use (18) and the shifting popety of belief vectos and substitute i α with i. Using (20) at t + 1 we have: V t (b α ; ) = R(b; ) + α + β b(i t 1 )[V t+1 (T α bp ) + α ] i = α α 1 + β b(i t 1 )[V t+1 (I i P ) + α ] (53a) i =1 = V t (b; α) + α + αβ b(i ) t 1 = V t (b; α) + α t (53b)

To pove (21), using (20), we have: opt t (b α ) = ag max V t (b α ; ) t = ag max{v t (b; α) + α } = ag max V t (b; α) = ag max V t (b; ) + α = opt t (b) + α. (54) Using (20), (22) and maximizing ove actions, we obtain (22). Poof of Lemma 7: Conside a belief vecto b u in the unlimited state set and its coespond b l in the limited state set M. If b u has nonzeo elements above M, i.e., the uppe edge of M, b l (M) = i M bu (i), and b l (j) = b u (j), j < M. Theefoe facing the uppe edge effect esults in b l s b u. Similaly, is b u has non-zeo elements below 1, i.e., the lowe edge of M, b l (1) = i 1 bu (i), and b l (j) = b u (j), j > 1. Thus facing with the lowe edge effect esults in b l s b u. Now using Lemma 6, if the belief vecto b t and its shifted vesion b α t, α > 0 ae in the unlimited state set, between thei value functions the equation (22) holds. Lets b τ and b α τ denote the sequence of belief vectos constucted by updating b t and b α t, espectively, accoding to thei optimal actions at all time steps τ = t + 1,..., T. Then b α τ, τ = t + 1,..., T ae shifted vesion of b τ. Now with limiting the state set,thee ae some possibilities. If none of the b τ, b α τ, τ = t + 1,..., T face with the edge effects and (22) will still hold. Othewise, b α τ will face with the uppe edge effect soone than b τ, o b τ will face with the lowe edge effect soone than b α τ. In both cases, combining the above agument about the FOSD odeings, the esults of Lemma 4 and (22), we can conclude the inequality (23). APPENDIX B Poof of Theoem 1: The immediate expected ewad and the futue expected ewad achieved by the action < myopic ae less than those achieved by myopic, as esults of Lemma 1 and Lemma 3, espectively. Combining them as (5b), we get V t (b; ) V t (b; myopic ) which means cannot be optimal and thus opt myopic. Poof of Theoem 2: To achieve the uppe bound on the optimal action we will show that: V t (b; ) = R(b; ) + β V f t (b; ) 0, ub. (55) By ecalling (24) we know R(b; ) = 1 (1 + C) b(i) 0, myopic. (56) Moeove, using (32), V f t (b; ) = +1 b(i)[v t+1 (T +1 bp ) V t+1 (T bp )] + b()[v t+1 (I P ) V t+1 (T bp )], (57) and accoding to lemma 3 we have V f t (b; ) 0 fo all. Hence (55) is equivalent to: β V f t (b; ) R(b; ), ub. (58) To this end, it is sufficient to find an uppe bound fo β V f t (b; ) such that it is less than R(b; ) fo all geate than a specific action, i.e., ou desied uppe bound, ub. Applying Lemma 4 one bp s I lp, we have: V t (bp ) V t (I lp ), (59) Now by above inequality and the fact that is the lowest nonzeo index of the belief vecto T b, the second tem in (57) is less than zeo and we can bound V f t (b; ) as follows: V f t (b; ) +1 b(i)[v t+1 (T +1 bp ) V t+1 (T bp )]. (60) Fom the convexity of value function poved in lemma 2 and (59) we get: V t+1 (T +1 bp ) V t+1 (T bp ) h +1 h +1 b (i)v t+1 (I i P ) V t+1 (I P ) {b (i)[v t+1 (I i P ) V t+1 (I P )]}, (61) whee b = T +1 b. Now fom (23) and substituting b = I i P and α = j i, it follows that: V t+1 (T +1 bp ) V t+1 (T bp ) = = h +1 {b (i) (i )( 1 ) } h 1 [ +1 ib (i) ] 1 [ b ], (62) h +1 ib(i) h +1 b(i) whee b = h +1 ib (i) = = S S. Accodingly, we may bound (60) using (62) as follows: V f t (b; ) S T 1 [ S S ]. (63)

To obtain (58) using (63) it suffices to make the following inequality satisfied: T 1 βs [ S ] (1 + C) b(i) 1 S = C (1 + C)S, ub. (64) Applying the definition of f(β) in (14) and doing some staightfowad manipulations, we can get: f(β) S + [(1 + C) f(β)]s C 0, ub. (65) ub is the smallest satisfying (65), as mentioned in (13), and the poof is complete.