Learning Automata in Games with Memory with Application to Circuit-Switched Routing

Size: px

Start display at page:

Download "Learning Automata in Games with Memory with Application to Circuit-Switched Routing"

Marian Underwood
5 years ago
Views:

1 Learning Automata in Games with Memory with Application to Circuit-Switche Routing Murat Alanyali Abstract A general setting is consiere in which autonomous users interact by means of a finite-state controlle Marov process. This process is riven by the collective actions of all users, an iniviual users receive separate rewars accoring to its state. It is assume that each user chooses its actions via a reinforcement learning algorithm base on its local information. The ynamic behavior of user strategies is characterize for small values of a step-size parameter aopte in learning. The general form of equilibria is obtaine an is shown to be analogous to Warrop equilibria if users upate their strategies on a faster time-scale compare to the unerlying process. The results are illustrate in the context of routing in circuit-switche communication networs. I. INTRODUCTION We consier a ynamic setting in which a collection of users interact by means of an unerlying Marov process that is controlle by the collective actions of all users. The state space of this process is assume to be finite but it is arbitrary otherwise. The process may, for example, represent a suitable state escriptor for a share resource. Iniviual users receive possibly ifferent rewars epening on the state of this process. The users are not require to be aware of the unerlying process, or the presence of other users. In particular it is assume that each user autonomously follows a ynamic ranomize strategy that amounts to reinforcement learning base on its local information, which is the history of its own actions an own rewars. The outline scenario is motivate by istribute moels of routing an flow control in communication networs, where an iniviual networ user may gain by eviating from socially oriente principles. The framewor aopte in this paper iffers from that of classical game theory in two aspects. Namely, it is assume here that rewars are etermine by the whole history of actions ue to the Marovian nature of the unerlying process, an that users operate with ite information about the nature of their interaction. In view of the latter assumption we restrict attention to user strategies that involve estimating consequences of available actions, an moel iniviual user behavior by reinforcement learning which arises as a plausible moel in a variety of relate contexts [5], [11]. The main contribution of the paper is an approximation of user strategies that is asymptotically exact for small values of a step-size parameter aopte in learning. In particular This wor was supporte by the NSF CAREER Program uner grant ANI Department of Electrical an Computer Engineering, an The Center for Information an Systems Engineering, Boston University, 8 Saint Mary s Street, Boston, MA 02215, USA. alanyali@bu.eu the transient behavior of user strategies is approximate via a system of orinary ifferential equations. The it ynamics amits a more intuitive interpretation when users upate their strategies on a fast time scale. In particular it is shown that equilibria of the it ynamics are then analogous to Warrop equilibria [14] that arise in the context of transportation networs. Interpretation of the aopte moel an the obtaine results is illustrate on an aaptation of a istribute circuit-switche networ architecture ue to Kelly [7]. We show that when autonomous source-estination pairs employ reinforcement learning to aaptively route incoming calls, the it ynamics locally maximizes a social welfare function, namely rate of networ-wie revenue generation ue to amitte calls. The outline of the paper is as follows. Section II escribes the general moel, an the main results are given in Section III. Application to aaptive routing in circuit-switche networs is iscusse in Section IV. The paper conclues with final remars in Section V. II. MODEL Consier a set U of users that share a collection of resources. We refer to the totality of resources as the networ. Suppose that each user has a finite number of alternative ways to access the networ. We refer to each alternative way as an action an enote by Au the set of actions available to user u U. A U-tuple αu : u U that specifies one action αu Au from the action set of each user u is calle an action profile. Let A = u U Au enote the set of action profiles. Consier further a ynamic situation in which the state of the networ evolves accoring to the actions taen by the users. Let Ξ enote the state space of the networ an let X t Ξ enote the state of the networ at time t 0. We assume that Ξ is finite but arbitrary otherwise, an that the networ process X t : t 0 is a controlle Marov process. To provie a complete specification of X t : t 0 let α t u enote the action exercise by user u U at time t an let α t = α t u : u U. For each action profile α A there is a generator matrix Gα = [q ξ,ξ α] Ξ Ξ such that 1 h 0 h P X t+h = ξ X t = ξ, α t = α = q ξ,ξ α, ξ ξ. We shall assume that Gα is irreucible for each α A. Suppose that each user receives a ranom rewar whose istribution epens on the networ state an the exercise action by that user. More precisely, for each user u U an

2 t 0 let θ t u enote the total rewar receive by user u over the time interval [0, t. Let g + u, g u : Ξ Au R + be positive value functions an set g u ξ, a = g + u ξ, a g u ξ, a for ξ Ξ, a Au. We shall assume that for each ξ Ξ, a Au an integer, 1 h 0 h P θ t+hu = + j θ t u =, X t = ξ, α t u = a g u + ξ, a if j = 1 = gu ξ, a if j = 1 0 if j { 1, 0, 1}. In particular g u ξ, a is the mean rewar receive by user u per unit of time uring which the networ is at state ξ an the user exercises action a. We concentrate here on the case when users follow ranomize strategies in exercising their actions. Some efinitions are neee before giving a formal escription of user strategies. For each u U let S u be the space of probability vectors on the set Au. Set S = u U S u, so that S = pu, a : u U, a Au, pu, a 0 an a Au pu, a = 1 for u U. For p S let pu enote the probability vector pu, a : a Au S u. Suppose that each user u U maintains a ata structure zu, Qu where zu = zu, a : a Au R Au an Qu = Qu, a : a Au S u. This ata structure is ynamically maintaine by upating at iscrete time instants. The upate proceure is mae precise in the following paragraphs. At each upate instant the user also selects an action ranomly with respect to the istribution Qu, an it exercises this action until the next upate instant. Hence the ata structure Qu bears a ranomize action policy for user u. The ata structure zu will be seen to represent estimates of mean rewars ue to ifferent actions. Let z t u, Q t u enote the content of zu, Qu at time t 0. Each user u U upates its ata structure at the jump instants of a Poisson process with rate γu > 0. The Poisson clocs of ifferent users are mutually inepenent; so users operate asynchronously. Let t u enote the time of th upate of user u with t u 0 = 0. For each integer 1 let I u = I u, a : a Au be a binary ranom vector with exactly one nonzero entry. Given Q t u 1 u the vector I u is rawn inepenently of the prior history before time t u 1, so that P I u, a = 1 Q t u 1 u = Q t u 1 u, a, a Au. The vector I u ientifies the action exercise by user u over the interval [t u, tu +1, that is, I u, a = 1 if an only if α t u = a for t [t u, tu +1. It is assume that an arbitrary action is exercise in [0, t u 1 an for completeness let I 0 u ientify that action. For t 0 let Θ t u enote the total rewar receive by user u since its last policy upate before time t. In particular for 0 Θ t u +1 u = θ t u +1 u θ t u u. Let β 0, 1 be a fixe constant. The ata structure zu is upate at time t u +1 by setting z 0u, a = 0 an z t u +1 u, a = z t u u, a+β Θ t u +1 u z t u u, a I u, a 1 for each a Au. Note that zu, a changes its value at time t u +1 only if a is the action exercise over the interval [t u, tu +1. As I u, a = 1 an the action yiels rewar Θ t u +1 u in that case, the upate rule is inee exponential averaging of the associate rewar. In turn zu, a aims to trac the expecte rewar ue to action a, which is typically non-stationary owing to the ynamic strategies followe by other users. Upating the ranomize strategy Qu involves two entries. For positive integer let J u = J u, a : a Au be a ranom binary vector such that given Q t u 1 u, J u is conitionally inepenent with an ientically istribute as I u. Let n > 0 be a real parameter common to all users. The probability vector Qu is upate at time t u +1 by setting { ˆQt u Q t u +1 u, a = u, a if ˆQ +1 t u +1 u S u 2 Q t u u, a otherwise, where ˆQu, a is an auxiliary variable that satisfies ˆQ t u +1 u, a = Q t u u, a + 1 n z t u u, a z t u u, a 3 if I +1 u, a = J +1 u, a = 1 or I +1 u, a = J +1 u, a = 1. In wors, the vectors I +1 u an J +1 u ientify the two entries of Qu that are upate at time t u +1. If a, a Au are such that I +1 u, a = 1 an J +1 u, a = 1 then Qu, a is increase by an amount n 1 z t u u, a z t u u, a an Qu, a is ecrease by the same amount, provie that all entries of the resulting probability vector are strictly positive. Otherwise Qu is not upate. The ecision expresse by equality 2 assures that Qu represents a probability vector at all times. The algorithm ientifie by equalities 1 2 amounts to reinforcement learning on the part of each user: If the value of Qu, a changes its value ue to either I +1 u, a = 1 or J +1 u, a = 1 then the expecte change, conitione on Q t u u, is given by n 1 z t u u, a a Au z t uu, a Q t u u, a. Hence if z t u u is a reliable estimate of the expecte rewars of iniviual actions in Au, then the upate rule 3 tens to increase the probabilities of actions that yiel larger-than-average expecte rewars. Reliability of the estimates in zu : u U is achieve by choosing the parameter n > 0 large. In fact large values of n lea to separation in the time scales of the estimates zu : u U an the policies Qu : u U. Namely, each probability istribution in Qu : u U changes by an amount On 1 per upate; therefore the estimates zu : u U are upate many times before the policies Qu : u U change their value significantly, an in turn Qu : u U are upate base on reliable estimates.

3 The following section quantifies this interplay between the estimates an the ranomize policies. III. MAIN RESULTS For t 0 efine the vector-value variable Q t = Q t u : u U. We next characterize the time-scale process Q nt : t 0 in the large n it. For u U an a Au efine the mapping T u,a : A A so that the image α = T u,a α of α A satisfies α u = a an α u = αu for u u. For p S let ϕ p enote an equilibrium istribution for the Marov process that taes values in Ξ A, an has generator Ĝp = [ˆq ξ,α,ξ,α ] Ξ A Ξ A with off-iagonal entries q ξ,ξ α if ξ ξ, α = α ˆq ξ,α,ξ,α = γupu, a if ξ = ξ, α = T u,a α 0 otherwise. 4 Note that ϕ p is uniquely ientifie ue to the irreucibility of generators Gα : α A. Let 1{ } be the inicator function whose value is 1 if its argument is correct, an it is 0 otherwise. Define the probability istribution νp u on Ξ Au by setting ν u p ξ, a = α A 1{αu = a}ϕ p ξ, α, ξ Ξ, a Au. Given π o S, let π t : t 0 be the trajectory in S such that for each u U an a Au t π tu, a = 2 µ πt u, a π t u, a µ πt u, a, with π 0 u, a = π o u, a an a Au µ πt u, a = ξ Ξ g u ξ, aν u π t ξ, a. The right han sie of equality 5 is well-efine an Lipschitz continuous, an the trajectory π t : t 0 is uniquely ientifie. Let S be enowe with metric such that p, p = max{ pu, a p u, a : u U, a Au} for p, p S. Define Q n t = Q nt for t 0. The following theorem states that, for large values of n, the trajectories of Q n t : t 0 are well-approximate by the solution of 5. Theorem 3.1: Let π o S an let π t : t 0 solve equations 5. If n Qn 0 = π o then for each ε, T > 0 n P in probability, sup Q n t, π t > ε 0 t T = 0. It appears ifficult to obtain an explicit expression for the solutions of equations 5, or to mae general qualitative observations about their ynamic behavior. However the following example suggests the possibility of sophisticate 5 behavior, by illustrating that for certain choices of parameters 5 reuces to the replicator ynamics, which may possess unstable equilibria an it cycles [4], [12]. Example 3.1: Replicator ynamics Suppose that the state of the networ is ientical to the current action profile. That is, Ξ = A an for α A the generator matrix Gα satisfies q α,ξ α = 0 an q ξ,α α 1 for ξ Ξ, ξ α, so that, for the purposes of this example, we may tae X t = α t for t 0. Then for p S, ξ Ξ, α A, u U an a Au ϕ p ξ, α = 1{ξ = α} u U pu, αu ν u p ξ, a = pu, a1{ξu = a} µ p u, a = pu, a α A:αu=a u U {u} g u α, a Let the mapping g u : A R be such that pu, ξu u U {u} g u α = g u α, αu, α A. pu, αu. Let µ p u, a be the expecte value of g u when user u exercises action a an each user u u inepenently ranomizes its action accoring to the probability istribution pu. That is, µ p u, a = g u α pu, αu. α A:αu=a Direct substitution in 5 yiels t π tu, a = 2π t u, a µ πt u, a a Au u U {u} π t u, a µ πt u, a, 6 which is the replicator ynamics associate with the normalform game specifie by action sets Au : u U an payoff functions g u : u U. A. Special case: Agile users We next consier the it ynamics 5 when the users are agile in the sense that they upate their strategies on a faster time-scale than the networ process evolves. Namely in this section it is assume that γu max q ξ,ξα for all users u U. 7 ξ Ξ, α A This assumption leas to a more intuitive interpretation of 5, an it appears suitable when, for example, the networ process is comprise of ynamically maintaine estimates that are obtaine via averaging. Such estimates are employe in typical rate-base control mechanisms in pacet ata networs, as well as in the circuit-switche networ application examine in Section IV.

4 Suppose that conition 7 hols, so that for p S the generator matrix Ĝp efine by equality 4 is nearlyecomposable [3]. That is, Ĝp has a roughly blociagonal structure in which each iagonal bloc is associate with a istinct ξ Ξ, an the entries of noniagonal blocs are small in magnitue relative to those of the iagonal blocs. For p S the probability istribution ϕ p is then well-approximate by ϕ p ξ, α ν p ξ u U pu, αu, 8 where ν p is the unique equilibrium istribution of the Marov process that taes values in Ξ an is generate by α A Gα u U pu, αu. In other wors ν p is the equilibrium istribution of the networ state in the case when each user u U inepenently ranomizes its actions accoring to the static istribution pu. The approximation 8 reflects a separation of time scales where one component of the process fluctuates much faster than the other, so that transition probabilities of the slow component are etermine by the equilibrium istribution of the fast component. The reaer is referre to [3] for a etaile iscussion an explicit error bouns that apply to the approximation 8. Here we appeal in particular to [3, Section 2.1] to point out that the approximation error is vanishingly small for large values of min{γu : u U}. Let g u ν p, a enote the expecte value of g u, a with respect to the istribution ν p. That is, g u ν p, a = ξ Ξ g u ξ, aν p ξ. 9 Let the trajectory ˆπ t : t 0 in S be efine by t ˆπ tu, a = 2ˆπ t u, a g u νˆπt, a a Au ˆπ t u, a g u νˆπt, a, 10 with ˆπ 0 u, a = π o u, a, for u U, a Au. The approximation 8 leas via irect substitution to ν u p ξ, a pu, aν p ξ µ p u, a pu, a ξ Ξ g u ξ, aν p ξ, an equation 5 reuces to equation 10 provie that the approximations above are exact. This observation is formalize in the following proposition. Proposition 3.1: Given ε, T > 0 there exists γ > 0 such that if min u U γu > γ then sup ˆπ t, π t ε. 0 t T Proposition 3.1, together with Theorem 3.1, leas to the following conclusion: Corollary 3.1: Let ˆπ o S an let ˆπ t : t 0 solve equations 10. If n Q n 0 = ˆπ o in probability, then for each ε, T > 0 there exists γ > 0 such that if min u U γu > γ then P sup Q n t, ˆπ t > ε = 0. n 0 t T In interpreting equations 10 note that g u νˆπt, a is the instantaneous rate that user u accumulates rewars per unit of time uring which it exercises action a an the networ process is in statistical equilibrium uner static strategies ˆπ t u : u U. Hence the probability of exercising an action tens to increase if an only if the action has a betterthan-average rewar rate, in the sense escribe above, for the user. Although Corollary 3.1 establishes a rigorous connection between Q n t : t 0 an ˆπ t : t 0 only over finite intervals, we continue by examining asymptotic properties of ˆπ t : t 0 in orer to gain insight on the equilibrium regime of Q n t : t 0 1. The following lemma provies a necessary conition for Lyapunov stability. Lemma 3.1: If p S is Lyapunov stable uner 10 then for each u U g u ν p, a = max g uν p, a whenever pu, a > a Au Conition 11 is reminiscent of Warrop equilibrium [14], which, in informal terms, refers to an assignment of flows to routes on a graph so that all flows between a given pair of noes experience the same elay an no other route for that pair has smaller elay. We next appeal to the literature on potential games an aapt [13, Lemma 4.1] to obtain a sufficient conition for stability. The following proposition ientifies rewar rates g u : u U so that the ynamics 10 amits a prescribe Lyapunov function. Proposition 3.2: Suppose that there exists a continuously ifferentiable function V : S R such that for each p S, V p pu, a = g uν p, a, u U, a Au. 12 Then t V ˆπ t 0, with equality if an only if ˆπ t is an equilibrium for 10. Isolate local maxima of V are asymptotically stable, an p S is a local maximum of V if it satisfies conition 11. IV. APPLICATION: ADAPTIVE ROUTING This section aims to illustrate interpretation of the results in a circuit-switche communication networ comprise of finite-capacity lins see, for example, [1] for another illustration in the context of pacet routing. The set of lins is enote by L, an users U of the networ are source-estination pairs that provie ata traffic. For each user u U, the action set Au enotes a set of alternate 1 A rigorous connection between the equilibrium istribution of Q n t : t 0 an stable equilibria of 5 entails more conservative reflection at the bounary of S. This irection is not pursue here in orer to eep the exposition simple.

5 routes for user u, in particular each a Au is a subset of L. The capacity κl of lin l L enotes the number of channels available at that lin. Suppose that each user u U receives a Poisson stream of connection requests at rate λu > 0 requests per unit time, an each request is to be route along one of the alternate routes in Au. A request that arrives at time t is assigne to the alternate route ientifie by α t u. If each lin on that route has at least one free channel, then the request is route along the route, otherwise the request is bloce hence a request may be bloce even though there are free alternate routes at the time of arrival an it is lost. Once assigne to a route, a connection remains in the system for the uration of its holing time uring which it simultaneously reserves one channel from each lin on its original route. Connection holing times are exponentially istribute with mean 1, inepenently of the history prior to the arrival time. The traffic moel escribe above has been rigourously stuie in the context of telephone networs. In particular Kelly [6] consiers the case when each Au is a singleton, in an asymptotic regime where for positive numbers λ o u : u U an κ o l : l L λu = mλ o u, κl = mκ o l, m 1, 13 an provies an approximation for the connection-blocing probabilities that is asymptotically exact in the large m it. This approximation is referre to as the reuce loa approximation an it will be restate in the following paragraphs. When alternative routes are present, it is nown that intuitively appealing routing rules may lea to instability in the networ [8], as well as to instances of Braess paraox [2]. Reinforcement-learning base ecentralize routing schemes were consiere for alternate routing in telephone networs by [9], [10]. The empirical evience therein suggests that employing learning automata leas to equalization of blocing probabilities on alternate routes. A connection between networ-wie optimality an user behavior was establishe by [7] in terms of certain parameters, referre to as shaow prices, which were envisione to be compute by the networ an then conveye to the users who execute ecentralize hill-escent algorithms base on these parameters. The istribute architecture outline in this section is also base on shaow prices. The reaer may fin the review [8] of circuit-switching networs helpful in following the evelopment of ieas in this section. We first recite two relevant results from the literature of capacity-ite circuit-switching networs, in the notation of the present section. The first result concerns asymptotic exactness of an approximate expression for the connectionblocing probabilities: Let p S an suppose for the moment that each user u U aopts the static ranomize routing policy pu. That is, each user u U routes an arriving request on route a Au with probability pu, a, inepenently of the prior history. The connection-blocing probability on each route can then be etermine in the framewor of [6] in the following fashion. Define κ 1 Bρ, κ = ρκ ρ i, ρ > 0, κ Z +, κ! i! i=0 an ientify a real value vector b p l, ρ p l : l L that solves the equalities b p l = Bρ p l, κl ρ p l = 1{l a}λupu, a u U a Au l a {l} 1 b p l. Brouwer s fixe point theorem guarantees existence of solutions, furthermore the vector b p l : l L is unique [6]. The reuce loa approximation refers to approximating the connection acceptance probability on route a by the prouct l a 1 b pl. In essence, this approximation is base on the assumption that lin loas are statistically inepenent in equilibrium. Although this assumption is generally inaccurate, the reuce loa approximation is asymptotically exact in the large m it in 13 [6]. For u U, a Au set λ p u, a = λupu, a l a1 b p l, which is interprete as the rate of accepte requests on route a uner the reuce loa approximation. The secon recite result pertains to the sensitivity of the approximate blocing probabilities arising from the reuce loa approximation: Let wu, a : u U, a Au be arbitrary but fixe nonnegative numbers, an let s p l : l L solve s p l = h p l 1{l a}λupu, a u U a Au wu, a s p l, l L, where l a {l} h p l = Bρ pl, κl 1 Bρ p l, κl. 1 b p l The quantities s p l : l L are coine as shaow prices by Kelly [7]. We refer the reaer to [7] for an intuitive interpretation of shaow prices. For the purposes of the present iscussion it is enough to note that [7, Equation 1.6] translates to wu, a λ p u, a = pu, a u U a Au wu, a l a s p l λu l a1 b p l 14 for u U, a Au. We now turn to aaptive routing an escribe the networ architecture. Suppose that each lin l L ynamically estimates the shaow price s Q n t l : t 0. See for example

6 [7, Sections 4-5] for a recursive proceure for locally estimating the shaow price at each lin, an a iscussion of relate implementation issues. Let Ŝtl enote the estimate of the shaow price s Q n t l for lin l at time t. We shall assume that each estimator is consistent, in the sense that it eventually ientifies the associate shaow price correctly provie that the strategies of all users are ept constant. That is, for each fixe strategy profile p S, 1{ŝl = s p l}ν p ξ = 1, l L. 15 ξ Ξ Let ˆNt u, a be the number of connections on route a of user u that have been establishe but not expire by time t. Aopt the networ process efine by X t = Ŝtl, ˆN t u, a : l L, u U, a Au for t 0. 2 Suppose that for each connection that user u successfully routes along route a, the user charges that connection an amount wu, a, an it is charge by each lin on route a an amount that equals the current estimate of the shaow price of that lin. Users o not receive any rewar for bloce requests. The instantaneous rewar rate of user u ue to choosing route a when the networ is at state ξ = ŝ, ˆn, with ŝ = ŝl : l L an ˆn = ˆnu, a : u U, a Au, is given by g u ξ, a = wu, a l a ŝl λuψu, a, ξ, where ψu, a, ξ is the conitional probability that each lin on route a of user u has at least one free channel given that the networ state is ξ. In particular ψu, a, ξ = 1 if each lin on route a has free capacity uner route assignment ˆn, 0 else. To mae further progress, appeal to conition 13 to assume that the reuce loa approximation is exact, so that the acceptance probability on route a uner the static strategy profile p is given by 1 b p l, p S. 16 ξ Ξ ψu, a, ξν p ξ = l a Equality 9, together with conitions 15-16, implies that g u ν p, a = wu, a l a s p l in turn by equality 14 the mapping V p = wu, aλ p u, a, u U a Au λu l a1 b p l; p S satisfies conition 12. Note that uner the reuce loa approximation the expression for V p enotes the total rate of revenue generation by the users when they aopt the static policies pu : u U, an by Proposition 3.2 the ynamics 10 locally maximizes the total revenue generate by the users in the long-term. We appeal to [7, Section 6] to point out that, in the large m it in 13, local maxima of the mapping V are global maximizers. V. CONCLUSION This paper provies a ynamic escription of user strategies when each user employs a tractable reinforcement learning algorithm to pursue own interests subject to local information. The present setting is motivate by large resource sharing systems, such as ata networs, where control is istribute to users uner the assumption that they are socially responsible. The results of the paper thereby provie a framewor to assess robustness of such architectures against selfish behavior of partially informe users. The results may also be interprete from a control perspective to assess efficiency of autonomous reinforcement learning as a ecentralize control mechanism. In this respect conition 12 provies a guieline for engineering the networ process so that system ynamics amits a prescribe Lyapunov function. REFERENCES [1] M. Alanyali, On non-cooperative interaction via reinforcement learning an its control, 41st Allerton Conference on Communications, Control an Computing. Champaign, [2] N. G. Bean, F. P. Kelly, an P. G. Taylor, Braess paraox in a loss networ, Journal of Applie Probability, vol. 34, pp , [3] P. J. Courtois, Decomposability: Queueing an computer system applications, Acaemic Press, New Yor, [4] J. Hofbauer an K. Siegmun, Evolutionary games an replicator ynamics, Cambrige University Press, [5] E. Hopins, Two competing moels of how people learn in games, Econometrica, vol. 70, no. 6, pp , [6] F. P. Kelly, Blocing probabilities in large circuit-switche networs, Avances in Applie Probability, vol. 18, pp , [7] F. P. Kelly, Routing in circuit-switche networs: optimization, shaow prices an ecentralization, Avances in Applie Probability, vol. 20, pp , [8] F. P. Kelly, Loss networs, The Annals of Applie Probability, vol. 1, no. 3, pp , [9] K. S. Narenra, E. A. Wright, an L. G. Mason, Application of learning automata to telephone traffic routing an control, IEEE Transactions on Systems, Man, an Cybernetics, vol. 7, no. 11, pp , [10] K. S. Narenra an P. Mars, The use of learning algorithms in telephone traffic routing a methoology, Automatica, vol. 19, no. 5, pp , [11] K. S. Narenra an M. A. L. Thathachar, Learning automata: An introuction, Prentice-Hall, New Jersey, [12] K. Ritzberger an J.W. Weibull, Evolutionary selection in normalform games, Econometrica, vol. 63, no. 6, pp , [13] W. B. Sanholm, Potential games with continuous player sets, Journal of Economic Theory, vol. 97, pp , [14] J. G. Warrop, Some theoretical aspects of roa traffic research, Proceeings of the Institute of Civil Engineering, Part 2, vol. 1, pp , [15] J. Weibull, Evolutionary game theory, MIT Press, Cambrige, If the estimator of [7] is employe then a number of auxiliary variables shoul be inclue in the state escriptor of the networ in orer that the networ process complies with the conitions set forth in Section II. Such variables are omitte here for notational convenience.

Equilibrium in Queues Under Unknown Service Times and Service Value

Equilibrium in Queues Under Unknown Service Times and Service Value University of Pennsylvania ScholarlyCommons Finance Papers Wharton Faculty Research 1-2014 Equilibrium in Queues Uner Unknown Service Times an Service Value Laurens Debo Senthil K. Veeraraghavan University