Mean Field Competitive Binary MDPs and Structured Solutions

Size: px

Start display at page:

Download "Mean Field Competitive Binary MDPs and Structured Solutions"

Theodore Lambert
5 years ago
Views:

1 Mean Field Competitive Binary MDPs and Structured Solutions Minyi Huang School of Mathematics and Statistics Carleton University Ottawa, Canada MFG217, UCLA, Aug 28 Sept 1, / 32

2 Outline of talk The competitive binary MDP model (Huang and Ma 16, 17) The stationary MFG equation (with discount) Model features: monotone state dynamics, resetting action, positive externality Check: ergodicity, existence, uniqueness, solution structure, pure strategies, stable/unstable equilibria Exploit positive externality to prove uniqueness Extensions 2 / 32

3 Motivation and background Dynamics and cost Finite horizon Model: Individual controlled state x i t [, 1], action a i t {, 1}. Motivation for this class of competitive MDP models Except for LQ cases, general nonlinear MFGs rarely have closed-form solutions We introduce this class of MDP models which have simple solution structures threshold policy (see next page) i) There is a long tradition of looking for threshold policies to various decision problems in the operations research community ii) A threshold policy is also studied in R. Johari s ride sharing problem Sometimes simple models tell more; the binary MDP modeling is of interest in its own right 3 / 32

4 Motivation and background Dynamics and cost Finite horizon action a t i i state x t Figure : threshold is.6 4 / 32

5 Motivation and background Dynamics and cost Finite horizon Dynamic games of MDPs were initial introduced by L. Shapley (1953) and called stochastic games. Literature on large population games with discrete states and/or actions (or discrete time MDP) Weintraub et al. (25, 28), Huang (212), Gomes et al. (213), Adlakha, Johari, and Weintraub (215), Kolokoltsov and Bensoussan (216), Saldi, Basar and Raginsky (217). Also see anonymous games e.g. Jovanovic and Rosenthal (1988) 5 / 32

6 Motivation and background Dynamics and cost Finite horizon The MF MDP model: N players (or agents A i, 1 i N) with states x i t, t Z +, as controlled Markov processes (no coupling) State space: S = [, 1] Interpret state as unfitness Action space: A = {a, a 1 }. a : inaction, a 1 : active control Agents are coupled via the costs (Huang and Ma, CDC 16, CDC 17) Examples of binary action space (action or inaction) maintenance of equipment; network security games; wireless medium access control (channel shared by users; transmission or back off; collisions possible) (flu) vaccination games, etc. 6 / 32

7 Motivation and background Dynamics and cost Finite horizon Binary choice models are widely used in various decision problems Schelling, T. (1973), Hockey helmets, concealed weapons, and daylight saving: a study of binary choices with externalities, J. Conflict Resol. Brock, W. and Durlauf, S. (21), Discrete choice with social interactions, Rev. Econ. Studies. Schelling, T. (26), Micromotives and Macrobehavior Babichenko, Y. (213), Best-reply dynamics in large binary-choice anonymous games, Games Econ. Behavior Lee, L., Li, J. and Lin, X. (214), Binary choice models with social network under heterogeneous rational expectations, Rev. Econ. Statistics 7 / 32

8 Motivation and background Dynamics and cost Finite horizon Dynamics: The controlled transition kernel for xt. i For t and x S, P(xt+1 i B xt i = x, at i = a ) = Q (B x), P(xt+1 i = xt i = x, at i = a 1 ) = 1, Q ( x): stochastic kernel defined for B B(S) (Borel sets). Q ([x, 1] x) = 1. Under inaction a, state gets worse. Transition of xt i is not affected by other at, j j i. The stochastic deterioration is similar to hazard rate modelling in the maintenance literature (Bensoussan and Sehti, 27; Grall et al. 22) 8 / 32

9 Motivation and background Dynamics and cost Finite horizon x i t under inaction. It is getting worse and worse. 9 / 32

10 Motivation and background Dynamics and cost Finite horizon The cost of A i : J i = E T t= ρ t c(x i t, x (N) t, a i t), 1 i N. < T ; ρ (, 1): discount factor. Population average state: x (N) t = 1 N N i=1 x i t. The one stage cost: c(xt, i x (N) t, at) i = R(xt, i x (N) t ) + γ1 {a i t =a 1 } R : unfitness-related cost; γ > : effort cost 1 / 32

11 Motivation and background Dynamics and cost Finite horizon The MFG equation system (MFG Eq [,T ] ): [ V (t, x) = min ρ 1 V (t + 1, y)q (dy x) + R(x, z t ), ρv (t + 1, ) + R(x, z t ) + γ ], t < T V (T, x) = R(x, z T ), z t = Ext, i t T consistency condition in MFG Notation: b s,t = (b s, b s+1,..., b t). Find a solution (ẑ,t, â,t i ) such that {xt i, t T } is generated by {âi t (x), t T } as best response. Fact: Given z,t, the best response is a threshold policy Theorem: i) Under technical conditions, (MFG Eq [,T ] ) has a solution. ii) The resulting set of decentralized strategies for the N players is an ϵ-nash equilibrium. Below we will focus on its stationary version. 11 / 32

12 Action space {, 1}. Competitive MDP model Motivation and background Dynamics and cost Finite horizon Threshold policy: { at i 1 x i = t r xt i < r action a t i i state x t Figure : r =.6 12 / 32

13 Stationary equation system for MFG [ V (x) = min ρ z = 1 1 V (y)q (dy x) + R(x, z), xπ(dx) for probability measure π. ] ρv () + R(x, z) + γ (V, ẑ, ˆπ, â i ) is called a stationary equilibrium with discount if i) the feedback policy â i is the best response with respect to ẑ (in the associated infinite horizon control problem); ii) The distribut n of {x i t, t } under âi converges to the stationary distribution ˆπ; iii) (ẑ, ˆπ) satisfies the second equation. Note: This solution definition is different from that in anonymous games which only deal with a fixed point equation for a measure (describe the continuum population) without considering ergodicity 13 / 32

14 What we want to study? Q1 Existence (based on ergodicity) Q2 Threshold policy exists? Q3 Uniqueness Q4 Under what condition can a pure equilibrium strategy fail to exist? Q5 How does the model parameter affect the solution? Related to comparative statistics Q6 Does the equilibrium have stability, or can this fail? Can the model show oscillatory behaviors (as observed in vaccination situations)? 14 / 32

15 Assumptions: Competitive MDP model (A1) {x i, i 1} are independent random variables taking values in S. (A2) R(x, z) is a continuous function on S S. For each fixed z, R(, z) is strictly increasing. (A3) i) Q ( x) satisfies Q ([x, 1] x) = 1 for any x, and is strictly stochastically increasing; ii) Q ( x) has a positive density for all x < 1. (A4) R(x, ) is increasing for fixed x. (i.e. positive externality) 1 (A5) γ > β max z [R(y, z) R(, z)]q (dy ). Remarks: Montonicity in (A2): cost increases when state is poorer. (A3)-i) means dominance of distributions (A5) Effort cost should not be too low; this prevents zero action threshold; this condition can be refined if more information is known on R (such as sub-modularity). 15 / 32

16 Theorem (existence on Q1, Q2) Assume (A1)-(A5). Then [ V (x) = min ρ z = 1 xπ(dx) 1 V (y)q (dy x) + R(x, z), ] ρv () + R(x, z) + γ has a solution (V, z, π, a i ) for which the best response a i is a threshold policy. Note: a i : x [, 1] {a, a 1 }, is implicitly specified by the first equation On the right hand side, if the first term is smaller, then a i (x) = a (otherwise, a 1 ) 16 / 32

17 Theorem (uniqueness on Q3) In the previous theorem, if we further assume R(x, z) = R 1 (x)r 2 (z) and R 2 > is strictly increasing on S, then [ V (x) = min ρ z = 1 xπ(dx) 1 V (y)q (dy x) + R(x, z), has a unique solution (V, z, π, a i ). ] ρv () + R(x, z) + γ Remark: Monotonicity of R 2 indicates positive externalities (crucial in uniqueness analysis), i.e, a person benefits from the improving behaviour of the population. 17 / 32

18 How to prove the existence theorem? Show ergodicity of the controlled Makov chain {x i t} under an interior threshold policy (so that π in the equation makes sense); the case of threshold 1 is handled separately Further use Brouwer s fixed point theorem to show existence How to prove uniqueness? Use two comparison theorems 18 / 32

19 Theorem (ergodicity, Q1) : For threshold θ (, 1), {xt i,θ, t } is uniformly ergodic with stationary probability distribution π θ, i.e., sup P t (x, ) π θ Kr t x S for some constants K > and r (, 1), where is the total variation norm of signed measures. Proof. Show Doeblin s condition by checking inf P(x 4 = x = x) η >. x S and aperiodicity. Here 4 is the magic number. 19 / 32

20 The regenerative process: x i t under threshold θ =.85 2 / 32

21 Numerics: Recall the MFG solution equation system [ V (x) = min ρ z = 1 Algorithm: xπ(dx) 1 Step 1. Initialize z () V (y)q (dy x) + R(x, z), ] ρv () + R(x, z) + γ Step 2. Given z (k), solve V (k) (x) and associated threshold policy a (k) from the DP Eqn Step 3. Use a (k) to find the mean field z (k+1) via π (k). Step 4. z (k+1) =.98z (k) +.2 z (k+1) (cautious update!) Step 5. Go back to step 2 (until accuracy attained). 21 / 32

22 Competitive MDP model 5 V(x) x iterate Figure : V (x) deﬁned on [, 1] V(x) Figure : Left: V (x) (threshold.49) Right: search of the threshold / 32

The comparison theorems use very intuitive ideas. Used to answer Q3. First comparison theorem (with R(x, z) = R 1 (x)r 2 (z)): If z (i.e., decreasing population threat), best response s

23 The comparison theorems use very intuitive ideas. Used to answer Q3. First comparison theorem (with R(x, z) = R 1 (x)r 2 (z)): If z (i.e., decreasing population threat), best response s threshold θ Second comparison theorem: If the threshold θ (, 1), the resulting regenerative process xt i θ s long term mean (since more chance to grow ; lawn, say, under biweekly cut) 23 / 32

24 Method to prove the second comparison theorem: Take the threshold θ (, 1). Set x i,θ =, and the Markov chain xt i,θ evolves under inaction until τ θ : τ θ = inf{t x i,θ t θ}. Reset x i,θ τ θ +1 =, and the Markov chain restarts at τ θ + 1. Denote S k = k. Let S k be defined with θ (θ, 1). Then t= x t i,θ 1 k 1 λ θ := lim xt i,θ = ES τ θ, similarly define λ θ. k k 1 + Eτ θ t= The second comparison theorem: For < θ < θ < 1, then i) ES τθ 1+Eτ θ ES τ θ 1+Eτ θ ii) So λ θ λ θ 24 / 32

25 A useful lemma for proving the second comparison theorem. Let < r < r < 1. Consider a Markov process {Y t, t } with state space [, 1], and transition kernel Q Y ( y) which satisfies Q Y ([y, 1] y) = 1 for any y [, 1], and is stochastically monotone. Suppose Y y < r. Define the stopping time τ = inf{t Y t r} Lemma If Eτ <, then E τ t= Y t < and Facts: E τ t= Y t 1 + Eτ = EY + EY 1 + k=1 E(Y k+11 {Yk <r}) 2 + k=1 P(Y k < r) Apply the lemma to the regenerative process xt i,θ for one cycle. Show that increasing r = θ to r = θ increases the ratio in the lemma. 25 / 32

26 Example (to answer Q4): Assume R(x, z) = x(c + z) where c >. Q ( x) has the same distribution as x + (1 x)ξ where ξ has uniform distribution on [, 1]. Consider all γ >. Denote c = Eξ 2 = 1 4. Then If γ (, ρc 2 ] ( ρ(c+c ) 2, ), there exists a unique pure strategy as a stationary equilibrium with discount If γ ( ρc 2, ρ(c+c ) 2 ], there exists no pure strategy as a stationary equilibrium with discount This result can be extended to the general model for R(x, z) the long run average cost case 26 / 32

27 Q5: Check the dependence of equilibrium solution (z, θ) on γ. This is actually a question on comparative statistics formalized by J. R. Hicks (1939) and P. A. Samuelson (1947) It studies how a change of the model parameters affects the equilibrium Important in the economic literature, game theory and optimization; however, most literature considered static models Example of dynamic models: D. Acemoglu, M. K. Jensen (215) J. Political Econ. (on large dynamic economies) 27 / 32

28 Theorem (monotone comparative statistic, Q5). Suppose both γ 1 and γ 2 (effort cost) satisfy (A5), and γ 1 < γ 2 ; (γ i, z i, θ i ) is solved from the MFG equation system. Then θ 1 < θ 2, z 1 < z 2. Interpretation: Effort being more expensive less willing to act worse average state of the population 28 / 32

29 Q6. Instability of the equilibrium solution can occur Take R(x, z) = x(c + R (z)) where R (z) has a quick leap from a near value to nearly 1 around some value z c Such R is not purely artificial; similar phenomena appear in vaccination models where group risk has sharp decrease toward when the vaccination coverage exceeds a certain critical level Figure : Iteration of threshold in best response w.r.t. mean field; swing back and forth 29 / 32

30 Game with long-run average costs: Define J i (x 1,..., x N, π 1,..., π N ) = lim sup T T 1 1 T E t= c(xt, i x (N) t, at), i where c(xt, i x (N) t, at) i = R(xt, i x (N) t ) + γ1 {a i t =a 1 } The solution equation system of the MFG: { λ + h(x) = min{ 1 h(y)q (dy x) + R(x, z), R(x, z) + γ}, z = 1 xν(dx), where ν is the limiting distribution of the closed-loop state X i t. Idea: Consider discount factor ρ and denote value function V ρ(x). Define h ρ (x) = V ρ (x) V ρ (). Try to get a subsequence of {h ρ } converging to h when ρ 1 under an additional concavity condition on R(, z) for each fixed z. 3 / 32

31 Further possible extensions: Unbounded state space (such as half line) Pairwise coupling in the cost Continuous time modeling (example, compound Poisson process with impulse control; CDC 17, with Zhou); a Levy process with one-sided jumps (such as a subordinator) also fits the monotone state modeling 31 / 32

32 Thank you! 32 / 32

Mean Field Games: Basic theory and generalizations

Mean Field Games: Basic theory and generalizations School of Mathematics and Statistics Carleton University Ottawa, Canada University of Southern California, Los Angeles, Mar 2018 Outline of talk Mean