arxiv: v1 [cs.lo] 17 Dec 2015

Size: px

Start display at page:

Download "arxiv: v1 [cs.lo] 17 Dec 2015"

Walter Booker
6 years ago
Views:

1 Non-Zero Sum Games for Reactive Synthesis Romain Brenguier 1, Lorenzo Clemente 2, Paul Hunter 3, Guillermo A. Pérez 3, Mickael Randour 3, Jean-François Raskin 3, Ocan Sankur 4, Mathieu Sassolas 5 arxiv: v1 [cs.lo] 17 Dec University of Oxford, UK 2 University of Warsaw, Poland 3 Université Libre de Bruxelles, Belgium 4 CNRS, Irisa, France 5 Université Paris-Est Créteil, LACL, France Abstract. In this invited contribution [7], we summarize new solution concepts useful for the synthesis of reactive systems that we have introduced in several recent publications. These solution concepts are developed in the context of non-zero sum games played on graphs. They are part of the contributions obtained in the invest project funded by the European Research Council. 1 Introduction Reactive systems are computer systems that maintain a continuous interaction with the environment in which they operate. They usually exhibit characteristics, like real-time constraints, concurrency, parallelism, etc., that make them difficult to develop correctly. Therefore, formal techniques using mathematical models have been advocated to help to their systematic design. One well-studied formal technique is model checking [21,40,2] which compares a model of a system with its specification. The main objective of this technique is to find design errors early in the development cycle. So model-checking can be considered as a sophisticated debugging method. A scientifically more challenging goal, called synthesis, is to design algorithms that, given a specification for a reactive system and a model of its environment, directly synthesize a correct system, i.e., a system that enforces the specification no matter how the environment behaves. Synthesis can take different forms: from computing optimal values of parameters to the full-blown automatic synthesis of finite-state machine descriptions for components of the reactive system. The main mathematical models proposed for the synthesis problem are based on two-player zero-sum games played on graphs and the main solution concept for those games is the notion of winning strategy. This model encompasses the situation where a monolithic controller has to be designed to interact with a monolithic environment that is supposed to be fully Work supported by the ERC starting grant invest (FP ), G.A. Pérez is supported by F.R.S.-FNRS ASP fellowship, M. Randour is a F.R.S.-FNRS Postdoctoral Researcher.

2 antagonistic. In the sequel, we call the two players Eve and Adam, Eve plays the role of the system and Adam plays the role of the environment. A fully antagonistic environment is most often a bold abstraction of reality: the environment usually has its own goal which, in general, does not correspond to that of falsifying the specification of the reactive system. Nevertheless, this abstraction is popular because it is simple and sound: a winning strategy against an antagonistic environment is winning against any environment that pursues its own objective. However this approach may fail to find a winning strategy even if solutions exist when the objective of the environment are taken into account, or it may produce sub-optimal solutions because they are overcautious and do not exploit the fact the the environment has its own objective. In several recent works, we have introduced new solution concepts for synthesis of reactive systems that take the objective of the environment into account or relax the fully adversarial assumption. Assume admissible synthesis In [8], we proposed a novel notion of synthesis where the objective of the environment can be captured using the concept of admissible strategies [5,3,9]. For a player with objective φ, a strategy σ is dominated by σ if σ does as well as σ w.r.t. φ against all strategies of the other players, and better for some of those strategies. A strategy σ is admissible if it is not dominated by another strategy. We use this notion to derive a meaningful notion to synthesize systems with several players, with the following idea. Only admissible strategies should be played by rational players as dominated strategies are clearly sub-optimal options. In assume-admissible synthesis, we make the assumption that both players play admissible strategies. Then, when synthesizing a controller, we search for an admissible strategy that is winning against all admissible strategies of the environment. Assume admissible synthesis is sound: if both players choose strategies that are winning against admissible strategies of the other player, the objectives of both players will be satisfied. Regret minimization: best-responses as yardstick In [33] we studied strategies for Eve which minimize her regret. The regret of a strategy σ of Eve corresponds to the difference between the value Eve achieves by playing σ against Adam and the value she could have ensured if she had known the strategy of Adam in advance. Regret is not a novel concept in game theory see, e.g., [31], but it was not explicitly used for games played on graphs before [29]. The complexity of deciding whether a regret-minimizing strategy for Eve exists, and the memory requirements for such strategies change depending on what type of behavior Adam can use. We have focused on three particular cases: arbitrary behaviors, positional behaviors, and time-dependent behaviors (otherwise known as oblivious environments). The latter class of regret games was shown in [33] to be related to the problem of determining whether an automaton has a certain form of determinism. Games with an expected adversary In [13,12,22], we combined the classical formalism of two-player zero-sum games (where the environment is con- 2

3 sidered to be completely antagonistic) with Markov decision processes (MDPs), a well-known model for decision-making inside a stochastic environment. The motivation is that one has often a good idea of the expected behavior (i.e., average-case) of the environment represented as a stochastic model based on statistical data such as the frequency of requests for a computer server, the average traffic in a town, etc. In this case, it makes sense to look for strategies that will maximize the expected performance of the system. This is the traditional approach for MDPs, but it gives no guarantee at all if the environment deviates from its expected behavior, which can happen, for example, if events with small probability happen, or if the statistical data upon which probabilities are estimated is noisy or unreliable. On the other hand, two-player zero-sum games lead to strategies guaranteeing a worst-case performance no matter how the environment behaves however such strategies may be far from optimal against the expected behavior of the environment. With our new framework of beyond worst-case synthesis, we provide formal grounds to synthesize strategies that both guarantee some minimal performance against any adversary and provide an higher expected performance against a given expected behavior of the environment thus essentially combining the two traditional standpoints from games and MDPs. Structure of the paper Section 2 recalls preliminaries about games played on graphs while Section 3 recalls the classical setting of zero-sum two player games. Section 4 summarizes our recent works on the use of the notion of admissibility for synthesis of reactive systems. Section 5 summarizes our recent results on regret minimization for reactive synthesis. Section 6 summarizes our recent contributions on the synthesis of strategies that ensure good expected performance together with guarantees against their worst-case behaviors. 2 Preliminaries We consider two-player turn-based games played on finite (weighted) graphs. Such games are played on so-called weighted game arenas. Definition 1 (Weighted Game Arena). A (turn-based) two-player weighted game arena is a tuple A = S,S,E,s init,w where: S is the finite set of states owned by Eve, S is the finite set of states owned by Adam, S S = and we denote S S by S. E S S is a set of edges, we say that E is total whenever for all states s S, there exists s S such that (s,s ) E (we often assume this w.l.o.g.). s init S is the initial state. w : E Z is the weight function that assigns an integer weight to each edge. We do not always use the weight function defined on the edges of the weighted game arena and in these cases we simply omit it. 3

4 Unless otherwise stated, we consider for the rest of the paper a fixed weighted game arena A = S,S,E,s init,w. A play in the arena A is an infinite sequence of states π = s 0 s 1...s n... such that for all i 0, (s i,s i+1 ) E. A play π = s 0 s 1... is initial when s 0 = s init. We denote by Plays(A) the set of plays in the arena A, and by InitPlays(A) its subset of initial plays. A history ρ is a finite sequence of states which is a prefix of a play in A. We denote by Pref(A) the set of histories in A, and the set of prefixes of initial plays is denoted by InitPref(A). Given an infinite sequence of states π, and two finite sequences of states ρ 1,ρ 2, we write ρ 1 < π if ρ 1 is a prefix of π, and ρ 2 ρ 1 if ρ 2 is a prefix of ρ 1. For a history ρ = s 0 s 1...s n, we denote by last(ρ) its last state s n, and for all i,j, 0 i j n, by ρ(i..j) the infix of ρ between position i and position j, i.e., ρ(i..j) = s i s i+1...s j, and by ρ(i) the position i of ρ, i.e., ρ(i) = s i. The set of histories that belong to Eve, noted Pref (A) is the subset of histories ρ Pref(A) such that last(ρ) S, and the set of histories that belong to Adam, noted Pref (A) is the subset of histories ρ Pref(A) such that last(ρ) S. Definition 2 (Strategy). A strategy for Eve in the arena A is a function σ : Pref (A) S such that for all ρ Pref (A), (last(ρ),σ (ρ)) E, i.e., it assigns to each history of A that belongs to Eve a state which is a E-successor of the last state of the history. Symmetrically, a strategy for Adam in the arena A is a function σ : Pref (A) S such that for all ρ Pref (A), (last(ρ),σ (ρ)) E. The set of strategies for Eve is denoted by Σ and the set of strategies of Adam by Σ. When we want to refer to a strategyof Eve or Adam, we write it σ. We denote by Dom(σ) the domain of definition of the strategy σ, i.e., for all strategies σ of Eve (resp. Adam), Dom(σ) = Pref (A) (resp. Dom(σ) = Pref (A)). A play π = s 0 s 1...s n... is compatible with a strategy σ if for all i 0 such that π(0..i) Dom(σ), we have that s i+1 = σ(ρ(0..i)). We denote by Outcome s (σ) the set of playsthat start in s and are compatible with the strategy σ. Given a strategy σ for Eve and a strategy σ for Adam, and a state s, we write Outcome s (σ,σ ) the unique play that starts in s and which is compatible both with σ and σ. A strategy σ is memoryless when for all histories ρ 1,ρ 2 Dom(σ), if we have that last(ρ 1 ) = last(ρ 2 ) then σ(ρ 1 ) = σ(ρ 2 ), i.e., memoryless strategies only depend on the last state of the history and so they can be seen as (partial) functions from S to S. Σ ML andσ ML denotes memoryless strategies of Eve and of Adam, respectively. A strategy σ is finite-memory if there exists an equivalence relation Dom(σ) Dom(σ) of finite index such that for all histories ρ 1,ρ 2 such that ρ 1 ρ 2, we have that σ(ρ 1 ) = σ(ρ 2 ). If the relation is regular (computable by a finite state machine) then the finite memory strategy can be modeled by a finite state transducer (a so-called Moore or Mealy machine). If a strategy is encoded by a machine with m states, we say that it has memory size m. 4

5 An objective Win Plays(A) is a subset of plays. A strategy σ is winning from state s if Outcome s (σ) Win. We will consider both qualitative objectives, that do not depend on the weight function of the game arena, and quantitative objectives that depend on the weight function of the game arena. Our qualitative objectives are defined with Muller conditions (which are a canonical way to represent all the regular sets of plays). Let π S ω, be a play, then inf(π) = {s S i j i 0 : π(j) = s} is the subset of elements of S that occur infinitely often along π. A Muller objective for a game arena A is a defined by a set of sets of states F and contains the plays {π S ω inf(π) F}. We sometimes take the liberty to define such regular sets using standard LTL syntax. For a formal definition of the syntax and semantics of LTL, we refer the interested reader to [2]. We associate, to each play π, an infinite sequence of weights, denoted w(π), and defined as follows: w(π) = w(π(0),π(1))w(π(1),π(2))...w(π(i),π(i+1)) Z ω. To assign a value Val(π) to a play π, we classically use functions like sup (that returns the supremum of the values along the play), inf (that returns the infimum), limsup (that returns the limit superior), liminf (that returns the limit inferior),mp (that returnsthe limit ofthe averageofthe weightsalongthe play), or dsum (that returns the discounted sum of the weights along the play). We only define the mean-payoff measure formally. Let ρ = s 0 s 1...s n be s.t. (s i,s i+1 ) E for all i, 0 i < n, the mean-payoff of this sequence of edges is MP(ρ) = 1 i=n 1 n w(ρ(i),ρ(i+1)), i=0 i.e., the mean-value of the weights of the edges traversed by the finite sequence ρ. The mean-payoff of an (infinite) play π, denoted MP(π), is a real number defined from the sequence of weights w(π) as follows: i=n 1 1 MP(π) = liminf n + n w(π(i),π(i+1)), i=0 i.e., MP(π) is the limit inferior of running averagesof weights seen alongthe play π. Note that we need to use liminf because the value of the running averages of weights may oscillate along π, and so the limit is not guaranteed to exist. A game is defined by a (weighted) game arena, and objectives for Eve and Adam. Definition 3 (Game). A game G = (A,Win,Win ) is defined by a game arena A, an objective Win for Eve, and an objective Win for Adam. 5

6 3 Classical Zero-Sum Setting In zero sum games, players have antagonistic objectives. Definition 4. A game G = (A,Win,Win ) is zero-sum if Win = Plays\Win Fig.1. An example of a two-player game arena. Rounded positions belong to Eve, and squared positions belong to Adam. Example 1. Let us consider the example of Fig. 1. Assume that the objective of Eve is to visit 4 infinitely often, i.e., Win = {π Plays π = 4}, and that the objective of Adam is Win = Plays\Win. Then it should be clear that Eve does not have a strategy that enforces a play in Win no matter what Adam plays. Indeed, if Adam always chooses to stay at state 2, there is no way for Eve to visit 4 at all. As we already said, zero-sum games are usually a bold abstraction of reality. This is because the system to synthesize usually interacts with an environment that has its own objective, and this objective is not necessarily the complement of the objective of the system. A classical way to handle this situation (see e.g., [4]) is to ask the system to win only when the environment meets its own objective. Definition 5 (Win-Hyp). Let G = (A,Win,Win ) be a game, Eve achieves Win from state s under hypothesis Win if there exists σ such that Outcome s (σ ) Win Win. The synthesis rule in the definition above is called winning under hypothesis, Win-Hyp for short. Example 2. Let us consider the example of Fig. 1 again. But now assume that the objective of Adam is to visit 3 infinitely often, i.e., Win = {π Plays π = 3}. In this case, it should be clear then the strategy 1 2 and 3 4 for Eve is winning for the objective Win-Hyp 4 3 = {π Plays π = 4} {π Plays π = 3} i.e., under the hypothesis that the outcome satisfies the objective of Adam. Unfortunately, there are strategies of Eve which are winning for the rule Win- Hyp but which are not desirable. As an example, consider the strategy that in 1 chooses to go to 5. In that case, the objective of Adam is unmet and so this strategy of Eve is winning for Win-Hyp 4 3, but clearly such a strategy is not interesting as it excludes the possibility to meet the objective of Eve. 6

7 4 Assume Admissible Synthesis Todefinethenotionofadmissiblestrategy,wefirstneedtodefinewhenastrategy σ is dominated by a strategy σ. We will define the notion for Eve, the definition for Adam is symmetric. Let σ and σ be two strategies of Eve in the game arena A. We say that σ dominates σ if the following two conditions hold: 1. σ Σ Outcome sinit (σ,σ ) Win Outcome sinit (σ,σ ) Win 2. σ Σ Outcome sinit (σ,σ ) / Win Outcome sinit (σ,σ ) Win So a strategy σ is dominated by σ if σ does as well as σ against any strategy of Adam (condition 1), and there exists a strategy of Adam against which σ does better than σ (condition 2). Definition 6 (Admissible Strategy). A strategy is admissible if there does not exist a strategy that dominates it. Let G = (A,Win,Win ) be a game, the set of admissible strategies for Eve is noted Adm, and the set of admissible strategies for Adam is denoted Adm. Clearly, a rational player should not play a dominated strategy as there always exists some strategy that behaves strictly better than the dominated strategy. So, a rational player only plays admissible strategies. Example 3. Let us consider again the example of Fig. 1 with Win = {π Plays π = 4} and Win = {π Plays π = 3}. We claim that the strategy σ that plays 1 5 is not admissible in A from state 1. This is because the strategy σ that plays 1 2 and 4 3 dominates this strategy. Indeed, while σ is always losing for the objective of Eve, the strategy σ wins for this objective whenever Adam eventually plays 2 3. Definition 7 (AA). Let G = (A,Win,Win ) be a game, Eve achieves Win from s under the hypothesis that Adam plays admissible strategies if σ Adm σ Adm Outcome s (σ,σ ) Win. Example 4. Let us consideragainthe exampleoffig. 1with Win = {π Plays π = 4} and Win = {π Plays π = 3}. We claim that the strategy σ of Eve that plays 1 2 and 4 3 is admissible (see previous example) and winning against all the admissible strategies of Adam. This is a consequence of the fact that the strategy of Adam that always plays 2 2, and which is the only counter strategyofadam againstσ, is not admissible. Indeed, this strategy falsifies Win while a strategy that always chooses 2 3 enforces the objective of Adam. Theorem 1 ([3,9,8]). For all games G = (A,Win,Win ), if Win and Win are omega-regular sets of plays, then Adm and Adm are both non empty sets. The problem of deciding if a game G = (A,Win,Win ), where Win and Win are omega-regular sets of plays expressed as Muller objectives, satisfies is PSpace-complete. σ Adm σ Adm Outcome s (σ,σ ) Win 7

8 Additional Results. The assume-admissible setting we present here relies on procedures for iterative elimination of dominated strategies for multiple players which was studied in [3] on games played on graphs. In this context, dominated strategies are repeatedly eliminated for each player. Thus, with respect to the new set of strategies of its opponent, new strategies may become dominated, and will therefore be eliminated, and so on until the process stabilizes. In [9], we studied the algorithmic complexity of this problem and proved that for games with Muller objectives, deciding whether all outcomes compatible with iteratively admissible strategy profiles satisfy an omega-regular objective defined by a Muller condition is PSpace-complete and in UP coup for the special case of Büchi objectives. The assume-admissible rule introduced in [8] is also defined for multiple players and corresponds, roughly, to the first iteration of the elimination procedure. We additionally prove that if players have Büchi objectives, then the rule can be decided in polynomial-time. One advantage of the assume-admissible rule is the rectangularity of the solution set: the set of strategy profiles that witness the rule can be written as a product of sets of strategies for each player. In particular, this means that a strategy witnessing the rule can be chosen separately for each player. Thus, the rule is robust in the sense that the players do not need to agree on a strategy profile, but only on the admissibility assumption on each other. In addition, we show in [8] that the rule is amenable to abstraction techniques: we show how state-space abstractions can be used to check a sufficient condition for assume-admissible, only doing computations on the abstract state space. Related Works. The rule winning under hypothesis (Win-Hyp) and its weaknesses are discussed in [4]. We have illustrated the limitations of this rule in Example 2. There are related works in the literature which propose concepts to model systems composed of several parts, each having their own objectives. The solutions that are proposed are based on n-players non-zero sum games. This is the case both for assume-guarantee synthesis[18](ag), and for rational synthesis[30] (RS). For the case of two player games, AG is based on the concept of secure equilibria [19] (SE), a refinement of Nash equilibria [38] (NE). In SE, objectives of the players are lexicographic: each player first tries to force his own objective, and then tries to falsify the objectives of the other players. It was shown in [19] that SE are the NE that form enforceable contracts between the two players. When the AG rule is extended to several players, as in [18], it no longer corresponds to secure equilibria. We gave a direct algorithm for multiple players in [8]. The difference betweenag andse is that AGstrategieshaveto be resilientto deviations of all the other players, while SE profiles have to be resilient to deviations by only one player. A variant of the rule AG, called Doomsday equilibria, has been proposed in [15]. We have also studied quantitative extensions of the notion of secure equilibria in [14]. In the context of infinite games played on graphs, one well known limitation of NE is the existence of non-credible threats. Refinements of the notion of 8

9 NE, like sub-game perfect equilibria (SPE), have been proposed to overcome this limitation. SPE for games played on graphs have been studied in e.g., [43,10]. Admissibility does not suffer from this limitation. In RS, the system is assumed to be monolithic and the environment is made of several components that are only partially controllable. In RS, we search for a profile of strategies in which the system forces its objective and the players that model the environment are given an acceptable strategy profile, from which it is assumed that they will not deviate. Acceptable can be formalized by any solution concept, e.g., by NE, dominant strategies, or sub-game perfect equilibria. This is the existential flavor of RS. More recently, Kupferman et al. have proposed in [35] a universal variant of this rule. In this variant, we search for a strategy of the system such that in all strategy profiles that extend this strategy for the system and that are NE, the outcome of the game satisfies the specification of the system. In [26], Faella studies several alternatives to the notion of winning strategy including the notion of admissible strategy. His work is for two-players but only the objective ofone playeris taken into account, the objectiveof the other player is left unspecified. In that work, the notion of admissibility is used to define a notion of best-effort in synthesis. The notion of admissible strategy is definable in strategy logics [20,37] and decision problems related to the assume-admissible rule can be reduced to satisfiability queries in such logics. This reduction does not lead to worst-case optimal algorithms; we presented worst-case optimal algorithms in [8] based on our previous work [9]. 5 Regret Minimization In the previous section, we have shown how the notion of admissible strategy can be used to relax the classical worst-case hypothesis made on the environment. In this section, we review another way to relax this worst-case hypothesis. The idea is simple and intuitive. When looking for a strategy, instead of trying to find a strategy which is worst-caseoptimal, we searchfor a strategy that takes best-responses (against the behavior of the environment) as a yardstick. That is, we would like to find a strategy that behaves not far from an optimal response to the strategy of the environment when the latter is fixed. The notion of regret minimization is naturally defined in a quantitative setting (although it also makes sense in a Boolean setting). Let us now formally define the notion of regret associated to a strategy of Eve. This definition is parameterized by a set of strategies for Adam. Definition 8 (Relative Regret). Let A = S,S,E,s init,w be a weighted game arena, let σ be a strategy of Eve, the regret of this strategy relative to a set of strategies Str Σ is defined as follows: Reg(σ,Str ) = sup sup Val(σ,σ ) Val(σ,σ ). σ Str σ Σ 9

10 We interpret the sub-expression sup σ Σ Val(σ,σ ) as the best-response of Eve against σ. Then, the relative regret of a strategy of Eve can be seen as the supremum of the differences between the value achieved by σ against a strategy of Adam and the value achieved by the corresponding best-response. We are now equipped to formally define the problem under study, which is parameterized by payoff function Val( ) and a set Str of strategies of Adam. Definition 9 (Regret Minimization). Given a weighted game arena A and a rational threshold r, decide if there exists a strategy σ for Eve such that Reg(σ,Str ) r and synthesize such a strategy if one exists. In [33], we have considered several types of strategies for Adam: the set Σ, i.e., any strategy, the set Σ ML, i.e., memoryless strategies for Adam, and the set Σ W, i.e., word strategies for Adam.6 We will illustrate each of these cases on examples below. Example 5. Let us consider the weighted gamearena offig. 2, and let us assume that we want to synthesize a strategy for Eve that minimizes her mean-payoff regret against Adam playing a memoryless strategy. The memoryless restriction is useful when designing a system that needs to perform well in an environment which is only partially known. In practice, a controller may discover the environment with which it is interacting during run-time. Such a situation can be modeled by an arena in which choices in nodes of the environment model an entire family of environments and each memoryless strategy models a specific environment of the family. In such cases, if we want to design a controller that performs reasonably well against all the possible environments, we can consider each best-response of Eve for each environment and then try to choose one unique strategy for Eve that minimizes the difference in performance w.r.t. those best-responses: a regret-minimizing strategy Fig.2. An example of a two-player game arena with MP objective for Eve. Rounded positions belong to Eve, and squared positions belong to Adam. 6 To define word strategies, it is convenient to consider game arenas where edges have labels called letters. In that case, when playing a word strategy, Adam commits to a sequence of letters (i.e., a word) and plays that word regardless of the exact state of the game. Word strategies are formally defined in [33] and below. 10

11 In our example, prior to a first visit to state 3, we do not know if the edge 3 2 or the edge 3 1 will be activated by Adam. But as Adam is bound to play a memoryless strategy, once he has chosen one of the two edges, we know that he will stick to this choice. A regret-minimizing strategy in this example is as follows: play 1 2, then 2 3, if Adam plays 3 2, then play 2 1 and then 1 1 forever, otherwise Adam plays 3 1 and then Eve should continue to play 1 2 and 2 3 forever. This strategy has regret 0. Note that this strategy uses memory and that there is no memoryless strategy of Eve with regret 0 in this game. Let us now illustrate the interest of the notion of regret minimization when Adam plays word strategies. When considering this restriction, it is convenient to consider letters that label the edges of the graph (Fig. 3). A word strategy for Adam is a function w : N {a,b}. In this setting Adam plays a sequence of letters and this sequence is independent of the current state of the game. We have shown in [33] that the notion of regret minimization relative to word strategies is a generalization of the notion of good-for-games automata introduced by Henzinger and Piterman in [32]. a 3 9 a 2 a,b a,b b 1 2 b Fig.3. An example of a two-player game arena with MP objective for Eve. Edges are annotated by letters: Adam chooses a word w and Eve resolves the non-determinism on edges. Example 6. In this example, a strategy of Eve determines how to resolve nondeterminism in state 1. The best strategy of Eve for mean-payoff regret minimization is to always take the edge 1 3. Indeed, let us consider all the sequences of two letters that Adam can choose and compute the regret of choosing 1 2 (left) and the regret of choosing 1 3 (right): a with {a,b}, the regret of left is equal to 0, and the regret of right is = 1. b with {a,b}, the regret of left is equal to = 3, and the regret of right is 0. So the strategy that minimizes the regret of Eve is to always take the arrow 1 3 (right), the regret is then equal to 1. In [33], we have studied the complexity of deciding the existence of strategies for Eve that have less than a given regret threshold. The results that we have obtained are summarized in the theorem below. 11

12 Theorem 2 ([33]). Let A = S,S,E,s init,w be a weighted game arena, the complexity of deciding if Eve has a strategy with regret less than or equal to a threshold r Q against Adam playing: a strategy in Σ, is PTime-Complete for payoff functions inf, sup, liminf, limsup, and in NP conp for MP. a strategy in Σ ML, is in PSpace for payoff functions inf, sup, liminf, limsup, and MP, and is conp-hard for inf, sup, limsup, and PSpace-Hard for liminf, and MP. a strategy in Σ W, is ExpTime-Complete for payoff functions inf, sup, liminf, limsup, and undecidable for MP. The above results are obtained by reducing the synthesis of regret-minimizing strategies to finding winning strategies in classical games. For instance, a strategy for Eve that minimizes regret against Σ ML for the mean-payoff measure corresponds to finding a winning strategy in a mean-payoff game played on a larger game arena which encodes the witnessed choices of Adam and forces him to play positionally. When minimizing regret against word strategies, for the decidable cases the reduction is done to parity games and is based on the quantitative simulation games defined in [16]. Additional Results. Since synthesis of regret-minimizing strategies against word strategies of Adam is undecidable with measure MP, we have considered the sub-case which limits the amount of memory the desired controller can use (as in [1]). That is, we ask whether there exists a strategy of Eve which uses at most memory m and ensures regret at most r. In [33] we showed that this problem is in NTime(m 2 A 2 ) for MP. Theorem 3 ([33]). Let A = S,S,E,s init,w be a weighted game arena, the complexity of deciding if Eve has a strategy using memory of at most m with regret less than or equal to a threshold λ Q against Adam playing a strategy in Σ W, is in non-deterministic polynomial time w.r.t. m and A for inf, sup, liminf, limsup, and MP. Finally, we have established the equivalence of a quantitative extension of the notion of good-for-games automata [32] with determinization-by-pruning of the refinement of an automaton [1] and our regret games against word strategies of Adam. Before we can formally state these results, some definitions are needed. Definition 10 (Weighted Automata). A finite weighted automaton is a tuple Q,q init,a,,w where: Q is a finite set of states, q init Q is the initial state, A is a finite alphabet of actions or symbols, Q A Q is the transition relation, and w : Z is the weight function. Arun ofanautomatononaworda A ω is aninfinite sequenceoftransitions ρ = (q 0,a 0,q 1 )(q 1,a 1,q 2 ) ω such that q 0 = q init and a i = a(i) for all i 0. As with plays in a game, each run is assigned a value with a payoff function Val( ). A weighted automaton M defines a function A ω R by assigning to 12

13 a A ω the supremum over all the values of its runs on a. The automaton is said to be deterministic if for all q Q and x A ω the set {q Q (q,x,q ) } is a singleton. In [32], Henzinger and Piterman introduced the notion of good-for-games automata. A non-deterministic automaton is good for solving games if it fairly simulates the equivalent deterministic automaton. Definition 11 (α-good-for-games). A finite weighted automaton M is α- good-for-games if a player (Simulator), against any word x A ω spelled by Spoiler, can resolve non-determinism in M so that the resulting run has value v and M(x) v α. The above definition is a quantitative generalization of the notion proposed in [32]. We link their class of automata with our regret games in the sequel. Proposition 1 ([33]). A weighted automaton M = Q,q init,a,,w is α-goodfor-games if and only if there exists a strategy σ for Eve with relative regret of at most α against strategies Σ W of Adam. Our definitions also suggest a natural notion of approximate determinization for weighted automata on infinite words. This is related to recent work by Aminof et al.: in [1], they introduce the notion of approximate-determinizationby-pruning for weighted sum automata over finite words. For α (0, 1], a weighted sum automaton is α-determinizable-by-pruning if there exists a finite state strategy to resolve non-determinism and that constructs a run whose value is at least α times the value of the maximal run of the given word. So, they consider a notion of approximation which is a ratio. Let us introduce some additional definitions required to formalize the notion of determinizable-by-pruning. Consider two weighted automata M = Q,q init,a,,w and M = Q,q init,a,,w. We say that M α-approximates M if M(x) M (x) α, for all x A ω. We say that M embodies M if Q Q,, and w agrees with w on. For an integer k 0, the k-refinement of M is the automaton obtained by refining the state-space of M using k boolean variables. Definition 12 ((α, k)-determinizable-by-pruning). A finite weighted automaton M is (α, k)-determinizable-by-pruning if the k-refinement of M embodies a deterministic automaton which α-approximates M. We show in [33] that when Adam plays word strategies only, our notion of regret defines a notion of approximation with respect to the difference metric for weighted automata (as defined above). Proposition 2 ([33]). A weighted automaton M = Q,q init,a,,w is α- determinizable-by-pruning if and only if there exists a strategy σ for Eve using memory at most 2 m with relative regret of at most α against strategies Σ W of Adam. 13

14 Related Works The notion of regret minimization is important in game and decision theory, see e.g., [46] and additional bibliographical pointers there. The concept of iterated regret minimization has been recently proposed by Halpern et al. for non-zero sum games [31]. In [29], the concept is applied to games played on weighted graphs with shortest path objectives. Variants on the different sets of strategies considered for Adam were not considered there. In [24], Damm and Finkbeiner introduce the notion of remorse-free strategies. The notion is introduced in order to define a notion of best-effort strategy when winning strategies do not exist. Remorse-free strategies are exactly the strategies which minimize regret in games with ω-regular objectives in which the environment (Adam) is playing word strategies only. The authors of [24] do not establish lower bounds on the complexity of the realizability and synthesis problems for remorse-free strategies. A concept equivalent to good-for-games automata is that of historydeterminism [23]. Proposition 1 thus allows us to generalize history-determinism to a quantitative setting via this relationship with good-for-games automata. Finally, we would like to highlight some differences between our work and the study of Aminof et al. in [1] on determinization-by-pruning. First, we consider infinite words while they consider finite words. Second, we study a general notion of regret minimization problem in which Eve can use any strategy while they restrict their study to fixed memory strategies only and leave the problem open when the memory is not fixed a priori. 6 Game Arenas with Expected Adversary In the two previous sections we have relaxed the worst-case hypothesis on the environment(modeled by the behavior of Adam) by either considering an explicit objective for the environment or by considering as yardsticks the best-responses to the strategies of Adam. Here, we introduce another model where the environment is modeled as a stochastic process (i.e., Adam is expected to play according to some known randomized strategy) and we are looking for strategies for Eve that ensure good expectation against this stochastic process while guaranteeing acceptable worst-case performance even if Adam deviates from his expected behavior. To define formally this new framework, we need game arenas in which an expected behavior for Adam is given as a memoryless randomized strategy. 7 We first introduce some notation. Given a set A, let D(A) denote the set of rational probability distributions over A, and, for d D(A), we denote its support by Supp(d) = {a A d(a) > 0} A. 7 It should be noted that we can easily consider finite-memory randomized strategies for Adam, instead of memoryless randomized strategies. This is because we can always take the synchronized product of a finite-memory randomized strategy with the game arena to obtain a new game arena in which the finite-memory strategy on the original game arena is now equivalent to a memoryless strategy. 14

15 Fig. 4. A game arena associated with a memoryless randomized strategy for Adam can be seen as an MDP: the fractions represent the respective probability to take each outgoing edge when leaving state 3. Definition 13. Fix a weighted game arena A = S,S,E,s init,w. A memoryless randomized strategy for Adam is a function σ rnd : S D(S) such that for all s S, Supp(σ rnd(s)) {s S (s,s ) E}. For the rest of this section, we model the expected behavior of Adam with a strategyσ rnd, given as part of the input for the problem we will consider. Given a weightedgamearenaaandamemorylessrandomizedstrategyσ rnd for Adam, we are left with a model with both non-deterministic choices(for Eve) and stochastic transitions (due to the randomized strategy of Adam). This is essentially what is known in the literature as a player game or more commonly, a Markov Decision Process (MDP), see for example [39,27]. One can talk about plays, strategies and other notions in MDPs as introduced for games. Considerthe game in Fig. 4. We can see it as a classicaltwo-playergame ifwe forget about the fractions around state 3. Now assume that we fix the memoryless randomized strategy σ rnd for Adam to be the one that, from 3, goes to 1 with probability 9 10 and to 2 with the remaining probability, This is represented by the fractions on the corresponding outgoing edges. In the remaining model, only Eve still has to pick a strategy: it is an MDP. We denote this MDP by A[σ rnd]. Let us go one step further. Assume now that Eve also picks a strategy σ in this MDP. Now we obtain a fully stochastic process called a Markov Chain (MC). We denote it by A[σ,σ rnd ]. In an MC, an event is a measurable set of plays. It is well-known from the literature [44] that every event has a uniquely defined probability (Carathéodory s extension theorem induces a unique probability measure on the Borel σ-algebra over plays in the MC). Given E a set of playsinm = A[σ,σ rnd ],wedenotebyp M(E)theprobabilitythataplaybelongs to E when M is executed for an infinite number of steps. Given a measurable value function Val, we denote by E M (Val) the expected value or expectation of Val over plays in M. In this paper, we focus on the mean-payoff function MP. We are now finally equipped to formally define the problem under study. Definition 14 (Beyond Worst-Case Synthesis). Given a weighted game arena A, a stochastic model of Adam given as a memoryless randomized strategy 15

16 σ rnd, and two rational thresholds λ wc,λ exp, decide if there exists a strategy σ for Eve such that { π Outcome A (σ sinit ) Val(π) > λ wc E A[σ,σ rnd ] (Val) > λ exp and synthesize such a strategy if one exists. Intuitively, we are looking for strategies that can simultaneously guarantee a worst-case performance higher than λ wc, i.e., against any behavior of Adam in the game A, and guarantee an expectation higher than λ exp when faced to the expected behavior of Adam, i.e., when played in the MDP A[σ rnd ]. We can of course assume w.l.o.g. that λ wc < λ exp, otherwise the problem reduces trivially to just a worst-case requirement: any lower bound on the worst-case value is also a lower bound on the expected value. Example 7. Consider the arena depicted in Fig. 4. As mentioned before, the probability distribution models the expected behavior of Adam. Assume that we want now to synthesize a strategy for Eve which ensures that (C 1 ) the meanpayoff will be at least 1 3 no matter how Adam behaves (worst-case guarantee), and (C 2 ) at least 3 2 if Adam plays according to his expected behavior (good expectation). First, let us study whether this can be achieved through the two classical solution concepts used in games and MDPs respectively. We start by considering the arena as a traditional two-player zero-sum game: in this case, it is known that an optimal memoryless strategy exists [25]. Let σ wc be the strategy of Eve that always plays 1 1 and 2 1. That strategy maximizes the worst-case mean-payoff, as it enforces a mean-payoff of 1 no matter how Adam behaves. Thus, (C 1 ) is satisfied. Observe that if we consider the arena as an MDP (i.e., taking the probabilities into account), this strategy yields an expected value of 1 as the unique possible play from state 1 is to take the self-loop forever. Hence this strategy does not satisfy (C 2 ). Now, consider the arena as an MDP. Again, it is known that the expected value can be maximized by a memoryless strategy [39,27]. Let σ exp be the strategyofevethatalwayschoosesthe followingedges:1 2and2 3.Itsexpected mean-payoff can be calculated in two steps: first computing the probability vector that represents the limiting stationary distribution of the irreducible MC induced by this strategy, second multiplying it by the vector containing the expected weights over outgoing edges for each state. In this case, it can be shown that the expected value is equal to 54 29, hence the strategy does satisfy (C 2). Unfortunately, it is clearly not acceptable for (C 1 ) as, if Adam does not behave according to the stochastic model and always chooses to play 3 2, the mean-payoff will be equal to zero. Hence this shows that the classical solution concepts do not suffice if one wants to go beyond the worst-case and mix guarantees on the worst-case and the expected performance of strategies. In contrast, with the framework developed in [13,12], it is indeed possible for the considered arena (Fig. 4) to build a strategyforevethatensurestheworst-caseconstraint(c 1 )andatthesametime, 16

17 yields an expected value arbitrarily close to the optimal expectation achieved by strategy σ exp. In particular, one can build a finite-memory strategy that guarantees both (C 1 ) and (C 2 ). The general form of such strategies is a combination of σ exp and σ wc in a well-chosen pattern. Let σcmb(k,l) be a combined strategy parameterized by two integers K,L N. The strategy is as follows. 1. Play according to σ exp for K steps. 2. If the mean-payoff over the last K steps is larger than the worst-case threshold λ wc (here 1 3 ), then go to phase Otherwise, play according to σ wc for L steps, and then go to phase 1. Intuitively, the strategy starts by mimicking σ exp for a long time, and the witnessed mean-payoff over the K steps will be close to the optimal expectation with high probability. Thus, with high probability it will be higher than λ exp, and therefore higher than λ wc recall that we assumed λ wc < λ exp. If this is not the case, then Eve has to switch to σ wc for sufficiently many steps L in order to make sure that the worst-case constraint (C 1 ) is satisfied before switching back to σ exp. One of the key results of [13] is to show that for any λ wc < µ, where µ denotes the optimal worst-case value guaranteed by σ wc, and for any expected value thresholdλ exp < ν, whereν denotes the optimal expected value guaranteed by σ exp, it is possible to compute values for K and L such that σcmb(k,l) satisfies the beyond worst-caseconstraint forthresholds λ wc and λ exp. For instance, in the example, where λ wc = 1 3 < 1 and λ exp = 3 2 < 54 29, one can compute appropriate values of the parameters following the technique presented in [13, Theorem 5]. The crux is proving that, for large enough values of K and L, the contribution to the expectation of the phases when σ cmb(k,l) mimics σ wc are negligible, and thus the expected value yield by σ cmb(k,l) tends to the optimal one given by σ exp, while at the same time the strategy ensures that the worst-case constraint is met. In the next theorem, we sum up some of the main results that we have obtained for the beyond worst-case synthesis problem applied to the mean-payoff value function. Theorem 4 ([13,12,22]). The beyond worst-case synthesis problem for the mean-payoff is in NP conp, and at least as hard as deciding the winner in two-player zero-sum mean-payoff games, both when looking for finite-memory or infinite-memory strategies of Eve. When restricted to finite-memory strategies, pseudo-polynomial memory is both sufficient and necessary. The NP conp-membership is good news as it matches the long-standing complexity barrier for two-player zero-sum mean-payoff games [25,47,11,17]: the beyond worst-case framework offers additional modeling power for free in terms of decision complexity. It is also interesting to note that in general, infinitememory strategies are more powerful than finite-memory ones in the beyond 17

18 worst-case setting, which is not the case for the classical problems in games and MDPs. Looking carefully at the techniques from [13,12], it can be seen that the main bottleneck in complexity is solving mean-payoff games in order to check whether the worst-case constraint can be met. Therefore, a natural relaxation of the problem is to consider the beyond almost-sure threshold problem where the worst-case constraint is softened by only asking that a threshold is satisfied with probability one against the stochastic model given as the strategy σ rnd of Adam. In this case, the complexity is reduced. Theorem 5 ([22]). The beyond almost-sure threshold problem for the meanpayoff is in PTime and finite-memory strategies are sufficient. Related Works We originally introduced the beyond worst-case framework in[13] where we studied both mean-payoff and shortest path objectives. This framework generalizes classical problems for two-player zero-sum games and MDPs. In mean-payoff games, optimal memoryless strategies exist and deciding the winner lies in NP conp while no polynomial algorithm is known [25,47,11,17]. For shortest path games, where we consider game graphs with strictly positive weights and try to minimize the accumulated cost to target,it can be shown that memoryless strategies also suffice, and the problem is in PTime [34]. In MDPs, optimal strategies for the expectation are studied in [39,27] for the mean-payoff and the shortest path: in both cases, memoryless strategies suffice and they can be computed in PTime. While we saw that the beyond worst-case synthesis problem does not cost more than solving games for the mean-payoff, it is not the case anymore for the shortest path: we jump from PTime to a pseudopolynomial-time algorithm. We proved in [13, Theorem 11] that the problem is inherently harder as it is NP-hard. The beyond worst-case framework was extended to the multi-dimensional setting where edges are fitted with vectors of integer weights in [22]. The general case is proved to be conp-complete. Our strategies can be considered as strongly risk averse: they avoid at all cost outcomes that are below a given threshold (no matter what is their probability), and inside the set of those safe strategies, we maximize the expectation. Other different notions of risk have been studied for MDPs: in [45], the authors want to find policies which minimize the probability (risk) that the total discounted rewards do not exceed a specified value (target); in [28] the authors want policies that achieve a specified value of the long-run limiting average reward at a specified probability level (percentile). The latter problem was recently extended significantly in the framework of percentile queries, which provide elaborate guarantees on the performance profile of strategies in multi-dimensional MDPs [41]. While all those strategies limit risk, they only ensure low probability for bad behaviors but they do not ensure their absence, furthermore, they do not ensure good expectation either. Anotherbodyofworkisthe studyofstrategiesin MDPsthatachieveatradeoff between the expectation and the variance over the outcomes (e.g., [6] for the 18

19 mean-payoff, [36] for the cumulative reward), giving a statistical measure of the stability of the performance. In our setting, we strengthen this requirement by asking for strict guarantees on individual outcomes, while maintaining an appropriate expected payoff. A survey of rich behavioral models extending the classical approaches for MDPs including the beyond worst-case framework presented here was published in [42], with a focus on the shortest path problem. References 1. B. Aminof, O. Kupferman, and R. Lampert. Reasoning about online algorithms with weighted automata. ACM Transactions on Algorithms, C. Baier and J.-P. Katoen. Principles of model checking. MIT Press, D. Berwanger. Admissibility in infinite games. In Proc. of STACS, LNCS 4393, pages Springer, R. Bloem, R. Ehlers, S. Jacobs, and R. Könighofer. How to handle assumptions in synthesis. In Proc. of SYNT, EPTCS 157, pages 34 50, A. Brandenburger, A. Friedenberg, and H. J. Keisler. Admissibility in games. Econometrica, 76(2), T. Brázdil, K. Chatterjee, V. Forejt, and A. Kucera. Trading performance for stability in Markov decision processes. In Proc. of LICS, pages IEEE, R. Brenguier, L. Clemente, P. Hunter, G. A. Pérez, M. Randour, J.-F. Raskin, O. Sankur, and M. Sassolas. Non-zero sum games for reactive synthesis. In Proc. of LATA, LNCS. Springer, To appear. 8. R. Brenguier, J.-F. Raskin, and O. Sankur. Assume-admissible synthesis. In Proc. of CONCUR, LIPIcs 42, pages Schloss Dagstuhl LZI, R. Brenguier, J.-F. Raskin, and M. Sassolas. The complexity of admissibility in omega-regular games. In Proc. of CSL-LICS, pages 23:1 23:10. ACM, T. Brihaye, V. Bruyère, N. Meunier, and J.-F. Raskin. Weak subgame perfect equilibria and their application to quantitative reachability. In Proc. of CSL, LIPIcs 41, pages Schloss Dagstuhl - LZI, L. Brim, J. Chaloupka, L. Doyen, R. Gentilini, and J.-F. Raskin. Faster algorithms for mean-payoff games. Formal Methods in System Design, 38(2):97 118, V. Bruyère, E. Filiot, M. Randour, and J.-F. Raskin. Expectations or guarantees? I want it all! A crossroad between games and MDPs. In Proc. of SR, EPTCS 146, pages 1 8, V. Bruyère, E. Filiot, M. Randour, and J.-F. Raskin. Meet your expectations with guarantees: Beyond worst-case synthesis in quantitative games. In Proc. of STACS, LIPIcs 25, pages Schloss Dagstuhl - LZI, V. Bruyère, N. Meunier, and J.-F. Raskin. Secure equilibria in weighted games. In Proc. of CSL-LICS, pages 26:1 26:26. ACM, K. Chatterjee, L. Doyen, E. Filiot, and J.-F. Raskin. Doomsday equilibria for omega-regular games. In Proc. of VMCAI, LNCS 8318, pages Springer, K. Chatterjee, L. Doyen, and T. A. Henzinger. Quantitative languages. ACM Transactions on Computational Logic, 11(4), K. Chatterjee, L. Doyen, M. Randour, and J.-F. Raskin. Looking at mean-payoff and total-payoff through windows. Information and Computation, 242:25 52,

Admissible Strategies for Synthesizing Systems

Admissible Strategies for Synthesizing Systems Ocan Sankur Univ Rennes, Inria, CNRS, IRISA, Rennes Joint with Romain Brenguier (DiffBlue), Guillermo Pérez (Antwerp), and Jean-François Raskin (ULB) (Multiplayer)