Multi-agent learning

Size: px
Start display at page:

Download "Multi-agent learning"

Transcription

1 Multi-agent learning Ê Ò ÓÖ Ñ ÒØ Ä ÖÒ Ò Gerard Vreeswijk, Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 1

2 Reinforcement learning: motivation Nash equilibria in repeated games is a static analysis. ÝÒ Ñ Ð Ò ÐÝ : How do (or should) players develop their strategies and behaviour in a repeated game? Do : descriptive / economics; should : normative / agent theory. Reinforcement learning (RL) is a rudimentary learning technique. 1. RL is Ø ÑÙÐÙ ¹Ö ÔÓÒ : it plays actions with the highest past payoff. 2. It is ÑÝÓÔ : it is only interested in immediate success. Reinforcement learning can be applied to learning in games. When computer scientists mention RL, they usually mean multi-state RL. Single-state RL has already interesting and theoretically important properties, especially when it is coupled to games. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 2

3 Plan for today Part I: Single-state RL. Parts of Ch. 2 of Sutton et al.(1989): Evaluative Feedback : ǫ-greedy, optimistic, value-based, proportional. Part II: Single-state RL in games. First half of Ch. 2 of Peyton Young (2004): Reinforcement and Regret. 1. By average: 1 n r n r n. 2. With discounted past : γ n 1 r 1 + γ n 2 r γr n 1 + r n. 3. With an aspiration level (Sutton et al.: reference reward ). Part III: Convergence to dominant strategies. Begin of Beggs (2005): On the Convergence of. #Players #Actions Result Theorem 1 : 1 2 Pr(dominant action) = 1 Theorem 2 : 1 2 Pr(sub-dominant actions) = 0 Theorem 3 : 1 2 Pr(dom) = 1, Pr(sub-dom) = 0 Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 3

4 Part I: Single-state reinforcement learning Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 4

5 Exploration vs. exploitation Problem. You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies: A. You make friends whe{n r}ever possible. You could be called an ÜÔÐÓÖ Ö. B. You stick to the nearest fellow-student. You could be called an ÜÔÐÓ Ø Ö. C. What most people do: first explore, then exploit. We ignore: 1. How quality of friendships is measured. 2. How changing personalities of friends (so-called moving targets ) are dealt with. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 5

6 An array of N slot machines Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 6

7 Exploitation vs. exploration Given. An array of N slot machines. Suppose the yield of every machine is normally distributed with mean and variance unknown to us. Random questions: 1. How long do to stick with your first slot machine? 2. When do you leave the second? 3. If machine A so far yields more than machine B, then would you explore B ever again? 4. Try many machines, or opt for security? Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 7

8 Experiment Yield Machine 1 Yield Machine 1 Yield Machine Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 8

9 The N-armed bandit problem Barto & Sutton: the N-armed bandit. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 9

10 Computation of the quality (offline version) A reasonable measure for the ÕÙ Ð ØÝ Ó ÐÓØ Ñ Ò Ø Ö n ØÖ, would be the average profit. Formula for the quality of a slot machine after n tries. Simple formula, but: Q n = Def r r n n Every time Q n is computed, all values r 1,..., r n must be retrieved. The idea is to draw conclusions only if you have ÐÐ the data. The data is processed in Ø. Learning proceeds Ó Ð Ò. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 10

11 Computation of the quality (online version) Q n = r r n n = r r n 1 n 1 n ( ) 1 = Q n 1 Q n n 1 + = r r n 1 n n 1 + r n ( 1 n + r n n n = Q n 1 n 1 n ) r n + r n n = Q }{{ n ( r n }}{{} n old learning value rate Q }{{ n 1 } old value }{{} }{{} goal value error } {{ } } correction {{ } new value ). Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 11

12 Progress of quality Q n Amplitude of correction is determined by the Ð ÖÒ Ò Ö Ø. Here, the learning rate is 1/n and decreases through time. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 12

13 Exploration: ǫ-greedy exploration ǫ-greedy exploration. Let 0 < ǫ 1 close to Choose (1 ǫ)% of the time an optimal action. 2. At other times, choose a random action. Item 1: ÜÔÐÓ Ø Ø ÓÒ. Item 2: ÜÔÐÓÖ Ø ÓÒ. With probability one, every action is explored infinitely many times. (Why?) Is it guaranteed that every action is explored infinitely many times? It would be an idea to Ù ¹ÓÔØ Ñ Ð Ø ÓÒ Û Ø Ö Ð Ø Ú Ö Û Ö explore more often. However, that is not how greedy exploration works and we may lose convergence to optimal actions... Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 13

14 Optimistic initial values An alternative for ǫ-greedy is to work with ÓÔØ Ñ Ø Ò Ø Ð Ú ÐÙ. 1. At the outset, an ÙÒÖ Ð Ø ÐÐÝ ÕÙ Ð ØÝ is attributed to every slot machine: for 1 k N. Q k 0 = high 2. As usual, for every slot machine its average profit is maintained. 3. Without exception, always exploit machines with highest Q-values. Random questions: q1: Initially, many actions are tried all actions are tried? q2: How high should high be? q3: What to do in case of ties (more than one optimal machine)? q4: Can we speak of exploration? q5: Is optimism (as a method) suitable to explore an array of (possibly) infinitely many slot machines? Why (not)? q6: ǫ-greedy: Pr( every action is explored infinitely many times ) = 1. Also with optimism? Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 14

15 Optimistic initial values vs. ǫ-greedy From: (...), Sutton and Barto, Sec. 2.8, p. 41. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 15

16 Maintaining and exploring friendships: strategies ǫ-greedy Spend most of the time to your best friends (greedy). Occasionally spend ǫ of your time to explore random friendships. Optimistic In the beginning, foster (unreasonably) high expectations of everyone. You will be disappointed many times. Adapt your expectations based on experience. Always spend time with your best friends. Values (Cf. Sutton et al.) Let 0 < α << 1. In the beginning rate everyone with a 6, say. If a friendship rated r involves a new experience e [0, 10] then for example r new = r old + sign(e) α. (Watch boundaries!) Other method: r new = (1 α) r old + α e Proportions Give everyone equal attention in the beginning. If there is a positive experience, then give that person a little more attention in the future. (Similarly with negative experiences.) Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 16

17 Part II: Single-state reinforcement learning in games Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 17

18 Proportional techniques: basic setup There are two players: A (the protagonist) and B (the antagonist, sometimes nature ). Play proceeds in (possibly an infinite number of) rounds 1,..., t,.... Identifiers X and Y denote finite sets of possible actions. Each round, t, players A and B choose actions x X and y Y, respectively: A s payoff is given by a fixed function u : X Y R. In other words, A s payoff matrix is known. It follows that payoffs Ø Ñ are ÓÑÓ Ò ÓÙ, i.e., (x s, y s ) = (x t, y t ) u(x s, y s ) = u(x t, y t ). (x 1, y 1 ), (x 2, y 2 ),..., (x t, y t ),.... Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 18

19 Propensity, and mixed strategy of play Let t 0. The ÔÖÓÔ Ò ØÝ of A to play x at t is denoted by θ t x. A simple model of propensity is cumulative payoff matching (CPM): { θx t+1 θx t + u(x, y) if x is played at round t, = else. θ t x The vector of Ò Ø Ð ÔÖÓÔ Ò Ø, θ 0 is not the result of play. As a vector: θ t+1 = θ t + u t e t, where e t x = Def x is played at t? 1 : 0. Ñ Ü A plausible is to play at round t the normalised propensity of ØÖ Ø Ý x at t: ( ) q t x x X, where q t θx x = t Def x X θx t. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 19

20 An Example The ØÓØ Ð ÙÑÙÐ Ø Ô ÝÓ at round t, the sum x X θ t x is abbreviated by v t. θ θ 15 x θx 15 1 x θx 15 2 x θx v 15 Remarks: It is the cumulative payoff from each action that matters, not the average payoff. (There is a difference!) In this example, it is assumed that the initial propensities, θ 0 x, are one. In general, they could be anything. (But θ 0 = 0 is not very useful.) Alternatively, scalar v t = x X θ 0 x + s t u t. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 20

21 Dynamics of the mixed strategy We can obtain further insight in the dynamics of the process through the change of the mixed strategy: q t x = q t x q t 1 x = θt x v t θt 1 x v t 1 = vt 1 θx t v t 1 v t vt θx t 1 v t v t 1 = vt 1 θx t v t θx t 1 v t 1 v t = vt 1 (θx t 1 + e t x u t ) (v t 1 + u t ) θx t 1 = v t 1 θx t 1 = vt 1 e t x u t u t θx t 1 v t 1 v t = ut v t 1 v t + v t 1 e t x u t v t 1 θ t 1 v t 1 v t = ut v t (et x θt 1 x ut vt 1) = v t (et x q t 1 x ). x u t θ t 1 x v t 1 e t x θx t 1 v t v t 1 Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 21

22 Dynamics of the mixed strategy: convergence The dynamics of the mixed strategy in round t is given by On coordinate x: We have: q t = ut v t (et q t 1 ). q t x = ut v t (et x q t 1 x ). q t = ut v t (et q t 1 ) = ut v t et q t 1 ut v t 2 u = t 2 max{us s t} u u t t min{u s s t} 2 = C 1 t. Since all terms except v t are bounded, lim t q t = 0 ( / convg.). Does q t converge? If so, to the right (e.g., a Pareto optimal) strategy? Beggs (2005) provides more clarity in certain circumstances. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 22

23 Abstraction of past payoffs t p In 1991 and 1993, B. Arthur proposed the following update formula: Consequently, q t = u t Ct p + u t (et q t 1 ) q t 1 t p. Remarks: Arthur s notation differs considerably from that of Peyton Young (2004). If the parameter p is set to, e.g., 2, then there is convergence. However... In related research, where the value of p is determined through psychological experiments, it is estimated that p < 1. B. Arthur (1993): On Designing Economic Agents that Behave Like Human Agents. In: Journal of Evolutionary Economy 3, pp Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 23

24 Past payoffs at discount rate λ < 1 In 1995, Erev and Roth proposed the following update formula: Consequently, q t = (For simplicity, we assume θ 0 = 0.) Since θ t+1 = λθ t + u t e t. u t s t λ t s u s (et q t 1 ) ( λ t s ) min{u s s t} λ t s u s ( λ t s ) max{u s s t} s t s t s t and since 1 + λ + λ λ t 1 = 1 λt 1 λ for λ = 1, the mixed strategy tends to change at a rate 1 λ. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 24

25 Past payoffs represented by an aspiration level Assume an Ô Ö Ø ÓÒ Ð Ú Ð a t R at every round. (Intuition: payoff with which one would be satisfied.) Idea: u t x > a t positively reinforce action x u t x < a t negatively reinforce action x Correspondingly, the mixed strategy evolves according to q t = (u t a t )(e t q t 1 ). Typical definitions for aspiration: Average past payoffs. a t = Def v t /t. Ø Ò ÔÐ Ý A.k.a. (Crandall, 2005). Discounted past payoffs. a t = Def s t λ t s u s. (Erev & Roth, 1995). Börgers and Sarin (2000). Naïve with Endogeneous Aspirations in: Int. Economic Review 41, pp Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 25

26 Adequacy of reinforcement learning Does reinforcement learning lead to optimal behaviour against B? If A and B would both converge to optimal behaviour, i.e., to a best response, this would yield a Nash equilibrium. Less demanding: Does reinforcement learning converge to optimal behaviour in a stationary (and, perhaps, stochastic) environment? A ØÓÖÝ is a finite sequence of actions ξ t : (x 1, y 1 ),..., (x t, y t ). A ØÖ Ø Ý is a function g : H (X) that maps histories to probability distributions over X. Write q t+1 = Def g(ξ t ) Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 26

27 Optimality against stationary opponents Assume that B plays a fixed probability distribution q (Y). The combination of θ 0, g and q yields Ö Ð Ø ÓÒ a ω = (x 1, y 1 ),..., (x t, y t ),.... Define B(q ) = Def {x X x is a best response to q }. Definition. A strategy g is called ÓÔØ Ñ Ð Ò Ø q if, with probability one, for all x / B(q ) : lim t q t x = 0 (1) In this case, the phrase with probability one means that almost all (read: all but finitely many) realisations satisfy (1). Theorem. Given finite action sets X and Y, cumulative payoff matching on X is optimal against every stationary distribution on Y. Peyton Young (2004, p. 17): Its proof is actually quite involved (... ). (FSs.) Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 27

28 Part III: Beggs, 2005 Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 28

29 The learning model Single-state proportional reinforcement learning (Erev & Roth, 1995). As usual: As usual: A i (n + 1) = { A i (n) + π i (n + 1) if action i is chosen, A i (n) else. Pr i (n + 1) = The following two assumptions are crucial: A i (n) m j=1 A j(n) 1. All past, current and future payoffs π i (n) are bounded away from zero and bounded from above. More precisely, there are 0 < k 1 k 2 such that all payoffs are in [k 1, k 2 ]. 2. Initial propensities A i (0) are ØÖ ØÐÝ ÔÓ Ø Ú. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 29

30 Choice of actions Lemma 1. Each action is chosen infinitely often with probability one. Proof. From the above assumptions it follows that Pr i (n + 1) = A i (n) m j=1 A j(n) A i (0) A i (0) + nk 2 (Which is like worst case for i: as if i was never chosen and all previous n rounds actions = i received the maximum possible payoff.) Apply so-called ÓÒ Ø ÓÒ Ð ÓÖ Ð¹ ÒØ ÐÐ Ð ÑÑ : a if {E n } n are events, and Pr(E n X 1,..., X n 1 ) n=1 is unbounded, then the probability that an infinite number of E n s occur is one. a A.k.a. the second Borel-Cantelli lemma, or the Borel-Cantelli-Lévy lemma (Shiryaev, p. 518). Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 30

31 Unboundedness of propensities, and convergence Lemma 2. For each i, A i tends to infinity with probability one. Proof. For each i, action i is chosen infinitely often with probability one. Since payoff per round is bounded from below by k 1, we have j=1 k 1 A i, where j runs over rounds where i is chosen. Now Lemma 1 + Lemma 2 + martingale theory suffice to prove convergence: Suppose there are only two possible actions: a 1 and a 2. The expression E[ π(a i ) history ] denotes the expected payoff of action a i, given history of play up to and including the choice to play a i itself. Theorem 1. If E[ π(a 1 ) history ] > γe[ π(a 2 ) history ] (2) for some fixed γ > 1, then the probability that a 1 will be played converges to one. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 31

32 Convergence to the dominant action: proof in a nutshell If a 1 is dominant (like in Eq. 2), the objective is to show that A 2 A 1 (n) 0, a.s. To this end, it suffices to show that A ǫ 2 A 1 (n) C, a.s. (3) for some C and for some 1 < ǫ < γ (which is possible, since γ > 1). Then lim n A 2 A 1 (n) = lim n A ǫ 2 A 1 1 A ǫ 1 2 = lim n A ǫ 2 A 1 lim 1 n A2 ǫ 1 = C 0 To this end, Beggs shows that, for some n N, and for all 1 < ǫ < γ, (3) is a so-called ÒÓÒ¹Ò Ø Ú ÙÔ Ö¹Ñ ÖØ Ò Ð. (Explained in a moment.) It is known that every non-negative super-martingale converges to a finite limit with probability one. (Explained in a moment.) Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 32

33 Super-martingale A ÙÔ Ö¹Ñ ÖØ Ò Ð is a stochastic process in which the conditional expectation of the next value, given the current and preceding values, is less than or equal to the current value: E[ Z n+1 Z 1,..., Z n ] Z n Think of an ÙÒ Ö Ñ Ð Ò Ñ that proceeds in rounds. 1. Expectations decrease. Taking expectations on both sides yields E[ Z n+1 ] E[ Z n ]. 2. Expectations converge. From (1) and the ÑÓÒÓØÓÒ ÓÒÚ Ö Ò Ø ÓÖ Ñ a it follows that the expectations of a non-negative super-martingale converge to a limit L somewhere in [0, E[ Z 1 ]]. 3. Doob s Martingale Convergence Theorem: Values converge a.s. let {Z n } n be a martingale (or sub-martingale, or super-martingale) such that E[ Z n ] is bounded. Then lim n Z n exists a.s. and is finite. a Ordinary mathematics. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 33

34 To show that A ǫ 2 /A 1 is a non-neg super-martingale E [ ] Aǫ 2 (n + 1) history A 1 = Pr(1 history)e [ Aǫ 2 ] (n + 1) 1, history A 1 Pr(2 history)e + [ Aǫ 2 (n + 1) 2, history A 1 ] = [ A 1 (n) A 1 (n) + A 2 (n) E A2 ǫ(n) A 1 (n) + π 1 (n + 1) Aǫ 2 (n) A 1 (n) A 2 (n) A 1 (n) + A 2 (n) E ] + [ (A2 (n) + π 2 (n + 1)) ǫ A 1 (n) Aǫ 2 (n) ]. A 1 (n) Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 34

35 To show that A ǫ 2 /A 1 is a non-neg super-martingale Taylor expansion: f(x + h) = f(x) + h f (x) + h2 2! f (x) + h3 3! f (x) + h4 4! f (x + θh) }{{} Lagrange remainder for some θ (0, 1). (Of course, there is nothing special about n = 4.) Applied to f(x) = x 1 and n = 2 we obtain (x + h) 1 = x 1 + h( x 2 ) + h2 2! (2(x + θh) 3 ) = x 1 hx 2 + h 2 (x + θh) 3 = 1 x h x 2 + h 2 (x + θh) 3. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 35

36 To show that A ǫ 2 /A 1 is a non-neg super-martingale Applied Taylor expansion to f(x) = x 1 and n = 2 yields (x + h) 1 = x 1 hx 2 + h 2 (x + θh) 3 = 1 x h x 2 + h 2 (x + θh) 3. For non-negative x and h we have x 3 (x + θh) 3 so that 1 x h x 2 + h2 x 3. This first inequality puts an upper bound with pure x and h on 1 A 1 (n) + π 1 (n + 1) 1 A 1 (n) π 1(n + 1) A 2 1 (n) + π2 1 (n + 1) A 3 1 (n). Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 36

37 To show that A ǫ 2 /A 1 is a non-neg super-martingale Similarly, applying Taylor expansion to f(x) = (x + h) ǫ with n = 2 yields (x + h) ǫ = x ǫ + hǫx ǫ 1 + h 2 (ǫ 1)ǫ(x + θh) ǫ 2. For non-negative x and h and ǫ > 1, we have (ǫ 1)(x + θh) ǫ 2 Cx ǫ 2 for some constant C, so that (x + h) ǫ x ǫ + hǫx ǫ 1 + h 2 Cǫx ǫ 2 This second inequality puts an upper bound with pure x and h on (A 2 (n) + π 2 (n + 1)) ǫ A2 ǫ(n) +... Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 37

38 To show that A ǫ 2 /A 1 is a non-neg super-martingale Using E[aX + b] = ae[x] + b and factoring out common terms, Beggs obtains A 1 A ǫ [ 2 E[π A 1 + A 2 A 2 (n) E[π 1 (n + 1)] + c 1 (n + 1) 2 ] ] 1 + A 1 1 (n) 1 ǫa ǫ [ 2 E[π (n) E[π A (n + 1)] + c 2 (n + 1) 2 ] ] 2. A 2 A 1 A 2 (n) Because payoffs are bounded, E[π 1 (... )] > γe[π 2 (... )], 1 γ < ǫ γ < 0, constants K 1, K 2, K 3 > 0 can be found such that A2 ǫ ( K A 1 (A 1 + A 2 ) 1 (ǫ γ) + K 2 + K ) 3 (n) A 1 A 2 For ǫ (1, γ) and for n large enough, this expression is non-positive. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 38

39 Generalisation of Begg s Theorem 1, and application to games Let there be m 2 alternative actions, a 1,..., a m (rather than m = 2). Theorem 2. If the expected payoff (conditional on the history) of a i dominates the expected payoff (conditional on the history) of a j, for all j = i, then the probability that a j will be played converges to zero, for all j = i. Applied to games: Theorem 3. In a game with finitely many actions and players, if a player learns according the ER scheme then, a. With probability 1, the probability and empirical frequency that he plays any action that is strictly dominated by another pure strategy converges to zero. b. Hence if he has a strictly dominant strategy, with probability 1, the probability and empirical frequency with which he plays that action converges to 1. (Beggs, 2005). Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 39

40 Summary There are several rules for reinforcement learning on single states. Sheer convergence is often easy to prove. To prove convergence to Ø Ò Ø Ø ÓÒ ÖÝ ÒÚ ÖÓÒÑ ÒØ Ø ÓÒ is much more difficult. ÓÒÚ Ö Ò ØÓ Ø Ø ÓÒ Ò ÒÓÒ¹ Ø Ø ÓÒ ÖÝ ÒÚ ÖÓÒÑ ÒØ, e.g., convergence to dominant actions, or best responses in self-play, is state-of-the art research. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 40

41 What next? No-regret learning: this is a generalisation of reinforcement learning No-regret = Def play those actions that ÛÓÙÐ Ú Ò successful in the past. Similarities with reinforcement learning: 1. Driven by Ô Ø Ô ÝÓ. 2. Not interested in (the behaviour of) the opponent. 3. ÅÝÓÔ. Differences: a) Keeping accounts of hypothetical actions rests on the ÙÑÔØ ÓÒ that a player is able to estimate payoffs of actions that were not actually played. [Knowledge of the payoff matrix definitely helps, but is an even more severe assumption.] b) Bit more easy to obtain results regarding performance. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 41

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Lecture 16: Modern Classification (I) - Separating Hyperplanes Lecture 16: Modern Classification (I) - Separating Hyperplanes Outline 1 2 Separating Hyperplane Binary SVM for Separable Case Bayes Rule for Binary Problems Consider the simplest case: two classes are

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Reinforcement Learning

Reinforcement Learning 5 / 28 Reinforcement Learning Based on a simple principle: More likely to repeat an action, if it had to a positive outcome. 6 / 28 Reinforcement Learning Idea of reinforcement learning first formulated

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3

More information

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden 1 Selecting Efficient Correlated Equilibria Through Distributed Learning Jason R. Marden Abstract A learning rule is completely uncoupled if each player s behavior is conditioned only on his own realized

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Sparse Linear Contextual Bandits via Relevance Vector Machines

Sparse Linear Contextual Bandits via Relevance Vector Machines Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,

More information

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017 Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps

More information

u x + u y = x u . u(x, 0) = e x2 The characteristics satisfy dx dt = 1, dy dt = 1

u x + u y = x u . u(x, 0) = e x2 The characteristics satisfy dx dt = 1, dy dt = 1 Õ 83-25 Þ ÛÐ Þ Ð ÚÔÜØ Þ ÝÒ Þ Ô ÜÞØ ¹ 3 Ñ Ð ÜÞ u x + u y = x u u(x, 0) = e x2 ÝÒ Þ Ü ÞØ º½ dt =, dt = x = t + c, y = t + c 2 We can choose c to be zero without loss of generality Note that each characteristic

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Lund Institute of Technology Centre for Mathematical Sciences Mathematical Statistics

Lund Institute of Technology Centre for Mathematical Sciences Mathematical Statistics Lund Institute of Technology Centre for Mathematical Sciences Mathematical Statistics STATISTICAL METHODS FOR SAFETY ANALYSIS FMS065 ÓÑÔÙØ Ö Ü Ö Ì ÓÓØ ØÖ Ô Ð ÓÖ Ø Ñ Ò Ý Ò Ò ÐÝ In this exercise we will

More information

F(jω) = a(jω p 1 )(jω p 2 ) Û Ö p i = b± b 2 4ac. ω c = Y X (jω) = 1. 6R 2 C 2 (jω) 2 +7RCjω+1. 1 (6jωRC+1)(jωRC+1) RC, 1. RC = p 1, p

F(jω) = a(jω p 1 )(jω p 2 ) Û Ö p i = b± b 2 4ac. ω c = Y X (jω) = 1. 6R 2 C 2 (jω) 2 +7RCjω+1. 1 (6jωRC+1)(jωRC+1) RC, 1. RC = p 1, p ÓÖ Ò ÊÄ Ò Ò Û Ò Ò Ö Ý ½¾ Ù Ö ÓÖ ÖÓÑ Ö ÓÒ Ò ÄÈ ÐØ Ö ½¾ ½¾ ½» ½½ ÓÖ Ò ÊÄ Ò Ò Û Ò Ò Ö Ý ¾ Á b 2 < 4ac Û ÒÒÓØ ÓÖ Þ Û Ö Ð Ó ÒØ Ó Û Ð Ú ÕÙ Ö º ËÓÑ Ñ ÐÐ ÕÙ Ö Ö ÓÒ Ò º Ù Ö ÓÖ ½¾ ÓÖ Ù Ö ÕÙ Ö ÓÖ Ò ØÖ Ò Ö ÙÒØ ÓÒ

More information

Iterated Strict Dominance in Pure Strategies

Iterated Strict Dominance in Pure Strategies Iterated Strict Dominance in Pure Strategies We know that no rational player ever plays strictly dominated strategies. As each player knows that each player is rational, each player knows that his opponents

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Players as Serial or Parallel Random Access Machines. Timothy Van Zandt. INSEAD (France)

Players as Serial or Parallel Random Access Machines. Timothy Van Zandt. INSEAD (France) Timothy Van Zandt Players as Serial or Parallel Random Access Machines DIMACS 31 January 2005 1 Players as Serial or Parallel Random Access Machines (EXPLORATORY REMARKS) Timothy Van Zandt tvz@insead.edu

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

ÇÙÐ Ò ½º ÅÙÐ ÔÐ ÔÓÐÝÐÓ Ö Ñ Ò Ú Ö Ð Ú Ö Ð ¾º Ä Ò Ö Ö Ù Ð Ý Ó ËÝÑ ÒÞ ÔÓÐÝÒÓÑ Ð º Ì ÛÓ¹ÐÓÓÔ ÙÒÖ Ö Ô Û Ö Ö ÖÝ Ñ ¹ ÝÓÒ ÑÙÐ ÔÐ ÔÓÐÝÐÓ Ö Ñ

ÇÙÐ Ò ½º ÅÙÐ ÔÐ ÔÓÐÝÐÓ Ö Ñ Ò Ú Ö Ð Ú Ö Ð ¾º Ä Ò Ö Ö Ù Ð Ý Ó ËÝÑ ÒÞ ÔÓÐÝÒÓÑ Ð º Ì ÛÓ¹ÐÓÓÔ ÙÒÖ Ö Ô Û Ö Ö ÖÝ Ñ ¹ ÝÓÒ ÑÙÐ ÔÐ ÔÓÐÝÐÓ Ö Ñ ÅÙÐ ÔÐ ÔÓÐÝÐÓ Ö Ñ Ò ÝÒÑ Ò Ò Ö Ð Ö Ò Ó Ò Ö ÀÍ ÖÐ Òµ Ó Ò ÛÓÖ Û Ö Ò ÖÓÛÒ Ö Ú ½ ¼¾º ¾½ Û Åº Ä Ö Ö Ú ½ ¼¾º ¼¼ Û Äº Ñ Ò Ëº Ï ÒÞ ÖÐ Å ÒÞ ½ º¼ º¾¼½ ÇÙÐ Ò ½º ÅÙÐ ÔÐ ÔÓÐÝÐÓ Ö Ñ Ò Ú Ö Ð Ú Ö Ð ¾º Ä Ò Ö Ö Ù Ð Ý Ó ËÝÑ

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Game Theory and Algorithms Lecture 2: Nash Equilibria and Examples

Game Theory and Algorithms Lecture 2: Nash Equilibria and Examples Game Theory and Algorithms Lecture 2: Nash Equilibria and Examples February 24, 2011 Summary: We introduce the Nash Equilibrium: an outcome (action profile) which is stable in the sense that no player

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Distributed Learning based on Entropy-Driven Game Dynamics

Distributed Learning based on Entropy-Driven Game Dynamics Distributed Learning based on Entropy-Driven Game Dynamics Bruno Gaujal joint work with Pierre Coucheney and Panayotis Mertikopoulos Inria Aug., 2014 Model Shared resource systems (network, processors)

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Bandit View on Continuous Stochastic Optimization

Bandit View on Continuous Stochastic Optimization Bandit View on Continuous Stochastic Optimization Sébastien Bubeck 1 joint work with Rémi Munos 1 & Gilles Stoltz 2 & Csaba Szepesvari 3 1 INRIA Lille, SequeL team 2 CNRS/ENS/HEC 3 University of Alberta

More information

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das Online Learning 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das About this class Goal To introduce the general setting of online learning. To describe an online version of the RLS algorithm

More information

Game Theory: introduction and applications to computer networks

Game Theory: introduction and applications to computer networks Game Theory: introduction and applications to computer networks Introduction Giovanni Neglia INRIA EPI Maestro 27 January 2014 Part of the slides are based on a previous course with D. Figueiredo (UFRJ)

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

6.207/14.15: Networks Lecture 16: Cooperation and Trust in Networks

6.207/14.15: Networks Lecture 16: Cooperation and Trust in Networks 6.207/14.15: Networks Lecture 16: Cooperation and Trust in Networks Daron Acemoglu and Asu Ozdaglar MIT November 4, 2009 1 Introduction Outline The role of networks in cooperation A model of social norms

More information

Prediction and Playing Games

Prediction and Playing Games Prediction and Playing Games Vineel Pratap vineel@eng.ucsd.edu February 20, 204 Chapter 7 : Prediction, Learning and Games - Cesa Binachi & Lugosi K-Person Normal Form Games Each player k (k =,..., K)

More information

Stochastic bandits: Explore-First and UCB

Stochastic bandits: Explore-First and UCB CSE599s, Spring 2014, Online Learning Lecture 15-2/19/2014 Stochastic bandits: Explore-First and UCB Lecturer: Brendan McMahan or Ofer Dekel Scribe: Javad Hosseini In this lecture, we like to answer this

More information

Cooperation Speeds Surfing: Use Co-Bandit!

Cooperation Speeds Surfing: Use Co-Bandit! Cooperation Speeds Surfing: Use Co-Bandit! Anuja Meetoo Appavoo, Seth Gilbert, and Kian-Lee Tan Department of Computer Science, National University of Singapore {anuja, seth.gilbert, tankl}@comp.nus.edu.sg

More information

DETERMINISTIC AND STOCHASTIC SELECTION DYNAMICS

DETERMINISTIC AND STOCHASTIC SELECTION DYNAMICS DETERMINISTIC AND STOCHASTIC SELECTION DYNAMICS Jörgen Weibull March 23, 2010 1 The multi-population replicator dynamic Domain of analysis: finite games in normal form, G =(N, S, π), with mixed-strategy

More information

Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing

Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing Jacob W. Crandall and Michael A. Goodrich Computer Science Department Brigham Young University Provo, UT 84602

More information

Game Theory, Evolutionary Dynamics, and Multi-Agent Learning. Prof. Nicola Gatti

Game Theory, Evolutionary Dynamics, and Multi-Agent Learning. Prof. Nicola Gatti Game Theory, Evolutionary Dynamics, and Multi-Agent Learning Prof. Nicola Gatti (nicola.gatti@polimi.it) Game theory Game theory: basics Normal form Players Actions Outcomes Utilities Strategies Solutions

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Optimal Convergence in Multi-Agent MDPs

Optimal Convergence in Multi-Agent MDPs Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl,

More information

Preliminary Results on Social Learning with Partial Observations

Preliminary Results on Social Learning with Partial Observations Preliminary Results on Social Learning with Partial Observations Ilan Lobel, Daron Acemoglu, Munther Dahleh and Asuman Ozdaglar ABSTRACT We study a model of social learning with partial observations from

More information

A Note on the Existence of Ratifiable Acts

A Note on the Existence of Ratifiable Acts A Note on the Existence of Ratifiable Acts Joseph Y. Halpern Cornell University Computer Science Department Ithaca, NY 14853 halpern@cs.cornell.edu http://www.cs.cornell.edu/home/halpern August 15, 2018

More information

Multi-Agent Learning with Policy Prediction

Multi-Agent Learning with Policy Prediction Multi-Agent Learning with Policy Prediction Chongjie Zhang Computer Science Department University of Massachusetts Amherst, MA 3 USA chongjie@cs.umass.edu Victor Lesser Computer Science Department University

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 22

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

6.254 : Game Theory with Engineering Applications Lecture 7: Supermodular Games

6.254 : Game Theory with Engineering Applications Lecture 7: Supermodular Games 6.254 : Game Theory with Engineering Applications Lecture 7: Asu Ozdaglar MIT February 25, 2010 1 Introduction Outline Uniqueness of a Pure Nash Equilibrium for Continuous Games Reading: Rosen J.B., Existence

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

MS&E 246: Lecture 12 Static games of incomplete information. Ramesh Johari

MS&E 246: Lecture 12 Static games of incomplete information. Ramesh Johari MS&E 246: Lecture 12 Static games of incomplete information Ramesh Johari Incomplete information Complete information means the entire structure of the game is common knowledge Incomplete information means

More information

x 0, x 1,...,x n f(x) p n (x) = f[x 0, x 1,..., x n, x]w n (x),

x 0, x 1,...,x n f(x) p n (x) = f[x 0, x 1,..., x n, x]w n (x), ÛÜØ Þ ÜÒ Ô ÚÜ Ô Ü Ñ Ü Ô Ð Ñ Ü ÜØ º½ ÞÜ Ò f Ø ÚÜ ÚÛÔ Ø Ü Ö ºÞ ÜÒ Ô ÚÜ Ô Ð Ü Ð Þ Õ Ô ÞØÔ ÛÜØ Ü ÚÛÔ Ø Ü Ö L(f) = f(x)dx ÚÜ Ô Ü ÜØ Þ Ü Ô, b] Ö Û Þ Ü Ô Ñ ÒÖØ k Ü f Ñ Df(x) = f (x) ÐÖ D Ü Ü ÜØ Þ Ü Ô Ñ Ü ÜØ Ñ

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

Game Theory Lecture 10+11: Knowledge

Game Theory Lecture 10+11: Knowledge Game Theory Lecture 10+11: Knowledge Christoph Schottmüller University of Copenhagen November 13 and 20, 2014 1 / 36 Outline 1 (Common) Knowledge The hat game A model of knowledge Common knowledge Agree

More information

SKMM 3023 Applied Numerical Methods

SKMM 3023 Applied Numerical Methods SKMM 3023 Applied Numerical Methods Solution of Nonlinear Equations ibn Abdullah Faculty of Mechanical Engineering Òº ÙÐÐ ÚºÒÙÐÐ ¾¼½ SKMM 3023 Applied Numerical Methods Solution of Nonlinear Equations

More information

Radu Alexandru GHERGHESCU, Dorin POENARU and Walter GREINER

Radu Alexandru GHERGHESCU, Dorin POENARU and Walter GREINER È Ö Ò Ò Ù Ò Ò Ò ÖÝ ÒÙÐ Ö Ý Ø Ñ Radu Alexandru GHERGHESCU, Dorin POENARU and Walter GREINER Radu.Gherghescu@nipne.ro IFIN-HH, Bucharest-Magurele, Romania and Frankfurt Institute for Advanced Studies, J

More information

1 Lattices and Tarski s Theorem

1 Lattices and Tarski s Theorem MS&E 336 Lecture 8: Supermodular games Ramesh Johari April 30, 2007 In this lecture, we develop the theory of supermodular games; key references are the papers of Topkis [7], Vives [8], and Milgrom and

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

LECTURE 10: REVIEW OF POWER SERIES. 1. Motivation

LECTURE 10: REVIEW OF POWER SERIES. 1. Motivation LECTURE 10: REVIEW OF POWER SERIES By definition, a power series centered at x 0 is a series of the form where a 0, a 1,... and x 0 are constants. For convenience, we shall mostly be concerned with the

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

PH Nuclear Physics Laboratory Gamma spectroscopy (NP3)

PH Nuclear Physics Laboratory Gamma spectroscopy (NP3) Physics Department Royal Holloway University of London PH2510 - Nuclear Physics Laboratory Gamma spectroscopy (NP3) 1 Objectives The aim of this experiment is to demonstrate how γ-ray energy spectra may

More information

Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games

Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games Stéphane Ross and Brahim Chaib-draa Department of Computer Science and Software Engineering Laval University, Québec (Qc),

More information

LEARNING IN CONCAVE GAMES

LEARNING IN CONCAVE GAMES LEARNING IN CONCAVE GAMES P. Mertikopoulos French National Center for Scientific Research (CNRS) Laboratoire d Informatique de Grenoble GSBE ETBC seminar Maastricht, October 22, 2015 Motivation and Preliminaries

More information

2 Hallén s integral equation for the thin wire dipole antenna

2 Hallén s integral equation for the thin wire dipole antenna Ú Ð Ð ÓÒÐ Ò Ø ØØÔ»» Ѻ Ö Ùº º Ö ÁÒغ º ÁÒ Ù ØÖ Ð Å Ø Ñ Ø ÎÓк ÆÓº ¾ ¾¼½½µ ½ ¹½ ¾ ÆÙÑ Ö Ð Ñ Ø Ó ÓÖ Ò ÐÝ Ó Ö Ø ÓÒ ÖÓÑ Ø Ò Û Ö ÔÓÐ ÒØ ÒÒ Ëº À Ø ÑÞ ¹Î ÖÑ ÞÝ Ö Åº Æ Ö¹ÅÓ Êº Ë Þ ¹Ë Ò µ Ô ÖØÑ ÒØ Ó Ð ØÖ Ð Ò Ò

More information

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems Lecture 2: Learning from Evaluative Feedback or Bandit Problems 1 Edward L. Thorndike (1874-1949) Puzzle Box 2 Learning by Trial-and-Error Law of Effect: Of several responses to the same situation, those

More information

Learning to Coordinate Efficiently: A Model-based Approach

Learning to Coordinate Efficiently: A Model-based Approach Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion

More information

SME 3023 Applied Numerical Methods

SME 3023 Applied Numerical Methods UNIVERSITI TEKNOLOGI MALAYSIA SME 3023 Applied Numerical Methods Solution of Nonlinear Equations Abu Hasan Abdullah Faculty of Mechanical Engineering Sept 2012 Abu Hasan Abdullah (FME) SME 3023 Applied

More information

arxiv:hep-ph/ v1 10 May 2001

arxiv:hep-ph/ v1 10 May 2001 New data and the hard pomeron A Donnachie Department of Physics, Manchester University P V Landshoff DAMTP, Cambridge University DAMTP-200-38 M/C-TH-0/03 arxiv:hep-ph/005088v 0 May 200 Abstract New structure-function

More information

Econometric Analysis of Games 1

Econometric Analysis of Games 1 Econometric Analysis of Games 1 HT 2017 Recap Aim: provide an introduction to incomplete models and partial identification in the context of discrete games 1. Coherence & Completeness 2. Basic Framework

More information

Smooth Calibration, Leaky Forecasts, Finite Recall, and Nash Dynamics

Smooth Calibration, Leaky Forecasts, Finite Recall, and Nash Dynamics Smooth Calibration, Leaky Forecasts, Finite Recall, and Nash Dynamics Sergiu Hart August 2016 Smooth Calibration, Leaky Forecasts, Finite Recall, and Nash Dynamics Sergiu Hart Center for the Study of Rationality

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Mathematics for Game Theory

Mathematics for Game Theory Mathematics for Game Theory Christoph Schottmüller August 9, 6. Continuity of real functions You might recall that a real function is continuous if you can draw it without lifting the pen. That gives a

More information

Recursive Methods Recursive Methods Nr. 1

Recursive Methods Recursive Methods Nr. 1 Nr. 1 Outline Today s Lecture Dynamic Programming under Uncertainty notation of sequence problem leave study of dynamics for next week Dynamic Recursive Games: Abreu-Pearce-Stachetti Application: today

More information

Reinforcement Learning for NLP

Reinforcement Learning for NLP Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced

More information

Worst case analysis for a general class of on-line lot-sizing heuristics

Worst case analysis for a general class of on-line lot-sizing heuristics Worst case analysis for a general class of on-line lot-sizing heuristics Wilco van den Heuvel a, Albert P.M. Wagelmans a a Econometric Institute and Erasmus Research Institute of Management, Erasmus University

More information

Learning to Compete, Compromise, and Cooperate in Repeated General-Sum Games

Learning to Compete, Compromise, and Cooperate in Repeated General-Sum Games Learning to Compete, Compromise, and Cooperate in Repeated General-Sum Games Jacob W. Crandall Michael A. Goodrich Computer Science Department, Brigham Young University, Provo, UT 84602 USA crandall@cs.byu.edu

More information

Duality. Peter Bro Mitersen (University of Aarhus) Optimization, Lecture 9 February 28, / 49

Duality. Peter Bro Mitersen (University of Aarhus) Optimization, Lecture 9 February 28, / 49 Duality Maximize c T x for x F = {x (R + ) n Ax b} If we guess x F, we can say that c T x is a lower bound for the optimal value without executing the simplex algorithm. Can we make similar easy guesses

More information

Area I: Contract Theory Question (Econ 206)

Area I: Contract Theory Question (Econ 206) Theory Field Exam Summer 2011 Instructions You must complete two of the four areas (the areas being (I) contract theory, (II) game theory A, (III) game theory B, and (IV) psychology & economics). Be sure

More information

Introduction to General Equilibrium

Introduction to General Equilibrium Introduction to General Equilibrium Juan Manuel Puerta November 6, 2009 Introduction So far we discussed markets in isolation. We studied the quantities and welfare that results under different assumptions

More information

6.891 Games, Decision, and Computation February 5, Lecture 2

6.891 Games, Decision, and Computation February 5, Lecture 2 6.891 Games, Decision, and Computation February 5, 2015 Lecture 2 Lecturer: Constantinos Daskalakis Scribe: Constantinos Daskalakis We formally define games and the solution concepts overviewed in Lecture

More information

106 Chapter 5 Curve Sketching. If f(x) has a local extremum at x = a and. THEOREM Fermat s Theorem f is differentiable at a, then f (a) = 0.

106 Chapter 5 Curve Sketching. If f(x) has a local extremum at x = a and. THEOREM Fermat s Theorem f is differentiable at a, then f (a) = 0. 5 Curve Sketching Whether we are interested in a function as a purely mathematical object or in connection with some application to the real world, it is often useful to know what the graph of the function

More information

Reinforcement Learning (1)

Reinforcement Learning (1) Reinforcement Learning 1 Reinforcement Learning (1) Machine Learning 64-360, Part II Norman Hendrich University of Hamburg, Dept. of Informatics Vogt-Kölln-Str. 30, D-22527 Hamburg hendrich@informatik.uni-hamburg.de

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information