Multi-agent learning

Size: px

Start display at page:

Download "Multi-agent learning"

Gerald Floyd
5 years ago
Views:

1 Multi-agent learning Ê Ò ÓÖ Ñ ÒØ Ä ÖÒ Ò Gerard Vreeswijk, Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 1

2 Reinforcement learning: motivation Nash equilibria in repeated games is a static analysis. ÝÒ Ñ Ð Ò ÐÝ : How do (or should) players develop their strategies and behaviour in a repeated game? Do : descriptive / economics; should : normative / agent theory. Reinforcement learning (RL) is a rudimentary learning technique. 1. RL is Ø ÑÙÐÙ ¹Ö ÔÓÒ : it plays actions with the highest past payoff. 2. It is ÑÝÓÔ : it is only interested in immediate success. Reinforcement learning can be applied to learning in games. When computer scientists mention RL, they usually mean multi-state RL. Single-state RL has already interesting and theoretically important properties, especially when it is coupled to games. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 2

3 Plan for today Part I: Single-state RL. Parts of Ch. 2 of Sutton et al.(1989): Evaluative Feedback : ǫ-greedy, optimistic, value-based, proportional. Part II: Single-state RL in games. First half of Ch. 2 of Peyton Young (2004): Reinforcement and Regret. 1. By average: 1 n r n r n. 2. With discounted past : γ n 1 r 1 + γ n 2 r γr n 1 + r n. 3. With an aspiration level (Sutton et al.: reference reward ). Part III: Convergence to dominant strategies. Begin of Beggs (2005): On the Convergence of. #Players #Actions Result Theorem 1 : 1 2 Pr(dominant action) = 1 Theorem 2 : 1 2 Pr(sub-dominant actions) = 0 Theorem 3 : 1 2 Pr(dom) = 1, Pr(sub-dom) = 0 Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 3

4 Part I: Single-state reinforcement learning Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 4

5 Exploration vs. exploitation Problem. You are at the beginning of a new study year. Every fellow student is interesting as a possible new friend. How do you divide your time between your classmates to optimise your happiness? Strategies: A. You make friends whe{n r}ever possible. You could be called an ÜÔÐÓÖ Ö. B. You stick to the nearest fellow-student. You could be called an ÜÔÐÓ Ø Ö. C. What most people do: first explore, then exploit. We ignore: 1. How quality of friendships is measured. 2. How changing personalities of friends (so-called moving targets ) are dealt with. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 5

6 An array of N slot machines Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 6

Exploitation vs. exploration Given. An array of N slot machines. Suppose the yield of every machine is normally distributed with mean and variance unknown to us. Random questions: 1.

7 Exploitation vs. exploration Given. An array of N slot machines. Suppose the yield of every machine is normally distributed with mean and variance unknown to us. Random questions: 1. How long do to stick with your first slot machine? 2. When do you leave the second? 3. If machine A so far yields more than machine B, then would you explore B ever again? 4. Try many machines, or opt for security? Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 7

8 Experiment Yield Machine 1 Yield Machine 1 Yield Machine Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 8

9 The N-armed bandit problem Barto & Sutton: the N-armed bandit. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 9

10 Computation of the quality (offline version) A reasonable measure for the ÕÙ Ð ØÝ Ó ÐÓØ Ñ Ò Ø Ö n ØÖ, would be the average profit. Formula for the quality of a slot machine after n tries. Simple formula, but: Q n = Def r r n n Every time Q n is computed, all values r 1,..., r n must be retrieved. The idea is to draw conclusions only if you have ÐÐ the data. The data is processed in Ø. Learning proceeds Ó Ð Ò. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 10

11 Computation of the quality (online version) Q n = r r n n = r r n 1 n 1 n ( ) 1 = Q n 1 Q n n 1 + = r r n 1 n n 1 + r n ( 1 n + r n n n = Q n 1 n 1 n ) r n + r n n = Q }{{ n ( r n }}{{} n old learning value rate Q }{{ n 1 } old value }{{} }{{} goal value error } {{ } } correction {{ } new value ). Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 11

12 Progress of quality Q n Amplitude of correction is determined by the Ð ÖÒ Ò Ö Ø. Here, the learning rate is 1/n and decreases through time. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 12

13 Exploration: ǫ-greedy exploration ǫ-greedy exploration. Let 0 < ǫ 1 close to Choose (1 ǫ)% of the time an optimal action. 2. At other times, choose a random action. Item 1: ÜÔÐÓ Ø Ø ÓÒ. Item 2: ÜÔÐÓÖ Ø ÓÒ. With probability one, every action is explored infinitely many times. (Why?) Is it guaranteed that every action is explored infinitely many times? It would be an idea to Ù ¹ÓÔØ Ñ Ð Ø ÓÒ Û Ø Ö Ð Ø Ú Ö Û Ö explore more often. However, that is not how greedy exploration works and we may lose convergence to optimal actions... Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 13

14 Optimistic initial values An alternative for ǫ-greedy is to work with ÓÔØ Ñ Ø Ò Ø Ð Ú ÐÙ. 1. At the outset, an ÙÒÖ Ð Ø ÐÐÝ ÕÙ Ð ØÝ is attributed to every slot machine: for 1 k N. Q k 0 = high 2. As usual, for every slot machine its average profit is maintained. 3. Without exception, always exploit machines with highest Q-values. Random questions: q1: Initially, many actions are tried all actions are tried? q2: How high should high be? q3: What to do in case of ties (more than one optimal machine)? q4: Can we speak of exploration? q5: Is optimism (as a method) suitable to explore an array of (possibly) infinitely many slot machines? Why (not)? q6: ǫ-greedy: Pr( every action is explored infinitely many times ) = 1. Also with optimism? Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 14

15 Optimistic initial values vs. ǫ-greedy From: (...), Sutton and Barto, Sec. 2.8, p. 41. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 15

16 Maintaining and exploring friendships: strategies ǫ-greedy Spend most of the time to your best friends (greedy). Occasionally spend ǫ of your time to explore random friendships. Optimistic In the beginning, foster (unreasonably) high expectations of everyone. You will be disappointed many times. Adapt your expectations based on experience. Always spend time with your best friends. Values (Cf. Sutton et al.) Let 0 < α << 1. In the beginning rate everyone with a 6, say. If a friendship rated r involves a new experience e [0, 10] then for example r new = r old + sign(e) α. (Watch boundaries!) Other method: r new = (1 α) r old + α e Proportions Give everyone equal attention in the beginning. If there is a positive experience, then give that person a little more attention in the future. (Similarly with negative experiences.) Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 16

17 Part II: Single-state reinforcement learning in games Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 17

18 Proportional techniques: basic setup There are two players: A (the protagonist) and B (the antagonist, sometimes nature ). Play proceeds in (possibly an infinite number of) rounds 1,..., t,.... Identifiers X and Y denote finite sets of possible actions. Each round, t, players A and B choose actions x X and y Y, respectively: A s payoff is given by a fixed function u : X Y R. In other words, A s payoff matrix is known. It follows that payoffs Ø Ñ are ÓÑÓ Ò ÓÙ, i.e., (x s, y s ) = (x t, y t ) u(x s, y s ) = u(x t, y t ). (x 1, y 1 ), (x 2, y 2 ),..., (x t, y t ),.... Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 18

19 Propensity, and mixed strategy of play Let t 0. The ÔÖÓÔ Ò ØÝ of A to play x at t is denoted by θ t x. A simple model of propensity is cumulative payoff matching (CPM): { θx t+1 θx t + u(x, y) if x is played at round t, = else. θ t x The vector of Ò Ø Ð ÔÖÓÔ Ò Ø, θ 0 is not the result of play. As a vector: θ t+1 = θ t + u t e t, where e t x = Def x is played at t? 1 : 0. Ñ Ü A plausible is to play at round t the normalised propensity of ØÖ Ø Ý x at t: ( ) q t x x X, where q t θx x = t Def x X θx t. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 19

20 An Example The ØÓØ Ð ÙÑÙÐ Ø Ô ÝÓ at round t, the sum x X θ t x is abbreviated by v t. θ θ 15 x θx 15 1 x θx 15 2 x θx v 15 Remarks: It is the cumulative payoff from each action that matters, not the average payoff. (There is a difference!) In this example, it is assumed that the initial propensities, θ 0 x, are one. In general, they could be anything. (But θ 0 = 0 is not very useful.) Alternatively, scalar v t = x X θ 0 x + s t u t. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 20

21 Dynamics of the mixed strategy We can obtain further insight in the dynamics of the process through the change of the mixed strategy: q t x = q t x q t 1 x = θt x v t θt 1 x v t 1 = vt 1 θx t v t 1 v t vt θx t 1 v t v t 1 = vt 1 θx t v t θx t 1 v t 1 v t = vt 1 (θx t 1 + e t x u t ) (v t 1 + u t ) θx t 1 = v t 1 θx t 1 = vt 1 e t x u t u t θx t 1 v t 1 v t = ut v t 1 v t + v t 1 e t x u t v t 1 θ t 1 v t 1 v t = ut v t (et x θt 1 x ut vt 1) = v t (et x q t 1 x ). x u t θ t 1 x v t 1 e t x θx t 1 v t v t 1 Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 21

22 Dynamics of the mixed strategy: convergence The dynamics of the mixed strategy in round t is given by On coordinate x: We have: q t = ut v t (et q t 1 ). q t x = ut v t (et x q t 1 x ). q t = ut v t (et q t 1 ) = ut v t et q t 1 ut v t 2 u = t 2 max{us s t} u u t t min{u s s t} 2 = C 1 t. Since all terms except v t are bounded, lim t q t = 0 ( / convg.). Does q t converge? If so, to the right (e.g., a Pareto optimal) strategy? Beggs (2005) provides more clarity in certain circumstances. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 22

23 Abstraction of past payoffs t p In 1991 and 1993, B. Arthur proposed the following update formula: Consequently, q t = u t Ct p + u t (et q t 1 ) q t 1 t p. Remarks: Arthur s notation differs considerably from that of Peyton Young (2004). If the parameter p is set to, e.g., 2, then there is convergence. However... In related research, where the value of p is determined through psychological experiments, it is estimated that p < 1. B. Arthur (1993): On Designing Economic Agents that Behave Like Human Agents. In: Journal of Evolutionary Economy 3, pp Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 23

24 Past payoffs at discount rate λ < 1 In 1995, Erev and Roth proposed the following update formula: Consequently, q t = (For simplicity, we assume θ 0 = 0.) Since θ t+1 = λθ t + u t e t. u t s t λ t s u s (et q t 1 ) ( λ t s ) min{u s s t} λ t s u s ( λ t s ) max{u s s t} s t s t s t and since 1 + λ + λ λ t 1 = 1 λt 1 λ for λ = 1, the mixed strategy tends to change at a rate 1 λ. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 24

25 Past payoffs represented by an aspiration level Assume an Ô Ö Ø ÓÒ Ð Ú Ð a t R at every round. (Intuition: payoff with which one would be satisfied.) Idea: u t x > a t positively reinforce action x u t x < a t negatively reinforce action x Correspondingly, the mixed strategy evolves according to q t = (u t a t )(e t q t 1 ). Typical definitions for aspiration: Average past payoffs. a t = Def v t /t. Ø Ò ÔÐ Ý A.k.a. (Crandall, 2005). Discounted past payoffs. a t = Def s t λ t s u s. (Erev & Roth, 1995). Börgers and Sarin (2000). Naïve with Endogeneous Aspirations in: Int. Economic Review 41, pp Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 25

26 Adequacy of reinforcement learning Does reinforcement learning lead to optimal behaviour against B? If A and B would both converge to optimal behaviour, i.e., to a best response, this would yield a Nash equilibrium. Less demanding: Does reinforcement learning converge to optimal behaviour in a stationary (and, perhaps, stochastic) environment? A ØÓÖÝ is a finite sequence of actions ξ t : (x 1, y 1 ),..., (x t, y t ). A ØÖ Ø Ý is a function g : H (X) that maps histories to probability distributions over X. Write q t+1 = Def g(ξ t ) Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 26

27 Optimality against stationary opponents Assume that B plays a fixed probability distribution q (Y). The combination of θ 0, g and q yields Ö Ð Ø ÓÒ a ω = (x 1, y 1 ),..., (x t, y t ),.... Define B(q ) = Def {x X x is a best response to q }. Definition. A strategy g is called ÓÔØ Ñ Ð Ò Ø q if, with probability one, for all x / B(q ) : lim t q t x = 0 (1) In this case, the phrase with probability one means that almost all (read: all but finitely many) realisations satisfy (1). Theorem. Given finite action sets X and Y, cumulative payoff matching on X is optimal against every stationary distribution on Y. Peyton Young (2004, p. 17): Its proof is actually quite involved (... ). (FSs.) Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 27

28 Part III: Beggs, 2005 Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 28

29 The learning model Single-state proportional reinforcement learning (Erev & Roth, 1995). As usual: As usual: A i (n + 1) = { A i (n) + π i (n + 1) if action i is chosen, A i (n) else. Pr i (n + 1) = The following two assumptions are crucial: A i (n) m j=1 A j(n) 1. All past, current and future payoffs π i (n) are bounded away from zero and bounded from above. More precisely, there are 0 < k 1 k 2 such that all payoffs are in [k 1, k 2 ]. 2. Initial propensities A i (0) are ØÖ ØÐÝ ÔÓ Ø Ú. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 29

30 Choice of actions Lemma 1. Each action is chosen infinitely often with probability one. Proof. From the above assumptions it follows that Pr i (n + 1) = A i (n) m j=1 A j(n) A i (0) A i (0) + nk 2 (Which is like worst case for i: as if i was never chosen and all previous n rounds actions = i received the maximum possible payoff.) Apply so-called ÓÒ Ø ÓÒ Ð ÓÖ Ð¹ ÒØ ÐÐ Ð ÑÑ : a if {E n } n are events, and Pr(E n X 1,..., X n 1 ) n=1 is unbounded, then the probability that an infinite number of E n s occur is one. a A.k.a. the second Borel-Cantelli lemma, or the Borel-Cantelli-Lévy lemma (Shiryaev, p. 518). Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 30

31 Unboundedness of propensities, and convergence Lemma 2. For each i, A i tends to infinity with probability one. Proof. For each i, action i is chosen infinitely often with probability one. Since payoff per round is bounded from below by k 1, we have j=1 k 1 A i, where j runs over rounds where i is chosen. Now Lemma 1 + Lemma 2 + martingale theory suffice to prove convergence: Suppose there are only two possible actions: a 1 and a 2. The expression E[ π(a i ) history ] denotes the expected payoff of action a i, given history of play up to and including the choice to play a i itself. Theorem 1. If E[ π(a 1 ) history ] > γe[ π(a 2 ) history ] (2) for some fixed γ > 1, then the probability that a 1 will be played converges to one. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 31

32 Convergence to the dominant action: proof in a nutshell If a 1 is dominant (like in Eq. 2), the objective is to show that A 2 A 1 (n) 0, a.s. To this end, it suffices to show that A ǫ 2 A 1 (n) C, a.s. (3) for some C and for some 1 < ǫ < γ (which is possible, since γ > 1). Then lim n A 2 A 1 (n) = lim n A ǫ 2 A 1 1 A ǫ 1 2 = lim n A ǫ 2 A 1 lim 1 n A2 ǫ 1 = C 0 To this end, Beggs shows that, for some n N, and for all 1 < ǫ < γ, (3) is a so-called ÒÓÒ¹Ò Ø Ú ÙÔ Ö¹Ñ ÖØ Ò Ð. (Explained in a moment.) It is known that every non-negative super-martingale converges to a finite limit with probability one. (Explained in a moment.) Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 32

33 Super-martingale A ÙÔ Ö¹Ñ ÖØ Ò Ð is a stochastic process in which the conditional expectation of the next value, given the current and preceding values, is less than or equal to the current value: E[ Z n+1 Z 1,..., Z n ] Z n Think of an ÙÒ Ö Ñ Ð Ò Ñ that proceeds in rounds. 1. Expectations decrease. Taking expectations on both sides yields E[ Z n+1 ] E[ Z n ]. 2. Expectations converge. From (1) and the ÑÓÒÓØÓÒ ÓÒÚ Ö Ò Ø ÓÖ Ñ a it follows that the expectations of a non-negative super-martingale converge to a limit L somewhere in [0, E[ Z 1 ]]. 3. Doob s Martingale Convergence Theorem: Values converge a.s. let {Z n } n be a martingale (or sub-martingale, or super-martingale) such that E[ Z n ] is bounded. Then lim n Z n exists a.s. and is finite. a Ordinary mathematics. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 33

34 To show that A ǫ 2 /A 1 is a non-neg super-martingale E [ ] Aǫ 2 (n + 1) history A 1 = Pr(1 history)e [ Aǫ 2 ] (n + 1) 1, history A 1 Pr(2 history)e + [ Aǫ 2 (n + 1) 2, history A 1 ] = [ A 1 (n) A 1 (n) + A 2 (n) E A2 ǫ(n) A 1 (n) + π 1 (n + 1) Aǫ 2 (n) A 1 (n) A 2 (n) A 1 (n) + A 2 (n) E ] + [ (A2 (n) + π 2 (n + 1)) ǫ A 1 (n) Aǫ 2 (n) ]. A 1 (n) Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 34

35 To show that A ǫ 2 /A 1 is a non-neg super-martingale Taylor expansion: f(x + h) = f(x) + h f (x) + h2 2! f (x) + h3 3! f (x) + h4 4! f (x + θh) }{{} Lagrange remainder for some θ (0, 1). (Of course, there is nothing special about n = 4.) Applied to f(x) = x 1 and n = 2 we obtain (x + h) 1 = x 1 + h( x 2 ) + h2 2! (2(x + θh) 3 ) = x 1 hx 2 + h 2 (x + θh) 3 = 1 x h x 2 + h 2 (x + θh) 3. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 35

36 To show that A ǫ 2 /A 1 is a non-neg super-martingale Applied Taylor expansion to f(x) = x 1 and n = 2 yields (x + h) 1 = x 1 hx 2 + h 2 (x + θh) 3 = 1 x h x 2 + h 2 (x + θh) 3. For non-negative x and h we have x 3 (x + θh) 3 so that 1 x h x 2 + h2 x 3. This first inequality puts an upper bound with pure x and h on 1 A 1 (n) + π 1 (n + 1) 1 A 1 (n) π 1(n + 1) A 2 1 (n) + π2 1 (n + 1) A 3 1 (n). Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 36

37 To show that A ǫ 2 /A 1 is a non-neg super-martingale Similarly, applying Taylor expansion to f(x) = (x + h) ǫ with n = 2 yields (x + h) ǫ = x ǫ + hǫx ǫ 1 + h 2 (ǫ 1)ǫ(x + θh) ǫ 2. For non-negative x and h and ǫ > 1, we have (ǫ 1)(x + θh) ǫ 2 Cx ǫ 2 for some constant C, so that (x + h) ǫ x ǫ + hǫx ǫ 1 + h 2 Cǫx ǫ 2 This second inequality puts an upper bound with pure x and h on (A 2 (n) + π 2 (n + 1)) ǫ A2 ǫ(n) +... Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 37

38 To show that A ǫ 2 /A 1 is a non-neg super-martingale Using E[aX + b] = ae[x] + b and factoring out common terms, Beggs obtains A 1 A ǫ [ 2 E[π A 1 + A 2 A 2 (n) E[π 1 (n + 1)] + c 1 (n + 1) 2 ] ] 1 + A 1 1 (n) 1 ǫa ǫ [ 2 E[π (n) E[π A (n + 1)] + c 2 (n + 1) 2 ] ] 2. A 2 A 1 A 2 (n) Because payoffs are bounded, E[π 1 (... )] > γe[π 2 (... )], 1 γ < ǫ γ < 0, constants K 1, K 2, K 3 > 0 can be found such that A2 ǫ ( K A 1 (A 1 + A 2 ) 1 (ǫ γ) + K 2 + K ) 3 (n) A 1 A 2 For ǫ (1, γ) and for n large enough, this expression is non-positive. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 38

39 Generalisation of Begg s Theorem 1, and application to games Let there be m 2 alternative actions, a 1,..., a m (rather than m = 2). Theorem 2. If the expected payoff (conditional on the history) of a i dominates the expected payoff (conditional on the history) of a j, for all j = i, then the probability that a j will be played converges to zero, for all j = i. Applied to games: Theorem 3. In a game with finitely many actions and players, if a player learns according the ER scheme then, a. With probability 1, the probability and empirical frequency that he plays any action that is strictly dominated by another pure strategy converges to zero. b. Hence if he has a strictly dominant strategy, with probability 1, the probability and empirical frequency with which he plays that action converges to 1. (Beggs, 2005). Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 39

Summary There are several rules for reinforcement learning on single states. Sheer convergence is often easy to prove. To prove convergence to Ø Ò Ø Ø ÓÒ ÖÝ ÒÚ ÖÓÒÑ ÒØ Ø ÓÒ is much more difficult.

40 Summary There are several rules for reinforcement learning on single states. Sheer convergence is often easy to prove. To prove convergence to Ø Ò Ø Ø ÓÒ ÖÝ ÒÚ ÖÓÒÑ ÒØ Ø ÓÒ is much more difficult. ÓÒÚ Ö Ò ØÓ Ø Ø ÓÒ Ò ÒÓÒ¹ Ø Ø ÓÒ ÖÝ ÒÚ ÖÓÒÑ ÒØ, e.g., convergence to dominant actions, or best responses in self-play, is state-of-the art research. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 40

41 What next? No-regret learning: this is a generalisation of reinforcement learning No-regret = Def play those actions that ÛÓÙÐ Ú Ò successful in the past. Similarities with reinforcement learning: 1. Driven by Ô Ø Ô ÝÓ. 2. Not interested in (the behaviour of) the opponent. 3. ÅÝÓÔ. Differences: a) Keeping accounts of hypothetical actions rests on the ÙÑÔØ ÓÒ that a player is able to estimate payoffs of actions that were not actually played. [Knowledge of the payoff matrix definitely helps, but is an even more severe assumption.] b) Bit more easy to obtain results regarding performance. Gerard Vreeswijk. Last modified on February 13 th, 2012 at 21:42 Slide 41

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Lecture 16: Modern Classification (I) - Separating Hyperplanes Outline 1 2 Separating Hyperplane Binary SVM for Separable Case Bayes Rule for Binary Problems Consider the simplest case: two classes are