Scalable Learning in Stochastic Games

Size: px

Start display at page:

Download "Scalable Learning in Stochastic Games"

Randall Waters
5 years ago
Views:

1 Sclble Lerning in Stochstic Gmes Michel Bowling nd Mnuel Veloso Computer Science Deprtment Crnegie Mellon University Pittsburgh PA, Abstrct Stochstic gmes re generl model of interction between multiple gents. They hve recently been the focus of gret del of reserch in reinforcement lerning s they re both descriptive nd hve well-defined Nsh equilibrium solution. Most of this recent work, lthough very generl, hs only been pplied to smll gmes with t most hundreds of sttes. On the other hnd, there re lndmrk results of lerning being successfully pplied to specific lrge nd complex gmes such s Checkers nd Bckgmmon. In this pper we describe sclble lerning lgorithm for stochstic gmes, tht combines three seprte ides from reinforcement lerning into single lgorithm. These ides re tile coding for generliztion, policy grdient scent s the bsic lerning method, nd our previous work on the WoLF ( Win or Lern Fst ) vrible lerning rte to encourge convergence. We pply this lgorithm to the intrctbly sized gme-theoretic crd gme Goofspiel, showing preliminry results of lerning in self-ply. We demonstrte tht policy grdient scent cn lern even in this highly non-sttionry problem with simultneous lerning. We lso show tht the WoLF principle continues to hve converging effect even in lrge problems with pproximtion nd generliztion. Introduction We re interested in the problem of lerning in multigent environments. One of the min chllenges with these environments is tht other gents in the environment my be lerning nd dpting s well. These environments re, therefore, no longer sttionry. They violte the Mrkov property tht trditionl single-gent behvior lerning relies upon. The model of stochstic gmes cptures these problems very well through explicit models of the rewrd functions of the other gents nd their ffects on trnsitions. They re lso nturl extension of Mrkov decision processes (MDPs) to multiple gents nd so hve ttrcted interest from the reinforcement lerning community. The problem of simultneously finding optiml policies for stochstic gmes hs been well studied in the field of gme theory. The trditionl solution concept is tht of Nsh equilibri, policy for ll the plyers where ech is plying optimlly with Copyright c 2002, Americn Assocition for Artificil Intelligence ( All rights reserved. respect to the others. This concept is powerful solution for these gmes even in lerning context, since no gent could lern better policy when ll the gents re plying n equilibri. It is this foundtion tht hs driven much of the recent work in pplying reinforcement lerning to stochstic gmes (Littmn 1994; Hu & Wellmn 1998; Singh, Kerns, & Mnsour 2000; Littmn 2001; Bowling & Veloso 2002; Greenwld & Hll 2002). This work hs thus fr only been pplied to smll gmes with enumerble stte nd ction spces. Historiclly, though, number of lndmrk results in reinforcement lerning hve looked t lerning in prticulr stochstic gmes tht re not smll nor re the stte esily enumerted. Smuel s Checkers plying progrm (Smuel 1967) nd Tesuro s TD-Gmmon (Tesuro 1995) re successful pplictions of lerning in gmes with very lrge stte spces. Both of these results mde generous use of generliztion nd pproximtion, which hve not been used in the more recent work. On the other hnd, both TD- Gmmon nd Smuel s Checkers plyer only used deterministic strtegies to ply competitively, while Nsh equilibri often require stochstic strtegies. We re interested in scling some of the recent techniques bsed on the Nsh equilibri concept to gmes with intrctble stte spces. Such gol is not new. Singh nd collegues lso described future work of pplying their simple grdient techniques to problems with lrge or infinite stte nd ction spces (Singh, Kerns, & Mnsour 2000). This pper exmines some initil results in this direction. We first describe the forml definition of stochstic gme nd the notion of equilibri. We then describe one prticulr very lrge, two-plyer, zero-sum stochstic gme, Goofspiel. Our lerning lgorithm is described s the combintion of three ides from reinforcement lerning: tile-coding, policy grdients, nd the WoLF principle. We then show results of our lgorithm lerning to ply Goofspiel with self-ply. Finlly, we conclude with some future directions for this work. Stochstic Gmes A stochstic gme is tuple (n, S, A 1...n, T, R 1...n ), where n is the number of gents, S is set of sttes, A i is the set of ctions vilble to gent i (nd A is the joint ction spce A 1... A n ), T is trnsition function S A S [0, 1], nd R i is rewrd function for the ith gent S A R.

2 This looks very similr to the MDP frmework except we hve multiple gents selecting ctions nd the next stte nd rewrds depend on the joint ction of the gents. Another importnt difference is tht ech gent hs its own seprte rewrd function. The gol for ech gent is to select ctions in order to mximize its discounted future rewrds with discount fctor γ. Stochstic gmes re very nturl extension of MDPs to multiple gents. They re lso n extension of mtrix gmes to multiple sttes. Two exmple mtrix gmes re in Figure 1. In these gmes there re two plyers; one selects row nd the other selects column of the mtrix. The entry of the mtrix they jointly select determines the pyoffs. The gmes in Figure 1 re zero-sum gmes, so the row plyer would receive the pyoff in the mtrix, nd the column plyer would receive the negtive of tht pyoff. In the generl cse (generl-sum gmes), ech plyer would hve seprte mtrix tht determines their pyoffs. ( ) ( Mtching Pennies R-P-S Figure 1: Mtching Pennies nd Rock-Pper-Scissors mtrix gmes. Ech stte in stochstic gme cn be viewed s mtrix gme with the pyoffs for ech joint ction determined by the mtrices R i (s, ). After plying the mtrix gme nd receiving their pyoffs the plyers re trnsitioned to nother stte (or mtrix gme) determined by their joint ction. We cn see tht stochstic gmes then contin both MDPs nd mtrix gmes s subsets of the frmework. Stochstic Policies. Unlike in single-gent settings, deterministic policies in multigent settings cn often be exploited by the other gents. Consider the mtching pennies mtrix gme s shown in Figure 1. If the column plyer were to ply either ction deterministiclly, the row plyer could win every time. This requires us to consider mixed strtegies nd stochstic policies. A stochstic policy, ρ : S P D(A i ), is function tht mps sttes to mixed strtegies, which re probbility distributions over the plyer s ctions. Nsh Equilibri. Even with the concept of mixed strtegies there re still no optiml strtegies tht re independent of the other plyers strtegies. We cn, though, define notion of best-response. A strtegy is best-response to the other plyers strtegies if it is optiml given their strtegies. The mjor dvncement tht hs driven much of the development of mtrix gmes, gme theory, nd even stochstic gmes is the notion of best-response equilibrium, or Nsh equilibrium (Nsh, Jr. 1950). A Nsh equilibrium is collection of strtegies for ech of the plyers such tht ech plyer s strtegy is best-response to the other plyers strtegies. So, no plyer cn do better ) by chnging strtegies given tht the other plyers lso don t chnge strtegies. Wht mkes the notion of equilibrium compelling is tht ll mtrix gmes hve such n equilibrium, possibly hving multiple equilibri. Zero-sum, twoplyer gmes, where one plyer s pyoffs re the negtive of the other, hve single Nsh equilibrium. 1 In the zero-sum exmples in Figure 1, both gmes hve n equilibrium consisting of ech plyer plying the mixed strtegy where ll the ctions hve equl probbility. The concept of equilibri lso extends to stochstic gmes. This is non-trivil result, proven by Shpley (Shpley 1953) for zero-sum stochstic gmes nd by Fink (Fink 1964) for generl-sum stochstic gmes. Lerning in Stochstic Gmes. Stochstic gmes hve been the focus of recent reserch in the re of reinforcement lerning. There re two different pproches being explored. The first is tht of lgorithms tht explicitly lern equilibri through experience, independent of the other plyers policy (Littmn 1994; Hu & Wellmn 1998; Greenwld & Hll 2002). These lgorithms itertively estimte vlue functions, nd use them to compute n equilibrium for the gme. A second pproch is tht of bestresponse lerners (Clus & Boutilier 1998; Singh, Kerns, & Mnsour 2000; Bowling & Veloso 2002). These lerners explicitly optimize their rewrd with respect to the other plyers (chnging) policies. This pproch, too, hs strong connection to equilibri. If these lgorithms converge when plying ech other, then they must do so to n equilibrium (Bowling & Veloso 2001). Neither of these pproches, though, hve been scled beyond gmes with few hundred sttes. Gmes with very lrge number of sttes, or gmes with continuous stte spces, mke stte enumertion intrctble. Since previous lgorithms in their stted form require the enumertion of sttes either for policies or vlue functions, this is mjor limittion. In this pper we exmine lerning in very lrge stochstic gme, using pproximtion nd generliztion techniques. Specificlly, we will build on the ide of best-response lerners using grdient techniques (Singh, Kerns, & Mnsour 2000; Bowling & Veloso 2002). We first describe n interesting gme with n intrctbly lrge stte spce. Goofspiel Goofspiel (or The Gme of Pure Strtegy) ws invented by Merrill Flood while t Princeton (Flood 1985). The gme hs numerous vritions, but here we focus on the simple two-plyer, n-crd version. Ech plyer receives suit of crds numbered 1 through n, third suit of crds is shuffled nd plced fce down s the deck. Ech round the next crd is flipped over from the deck, nd the two plyers ech select crd plcing it fce down. They re reveled simultneously nd the plyer with the highest crd wins the crd from the deck, which is worth its number in points. If 1 There cn ctully be multiple equilibri, but they will ll hve equl pyoffs nd re interchngeble (Osborne & Rubinstein 1994).

3 the plyers choose the sme vlued crd, then neither plyer gets ny points. Regrdless of the winner, both plyers discrd their chosen crd. This is repeted until the deck nd plyers hnds re exhusted. The winner is the plyer with the most points. This gme hs numerous interesting properties mking it very interesting step between toy problems nd more relistic problems. First, notice tht this gme is zero-sum, nd s with mny zero-sum gmes ny deterministic strtegy cn be soundly defeted. In this gme, it s by simply plying the crd one higher thn the other plyer s deterministiclly chosen crd. Second, notice tht the number of sttes nd stte-ction pirs grows exponentilly with the number of crds. The stndrd size of the gme n = 13 is so lrge tht just storing one plyer s policy or Q-tble would require pproximtely 2.5 terbytes of spce. Just gthering dt on ll the stte-ction trnsitions would require well over plyings of the gme. Tble 1 shows the number of sttes nd stte-ction pirs s well s the policy size for three different vlues of n. This gme obviously requires some form of generliztion to mke lerning possible. Another interesting property is tht rndomly selecting ctions is resonbly good policy. The worst-cse vlues of the rndom policy long with the worst-cse vlues of the best deterministic policy re lso shown in Tble 1. This gme cn be described using the stochstic gme model. The stte is the current crds in the plyers hnds nd deck long with the upturned crd. The ctions for plyer re the crds in the plyer s hnd. The trnsitions follow the rules s described, with n immedite rewrd going to the plyer who won the upturned crd. Since the gme hs finite end nd we re interested in mximizing totl rewrd, we cn set the discount fctor γ to be 1. Although equilibrium lerning techniques such s Minimx-Q (Littmn 1994) re gurnteed to find the gme s equilibrium, it requires mintining stte-joint-ction tble of vlues. This tble would require 20.1 terbytes to store for the n = 13 crd gme. We will now describe best-response lerning lgorithm using pproximtion techniques to hndle the enormous stte spce. Three Ides One Algorithm The lgorithm we will use combines three seprte ides from reinforcement lerning. The first is the ide of tile coding s generliztion for liner function pproximtion. The second is the use of prmeterized policy nd lerning s grdient scent in the policy s prmeter spce. The finl component is the use of WoLF vrible lerning rte to djust the grdient scent step size. We will briefly overview these three techniques nd then describe how they re combined into reinforcement lerning lgorithm for Goofspiel. Tile Coding. Tile coding (Sutton & Brto 1998), lso known s CMACS, is populr technique for creting set of boolen fetures from set of continuous fetures. In reinforcement lerning, tile coding hs been used extensively to crete liner pproximtors of stte-ction vlues (e.g., (Stone & Sutton 2001)). Tiling One Tiling Two Figure 2: An exmple of tile coding two dimensionl spce with two overlpping tilings. The bsic ide is to ly offset grids or tilings over the multidimensionl continuous feture spce. A point in the continuous feture spce will be in exctly one tile for ech of the offset tilings. Ech tile hs n ssocited boolen vrible, so the continuous feture vector gets mpped into very high-dimensionl boolen vector. In ddition, nerby points will fll into the sme tile for mny of the offset grids, nd so shre mny of the sme boolen vribles in their resulting vector. This provides the importnt feture of generliztion. An exmple of tile coding in two-dimensionl continuous spce is shown in Figure 2. This exmple shows two overlpping tilings, nd so ny given point flls into two different tiles. Another common trick with tile coding is the use of hshing to keep the number of prmeters mngeble. Ech tile is hshed into tble of fixed size. Collisions re simply ignored, mening tht two unrelted tiles my shre the sme prmeter. Hshing reduces the memory requirements with little loss in performnce. This is becuse only smll frction of the continuous spce is ctully needed or visited while lerning, nd so independent prmeters for every tile re often not necessry. Hshing provides mens for using only the number of prmeters the problem requires while not knowing in dvnce which stte-ction pirs need prmeters. Policy Grdient Ascent Policy grdient techniques (Sutton et l. 2000; Bxter & Brtlett 2000) re method of reinforcement lerning with function pproximtion. Trditionl pproches pproximte stte-ction vlue function, nd result in deterministic policy tht selects the ction with the mximum lerned vlue. Alterntively, policy grdient pproches pproximte policy directly, nd then use grdient scent to djust the prmeters to mximize the policy s vlue. There re three good resons for the ltter pproch. First, there s whole body of theoreticl work describing convergence problems using vriety of vlue-bsed lerning techniques with vriety of function pproximtion techniques (See (Gordon 2000) for summry of these results.) Second, vlue-bsed pproches lern deterministic policies, nd s we mentioned erlier deterministic policies in multigent

4 n S S A SIZEOF(π or Q) VALUE(det) VALUE(rndom) KB MB TB Tble 1: The pproximte number of sttes nd stte-ctions, nd the size of stochstic policy or Q tble for Goofspiel depending on the number of crds, n. The VALUE columns list the worst-cse vlue of the best deterministic policy nd the rndom policy respectively. settings re often esily exploitble. Third, grdient techniques hve been shown to be successful for simultneous lerning in mtrix gmes (Singh, Kerns, & Mnsour 2000; Bowling & Veloso 2002). We use the policy grdient technique presented by Sutton nd collegues (Sutton et l. 2000). Specificlly, we will define policy s Gibbs distribution over liner combintion of fetures, such s those tken from tile coding representtion of stte-ctions. Let θ be vector of the policy s prmeters nd φ s be feture vector for stte s nd ction then this defines stochstic policy ccording to, π(s, ) = eθ φs b eθ φ. sb Their min result ws convergence proof for the following policy itertion rule tht updtes policy s prmeters, θ k+1 = θ k + α k d π k(s) s π k (s, ) f wk (s, ). (1) θ For the Gibbs distribution this is just, θ k+1 = θ k + α k d π k(s) φ s π(s, )f wk (s, ) (2) s Here α k is n ppropritely decyed lerning rte nd d π k (s) is stte s s contribution to the policy s overll vlue. This contribution is defined differently depending on whether verge or discounted strt stte rewrd criterion is used. f wk (s, ) is n independent pproximtion of Q π k (s, ) with prmeters w, which is the expected vlue of tking ction from stte s nd then following the policy π k. For Gibbs distribution, Sutton nd collegues showed tht for convergence this pproximtion should hve the following form, [ f w (s, ) = w φ s ] π(s, b)φ sb. b As they point out, this mounts to f w being n pproximtion of the dvntge function, A π (s, ) = Q π (s, ) V π (s), where V π (s) is the vlue of following policy π from stte s. It is this dvntge function tht we will estimte nd use for grdient scent. Using this bsic formultion we derive n on-line version of the lerning rule, where the policy s weights re updted with ech stte visited. The totl rewrd criterion for Goofspiel is identicl to hving γ = 1 in the discounted setting. So, d π (s) is just the probbility of visiting stte s when following policy π. Since we will be visiting sttes on-policy, this mounts to updting weights in proportion to how often the stte is visited. By doing updtes on-line s sttes re visited we cn simply drop this term from eqution 2, resulting in, θ k+1 = θ k + α k φ s π(s, )f wk (s, ). (3) Lstly, we will do the policy improvement step (updting θ) simultneously with the vlue estimtion step (updting w). We will do vlue estimtion using grdient-descent Srs(0) (Sutton & Brto 1998) over the sme feture spce s the policy. Specificlly, if t time k the system is in stte s nd tkes ction trnsitioning to stte s nd then tking ction, we updte the weight vector, w k+1 = w k + β k (r + γq wk (s, ) Q wk (s, )) (4) The policy improvement step uses eqution 3 where s is the stte of the system t time k nd the ction-vlue estimtes from Srs Q wk re used to compute the dvntge term, f wk (s, ) = Q wk (s, ) π(s, θ k )Q wk (s, ). Win or Lern Fst WoLF ( Win or Lern Fst ) is method for chnging the lerning rte to encourge convergence in multigent reinforcement lerning scenrio (Bowling & Veloso 2002). Notice tht the grdient scent lgorithm described does not ccount for non-sttionry environment tht rises with simultneous lerning in stochstic gmes. All of the other gents ctions re simply ssumed to be prt of the environment nd unchnging. WoLF provides simple wy to ccount for other gents through djusting how quickly or slowly the gent chnges its policy. Since only the rte of lerning is chnged, lgorithms tht re gurnteed to find (loclly) optiml policies in nonsttionry environments retin this property even when using WoLF. In stochstic gmes with simultneous lerning, WoLF hs both theoreticl evidence (limited to two-plyer, two-ction mtrix gmes), nd empiricl evidence (experiments in mtrix gmes, s well s smller zero-sum nd generl-sum stochstic gmes) tht it encourges convergence in lgorithms tht don t otherwise converge (Bowling & Veloso 2002). The intuition for this technique is tht lerner should dpt quickly when it is doing more poorly thn expected. When it is doing better thn expected, it should be cutious, since the other plyers re likely to chnge their policy. This implicitly ccounts for other plyers tht re lerning, rther thn other techniques tht try to explicitly reson bout their ction choices.

5 The WoLF principle nturlly lends itself to policy grdient techniques where there is well-defined lerning rte, α k. With WoLF we replce the originl lerning rte with two lerning rtes αk w < αk l to be used when winning or losing, respectively. One determintion of winning nd losing tht hs been successful is to compre the vlue of the current policy V π (s) to the vlue of the verge policy over time V π (s). With the policy grdient technique bove we cn define similr rule tht exmines the pproximte vlue, using Q w, of the current weight vector θ with the verge weight vector over time θ. Specificlly, we re winning if nd only if, π(s, θ)q w (s, ) > π(s, θ)q w (s, ). (5) When winning in prticulr stte, we updte the prmeters for tht stte using α w k, otherwise αl k. Lerning in Goofspiel We combine these three techniques in the obvious wy. Tile coding provides lrge boolen feture vector for ny sttection pir. This is used both for the prmeteriztion of the policy nd for the pproximtion of the policy s vlue, which is used to compute the policy s grdient. Grdient updtes re then performed on both the policy using eqution 3 nd the vlue estimte using eqution 4. WoLF is used to vry the lerning rte α k in the policy updte ccording to the rule in inequlity 5. This composition cn be essentilly thought of s n ctor-critic method (Sutton & Brto 1998). Here the Gibbs distribution over the set of prmeters is the ctor, nd the grdient-descent Srs(0) is the critic. Tile-coding provides the necessry prmeteriztion of the stte. The WoLF principle is djusting how the ctor chnges policies bsed on response from the critic. The min detil yet to be explined nd where the lgorithm is specificlly dpted to Goofspiel is in the tile coding. The method of tiling is extremely importnt to the overll performnce of lerning s it is powerful bis on wht policies cn nd will be lerned. The mjor decision to be mde is how to represent the stte s vector of numbers nd which of these numbers re tiled together. The first decision determines wht sttes re distinguishble, nd the second determines how generliztion works cross distinguishble sttes. Despite the importnce of the tiling we essentilly selected wht seemed like resonble tiling, nd used it throughout our results. We represent set of crds, either plyer s hnd or the deck, by five numbers, corresponding to the vlue of the crd tht is the minimum, lower qurtile, medin, upper qurtile, nd mximum. This provides informtion s to the generl shpe of the set, which is wht is importnt in Goofspiel. The other vlues used in the tiling re the vlue of the crd tht is being bid on nd the crd corresponding to the gent s ction. An exmple of this process in the 13-crd gme is shown in Tble 2. These vlues re combined together into three tilings. The first tiles together the qurtiles describing the plyers hnds. The second tiles together the qurtiles of the deck with the crd vilble nd plyer s ction. The lst tiles together the qurtiles of the opponent s hnd with the crd vilble nd plyer s ction. The tilings use tile sizes equl to roughly hlf the number of crds in the gme with the number of tilings greter thn the tile sizes to distinguish between ny integer stte vlues. Finlly, these tiles were ll then hshed into tble of size one million in order to keep the prmeter spce mngeble. We don t suggest tht this is perfect or even good tiling for this domin, but s we will show the results re still interesting. Results One of the difficult nd open issues in multigent reinforcement lerning is tht of evlution. Before presenting lerning results we first need to look t how one evlutes lerning success. Evlution One strightforwrd evlution technique is to hve two lerning lgorithms lern ginst ech other nd simply exmine the expected rewrd over time. This technique is not useful if one s interested in lerning in self-ply, where both plyers use n identicl lgorithm. In this cse with symmetric zero-sum gme like Goofspiel, the expected rewrd of the two gents is necessrily zero, providing no informtion. Another common evlution criterion is tht of convergence. This is true in single-gent lerning s well s multigent lerning. One strong motivtion for considering this criterion in multigent domins is the connection of convergence to Nsh equilibrium. If lgorithms tht re gurnteed to converge to optiml policies in sttionry environments, converge in multigent lerning environment, then the resulting joint policy must be Nsh equilibrium of the stochstic gme (Bowling & Veloso 2002). Although, convergence to n equilibrium is n idel criterion for smll problems, there re number of resons why this is unlikely to be possible for lrge problems. First, optimlity in lrge (even sttionry) environments is not generlly fesible. This is exctly the motivtion for exploring function pproximtion nd policy prmeteriztions. Second, when we ccount for the limittions tht pproximtion imposes on plyer s policy then equilibri my cese to exist, mking convergence of policies impossible (Bowling & Veloso 2002b). Third, policy grdient techniques lern only loclly optiml policies. They my converge to policies tht re not globlly optiml nd therefore necessrily not equilibri. Although convergence to equilibri nd therefore convergence in generl is not resonble criterion we would still expect self-ply lerning gents to lern something. In this pper we use the evlution technique used by Littmn with Minimx-Q (Littmn 1994). We trin n gent in self-ply, nd then freeze its policy, nd trin chllenger to find tht policy s worst-cse performnce. This chllenger is trined using just grdient-descent Srs nd chooses the ction with mximum estimted vlue with ɛ-greedy explortion. Notice tht the possible policies plyble by the chllenger re the deterministic policies (modulo explortion) plyble by the lerning lgorithm being evluted. Since Goofspiel

6 My Hnd Qurtiles * * * * * Opp Hnd Qurtiles * * * * * Deck Qurtiles * * * * * Crd 11 Action 3 1, 4, 6, 8, 13, 4, 8, 10, 11, 13, 1, 3, 9, 10, 12, 11, 3 (Tile Coding) TILES {0, 1} 106 Tble 2: An exmple stte-ction representtion using qurtiles to describe the plyers hnds nd the deck. These numbers re then tiled nd hshed with the resulting tiles representing boolen vector of size is symmetric zero-sum gme, we know tht the equilibrium policy, if one exists, would hve vlue zero ginst its chllenger. So, this provides some mesure of how close the policy is to the equilibrium by exmining its vlue ginst its chllenger. A second relted criterion will lso help to understnd the performnce of the lgorithm. Although policy convergence might not be possible, convergence of the expected vlue of the gents policies my be possible. Since the rel desirbility of policy convergence is the convergence of the policy s vlue, this is in fct often just s good. This is lso one of the strengths of the WoLF vrible lerning rte, s it hs been shown to mke lerning lgorithms with cycling policies nd expected vlues converge both in expected vlue nd policy. Experiments Throughout our experiments, we exmined three different lerning lgorithms in self-ply. The first two did not use the WoLF vrible lerning rte, nd insted followed sttic step size. Fst used lrge step size α k = 0.16; Slow used smll step size α k = 0.008; WoLF switched between these lerning rtes bsed on inequlity 5. In ll experiments, the vlue estimtion updte used fixed lerning rte of β = 0.2. These rtes were not decyed, in order to better isolte the effectiveness prt from pproprite selection of decy schedules. In ddition, throughout trining nd evlution runs, ll gents followed n ɛ-greedy explortion strtegy with ɛ = The initil policies nd vlues ll begin with zero weight vectors, which with Gibbs distribution corresponds to the rndom policy, which s we hve noted is resonbly good. In our first experiment we trined the lerner in self-ply for 40,000 gmes. After every 5,000 gmes we stopped the trining nd trined chllenger ginst the gent s current policy. The chllenger ws trined on 10,000 gmes using Srs(0) grdient scent with the lerning rte prmeters described bove. The two policies, the gent s nd its chllenger, were then evluted on 1,000 gmes to estimte the policy s worst-cse expected vlue. This experiment ws repeted thirty times for ech lgorithm. The lerning results verged over the thirty runs re shown in Figure 3 for crd sizes of 4, 8, nd 13. The bseline comprison is with tht of the rndom policy, very competitive policy for this gme. All three lerners improve on this policy while trining in self-ply. The initil dips in the 8 nd 13 crd gmes re due to the fct tht vlue estimtes re initilly very poor mking the initil policy grdients not in the direction of incresing the overll vlue of the policy. It tkes number of trining gmes for the delyed rewrd of winning crds lter to overcome the initil immedite rewrd of winning crds now. Lstly, notice the ffect of the WoLF principle. It consistently outperforms the two sttic step size lerners. This is identicl to ffects shown in nonpproximted stochstic gmes (Bowling & Veloso 2002). The second experiment ws to further exmine the issue of convergence nd the ffect of the WoLF principle on the lerning process. Insted of exmining worst-cse performnce ginst some fictitious chllenger, we now exmine the expected vlue of the plyer s policy while lerning in self-ply. Agin the lgorithm ws trined in self-ply for 40,000 gmes. After 50 gmes both plyers policies were frozen nd evluted over 1,000 gmes to find the expected vlue to the plyers t tht moment. We rn ech lgorithm once on just the 13 crd gme nd plotted its expected vlue over time while lerning. The results re shown in Figure 4. Notice tht expected vlue of ll the lerning lgorithms seem to hve some oscilltion round zero. We would expect this with identicl lerners in symmetric zero-sum gme. The point of interest though is how close these oscilltions sty to zero over time. The WoLF principle cuses the policies to hve more constnt expected vlue with lower mplitude oscilltions. This gin shows tht the WoLF principle continues to hve converging ffects even in stochstic gmes with pproximtion techniques. Conclusion We hve described sclble lerning lgorithm for stochstic gmes, composed of three reinforcement lerning ides. We showed preliminry results of this lgorithm lerning in the gme Goofspiel. These results demonstrte tht the pol-

7 -1.1 n = 4 Fst Vlue v. Worst-Cse Opponent WoLF -2 Fst -2.1 Slow Rndom -2.2 Number of Trining Gmes Expected Vlue While Lerning Number of Gmes -2 n = 8 Slow Vlue v. Worst-Cse Opponent WoLF -9 Fst Slow -10 Rndom Number of Trining Gmes Expected Vlue While Lerning Number of Gmes -12 n = 13 WoLF Vlue v. Worst-Cse Opponent WoLF -24 Fst Slow -26 Rndom Number of Trining Gmes Expected Vlue While Lerning Number of Gmes Figure 3: Worst-cse expected vlue of the policy lerned in self-ply. Figure 4: Expected vlue of the gme while lerning.

8 icy grdient pproch using n ctor-critic model cn lern in this domin. In ddition, the WoLF principle for encourging convergence lso seems to hold even when using pproximtion nd generliztion techniques. There re number of directions for future work. Within the gme of Goofspiel, it would be interesting to explore lterntive wys of tiling the stte-ction spce. This could likely increse the overll performnce of the lerned policy, but would lso exmine how generliztion might ffect the convergence of lerning. Might certin generliztion techniques retin the existence of equilibrium, nd is the equilibrium lernble? Another importnt direction is to exmine these techniques on more domins, with possibly continuous stte nd ction spces. Also, it would be interesting to vry some of the components of the system. Cn we use different pproximtor thn tile-coding? Do we chieve similr results with different policy grdient techniques (e.g. GPOMDP (Bxter & Brtlett 2000)). These initil results, though, show promise tht grdient scent nd the WoLF principle cn scle to lrge stte spces. References Bxter, J., nd Brtlett, P. L Reinforcement lerning in POMDP s vi direct grdient scent. In Proceedings of the Seventeenth Interntionl Conference on Mchine Lerning, Stnford University: Morgn Kufmn. Bowling, M., nd Veloso, M Rtionl nd convergent lerning in stochstic gmes. In Proceedings of the Seventeenth Interntionl Joint Conference on Artificil Intelligence, Bowling, M., nd Veloso, M Multigent lerning using vrible lerning rte. Artificil Intelligence. In Press. Bowling, M., nd Veloso, M. M. 2002b. Existence of multigent equilibri with limited gents. Technicl report CMU-CS , Computer Science Deprtment, Crnegie Mellon University. Clus, C., nd Boutilier, C The dynmics of reinforcement lerning in coopertive multigent systems. In Proceedings of the Fifteenth Ntionl Conference on Artificil Intelligence. Menlo Prk, CA: AAAI Press. Fink, A. M Equilibrium in stochstic n-person gme. Journl of Science in Hiroshim University, Series A-I 28: Flood, M Interview by Albert Tucker. The Princeton Mthemtics Community in the 1930s, Trnscript Number 11. Gordon, G Reinforcement lerning with function pproximtion converges to region. In Advnces in Neurl Informtion Processing Systems 12. MIT Press. Greenwld, A., nd Hll, K Correlted Q-lerning. In Proceedings of the AAAI Spring Symposium Workshop on Collbortive Lerning Agents. In Press. Hu, J., nd Wellmn, M. P Multigent reinforcement lerning: Theoreticl frmework nd n lgorithm. In Proceedings of the Fifteenth Interntionl Conference on Mchine Lerning, Sn Frncisco: Morgn Kufmn. Kuhn, H. W., ed Clssics in Gme Theory. Princeton University Press. Littmn, M. L Mrkov gmes s frmework for multi-gent reinforcement lerning. In Proceedings of the Eleventh Interntionl Conference on Mchine Lerning, Morgn Kufmn. Littmn, M Friend-or-foe Q-lerning in generlsum gmes. In Proceedings of the Eighteenth Interntionl Conference on Mchine Lerning, Willims College: Morgn Kufmn. Nsh, Jr., J. F Equilibrium points in n-person gmes. PNAS 36: Reprinted in (Kuhn 1997). Osborne, M. J., nd Rubinstein, A A Course in Gme Theory. The MIT Press. Smuel, A. L Some studies in mchine lerning using the gme of checkers. IBM Journl on Reserch nd Development 11: Shpley, L. S Stochstic gmes. PNAS 39: Reprinted in (Kuhn 1997). Singh, S.; Kerns, M.; nd Mnsour, Y Nsh convergence of grdient dynmics in generl-sum gmes. In Proceedings of the Sixteenth Conference on Uncertinty in Artificil Intelligence, Morgn Kufmn. Stone, P., nd Sutton, R Scling reinforcement lerning towrd Robocup soccer. In Proceedings of the Eighteenth Interntionl Conference on Mchine Lerning, Willims College: Morgn Kufmn. Sutton, R. S., nd Brto, A. G Reinforcement Lerning. MIT Press. Sutton, R. S.; McAllester, D.; Singh, S.; nd Mnsour, Y Policy grdient methods for reinforcement lerning with function pproximtion. In Advnces in Neurl Informtion Processing Systems 12. MIT Press. Tesuro, G. J Temporl difference lerning nd TD Gmmon. Communictions of the ACM 38:48 68.

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic