Scalable Learning in Stochastic Games

Size: px
Start display at page:

Download "Scalable Learning in Stochastic Games"

Transcription

1 Sclble Lerning in Stochstic Gmes Michel Bowling nd Mnuel Veloso Computer Science Deprtment Crnegie Mellon University Pittsburgh PA, Abstrct Stochstic gmes re generl model of interction between multiple gents. They hve recently been the focus of gret del of reserch in reinforcement lerning s they re both descriptive nd hve well-defined Nsh equilibrium solution. Most of this recent work, lthough very generl, hs only been pplied to smll gmes with t most hundreds of sttes. On the other hnd, there re lndmrk results of lerning being successfully pplied to specific lrge nd complex gmes such s Checkers nd Bckgmmon. In this pper we describe sclble lerning lgorithm for stochstic gmes, tht combines three seprte ides from reinforcement lerning into single lgorithm. These ides re tile coding for generliztion, policy grdient scent s the bsic lerning method, nd our previous work on the WoLF ( Win or Lern Fst ) vrible lerning rte to encourge convergence. We pply this lgorithm to the intrctbly sized gme-theoretic crd gme Goofspiel, showing preliminry results of lerning in self-ply. We demonstrte tht policy grdient scent cn lern even in this highly non-sttionry problem with simultneous lerning. We lso show tht the WoLF principle continues to hve converging effect even in lrge problems with pproximtion nd generliztion. Introduction We re interested in the problem of lerning in multigent environments. One of the min chllenges with these environments is tht other gents in the environment my be lerning nd dpting s well. These environments re, therefore, no longer sttionry. They violte the Mrkov property tht trditionl single-gent behvior lerning relies upon. The model of stochstic gmes cptures these problems very well through explicit models of the rewrd functions of the other gents nd their ffects on trnsitions. They re lso nturl extension of Mrkov decision processes (MDPs) to multiple gents nd so hve ttrcted interest from the reinforcement lerning community. The problem of simultneously finding optiml policies for stochstic gmes hs been well studied in the field of gme theory. The trditionl solution concept is tht of Nsh equilibri, policy for ll the plyers where ech is plying optimlly with Copyright c 2002, Americn Assocition for Artificil Intelligence ( All rights reserved. respect to the others. This concept is powerful solution for these gmes even in lerning context, since no gent could lern better policy when ll the gents re plying n equilibri. It is this foundtion tht hs driven much of the recent work in pplying reinforcement lerning to stochstic gmes (Littmn 1994; Hu & Wellmn 1998; Singh, Kerns, & Mnsour 2000; Littmn 2001; Bowling & Veloso 2002; Greenwld & Hll 2002). This work hs thus fr only been pplied to smll gmes with enumerble stte nd ction spces. Historiclly, though, number of lndmrk results in reinforcement lerning hve looked t lerning in prticulr stochstic gmes tht re not smll nor re the stte esily enumerted. Smuel s Checkers plying progrm (Smuel 1967) nd Tesuro s TD-Gmmon (Tesuro 1995) re successful pplictions of lerning in gmes with very lrge stte spces. Both of these results mde generous use of generliztion nd pproximtion, which hve not been used in the more recent work. On the other hnd, both TD- Gmmon nd Smuel s Checkers plyer only used deterministic strtegies to ply competitively, while Nsh equilibri often require stochstic strtegies. We re interested in scling some of the recent techniques bsed on the Nsh equilibri concept to gmes with intrctble stte spces. Such gol is not new. Singh nd collegues lso described future work of pplying their simple grdient techniques to problems with lrge or infinite stte nd ction spces (Singh, Kerns, & Mnsour 2000). This pper exmines some initil results in this direction. We first describe the forml definition of stochstic gme nd the notion of equilibri. We then describe one prticulr very lrge, two-plyer, zero-sum stochstic gme, Goofspiel. Our lerning lgorithm is described s the combintion of three ides from reinforcement lerning: tile-coding, policy grdients, nd the WoLF principle. We then show results of our lgorithm lerning to ply Goofspiel with self-ply. Finlly, we conclude with some future directions for this work. Stochstic Gmes A stochstic gme is tuple (n, S, A 1...n, T, R 1...n ), where n is the number of gents, S is set of sttes, A i is the set of ctions vilble to gent i (nd A is the joint ction spce A 1... A n ), T is trnsition function S A S [0, 1], nd R i is rewrd function for the ith gent S A R.

2 This looks very similr to the MDP frmework except we hve multiple gents selecting ctions nd the next stte nd rewrds depend on the joint ction of the gents. Another importnt difference is tht ech gent hs its own seprte rewrd function. The gol for ech gent is to select ctions in order to mximize its discounted future rewrds with discount fctor γ. Stochstic gmes re very nturl extension of MDPs to multiple gents. They re lso n extension of mtrix gmes to multiple sttes. Two exmple mtrix gmes re in Figure 1. In these gmes there re two plyers; one selects row nd the other selects column of the mtrix. The entry of the mtrix they jointly select determines the pyoffs. The gmes in Figure 1 re zero-sum gmes, so the row plyer would receive the pyoff in the mtrix, nd the column plyer would receive the negtive of tht pyoff. In the generl cse (generl-sum gmes), ech plyer would hve seprte mtrix tht determines their pyoffs. ( ) ( Mtching Pennies R-P-S Figure 1: Mtching Pennies nd Rock-Pper-Scissors mtrix gmes. Ech stte in stochstic gme cn be viewed s mtrix gme with the pyoffs for ech joint ction determined by the mtrices R i (s, ). After plying the mtrix gme nd receiving their pyoffs the plyers re trnsitioned to nother stte (or mtrix gme) determined by their joint ction. We cn see tht stochstic gmes then contin both MDPs nd mtrix gmes s subsets of the frmework. Stochstic Policies. Unlike in single-gent settings, deterministic policies in multigent settings cn often be exploited by the other gents. Consider the mtching pennies mtrix gme s shown in Figure 1. If the column plyer were to ply either ction deterministiclly, the row plyer could win every time. This requires us to consider mixed strtegies nd stochstic policies. A stochstic policy, ρ : S P D(A i ), is function tht mps sttes to mixed strtegies, which re probbility distributions over the plyer s ctions. Nsh Equilibri. Even with the concept of mixed strtegies there re still no optiml strtegies tht re independent of the other plyers strtegies. We cn, though, define notion of best-response. A strtegy is best-response to the other plyers strtegies if it is optiml given their strtegies. The mjor dvncement tht hs driven much of the development of mtrix gmes, gme theory, nd even stochstic gmes is the notion of best-response equilibrium, or Nsh equilibrium (Nsh, Jr. 1950). A Nsh equilibrium is collection of strtegies for ech of the plyers such tht ech plyer s strtegy is best-response to the other plyers strtegies. So, no plyer cn do better ) by chnging strtegies given tht the other plyers lso don t chnge strtegies. Wht mkes the notion of equilibrium compelling is tht ll mtrix gmes hve such n equilibrium, possibly hving multiple equilibri. Zero-sum, twoplyer gmes, where one plyer s pyoffs re the negtive of the other, hve single Nsh equilibrium. 1 In the zero-sum exmples in Figure 1, both gmes hve n equilibrium consisting of ech plyer plying the mixed strtegy where ll the ctions hve equl probbility. The concept of equilibri lso extends to stochstic gmes. This is non-trivil result, proven by Shpley (Shpley 1953) for zero-sum stochstic gmes nd by Fink (Fink 1964) for generl-sum stochstic gmes. Lerning in Stochstic Gmes. Stochstic gmes hve been the focus of recent reserch in the re of reinforcement lerning. There re two different pproches being explored. The first is tht of lgorithms tht explicitly lern equilibri through experience, independent of the other plyers policy (Littmn 1994; Hu & Wellmn 1998; Greenwld & Hll 2002). These lgorithms itertively estimte vlue functions, nd use them to compute n equilibrium for the gme. A second pproch is tht of bestresponse lerners (Clus & Boutilier 1998; Singh, Kerns, & Mnsour 2000; Bowling & Veloso 2002). These lerners explicitly optimize their rewrd with respect to the other plyers (chnging) policies. This pproch, too, hs strong connection to equilibri. If these lgorithms converge when plying ech other, then they must do so to n equilibrium (Bowling & Veloso 2001). Neither of these pproches, though, hve been scled beyond gmes with few hundred sttes. Gmes with very lrge number of sttes, or gmes with continuous stte spces, mke stte enumertion intrctble. Since previous lgorithms in their stted form require the enumertion of sttes either for policies or vlue functions, this is mjor limittion. In this pper we exmine lerning in very lrge stochstic gme, using pproximtion nd generliztion techniques. Specificlly, we will build on the ide of best-response lerners using grdient techniques (Singh, Kerns, & Mnsour 2000; Bowling & Veloso 2002). We first describe n interesting gme with n intrctbly lrge stte spce. Goofspiel Goofspiel (or The Gme of Pure Strtegy) ws invented by Merrill Flood while t Princeton (Flood 1985). The gme hs numerous vritions, but here we focus on the simple two-plyer, n-crd version. Ech plyer receives suit of crds numbered 1 through n, third suit of crds is shuffled nd plced fce down s the deck. Ech round the next crd is flipped over from the deck, nd the two plyers ech select crd plcing it fce down. They re reveled simultneously nd the plyer with the highest crd wins the crd from the deck, which is worth its number in points. If 1 There cn ctully be multiple equilibri, but they will ll hve equl pyoffs nd re interchngeble (Osborne & Rubinstein 1994).

3 the plyers choose the sme vlued crd, then neither plyer gets ny points. Regrdless of the winner, both plyers discrd their chosen crd. This is repeted until the deck nd plyers hnds re exhusted. The winner is the plyer with the most points. This gme hs numerous interesting properties mking it very interesting step between toy problems nd more relistic problems. First, notice tht this gme is zero-sum, nd s with mny zero-sum gmes ny deterministic strtegy cn be soundly defeted. In this gme, it s by simply plying the crd one higher thn the other plyer s deterministiclly chosen crd. Second, notice tht the number of sttes nd stte-ction pirs grows exponentilly with the number of crds. The stndrd size of the gme n = 13 is so lrge tht just storing one plyer s policy or Q-tble would require pproximtely 2.5 terbytes of spce. Just gthering dt on ll the stte-ction trnsitions would require well over plyings of the gme. Tble 1 shows the number of sttes nd stte-ction pirs s well s the policy size for three different vlues of n. This gme obviously requires some form of generliztion to mke lerning possible. Another interesting property is tht rndomly selecting ctions is resonbly good policy. The worst-cse vlues of the rndom policy long with the worst-cse vlues of the best deterministic policy re lso shown in Tble 1. This gme cn be described using the stochstic gme model. The stte is the current crds in the plyers hnds nd deck long with the upturned crd. The ctions for plyer re the crds in the plyer s hnd. The trnsitions follow the rules s described, with n immedite rewrd going to the plyer who won the upturned crd. Since the gme hs finite end nd we re interested in mximizing totl rewrd, we cn set the discount fctor γ to be 1. Although equilibrium lerning techniques such s Minimx-Q (Littmn 1994) re gurnteed to find the gme s equilibrium, it requires mintining stte-joint-ction tble of vlues. This tble would require 20.1 terbytes to store for the n = 13 crd gme. We will now describe best-response lerning lgorithm using pproximtion techniques to hndle the enormous stte spce. Three Ides One Algorithm The lgorithm we will use combines three seprte ides from reinforcement lerning. The first is the ide of tile coding s generliztion for liner function pproximtion. The second is the use of prmeterized policy nd lerning s grdient scent in the policy s prmeter spce. The finl component is the use of WoLF vrible lerning rte to djust the grdient scent step size. We will briefly overview these three techniques nd then describe how they re combined into reinforcement lerning lgorithm for Goofspiel. Tile Coding. Tile coding (Sutton & Brto 1998), lso known s CMACS, is populr technique for creting set of boolen fetures from set of continuous fetures. In reinforcement lerning, tile coding hs been used extensively to crete liner pproximtors of stte-ction vlues (e.g., (Stone & Sutton 2001)). Tiling One Tiling Two Figure 2: An exmple of tile coding two dimensionl spce with two overlpping tilings. The bsic ide is to ly offset grids or tilings over the multidimensionl continuous feture spce. A point in the continuous feture spce will be in exctly one tile for ech of the offset tilings. Ech tile hs n ssocited boolen vrible, so the continuous feture vector gets mpped into very high-dimensionl boolen vector. In ddition, nerby points will fll into the sme tile for mny of the offset grids, nd so shre mny of the sme boolen vribles in their resulting vector. This provides the importnt feture of generliztion. An exmple of tile coding in two-dimensionl continuous spce is shown in Figure 2. This exmple shows two overlpping tilings, nd so ny given point flls into two different tiles. Another common trick with tile coding is the use of hshing to keep the number of prmeters mngeble. Ech tile is hshed into tble of fixed size. Collisions re simply ignored, mening tht two unrelted tiles my shre the sme prmeter. Hshing reduces the memory requirements with little loss in performnce. This is becuse only smll frction of the continuous spce is ctully needed or visited while lerning, nd so independent prmeters for every tile re often not necessry. Hshing provides mens for using only the number of prmeters the problem requires while not knowing in dvnce which stte-ction pirs need prmeters. Policy Grdient Ascent Policy grdient techniques (Sutton et l. 2000; Bxter & Brtlett 2000) re method of reinforcement lerning with function pproximtion. Trditionl pproches pproximte stte-ction vlue function, nd result in deterministic policy tht selects the ction with the mximum lerned vlue. Alterntively, policy grdient pproches pproximte policy directly, nd then use grdient scent to djust the prmeters to mximize the policy s vlue. There re three good resons for the ltter pproch. First, there s whole body of theoreticl work describing convergence problems using vriety of vlue-bsed lerning techniques with vriety of function pproximtion techniques (See (Gordon 2000) for summry of these results.) Second, vlue-bsed pproches lern deterministic policies, nd s we mentioned erlier deterministic policies in multigent

4 n S S A SIZEOF(π or Q) VALUE(det) VALUE(rndom) KB MB TB Tble 1: The pproximte number of sttes nd stte-ctions, nd the size of stochstic policy or Q tble for Goofspiel depending on the number of crds, n. The VALUE columns list the worst-cse vlue of the best deterministic policy nd the rndom policy respectively. settings re often esily exploitble. Third, grdient techniques hve been shown to be successful for simultneous lerning in mtrix gmes (Singh, Kerns, & Mnsour 2000; Bowling & Veloso 2002). We use the policy grdient technique presented by Sutton nd collegues (Sutton et l. 2000). Specificlly, we will define policy s Gibbs distribution over liner combintion of fetures, such s those tken from tile coding representtion of stte-ctions. Let θ be vector of the policy s prmeters nd φ s be feture vector for stte s nd ction then this defines stochstic policy ccording to, π(s, ) = eθ φs b eθ φ. sb Their min result ws convergence proof for the following policy itertion rule tht updtes policy s prmeters, θ k+1 = θ k + α k d π k(s) s π k (s, ) f wk (s, ). (1) θ For the Gibbs distribution this is just, θ k+1 = θ k + α k d π k(s) φ s π(s, )f wk (s, ) (2) s Here α k is n ppropritely decyed lerning rte nd d π k (s) is stte s s contribution to the policy s overll vlue. This contribution is defined differently depending on whether verge or discounted strt stte rewrd criterion is used. f wk (s, ) is n independent pproximtion of Q π k (s, ) with prmeters w, which is the expected vlue of tking ction from stte s nd then following the policy π k. For Gibbs distribution, Sutton nd collegues showed tht for convergence this pproximtion should hve the following form, [ f w (s, ) = w φ s ] π(s, b)φ sb. b As they point out, this mounts to f w being n pproximtion of the dvntge function, A π (s, ) = Q π (s, ) V π (s), where V π (s) is the vlue of following policy π from stte s. It is this dvntge function tht we will estimte nd use for grdient scent. Using this bsic formultion we derive n on-line version of the lerning rule, where the policy s weights re updted with ech stte visited. The totl rewrd criterion for Goofspiel is identicl to hving γ = 1 in the discounted setting. So, d π (s) is just the probbility of visiting stte s when following policy π. Since we will be visiting sttes on-policy, this mounts to updting weights in proportion to how often the stte is visited. By doing updtes on-line s sttes re visited we cn simply drop this term from eqution 2, resulting in, θ k+1 = θ k + α k φ s π(s, )f wk (s, ). (3) Lstly, we will do the policy improvement step (updting θ) simultneously with the vlue estimtion step (updting w). We will do vlue estimtion using grdient-descent Srs(0) (Sutton & Brto 1998) over the sme feture spce s the policy. Specificlly, if t time k the system is in stte s nd tkes ction trnsitioning to stte s nd then tking ction, we updte the weight vector, w k+1 = w k + β k (r + γq wk (s, ) Q wk (s, )) (4) The policy improvement step uses eqution 3 where s is the stte of the system t time k nd the ction-vlue estimtes from Srs Q wk re used to compute the dvntge term, f wk (s, ) = Q wk (s, ) π(s, θ k )Q wk (s, ). Win or Lern Fst WoLF ( Win or Lern Fst ) is method for chnging the lerning rte to encourge convergence in multigent reinforcement lerning scenrio (Bowling & Veloso 2002). Notice tht the grdient scent lgorithm described does not ccount for non-sttionry environment tht rises with simultneous lerning in stochstic gmes. All of the other gents ctions re simply ssumed to be prt of the environment nd unchnging. WoLF provides simple wy to ccount for other gents through djusting how quickly or slowly the gent chnges its policy. Since only the rte of lerning is chnged, lgorithms tht re gurnteed to find (loclly) optiml policies in nonsttionry environments retin this property even when using WoLF. In stochstic gmes with simultneous lerning, WoLF hs both theoreticl evidence (limited to two-plyer, two-ction mtrix gmes), nd empiricl evidence (experiments in mtrix gmes, s well s smller zero-sum nd generl-sum stochstic gmes) tht it encourges convergence in lgorithms tht don t otherwise converge (Bowling & Veloso 2002). The intuition for this technique is tht lerner should dpt quickly when it is doing more poorly thn expected. When it is doing better thn expected, it should be cutious, since the other plyers re likely to chnge their policy. This implicitly ccounts for other plyers tht re lerning, rther thn other techniques tht try to explicitly reson bout their ction choices.

5 The WoLF principle nturlly lends itself to policy grdient techniques where there is well-defined lerning rte, α k. With WoLF we replce the originl lerning rte with two lerning rtes αk w < αk l to be used when winning or losing, respectively. One determintion of winning nd losing tht hs been successful is to compre the vlue of the current policy V π (s) to the vlue of the verge policy over time V π (s). With the policy grdient technique bove we cn define similr rule tht exmines the pproximte vlue, using Q w, of the current weight vector θ with the verge weight vector over time θ. Specificlly, we re winning if nd only if, π(s, θ)q w (s, ) > π(s, θ)q w (s, ). (5) When winning in prticulr stte, we updte the prmeters for tht stte using α w k, otherwise αl k. Lerning in Goofspiel We combine these three techniques in the obvious wy. Tile coding provides lrge boolen feture vector for ny sttection pir. This is used both for the prmeteriztion of the policy nd for the pproximtion of the policy s vlue, which is used to compute the policy s grdient. Grdient updtes re then performed on both the policy using eqution 3 nd the vlue estimte using eqution 4. WoLF is used to vry the lerning rte α k in the policy updte ccording to the rule in inequlity 5. This composition cn be essentilly thought of s n ctor-critic method (Sutton & Brto 1998). Here the Gibbs distribution over the set of prmeters is the ctor, nd the grdient-descent Srs(0) is the critic. Tile-coding provides the necessry prmeteriztion of the stte. The WoLF principle is djusting how the ctor chnges policies bsed on response from the critic. The min detil yet to be explined nd where the lgorithm is specificlly dpted to Goofspiel is in the tile coding. The method of tiling is extremely importnt to the overll performnce of lerning s it is powerful bis on wht policies cn nd will be lerned. The mjor decision to be mde is how to represent the stte s vector of numbers nd which of these numbers re tiled together. The first decision determines wht sttes re distinguishble, nd the second determines how generliztion works cross distinguishble sttes. Despite the importnce of the tiling we essentilly selected wht seemed like resonble tiling, nd used it throughout our results. We represent set of crds, either plyer s hnd or the deck, by five numbers, corresponding to the vlue of the crd tht is the minimum, lower qurtile, medin, upper qurtile, nd mximum. This provides informtion s to the generl shpe of the set, which is wht is importnt in Goofspiel. The other vlues used in the tiling re the vlue of the crd tht is being bid on nd the crd corresponding to the gent s ction. An exmple of this process in the 13-crd gme is shown in Tble 2. These vlues re combined together into three tilings. The first tiles together the qurtiles describing the plyers hnds. The second tiles together the qurtiles of the deck with the crd vilble nd plyer s ction. The lst tiles together the qurtiles of the opponent s hnd with the crd vilble nd plyer s ction. The tilings use tile sizes equl to roughly hlf the number of crds in the gme with the number of tilings greter thn the tile sizes to distinguish between ny integer stte vlues. Finlly, these tiles were ll then hshed into tble of size one million in order to keep the prmeter spce mngeble. We don t suggest tht this is perfect or even good tiling for this domin, but s we will show the results re still interesting. Results One of the difficult nd open issues in multigent reinforcement lerning is tht of evlution. Before presenting lerning results we first need to look t how one evlutes lerning success. Evlution One strightforwrd evlution technique is to hve two lerning lgorithms lern ginst ech other nd simply exmine the expected rewrd over time. This technique is not useful if one s interested in lerning in self-ply, where both plyers use n identicl lgorithm. In this cse with symmetric zero-sum gme like Goofspiel, the expected rewrd of the two gents is necessrily zero, providing no informtion. Another common evlution criterion is tht of convergence. This is true in single-gent lerning s well s multigent lerning. One strong motivtion for considering this criterion in multigent domins is the connection of convergence to Nsh equilibrium. If lgorithms tht re gurnteed to converge to optiml policies in sttionry environments, converge in multigent lerning environment, then the resulting joint policy must be Nsh equilibrium of the stochstic gme (Bowling & Veloso 2002). Although, convergence to n equilibrium is n idel criterion for smll problems, there re number of resons why this is unlikely to be possible for lrge problems. First, optimlity in lrge (even sttionry) environments is not generlly fesible. This is exctly the motivtion for exploring function pproximtion nd policy prmeteriztions. Second, when we ccount for the limittions tht pproximtion imposes on plyer s policy then equilibri my cese to exist, mking convergence of policies impossible (Bowling & Veloso 2002b). Third, policy grdient techniques lern only loclly optiml policies. They my converge to policies tht re not globlly optiml nd therefore necessrily not equilibri. Although convergence to equilibri nd therefore convergence in generl is not resonble criterion we would still expect self-ply lerning gents to lern something. In this pper we use the evlution technique used by Littmn with Minimx-Q (Littmn 1994). We trin n gent in self-ply, nd then freeze its policy, nd trin chllenger to find tht policy s worst-cse performnce. This chllenger is trined using just grdient-descent Srs nd chooses the ction with mximum estimted vlue with ɛ-greedy explortion. Notice tht the possible policies plyble by the chllenger re the deterministic policies (modulo explortion) plyble by the lerning lgorithm being evluted. Since Goofspiel

6 My Hnd Qurtiles * * * * * Opp Hnd Qurtiles * * * * * Deck Qurtiles * * * * * Crd 11 Action 3 1, 4, 6, 8, 13, 4, 8, 10, 11, 13, 1, 3, 9, 10, 12, 11, 3 (Tile Coding) TILES {0, 1} 106 Tble 2: An exmple stte-ction representtion using qurtiles to describe the plyers hnds nd the deck. These numbers re then tiled nd hshed with the resulting tiles representing boolen vector of size is symmetric zero-sum gme, we know tht the equilibrium policy, if one exists, would hve vlue zero ginst its chllenger. So, this provides some mesure of how close the policy is to the equilibrium by exmining its vlue ginst its chllenger. A second relted criterion will lso help to understnd the performnce of the lgorithm. Although policy convergence might not be possible, convergence of the expected vlue of the gents policies my be possible. Since the rel desirbility of policy convergence is the convergence of the policy s vlue, this is in fct often just s good. This is lso one of the strengths of the WoLF vrible lerning rte, s it hs been shown to mke lerning lgorithms with cycling policies nd expected vlues converge both in expected vlue nd policy. Experiments Throughout our experiments, we exmined three different lerning lgorithms in self-ply. The first two did not use the WoLF vrible lerning rte, nd insted followed sttic step size. Fst used lrge step size α k = 0.16; Slow used smll step size α k = 0.008; WoLF switched between these lerning rtes bsed on inequlity 5. In ll experiments, the vlue estimtion updte used fixed lerning rte of β = 0.2. These rtes were not decyed, in order to better isolte the effectiveness prt from pproprite selection of decy schedules. In ddition, throughout trining nd evlution runs, ll gents followed n ɛ-greedy explortion strtegy with ɛ = The initil policies nd vlues ll begin with zero weight vectors, which with Gibbs distribution corresponds to the rndom policy, which s we hve noted is resonbly good. In our first experiment we trined the lerner in self-ply for 40,000 gmes. After every 5,000 gmes we stopped the trining nd trined chllenger ginst the gent s current policy. The chllenger ws trined on 10,000 gmes using Srs(0) grdient scent with the lerning rte prmeters described bove. The two policies, the gent s nd its chllenger, were then evluted on 1,000 gmes to estimte the policy s worst-cse expected vlue. This experiment ws repeted thirty times for ech lgorithm. The lerning results verged over the thirty runs re shown in Figure 3 for crd sizes of 4, 8, nd 13. The bseline comprison is with tht of the rndom policy, very competitive policy for this gme. All three lerners improve on this policy while trining in self-ply. The initil dips in the 8 nd 13 crd gmes re due to the fct tht vlue estimtes re initilly very poor mking the initil policy grdients not in the direction of incresing the overll vlue of the policy. It tkes number of trining gmes for the delyed rewrd of winning crds lter to overcome the initil immedite rewrd of winning crds now. Lstly, notice the ffect of the WoLF principle. It consistently outperforms the two sttic step size lerners. This is identicl to ffects shown in nonpproximted stochstic gmes (Bowling & Veloso 2002). The second experiment ws to further exmine the issue of convergence nd the ffect of the WoLF principle on the lerning process. Insted of exmining worst-cse performnce ginst some fictitious chllenger, we now exmine the expected vlue of the plyer s policy while lerning in self-ply. Agin the lgorithm ws trined in self-ply for 40,000 gmes. After 50 gmes both plyers policies were frozen nd evluted over 1,000 gmes to find the expected vlue to the plyers t tht moment. We rn ech lgorithm once on just the 13 crd gme nd plotted its expected vlue over time while lerning. The results re shown in Figure 4. Notice tht expected vlue of ll the lerning lgorithms seem to hve some oscilltion round zero. We would expect this with identicl lerners in symmetric zero-sum gme. The point of interest though is how close these oscilltions sty to zero over time. The WoLF principle cuses the policies to hve more constnt expected vlue with lower mplitude oscilltions. This gin shows tht the WoLF principle continues to hve converging ffects even in stochstic gmes with pproximtion techniques. Conclusion We hve described sclble lerning lgorithm for stochstic gmes, composed of three reinforcement lerning ides. We showed preliminry results of this lgorithm lerning in the gme Goofspiel. These results demonstrte tht the pol-

7 -1.1 n = 4 Fst Vlue v. Worst-Cse Opponent WoLF -2 Fst -2.1 Slow Rndom -2.2 Number of Trining Gmes Expected Vlue While Lerning Number of Gmes -2 n = 8 Slow Vlue v. Worst-Cse Opponent WoLF -9 Fst Slow -10 Rndom Number of Trining Gmes Expected Vlue While Lerning Number of Gmes -12 n = 13 WoLF Vlue v. Worst-Cse Opponent WoLF -24 Fst Slow -26 Rndom Number of Trining Gmes Expected Vlue While Lerning Number of Gmes Figure 3: Worst-cse expected vlue of the policy lerned in self-ply. Figure 4: Expected vlue of the gme while lerning.

8 icy grdient pproch using n ctor-critic model cn lern in this domin. In ddition, the WoLF principle for encourging convergence lso seems to hold even when using pproximtion nd generliztion techniques. There re number of directions for future work. Within the gme of Goofspiel, it would be interesting to explore lterntive wys of tiling the stte-ction spce. This could likely increse the overll performnce of the lerned policy, but would lso exmine how generliztion might ffect the convergence of lerning. Might certin generliztion techniques retin the existence of equilibrium, nd is the equilibrium lernble? Another importnt direction is to exmine these techniques on more domins, with possibly continuous stte nd ction spces. Also, it would be interesting to vry some of the components of the system. Cn we use different pproximtor thn tile-coding? Do we chieve similr results with different policy grdient techniques (e.g. GPOMDP (Bxter & Brtlett 2000)). These initil results, though, show promise tht grdient scent nd the WoLF principle cn scle to lrge stte spces. References Bxter, J., nd Brtlett, P. L Reinforcement lerning in POMDP s vi direct grdient scent. In Proceedings of the Seventeenth Interntionl Conference on Mchine Lerning, Stnford University: Morgn Kufmn. Bowling, M., nd Veloso, M Rtionl nd convergent lerning in stochstic gmes. In Proceedings of the Seventeenth Interntionl Joint Conference on Artificil Intelligence, Bowling, M., nd Veloso, M Multigent lerning using vrible lerning rte. Artificil Intelligence. In Press. Bowling, M., nd Veloso, M. M. 2002b. Existence of multigent equilibri with limited gents. Technicl report CMU-CS , Computer Science Deprtment, Crnegie Mellon University. Clus, C., nd Boutilier, C The dynmics of reinforcement lerning in coopertive multigent systems. In Proceedings of the Fifteenth Ntionl Conference on Artificil Intelligence. Menlo Prk, CA: AAAI Press. Fink, A. M Equilibrium in stochstic n-person gme. Journl of Science in Hiroshim University, Series A-I 28: Flood, M Interview by Albert Tucker. The Princeton Mthemtics Community in the 1930s, Trnscript Number 11. Gordon, G Reinforcement lerning with function pproximtion converges to region. In Advnces in Neurl Informtion Processing Systems 12. MIT Press. Greenwld, A., nd Hll, K Correlted Q-lerning. In Proceedings of the AAAI Spring Symposium Workshop on Collbortive Lerning Agents. In Press. Hu, J., nd Wellmn, M. P Multigent reinforcement lerning: Theoreticl frmework nd n lgorithm. In Proceedings of the Fifteenth Interntionl Conference on Mchine Lerning, Sn Frncisco: Morgn Kufmn. Kuhn, H. W., ed Clssics in Gme Theory. Princeton University Press. Littmn, M. L Mrkov gmes s frmework for multi-gent reinforcement lerning. In Proceedings of the Eleventh Interntionl Conference on Mchine Lerning, Morgn Kufmn. Littmn, M Friend-or-foe Q-lerning in generlsum gmes. In Proceedings of the Eighteenth Interntionl Conference on Mchine Lerning, Willims College: Morgn Kufmn. Nsh, Jr., J. F Equilibrium points in n-person gmes. PNAS 36: Reprinted in (Kuhn 1997). Osborne, M. J., nd Rubinstein, A A Course in Gme Theory. The MIT Press. Smuel, A. L Some studies in mchine lerning using the gme of checkers. IBM Journl on Reserch nd Development 11: Shpley, L. S Stochstic gmes. PNAS 39: Reprinted in (Kuhn 1997). Singh, S.; Kerns, M.; nd Mnsour, Y Nsh convergence of grdient dynmics in generl-sum gmes. In Proceedings of the Sixteenth Conference on Uncertinty in Artificil Intelligence, Morgn Kufmn. Stone, P., nd Sutton, R Scling reinforcement lerning towrd Robocup soccer. In Proceedings of the Eighteenth Interntionl Conference on Mchine Lerning, Willims College: Morgn Kufmn. Sutton, R. S., nd Brto, A. G Reinforcement Lerning. MIT Press. Sutton, R. S.; McAllester, D.; Singh, S.; nd Mnsour, Y Policy grdient methods for reinforcement lerning with function pproximtion. In Advnces in Neurl Informtion Processing Systems 12. MIT Press. Tesuro, G. J Temporl difference lerning nd TD Gmmon. Communictions of the ACM 38:48 68.

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic

More information

1 Online Learning and Regret Minimization

1 Online Learning and Regret Minimization 2.997 Decision-Mking in Lrge-Scle Systems My 10 MIT, Spring 2004 Hndout #29 Lecture Note 24 1 Online Lerning nd Regret Minimiztion In this lecture, we consider the problem of sequentil decision mking in

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm

More information

2D1431 Machine Learning Lab 3: Reinforcement Learning

2D1431 Machine Learning Lab 3: Reinforcement Learning 2D1431 Mchine Lerning Lb 3: Reinforcement Lerning Frnk Hoffmnn modified by Örjn Ekeberg December 7, 2004 1 Introduction In this lb you will lern bout dynmic progrmming nd reinforcement lerning. It is ssumed

More information

Administrivia CSE 190: Reinforcement Learning: An Introduction

Administrivia CSE 190: Reinforcement Learning: An Introduction Administrivi CSE 190: Reinforcement Lerning: An Introduction Any emil sent to me bout the course should hve CSE 190 in the subject line! Chpter 4: Dynmic Progrmming Acknowledgment: A good number of these

More information

19 Optimal behavior: Game theory

19 Optimal behavior: Game theory Intro. to Artificil Intelligence: Dle Schuurmns, Relu Ptrscu 1 19 Optiml behvior: Gme theory Adversril stte dynmics hve to ccount for worst cse Compute policy π : S A tht mximizes minimum rewrd Let S (,

More information

{ } = E! & $ " k r t +k +1

{ } = E! & $  k r t +k +1 Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

Bellman Optimality Equation for V*

Bellman Optimality Equation for V* Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V (s) mx Q (s,) A(s) mx A(s) mx A(s) Er t 1 V (s t 1 ) s t s, t s

More information

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning Solution for Assignment 1 : Intro to Probbility nd Sttistics, PAC lerning 10-701/15-781: Mchine Lerning (Fll 004) Due: Sept. 30th 004, Thursdy, Strt of clss Question 1. Bsic Probbility ( 18 pts) 1.1 (

More information

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7 CS 188 Introduction to Artificil Intelligence Fll 2018 Note 7 These lecture notes re hevily bsed on notes originlly written by Nikhil Shrm. Decision Networks In the third note, we lerned bout gme trees

More information

Review of Calculus, cont d

Review of Calculus, cont d Jim Lmbers MAT 460 Fll Semester 2009-10 Lecture 3 Notes These notes correspond to Section 1.1 in the text. Review of Clculus, cont d Riemnn Sums nd the Definite Integrl There re mny cses in which some

More information

Chapter 0. What is the Lebesgue integral about?

Chapter 0. What is the Lebesgue integral about? Chpter 0. Wht is the Lebesgue integrl bout? The pln is to hve tutoril sheet ech week, most often on Fridy, (to be done during the clss) where you will try to get used to the ides introduced in the previous

More information

p-adic Egyptian Fractions

p-adic Egyptian Fractions p-adic Egyptin Frctions Contents 1 Introduction 1 2 Trditionl Egyptin Frctions nd Greedy Algorithm 2 3 Set-up 3 4 p-greedy Algorithm 5 5 p-egyptin Trditionl 10 6 Conclusion 1 Introduction An Egyptin frction

More information

Math 1B, lecture 4: Error bounds for numerical methods

Math 1B, lecture 4: Error bounds for numerical methods Mth B, lecture 4: Error bounds for numericl methods Nthn Pflueger 4 September 0 Introduction The five numericl methods descried in the previous lecture ll operte by the sme principle: they pproximte the

More information

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite Unit #8 : The Integrl Gols: Determine how to clculte the re described by function. Define the definite integrl. Eplore the reltionship between the definite integrl nd re. Eplore wys to estimte the definite

More information

CS 188: Artificial Intelligence Spring 2007

CS 188: Artificial Intelligence Spring 2007 CS 188: Artificil Intelligence Spring 2007 Lecture 3: Queue-Bsed Serch 1/23/2007 Srini Nrynn UC Berkeley Mny slides over the course dpted from Dn Klein, Sturt Russell or Andrew Moore Announcements Assignment

More information

Recitation 3: More Applications of the Derivative

Recitation 3: More Applications of the Derivative Mth 1c TA: Pdric Brtlett Recittion 3: More Applictions of the Derivtive Week 3 Cltech 2012 1 Rndom Question Question 1 A grph consists of the following: A set V of vertices. A set E of edges where ech

More information

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below. Dulity #. Second itertion for HW problem Recll our LP emple problem we hve been working on, in equlity form, is given below.,,,, 8 m F which, when written in slightly different form, is 8 F Recll tht we

More information

Improper Integrals, and Differential Equations

Improper Integrals, and Differential Equations Improper Integrls, nd Differentil Equtions October 22, 204 5.3 Improper Integrls Previously, we discussed how integrls correspond to res. More specificlly, we sid tht for function f(x), the region creted

More information

1 Probability Density Functions

1 Probability Density Functions Lis Yn CS 9 Continuous Distributions Lecture Notes #9 July 6, 28 Bsed on chpter by Chris Piech So fr, ll rndom vribles we hve seen hve been discrete. In ll the cses we hve seen in CS 9, this ment tht our

More information

Reinforcement learning

Reinforcement learning Reinforcement lerning Regulr MDP Given: Trnition model P Rewrd function R Find: Policy π Reinforcement lerning Trnition model nd rewrd function initilly unknown Still need to find the right policy Lern

More information

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies Stte spce systems nlysis (continued) Stbility A. Definitions A system is sid to be Asymptoticlly Stble (AS) when it stisfies ut () = 0, t > 0 lim xt () 0. t A system is AS if nd only if the impulse response

More information

5.7 Improper Integrals

5.7 Improper Integrals 458 pplictions of definite integrls 5.7 Improper Integrls In Section 5.4, we computed the work required to lift pylod of mss m from the surfce of moon of mss nd rdius R to height H bove the surfce of the

More information

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as Improper Integrls Two different types of integrls cn qulify s improper. The first type of improper integrl (which we will refer to s Type I) involves evluting n integrl over n infinite region. In the grph

More information

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying Vitli covers 1 Definition. A Vitli cover of set E R is set V of closed intervls with positive length so tht, for every δ > 0 nd every x E, there is some I V with λ(i ) < δ nd x I. 2 Lemm (Vitli covering)

More information

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

SUMMER KNOWHOW STUDY AND LEARNING CENTRE SUMMER KNOWHOW STUDY AND LEARNING CENTRE Indices & Logrithms 2 Contents Indices.2 Frctionl Indices.4 Logrithms 6 Exponentil equtions. Simplifying Surds 13 Opertions on Surds..16 Scientific Nottion..18

More information

New Expansion and Infinite Series

New Expansion and Infinite Series Interntionl Mthemticl Forum, Vol. 9, 204, no. 22, 06-073 HIKARI Ltd, www.m-hikri.com http://dx.doi.org/0.2988/imf.204.4502 New Expnsion nd Infinite Series Diyun Zhng College of Computer Nnjing University

More information

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives Block #6: Properties of Integrls, Indefinite Integrls Gols: Definition of the Definite Integrl Integrl Clcultions using Antiderivtives Properties of Integrls The Indefinite Integrl 1 Riemnn Sums - 1 Riemnn

More information

Math 8 Winter 2015 Applications of Integration

Math 8 Winter 2015 Applications of Integration Mth 8 Winter 205 Applictions of Integrtion Here re few importnt pplictions of integrtion. The pplictions you my see on n exm in this course include only the Net Chnge Theorem (which is relly just the Fundmentl

More information

Monte Carlo method in solving numerical integration and differential equation

Monte Carlo method in solving numerical integration and differential equation Monte Crlo method in solving numericl integrtion nd differentil eqution Ye Jin Chemistry Deprtment Duke University yj66@duke.edu Abstrct: Monte Crlo method is commonly used in rel physics problem. The

More information

Module 6: LINEAR TRANSFORMATIONS

Module 6: LINEAR TRANSFORMATIONS Module 6: LINEAR TRANSFORMATIONS. Trnsformtions nd mtrices Trnsformtions re generliztions of functions. A vector x in some set S n is mpped into m nother vector y T( x). A trnsformtion is liner if, for

More information

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling CSE 547/Stt 548: Mchine Lerning for Big Dt Lecture Multi-Armed Bndits: Non-dptive nd Adptive Smpling Instructor: Shm Kkde 1 The (stochstic) multi-rmed bndit problem The bsic prdigm is s follows: K Independent

More information

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral Improper Integrls Every time tht we hve evluted definite integrl such s f(x) dx, we hve mde two implicit ssumptions bout the integrl:. The intervl [, b] is finite, nd. f(x) is continuous on [, b]. If one

More information

New data structures to reduce data size and search time

New data structures to reduce data size and search time New dt structures to reduce dt size nd serch time Tsuneo Kuwbr Deprtment of Informtion Sciences, Fculty of Science, Kngw University, Hirtsuk-shi, Jpn FIT2018 1D-1, No2, pp1-4 Copyright (c)2018 by The Institute

More information

13: Diffusion in 2 Energy Groups

13: Diffusion in 2 Energy Groups 3: Diffusion in Energy Groups B. Rouben McMster University Course EP 4D3/6D3 Nucler Rector Anlysis (Rector Physics) 5 Sept.-Dec. 5 September Contents We study the diffusion eqution in two energy groups

More information

The Regulated and Riemann Integrals

The Regulated and Riemann Integrals Chpter 1 The Regulted nd Riemnn Integrls 1.1 Introduction We will consider severl different pproches to defining the definite integrl f(x) dx of function f(x). These definitions will ll ssign the sme vlue

More information

CS667 Lecture 6: Monte Carlo Integration 02/10/05

CS667 Lecture 6: Monte Carlo Integration 02/10/05 CS667 Lecture 6: Monte Crlo Integrtion 02/10/05 Venkt Krishnrj Lecturer: Steve Mrschner 1 Ide The min ide of Monte Crlo Integrtion is tht we cn estimte the vlue of n integrl by looking t lrge number of

More information

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique? XII. LINEAR ALGEBRA: SOLVING SYSTEMS OF EQUATIONS Tody we re going to tlk bout solving systems of liner equtions. These re problems tht give couple of equtions with couple of unknowns, like: 6 2 3 7 4

More information

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams Chpter 4 Contrvrince, Covrince, nd Spcetime Digrms 4. The Components of Vector in Skewed Coordintes We hve seen in Chpter 3; figure 3.9, tht in order to show inertil motion tht is consistent with the Lorentz

More information

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes Jim Lmbers MAT 169 Fll Semester 2009-10 Lecture 4 Notes These notes correspond to Section 8.2 in the text. Series Wht is Series? An infinte series, usully referred to simply s series, is n sum of ll of

More information

Extended nonlocal games from quantum-classical games

Extended nonlocal games from quantum-classical games Extended nonlocl gmes from quntum-clssicl gmes Theory Seminr incent Russo niversity of Wterloo October 17, 2016 Outline Extended nonlocl gmes nd quntum-clssicl gmes Entngled vlues nd the dimension of entnglement

More information

Fig. 1. Open-Loop and Closed-Loop Systems with Plant Variations

Fig. 1. Open-Loop and Closed-Loop Systems with Plant Variations ME 3600 Control ystems Chrcteristics of Open-Loop nd Closed-Loop ystems Importnt Control ystem Chrcteristics o ensitivity of system response to prmetric vritions cn be reduced o rnsient nd stedy-stte responses

More information

Natural examples of rings are the ring of integers, a ring of polynomials in one variable, the ring

Natural examples of rings are the ring of integers, a ring of polynomials in one variable, the ring More generlly, we define ring to be non-empty set R hving two binry opertions (we ll think of these s ddition nd multipliction) which is n Abelin group under + (we ll denote the dditive identity by 0),

More information

Review of basic calculus

Review of basic calculus Review of bsic clculus This brief review reclls some of the most importnt concepts, definitions, nd theorems from bsic clculus. It is not intended to tech bsic clculus from scrtch. If ny of the items below

More information

Bernoulli Numbers Jeff Morton

Bernoulli Numbers Jeff Morton Bernoulli Numbers Jeff Morton. We re interested in the opertor e t k d k t k, which is to sy k tk. Applying this to some function f E to get e t f d k k tk d k f f + d k k tk dk f, we note tht since f

More information

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Module 6 Vlue Itertion CS 886 Sequentil Decision Mking nd Reinforcement Lerning University of Wterloo Mrkov Decision Process Definition Set of sttes: S Set of ctions (i.e., decisions): A Trnsition model:

More information

1 Linear Least Squares

1 Linear Least Squares Lest Squres Pge 1 1 Liner Lest Squres I will try to be consistent in nottion, with n being the number of dt points, nd m < n being the number of prmeters in model function. We re interested in solving

More information

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS. THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS RADON ROSBOROUGH https://intuitiveexplntionscom/picrd-lindelof-theorem/ This document is proof of the existence-uniqueness theorem

More information

Anytime algorithms for multiagent decision making using coordination graphs

Anytime algorithms for multiagent decision making using coordination graphs Anytime lgorithms for multigent decision mking using coordintion grphs N. Vlssis R. Elhorst J. R. Kok Informtics Institute, University of Amsterdm, The Netherlnds {vlssis,reinhrst,jellekok}@science.uv.nl

More information

The steps of the hypothesis test

The steps of the hypothesis test ttisticl Methods I (EXT 7005) Pge 78 Mosquito species Time of dy A B C Mid morning 0.0088 5.4900 5.5000 Mid Afternoon.3400 0.0300 0.8700 Dusk 0.600 5.400 3.000 The Chi squre test sttistic is the sum of

More information

A recursive construction of efficiently decodable list-disjunct matrices

A recursive construction of efficiently decodable list-disjunct matrices CSE 709: Compressed Sensing nd Group Testing. Prt I Lecturers: Hung Q. Ngo nd Atri Rudr SUNY t Bufflo, Fll 2011 Lst updte: October 13, 2011 A recursive construction of efficiently decodble list-disjunct

More information

1B40 Practical Skills

1B40 Practical Skills B40 Prcticl Skills Comining uncertinties from severl quntities error propgtion We usully encounter situtions where the result of n experiment is given in terms of two (or more) quntities. We then need

More information

Numerical integration

Numerical integration 2 Numericl integrtion This is pge i Printer: Opque this 2. Introduction Numericl integrtion is problem tht is prt of mny problems in the economics nd econometrics literture. The orgniztion of this chpter

More information

Tech. Rpt. # UMIACS-TR-99-31, Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742, June 3, 1999.

Tech. Rpt. # UMIACS-TR-99-31, Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742, June 3, 1999. Tech. Rpt. # UMIACS-TR-99-3, Institute for Advnced Computer Studies, University of Mrylnd, College Prk, MD 20742, June 3, 999. Approximtion Algorithms nd Heuristics for the Dynmic Storge Alloction Problem

More information

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a).

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a). The Fundmentl Theorems of Clculus Mth 4, Section 0, Spring 009 We now know enough bout definite integrls to give precise formultions of the Fundmentl Theorems of Clculus. We will lso look t some bsic emples

More information

Driving Cycle Construction of City Road for Hybrid Bus Based on Markov Process Deng Pan1, a, Fengchun Sun1,b*, Hongwen He1, c, Jiankun Peng1, d

Driving Cycle Construction of City Road for Hybrid Bus Based on Markov Process Deng Pan1, a, Fengchun Sun1,b*, Hongwen He1, c, Jiankun Peng1, d Interntionl Industril Informtics nd Computer Engineering Conference (IIICEC 15) Driving Cycle Construction of City Rod for Hybrid Bus Bsed on Mrkov Process Deng Pn1,, Fengchun Sun1,b*, Hongwen He1, c,

More information

Week 10: Line Integrals

Week 10: Line Integrals Week 10: Line Integrls Introduction In this finl week we return to prmetrised curves nd consider integrtion long such curves. We lredy sw this in Week 2 when we integrted long curve to find its length.

More information

Tests for the Ratio of Two Poisson Rates

Tests for the Ratio of Two Poisson Rates Chpter 437 Tests for the Rtio of Two Poisson Rtes Introduction The Poisson probbility lw gives the probbility distribution of the number of events occurring in specified intervl of time or spce. The Poisson

More information

Acceptance Sampling by Attributes

Acceptance Sampling by Attributes Introduction Acceptnce Smpling by Attributes Acceptnce smpling is concerned with inspection nd decision mking regrding products. Three spects of smpling re importnt: o Involves rndom smpling of n entire

More information

Riemann Sums and Riemann Integrals

Riemann Sums and Riemann Integrals Riemnn Sums nd Riemnn Integrls Jmes K. Peterson Deprtment of Biologicl Sciences nd Deprtment of Mthemticl Sciences Clemson University August 26, 2013 Outline 1 Riemnn Sums 2 Riemnn Integrls 3 Properties

More information

2008 Mathematical Methods (CAS) GA 3: Examination 2

2008 Mathematical Methods (CAS) GA 3: Examination 2 Mthemticl Methods (CAS) GA : Exmintion GENERAL COMMENTS There were 406 students who st the Mthemticl Methods (CAS) exmintion in. Mrks rnged from to 79 out of possible score of 80. Student responses showed

More information

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4 Intermedite Mth Circles Wednesdy, Novemer 14, 2018 Finite Automt II Nickols Rollick nrollick@uwterloo.c Regulr Lnguges Lst time, we were introduced to the ide of DFA (deterministic finite utomton), one

More information

LECTURE NOTE #12 PROF. ALAN YUILLE

LECTURE NOTE #12 PROF. ALAN YUILLE LECTURE NOTE #12 PROF. ALAN YUILLE 1. Clustering, K-mens, nd EM Tsk: set of unlbeled dt D = {x 1,..., x n } Decompose into clsses w 1,..., w M where M is unknown. Lern clss models p(x w)) Discovery of

More information

Riemann Sums and Riemann Integrals

Riemann Sums and Riemann Integrals Riemnn Sums nd Riemnn Integrls Jmes K. Peterson Deprtment of Biologicl Sciences nd Deprtment of Mthemticl Sciences Clemson University August 26, 203 Outline Riemnn Sums Riemnn Integrls Properties Abstrct

More information

Continuous Random Variables

Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 217 Néhémy Lim Continuous Rndom Vribles Nottion. The indictor function of set S is rel-vlued function defined by : { 1 if x S 1 S (x) if x S Suppose tht

More information

Math& 152 Section Integration by Parts

Math& 152 Section Integration by Parts Mth& 5 Section 7. - Integrtion by Prts Integrtion by prts is rule tht trnsforms the integrl of the product of two functions into other (idelly simpler) integrls. Recll from Clculus I tht given two differentible

More information

ADVANCEMENT OF THE CLOSELY COUPLED PROBES POTENTIAL DROP TECHNIQUE FOR NDE OF SURFACE CRACKS

ADVANCEMENT OF THE CLOSELY COUPLED PROBES POTENTIAL DROP TECHNIQUE FOR NDE OF SURFACE CRACKS ADVANCEMENT OF THE CLOSELY COUPLED PROBES POTENTIAL DROP TECHNIQUE FOR NDE OF SURFACE CRACKS F. Tkeo 1 nd M. Sk 1 Hchinohe Ntionl College of Technology, Hchinohe, Jpn; Tohoku University, Sendi, Jpn Abstrct:

More information

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata CS103B ndout 18 Winter 2007 Ferury 28, 2007 Finite Automt Initil text y Mggie Johnson. Introduction Severl childrens gmes fit the following description: Pieces re set up on plying ord; dice re thrown or

More information

Lecture 2: Fields, Formally

Lecture 2: Fields, Formally Mth 08 Lecture 2: Fields, Formlly Professor: Pdric Brtlett Week UCSB 203 In our first lecture, we studied R, the rel numbers. In prticulr, we exmined how the rel numbers intercted with the opertions of

More information

Matrix Solution to Linear Equations and Markov Chains

Matrix Solution to Linear Equations and Markov Chains Trding Systems nd Methods, Fifth Edition By Perry J. Kufmn Copyright 2005, 2013 by Perry J. Kufmn APPENDIX 2 Mtrix Solution to Liner Equtions nd Mrkov Chins DIRECT SOLUTION AND CONVERGENCE METHOD Before

More information

Near-Bayesian Exploration in Polynomial Time

Near-Bayesian Exploration in Polynomial Time J. Zico Kolter kolter@cs.stnford.edu Andrew Y. Ng ng@cs.stnford.edu Computer Science Deprtment, Stnford University, CA 94305 Abstrct We consider the explortion/exploittion problem in reinforcement lerning

More information

7.2 The Definite Integral

7.2 The Definite Integral 7.2 The Definite Integrl the definite integrl In the previous section, it ws found tht if function f is continuous nd nonnegtive, then the re under the grph of f on [, b] is given by F (b) F (), where

More information

Riemann is the Mann! (But Lebesgue may besgue to differ.)

Riemann is the Mann! (But Lebesgue may besgue to differ.) Riemnn is the Mnn! (But Lebesgue my besgue to differ.) Leo Livshits My 2, 2008 1 For finite intervls in R We hve seen in clss tht every continuous function f : [, b] R hs the property tht for every ɛ >

More information

Decision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information

Decision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information CS 188: Artificil Intelligence nd Vlue of Informtion Instructors: Dn Klein nd Pieter Abbeel niversity of Cliforni, Berkeley [These slides were creted by Dn Klein nd Pieter Abbeel for CS188 Intro to AI

More information

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007 A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H Thoms Shores Deprtment of Mthemtics University of Nebrsk Spring 2007 Contents Rtes of Chnge nd Derivtives 1 Dierentils 4 Are nd Integrls 5 Multivrite Clculus

More information

Chapter 3 Polynomials

Chapter 3 Polynomials Dr M DRAIEF As described in the introduction of Chpter 1, pplictions of solving liner equtions rise in number of different settings In prticulr, we will in this chpter focus on the problem of modelling

More information

ECO 317 Economics of Uncertainty Fall Term 2007 Notes for lectures 4. Stochastic Dominance

ECO 317 Economics of Uncertainty Fall Term 2007 Notes for lectures 4. Stochastic Dominance Generl structure ECO 37 Economics of Uncertinty Fll Term 007 Notes for lectures 4. Stochstic Dominnce Here we suppose tht the consequences re welth mounts denoted by W, which cn tke on ny vlue between

More information

Quadratic Forms. Quadratic Forms

Quadratic Forms. Quadratic Forms Qudrtic Forms Recll the Simon & Blume excerpt from n erlier lecture which sid tht the min tsk of clculus is to pproximte nonliner functions with liner functions. It s ctully more ccurte to sy tht we pproximte

More information

Convergence of Fourier Series and Fejer s Theorem. Lee Ricketson

Convergence of Fourier Series and Fejer s Theorem. Lee Ricketson Convergence of Fourier Series nd Fejer s Theorem Lee Ricketson My, 006 Abstrct This pper will ddress the Fourier Series of functions with rbitrry period. We will derive forms of the Dirichlet nd Fejer

More information

CHM Physical Chemistry I Chapter 1 - Supplementary Material

CHM Physical Chemistry I Chapter 1 - Supplementary Material CHM 3410 - Physicl Chemistry I Chpter 1 - Supplementry Mteril For review of some bsic concepts in mth, see Atkins "Mthemticl Bckground 1 (pp 59-6), nd "Mthemticl Bckground " (pp 109-111). 1. Derivtion

More information

Mapping the delta function and other Radon measures

Mapping the delta function and other Radon measures Mpping the delt function nd other Rdon mesures Notes for Mth583A, Fll 2008 November 25, 2008 Rdon mesures Consider continuous function f on the rel line with sclr vlues. It is sid to hve bounded support

More information

Lecture 3. In this lecture, we will discuss algorithms for solving systems of linear equations.

Lecture 3. In this lecture, we will discuss algorithms for solving systems of linear equations. Lecture 3 3 Solving liner equtions In this lecture we will discuss lgorithms for solving systems of liner equtions Multiplictive identity Let us restrict ourselves to considering squre mtrices since one

More information

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms Mchine Lerning, 39, 287 308, 2000. c 2000 Kluwer Acdemic Publishers. Printed in The Netherlnds. Convergence Results for Single-Step On-Policy Reinforcement-Lerning Algorithms SATINDER SINGH AT&T Lbs-Reserch,

More information

Credibility Hypothesis Testing of Fuzzy Triangular Distributions

Credibility Hypothesis Testing of Fuzzy Triangular Distributions 666663 Journl of Uncertin Systems Vol.9, No., pp.6-74, 5 Online t: www.jus.org.uk Credibility Hypothesis Testing of Fuzzy Tringulr Distributions S. Smpth, B. Rmy Received April 3; Revised 4 April 4 Abstrct

More information

ODE: Existence and Uniqueness of a Solution

ODE: Existence and Uniqueness of a Solution Mth 22 Fll 213 Jerry Kzdn ODE: Existence nd Uniqueness of Solution The Fundmentl Theorem of Clculus tells us how to solve the ordinry differentil eqution (ODE) du = f(t) dt with initil condition u() =

More information

The Wave Equation I. MA 436 Kurt Bryan

The Wave Equation I. MA 436 Kurt Bryan 1 Introduction The Wve Eqution I MA 436 Kurt Bryn Consider string stretching long the x xis, of indeterminte (or even infinite!) length. We wnt to derive n eqution which models the motion of the string

More information

Density of Energy Stored in the Electric Field

Density of Energy Stored in the Electric Field Density of Energy Stored in the Electric Field Deprtment of Physics, Cornell University c Tomás A. Aris October 14, 01 Figure 1: Digrm of Crtesin vortices from René Descrtes Principi philosophie, published

More information

Review of Gaussian Quadrature method

Review of Gaussian Quadrature method Review of Gussin Qudrture method Nsser M. Asi Spring 006 compiled on Sundy Decemer 1, 017 t 09:1 PM 1 The prolem To find numericl vlue for the integrl of rel vlued function of rel vrile over specific rnge

More information

Applying Q-Learning to Flappy Bird

Applying Q-Learning to Flappy Bird Applying Q-Lerning to Flppy Bird Moritz Ebeling-Rump, Mnfred Ko, Zchry Hervieux-Moore Abstrct The field of mchine lerning is n interesting nd reltively new re of reserch in rtificil intelligence. In this

More information

3.4 Numerical integration

3.4 Numerical integration 3.4. Numericl integrtion 63 3.4 Numericl integrtion In mny economic pplictions it is necessry to compute the definite integrl of relvlued function f with respect to "weight" function w over n intervl [,

More information

Math 270A: Numerical Linear Algebra

Math 270A: Numerical Linear Algebra Mth 70A: Numericl Liner Algebr Instructor: Michel Holst Fll Qurter 014 Homework Assignment #3 Due Give to TA t lest few dys before finl if you wnt feedbck. Exercise 3.1. (The Bsic Liner Method for Liner

More information

Chapter 14. Matrix Representations of Linear Transformations

Chapter 14. Matrix Representations of Linear Transformations Chpter 4 Mtrix Representtions of Liner Trnsformtions When considering the Het Stte Evolution, we found tht we could describe this process using multipliction by mtrix. This ws nice becuse computers cn

More information

APPROXIMATE INTEGRATION

APPROXIMATE INTEGRATION APPROXIMATE INTEGRATION. Introduction We hve seen tht there re functions whose nti-derivtives cnnot be expressed in closed form. For these resons ny definite integrl involving these integrnds cnnot be

More information

fractions Let s Learn to

fractions Let s Learn to 5 simple lgebric frctions corne lens pupil retin Norml vision light focused on the retin concve lens Shortsightedness (myopi) light focused in front of the retin Corrected myopi light focused on the retin

More information

Strong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation

Strong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation Strong Bisimultion Overview Actions Lbeled trnsition system Trnsition semntics Simultion Bisimultion References Robin Milner, Communiction nd Concurrency Robin Milner, Communicting nd Mobil Systems 32

More information

Lecture 1. Functional series. Pointwise and uniform convergence.

Lecture 1. Functional series. Pointwise and uniform convergence. 1 Introduction. Lecture 1. Functionl series. Pointwise nd uniform convergence. In this course we study mongst other things Fourier series. The Fourier series for periodic function f(x) with period 2π is

More information

Reinforcement Learning and Policy Reuse

Reinforcement Learning and Policy Reuse Reinforcement Lerning nd Policy Reue Mnuel M. Veloo PEL Fll 206 Reding: Reinforcement Lerning: An Introduction R. Sutton nd A. Brto Probbilitic policy reue in reinforcement lerning gent Fernndo Fernndez

More information

This lecture covers Chapter 8 of HMU: Properties of CFLs

This lecture covers Chapter 8 of HMU: Properties of CFLs This lecture covers Chpter 8 of HMU: Properties of CFLs Turing Mchine Extensions of Turing Mchines Restrictions of Turing Mchines Additionl Reding: Chpter 8 of HMU. Turing Mchine: Informl Definition B

More information

Cf. Linn Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Series in Probability & Statistics, 1999.

Cf. Linn Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Series in Probability & Statistics, 1999. Cf. Linn Sennott, Stochstic Dynmic Progrmming nd the Control of Queueing Systems, Wiley Series in Probbility & Sttistics, 1999. D.L.Bricker, 2001 Dept of Industril Engineering The University of Iow MDP

More information