Near-Bayesian Exploration in Polynomial Time

Size: px
Start display at page:

Download "Near-Bayesian Exploration in Polynomial Time"

Transcription

1 J. Zico Kolter Andrew Y. Ng Computer Science Deprtment, Stnford University, CA Abstrct We consider the explortion/exploittion problem in reinforcement lerning RL. The Byesin pproch to model-bsed RL offers n elegnt solution to this problem, by considering distribution over possible models nd cting to mximize expected rewrd; unfortuntely, the Byesin solution is intrctble for ll but very restricted cses. In this pper we present simple lgorithm, nd prove tht with high probbility it is ble to perform ǫ-close to the true intrctble optiml Byesin policy fter some smll polynomil in quntities describing the system number of time steps. The lgorithm nd nlysis re motivted by the so-clled PAC- MDP pproch, nd extend such results into the setting of Byesin RL. In this setting, we show tht we cn chieve lower smple complexity bounds thn existing lgorithms, while using n explortion strtegy tht is much greedier thn the extremely cutious explortion of PAC-MDP lgorithms. 1. Introduction An gent cting in n unknown environment must consider the well-known explortion/exploittion trdeoff : the trde-off between mximizing rewrds bsed on the current knowledge of the system exploiting nd cting so s to gin dditionl informtion bout the system exploring. The Byesin pproch to model-bsed reinforcement lerning RL offers very elegnt solution to this dilemm. Under the Byesin pproch, one mintins distribution over possible models nd simply cts to mximize the expected future rewrd; this objective trdes off very nturlly between explortion nd exploittion. Unfortuntely, Appering in Proceedings of the 26 th Interntionl Conference on Mchine Lerning, Montrel, Cnd, Copyright 2009 by the uthors/owners. except for very restricted cses, computing the full Byesin policy is intrctble. This hs led to numerous pproximtion methods, but, to the best of our knowledge, there hs been little work on providing ny forml gurntees for such lgorithms. In this pper we present simple, greedy pproximtion lgorithm, nd show tht is is ble to perform nerly s well s the intrctble optiml Byesin policy fter executing smll polynomil is quntities describing the system number of steps. The lgorithm nd nlysis re motivted by the so-clled PAC-MDP pproch, typified by lgorithms such s E 3 nd R mx, but extend this prdigm to the setting of Byesin RL. We show tht by considering optimlity with respect to the optiml Byesin policy, we cn both chieve lower smple complexity thn existing lgorithms, nd use n explortion pproch tht is fr greedier thn the extremely cutious explortion required by ny PAC-MDP lgorithm. Indeed, our nlysis lso shows tht both our greedy lgorithm nd the true Byesin policy re not PAC-MDP. The reminder of the pper is orgnized s follows. In Section 2 we describe our setting formlly nd review the Byesin nd PAC-MDP pproches to explortion. In Section 3 we present our greedy pproximtion lgorithm, nd then stte nd discuss the theoreticl gurntees for this method. In Section 4 we prove these results, present brief simultion results in Section 5 nd conclude in Section Preliminries A Mrkov Decision Process MDP is tuple S,A,P,R,H where S is set of sttes, A is set of ctions, P : S A S R + is stte trnsition probbility function, R : S A [0,1] is bounded rewrd function, nd H is time horizon. 1 We consider n gent intercting with n MDP vi 1 We use the finite horizon setting for simplicity, but ll results presented here esily extend to the cse of infinite horizon discounted rewrds.

2 single continuous thred of experience, nd we ssume tht the trnsition probbilities P re unknown to the gent. For simplicity of nottion, we will ssume tht the rewrd is known, but this does not scrifice generlity, since n MDP with unknown bounded rewrds nd unknown trnsitions cn be represented s n MDP with known rewrds nd unknown trnsitions by dding dditionl sttes to the system. In this work we will focus on the cse of discrete sttes nd ctions. A policy π is mpping from sttes to ctions. The vlue of policy for given stte is defined s the sum [ of rewrds over the next H time steps VH πs = H ] E t=1 Rs t,πs s 1 = s,π. The vlue function cn lso be written in terms of Bellmn s eqution V π Hs = Rs,πs + s Ps s,v π H 1s. Finlly, when the trnsitions of the MDP re known, we cn find the optiml policy π nd optiml vlue function V by solving Bellmn s optimlity eqution V Hs = mx Ner-Byesin Explortion in Polynomil Time = Rs, + Pb s,b,s, b,s Rs, + s Ps s,v H 1s where π s is simply the ction tht mximizes the right hnd side; we cn pply clssicl lgorithms such s Vlue Itertion or Policy Itertion to find solution to this eqution Puttermn, Byesin Reinforcement Lerning Of course, in the setting we consider, the trnsitions of the MDP re not known, so we must dopt different methods. The Byesin pproch to model-bsed RL, which hs its roots in the topic of Dul Control Fel dbum, 1961; Filtov & Unbehuen, 2004, explicitly represents the uncertinty over MDPs by mintining belief stte b. Since we re concerned with the setting of discrete sttes nd ctions, nturl mens of representing the belief is to let b consist of set of Dirichlet distributions tht describe our uncertinty over the stte trnsitions b = αs,,s, Ps b,s, = αs,,s α 0 s, where α 0 s, = s αs,,s. In this setting, the stte now consists of both the system stte nd the belief stte; policies nd vlue functions now depend on both these elements. The vlue of policy π for belief nd stte is gin given by Bellmn s eqution VHb,s π = Rs, + Pb,s b,s,vh 1b π,s b,s Ps b,s,v π H 1b,s = Rs, + s Ps b,s,v π H 1b,s where = πs,b nd where on the third line the belief b is equl to b except tht we increment αs,,s. The simplified form on this third line results from the fct tht Pb s,b,s, is deterministic: if we re in stte s, tke ction, nd end up in stte s, then we know how to updte b by incrementing αs,,s, s described bove. This llows us to remove the integrl over b, nd since the sttes re discrete we only need to sum over s. We cn lso use the sme logic to derive Byesin version of Bellmn s optimlity eqution, which gives us the optiml Byesin vlue function nd policy V Hb,s = mx Rs, + s Ps b,s,v H 1b,s nd where the optiml Byesin policy π b,s is just the ction tht mximizes the right hnd side. Since this Byesin policy plys crucil role in the reminder of this pper, it is worth understnding the intuition behind this eqution. The optiml Byesin policy chooses ctions bsed not only on how they will ffect the next stte of the system, but lso bsed on how they will ffect the next belief stte; nd, since better knowledge of the MDP will typiclly led to greter future rewrd, the Byesin policy will very nturlly trde off between exploring the system to gin more knowledge, nd exploiting its current knowledge of the system. Unfortuntely, while the Byesin pproch provides very elegnt solution to the explortion/exploittion problem, it is typiclly not possible to compute the Byesin policy exctly. Since the dimension of the belief stte grows polynomilly in the number of sttes nd ctions, computing the Byesin vlue function using vlue itertion or other methods is typiclly not trctble. 2 This hs led to numerous methods tht pproximte the Byesin explortion policy Derden et l., 1999; Strens, 2000; Wng et l., 2005; Pouprt et l., 2006, typiclly by computing n pproximtion to the optiml vlue function, either by smpling or other methods. However, little is known bout these 2 One exception, where the Byesin pproch is trctble, is the domin of k-rmed bndit i.e., n MDP with one stte nd k ctions, where the rewrds re unknown. In this cse, the Byesin pproch leds to the well-known Gittins indices Gittins, However, the pproch does not scle nlyticlly to multi-stte MDPs.

3 lgorithms from theoreticl perspective, nd it is uncler wht if ny forml gurntees cn be mde for such pproches PAC-MDP Reinforcement Lerning An lterntive pproch to explortion in RL is the soclled PAC-MDP pproch, exemplified by lgorithms such s E 3, R mx, nd others Kerns & Singh, 2002; Brfmn & Tennenholtz, 2002; Kkde, 2003; Strehl & Littmn, Such lgorithms lso ddress the explortion/exploittion problem, but do so in different mnner. The lgorithms re bsed on the following intuition: if n gent hs observed certin stte-ction pir enough times, then we cn use lrge devition inequlities, such s the Hoeffding bound, to ensure tht the true dynmics re close to the empiricl estimtes. On the other hnd, if we hve not observed stte-ction pir enough times, then we ssume it hs very high vlue; this will drive the gent to try out stte-ction pirs tht we hven t observed enough times, until eventully we hve suitbly ccurte model of the system this generl technique is known s optimism in the fce of uncertinty. Although the precise formultion of the lerning gurntees vry from lgorithm to lgorithm, using these strtegies, one cn prove theoreticl gurntees of the following form, or similr: with high probbility, the lgorithm performs ner optimlly for ll but smll number of time steps where smll here mens polynomil is vrious quntities describing the MDP. Slightly more formlly, if A t denotes the policy followed by the lgorithm t time t, then with probbility greter thn 1 δ, V At s t V s t ǫ for ll but m = Opoly S, A, H, 1/ǫ, 1/δ time steps. This sttement does not indicte when these suboptiml steps will occur the lgorithm could ct ner-optimlly for long period of time before returning to sub-optiml behvior for some number of steps which llows us to void issues of mixing times or ergodicity of the MDP; this precise formultion is due to Kkde Mny vritions nd extensions of these results exist: to the cse of metric MDPs Kkde et l., 2003, fctored MDPs Kerns & Koller, 1999 to continuous liner in stte fetures domins Strehl & Littmn, 2008b, to certin clss of switching liner systems Brunskill et l., 2008, nd others. However, the overll intuition behind these pproches is similr: in order to perform well, we wnt to explore enough tht we lern n ccurte model of the system. 3 While this results in very 3 A slightly different pproch is tken by Strehl et l. powerful gurntees, the lgorithms typiclly require very lrge mount of explortion in prctice. This contrsts to the Byesin pproch, were we just wnt to obtin high expected rewrd over some finite horizon or lterntively, n infinite discounted horizon. Intuitively, we might then expect tht the Byesin pproch could ct in greedier fshion thn the PAC- MDP pproches, nd we will confirm nd formlize this intuition in the next section. Furthermore, mny issues tht present chllenges in the PAC-MDP frmework, such s incorporting prior knowledge or deling with correlted trnsitions, seemingly cn be hndled very nturlly in the Byesin frmework. 3. A Greedy Approximtion Algorithm nd Theoreticl Results From the discussion bove, it should be pprent tht both the Byesin nd PAC-MDP pproches hve dvntges nd drwbcks, nd in this section we present n lgorithm nd nlysis tht combines elements from both frmeworks. In prticulr, we present simple greedy lgorithm tht we show to perform nerly s well s the full Byesin policy, in sense tht we will formlize shortly; this is PAC-MDP-type result, but we consider optimlity with respect to the Byesin policy for given belief stte, rther thn the optiml policy for some fixed MDP. As we will show, this lterntive definition of optimlity llows us to both chieve lower smple complexity thn existing PAC-MDP lgorithms nd use greedier explortion method. The lgorithm we propose is itself very strightforwrd nd similr to mny previously proposed explortion heuristics. We cll the lgorithm Byesin Explortion Bonus BEB, since it chooses ctions ccording to the current men estimte of the stte trnsitions plus n dditionl rewrd bonus for stte-ction pirs tht hve been observed reltively little. Specificlly, the BEB lgorithm, t ech time step, chooses ctions greedily with respect to the vlue function Ṽ Hb,s = mx Rs, + β 1 + α 0 s, + s Ps b,s,ṽ H 1b,s 1 where β is prmeter of the lgorithm tht we will discuss shortly. In other words, the lgorithm cts by solving the n MDP using the men of the cur- 2006, where they do not build n explicit model of the system. However, the overll ide is the sme, only here they wnt to explore enough until they obtin n ccurte estimte of the stte-ction vlue function.

4 rent belief stte for the trnsition probbilities, nd n dditionl explortion bonus of β/1 + α 0 s, t ech stte. Note tht the belief b is not updted in this eqution, mening we cn solve the eqution using the stndrd Vlue Itertion or Policy Itertion lgorithms. To simplify the nlysis tht follows, we lso tke common pproch nd cese updting the belief sttes fter certin number of observtions, which we will describe more fully below. The following theorem gives performnce gurntee for the BEB lgorithm for suitbly lrge vlue of β. Theorem 1. Let A t denote the policy followed by the BEB lgorithm with β = 2H 2 t time t, nd let s t nd b t be the corresponding stte nd belief. Also suppose we stop updting the belief for stte-ction pir when α 0 s, > 4H 3 /ǫ. Then with probbility t lest 1 δ, V At H b t,s t V Hb t,s t ǫ i.e., the lgorithm is ǫ-close to the optiml Byesin policy for ll but S A H 6 m = O ǫ 2 log S A δ time steps. In other words, BEB cts sub-optimlity where optimlity is defined in the Byesin sense, for only polynomil number of time steps. Like the PAC-MDP results mentioned bove, the theorem mkes no clims bout when these sub-optiml steps occur, nd thus voids issues of mixing times, etc. In terms of the polynomil degree on the vrious quntities, this bound is tighter thn the stndrd PAC- MDP bounds, which to the best of our knowledge hve smple complexity of S 2 m = Õ A H 6 time steps. 4 Intuitively, this smller bound results from the fct tht in order to pproximte the Byesin policy, we don t need to lern the true dynmics of n MDP, we just need to ensure tht the posterior beliefs re sufficiently peked so tht further updtes cnnot led to very much dditionl rewrd. However, s mentioned bove, the two theorems re not directly comprble, since we define optimlity with respect to the Byesin policy for belief stte, wheres the stndrd PAC-MDP frmework defines 4 The Õ nottion suppresses logrithmic fctors. In ddition, the model-free lgorithm of Strehl et l., 2006 obtins bound of Õ bound. S A H 8 ǫ 3 ǫ 4, lso lrger thn our optimlity with respect to the optiml policy for some given MDP. Indeed, one of the chief insights of this work is tht by considering the Byesin definition of optimlity, we cn chieve these smller bounds. To gin further insight into the nture of Byesin explortion, BEB, nd the PAC-MDP pproch, we compre our method to very similr PAC-MDP lgorithm known s Model Bsed Intervl Estimtion with Explortion Bonus MBIE-EB Strehl & Littmn, Like BEB, this lgorithm t ech time step solves n MDP ccording to the men estimte of the trnsitions, plus n explortion bonus. However, MBIE-EB uses n explortion bonus of the form β ns, where ns, denotes the number of times tht the stte-ction pir s, hs been observed; this contrsts with the BEB lgorithm, which hs n explortion bonus of β 1 + ns, where here ns, lso includes the counts implied by the prior. Since 1/ n decys much slower thn 1/n, MBIE-EB consequently explores gret del more thn the BEB lgorithm. Furthermore, this is not n rtifct of the MBIE-EB lgorithm lone: s we formlize in the next theorem, ny lgorithm with n explortion bonus tht decys fster thn 1/ n cnnot be PAC-MDP. Theorem 2. Let A t denote the policy followed n lgorithm using ny rbitrry complex explortion bonus tht is upper bounded by β ns, p for some constnt β nd p > 1/2. Then there exists some MDP M nd ǫ 0 β,p, such tht with probbility greter thn δ 0 = 0.15, V At H s t < V Hs t ǫ 0 will hold for n unbounded number of time steps. In other words, the BEB lgorithm nd Byesin explortion itself, re not PAC-MDP, nd my in fct never find the optiml policy for some given MDP. This result is firly intuitive: since the Byesin lgorithms re trying to mximize the rewrd over some finite horizon, there would be no benefit to excessive explortion if it cnnot help over this horizon time. To summrize, by considering optimlity with respect to the Byesin policy, we cn obtin lgorithms with

5 lower smple complexity nd greedier explortion policies thn PAC-MDP pproches. Although the resulting lgorithms my fil to find the optiml policy for certin MDPs, they re still close to optiml in the Byesin sense. 4. Proofs of the Min Results Before presenting the proofs in this section, we wnt to briefly describe their intuition. Due to spce constrints, the proofs of the technicl lemms re deferred to the ppendix, vilble in the full version of the pper Kolter & Ng, The key condition tht llows us to prove tht BEB quickly performs ǫ-close to the Byesin policy is tht t every time step, BEB is optimistic with respect to the Byesin Policy, nd this optimism decys to zero given enough smples tht is, BEB cts ccording to n internl model tht lwys overestimtes the vlues of stte-ction pirs, but which pproches the true Byesin vlue estimte t rte of O1/ns,. The O1/n term itself rises from the L 1 divergence between prticulr Dirichlet distributions. Given these results, Theorem 1 follows by dpting stndrd rguments from previous PAC-MDP results. In prticulr, we define the known stte-ction pirs to be ll those stte-ction pirs tht hve been observed more thn some number of times nd use the bove insights to show, similr to the PAC-MDP proofs, tht V At H b,s is close to the vlue of cting ccording to the optiml Byesin policy, ssuming the probbility of leving the known stte-ction set is smll. Finlly, we use the Hoeffding bound to show tht this escpe probbility cn be lrge only for polynomil number of steps. To prove Theorem 2, we use Slud s inequlity Slud, 1977 to show tht ny lgorithm with explortion rte O1/n p for p > 1/2 my not be optimistic with respect to the optiml policy for given MDP. The domin to consider here is simple two-rmed bndit, where one of the rms results in rndom Bernoulli pyoff, nd the other results in fixed known pyoff with slightly lower men vlue; we cn show tht with significnt probbility, ny such explortion bonus lgorithm my prefer the suboptiml rm t some point, resulting in policy tht is never ner-optiml Proof of Theorem 1 We begin with series of lemms used in the proof. The first lemm sttes tht if one hs sufficient number of counts for Dirichlet distribution, then incrementing one of the counts won t chnge the probbilities very much. The proof just involves lgebric mnipultion. Lemm 3. Consider two Dirichlet distributions with prmeters α,α R k. Further, suppose α i = α i for ll i, except α j = α j + 1. Then Pxi α Px i α α 0 i Next we use this lemm to show tht if we solve the MDP using the current men of the belief stte, with 2H n dditionl explortion bonus of 2 1+α 0s,, this will led to vlue function tht is optimistic with respect to the Byesin policy. The proof involves showing tht the potentil benefit of the true Byesin policy i.e., how much extr rewrd we could obtin by updting the beliefs, is upper bounded by the explortion bonus of BEB. The proof is deferred due to spce constrints, but since this result is the key to proving Theorem 1, this lemm is one of the key technicl results of the pper. Lemm 4. Let Ṽ H b,s be the vlue function used by BEB, defined s in 1, with β = 2H 2 ; tht is, it is the optiml vlue function for the men MDP of belief b, plus the dditionl rewrd term. Then for ll s, Ṽ Hb,s V Hb,s. Our finl lemm is trivil modifiction of the Induced Inequlity used by previous PAC-MDP bounds, which extends this inequlity to the Byesin setting. The lemm sttes tht if we execute policy using two different rewrds nd belief sttes R,b nd R,b, where b = b nd R = R on known set of stte-ction pirs K, then following policy π will obtin similr rewrds for both belief sttes, provided the probbility of escping from K is smll. The proof mirrors tht in Strehl & Littmn, Lemm 5. Let b,r nd b,r be two belief sttes over trnsition probbilities nd rewrd functions tht re identicl on some set of stte-ction pirs K i.e., α b s,,s = α b s,,s nd Rs, = R s, for ll s, K. Let A K be the probbility tht stte-ction pir not in K is generted when strting from stte s nd following policy π for H steps. Assuming the rewrds R re bounded in [0,R mx ] then, V π HR,b,s V π HR,b,s HR mx PA K where we now mke explicit the dependence of the vlue function on the rewrd. We re now redy to prove Theorem 1. Proof. of Theorem 1 Define R s the rewrd function used by the BEB lgorithm i.e., the rewrd plus the

6 explortion bonus. Let K be the set of ll sttes tht hve posterior counts α 0 s, m 4H 3 /ǫ. Let R be rewrd function equl to R on K nd equl to R elsewhere. Furthermore, let π be the policy followed by the BEB t time t i.e., the greedy policy with respect to the current belief b t nd the rewrd R. Letting A K be the event tht π escpes from K when strting in nd cting for H steps. Then V At H R,b t,s t V π HR,b t,s t H 2 PA K 2 by Lemm 5 where we note tht we cn limit the explortion bonus to H i.e., use bonus of 2H min 2 1+α 0s,,H, nd still mintin optimism, nd by noticing tht A t equls π unless A K occurs. In ddition, note tht since R nd R differ by t most 2H 2 /m = ǫ/2h t ech stte, V π HR,b t,s t V π H R,b t,s t ǫ 2. 3 Finlly, we consider two cses. First, suppose tht PA K > ǫ/2h 2. By the Hoeffding inequlity, with probbility 1 δ this will occur no more thn m S A H 3 S A H 6 O = O ǫ ǫ 2 times before ll the sttes become known. Now suppose PA K ǫ/2h 2. Then V At H R,b t,s t V π HR,b t,s t H 2 PA k VHR π,b t,s t ǫ 2 VH π R,b t,s t ǫ = Ṽ H R,b t,s t ǫ V HR,b t,s t ǫ i.e., the policy is ǫ-optiml. In this derivtion the first line follows from 2, the second line follows from our ssumption tht PA K ǫ/2h 2, the third line follows from 3, the fourth line follows from the fct tht π is precisely the optiml policy for R,b t, nd the lst line follows from Lemm Proof of Theorem 2 We mke use of the following inequlity, due to Slud 1977, which gives lower bound on the probbility of lrge devitions from the men in binomil distribution. Lemm 6. Slud s inequlity Let X 1,...,X n be i.i.d. Bernoulli rndom vribles, with men µ 3/4. Then P µ 1 n ǫ n X i > ǫ 1 Φ n µ1 µ i=1 where Φx is the cumultive distribution function of stndrd Gussin rndom vrible Φx = x 1 2π e x2 2 dx. Using this lemm, we now prove the theorem. Proof. of Theorem 2 As mentioned in the preceding text, the scenrio we consider for this proof is two-rmed bndit: ction 1 gives Bernoulli rndom rewrd, with true pyoff probbility unknown to the lgorithm of 3/4; ction 2 gives fixed known pyoff of 3/4 ǫ 0 we will define ǫ 0 shortly. Therefore, the optiml policy is to lwys pick ction 1. Since this is setting with only one stte nd therefore known trnsition dynmics but unknown rewrds, lter in the proof we will trnsform this domin into one with unknown trnsitions but known rewrds; however, the bndit formultion is more intuitive for the time being. Since the rewrd for ction 2 is known, the only explortory ction in this cse is 1, nd we let n denote the number of times tht the gent hs chosen ction 1, where t ech tril it receives rewrd r i 0,1. Let fn be n explortion bonus for some lgorithm ttempting to lern this domin, nd suppose it is upper bounded by fn β n p for some p > 1/2. Then we cn lower bound the probbility tht the lgorithm s estimte of the rewrd, plus its explortion bonus, is pessimistic by more thn β/n p : P 3/4 1 n r i fn β n n p i=1 P 3/4 1 n r i 2β n n p i=1 8β 1 Φ 3n p 1/2 where the lst line follows by pplying Slud s inequlity. We cn esily verify numericlly tht 1 Φ1 > 0.15, so for 2 8β 2p 1 n 3 we hve tht with probbility greter thn δ 0 = 0.15, the lgorithm is pessimistic by more thn β/n p. Therefore, fter this mny steps, if we let 2 8β 2p 1 ǫ 0 β,p = β/ 3

7 then with probbility t lest δ 0 = 0.15, ction 2 will be preferred by the lgorithm over ction 1. Once this occurs, the lgorithm will never opt to select ction 1 since 2 is known, nd lredy hs no explortion bonus, so for ny ǫ ǫ 0, the lgorithm will be more thn ǫ sub-optiml for n infinite number of steps. Finlly, we lso note tht we cn esily trnsform this domin to n MDP with known rewrds but unknown trnsitions by considering three stte MDP, with trnsition probbilities nd rewrds Ps 2 s 1, 1 = 3/4 Ps 3 s 1, 1 = 1/4 Ps 1 s 1, 2 = 1 Ps 1 s 2:3, 1:2 = 1 Rs 2, 1:2 = 1 Rs 3, 1:2 = 0 5. Simulted Domin Rs 1, 2 = 3/4 ǫ 0. In this section we present empiricl results for BEB nd other lgorithms on simple chin domin from the Byesin explortion literture Strens, 2000; Pouprt et l., 2006, shown in Figure 1. We stress tht the results here re not intended s rigorous evlution of the different methods, since the domin is extremely smll-scle. Nonetheless, the results illustrte tht the chrcteristics suggested by the theory do mnifest themselves in prctice, t lest in this smll-scle setting. Figure 2 shows the verge totl rewrd versus time step for severl different lgorithms. These results illustrte severl points. First, the results show, s suggested by the theory, tht BEB cn outperform PAC-MDP lgorithms in this cse, MBIE-EB, due to it s greedier explortion strtegy. Second, the vlue of β required by Theorem 1 is typiclly much lrger thn wht is best in prctice. This is common trend for such lgorithms, so for both BEB nd MBIE-EB we evluted wide rnge of vlues for β nd chose the best for ech the sme evlution strtegy ws used by the uthors of MBIE-EB Strehl & Littmn, Thus, while the constnt fctors in the theoreticl results for both BEB nd MBIE-EB re less importnt from prcticl stndpoint, the rtes implied by these results i.e., the 1/n vs. 1/ n explortion rtes do result in empiricl differences. Finlly, for this domin, the possibility tht BEB converges to sub-optiml policy is not lrge concern. This is to be expected, s Theorem 2 nlyzes firly extreme setting, nd indeed implies only reltively little suboptimlity, even in the worse cse. Figure 1. Simple chin domin, consisting of five sttes nd two ctions. Arrows indicte trnsitions nd rewrds for the ctions, but t ech time step the gent performs the opposite ction s intended with probbility 0.2. The gent lwys strts in stte 1, nd the horizon is H = 6. Averge Totl Rewrd BEB β=2.5 MBIE-EB β= Optiml policy BEB β=2h 2 Exploit only β= Time Step Figure 2. Performnce versus time for different lgorithms on the chin domin, verged over 500 trils nd shown with 95% confidence intervls. Averge Totl Rewrd Smll Prior, α 0 = Uniform Prior, α = S Uniform Prior, α 0 = 5 S Informtive Prior, α 0 = S Informtive Prior, α 0 = 5 S Time Step Figure 3. Performnce versus time for BEB with different priors on the chin domin, verged over 500 trils nd shown with 95% confidence intervls. We lso evluted the significnce of the prior distribution on BEB. In Figure 3 we show performnce for BEB with very smll prior, for uniform priors of vrying strength, nd for informtive priors consisting of the true trnsition probbilities. As cn be seen, BEB is firly insensitive to either smll priors, but cn be negtively impcted by lrge misspecified prior. These results re quite intuitive, s such s prior will gretly decrese the explortion bonus, while providing poor model of the environment.

8 6. Conclusion In this pper we presented novel lgorithm nd mode of nlysis tht llows n gent cting in n MDP to perform ǫ-close to the intrctble optiml Byesin policy fter polynomil number of time steps. We bring PAC-MDP-type results to the setting of Byesin RL, nd we show tht by doing so, we cn both obtin lower smple complexity bounds, nd use explortion techniques tht re greedier thn those required by ny PAC-MDP lgorithm. Looking forwrd, the sme mode of nlysis tht we use to derive the bounds in this pper which involves bounding divergences between updtes of the belief distributions cn lso be pplied to more structured domins, such s finite MDPs with correlted trnsitions or continuous stte MDPs with smooth dynmics; it will be interesting to see how the resulting lgorithms perform in such domins. An lterntive mens of nlyzing the efficiency of reinforcement lerning lgorithms is the notion of regret in infinite-horizon settings Auer & Ortner, 2007, nd it remins n open question whether the ides we present here cn be extended to this infinite-horizon cse. Finlly, very recently Asmuth et l hve independently developed n lgorithm tht lso combines Byesin nd PAC-MDP pproches. The ctul pproch is quite different they use Byesin smpling to chieve PAC-MDP lgorithm but it would be very interesting to compre the lgorithms. Acknowledgments This work ws supported by the DARPA Lerning Locomotion progrm under contrct number FA C We thnk the nonymous reviews nd Lihong Li for helpful comments. Zico Kolter is supported by n NSF Grdute Reserch Fellowship. References Asmuth, J., Li, L., Littmn, M. L., Nouri, A., & Wingte, D A Byesin smpling pproch to explortion in reinforcement lerning. Preprint. Auer, P., & Ortner, R Logrithmic online regret bounds for undiscounted reinforcement lerning. Neurl Informtion Processing Systems. Brfmn, R. I., & Tennenholtz, M R-MAX generl polynomil time lgorithm for ner-optiml reinforcement lerning. Journl of Mchine Lerning Reserch, 3, Brunskill, E., Leffler, B. R., Li, L., Littmn, M. L., & Roy, N CORL: A continuous-stte offset-dynmics reinforcement lerner. Proceedings of the Interntionl Conference on Uncertinty in Artificil Intelligence. Derden, R., Friedmn, N., & Andre, D Model bsed Byesin explortion. Proceedings of the Interntionl Conference on Uncertinty in Artificil Intelligence. Fel dbum, A. A Dul control theory, prts I IV. Automtion nd Remote Control, , , , Filtov, N., & Unbehuen, H Adptive dul control: Theory nd pplictions. Springer. Gittins, J. C Multimred bndit lloction indices. Wiley. Kkde, S., Kerns, M., & Lngford, J Explortion in metric stte spces. Proceedings of the Interntionl Conference on Mchine Lerning. Kkde, S. M On the smple complexity of reinforcement lerning. Doctorl disserttion, Gtsby Computtionl Neuroscience Unit, University College, London. Kerns, M., & Koller, D Efficient reinforcement lerning in fctored MDPs. Proceedings of the Interntionl Joint Conference on Artificil Intelligence. Kerns, M., & Singh, S Ner-optiml reinforcement lerning in polynomil time. Mchine Lerning, 49. Kolter, J. Z., & Ng, A. Y Ner-Byesin explortion in polynomil time full version. Avilble t kolter. Pouprt, P., Vlssis, N., Hoey, J., & Regn, K An nlytic solution to discrete Byesin reinforcement lerning. Proceedings of the Interntionl Conference on Mchine Lerning. Puttermn, M. L Mrkov decision processes: Discrete stochstic dynmic progrmming. Wiley. Slud, E. V Distribution inequlities for the binomil lw. The Annls of Probbility, 5, Strehl, A. L., Li, L., Wiewior, E., Lngford, J., & Littmn, M. L Pc model-free reinforcement lerning. Proceedings of the Interntionl Conference on Mchine Lerning. Strehl, A. L., & Littmn, M. L An nlysis of model-bsed intervl estimtion for mrkov decision processes. Journl of Computer nd System Sciences, 74, Strehl, A. L., & Littmn, M. L. 2008b. Online liner regression nd its ppliction to model-bsed reinforcement lerning. Neurl Informtion Processing Systems. Strens, M. J A Byesin frmework for reinforcement lerning. Proceedings of the Interntionl Conference on Mchine Lerning. Wng, T., Lizotte, D., Bowling, M., & Schuurmns, D Byesin sprse smpling for on-line rewrd optimiztion. Proceedings of the Interntionl Conference on Mchine Lerning.

9 A. Technicl Proofs A.1. Proof of Lemm 3 Proof. Using definition of the Dirichlet distribution, Px i α = Px i πpπ αdπ = α i α 0 where α 0 i α i. By the ssumptions on α nd α, Pxi α Px i α i = i j αi α i αj α 0 α α α j α 0 α i = α 0 + α 2 i j 0 i α i α 0 + α α 0 α j α 0 + α α 0 α 0 + α 2 0 = α 0. The first line just substitutes the definitions of Ṽ H nd VH. In the second line we use the fct tht mx x fx mx gx min fx gx. x In the third line we use the fct tht pxfx x x x qxgx pxfx gx x x px qx gx, nd note tht VH 1 b,s H 1 for ny b nd s. In the fourth line we pply Lemm 3 to show tht 2H α 0 s, H 1 s Ps b,s, Ps b t,s, A.2. Proof of Lemm 4 Proof. Consider some belief nd stte b nd s, nd let b t be the new belief formed by updting b fter tking t H steps. Then ṼHb,s VHb t,s Rs, + = mx mx min 2H α 0 s, + s Ps b,s,ṽ H 1b,s Rs, + s Ps b t,s,v H 1b t+1,s 2H α 0 s, + s Ps b,s,ṽ H 1b,s Ps b t,s,vh 1b t+1,s s 2H 2 min 1 + α 0 s, min min s H 1 s Ps b,s, Ps b t,s, + s s Ps b,s, Ṽ H 1 b,s VH 1b t,s Ps b,s, Ṽ H 1 b,s VH 1b t,s ṼH 1 b,s VH 1b t+1,s. which lets us remove these terms. In greter detil, using the tringle inequlity, Lemm 3, nd the fct tht t H Ps b t,s, Ps b,s, s t Ps b i,s, Ps b i 1,s, i=1 t i=1 s α 0 s, + i 2H 1 + α 0 s,. Since s is rbitrry in the bove derivtion, we hve tht for ny t H, min ṼH b,s V s Hb t,s min s ṼH 1 b,s VH 1b t+1,s Applying this eqution repetedly proves the desired lemm. A.3. Proof of Lemm 5 Proof. Consider sequence of beliefs, sttes, ctions, nd rewrds of length t, p t = s 1 1 r 1,...,s t, t,r t. Let Pp t be the probbility of this sequence under belief b with rewrd function R when strting in stte s, nd let P p t be the probbility of the sequence under belief b with rewrd function R. Let K t be the set of

10 sequences where ll s 1,...,s t K Then VHR π,b,s VHR,b,s π H = P p t r t p t Pp t r t p t = = t=1 p t H t=1 H p t K t p t K t t=1 p t K t H t=1 Ner-Byesin Explortion in Polynomil Time P p t r t p t Pp t r t p t + P p t r t p t Pp t r t p t P p t r t p t Pp t r t p t p t K t P p t r t p t HR mx PA K where we cn eliminte the terms for p t K t, becuse R,b nd R,b re identicl on this set, nd where the lst line follows since the rewrds re bounded in [0,R mx ].

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic

More information

1 Online Learning and Regret Minimization

1 Online Learning and Regret Minimization 2.997 Decision-Mking in Lrge-Scle Systems My 10 MIT, Spring 2004 Hndout #29 Lecture Note 24 1 Online Lerning nd Regret Minimiztion In this lecture, we consider the problem of sequentil decision mking in

More information

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling CSE 547/Stt 548: Mchine Lerning for Big Dt Lecture Multi-Armed Bndits: Non-dptive nd Adptive Smpling Instructor: Shm Kkde 1 The (stochstic) multi-rmed bndit problem The bsic prdigm is s follows: K Independent

More information

Review of Calculus, cont d

Review of Calculus, cont d Jim Lmbers MAT 460 Fll Semester 2009-10 Lecture 3 Notes These notes correspond to Section 1.1 in the text. Review of Clculus, cont d Riemnn Sums nd the Definite Integrl There re mny cses in which some

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm

More information

2D1431 Machine Learning Lab 3: Reinforcement Learning

2D1431 Machine Learning Lab 3: Reinforcement Learning 2D1431 Mchine Lerning Lb 3: Reinforcement Lerning Frnk Hoffmnn modified by Örjn Ekeberg December 7, 2004 1 Introduction In this lb you will lern bout dynmic progrmming nd reinforcement lerning. It is ssumed

More information

Math 8 Winter 2015 Applications of Integration

Math 8 Winter 2015 Applications of Integration Mth 8 Winter 205 Applictions of Integrtion Here re few importnt pplictions of integrtion. The pplictions you my see on n exm in this course include only the Net Chnge Theorem (which is relly just the Fundmentl

More information

Math 1B, lecture 4: Error bounds for numerical methods

Math 1B, lecture 4: Error bounds for numerical methods Mth B, lecture 4: Error bounds for numericl methods Nthn Pflueger 4 September 0 Introduction The five numericl methods descried in the previous lecture ll operte by the sme principle: they pproximte the

More information

Administrivia CSE 190: Reinforcement Learning: An Introduction

Administrivia CSE 190: Reinforcement Learning: An Introduction Administrivi CSE 190: Reinforcement Lerning: An Introduction Any emil sent to me bout the course should hve CSE 190 in the subject line! Chpter 4: Dynmic Progrmming Acknowledgment: A good number of these

More information

19 Optimal behavior: Game theory

19 Optimal behavior: Game theory Intro. to Artificil Intelligence: Dle Schuurmns, Relu Ptrscu 1 19 Optiml behvior: Gme theory Adversril stte dynmics hve to ccount for worst cse Compute policy π : S A tht mximizes minimum rewrd Let S (,

More information

Lecture 14: Quadrature

Lecture 14: Quadrature Lecture 14: Qudrture This lecture is concerned with the evlution of integrls fx)dx 1) over finite intervl [, b] The integrnd fx) is ssumed to be rel-vlues nd smooth The pproximtion of n integrl by numericl

More information

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1 Exm, Mthemtics 471, Section ETY6 6:5 pm 7:4 pm, Mrch 1, 16, IH-115 Instructor: Attil Máté 1 17 copies 1. ) Stte the usul sufficient condition for the fixed-point itertion to converge when solving the eqution

More information

The Regulated and Riemann Integrals

The Regulated and Riemann Integrals Chpter 1 The Regulted nd Riemnn Integrls 1.1 Introduction We will consider severl different pproches to defining the definite integrl f(x) dx of function f(x). These definitions will ll ssign the sme vlue

More information

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature CMDA 4604: Intermedite Topics in Mthemticl Modeling Lecture 19: Interpoltion nd Qudrture In this lecture we mke brief diversion into the res of interpoltion nd qudrture. Given function f C[, b], we sy

More information

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7 CS 188 Introduction to Artificil Intelligence Fll 2018 Note 7 These lecture notes re hevily bsed on notes originlly written by Nikhil Shrm. Decision Networks In the third note, we lerned bout gme trees

More information

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below. Dulity #. Second itertion for HW problem Recll our LP emple problem we hve been working on, in equlity form, is given below.,,,, 8 m F which, when written in slightly different form, is 8 F Recll tht we

More information

Bellman Optimality Equation for V*

Bellman Optimality Equation for V* Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V (s) mx Q (s,) A(s) mx A(s) mx A(s) Er t 1 V (s t 1 ) s t s, t s

More information

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004 Advnced Clculus: MATH 410 Notes on Integrls nd Integrbility Professor Dvid Levermore 17 October 2004 1. Definite Integrls In this section we revisit the definite integrl tht you were introduced to when

More information

p-adic Egyptian Fractions

p-adic Egyptian Fractions p-adic Egyptin Frctions Contents 1 Introduction 1 2 Trditionl Egyptin Frctions nd Greedy Algorithm 2 3 Set-up 3 4 p-greedy Algorithm 5 5 p-egyptin Trditionl 10 6 Conclusion 1 Introduction An Egyptin frction

More information

Chapter 0. What is the Lebesgue integral about?

Chapter 0. What is the Lebesgue integral about? Chpter 0. Wht is the Lebesgue integrl bout? The pln is to hve tutoril sheet ech week, most often on Fridy, (to be done during the clss) where you will try to get used to the ides introduced in the previous

More information

New Expansion and Infinite Series

New Expansion and Infinite Series Interntionl Mthemticl Forum, Vol. 9, 204, no. 22, 06-073 HIKARI Ltd, www.m-hikri.com http://dx.doi.org/0.2988/imf.204.4502 New Expnsion nd Infinite Series Diyun Zhng College of Computer Nnjing University

More information

Numerical Analysis: Trapezoidal and Simpson s Rule

Numerical Analysis: Trapezoidal and Simpson s Rule nd Simpson s Mthemticl question we re interested in numericlly nswering How to we evlute I = f (x) dx? Clculus tells us tht if F(x) is the ntiderivtive of function f (x) on the intervl [, b], then I =

More information

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by. NUMERICAL INTEGRATION 1 Introduction The inverse process to differentition in clculus is integrtion. Mthemticlly, integrtion is represented by f(x) dx which stnds for the integrl of the function f(x) with

More information

5.7 Improper Integrals

5.7 Improper Integrals 458 pplictions of definite integrls 5.7 Improper Integrls In Section 5.4, we computed the work required to lift pylod of mss m from the surfce of moon of mss nd rdius R to height H bove the surfce of the

More information

A Fast and Reliable Policy Improvement Algorithm

A Fast and Reliable Policy Improvement Algorithm A Fst nd Relible Policy Improvement Algorithm Ysin Abbsi-Ydkori Peter L. Brtlett Stephen J. Wright Queenslnd University of Technology UC Berkeley nd QUT University of Wisconsin-Mdison Abstrct We introduce

More information

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives Block #6: Properties of Integrls, Indefinite Integrls Gols: Definition of the Definite Integrl Integrl Clcultions using Antiderivtives Properties of Integrls The Indefinite Integrl 1 Riemnn Sums - 1 Riemnn

More information

Theoretical foundations of Gaussian quadrature

Theoretical foundations of Gaussian quadrature Theoreticl foundtions of Gussin qudrture 1 Inner product vector spce Definition 1. A vector spce (or liner spce) is set V = {u, v, w,...} in which the following two opertions re defined: (A) Addition of

More information

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying Vitli covers 1 Definition. A Vitli cover of set E R is set V of closed intervls with positive length so tht, for every δ > 0 nd every x E, there is some I V with λ(i ) < δ nd x I. 2 Lemm (Vitli covering)

More information

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral Improper Integrls Every time tht we hve evluted definite integrl such s f(x) dx, we hve mde two implicit ssumptions bout the integrl:. The intervl [, b] is finite, nd. f(x) is continuous on [, b]. If one

More information

Chapter 5 : Continuous Random Variables

Chapter 5 : Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 216 Néhémy Lim Chpter 5 : Continuous Rndom Vribles Nottions. N {, 1, 2,...}, set of nturl numbers (i.e. ll nonnegtive integers); N {1, 2,...}, set of ll

More information

{ } = E! & $ " k r t +k +1

{ } = E! & $  k r t +k +1 Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

8 Laplace s Method and Local Limit Theorems

8 Laplace s Method and Local Limit Theorems 8 Lplce s Method nd Locl Limit Theorems 8. Fourier Anlysis in Higher DImensions Most of the theorems of Fourier nlysis tht we hve proved hve nturl generliztions to higher dimensions, nd these cn be proved

More information

Math 426: Probability Final Exam Practice

Math 426: Probability Final Exam Practice Mth 46: Probbility Finl Exm Prctice. Computtionl problems 4. Let T k (n) denote the number of prtitions of the set {,..., n} into k nonempty subsets, where k n. Argue tht T k (n) kt k (n ) + T k (n ) by

More information

Numerical integration

Numerical integration 2 Numericl integrtion This is pge i Printer: Opque this 2. Introduction Numericl integrtion is problem tht is prt of mny problems in the economics nd econometrics literture. The orgniztion of this chpter

More information

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning Solution for Assignment 1 : Intro to Probbility nd Sttistics, PAC lerning 10-701/15-781: Mchine Lerning (Fll 004) Due: Sept. 30th 004, Thursdy, Strt of clss Question 1. Bsic Probbility ( 18 pts) 1.1 (

More information

1 Probability Density Functions

1 Probability Density Functions Lis Yn CS 9 Continuous Distributions Lecture Notes #9 July 6, 28 Bsed on chpter by Chris Piech So fr, ll rndom vribles we hve seen hve been discrete. In ll the cses we hve seen in CS 9, this ment tht our

More information

Math& 152 Section Integration by Parts

Math& 152 Section Integration by Parts Mth& 5 Section 7. - Integrtion by Prts Integrtion by prts is rule tht trnsforms the integrl of the product of two functions into other (idelly simpler) integrls. Recll from Clculus I tht given two differentible

More information

Continuous Random Variables

Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 217 Néhémy Lim Continuous Rndom Vribles Nottion. The indictor function of set S is rel-vlued function defined by : { 1 if x S 1 S (x) if x S Suppose tht

More information

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS. THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS RADON ROSBOROUGH https://intuitiveexplntionscom/picrd-lindelof-theorem/ This document is proof of the existence-uniqueness theorem

More information

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Module 6 Vlue Itertion CS 886 Sequentil Decision Mking nd Reinforcement Lerning University of Wterloo Mrkov Decision Process Definition Set of sttes: S Set of ctions (i.e., decisions): A Trnsition model:

More information

An approximation to the arithmetic-geometric mean. G.J.O. Jameson, Math. Gazette 98 (2014), 85 95

An approximation to the arithmetic-geometric mean. G.J.O. Jameson, Math. Gazette 98 (2014), 85 95 An pproximtion to the rithmetic-geometric men G.J.O. Jmeson, Mth. Gzette 98 (4), 85 95 Given positive numbers > b, consider the itertion given by =, b = b nd n+ = ( n + b n ), b n+ = ( n b n ) /. At ech

More information

MAA 4212 Improper Integrals

MAA 4212 Improper Integrals Notes by Dvid Groisser, Copyright c 1995; revised 2002, 2009, 2014 MAA 4212 Improper Integrls The Riemnn integrl, while perfectly well-defined, is too restrictive for mny purposes; there re functions which

More information

Numerical Integration

Numerical Integration Chpter 1 Numericl Integrtion Numericl differentition methods compute pproximtions to the derivtive of function from known vlues of the function. Numericl integrtion uses the sme informtion to compute numericl

More information

LECTURE NOTE #12 PROF. ALAN YUILLE

LECTURE NOTE #12 PROF. ALAN YUILLE LECTURE NOTE #12 PROF. ALAN YUILLE 1. Clustering, K-mens, nd EM Tsk: set of unlbeled dt D = {x 1,..., x n } Decompose into clsses w 1,..., w M where M is unknown. Lern clss models p(x w)) Discovery of

More information

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite Unit #8 : The Integrl Gols: Determine how to clculte the re described by function. Define the definite integrl. Eplore the reltionship between the definite integrl nd re. Eplore wys to estimte the definite

More information

APPROXIMATE INTEGRATION

APPROXIMATE INTEGRATION APPROXIMATE INTEGRATION. Introduction We hve seen tht there re functions whose nti-derivtives cnnot be expressed in closed form. For these resons ny definite integrl involving these integrnds cnnot be

More information

ECO 317 Economics of Uncertainty Fall Term 2007 Notes for lectures 4. Stochastic Dominance

ECO 317 Economics of Uncertainty Fall Term 2007 Notes for lectures 4. Stochastic Dominance Generl structure ECO 37 Economics of Uncertinty Fll Term 007 Notes for lectures 4. Stochstic Dominnce Here we suppose tht the consequences re welth mounts denoted by W, which cn tke on ny vlue between

More information

Recitation 3: More Applications of the Derivative

Recitation 3: More Applications of the Derivative Mth 1c TA: Pdric Brtlett Recittion 3: More Applictions of the Derivtive Week 3 Cltech 2012 1 Rndom Question Question 1 A grph consists of the following: A set V of vertices. A set E of edges where ech

More information

arxiv: v1 [stat.ml] 9 Aug 2016

arxiv: v1 [stat.ml] 9 Aug 2016 On Lower Bounds for Regret in Reinforcement Lerning In Osbnd Stnford University, Google DeepMind iosbnd@stnford.edu Benjmin Vn Roy Stnford University bvr@stnford.edu rxiv:1608.02732v1 [stt.ml 9 Aug 2016

More information

Fig. 1. Open-Loop and Closed-Loop Systems with Plant Variations

Fig. 1. Open-Loop and Closed-Loop Systems with Plant Variations ME 3600 Control ystems Chrcteristics of Open-Loop nd Closed-Loop ystems Importnt Control ystem Chrcteristics o ensitivity of system response to prmetric vritions cn be reduced o rnsient nd stedy-stte responses

More information

MAT 168: Calculus II with Analytic Geometry. James V. Lambers

MAT 168: Calculus II with Analytic Geometry. James V. Lambers MAT 68: Clculus II with Anlytic Geometry Jmes V. Lmbers Februry 7, Contents Integrls 5. Introduction............................ 5.. Differentil Clculus nd Quotient Formuls...... 5.. Integrl Clculus nd

More information

Numerical Integration

Numerical Integration Chpter 5 Numericl Integrtion Numericl integrtion is the study of how the numericl vlue of n integrl cn be found. Methods of function pproximtion discussed in Chpter??, i.e., function pproximtion vi the

More information

Euler, Ioachimescu and the trapezium rule. G.J.O. Jameson (Math. Gazette 96 (2012), )

Euler, Ioachimescu and the trapezium rule. G.J.O. Jameson (Math. Gazette 96 (2012), ) Euler, Iochimescu nd the trpezium rule G.J.O. Jmeson (Mth. Gzette 96 (0), 36 4) The following results were estblished in recent Gzette rticle [, Theorems, 3, 4]. Given > 0 nd 0 < s

More information

Review of basic calculus

Review of basic calculus Review of bsic clculus This brief review reclls some of the most importnt concepts, definitions, nd theorems from bsic clculus. It is not intended to tech bsic clculus from scrtch. If ny of the items below

More information

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007 A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H Thoms Shores Deprtment of Mthemtics University of Nebrsk Spring 2007 Contents Rtes of Chnge nd Derivtives 1 Dierentils 4 Are nd Integrls 5 Multivrite Clculus

More information

Monte Carlo method in solving numerical integration and differential equation

Monte Carlo method in solving numerical integration and differential equation Monte Crlo method in solving numericl integrtion nd differentil eqution Ye Jin Chemistry Deprtment Duke University yj66@duke.edu Abstrct: Monte Crlo method is commonly used in rel physics problem. The

More information

UNIFORM CONVERGENCE. Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3

UNIFORM CONVERGENCE. Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3 UNIFORM CONVERGENCE Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3 Suppose f n : Ω R or f n : Ω C is sequence of rel or complex functions, nd f n f s n in some sense. Furthermore,

More information

QUADRATURE is an old-fashioned word that refers to

QUADRATURE is an old-fashioned word that refers to World Acdemy of Science Engineering nd Technology Interntionl Journl of Mthemticl nd Computtionl Sciences Vol:5 No:7 011 A New Qudrture Rule Derived from Spline Interpoltion with Error Anlysis Hdi Tghvfrd

More information

and that at t = 0 the object is at position 5. Find the position of the object at t = 2.

and that at t = 0 the object is at position 5. Find the position of the object at t = 2. 7.2 The Fundmentl Theorem of Clculus 49 re mny, mny problems tht pper much different on the surfce but tht turn out to be the sme s these problems, in the sense tht when we try to pproimte solutions we

More information

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as Improper Integrls Two different types of integrls cn qulify s improper. The first type of improper integrl (which we will refer to s Type I) involves evluting n integrl over n infinite region. In the grph

More information

Reward Shaping for Model-Based Bayesian Reinforcement Learning

Reward Shaping for Model-Based Bayesian Reinforcement Learning Rewrd Shping for Model-Bsed Byesin Reinforcement Lerning Hyeoneun Kim, Woosng Lim, Knghoon Lee, Yung-Kyun Noh nd Kee-Eung Kim Deprtment of Computer Science Kore Advnced Institute of Science nd Technology

More information

Tests for the Ratio of Two Poisson Rates

Tests for the Ratio of Two Poisson Rates Chpter 437 Tests for the Rtio of Two Poisson Rtes Introduction The Poisson probbility lw gives the probbility distribution of the number of events occurring in specified intervl of time or spce. The Poisson

More information

Bernoulli Numbers Jeff Morton

Bernoulli Numbers Jeff Morton Bernoulli Numbers Jeff Morton. We re interested in the opertor e t k d k t k, which is to sy k tk. Applying this to some function f E to get e t f d k k tk d k f f + d k k tk dk f, we note tht since f

More information

221B Lecture Notes WKB Method

221B Lecture Notes WKB Method Clssicl Limit B Lecture Notes WKB Method Hmilton Jcobi Eqution We strt from the Schrödinger eqution for single prticle in potentil i h t ψ x, t = [ ] h m + V x ψ x, t. We cn rewrite this eqution by using

More information

Riemann Sums and Riemann Integrals

Riemann Sums and Riemann Integrals Riemnn Sums nd Riemnn Integrls Jmes K. Peterson Deprtment of Biologicl Sciences nd Deprtment of Mthemticl Sciences Clemson University August 26, 2013 Outline 1 Riemnn Sums 2 Riemnn Integrls 3 Properties

More information

Overview of Calculus I

Overview of Calculus I Overview of Clculus I Prof. Jim Swift Northern Arizon University There re three key concepts in clculus: The limit, the derivtive, nd the integrl. You need to understnd the definitions of these three things,

More information

Numerical Integration. 1 Introduction. 2 Midpoint Rule, Trapezoid Rule, Simpson Rule. AMSC/CMSC 460/466 T. von Petersdorff 1

Numerical Integration. 1 Introduction. 2 Midpoint Rule, Trapezoid Rule, Simpson Rule. AMSC/CMSC 460/466 T. von Petersdorff 1 AMSC/CMSC 46/466 T. von Petersdorff 1 umericl Integrtion 1 Introduction We wnt to pproximte the integrl I := f xdx where we re given, b nd the function f s subroutine. We evlute f t points x 1,...,x n

More information

Lecture 21: Order statistics

Lecture 21: Order statistics Lecture : Order sttistics Suppose we hve N mesurements of sclr, x i =, N Tke ll mesurements nd sort them into scending order x x x 3 x N Define the mesured running integrl S N (x) = 0 for x < x = i/n for

More information

The steps of the hypothesis test

The steps of the hypothesis test ttisticl Methods I (EXT 7005) Pge 78 Mosquito species Time of dy A B C Mid morning 0.0088 5.4900 5.5000 Mid Afternoon.3400 0.0300 0.8700 Dusk 0.600 5.400 3.000 The Chi squre test sttistic is the sum of

More information

Review of Gaussian Quadrature method

Review of Gaussian Quadrature method Review of Gussin Qudrture method Nsser M. Asi Spring 006 compiled on Sundy Decemer 1, 017 t 09:1 PM 1 The prolem To find numericl vlue for the integrl of rel vlued function of rel vrile over specific rnge

More information

Sections 5.2: The Definite Integral

Sections 5.2: The Definite Integral Sections 5.2: The Definite Integrl In this section we shll formlize the ides from the lst section to functions in generl. We strt with forml definition.. The Definite Integrl Definition.. Suppose f(x)

More information

Reinforcement learning

Reinforcement learning Reinforcement lerning Regulr MDP Given: Trnition model P Rewrd function R Find: Policy π Reinforcement lerning Trnition model nd rewrd function initilly unknown Still need to find the right policy Lern

More information

Riemann Sums and Riemann Integrals

Riemann Sums and Riemann Integrals Riemnn Sums nd Riemnn Integrls Jmes K. Peterson Deprtment of Biologicl Sciences nd Deprtment of Mthemticl Sciences Clemson University August 26, 203 Outline Riemnn Sums Riemnn Integrls Properties Abstrct

More information

CS667 Lecture 6: Monte Carlo Integration 02/10/05

CS667 Lecture 6: Monte Carlo Integration 02/10/05 CS667 Lecture 6: Monte Crlo Integrtion 02/10/05 Venkt Krishnrj Lecturer: Steve Mrschner 1 Ide The min ide of Monte Crlo Integrtion is tht we cn estimte the vlue of n integrl by looking t lrge number of

More information

CALCULUS WITHOUT LIMITS

CALCULUS WITHOUT LIMITS CALCULUS WITHOUT LIMITS The current stndrd for the clculus curriculum is, in my opinion, filure in mny spects. We try to present it with the modern stndrd of mthemticl rigor nd comprehensiveness but of

More information

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees CS 188: Artificil Intelligence Fll 2011 Decision Networks ME: choose the ction which mximizes the expected utility given the evidence mbrell Lecture 17: Decision Digrms 10/27/2011 Cn directly opertionlize

More information

Reversals of Signal-Posterior Monotonicity for Any Bounded Prior

Reversals of Signal-Posterior Monotonicity for Any Bounded Prior Reversls of Signl-Posterior Monotonicity for Any Bounded Prior Christopher P. Chmbers Pul J. Hely Abstrct Pul Milgrom (The Bell Journl of Economics, 12(2): 380 391) showed tht if the strict monotone likelihood

More information

Physics 116C Solution of inhomogeneous ordinary differential equations using Green s functions

Physics 116C Solution of inhomogeneous ordinary differential equations using Green s functions Physics 6C Solution of inhomogeneous ordinry differentil equtions using Green s functions Peter Young November 5, 29 Homogeneous Equtions We hve studied, especilly in long HW problem, second order liner

More information

For the percentage of full time students at RCC the symbols would be:

For the percentage of full time students at RCC the symbols would be: Mth 17/171 Chpter 7- ypothesis Testing with One Smple This chpter is s simple s the previous one, except it is more interesting In this chpter we will test clims concerning the sme prmeters tht we worked

More information

Generation of Lyapunov Functions by Neural Networks

Generation of Lyapunov Functions by Neural Networks WCE 28, July 2-4, 28, London, U.K. Genertion of Lypunov Functions by Neurl Networks Nvid Noroozi, Pknoosh Krimghee, Ftemeh Sfei, nd Hmed Jvdi Abstrct Lypunov function is generlly obtined bsed on tril nd

More information

Lecture 1. Functional series. Pointwise and uniform convergence.

Lecture 1. Functional series. Pointwise and uniform convergence. 1 Introduction. Lecture 1. Functionl series. Pointwise nd uniform convergence. In this course we study mongst other things Fourier series. The Fourier series for periodic function f(x) with period 2π is

More information

Polynomial Approximations for the Natural Logarithm and Arctangent Functions. Math 230

Polynomial Approximations for the Natural Logarithm and Arctangent Functions. Math 230 Polynomil Approimtions for the Nturl Logrithm nd Arctngent Functions Mth 23 You recll from first semester clculus how one cn use the derivtive to find n eqution for the tngent line to function t given

More information

ARITHMETIC OPERATIONS. The real numbers have the following properties: a b c ab ac

ARITHMETIC OPERATIONS. The real numbers have the following properties: a b c ab ac REVIEW OF ALGEBRA Here we review the bsic rules nd procedures of lgebr tht you need to know in order to be successful in clculus. ARITHMETIC OPERATIONS The rel numbers hve the following properties: b b

More information

Chapters 4 & 5 Integrals & Applications

Chapters 4 & 5 Integrals & Applications Contents Chpters 4 & 5 Integrls & Applictions Motivtion to Chpters 4 & 5 2 Chpter 4 3 Ares nd Distnces 3. VIDEO - Ares Under Functions............................................ 3.2 VIDEO - Applictions

More information

1B40 Practical Skills

1B40 Practical Skills B40 Prcticl Skills Comining uncertinties from severl quntities error propgtion We usully encounter situtions where the result of n experiment is given in terms of two (or more) quntities. We then need

More information

different methods (left endpoint, right endpoint, midpoint, trapezoid, Simpson s).

different methods (left endpoint, right endpoint, midpoint, trapezoid, Simpson s). Mth 1A with Professor Stnkov Worksheet, Discussion #41; Wednesdy, 12/6/217 GSI nme: Roy Zho Problems 1. Write the integrl 3 dx s limit of Riemnn sums. Write it using 2 intervls using the 1 x different

More information

Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 17

Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 17 EECS 70 Discrete Mthemtics nd Proility Theory Spring 2013 Annt Shi Lecture 17 I.I.D. Rndom Vriles Estimting the is of coin Question: We wnt to estimte the proportion p of Democrts in the US popultion,

More information

Lecture 20: Numerical Integration III

Lecture 20: Numerical Integration III cs4: introduction to numericl nlysis /8/0 Lecture 0: Numericl Integrtion III Instructor: Professor Amos Ron Scribes: Mrk Cowlishw, Yunpeng Li, Nthnel Fillmore For the lst few lectures we hve discussed

More information

The practical version

The practical version Roerto s Notes on Integrl Clculus Chpter 4: Definite integrls nd the FTC Section 7 The Fundmentl Theorem of Clculus: The prcticl version Wht you need to know lredy: The theoreticl version of the FTC. Wht

More information

Riemann Integrals and the Fundamental Theorem of Calculus

Riemann Integrals and the Fundamental Theorem of Calculus Riemnn Integrls nd the Fundmentl Theorem of Clculus Jmes K. Peterson Deprtment of Biologicl Sciences nd Deprtment of Mthemticl Sciences Clemson University September 16, 2013 Outline Grphing Riemnn Sums

More information

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a).

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a). The Fundmentl Theorems of Clculus Mth 4, Section 0, Spring 009 We now know enough bout definite integrls to give precise formultions of the Fundmentl Theorems of Clculus. We will lso look t some bsic emples

More information

3.4 Numerical integration

3.4 Numerical integration 3.4. Numericl integrtion 63 3.4 Numericl integrtion In mny economic pplictions it is necessry to compute the definite integrl of relvlued function f with respect to "weight" function w over n intervl [,

More information

Reinforcement Learning and Policy Reuse

Reinforcement Learning and Policy Reuse Reinforcement Lerning nd Policy Reue Mnuel M. Veloo PEL Fll 206 Reding: Reinforcement Lerning: An Introduction R. Sutton nd A. Brto Probbilitic policy reue in reinforcement lerning gent Fernndo Fernndez

More information

1 Linear Least Squares

1 Linear Least Squares Lest Squres Pge 1 1 Liner Lest Squres I will try to be consistent in nottion, with n being the number of dt points, nd m < n being the number of prmeters in model function. We re interested in solving

More information

Math 270A: Numerical Linear Algebra

Math 270A: Numerical Linear Algebra Mth 70A: Numericl Liner Algebr Instructor: Michel Holst Fll Qurter 014 Homework Assignment #3 Due Give to TA t lest few dys before finl if you wnt feedbck. Exercise 3.1. (The Bsic Liner Method for Liner

More information

1.9 C 2 inner variations

1.9 C 2 inner variations 46 CHAPTER 1. INDIRECT METHODS 1.9 C 2 inner vritions So fr, we hve restricted ttention to liner vritions. These re vritions of the form vx; ǫ = ux + ǫφx where φ is in some liner perturbtion clss P, for

More information

approaches as n becomes larger and larger. Since e > 1, the graph of the natural exponential function is as below

approaches as n becomes larger and larger. Since e > 1, the graph of the natural exponential function is as below . Eponentil nd rithmic functions.1 Eponentil Functions A function of the form f() =, > 0, 1 is clled n eponentil function. Its domin is the set of ll rel f ( 1) numbers. For n eponentil function f we hve.

More information

Quantum Physics II (8.05) Fall 2013 Assignment 2

Quantum Physics II (8.05) Fall 2013 Assignment 2 Quntum Physics II (8.05) Fll 2013 Assignment 2 Msschusetts Institute of Technology Physics Deprtment Due Fridy September 20, 2013 September 13, 2013 3:00 pm Suggested Reding Continued from lst week: 1.

More information

Non-Linear & Logistic Regression

Non-Linear & Logistic Regression Non-Liner & Logistic Regression If the sttistics re boring, then you've got the wrong numbers. Edwrd R. Tufte (Sttistics Professor, Yle University) Regression Anlyses When do we use these? PART 1: find

More information