Bias in Natural Actor-Critic Algorithms

Size: px
Start display at page:

Download "Bias in Natural Actor-Critic Algorithms"

Transcription

1 Bi in Nturl Actor-Critic Algorithm Philip S. Thom Deprtment of Computer Science, Univerity of Mchuett, Amhert, MA USA Technicl Report UM-CS Abtrct We how tht two populr dicounted rewrd nturl ctor-critic, NAC-LSTD nd enac, follow bied etimte of the nturl policy grdient. We derive the firt unbied dicounted rewrd nturl ctor-critic uing btch nd itertive pproche to grdient etimtion nd prove their convergence to globlly optiml policie for dicrete problem nd loclly optiml policie for continuou problem. Finlly, we rgue tht the bi mke the exiting lgorithm more pproprite for the verge rewrd etting. 1. Introduction We how tht two populr dicounted rewrd nturl ctor-critic, NAC-LSTD nd enac (Peter & Schl, 2008), do not produce unbied etimte of the nturl policy grdient purported. We prove tht, for et of Mrkov deciion procee, thee bied dicounted rewrd nturl ctor-critic re ctully unbied verge rewrd nturl ctor-critic, even though they ue etimte of dicounted rewrd vlue function. Another lgorithm, INAC (Degri et l., 2012), which i vrint of the NTD lgorithm (Morimur et l., 2005), w originlly preented bied dicounted rewrd lgorithm. We ugget tht it i more pproprite to think of it n verge rewrd lgorithm. We derive the unbied dicounted rewrd NAC- LSTD, enac, nd NAC-S lgorithm, where NAC-S i liner-time lgorithm imilr to NTD nd INAC. We prove tht unbied policy grdient nd nturl policy grdient lgorithm, like thoe preented, re convergent to globlly optiml policie for dicrete problem. However, the unbied dicounted rewrd lgorithm uffer from updte tht rpidly decy to zero, which cue poor dt efficiency. 2. Problem We re intereted in the problem of finding optiml deciion rule or policie for equentil deciion tk formulted Mrkov deciion procee (MDP). An MDP i tuple, M = (S, A, P, R, d 0, γ). S nd A denote the et of poible tte nd ction, which my be countble (dicrete), or uncountble (continuou). 1 P i clled the trnition function, where P = Pr( t+1= t =, t =), where t N 0 denote the time tep,, S nd A. R i the rewrd function, where R t t = r t, where t S, t A, nd r t [ r mx, r mx ] for ome uniformly bounding contnt r mx. The initil tte ditribution i d 0, where d 0 () = Pr( 0 =), nd γ i dicount fctor. A policy or tochtic policy, π Π, i ditribution over ction given tte: π(, ) = Pr( t = t =), where Π i the et of ll poible policie. A prmeterized policy µ with prmeter θ R n i function tht mp it prmeter to policie, i.e., µ : R n Π nd µ(θ)(, ) = Pr( t = t =, θ t =θ). For brevity, we write µ θ for µ(θ). We ume tht, for ll,, nd θ, µ θ (, ) i differentible with repect to θ. The tte vlue function, V π, for policy π, i function mpping tte to the expected um of dicounted rewrd (or expected return) tht would be ccrued therefrom if π were executed on M. Tht i, V π () = E[ γt r t 0 =, π, M]. 2 Similrly, the tte-ction vlue function i Q π (, ) = E[ γt r t 0 =, 0 =, π, M]. The dicounted tte ditribution, d π, give the probbility of ech tte under policy π, with dicount pplied to tte tht occur t lter time: d π () = (1 γ) γt Pr( t = 0, π, M). The objective functionl, J, give the expected dicounted return for running the provided policy on M for one epiode: J(π) = E[ γt r t π, M], where n epiode i one equence of tte, ction, 1 We bue nottion by writing ummtion nd probbilitie over S nd A. If thee et re continuou, the ummtion nd probbilitie hould be replced with integrl nd probbility denitie. 2 To void clutter, we my uppre function dependencie on M. For exmple, V π i function of M.

2 Bi in Nturl Actor-Critic Algorithm nd rewrd, trting from tte mpled from d 0 nd following the dynmic pecified by P nd R. We cll n MDP epiodic if there i one or more tte in which the proce terminte, nd, for ll policie, every epiode reche terminl tte within finite number of tep. To model epiodic MDP in unified mnner with non-epiodic MDP, we follow the formultion pecified by Sutton & Brto (1998), in which only one ction i dmiible in terminl tte, nd it cue trnition to n borbing tte with zero rewrd, which we cll pot-terminl borbing tte. Thi borbing tte lo h only one dmiible ction, which cue elf-trnition with zero rewrd. We llow γ [0, 1], where γ = 1 only when the MDP i epiodic. 3 If S nd A re countble, then the gol i to find n optiml policy, π, which mximize the objective functionl: π rg mx π Π J(π). If S or A i continuou, we erch for loclly optiml policy prmeter, θ, tht i, prmeter tifying J (θ ) = 0, where J = J µ, nd where we ume J i Lipchitz. 3. Policy Grdient Grdient cent lgorithm for mximizing the objective functionl re clled policy grdient lgorithm. Their bic updte i θ t+1 θ t + α t J (θ t ), where {α t } i clr tep ize chedule. Policy grdient method my lo ue unbied etimte of the grdient, mking them tochtic grdient cent lgorithm. Stochtic grdient cent i gurnteed to converge to locl mximum if J i Lipchitz, α t =, nd α2 t < (Bertek & Titikli, 2000). We ume tht ll tep ize chedule herefter tify thee contrint. The policy grdient, J (θ), i the direction θ tht mximize J (θ + θ) under the contrint tht θ 2 = ɛ 2, for ufficiently mll ɛ, where denote the Eucliden (L 2 ) norm. Amri (1998) uggeted tht Riemnnin ditnce my be more pproprite metric thn Eucliden ditnce for prmeter pce. He cll the direction tifying thi modified contrint the nturl grdient. Kkde (2002) uggeted the ppliction of nturl grdient to policy grdient to get the nturl policy grdient. Bgnell & Schneider (2003) then derived proper Reimnnin ditnce metric, 4 bed on Amri nd Kkde work, 3 If γ = 1, every epiode reche terminl tte within ome finite time, T, o d π () um to T. 4 Recent work h propoed the ue of different metric tht ccount not only for how the ditribution over ction (the policy) chnge the prmeter chnge, but lo for how the tte ditribution chnge the prmeter nd howed tht the nturl policy grdient i covrint. Bhtngr et l. (2009) built on thi foundtion to crete everl provbly convergent policy grdient nd nturl policy grdient lgorithm for the verge rewrd etting. At thi point, it w known tht if d π () µ θ (, ) (1) [ (, ) f ϖ (, )] f ϖ(, ) = 0, where f ϖ (, ) i liner function pproximtor with prmeter vector ϖ = [w, v ], w = θ, feture vector ψ = [( log µ θ(, )), φ() ], for rbitrry uniformly bounded φ, nd f ϖ (, ) = ϖ ψ, then the nturl policy grdient i J (θ) = w (Sutton et l., 2000; Kkde, 2002). 5 The chllenge w then to devie method for finding w tifying Eqution Finding w To tify Eqution 1, Sutton et l. (2000), working in the v = 0 etting, ugget letting f ϖ : S A R be n pproximtion to with prmeter vector ϖ = w. They clim tht lerning f ϖ by following µ θ nd updting ϖ by rule uch ϖ t [ ˆ ( t, t ) f ϖ ( t, t )] 2, where ˆ (, ) i ome unbied etimte of (, ), will reult in tifctory w. However, thi i only true for the verge rewrd etting or the dicounted etting when γ = 1 becue, in the dicounted etting, d π in Eqution 1 i the dicounted weighting of tte encountered, where the tte oberved when merely following µ θ come from the undicounted tte ditribution. Peter & Schl (2006; 2008) oberved tht the cheme propoed by Sutton et l. (2000) i forwrd TD(1) lgorithm. Becue forwrd nd bckwrd TD(λ) re pproximtely equivlent, they ugget uing let qure temporl difference (LSTD), bckwrd TD(λ) method, to pproximte with f ϖ, where λ = 1. They cll the reulting lgorithm the nturl ctor-critic uing LSTD (NAC-LSTD) nd the epiodic nturl ctor-critic (enac). Becue the cheme propoed by Sutton et l., nd thu TD(1), doe not incorporte the γ t weighting in the dicounted tte ditribution, thi reult in w tht do not tify chnge (Morimur et l., 2009). 5 Notice tht if φ() = 0, we cn drop v from Eqution 1 to get the exct contrint pecified by Sutton et l. (2000). Eqution 1 follow immeditely ince µ θ(, )v φ() fϖ(,) = 0 for ll, µ θ, φ, ϖ nd M. Alo, for implicity lter, we ume tht φ() = 0 for the pot-terminl borbing tte.

3 Bi in Nturl Actor-Critic Algorithm Eqution 1, nd thu bi in the nturl policy grdient etimte. One olution would be to convert the dicounted MDP into n equivlent undicounted MDP, decribed in Section 2.3 of Bertek & Titikli (1996). To do thi, ech oberved trjectory mut be truncted fter ech trnition with probbility 1 γ. Notice tht NAC-LSTD i not bied when γ = 1 becue then the dicounted nd undicounted tte ditribution re identicl. 6 So, fter the trjectorie re truncted, the exiting NAC-LSTD lgorithm could be ued with γ = 1 to find policy for the originl MDP. However, thi pproch my dicrd ignificnt mount of dt when truncting epiode. Inted, we propoe the ue of ll of the oberved dt with proper dicounting in order to produce unbied grdient etimte. We preent new objective functionl, H, nd prove tht the locl minim of thi objective give w tifying Eqution 1. We then provide the tochtic grdient cent updte for thi objective. When following µ θ, the dicounting from the dicounted tte ditribution cn be hifted into the objective functionl in order to properly tify Eqution 1. We elect w tht i component of locl minimum for the objective functionl H: H(ϖ) = Pr ( t = M, µ θ ) µ θ (, ) [ γ t ( (, ) f ϖ (, )) 2] (2) = E, [γ ( ) ] 2 t ˆQµ θ (, ) f ϖ (, ). The objective functionl i lwy finite becue either γ < 1 or the MDP i epiodic. If the MDP i epiodic, it mut enter the pot-terminl borbing tte within finite number of tep. In thi tte, ψ, = 0, nd Q π (, ) = 0 for ll π nd the one dmiible, o µ θ(, )γ t ( (, ) f ϖ (, )) 2 = 0 for ll ϖ. Hence, if the MDP i epiodic, only finite number of term in the infinite um will be non-zero. We propoe performing tochtic grdient decent on H to obtin locl minimum where H(ϖ) = 0, o γ t Pr( t = M, µ θ ) µ θ (, ) [ (, ) f ϖ (, )] f ϖ(, ) = 0. (3) 6 It i uncler whether enac would be unbied in thi itution, decribed in Section 7. By the definition of d π, thi i equivlent to Eqution 1. Hence, when grdient decent on H h converged, the reulting w component of ϖ tifie Eqution 1. Notice tht the expecttion in Eqution 2 i over the oberved probbilitie of tte nd ction t time t if executing µ θ on M. Hence, we cn updte ϖ vi tochtic grdient decent: ϖ ϖ + η (4) [ ( )] γ t fϖ ˆQµ θ ( t, t ) ( t, t ) f ϖ ( t, t ), where ˆ i n unbied etimte of nd η i tep ize tifying the typicl decy contrint. The ubtitution of ˆQµ θ for doe not influence convergence (Bertek & Titikli, 2000). Becue f ϖ (, )/ i zero for terminl tte nd the potterminl borbing tte, the bove updte need only be performed for the pre-terminl tte. With v = 0, thi differ from the method propoed by Sutton et l. (2000) only by the um over time nd the γ t term. 5. Algorithm A imple lgorithm to find w would be to execute epiode nd then perform the updte in Eqution 4 uing the Monte Crlo return, ˆQµ θ ( t, t ) = τ=0 γτ r t+τ, the unbied etimte of ( t, t ). Thi i forwrd TD(1) lgorithm, with n dditionl dicount pplied to updte bed on the time t which they occur. However, thi lgorithm require tht entire trjectorie be tored in memory. To overcome thi, we cn derive the equivlent bckwrd updte by following Sutton nd Brto derivtion of bckwrd TD(λ) (Sutton & Brto, 1998). The reulting on-policy bckwrd lgorithm for etimting for fixed µ θ i: e t+1 =γλe t + γ t f ϖ( t, t ) (5) δ t =r t + γf ϖ ( t+1, t+1 ) f ϖ ( t, t ) (6) ϖ t+1 =ϖ t + η t δ t e t+1, (7) where λ i decy prmeter for eligibility trce in TD(λ) nd t, t, nd r t come from running µ θ on M. Although the bckwrd nd forwrd lgorithm re only pproximtely equivlent (Sutton & Brto, 1998), their convergence gurntee re the me (Bertek & Titikli, 1996). Hence, if λ = 1 nd η t i decyed ppropritely, the modified bckwrd TD(λ) lgorithm bove will produce w tifying Eqution 1. The only difference between thi lgorithm nd Sr(λ) i the γ t in Eqution 5. One cn then reproduce the work of Brdtke & Brto (1996) to crete

4 Bi in Nturl Actor-Critic Algorithm LSTD in thi new etting, which pproximte V µ θ in let qure mnner. Thi cn be extended long the line of Lgoudki & Prr (2001) to crete LSQ, which pproximte in let qure mnner. The reulting LSQ lgorithm in Peter nd Schl NAC-LSTD chnge only by the introduction of γ t term: z t+1 = λz t + γ t ˆφ t. We omit the complete peudocode for NAC-LSTD due to pce contrint. To crete n epiodic lgorithm, we convert Eqution 1 into ytem of liner eqution uing the umption tht ll epiode terminte within T tep, for ome finite number T. We rewrite Eqution 1 by replcing the infinite um in d π with finite one becue f ϖ (, )/ i zero for borbing tte: T µ θ (, )γ t (8) Pr( t = ) ( (, ) ϖ ψ ) ψ = 0. By collecting the term with ϖ on the left nd the other on the right, we get T Pr( t = ) µ θ (, )γ t ψ ψ ϖ =b, (9) where b =, T Pr( t=)µ θ (, )γ t (, )ψ. If we let A =, T Pr( t=)µ θ (, )γ t ψ ψ, then we get the ytem of liner eqution: Aϖ = b, where A i ψ by ψ qure mtrix. We cn then generte unbied etimte of A nd b from mple trjectorie. A the number of oberved trjectorie grow, our etimte of A nd b converge to their true vlue, giving n unbied etimte of the nturl grdient. The reulting epiodic nturl ctor-critic lgorithm, enac2, i preented in Algorithm 1. For both lgorithm preented, the uer mut elect either Type1 or Type2 updte. In the former, which emulte the updte cheme propoed by Peter & Schl (2008), the policy i updted when the grdient etimte h converged, while in the ltter, which emulte the two-timecle updte cheme propoed by Bhtngr et l. (2009), the policy i updted fter contnt number of time tep. The uer mut lo elect f(t) = γ t to get the unbied lgorithm or f(t) = 1 to get the bied lgorithm. The unbied lgorithm re only truly unbied when λ = 1, β = 0 (if β i preent), nd ɛ 0 (Type1) or k (Type2), in which ce they compute nd cend the exct nturl policy grdient. NAC-LSTD nd enac2 hve computtionl complexity proportionl to ϖ 2 per time tep jut to updte ttitic, nd ϖ 3 to compute the nturl policy grdient etimte for policy improvement tep. Thi Algorithm 1 epiodic Nturl Actor Critic 2 enac2 1: Input: MDP M, prmeterized policy µ θ (, ) with initil prmeter θ, bi function φ() for the ttevlue etimtion, updte frequency prmeter k, dicount prmeter γ, decy contnt β, lerning rte chedule {η t}, nd mximum epiode durtion T. 2: A 0; b 0; τ 0 3: for ep = 0, 1, 2,... do 4: Run n epiode nd remember the trjectory, { t, t, t+1, r t}, t [0, T 1]. 5: Updte Sttitic: 6: A A + T f(t)ψ t t ψ t t 7: b b + T Ṱ f(t)ψ t γˆt t t t=t rˆt 8: [wep, vep] = ( A A ) 1 A b; // If Type2, thi need only be done every k epiode. 9: Updte Actor (Nturl Policy Grdient): 10: if ( Type1, ep k 0, nd (w ep, w ep k ) ɛ ) or ( ) 11: Type2 nd (ep + 1) mod k = 0 then w 12: θ θ + η ep τ w ep 2 13: τ = τ + 1; A βa; b βb complexity cn be improved to liner by uing the modified Sr(λ) lgorithm in plce of LSTD to find w tifying Eqution 1. We cll the reulting lgorithm the Nturl Actor-Critic uing Sr(λ), or NAC-S. Notice tht ome men zero term cn be removed from the Sr(λ) updte nd the reulting lgorithm, provided in Algorithm 2, cn be viewed the dicounted rewrd nd eligibility trce extenion of the Nturl-Grdient Actor-Critic with Advntge Prmeter (Bhtngr et l., 2009). 7 NAC-S cn lo be viewed INAC (Degri et l., 2012) or NTD (Morimur et l., 2005) corrected to include the γ t term nd with the option of computing exct grdient etimte or uing two-timecle. Notice tht in ll lgorithm preented in thi pper, the nturl grdient i normlized. Thi normliztion i optionl. It my void convergence gurntee nd it often mke it difficult to chieve empiricl convergence. However, in prctice we find it eier to find fixed tep ize tht work on difficult problem when uing normlized updte to θ. Amri defined the nturl grdient only direction nd even dicrded cling contnt in hi derivtion of cloed form for the nturl grdient (Amri, 1998). 6. Convergence The nturl ctor-critic compute nd cend the nturl grdient of J, nd thu will converge to loclly optiml policy, t which point J (θ) = 0, um- 7 To get Bhtngr lgorithm, elect Type2 updte with k = 1, f(t) = 1, nd replce the dicounted TD error with the verge rewrd TD error.

5 Bi in Nturl Actor-Critic Algorithm Algorithm 2 Nturl Actor Critic uing Sr(λ) NAC-S(λ) 1: Input: MDP M, prmeterized policy µ θ (, ) with initil prmeter θ, bi function φ() for the ttevlue etimtion, updte frequency prmeter k, dicount prmeter γ, eligibility decy rte λ, nd lerning rte chedule {αt w }, {αt v } nd {η t}. 2: w 0 0; v 0 0; count 0 3: for epiode = 0, 1, 2,... do 4: Drw initil tte 0 d 0( ) 5: e w 1 = 0; e v 1 = 0; τ 1 = 0; 6: for t = 0, 1, 2,... do τ 2 = 0 7: t µ θ ( t, ); t+1 P( t, t, ); r t R t t ; 8: count count + 1 9: Updte Critic (Sr): 10: δ t = r t + γv t φ( t+1) v t φ( t) 11: 12: e w t = γλe w t 1 + f(t)[ log µ θ( t, t)] e v t = γλe v t 1 + f(t)φ( t) 13: w t+1 = w t+αt τ w 1 [δ t w t [ log µ θ( t, t)]]e w t 14: v t+1 = v t + αt τ v 1 δ te v t 15: Updte Actor (Nturl Policy Grdient): 16: if ( Type1, t k 0, nd (w t, w t k ) ɛ ) or ( ) 17: Type2 nd (count mod k = 0) then 18: w θ θ + η t+1 τ2 w t+1 2 ; τ 1 = t; τ 2 = τ : if t+1 terminl then brek out of loop over t ing the tep ize chedule re properly decyed nd tht the nturl ctor-critic etimte of the nturl grdient re unbied (Amri, 1998). A tted previouly, when λ = 1, β = 0 (if β i preent), nd ɛ 0 (Type1) or k (Type2), the nturl grdient etimte will be exct. In prctice, lrge k or mll ɛ nd mll fixed tep ize uully reult in convergence. Policy grdient pproche re typiclly purported to hve one ignificnt drwbck: where Q-bed method converge to globlly optiml policie for problem with dicrete tte nd ction, policy grdient lgorithm cn become tuck in rbitrrily bd locl optim (e.g., Peter & Bgnell, 2010; Peter, 2010). We rgue tht with umption imilr to thoe required by Q-lerning nd Sr, cending the policy grdient reult in convergence to globlly optiml policy well. 8 Firt, we ume tht S nd A re countble nd tht every tte-ction pir i oberved infinitely often. Second, we ume tht for ll θ, ll tte, nd ll ction nd â, where â, there i direction dθ of chnge to θ tht cue the probbility of in tte to incree while tht of â decree, while ll other ction probbilitie remin unchnged. Thee two umption re tified by policy prmeteriztion uch tbulr Gibb oftmx ction election (Sutton & Brto, 1998). We rgue tht t ll uboptiml θ, the policy grdient will be non-zero. For 8 Notice tht thi pplie to ll lgorithm tht cend the policy grdient or nturl policy grdient. ny policy tht i not globlly optiml, there exit rechble tte for which increing the probbility of pecific ction while decreing the probbility of â would incree J (ee Section 4.2 of Sutton & Brto (1998)). By our firt umption, thi ttection pir i reched by the policy, nd by our econd umption, there i direction, dθ, of chnge to θ tht cn mke exctly thi chnge. So, the directionl derivtive of J t θ in the direction dθ i non-zero nd therefore the grdient of J t θ mut lo be non-zero. Hence, θ cnnot be locl optimum. Policy grdient i typiclly pplied to problem with continuou tte or ction et, in which ce the umption bove cnnot be tified, o convergence to only locl optimum cn be gurnteed. However, the bove rgument ugget tht, in prctice nd on continuou problem, locl optim cn be voided by increing explortion nd the repreenttionl power of the policy prmeteriztion. However, if one deire pecific low-dimenionl policy prmeteriztion, uch proportionl-derivtive controller with limited explortion, then increing the explortion nd repreenttionl power of the policy my not be n cceptble option, in which ce locl optim my be unvoidble. 7. Anlyi of Bied Algorithm In thi ection we nlyze how the bi chnge performnce. Recll tht, without the correct dicounting, w re the weight tht minimize the qured error in the etimte, with tte mpled from ctul epiode. With the proper dicounting, tte tht re viited t lter time fctor le into w. Becue w will be the chnge to the policy prmeter, thi men tht in the bied lgorithm the chnge to the policy prmeter conider tte tht re viited t lter time jut much tte tht re viited erlier. Thi ugget tht the bied lgorithm my be optimizing different objective functionl imilr to J (θ) = (1 γ) ()V µ θ (), (10) where d π i the ttionry ditribution of the Mrkov chin induced by the policy π. More formlly, we ume d π () = lim t Pr( t = 0, π, M) exit nd i independent of 0 for ll policie. Notice tht J i not intereting for epiodic MDP ince, for ll policie, d π () i non-zero for only the pot-terminl borbing tte. So, henceforth, our dicuion i limited to the non-epiodic etting. For comprion, we cn write J in the me form: J (θ) = d 0()V µ θ (). The originl objective functionl, J, give the expected return from n epiode. Thi men tht for mll γ,

6 Bi in Nturl Actor-Critic Algorithm Action = Stte Action = Stte Action =0.5 Optiml Bied enac Unbied Stte Figure 1. The optiml policy (optiml), the ction elected by the bied NAC-LSTD, enac2, nd INAC (bied), the ction elected by the unbied NAC-LSTD, enac2, NAC-S, well rndom retrt hill-climbing lgorithm (unbied), nd the ction elected by enac (enac). it brely conider the qulity of the policy t tte tht re viited lte in trjectory. On the other hnd, J conider tte bed on their viittion frequency, regrdle of when they re viited. Kkde (2001) howed tht J, which include dicounting in V µ θ, i the typicl verge rewrd objective functionl. To ee tht the bied lgorithm pper to optimize omething cloer to thi verge rewrd objective, conider n MDP with S = [0, 1], where 0 = 0, = 1 i terminl, t+1 = t , nd R = ( ) 2. The optiml policy i to elect t = t. We prmeterize the policy with one prmeter, uch tht µ θ elect ction t N (θ, σ 2 ) for ll tte, where N i norml ditribution with mll contnt vrince, σ 2. If γ = 1, the optiml prmeter, θ, i θ = 0.5. Both the bied nd unbied lgorithm converge to thi θ. However, when γ = or γ = 0.5, the optiml θ decree in order to receive more rewrd initilly. We found tht the unbied nturl ctor-critic properly converge to the new optiml θ, doe imple hill-climbing lgorithm tht we implemented control. However, the bied lgorithm till converge to θ We found tht enac converge to θ tht differ from thoe of ll other lgorithm when γ 1, which ugget tht enac, but not enac2, my hve dditionl bi. Thee reult re preented in Figure 1. Thi difference rie the quetion of whether the bied lgorithm ctully compute the nturl policy grdient in the verge rewrd etting. In the reminder of thi ection, we prove tht they do whenever V µ θ () () = 0. (11) To derive Eqution 11, we firt review reult concerning the verge rewrd nturl policy grdient. The 9 We ued rndom retrt for ll method nd oberved no locl optim. typicl objective for verge rewrd lerning i J 1 (θ) = lim n n E[r t µ θ, M]. (12) A mentioned previouly, Kkde (2001) howed tht thi i equivlent to the definition in Eqution 10. The tte-ction vlue function i defined (, ) = Kkde (2002) tted tht if E[r t J (θ) 0 =, 0 =, µ θ, M]. (13) d π () µ θ (, ) (14) [ Qµ θ (, ) f ϖ (, ) ] f ϖ (, ) = 0 then the nturl grdient of J i J (θ) = w. (15) Thu, the unbied verge rewrd nturl policy grdient i given by w tifying Eqution 14. The bied lgorithm perform tochtic grdient decent ccording to the cheme propoed by Sutton et l. (2000). They mple tte,, from nd ction,, from µ θ nd perform grdient decent on the qured difference between (, ) nd f ϖ (, ). Thu, they elect w tifying d π () µ θ (, ) (16) [ (, ) f ϖ (, )] f ϖ(, ) = 0. Notice tht Eqution 16 ue the dicounted ttection vlue function while Eqution 14 ue the verge rewrd tte-ction vlue funciton. To determine if nd when the bied lgorithm compute J(θ), we mut determine when contnt multiple of the olution to Eqution 16 tify Eqution 14. To do thi, we olve Eqution 16 for w nd ubtitute contnt, k > 0, time thee w into Eqution 14 to generte contrint tht, when tified, reult in the bied lgorithm producing the me direction (but not necerily mgnitude) the verge rewrd nturl policy grdient. When doing o, we ume tht v = 0, ince it doe not influence the olution to either eqution. Firt, we mut etblih lemm tht relte the policy grdient theorem uing the verge rewrd tte ditribution but dicounted rewrd tte-ction vlue function (left hnd ide of Lemm

7 Bi in Nturl Actor-Critic Algorithm 1) to the derivtive of J without proper ppliction of the chin rule: Lemm 1, () µ θ(, ) (, ) = (1 γ) () V µ θ, for ll θ, µ, nd M. For proof of Lemm 1, ee the ppendix. Solving Eqution 16 for w, which give the direction of the bied lgorithm, we get ( w = () ) µ θ (, ) (, )ψ() (17) ( () ) 1 µ θ (, )ψ()ψ(). Notice tht the econd term i the invere (verge) Fiher informtion mtrix (Bgnell & Schneider, 2003). Subtituting k time thi w into Eqution 14 for w nd cnceling the product of the Fiher informtion mtrix nd it invere give 0 = µ θ (, ) (, )ψ (18) k µ θ (, ) (, )ψ = J (θ) k(1 γ) () V µ θ (), by ubtitution of the policy grdient theorem (Sutton et l., 2000) nd Lemm 1. Thu, when, for ome k, J (θ) = k(1 γ) () V µ θ (), (19) the bied lgorithm produce the direction of the unbied verge rewrd nturl policy grdient. If we let k = 1, we will till get contrint tht reult in the two direction being the me, lthough if the contrint i not tified, it doe not men the two re different (ince different k my reult in Eqution 19 being tified). Setting k = 1 nd ubtituting Eqution 10 for J (θ), we get: (1 γ) () V µ θ () V µ θ () () ()V µ θ () = (1 γ) + () V µ θ () = () V µ θ () () V µ θ () = 0. (20) We hve hown tht when Eqution 11 hold, the bied lgorithm compute the verge rewrd nturl policy grdient. 8. Dicuion nd Concluion We hve hown tht NAC-LSTD nd enac produce bied etimte of the nturl grdient. We rgued tht they, nd INAC nd NTD, ct more like verge rewrd nturl ctor-critic tht do not properly ccount for how chnge to θ chnge the expected return vi d µ θ. We proved tht in certin itution the bied lgorithm produce unbied etimte of the nturl policy grdient for the verge rewrd etting. The bi tem from improper dicounting when pproximting the tte-ction vlue function uing comptible function pproximtion. We derived the properly dicounted lgorithm to produce the unbied NAC-LSTD nd enac2, well the bied nd unbied NAC-S, liner time complexity lterntive to the qured to cubic time complexity NAC-LSTD nd enac2. However, the unbied lgorithm hve criticl drwbck tht limit their prcticlity. The unbied lgorithm dicount their updte by γ t. For mll γ, the updte will decy to zero rpidly, cuing the unbied lgorithm to ignore dt collected fter hort burn-in period. Conider n MDP like the one preented erlier, where the et of tte tht occur erly nd thoe tht occur lter re dijoint. In thi etting, the dicounted rewrd objective mndte tht dt recorded lte in trjectorie mut be ignored. In thi itution, the rpid decy of updte i cure of the choice of objective function. However, if the tte tht re viited erly in trjectory re lo viited lter in trjectory, off-policy method my be ble to tke dvntge of dt from lte in n epiode to provide meningful updte even for the dicounted rewrd etting. They my lo be ble to properly ue dt from previou policie to improve the etimte of the nturl policy grdient in principled mnner. Thee re poible venue for future reerch. Another intereting extenion would be to determine how γ hould be elected in the bied lgorithm. Recll tht Eqution 10 i the verge rewrd objective, for ll γ. Thi ugget tht in the bied lgorithm γ my be elected by the reercher. Smller vlue of γ re known to reult in fter convergence of vlue function etimte (Szepevri, 1997), however lrger γ typiclly reult in moother vlue function tht my be eier to pproximte ccurtely with few feture. Ltly, we rgued tht, with certin policy prmeteriztion, policy grdient method converge to globlly optiml policie for dicrete problem, nd uggeted tht locl optim my be voided in continuou problem by increing explortion nd the policy repreenttionl power. Future work my ttempt to provide globl convergence gurntee for ubet of

8 Bi in Nturl Actor-Critic Algorithm the continuou-ction etting by intelligently increing the repreenttionl power of the policy when it become tuck in locl optimum. Reference Amri, S. Nturl grdient work efficiently in lerning. Neurl Computtion, 10: , Bgnell, J. A. nd Schneider, J. Covrint policy erch. In Proceeding of the Interntionl Joint Conference on Artificil Intelligence, pp , Bertek, D. P. nd Titikli, J. N. Neuro-Dynmic Progrmming. Athen Scientific, Belmont, MA, Bertek, D. P. nd Titikli, J. N. Grdient convergence in grdient method. SIAM J. Optim., 10: , Bhtngr, S., Sutton, R. S., Ghvmzdeh, M., nd Lee, M. Nturl ctor-critic lgorithm. Automtic, 45(11): , Brdtke, S. J. nd Brto, A. G. Liner let-qure lgorithm for temporl difference lerning. Mchine Lerning, 22:33 57, Degri, T., Pilrki, P. M., nd Sutton, R. S. Model-free reinforcement lerning with continuou ction in prctice. In Proceeding of the 2012 Americn Control Conference, Kkde, S. Optimizing verge rewrd uing dicounted rewrd. In Proceeding of the 14th Annul Conference on Computtionl Lerning Theory, Kkde, S. A nturl policy grdient. In Advnce in Neurl Informtion Proceing Sytem, volume 14, pp , Lgoudki, M. nd Prr, R. Model-free let-qure policy itertion. In Neurl Informtion Proceing Sytem: Nturl nd Synthetic, pp , Morimur, T., Uchibe, E., nd Doy, K. Utilizing the nturl grdient in temporl difference reinforcement lerning with eligibility trce. In Interntionl Sympoium on Informtion Geometry nd it Appliction, Morimur, T., Uchibe, E., Yohimoto, J., nd Doy, K. A generlized nturl ctor-critic lgorithm. In Neurl Informtion Proceing Sytem: Nturl nd Synthetic, Peter, J. Policy grdient method. Scholrpedi, 5(11): 3698, Peter, J. nd Bgnell, J. A. Policy grdient method. Encyclopedi of Mchine Lerning, Peter, J. nd Schl, S. Policy grdient method for robotic. In Proceeding of the IEEE/RSJ Interntionl Conference on Intelligent Robot nd Sytem, Peter, J. nd Schl, S. Nturl ctor-critic. Neurocomputing, 71: , Sutton, R. S. nd Brto, A. G. Reinforcement Lerning: An Introduction. MIT Pre, Cmbridge, MA, Sutton, R. S., McAlleter, D., Singh, S., nd Mnour, Y. Policy grdient method for reinforcement lerning with function pproximtion. In Advnce in Neurl Informtion Proceing Sytem 12, pp , Szepevri, C. S. The ymptotic convergence-rte of q- lerning. In Advnce in Neurl Informtion Proceing Sytem, volume 10, pp , Appendix: Proof of Lemm 1 V µ θ () = µ θ (, ) (, ) (21) = [ µθ (, ) (, ) + µ θ (, ) ] Qµ θ (, ) = [ µ θ (, ) (, )+ ( ) ] µ θ (, ) R + P γv µ θ ( ) = [ µ θ (, ) (, )+ ] µ θ (, ) P γ V µ θ ( ) Solving for µ θ (,) (, ) yield. µ θ (, ) (, ) = (22) V µ θ () γ µ θ (, ) P V µ θ ( ). Summing both ide over ll tte weighted by give () µ θ (, ) (, ) (23) ( ) = () V µ θ () γ () µ θ (, ) ) P µ V θ ( ) = =(1 γ) () V µ θ () γ () V µ θ (). () V µ θ ()

Reinforcement learning

Reinforcement learning Reinforcement lerning Regulr MDP Given: Trnition model P Rewrd function R Find: Policy π Reinforcement lerning Trnition model nd rewrd function initilly unknown Still need to find the right policy Lern

More information

Artificial Intelligence Markov Decision Problems

Artificial Intelligence Markov Decision Problems rtificil Intelligence Mrkov eciion Problem ilon - briefly mentioned in hpter Ruell nd orvig - hpter 7 Mrkov eciion Problem; pge of Mrkov eciion Problem; pge of exmple: probbilitic blockworld ction outcome

More information

Reinforcement Learning and Policy Reuse

Reinforcement Learning and Policy Reuse Reinforcement Lerning nd Policy Reue Mnuel M. Veloo PEL Fll 206 Reding: Reinforcement Lerning: An Introduction R. Sutton nd A. Brto Probbilitic policy reue in reinforcement lerning gent Fernndo Fernndez

More information

Policy Gradient Methods for Reinforcement Learning with Function Approximation

Policy Gradient Methods for Reinforcement Learning with Function Approximation Policy Grdient Method for Reinforcement Lerning with Function Approximtion Richrd S. Sutton, Dvid McAlleter, Stinder Singh, Yihy Mnour AT&T Lb Reerch, 180 Prk Avenue, Florhm Prk, NJ 07932 Abtrct Function

More information

Reinforcement Learning for Robotic Locomotions

Reinforcement Learning for Robotic Locomotions Reinforcement Lerning for Robotic Locomotion Bo Liu Stnford Univerity 121 Cmpu Drive Stnford, CA 94305, USA bliuxix@tnford.edu Hunzhong Xu Stnford Univerity 121 Cmpu Drive Stnford, CA 94305, USA xuhunvc@tnford.edu

More information

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic

More information

TP 10:Importance Sampling-The Metropolis Algorithm-The Ising Model-The Jackknife Method

TP 10:Importance Sampling-The Metropolis Algorithm-The Ising Model-The Jackknife Method TP 0:Importnce Smpling-The Metropoli Algorithm-The Iing Model-The Jckknife Method June, 200 The Cnonicl Enemble We conider phyicl ytem which re in therml contct with n environment. The environment i uully

More information

2D1431 Machine Learning Lab 3: Reinforcement Learning

2D1431 Machine Learning Lab 3: Reinforcement Learning 2D1431 Mchine Lerning Lb 3: Reinforcement Lerning Frnk Hoffmnn modified by Örjn Ekeberg December 7, 2004 1 Introduction In this lb you will lern bout dynmic progrmming nd reinforcement lerning. It is ssumed

More information

20.2. The Transform and its Inverse. Introduction. Prerequisites. Learning Outcomes

20.2. The Transform and its Inverse. Introduction. Prerequisites. Learning Outcomes The Trnform nd it Invere 2.2 Introduction In thi Section we formlly introduce the Lplce trnform. The trnform i only pplied to cul function which were introduced in Section 2.1. We find the Lplce trnform

More information

APPENDIX 2 LAPLACE TRANSFORMS

APPENDIX 2 LAPLACE TRANSFORMS APPENDIX LAPLACE TRANSFORMS Thi ppendix preent hort introduction to Lplce trnform, the bic tool ued in nlyzing continuou ytem in the frequency domin. The Lplce trnform convert liner ordinry differentil

More information

PHYSICS 211 MIDTERM I 22 October 2003

PHYSICS 211 MIDTERM I 22 October 2003 PHYSICS MIDTERM I October 3 Exm i cloed book, cloed note. Ue onl our formul heet. Write ll work nd nwer in exm booklet. The bck of pge will not be grded unle ou o requet on the front of the pge. Show ll

More information

Markov Decision Processes

Markov Decision Processes Mrkov Deciion Procee A Brief Introduction nd Overview Jck L. King Ph.D. Geno UK Limited Preenttion Outline Introduction to MDP Motivtion for Study Definition Key Point of Interet Solution Technique Prtilly

More information

CHOOSING THE NUMBER OF MODELS OF THE REFERENCE MODEL USING MULTIPLE MODELS ADAPTIVE CONTROL SYSTEM

CHOOSING THE NUMBER OF MODELS OF THE REFERENCE MODEL USING MULTIPLE MODELS ADAPTIVE CONTROL SYSTEM Interntionl Crpthin Control Conference ICCC 00 ALENOVICE, CZEC REPUBLIC y 7-30, 00 COOSING TE NUBER OF ODELS OF TE REFERENCE ODEL USING ULTIPLE ODELS ADAPTIVE CONTROL SYSTE rin BICĂ, Victor-Vleriu PATRICIU

More information

PHYS 601 HW 5 Solution. We wish to find a Fourier expansion of e sin ψ so that the solution can be written in the form

PHYS 601 HW 5 Solution. We wish to find a Fourier expansion of e sin ψ so that the solution can be written in the form 5 Solving Kepler eqution Conider the Kepler eqution ωt = ψ e in ψ We wih to find Fourier expnion of e in ψ o tht the olution cn be written in the form ψωt = ωt + A n innωt, n= where A n re the Fourier

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm

More information

MArkov decision processes (MDPs) have been widely

MArkov decision processes (MDPs) have been widely Spre Mrkov Deciion Procee with Cul Spre Tlli Entropy Regulriztion for Reinforcement Lerning yungje Lee, Sungjoon Choi, nd Songhwi Oh rxiv:709.0693v3 [c.lg] 3 Oct 07 Abtrct In thi pper, re Mrkov deciion

More information

Non-Myopic Multi-Aspect Sensing with Partially Observable Markov Decision Processes

Non-Myopic Multi-Aspect Sensing with Partially Observable Markov Decision Processes Non-Myopic Multi-Apect Sening with Prtilly Oervle Mrkov Deciion Procee Shiho Ji 2 Ronld Prr nd Lwrence Crin Deprtment of Electricl & Computer Engineering 2 Deprtment of Computer Engineering Duke Univerity

More information

STABILITY and Routh-Hurwitz Stability Criterion

STABILITY and Routh-Hurwitz Stability Criterion Krdeniz Technicl Univerity Deprtment of Electricl nd Electronic Engineering 6080 Trbzon, Turkey Chpter 8- nd Routh-Hurwitz Stbility Criterion Bu der notlrı dece bu deri ln öğrencilerin kullnımın çık olup,

More information

19 Optimal behavior: Game theory

19 Optimal behavior: Game theory Intro. to Artificil Intelligence: Dle Schuurmns, Relu Ptrscu 1 19 Optiml behvior: Gme theory Adversril stte dynmics hve to ccount for worst cse Compute policy π : S A tht mximizes minimum rewrd Let S (,

More information

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Module 6 Vlue Itertion CS 886 Sequentil Decision Mking nd Reinforcement Lerning University of Wterloo Mrkov Decision Process Definition Set of sttes: S Set of ctions (i.e., decisions): A Trnsition model:

More information

Administrivia CSE 190: Reinforcement Learning: An Introduction

Administrivia CSE 190: Reinforcement Learning: An Introduction Administrivi CSE 190: Reinforcement Lerning: An Introduction Any emil sent to me bout the course should hve CSE 190 in the subject line! Chpter 4: Dynmic Progrmming Acknowledgment: A good number of these

More information

{ } = E! & $ " k r t +k +1

{ } = E! & $  k r t +k +1 Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS. THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS RADON ROSBOROUGH https://intuitiveexplntionscom/picrd-lindelof-theorem/ This document is proof of the existence-uniqueness theorem

More information

ARCHIVUM MATHEMATICUM (BRNO) Tomus 47 (2011), Kristína Rostás

ARCHIVUM MATHEMATICUM (BRNO) Tomus 47 (2011), Kristína Rostás ARCHIVUM MAHEMAICUM (BRNO) omu 47 (20), 23 33 MINIMAL AND MAXIMAL SOLUIONS OF FOURH ORDER IERAED DIFFERENIAL EQUAIONS WIH SINGULAR NONLINEARIY Kritín Rotá Abtrct. In thi pper we re concerned with ufficient

More information

COUNTING DESCENTS, RISES, AND LEVELS, WITH PRESCRIBED FIRST ELEMENT, IN WORDS

COUNTING DESCENTS, RISES, AND LEVELS, WITH PRESCRIBED FIRST ELEMENT, IN WORDS COUNTING DESCENTS, RISES, AND LEVELS, WITH PRESCRIBED FIRST ELEMENT, IN WORDS Sergey Kitev The Mthemtic Intitute, Reykvik Univerity, IS-03 Reykvik, Icelnd ergey@rui Toufik Mnour Deprtment of Mthemtic,

More information

CONTROL SYSTEMS LABORATORY ECE311 LAB 3: Control Design Using the Root Locus

CONTROL SYSTEMS LABORATORY ECE311 LAB 3: Control Design Using the Root Locus CONTROL SYSTEMS LABORATORY ECE311 LAB 3: Control Deign Uing the Root Locu 1 Purpoe The purpoe of thi lbortory i to deign cruie control ytem for cr uing the root locu. 2 Introduction Diturbnce D( ) = d

More information

Efficient Planning in R-max

Efficient Planning in R-max Efficient Plnning in R-mx Mrek Grześ nd Jee Hoey Dvid R. Cheriton School of Computer Science, Univerity of Wterloo 200 Univerity Avenue Wet, Wterloo, ON, N2L 3G1, Cnd {mgrze, jhoey}@c.uwterloo.c ABSTRACT

More information

Actor-Critic. Hung-yi Lee

Actor-Critic. Hung-yi Lee Actor-Critic Hung-yi Lee Asynchronous Advntge Actor-Critic (A3C) Volodymyr Mnih, Adrià Puigdomènech Bdi, Mehdi Mirz, Alex Grves, Timothy P. Lillicrp, Tim Hrley, Dvid Silver, Kory Kvukcuoglu, Asynchronous

More information

Review of Calculus, cont d

Review of Calculus, cont d Jim Lmbers MAT 460 Fll Semester 2009-10 Lecture 3 Notes These notes correspond to Section 1.1 in the text. Review of Clculus, cont d Riemnn Sums nd the Definite Integrl There re mny cses in which some

More information

4-4 E-field Calculations using Coulomb s Law

4-4 E-field Calculations using Coulomb s Law 1/11/5 ection_4_4_e-field_clcultion_uing_coulomb_lw_empty.doc 1/1 4-4 E-field Clcultion uing Coulomb Lw Reding Aignment: pp. 9-98 Specificlly: 1. HO: The Uniform, Infinite Line Chrge. HO: The Uniform Dik

More information

Bellman Optimality Equation for V*

Bellman Optimality Equation for V* Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V (s) mx Q (s,) A(s) mx A(s) mx A(s) Er t 1 V (s t 1 ) s t s, t s

More information

arxiv: v6 [stat.ml] 13 Apr 2018

arxiv: v6 [stat.ml] 13 Apr 2018 Expected Policy Grdient Kmil Cioek nd Shimon Whiteon Deprtment of Computer Science, Univerity of Oxford Wolfon Building, Prk Rod, Oxford OX1 3QD {kmil.cioek,himon.whiteon}@c.ox.c.uk rxiv:1706.05374v6 [tt.ml

More information

Math 1B, lecture 4: Error bounds for numerical methods

Math 1B, lecture 4: Error bounds for numerical methods Mth B, lecture 4: Error bounds for numericl methods Nthn Pflueger 4 September 0 Introduction The five numericl methods descried in the previous lecture ll operte by the sme principle: they pproximte the

More information

The ifs Package. December 28, 2005

The ifs Package. December 28, 2005 The if Pckge December 28, 2005 Verion 0.1-1 Title Iterted Function Sytem Author S. M. Icu Mintiner S. M. Icu Iterted Function Sytem Licene GPL Verion 2 or lter. R topic documented:

More information

A Fast and Reliable Policy Improvement Algorithm

A Fast and Reliable Policy Improvement Algorithm A Fst nd Relible Policy Improvement Algorithm Ysin Abbsi-Ydkori Peter L. Brtlett Stephen J. Wright Queenslnd University of Technology UC Berkeley nd QUT University of Wisconsin-Mdison Abstrct We introduce

More information

Math 2142 Homework 2 Solutions. Problem 1. Prove the following formulas for Laplace transforms for s > 0. a s 2 + a 2 L{cos at} = e st.

Math 2142 Homework 2 Solutions. Problem 1. Prove the following formulas for Laplace transforms for s > 0. a s 2 + a 2 L{cos at} = e st. Mth 2142 Homework 2 Solution Problem 1. Prove the following formul for Lplce trnform for >. L{1} = 1 L{t} = 1 2 L{in t} = 2 + 2 L{co t} = 2 + 2 Solution. For the firt Lplce trnform, we need to clculte:

More information

Robot Planning in Partially Observable Continuous Domains

Robot Planning in Partially Observable Continuous Domains Robot Plnning in Prtilly Obervble Continuou Domin Joep M. Port Intitut de Robòtic i Informàtic Indutril (UPC-CSIC) Lloren i Artig 4-6, 828, Brcelon Spin Emil: port@iri.upc.edu Mtthij T. J. Spn Informtic

More information

Package ifs. R topics documented: August 21, Version Title Iterated Function Systems. Author S. M. Iacus.

Package ifs. R topics documented: August 21, Version Title Iterated Function Systems. Author S. M. Iacus. Pckge if Augut 21, 2015 Verion 0.1.5 Title Iterted Function Sytem Author S. M. Icu Dte 2015-08-21 Mintiner S. M. Icu Iterted Function Sytem Etimtor. Licene GPL (>= 2) NeedCompiltion

More information

Robot Planning in Partially Observable Continuous Domains

Robot Planning in Partially Observable Continuous Domains Robot Plnning in Prtilly Obervble Continuou Domin Joep M. Port Intitut de Robòtic i Informàtic Indutril (UPC-CSIC) Lloren i Artig 4-6, 828, Brcelon Spin Emil: port@iri.upc.edu Mtthij T. J. Spn Informtic

More information

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature CMDA 4604: Intermedite Topics in Mthemticl Modeling Lecture 19: Interpoltion nd Qudrture In this lecture we mke brief diversion into the res of interpoltion nd qudrture. Given function f C[, b], we sy

More information

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling CSE 547/Stt 548: Mchine Lerning for Big Dt Lecture Multi-Armed Bndits: Non-dptive nd Adptive Smpling Instructor: Shm Kkde 1 The (stochstic) multi-rmed bndit problem The bsic prdigm is s follows: K Independent

More information

Excerpted Section. Consider the stochastic diffusion without Poisson jumps governed by the stochastic differential equation (SDE)

Excerpted Section. Consider the stochastic diffusion without Poisson jumps governed by the stochastic differential equation (SDE) ? > ) 1 Technique in Computtionl Stochtic Dynmic Progrmming Floyd B. Hnon niverity of Illinoi t Chicgo Chicgo, Illinoi 60607-705 Excerpted Section A. MARKOV CHAI APPROXIMATIO Another pproch to finite difference

More information

Chapter 5 : Continuous Random Variables

Chapter 5 : Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 216 Néhémy Lim Chpter 5 : Continuous Rndom Vribles Nottions. N {, 1, 2,...}, set of nturl numbers (i.e. ll nonnegtive integers); N {1, 2,...}, set of ll

More information

Acceptance Sampling by Attributes

Acceptance Sampling by Attributes Introduction Acceptnce Smpling by Attributes Acceptnce smpling is concerned with inspection nd decision mking regrding products. Three spects of smpling re importnt: o Involves rndom smpling of n entire

More information

( dg. ) 2 dt. + dt. dt j + dh. + dt. r(t) dt. Comparing this equation with the one listed above for the length of see that

( dg. ) 2 dt. + dt. dt j + dh. + dt. r(t) dt. Comparing this equation with the one listed above for the length of see that Arc Length of Curves in Three Dimensionl Spce If the vector function r(t) f(t) i + g(t) j + h(t) k trces out the curve C s t vries, we cn mesure distnces long C using formul nerly identicl to one tht we

More information

2. The Laplace Transform

2. The Laplace Transform . The Lplce Trnform. Review of Lplce Trnform Theory Pierre Simon Mrqui de Lplce (749-87 French tronomer, mthemticin nd politicin, Miniter of Interior for 6 wee under Npoleon, Preident of Acdemie Frncie

More information

1 Online Learning and Regret Minimization

1 Online Learning and Regret Minimization 2.997 Decision-Mking in Lrge-Scle Systems My 10 MIT, Spring 2004 Hndout #29 Lecture Note 24 1 Online Lerning nd Regret Minimiztion In this lecture, we consider the problem of sequentil decision mking in

More information

On the Adders with Minimum Tests

On the Adders with Minimum Tests Proceeding of the 5th Ain Tet Sympoium (ATS '97) On the Adder with Minimum Tet Seiji Kjihr nd Tutomu So Dept. of Computer Science nd Electronic, Kyuhu Intitute of Technology Atrct Thi pper conider two

More information

Continuous Random Variables

Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 217 Néhémy Lim Continuous Rndom Vribles Nottion. The indictor function of set S is rel-vlued function defined by : { 1 if x S 1 S (x) if x S Suppose tht

More information

Construction of Gauss Quadrature Rules

Construction of Gauss Quadrature Rules Jim Lmbers MAT 772 Fll Semester 2010-11 Lecture 15 Notes These notes correspond to Sections 10.2 nd 10.3 in the text. Construction of Guss Qudrture Rules Previously, we lerned tht Newton-Cotes qudrture

More information

ELECTRICAL CIRCUITS 10. PART II BAND PASS BUTTERWORTH AND CHEBYSHEV

ELECTRICAL CIRCUITS 10. PART II BAND PASS BUTTERWORTH AND CHEBYSHEV 45 ELECTRICAL CIRCUITS 0. PART II BAND PASS BUTTERWRTH AND CHEBYSHEV Introduction Bnd p ctive filter re different enough from the low p nd high p ctive filter tht the ubject will be treted eprte prt. Thi

More information

Hidden Markov Models

Hidden Markov Models Hidden Mrkov Models Huptseminr Mchine Lerning 18.11.2003 Referent: Nikols Dörfler 1 Overview Mrkov Models Hidden Mrkov Models Types of Hidden Mrkov Models Applictions using HMMs Three centrl problems:

More information

STOCHASTIC REGULAR LANGUAGE: A MATHEMATICAL MODEL FOR THE LANGUAGE OF SEQUENTIAL ACTIONS FOR DECISION MAKING UNDER UNCERTAINTY

STOCHASTIC REGULAR LANGUAGE: A MATHEMATICAL MODEL FOR THE LANGUAGE OF SEQUENTIAL ACTIONS FOR DECISION MAKING UNDER UNCERTAINTY Interntionl Journl of Mthemtic nd Computer Appliction Reerch (IJMCAR) ISSN 49-6955 Vol. 3, Iue, Mr 3, -8 TJPRC Pvt. Ltd. STOCHASTIC REGULAR LANGUAGE: A MATHEMATICAL MODEL FOR THE LANGUAGE OF SEQUENTIAL

More information

New Expansion and Infinite Series

New Expansion and Infinite Series Interntionl Mthemticl Forum, Vol. 9, 204, no. 22, 06-073 HIKARI Ltd, www.m-hikri.com http://dx.doi.org/0.2988/imf.204.4502 New Expnsion nd Infinite Series Diyun Zhng College of Computer Nnjing University

More information

. The set of these fractions is then obviously Q, and we can define addition and multiplication on it in the expected way by

. The set of these fractions is then obviously Q, and we can define addition and multiplication on it in the expected way by 50 Andre Gthmnn 6. LOCALIZATION Locliztion i very powerful technique in commuttive lgebr tht often llow to reduce quetion on ring nd module to union of mller locl problem. It cn eily be motivted both from

More information

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying Vitli covers 1 Definition. A Vitli cover of set E R is set V of closed intervls with positive length so tht, for every δ > 0 nd every x E, there is some I V with λ(i ) < δ nd x I. 2 Lemm (Vitli covering)

More information

M. A. Pathan, O. A. Daman LAPLACE TRANSFORMS OF THE LOGARITHMIC FUNCTIONS AND THEIR APPLICATIONS

M. A. Pathan, O. A. Daman LAPLACE TRANSFORMS OF THE LOGARITHMIC FUNCTIONS AND THEIR APPLICATIONS DEMONSTRATIO MATHEMATICA Vol. XLVI No 3 3 M. A. Pthn, O. A. Dmn LAPLACE TRANSFORMS OF THE LOGARITHMIC FUNCTIONS AND THEIR APPLICATIONS Abtrct. Thi pper del with theorem nd formul uing the technique of

More information

Accelerator Physics. G. A. Krafft Jefferson Lab Old Dominion University Lecture 5

Accelerator Physics. G. A. Krafft Jefferson Lab Old Dominion University Lecture 5 Accelertor Phyic G. A. Krfft Jefferon L Old Dominion Univerity Lecture 5 ODU Accelertor Phyic Spring 15 Inhomogeneou Hill Eqution Fundmentl trnvere eqution of motion in prticle ccelertor for mll devition

More information

Research Article Generalized Hyers-Ulam Stability of the Second-Order Linear Differential Equations

Research Article Generalized Hyers-Ulam Stability of the Second-Order Linear Differential Equations Hindwi Publihing Corportion Journl of Applied Mthemtic Volume 011, Article ID 813137, 10 pge doi:10.1155/011/813137 Reerch Article Generlized Hyer-Ulm Stbility of the Second-Order Liner Differentil Eqution

More information

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives Block #6: Properties of Integrls, Indefinite Integrls Gols: Definition of the Definite Integrl Integrl Clcultions using Antiderivtives Properties of Integrls The Indefinite Integrl 1 Riemnn Sums - 1 Riemnn

More information

Numerical Analysis: Trapezoidal and Simpson s Rule

Numerical Analysis: Trapezoidal and Simpson s Rule nd Simpson s Mthemticl question we re interested in numericlly nswering How to we evlute I = f (x) dx? Clculus tells us tht if F(x) is the ntiderivtive of function f (x) on the intervl [, b], then I =

More information

Lyapunov-type inequality for the Hadamard fractional boundary value problem on a general interval [a; b]; (1 6 a < b)

Lyapunov-type inequality for the Hadamard fractional boundary value problem on a general interval [a; b]; (1 6 a < b) Lypunov-type inequlity for the Hdmrd frctionl boundry vlue problem on generl intervl [; b]; ( 6 < b) Zid Ldjl Deprtement of Mthemtic nd Computer Science, ICOSI Lbortory, Univerity of Khenchel, 40000, Algeri.

More information

Jack Simons, Henry Eyring Scientist and Professor Chemistry Department University of Utah

Jack Simons, Henry Eyring Scientist and Professor Chemistry Department University of Utah 1. Born-Oppenheimer pprox.- energy surfces 2. Men-field (Hrtree-Fock) theory- orbitls 3. Pros nd cons of HF- RHF, UHF 4. Beyond HF- why? 5. First, one usully does HF-how? 6. Bsis sets nd nottions 7. MPn,

More information

2π(t s) (3) B(t, ω) has independent increments, i.e., for any 0 t 1 <t 2 < <t n, the random variables

2π(t s) (3) B(t, ω) has independent increments, i.e., for any 0 t 1 <t 2 < <t n, the random variables 2 Brownin Motion 2.1 Definition of Brownin Motion Let Ω,F,P) be probbility pce. A tochtic proce i meurble function Xt, ω) defined on the product pce [, ) Ω. In prticulr, ) for ech t, Xt, ) i rndom vrible,

More information

ODE: Existence and Uniqueness of a Solution

ODE: Existence and Uniqueness of a Solution Mth 22 Fll 213 Jerry Kzdn ODE: Existence nd Uniqueness of Solution The Fundmentl Theorem of Clculus tells us how to solve the ordinry differentil eqution (ODE) du = f(t) dt with initil condition u() =

More information

LECTURE NOTE #12 PROF. ALAN YUILLE

LECTURE NOTE #12 PROF. ALAN YUILLE LECTURE NOTE #12 PROF. ALAN YUILLE 1. Clustering, K-mens, nd EM Tsk: set of unlbeled dt D = {x 1,..., x n } Decompose into clsses w 1,..., w M where M is unknown. Lern clss models p(x w)) Discovery of

More information

p-adic Egyptian Fractions

p-adic Egyptian Fractions p-adic Egyptin Frctions Contents 1 Introduction 1 2 Trditionl Egyptin Frctions nd Greedy Algorithm 2 3 Set-up 3 4 p-greedy Algorithm 5 5 p-egyptin Trditionl 10 6 Conclusion 1 Introduction An Egyptin frction

More information

Math Advanced Calculus II

Math Advanced Calculus II Mth 452 - Advnced Clculus II Line Integrls nd Green s Theorem The min gol of this chpter is to prove Stoke s theorem, which is the multivrible version of the fundmentl theorem of clculus. We will be focused

More information

Consequently, the temperature must be the same at each point in the cross section at x. Let:

Consequently, the temperature must be the same at each point in the cross section at x. Let: HW 2 Comments: L1-3. Derive the het eqution for n inhomogeneous rod where the therml coefficients used in the derivtion of the het eqution for homogeneous rod now become functions of position x in the

More information

SPACE VECTOR PULSE- WIDTH-MODULATED (SV-PWM) INVERTERS

SPACE VECTOR PULSE- WIDTH-MODULATED (SV-PWM) INVERTERS CHAPTER 7 SPACE VECTOR PULSE- WIDTH-MODULATED (SV-PWM) INVERTERS 7-1 INTRODUCTION In Chpter 5, we briefly icue current-regulte PWM inverter uing current-hyterei control, in which the witching frequency

More information

Low-order simultaneous stabilization of linear bicycle models at different forward speeds

Low-order simultaneous stabilization of linear bicycle models at different forward speeds 203 Americn Control Conference (ACC) Whington, DC, USA, June 7-9, 203 Low-order imultneou tbiliztion of liner bicycle model t different forwrd peed A. N. Gündeş nd A. Nnngud 2 Abtrct Liner model of bicycle

More information

Driving Cycle Construction of City Road for Hybrid Bus Based on Markov Process Deng Pan1, a, Fengchun Sun1,b*, Hongwen He1, c, Jiankun Peng1, d

Driving Cycle Construction of City Road for Hybrid Bus Based on Markov Process Deng Pan1, a, Fengchun Sun1,b*, Hongwen He1, c, Jiankun Peng1, d Interntionl Industril Informtics nd Computer Engineering Conference (IIICEC 15) Driving Cycle Construction of City Rod for Hybrid Bus Bsed on Mrkov Process Deng Pn1,, Fengchun Sun1,b*, Hongwen He1, c,

More information

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary Outline Genetic Progrmming Evolutionry strtegies Genetic progrmming Summry Bsed on the mteril provided y Professor Michel Negnevitsky Evolutionry Strtegies An pproch simulting nturl evolution ws proposed

More information

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7 CS 188 Introduction to Artificil Intelligence Fll 2018 Note 7 These lecture notes re hevily bsed on notes originlly written by Nikhil Shrm. Decision Networks In the third note, we lerned bout gme trees

More information

1.4 Nonregular Languages

1.4 Nonregular Languages 74 1.4 Nonregulr Lnguges The number of forml lnguges over ny lphbet (= decision/recognition problems) is uncountble On the other hnd, the number of regulr expressions (= strings) is countble Hence, ll

More information

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below. Dulity #. Second itertion for HW problem Recll our LP emple problem we hve been working on, in equlity form, is given below.,,,, 8 m F which, when written in slightly different form, is 8 F Recll tht we

More information

Chapter 3 Polynomials

Chapter 3 Polynomials Dr M DRAIEF As described in the introduction of Chpter 1, pplictions of solving liner equtions rise in number of different settings In prticulr, we will in this chpter focus on the problem of modelling

More information

Linear predictive coding

Linear predictive coding Liner predictive coding Thi ethod cobine liner proceing with clr quntiztion. The in ide of the ethod i to predict the vlue of the current ple by liner cobintion of previou lredy recontructed ple nd then

More information

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral Improper Integrls Every time tht we hve evluted definite integrl such s f(x) dx, we hve mde two implicit ssumptions bout the integrl:. The intervl [, b] is finite, nd. f(x) is continuous on [, b]. If one

More information

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees CS 188: Artificil Intelligence Fll 2011 Decision Networks ME: choose the ction which mximizes the expected utility given the evidence mbrell Lecture 17: Decision Digrms 10/27/2011 Cn directly opertionlize

More information

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004 Advnced Clculus: MATH 410 Notes on Integrls nd Integrbility Professor Dvid Levermore 17 October 2004 1. Definite Integrls In this section we revisit the definite integrl tht you were introduced to when

More information

Chapters 4 & 5 Integrals & Applications

Chapters 4 & 5 Integrals & Applications Contents Chpters 4 & 5 Integrls & Applictions Motivtion to Chpters 4 & 5 2 Chpter 4 3 Ares nd Distnces 3. VIDEO - Ares Under Functions............................................ 3.2 VIDEO - Applictions

More information

Generation of Lyapunov Functions by Neural Networks

Generation of Lyapunov Functions by Neural Networks WCE 28, July 2-4, 28, London, U.K. Genertion of Lypunov Functions by Neurl Networks Nvid Noroozi, Pknoosh Krimghee, Ftemeh Sfei, nd Hmed Jvdi Abstrct Lypunov function is generlly obtined bsed on tril nd

More information

Best Approximation. Chapter The General Case

Best Approximation. Chapter The General Case Chpter 4 Best Approximtion 4.1 The Generl Cse In the previous chpter, we hve seen how n interpolting polynomil cn be used s n pproximtion to given function. We now wnt to find the best pproximtion to given

More information

Definition of Continuity: The function f(x) is continuous at x = a if f(a) exists and lim

Definition of Continuity: The function f(x) is continuous at x = a if f(a) exists and lim Mth 9 Course Summry/Study Guide Fll, 2005 [1] Limits Definition of Limit: We sy tht L is the limit of f(x) s x pproches if f(x) gets closer nd closer to L s x gets closer nd closer to. We write lim f(x)

More information

Calculus I-II Review Sheet

Calculus I-II Review Sheet Clculus I-II Review Sheet 1 Definitions 1.1 Functions A function is f is incresing on n intervl if x y implies f(x) f(y), nd decresing if x y implies f(x) f(y). It is clled monotonic if it is either incresing

More information

Uncertain Dynamic Systems on Time Scales

Uncertain Dynamic Systems on Time Scales Journl of Uncertin Sytem Vol.9, No.1, pp.17-30, 2015 Online t: www.ju.org.uk Uncertin Dynmic Sytem on Time Scle Umber Abb Hhmi, Vile Lupulecu, Ghu ur Rhmn Abdu Slm School of Mthemticl Science, GCU Lhore

More information

Conservation Law. Chapter Goal. 5.2 Theory

Conservation Law. Chapter Goal. 5.2 Theory Chpter 5 Conservtion Lw 5.1 Gol Our long term gol is to understnd how mny mthemticl models re derived. We study how certin quntity chnges with time in given region (sptil domin). We first derive the very

More information

Lecture 19: Continuous Least Squares Approximation

Lecture 19: Continuous Least Squares Approximation Lecture 19: Continuous Lest Squres Approximtion 33 Continuous lest squres pproximtion We begn 31 with the problem of pproximting some f C[, b] with polynomil p P n t the discrete points x, x 1,, x m for

More information

Transfer Functions. Chapter 5. Transfer Functions. Derivation of a Transfer Function. Transfer Functions

Transfer Functions. Chapter 5. Transfer Functions. Derivation of a Transfer Function. Transfer Functions 5/4/6 PM : Trnfer Function Chpter 5 Trnfer Function Defined G() = Y()/U() preent normlized model of proce, i.e., cn be ued with n input. Y() nd U() re both written in devition vrible form. The form of

More information

Lecture 14: Quadrature

Lecture 14: Quadrature Lecture 14: Qudrture This lecture is concerned with the evlution of integrls fx)dx 1) over finite intervl [, b] The integrnd fx) is ssumed to be rel-vlues nd smooth The pproximtion of n integrl by numericl

More information

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as Improper Integrls Two different types of integrls cn qulify s improper. The first type of improper integrl (which we will refer to s Type I) involves evluting n integrl over n infinite region. In the grph

More information

EE Control Systems LECTURE 8

EE Control Systems LECTURE 8 Coyright F.L. Lewi 999 All right reerved Udted: Sundy, Ferury, 999 EE 44 - Control Sytem LECTURE 8 REALIZATION AND CANONICAL FORMS A liner time-invrint (LTI) ytem cn e rereented in mny wy, including: differentil

More information

Scalable Learning in Stochastic Games

Scalable Learning in Stochastic Games Sclble Lerning in Stochstic Gmes Michel Bowling nd Mnuel Veloso Computer Science Deprtment Crnegie Mellon University Pittsburgh PA, 15213-3891 Abstrct Stochstic gmes re generl model of interction between

More information

Review of basic calculus

Review of basic calculus Review of bsic clculus This brief review reclls some of the most importnt concepts, definitions, nd theorems from bsic clculus. It is not intended to tech bsic clculus from scrtch. If ny of the items below

More information

positive definite (symmetric with positive eigenvalues) positive semi definite (symmetric with nonnegative eigenvalues)

positive definite (symmetric with positive eigenvalues) positive semi definite (symmetric with nonnegative eigenvalues) Chter Liner Qudrtic Regultor Problem inimize the cot function J given by J x' Qx u' Ru dt R > Q oitive definite ymmetric with oitive eigenvlue oitive emi definite ymmetric with nonnegtive eigenvlue ubject

More information

USA Mathematical Talent Search Round 1 Solutions Year 21 Academic Year

USA Mathematical Talent Search Round 1 Solutions Year 21 Academic Year 1/1/21. Fill in the circles in the picture t right with the digits 1-8, one digit in ech circle with no digit repeted, so tht no two circles tht re connected by line segment contin consecutive digits.

More information

LINEAR STOCHASTIC DIFFERENTIAL EQUATIONS WITH ANTICIPATING INITIAL CONDITIONS

LINEAR STOCHASTIC DIFFERENTIAL EQUATIONS WITH ANTICIPATING INITIAL CONDITIONS Communiction on Stochtic Anlyi Vol. 7, No. 2 213 245-253 Seril Publiction www.erilpubliction.com LINEA STOCHASTIC DIFFEENTIAL EQUATIONS WITH ANTICIPATING INITIAL CONDITIONS NAJESS KHALIFA, HUI-HSIUNG KUO,

More information

Near-Bayesian Exploration in Polynomial Time

Near-Bayesian Exploration in Polynomial Time J. Zico Kolter kolter@cs.stnford.edu Andrew Y. Ng ng@cs.stnford.edu Computer Science Deprtment, Stnford University, CA 94305 Abstrct We consider the explortion/exploittion problem in reinforcement lerning

More information