An Analysis of Actor/Critic Algorithms using Eligibility Traces: Reinforcement Learning with Imperfect Value Functions

Size: px
Start display at page:

Download "An Analysis of Actor/Critic Algorithms using Eligibility Traces: Reinforcement Learning with Imperfect Value Functions"

Transcription

1 An Analysis of Acor/Criic Algorihms using Eligibiliy Traces: Reinforcemen Learning wih Imperfec Value Funcions Hajime Kimura 3 Tokyo Insiue of Technology gen@fe.dis.iech.ac.jp Shigenobu Kobayashi Tokyo Insiue of Technology kobayasi@dis.iech.ac.jp Absrac We presen an analysis of acor/criic algorihms, in which he acor updaes is policy using eligibiliy races of he policy parameers. Mos of he heoreical resuls for eligibiliy races have been for only criic's value ieraion algorihms. This paper invesigaes wha he acor's eligibiliy race does. The resuls show ha he algorihm is an exension of Williams' REINFORCE algorihms for innie horizon reinforcemen asks, and hen he criic provides an appropriae reinforcemen baseline for he acor. Thanks o he acor's eligibiliy race, he acor improves is policy by using a gradien of acual reurn, no by using a gradien of he esimaed reurn in he criic. I enables he agen o learn a fairly good policy under he condiion ha he approximaed value funcion in he criic is hopelessly inaccurae for convenional acor/criic algorihms. Also, if an accurae value funcion is esimaed by he criic, he acor's learning is dramaically acceleraed in our es cases. The behavior of he algorihm is demonsraed hrough simulaions of a linear quadraic conrol problem and a pole balancing problem. 1 Inroducion Acor/criic archiecure is an adapive version of policy ieraion [Kaelbling e al.96]. In general, policy ieraion alernaes wo phases: a policy evaluaion phase and a policy improvemen phase. The acor implemens a sochasic policy ha maps from a represenaion of a sae o a probabiliy disribuion over 3 Inerdisciplinary Graduae School of Science and Engineering, Tokyo Insiue of Technology, 4259 Nagasua Midori-ku Yokohama 226{852 JAPAN. acions. The criic aemps o esimae he evaluaion funcion for he curren policy. The acor improves is conrol policy using criic's emporal dierence (TD) as an eecive reinforcemen. In many cases, he policy improvemen is execued concurrenly wih he policy evaluaion, because i is no feasible o wai for he policy evaluaion o converge. The acor/criic algorihms have been successfully applied o a variey of delayed reinforcemen asks; ASE/ACE archiecure for a pole balancing [Baro e al. 83] [Gullapalli 92], RFALCON for a pole balancing and for conrol of a ball-beam sysem [Lin e al. 96], a car-pole swing-up ask [Doya 96]. Alhough convergence proofs for he acor/criic algorihms (e.g. [Williams e al. 9] and [Gullapalli 92]) are less han value-ieraion based algorihms such as Q-learning [Wakins e.al 92], he acor/criic algorihms have he following pracical advanages. I is easy o implemen mulidimensional coninuous acion, ha is ofen mixed wih discree acion [Gullapalli 92]. Because he acor selecs acion by is sochasic policy, herefore problems of acion selecion like as Q-learning does no exis. The Q-learning needs o esimae reurns for all sae-acion pairs, bu he criic would esimae only he reurn of each sae. Memory-less sochasic policies can be considerably beer han memory-less deerminisic policies in he case of parially observable Markov decision processes (POMDPs) [Singh 94] [Jaakkola 94] or muli-player games [Liman 94]. I is easy o incorporae an exper's knowledge ino he learning sysem by applying convenional supervised learning echniques o he acor [Clouse e al. 92]. Eligibiliy races are a fundamenal mechanism ha has been widely used o handle delayed reward [Singh 96]. Also he races are ofen used o overcome non-markovian eecs [Suon 95],

2 [Pendrih e al. 96]. In Baro, Suon and Anderson's ASE/ACE archiecure, boh he criic and he acor make use of he eligibiliy race. Theoreical resuls of eligibiliy races in he conex of TD() [Suon 88] have been obained. Bu, in acor/criic algorihms, he eec of he acor's race has no been invesigaed. This paper presens an analysis of an acor/criic algorihm, in which he acor improves is policy using eligibiliy races of he policy parameers. This may be he rs analysis of he acor's eligibiliy races. 2 Discouned Reward Crieria A each discree ime, he agen observes x conaining informaion abou is curren sae, selec acion a, and hen receives an insananeous reward r resuling from sae ransiion in he environmen. In general, he reward and he nex sae may be random, bu heir probabiliy disribuions are assumed o depend only on x and a in Markov decision processes (MDPs), in which many reinforcemen learning algorihms are sudied. The objecive of reinforcemen learning is o consruc a policy ha maximizes he agen's performance. A naural performance measure for innie horizon asks is he cumulaive discouned reward: V k k r +k, (1) where he discoun facor, < 1 species he imporance of fuure rewards. V is called he acual reurn, ha species how good he reward sequence afer ime is. By his noaion, he goal of he learning is o maximize he expeced reurn. In MDPs, he expeced reurn can be dened for all saes as: V (x) E " 1 X k k r k jx x #, (2) where E denoes he expecaion assuming he agen always uses saionary policy. V (x) is called he value funcion, ha species how good he given sae x is. In MDPs, he goal of he learning is o nd an opimal policy ha maximizes he value of each sae x dened by Equaion 2. Alhough similar value funcions can be given in POMDPs, diculies o dene he have poined ou in [Singh 94]. 3 Acor/Criic Algorihms Figure 1 and 2 give an overview of acor/criic algorihms [Suon 9] [Cries e al. 94]. There are many ways o implemen he policy and is updaing scheme in he acor. The algorihms for he criic are mosly TD mehods. We should noice he following wo poins; one is he acor implemens sochasic policy, he oher is he acor improves is policy using TDerror. This paper especially invesigaes an algorihm for he acor. x x - Acor sochasic policy - Criic 6 Agen reinforcemen for a r + ^V (x +1) ^V (x ) ^V (x) Observaion Reward Acion Environmen r a TD-error Figure 1: A generic acor/criic framework. 1. The agen observes x in he environmen, and he acor execues acion a according o he curren sochasic policy. 2. The criic receives he immediae reward r, and hen observes he resuling nex sae x +1. The criic provides TD error as an useful reinforcemen feedback o he acor, according o (TD-error) 2r + ^V (x +1 )3 ^V (x ), where < 1 is he discoun facor, ^V (x) is an esimaed value funcion by he criic. 3. The acor updaes he sochasic policy using he TDerror. If (TD-error) >, acion a performed relaively good and is probabiliy should be increased. If (TD-error) <, acion a performed relaively poorly and is probabiliy should be decreased. 4. The criic updaes esimaed value funcion ^V (x) according o TD mehods. e.g., TD() algorihm adjuss ^V (x ) ^V (x )+ (TD-error), where is he learning rae. 5. Go o sep 1. Figure 2: Main loop of he generic acor/criic algorihm.?

3 4 Adding Eligibiliy Trace o he Acor 4.1 Funcion Approximaion for Sochasic Policies In his paper, (a; W; x) denoes probabiliy of selecing acion a under he policy in he observaion x. The (a; W; X) is aken o be a probabiliy densiy funcion when he se of possible acion is coninuous. The policy is represened by a parameric funcion approximaor using he inernal variable vecor W. The agen can improve he policy by modifying W. For example, W corresponds o synapic weighs where he acion selecing probabiliy is represened by neural neworks, or W means weigh of rules in classier sysems. The advanage of using he noaion of he parameric funcion () is ha compuaional resricion and mechanisms of he agen can be specied simply by a form of he funcion, and hen we can provide a sound heory of learning algorihms for arbirary ypes of he acor. 4.2 Deails of he Algorihm Figure 3 species he acor/criic algorihm ha uses he eligibiliy race in he acor. The ASE/ACE sysem congured for pole-balancing [Baro e al. 83] is jus an insance of his algorihm. The acor's eligibiliy in sep 3 is he same variable dened in Williams' REINFORCE algorihms [Williams 92]. The eligibiliy e i () species a correlaion beween he associaed policy parameer w i and he execued acion a. The eligibiliy race D i () is a discouned running average of eligibiliy. I accumulaes he agen's hisory. When a posiive reinforcemen is given, he acor updaes W so ha he probabiliy of acions recorded in he hisory is increased. I means he TD-error a he ime aecs no only he acion a bu also a 1; a 2; A rs glance, his idea is senseless for improving he policy, bu i has very ineresing feaures given in deail laer. Noe ha he algorihm shown in Figure 3 is idenical o a sochasic gradien ascen for discouned reward [Kimura e al. 97] when he acor's discoun facor and he ^V (x) in he criic equals a consan b for all observaions. The acor requires a memory o implemen W for he policy and o implemen D i for he eligibiliy race. The amoun of he memory for D i is equal o W 's. 4.3 An Analysis of he Algorihm Assume ha he acor's discoun facor equals, and for all <, D i (), hen he algorihm shown 1. The agen observes x, and he acor execues acion a wih probabiliy (a ; W; x ). 2. The criic receives he immediae reward r, and hen observes he resuling nex sae x+1. The criic provides TD error o he acor according o (TD-error) 2 r + ^V (x +1)3 ^V (x ), (3) where < 1 is he discoun facor, ^V (x) is an esimaed value funcion by he criic. 3. The acor updaes he sochasic policy using he TDerror according o: Eligibiliy: e i ln (a ; W; x ) Eligibiliy Trace: D i() e i() + D i( 1), 1w i () (TD-error) D i () W W + p 1W (), where w i denoes he i h componen of W, e i and D i are he associaed eligibiliy and eligibiliy race respecively, ( < 1) is a discoun facor for he eligibiliy race, p is he learning rae for he acor. 4. The criic updaes esimaed value funcion ^V (x) according o TD mehods. e.g., TD() algorihm adjuss ^V (x) ^V (x) + (TD-error), where is he learning rae. 5. Go o sep 1. Figure 3: The acor/criic algorihm adding he eligibiliy race o he acor. in Figure 3 updaes he policy parameers as: 1w i () r + ^V (x +1) ^V (x ) r + ^V (x +1) ^V (x ) e i () e i () e i () D i () X! e i () r + ^V (x +1) ^V (x ) r V ^V (x )! ^V (x )!! (4) (5) Equaion 5 is given by Equaion 1 and 4. Here we assume ha he saisics of he random variable V depends only on he curren policy parameer. I means EfV g is a deerminisic funcion of W, where E de-,

4 noes he expecaion operaor. This assumpion may be righ if he policy is converged o an equilibrium poin. The criic's esimaion ^V (x ) is obviously independen of he acion a he ime. From he heory of Williams' REINFORCE algorihm [Williams 92], he value V and ^V (x ) in Equaion 5 can be seen as a reinforcemen signal and a reinforcemen baseline respecively, hen we have Efe i () (V ^V (x ))g (@@w i )EfV g. I says ha he algorihm updaes policy parameers saisically in a direcion for increasing he acual reurn V, no in a direcion of a gradien of esimaed value funcion in he criic. Also I can be seen as an exension of reinforcemen comparison mehods [Suon e al. 98], hen ^V (x ) corresponds o he reference reward. From he above analysis and Figure 3, we can explain wha he acor's eligibiliy race does. A he ime, he algorihm reinforces a using TD error r + ^V (x+1 ) ^V (x ) as a emporary expedien, hereafer he acor's eligibiliy race replaces ^V (x +1 ) wih he acual reurn (r +1 + r r ) in order. The criic does no aec he direcion of he average updae vecor, because he criic works as a reinforcemen baseline. Therefore, he acor can improve is policy, wheher he criic is able o learn he value funcion or no. If he criic approximaes he value funcion well, he acor's learning would be acceleraed. The above resuls are under he special condiion. If, he acor updaes W in he direcion of he gradien of he approximaed value funcion in he criic. The ( < < ) inerpolaes beween he above wo limiing cases. The characerisics of he are similar o he in TD() [Suon 88] and Q()-learning [Peng e al. 94]. 5 Preliminary Experimens This secion demonsraes he performance of he algorihm applying o a simple linear conrol problem. 5.1 A Linear Quadraic Regulaor (LQR) The following linear conrol problem can serve as a benchmark of delayed reinforcemen asks [Baird 94]. A a given discree-ime, he sae of he environmen is he real value x. The agen chooses a conrol acion a ha is also real value. The dynamics of he environmen is: x +1 x + a + noise, (6) where he noise is he normal disribuion ha follows he sandard deviaion noise :5. The immediae reward is given by r x 2 a2. (7) The goal is o maximize he oal discouned reward, dened by Equaion 1 or 2 for all x. Because he ask is a linear quadraic regulaor (LQR) problem, i is possible o calculae he opimal conrol rule. From he discree-ime Riccai equaion, he regulaor is given by 2 a k 1 x, where k 1 1 p (8) The value funcion is given by V 3 (x ) k 2 x 2, where k 2 is a some posiive consan. In his experimen, he se of possible saes is consrained o lie in he range [4; 4]. When he sae ransiion given by Equaion 6 does no resul in he range [4; 4], he x is runcaed.when he agen chooses an acion ha is no lie in he range [4; 4], he acion execued in he environmen is also runcaed. 5.2 Implemenaion for he LQR Problem The Acor Remember he policy (a; W; X) is a probabiliy densiy funcion when he se of possible acion is coninuous. The normal disribuion is a simple muliparameer disribuion for a coninuous random variable. I has wo parameers, he mean and he sandard deviaion. When he policy funcion is given by he equaion 9, he eligibiliy of and are (a; ; ) 1 p 2 exp( (a )2 2 2 ) (9) e a 2 (1) e (a ) 2 2. (11) 3 One useful feaure of such a Gaussian uni [Williams 92] is ha he agen has a poenial o conrol is degree of exploraory behavior. We mus draw aenion o he fac ha he eligibiliy is o divergen when goes close o, because he parameer is occupying he denominaors of Equaion 1 and 11. The divergence of he eligibiliy has a bad inuence on he algorihm. One way o overcome his problem is o conrol he sep size of he updae parameer vecor using. I is obained by seing he learning rae parameer proporional o 2, hen he eligibiliy can be seen as e a,e (a ) 2 2. (12) The acor would rs compue and deerminisically and hen draw is oupu from he normal disribuion ha follows mean equal o and sandard

5 deviaion equal o. The acor has wo inernal variables, w 1 and w 2, and compues he values of and according o w 1 x, exp(w 2 ) : (13) Then, w 1 can be seen as a feedback gain. The reason for his calculaion of is o guaranee he o keep posiive. The e 1 and e 2 are he characerisic eligibiliies of w 1 and w 2 respecively. From Equaion 12, e 1 and e 2 are given by e 1 e 1 (a ) x, 2 ((a ) 2 2 )(1 ).(15) The w 1 is iniialized o :35 6 :15, and w 2, i.e., :5. The learning rae p is xed o : The Criic The criic quanizes he coninuous sae-space (4 x 4) ino an array of boxes. We have ried wo ypes of he quanizing: one is discreizing x evenly ino 3 boxes, he oher is 1 boxes. And he criic aemps o sore in each box a predicion of he value ^V by using TD() [Suon 88]. The learning rae for TD() is xed o : Simulaion Resuls Figure 4, 5, 6, 7 and 8 show he performance of 1 rials in he LQR problem wih he discoun rae :9. Figure 4 shows he performance of he algorihm, in which he criic uses 3 boxes, he acor does no use eligibiliy races, i.e,. Figure 6 shows he performance where he criic uses 1 boxes, he acor does no use he races. The algorihm in Figure 6 converged close o he feedback gain. In conras, Figure 4 didn'. The reason for his is ha he abiliy of he funcion approximaion (3 boxes) is insucien for learning policy wihou he race. Figure 5 shows he performance where he criic uses 3 boxes, he acor uses he race, :9. I achieved much beer resuls in erms of boh he learning eciency and he qualiy of he mean value of he converged policy han he algorihm in Figure 4 or 5. Obviously, he acor's eligibiliy race relaes hese wo advanages. The reason for he learning eciency in his case may be ha he acor's race acceleraes propagaing informaion. The beer qualiy of he policy is clearly owing o he propery ha he acor improves is policy by using a gradien of acual reurn, shown in Secion 4.3. Therefore, he algorihm using he race was no inuenced by he criic's abiliy in erms of he qualiy of he mean of he policy. We can also see his propery in Figure 8, bu is deviaion is considerably large. Figure 9 shows he value funcion ha is dened by Equaion 1 and 7 over he parameer space and. The value of performance is fairly a around he opimal soluion. This is he reason ha he deviaion of he policy is large in Figure 8. This example makes i clear ha he criic conrols sep-size of he acor's backups so ha he sep-size is aken o be smaller around he local maximum. The algorihm in Figure 7 achieved bes resuls in erms of boh he mean and he deviaion of he policy. The reason for his may be owing o he criic's perfec value esimaion. In his preliminary experimen, we can see ha he algorihm using he acor's eligibiliy race performed beer han he algorihm wihou using he race in he same compuaional resources. Here we presened he resuls of he acor-criic ha use only TD() in he criic, bu we have also experimened on TD() where < 1. Roughly speaking, we have poor performance when he approaches close o 1. I follows from his ha he eligibiliy race in he criic canno make up for he criic's poor abiliy of funcion approximaion. The deails of he experimens using TD() will appear in oher papers gamma.9 Criic s Grid 3 bea. Figure 4: The average performance of 1 rials wihou he acor's eligibiliy race ( ). The criic uses 3 boxes.

6 .6.4 gamma.9 Criic s Grid 3 bea gamma.9 Criic s Grid 1 bea Figure 5: The average performance of 1 rials using he acor's race :9. The criic uses 3 boxes Figure 7: The average performance of 1 rials using he acor's race :9. The criic uses 1 boxes gamma.9 Criic s Grid 1 bea..4 bea Figure 6: The average performance of 1 rials wihou he acor's race ( ). The criic uses 1 boxes Figure 8: The average performance of 1 rials. :9. The agen learns wihou he criic, i.e., he criic provides ^V (x) for all x.

7 Opimum poin F - x x j j Figure 1: The car-pole problem Deviaion Figure 9: Value funcion over he parameer space in he LQR problem, where :9. I is fairly a around he : :5884,. 6 Applying o a Car-Pole Problem The behavior of his algorihm is demonsraed hrough a compuer simulaion of a car-pole conrol ask, ha is a muli-dimensional nonlinear nonquadraic problem. We modied he car-pole problem described in [Baro e al. 83] so ha he acion is aken o be coninuous. 6.1 Problem Formulaion The dynamics of he car-pole sysem is modeled by g sin + cos Fm` _ 2 sin + csgn( _x) M+m p _ m`, 4 ` 3 m cos2 M+m x F + m` _2 sin cos c sgn( _x), M + m where M 1: (kg) denoes mass of he car, m :1 (kg) is mass of he pole, 2` 1 (m) is a lengh of he pole, g 9:8 (msec 2 ) is he acceleraion of graviy, F (N) denoes he force applied o car's cener of mass, c :5 is a coecien of fricion of car, p :2 is a coecien of fricion of pole. In his simulaion, we use discree-ime sysem o approximae hese equaions, where 1 :2 sec. A each discree ime sep, he agen observes (x; _x; ; _ ), and conrols he force F. The agen can execue acion in arbirary range, bu he possible acion in he car-pole sysem is consrained o lie in he range [2; 2](N). When he agen chooses an acion which is no lie in ha range, he acion execued in he sysem is runcaed. The sysem begins wih (x; _x; ; _ ) (; ; ; ). The sysem fails and receives a reward (penaly) signal of 1 when he pole falls over 612 degrees or he car runs over he bounds of is rack (2:4 x 2:4), hen he car-pole sysem is rese o he iniial sae. 6.2 Deails of he Agen In his experimen, he acor adops similar implemenaion shown in Equaion 9 and 12. The sae space is consrained in he range (x; _x; ; ) _ (62:4 m; 62 m/sec; rad; 61:5 rad/sec). The acor has ve inernal variables w w 5, and compues he and according o x w 1 2:4 + w x_ 2 : w _ 4 1:5, 2 + w exp(w 5 ) : (16) Similarly o Equaion 14 and 15, he eligibiliies e e 5 are given by e 1 (a ) x, e 2 (a ) x_ e 3 (a ), e 4 (a ) _ e 5 ((a ) 2 2 )(1 + :1 ). The criic discreizes he normalized sae space evenly ino boxes, and aemps o sore in each box ^V by using TD() algorihm [Suon 88]. The parameers are se o :95, :5, p : Simulaion Resuls Figure 11 shows he performance of hree learning algorihms in which he policy represenaion is he

8 same. The acor/criic algorihm using he acor's race achieved bes resuls. In conras, he algorihm wihou using he race couldn' learn he conrol policy because of he poor abiliy of funcion approximaion in he criic. Time seps unil failure Acor/Criic using acor s eligibiliy race Acor only using acor s eligibiliy race Figure 11: Trials Acor/Criic wihou acor s eligibiliy race The average performance of hree algorihms on 1 rials. The criic uses boxes. A rial means an aemp from iniial sae o a failure. 7 Discussion Represenaion of Policies: Firs of all, acor/criic algorihms should have sucien abiliy o approximae policies. If i is saised, use of he acor's eligibiliy race ( ) enables o learn an accepable policy wih less cos raher han increasing he criic's abiliy of funcion approximaion in our es cases. The reason is ha he policy funcion represenaion would require less memory han he represenaion of he sae-acion value funcion in many cases. Conrolling Sep-Size of Backups: I is analyically shown in Secion 4.3 ha he criic provides an appropriae reinforcemen baseline o he acor. The adapive baseline conrols sep-size of he acor's backups so ha he sep-size is aken o be smaller around he local maximum. This propery would conribue he beer learning eciency and he suppression of harmful drif of he policy ha are shown in he experimens. To Overcome non-markovian: There are many ways o implemen he criic's learning scheme. [Peng e al. 94] and [Suon 95] poined ou ha increasing makes TD() less sensiive o non- Markovian eec. The acor's eligibiliy races are also useful in geing over non-markovian problems [Kimura e al. 97]. Therefore, he combinaion of TD() and he acor's eligibiliy race will be robuser in non-markovian problems. Combining wih Eicien DP-based Mehods: If he hidden sae is relaively small in he sae space, he agen may perform good in which ecien DP-based algorihms are adoped for he criic. The DP-based algorihms accelerae he acor's learning in compleely observable saes, and he acor's sochasic policy and is race ( ) would make up for he non-markovian eecs owing o he hidden sae or funcion approximaion. 8 Conclusions This paper presened an analysis of acor/criic algorihms in which he acor updaes is policy using he eligibiliy race of he policy parameers. The resuls show ha when he discoun rae of he value funcion equals he discoun facor of he acor's race, he acor improves is policy by using a gradien of acual reurn, no by using a gradien of he esimaed reurn in he criic. Then, he criic provides an adapive reinforcemen baseline o he acor conrolling he sep-size of he acor's backups. I enables he agen o learn a fairly good policy under he condiion ha he approximaed value funcion in he criic is hopelessly imperfec. The behavior is demonsraed hrough simulaions showing ha he race conribues he learning eciency and he suppression of undesirable drifs of he policy. Analysis of he algorihm in non-markovian environmens is a fuure work. Acknowledgemens We would like o hank Andrew Baro, Jing Peng, Je Schneider, Sainder Singh, Richard Suon, and reviewers for many helpful commens and suggesions. References [Baird 94] Baird, L. C.: Reinforcemen Learning in Coninuous Time: Advanage Updaing, Proceedings of IEEE Inernaional Conference on Neural Neworks, Vol. IV, pp (1994). [Baro e al. 83] Baro, A. G., Suon, R. S. and Anderson, C. W.: Neuronlike Adapive Elemens Tha Can Solve Dicul Learning Conrol Problems, IEEE Transacions on Sysems, Man, and Cyberneics, vol. SMC3, no.5, Sepember/Ocober 1983, pp

9 [Clouse e al. 92] Clouse, J. A. & Uogo, P. E.: A Teaching Mehod for Reinforcemen Learning, Proc. of he 9h Inernaional Conference on Machine Learning, pp. 931 (1992). [Cries e al. 94] Cries, R. H. and Baro, A. G.: An Acor/Criic Algorihm ha is Equivalen o Q-Learning, Advances in Neural Informaion Processing Sysems 7, pp (1994). [Doya 96] Doya, K. : Ecien Nonlinear Conrol wih Acor-Tuor Archiecure, Advances in Neural Informaion Processing Sysems 9, pp. 112{118 (1996). [Gullapalli 92] Gullapalli, V.: Reinforcemen Learning and Is Applicaion o Conrol, PhD Thesis, Universiy of Massachuses, Amhers, COINS Technical Repor 92 (1992). [Jaakkola 94] Jaakkola, T., Singh, S. P., & Jordan, M. I.: Reinforcemen Learning Algorihm for Parially Observable Markov Decision Problems, Advances in Neural Informaion Processing Sysems 7, pp (1994). [Kaelbling e al.96] Kaelbling, L. P., & Liman, M. L., & Moore, A. W.: Reinforcemen Learning: A Survey, Journal of Aricial Inelligence Research, Vol. 4, pp. 237{277 (1996). [Kimura e al. 95] Kimura, H., Yamamura, M., & Kobayashi, S.: Reinforcemen Learning by Sochasic Hill Climbing on Discouned Reward, Proceedings of he 12h Inernaional Conference on Machine Learning, pp (1995). [Kimura e al. 97] Kimura, H., Miyazaki, K. and Kobayashi, S.: Reinforcemen Learning in POMDPs wih Funcion Approximaion, Proceedings of he 14h Inernaional Conference on Machine Learning, pp. 152{16 (1997). [Lin e al. 96] Lin, C. J. and Lin, C. T.: Reinforcemen Learning for An ART-Based Fuzzy Adapive Learning Conrol Nework, IEEE Transacions on Neural Neworks, Vol.7, No. 3, pp (1996). [Liman 94] Liman, M. L.: Markov games as a framework for muli-agen reinforcemen learning, Proc. of 11h Inernaional Conference on Machine Learning, pp (1994). [Pendrih e al. 96] Pendrih, M. D. & Ryan, M. R. K.: Acual reurn reinforcemen learning versus Temporal Dierences: Some heoreical and experimenal resuls, Proceedings of he 13h Inernaional Conference on Machine Learning, pp. 373{381 (1996). [Peng e al. 94] Peng, J. and Williams, R. J.: Incremenal Muli-Sep Q-Learning, Proceedings of he 11h Inernaional Conference on Machine Learning, pp (1994). [Singh 94] Singh, S. P., Jaakkola, T., & Jordan, M. I.: Learning Wihou Sae-Esimaion in Parially Observable Markovian Decision Processes, Proceedings of he 11h Inernaional Conference on Machine Learning, pp (1994). [Singh 96] Singh, S. P., & Suon, R.S.: Reinforcemen Learning wih Replacing Eligibiliy Traces, Machine Learning 22, pp (1996). [Suon 88] Suon, R. S.: Learning o Predic by he Mehods of Temporal Dierences, Machine Learning 3, pp (1988). [Suon 9] Suon, R. S.: Reinforcemen Learning Archiecures for Animas, Proceedings of he 1s Inernaional Conference on Simulaion of Adapive Behavior, pp (199). [Suon 95] Suon, R. S.: TD Models: Modeling he world a a Mixure of Time Scales, Proceedings of he 12h Inernaional Conference on Machine Learning, pp (1995). [Suon e al. 98] Suon, R. S. & Baro, A.: Reinforcemen Learning: An Inroducion, A Bradford Book, The MIT Press (1998). [Wakins e.al 92] Wakins, C. J. C. H., & Dayan, P.: Technical Noe: Q-Learning, Machine Learning 8, pp (1992). [Williams e al. 9] Williams, R. J. & Baird, L. C.: A Mahemaical Analysis of Acor-Criic Archiecures for Learning Opimal Conrols hrough Incremenal Dynamic Programming, Proceedings of he Sixh Yale Workshop on Adapive and Learning Sysems, pp Cener for Sysems Science, Dunham Laboraory, Yale Universiy, New Haven (199). [Williams 92] Williams, R. J.: Simple Saisical Gradien Following Algorihms for Connecionis Reinforcemen Learning, Machine Learning 8, pp (1992).

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 RL Lecure 7: Eligibiliy Traces R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 1 N-sep TD Predicion Idea: Look farher ino he fuure when you do TD backup (1, 2, 3,, n seps) R. S. Suon and

More information

Vehicle Arrival Models : Headway

Vehicle Arrival Models : Headway Chaper 12 Vehicle Arrival Models : Headway 12.1 Inroducion Modelling arrival of vehicle a secion of road is an imporan sep in raffic flow modelling. I has imporan applicaion in raffic flow simulaion where

More information

Georey E. Hinton. University oftoronto. Technical Report CRG-TR February 22, Abstract

Georey E. Hinton. University oftoronto.   Technical Report CRG-TR February 22, Abstract Parameer Esimaion for Linear Dynamical Sysems Zoubin Ghahramani Georey E. Hinon Deparmen of Compuer Science Universiy oftorono 6 King's College Road Torono, Canada M5S A4 Email: zoubin@cs.orono.edu Technical

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

An introduction to the theory of SDDP algorithm

An introduction to the theory of SDDP algorithm An inroducion o he heory of SDDP algorihm V. Leclère (ENPC) Augus 1, 2014 V. Leclère Inroducion o SDDP Augus 1, 2014 1 / 21 Inroducion Large scale sochasic problem are hard o solve. Two ways of aacking

More information

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9) CSE/NB 528 Lecure 14: Reinforcemen Learning Chaper 9 Image from hp://clasdean.la.asu.edu/news/images/ubep2001/neuron3.jpg Lecure figures are from Dayan & Abbo s book hp://people.brandeis.edu/~abbo/book/index.hml

More information

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017 Two Popular Bayesian Esimaors: Paricle and Kalman Filers McGill COMP 765 Sep 14 h, 2017 1 1 1, dx x Bel x u x P x z P Recall: Bayes Filers,,,,,,, 1 1 1 1 u z u x P u z u x z P Bayes z = observaion u =

More information

Notes on Kalman Filtering

Notes on Kalman Filtering Noes on Kalman Filering Brian Borchers and Rick Aser November 7, Inroducion Daa Assimilaion is he problem of merging model predicions wih acual measuremens of a sysem o produce an opimal esimae of he curren

More information

Robust estimation based on the first- and third-moment restrictions of the power transformation model

Robust estimation based on the first- and third-moment restrictions of the power transformation model h Inernaional Congress on Modelling and Simulaion, Adelaide, Ausralia, 6 December 3 www.mssanz.org.au/modsim3 Robus esimaion based on he firs- and hird-momen resricions of he power ransformaion Nawaa,

More information

Estimation of Poses with Particle Filters

Estimation of Poses with Particle Filters Esimaion of Poses wih Paricle Filers Dr.-Ing. Bernd Ludwig Chair for Arificial Inelligence Deparmen of Compuer Science Friedrich-Alexander-Universiä Erlangen-Nürnberg 12/05/2008 Dr.-Ing. Bernd Ludwig (FAU

More information

1 Review of Zero-Sum Games

1 Review of Zero-Sum Games COS 5: heoreical Machine Learning Lecurer: Rob Schapire Lecure #23 Scribe: Eugene Brevdo April 30, 2008 Review of Zero-Sum Games Las ime we inroduced a mahemaical model for wo player zero-sum games. Any

More information

STATE-SPACE MODELLING. A mass balance across the tank gives:

STATE-SPACE MODELLING. A mass balance across the tank gives: B. Lennox and N.F. Thornhill, 9, Sae Space Modelling, IChemE Process Managemen and Conrol Subjec Group Newsleer STE-SPACE MODELLING Inroducion: Over he pas decade or so here has been an ever increasing

More information

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II Roland Siegwar Margaria Chli Paul Furgale Marco Huer Marin Rufli Davide Scaramuzza ETH Maser Course: 151-0854-00L Auonomous Mobile Robos Localizaion II ACT and SEE For all do, (predicion updae / ACT),

More information

A Reinforcement Learning Approach for Collaborative Filtering

A Reinforcement Learning Approach for Collaborative Filtering A Reinforcemen Learning Approach for Collaboraive Filering Jungkyu Lee, Byonghwa Oh 2, Jihoon Yang 2, and Sungyong Park 2 Cyram Inc, Seoul, Korea jklee@cyram.com 2 Sogang Universiy, Seoul, Korea {mrfive,yangjh,parksy}@sogang.ac.kr

More information

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14 CSE/NB 58 Lecure 14: From Supervised o Reinforcemen Learning Chaper 9 1 Recall from las ime: Sigmoid Neworks Oupu v T g w u g wiui w Inpu nodes u = u 1 u u 3 T i Sigmoid oupu funcion: 1 g a 1 a e 1 ga

More information

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS NA568 Mobile Roboics: Mehods & Algorihms Today s Topic Quick review on (Linear) Kalman Filer Kalman Filering for Non-Linear Sysems Exended Kalman Filer (EKF)

More information

Balanced Importance Sampling Estimation

Balanced Importance Sampling Estimation In: Proceedings of he 11h Inernaional Conference on Informaion Processing and Managemen of Uncerainy in Knowledge-based Sysems IPMU), Paris, July -7, 006, pp. 66-73. Balanced Imporance Sampling Esimaion

More information

20. Applications of the Genetic-Drift Model

20. Applications of the Genetic-Drift Model 0. Applicaions of he Geneic-Drif Model 1) Deermining he probabiliy of forming any paricular combinaion of genoypes in he nex generaion: Example: If he parenal allele frequencies are p 0 = 0.35 and q 0

More information

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 175 CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 10.1 INTRODUCTION Amongs he research work performed, he bes resuls of experimenal work are validaed wih Arificial Neural Nework. From he

More information

Planning in POMDPs. Dominik Schoenberger Abstract

Planning in POMDPs. Dominik Schoenberger Abstract Planning in POMDPs Dominik Schoenberger d.schoenberger@sud.u-darmsad.de Absrac This documen briefly explains wha a Parially Observable Markov Decision Process is. Furhermore i inroduces he differen approaches

More information

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Article from. Predictive Analytics and Futurism. July 2016 Issue 13 Aricle from Predicive Analyics and Fuurism July 6 Issue An Inroducion o Incremenal Learning By Qiang Wu and Dave Snell Machine learning provides useful ools for predicive analyics The ypical machine learning

More information

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles Diebold, Chaper 7 Francis X. Diebold, Elemens of Forecasing, 4h Ediion (Mason, Ohio: Cengage Learning, 006). Chaper 7. Characerizing Cycles Afer compleing his reading you should be able o: Define covariance

More information

Particle Swarm Optimization Combining Diversification and Intensification for Nonlinear Integer Programming Problems

Particle Swarm Optimization Combining Diversification and Intensification for Nonlinear Integer Programming Problems Paricle Swarm Opimizaion Combining Diversificaion and Inensificaion for Nonlinear Ineger Programming Problems Takeshi Masui, Masaoshi Sakawa, Kosuke Kao and Koichi Masumoo Hiroshima Universiy 1-4-1, Kagamiyama,

More information

Presentation Overview

Presentation Overview Acion Refinemen in Reinforcemen Learning by Probabiliy Smoohing By Thomas G. Dieerich & Didac Busques Speaer: Kai Xu Presenaion Overview Bacground The Probabiliy Smoohing Mehod Experimenal Sudy of Acion

More information

Final Spring 2007

Final Spring 2007 .615 Final Spring 7 Overview The purpose of he final exam is o calculae he MHD β limi in a high-bea oroidal okamak agains he dangerous n = 1 exernal ballooning-kink mode. Effecively, his corresponds o

More information

Rapid Termination Evaluation for Recursive Subdivision of Bezier Curves

Rapid Termination Evaluation for Recursive Subdivision of Bezier Curves Rapid Terminaion Evaluaion for Recursive Subdivision of Bezier Curves Thomas F. Hain School of Compuer and Informaion Sciences, Universiy of Souh Alabama, Mobile, AL, U.S.A. Absrac Bézier curve flaening

More information

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing Applicaion of a Sochasic-Fuzzy Approach o Modeling Opimal Discree Time Dynamical Sysems by Using Large Scale Daa Processing AA WALASZE-BABISZEWSA Deparmen of Compuer Engineering Opole Universiy of Technology

More information

Excel-Based Solution Method For The Optimal Policy Of The Hadley And Whittin s Exact Model With Arma Demand

Excel-Based Solution Method For The Optimal Policy Of The Hadley And Whittin s Exact Model With Arma Demand Excel-Based Soluion Mehod For The Opimal Policy Of The Hadley And Whiin s Exac Model Wih Arma Demand Kal Nami School of Business and Economics Winson Salem Sae Universiy Winson Salem, NC 27110 Phone: (336)750-2338

More information

Navneet Saini, Mayank Goyal, Vishal Bansal (2013); Term Project AML310; Indian Institute of Technology Delhi

Navneet Saini, Mayank Goyal, Vishal Bansal (2013); Term Project AML310; Indian Institute of Technology Delhi Creep in Viscoelasic Subsances Numerical mehods o calculae he coefficiens of he Prony equaion using creep es daa and Herediary Inegrals Mehod Navnee Saini, Mayank Goyal, Vishal Bansal (23); Term Projec

More information

Air Traffic Forecast Empirical Research Based on the MCMC Method

Air Traffic Forecast Empirical Research Based on the MCMC Method Compuer and Informaion Science; Vol. 5, No. 5; 0 ISSN 93-8989 E-ISSN 93-8997 Published by Canadian Cener of Science and Educaion Air Traffic Forecas Empirical Research Based on he MCMC Mehod Jian-bo Wang,

More information

Christos Papadimitriou & Luca Trevisan November 22, 2016

Christos Papadimitriou & Luca Trevisan November 22, 2016 U.C. Bereley CS170: Algorihms Handou LN-11-22 Chrisos Papadimiriou & Luca Trevisan November 22, 2016 Sreaming algorihms In his lecure and he nex one we sudy memory-efficien algorihms ha process a sream

More information

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010 Simulaion-Solving Dynamic Models ABE 5646 Week 2, Spring 2010 Week Descripion Reading Maerial 2 Compuer Simulaion of Dynamic Models Finie Difference, coninuous saes, discree ime Simple Mehods Euler Trapezoid

More information

Robust and Learning Control for Complex Systems

Robust and Learning Control for Complex Systems Robus and Learning Conrol for Complex Sysems Peer M. Young Sepember 13, 2007 & Talk Ouline Inroducion Robus Conroller Analysis and Design Theory Experimenal Applicaions Overview MIMO Robus HVAC Conrol

More information

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB Elecronic Companion EC.1. Proofs of Technical Lemmas and Theorems LEMMA 1. Le C(RB) be he oal cos incurred by he RB policy. Then we have, T L E[C(RB)] 3 E[Z RB ]. (EC.1) Proof of Lemma 1. Using he marginal

More information

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon 3..3 INRODUCION O DYNAMIC OPIMIZAION: DISCREE IME PROBLEMS A. he Hamilonian and Firs-Order Condiions in a Finie ime Horizon Define a new funcion, he Hamilonian funcion, H. H he change in he oal value of

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updaes Michael Kearns AT&T Labs mkearns@research.a.com Sainder Singh AT&T Labs baveja@research.a.com Absrac We give he firs rigorous upper bounds on he error

More information

Lecture 2 October ε-approximation of 2-player zero-sum games

Lecture 2 October ε-approximation of 2-player zero-sum games Opimizaion II Winer 009/10 Lecurer: Khaled Elbassioni Lecure Ocober 19 1 ε-approximaion of -player zero-sum games In his lecure we give a randomized ficiious play algorihm for obaining an approximae soluion

More information

Some Basic Information about M-S-D Systems

Some Basic Information about M-S-D Systems Some Basic Informaion abou M-S-D Sysems 1 Inroducion We wan o give some summary of he facs concerning unforced (homogeneous) and forced (non-homogeneous) models for linear oscillaors governed by second-order,

More information

Learning to Take Concurrent Actions

Learning to Take Concurrent Actions Learning o Take Concurren Acions Khashayar Rohanimanesh Deparmen of Compuer Science Universiy of Massachuses Amhers, MA 0003 khash@cs.umass.edu Sridhar Mahadevan Deparmen of Compuer Science Universiy of

More information

Tom Heskes and Onno Zoeter. Presented by Mark Buller

Tom Heskes and Onno Zoeter. Presented by Mark Buller Tom Heskes and Onno Zoeer Presened by Mark Buller Dynamic Bayesian Neworks Direced graphical models of sochasic processes Represen hidden and observed variables wih differen dependencies Generalize Hidden

More information

Reinforcement Learning: A Tutorial. Scope of Tutorial. 1 Introduction

Reinforcement Learning: A Tutorial. Scope of Tutorial. 1 Introduction Reinforcemen Learning: A Tuorial Mance E. Harmon WL/AACF 224 Avionics Circle Wrigh Laboraory Wrigh-Paerson AFB, OH 45433 mharmon@acm.org Sephanie S. Harmon Wrigh Sae Universiy 56-8 Mallard Glen Drive Cenerville,

More information

Multi-scale 2D acoustic full waveform inversion with high frequency impulsive source

Multi-scale 2D acoustic full waveform inversion with high frequency impulsive source Muli-scale D acousic full waveform inversion wih high frequency impulsive source Vladimir N Zubov*, Universiy of Calgary, Calgary AB vzubov@ucalgaryca and Michael P Lamoureux, Universiy of Calgary, Calgary

More information

Cash Flow Valuation Mode Lin Discrete Time

Cash Flow Valuation Mode Lin Discrete Time IOSR Journal of Mahemaics (IOSR-JM) e-issn: 2278-5728,p-ISSN: 2319-765X, 6, Issue 6 (May. - Jun. 2013), PP 35-41 Cash Flow Valuaion Mode Lin Discree Time Olayiwola. M. A. and Oni, N. O. Deparmen of Mahemaics

More information

3.1 More on model selection

3.1 More on model selection 3. More on Model selecion 3. Comparing models AIC, BIC, Adjused R squared. 3. Over Fiing problem. 3.3 Sample spliing. 3. More on model selecion crieria Ofen afer model fiing you are lef wih a handful of

More information

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model Modal idenificaion of srucures from roving inpu daa by means of maximum likelihood esimaion of he sae space model J. Cara, J. Juan, E. Alarcón Absrac The usual way o perform a forced vibraion es is o fix

More information

d 1 = c 1 b 2 - b 1 c 2 d 2 = c 1 b 3 - b 1 c 3

d 1 = c 1 b 2 - b 1 c 2 d 2 = c 1 b 3 - b 1 c 3 and d = c b - b c c d = c b - b c c This process is coninued unil he nh row has been compleed. The complee array of coefficiens is riangular. Noe ha in developing he array an enire row may be divided or

More information

di Bernardo, M. (1995). A purely adaptive controller to synchronize and control chaotic systems.

di Bernardo, M. (1995). A purely adaptive controller to synchronize and control chaotic systems. di ernardo, M. (995). A purely adapive conroller o synchronize and conrol chaoic sysems. hps://doi.org/.6/375-96(96)8-x Early version, also known as pre-prin Link o published version (if available):.6/375-96(96)8-x

More information

Inventory Control of Perishable Items in a Two-Echelon Supply Chain

Inventory Control of Perishable Items in a Two-Echelon Supply Chain Journal of Indusrial Engineering, Universiy of ehran, Special Issue,, PP. 69-77 69 Invenory Conrol of Perishable Iems in a wo-echelon Supply Chain Fariborz Jolai *, Elmira Gheisariha and Farnaz Nojavan

More information

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature On Measuring Pro-Poor Growh 1. On Various Ways of Measuring Pro-Poor Growh: A Shor eview of he Lieraure During he pas en years or so here have been various suggesions concerning he way one should check

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION DOI: 0.038/NCLIMATE893 Temporal resoluion and DICE * Supplemenal Informaion Alex L. Maren and Sephen C. Newbold Naional Cener for Environmenal Economics, US Environmenal Proecion

More information

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Kriging Models Predicing Arazine Concenraions in Surface Waer Draining Agriculural Waersheds Paul L. Mosquin, Jeremy Aldworh, Wenlin Chen Supplemenal Maerial Number

More information

Temporal Abstraction in Temporal-difference Networks

Temporal Abstraction in Temporal-difference Networks Temporal Absracion in Temporal-difference Neworks Richard S. Suon, Eddie J. Rafols, Anna Koop Deparmen of Compuing Science Universiy of Albera Edmonon, AB, Canada T6G 2E8 {suon,erafols,anna}@cs.ualbera.ca

More information

Online Appendix to Solution Methods for Models with Rare Disasters

Online Appendix to Solution Methods for Models with Rare Disasters Online Appendix o Soluion Mehods for Models wih Rare Disasers Jesús Fernández-Villaverde and Oren Levinal In his Online Appendix, we presen he Euler condiions of he model, we develop he pricing Calvo block,

More information

2.160 System Identification, Estimation, and Learning. Lecture Notes No. 8. March 6, 2006

2.160 System Identification, Estimation, and Learning. Lecture Notes No. 8. March 6, 2006 2.160 Sysem Idenificaion, Esimaion, and Learning Lecure Noes No. 8 March 6, 2006 4.9 Eended Kalman Filer In many pracical problems, he process dynamics are nonlinear. w Process Dynamics v y u Model (Linearized)

More information

Robust Learning Control with Application to HVAC Systems

Robust Learning Control with Application to HVAC Systems Robus Learning Conrol wih Applicaion o HVAC Sysems Naional Science Foundaion & Projec Invesigaors: Dr. Charles Anderson, CS Dr. Douglas Hile, ME Dr. Peer Young, ECE Mechanical Engineering Compuer Science

More information

Off-policy TD(λ) with a true online equivalence

Off-policy TD(λ) with a true online equivalence Off-policy TD(λ) wih a rue online equivalence Hado van Hassel A Rupam Mahmood Richard S Suon Reinforcemen Learning and Arificial Inelligence Laboraory Universiy of Albera, Edmonon, AB T6G 2E8 Canada Absrac

More information

A variational radial basis function approximation for diffusion processes.

A variational radial basis function approximation for diffusion processes. A variaional radial basis funcion approximaion for diffusion processes. Michail D. Vreas, Dan Cornford and Yuan Shen {vreasm, d.cornford, y.shen}@ason.ac.uk Ason Universiy, Birmingham, UK hp://www.ncrg.ason.ac.uk

More information

Single-Pass-Based Heuristic Algorithms for Group Flexible Flow-shop Scheduling Problems

Single-Pass-Based Heuristic Algorithms for Group Flexible Flow-shop Scheduling Problems Single-Pass-Based Heurisic Algorihms for Group Flexible Flow-shop Scheduling Problems PEI-YING HUANG, TZUNG-PEI HONG 2 and CHENG-YAN KAO, 3 Deparmen of Compuer Science and Informaion Engineering Naional

More information

Chapter 2. Models, Censoring, and Likelihood for Failure-Time Data

Chapter 2. Models, Censoring, and Likelihood for Failure-Time Data Chaper 2 Models, Censoring, and Likelihood for Failure-Time Daa William Q. Meeker and Luis A. Escobar Iowa Sae Universiy and Louisiana Sae Universiy Copyrigh 1998-2008 W. Q. Meeker and L. A. Escobar. Based

More information

Energy Storage Benchmark Problems

Energy Storage Benchmark Problems Energy Sorage Benchmark Problems Daniel F. Salas 1,3, Warren B. Powell 2,3 1 Deparmen of Chemical & Biological Engineering 2 Deparmen of Operaions Research & Financial Engineering 3 Princeon Laboraory

More information

KINEMATICS IN ONE DIMENSION

KINEMATICS IN ONE DIMENSION KINEMATICS IN ONE DIMENSION PREVIEW Kinemaics is he sudy of how hings move how far (disance and displacemen), how fas (speed and velociy), and how fas ha how fas changes (acceleraion). We say ha an objec

More information

15. Vector Valued Functions

15. Vector Valued Functions 1. Vecor Valued Funcions Up o his poin, we have presened vecors wih consan componens, for example, 1, and,,4. However, we can allow he componens of a vecor o be funcions of a common variable. For example,

More information

Sliding Mode Extremum Seeking Control for Linear Quadratic Dynamic Game

Sliding Mode Extremum Seeking Control for Linear Quadratic Dynamic Game Sliding Mode Exremum Seeking Conrol for Linear Quadraic Dynamic Game Yaodong Pan and Ümi Özgüner ITS Research Group, AIST Tsukuba Eas Namiki --, Tsukuba-shi,Ibaraki-ken 5-856, Japan e-mail: pan.yaodong@ais.go.jp

More information

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN The MIT Press, 2014 Lecure Slides for INTRODUCTION TO MACHINE LEARNING 3RD EDITION alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/~ehem/i2ml3e CHAPTER 2: SUPERVISED LEARNING Learning a Class

More information

Introduction D P. r = constant discount rate, g = Gordon Model (1962): constant dividend growth rate.

Introduction D P. r = constant discount rate, g = Gordon Model (1962): constant dividend growth rate. Inroducion Gordon Model (1962): D P = r g r = consan discoun rae, g = consan dividend growh rae. If raional expecaions of fuure discoun raes and dividend growh vary over ime, so should he D/P raio. Since

More information

Pade and Laguerre Approximations Applied. to the Active Queue Management Model. of Internet Protocol

Pade and Laguerre Approximations Applied. to the Active Queue Management Model. of Internet Protocol Applied Mahemaical Sciences, Vol. 7, 013, no. 16, 663-673 HIKARI Ld, www.m-hikari.com hp://dx.doi.org/10.1988/ams.013.39499 Pade and Laguerre Approximaions Applied o he Acive Queue Managemen Model of Inerne

More information

WEEK-3 Recitation PHYS 131. of the projectile s velocity remains constant throughout the motion, since the acceleration a x

WEEK-3 Recitation PHYS 131. of the projectile s velocity remains constant throughout the motion, since the acceleration a x WEEK-3 Reciaion PHYS 131 Ch. 3: FOC 1, 3, 4, 6, 14. Problems 9, 37, 41 & 71 and Ch. 4: FOC 1, 3, 5, 8. Problems 3, 5 & 16. Feb 8, 018 Ch. 3: FOC 1, 3, 4, 6, 14. 1. (a) The horizonal componen of he projecile

More information

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED 0.1 MAXIMUM LIKELIHOOD ESTIMATIO EXPLAIED Maximum likelihood esimaion is a bes-fi saisical mehod for he esimaion of he values of he parameers of a sysem, based on a se of observaions of a random variable

More information

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients Secion 3.5 Nonhomogeneous Equaions; Mehod of Undeermined Coefficiens Key Terms/Ideas: Linear Differenial operaor Nonlinear operaor Second order homogeneous DE Second order nonhomogeneous DE Soluion o homogeneous

More information

Linear Response Theory: The connection between QFT and experiments

Linear Response Theory: The connection between QFT and experiments Phys540.nb 39 3 Linear Response Theory: The connecion beween QFT and experimens 3.1. Basic conceps and ideas Q: How do we measure he conduciviy of a meal? A: we firs inroduce a weak elecric field E, and

More information

Maintenance Models. Prof. Robert C. Leachman IEOR 130, Methods of Manufacturing Improvement Spring, 2011

Maintenance Models. Prof. Robert C. Leachman IEOR 130, Methods of Manufacturing Improvement Spring, 2011 Mainenance Models Prof Rober C Leachman IEOR 3, Mehods of Manufacuring Improvemen Spring, Inroducion The mainenance of complex equipmen ofen accouns for a large porion of he coss associaed wih ha equipmen

More information

In this chapter the model of free motion under gravity is extended to objects projected at an angle. When you have completed it, you should

In this chapter the model of free motion under gravity is extended to objects projected at an angle. When you have completed it, you should Cambridge Universiy Press 978--36-60033-7 Cambridge Inernaional AS and A Level Mahemaics: Mechanics Coursebook Excerp More Informaion Chaper The moion of projeciles In his chaper he model of free moion

More information

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes Represening Periodic Funcions by Fourier Series 3. Inroducion In his Secion we show how a periodic funcion can be expressed as a series of sines and cosines. We begin by obaining some sandard inegrals

More information

Announcements. Recap: Filtering. Recap: Reasoning Over Time. Example: State Representations for Robot Localization. Particle Filtering

Announcements. Recap: Filtering. Recap: Reasoning Over Time. Example: State Representations for Robot Localization. Particle Filtering Inroducion o Arificial Inelligence V22.0472-001 Fall 2009 Lecure 18: aricle & Kalman Filering Announcemens Final exam will be a 7pm on Wednesday December 14 h Dae of las class 1.5 hrs long I won ask anyhing

More information

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H. ACE 56 Fall 005 Lecure 5: he Simple Linear Regression Model: Sampling Properies of he Leas Squares Esimaors by Professor Sco H. Irwin Required Reading: Griffihs, Hill and Judge. "Inference in he Simple

More information

Random Walk with Anti-Correlated Steps

Random Walk with Anti-Correlated Steps Random Walk wih Ani-Correlaed Seps John Noga Dirk Wagner 2 Absrac We conjecure he expeced value of random walks wih ani-correlaed seps o be exacly. We suppor his conjecure wih 2 plausibiliy argumens and

More information

WATER LEVEL TRACKING WITH CONDENSATION ALGORITHM

WATER LEVEL TRACKING WITH CONDENSATION ALGORITHM WATER LEVEL TRACKING WITH CONDENSATION ALGORITHM Shinsuke KOBAYASHI, Shogo MURAMATSU, Hisakazu KIKUCHI, Masahiro IWAHASHI Dep. of Elecrical and Elecronic Eng., Niigaa Universiy, 8050 2-no-cho Igarashi,

More information

The Optimal Stopping Time for Selling an Asset When It Is Uncertain Whether the Price Process Is Increasing or Decreasing When the Horizon Is Infinite

The Optimal Stopping Time for Selling an Asset When It Is Uncertain Whether the Price Process Is Increasing or Decreasing When the Horizon Is Infinite American Journal of Operaions Research, 08, 8, 8-9 hp://wwwscirporg/journal/ajor ISSN Online: 60-8849 ISSN Prin: 60-8830 The Opimal Sopping Time for Selling an Asse When I Is Uncerain Wheher he Price Process

More information

Ordinary dierential equations

Ordinary dierential equations Chaper 5 Ordinary dierenial equaions Conens 5.1 Iniial value problem........................... 31 5. Forward Euler's mehod......................... 3 5.3 Runge-Kua mehods.......................... 36

More information

A Dynamic Model of Economic Fluctuations

A Dynamic Model of Economic Fluctuations CHAPTER 15 A Dynamic Model of Economic Flucuaions Modified for ECON 2204 by Bob Murphy 2016 Worh Publishers, all righs reserved IN THIS CHAPTER, OU WILL LEARN: how o incorporae dynamics ino he AD-AS model

More information

Stability and Bifurcation in a Neural Network Model with Two Delays

Stability and Bifurcation in a Neural Network Model with Two Delays Inernaional Mahemaical Forum, Vol. 6, 11, no. 35, 175-1731 Sabiliy and Bifurcaion in a Neural Nework Model wih Two Delays GuangPing Hu and XiaoLing Li School of Mahemaics and Physics, Nanjing Universiy

More information

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter Sae-Space Models Iniializaion, Esimaion and Smoohing of he Kalman Filer Iniializaion of he Kalman Filer The Kalman filer shows how o updae pas predicors and he corresponding predicion error variances when

More information

Errata (1 st Edition)

Errata (1 st Edition) P Sandborn, os Analysis of Elecronic Sysems, s Ediion, orld Scienific, Singapore, 03 Erraa ( s Ediion) S K 05D Page 8 Equaion (7) should be, E 05D E Nu e S K he L appearing in he equaion in he book does

More information

A DELAY-DEPENDENT STABILITY CRITERIA FOR T-S FUZZY SYSTEM WITH TIME-DELAYS

A DELAY-DEPENDENT STABILITY CRITERIA FOR T-S FUZZY SYSTEM WITH TIME-DELAYS A DELAY-DEPENDENT STABILITY CRITERIA FOR T-S FUZZY SYSTEM WITH TIME-DELAYS Xinping Guan ;1 Fenglei Li Cailian Chen Insiue of Elecrical Engineering, Yanshan Universiy, Qinhuangdao, 066004, China. Deparmen

More information

Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent

Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent CSE 47 Chaper Reinforcemen Learning The Reinforcemen Learning Agen Agen Sae u Reward r Acion a Enironmen CSE AI Faculy Why reinforcemen learning Programming an agen o drie a car or fly a helicoper is ery

More information

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still. Lecure - Kinemaics in One Dimension Displacemen, Velociy and Acceleraion Everyhing in he world is moving. Nohing says sill. Moion occurs a all scales of he universe, saring from he moion of elecrons in

More information

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis Speaker Adapaion Techniques For Coninuous Speech Using Medium and Small Adapaion Daa Ses Consaninos Boulis Ouline of he Presenaion Inroducion o he speaker adapaion problem Maximum Likelihood Sochasic Transformaions

More information

IB Physics Kinematics Worksheet

IB Physics Kinematics Worksheet IB Physics Kinemaics Workshee Wrie full soluions and noes for muliple choice answers. Do no use a calculaor for muliple choice answers. 1. Which of he following is a correc definiion of average acceleraion?

More information

Online Convex Optimization Example And Follow-The-Leader

Online Convex Optimization Example And Follow-The-Leader CSE599s, Spring 2014, Online Learning Lecure 2-04/03/2014 Online Convex Opimizaion Example And Follow-The-Leader Lecurer: Brendan McMahan Scribe: Sephen Joe Jonany 1 Review of Online Convex Opimizaion

More information

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t Exercise 7 C P = α + β R P + u C = αp + βr + v (a) (b) C R = α P R + β + w (c) Assumpions abou he disurbances u, v, w : Classical assumions on he disurbance of one of he equaions, eg. on (b): E(v v s P,

More information

On-line Adaptive Optimal Timing Control of Switched Systems

On-line Adaptive Optimal Timing Control of Switched Systems On-line Adapive Opimal Timing Conrol of Swiched Sysems X.C. Ding, Y. Wardi and M. Egersed Absrac In his paper we consider he problem of opimizing over he swiching imes for a muli-modal dynamic sysem when

More information

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1 SZG Macro 2011 Lecure 3: Dynamic Programming SZG macro 2011 lecure 3 1 Background Our previous discussion of opimal consumpion over ime and of opimal capial accumulaion sugges sudying he general decision

More information

OBJECTIVES OF TIME SERIES ANALYSIS

OBJECTIVES OF TIME SERIES ANALYSIS OBJECTIVES OF TIME SERIES ANALYSIS Undersanding he dynamic or imedependen srucure of he observaions of a single series (univariae analysis) Forecasing of fuure observaions Asceraining he leading, lagging

More information

Chapter 2. First Order Scalar Equations

Chapter 2. First Order Scalar Equations Chaper. Firs Order Scalar Equaions We sar our sudy of differenial equaions in he same way he pioneers in his field did. We show paricular echniques o solve paricular ypes of firs order differenial equaions.

More information

Decentralized Stochastic Control with Partial History Sharing: A Common Information Approach

Decentralized Stochastic Control with Partial History Sharing: A Common Information Approach 1 Decenralized Sochasic Conrol wih Parial Hisory Sharing: A Common Informaion Approach Ashuosh Nayyar, Adiya Mahajan and Demoshenis Tenekezis arxiv:1209.1695v1 [cs.sy] 8 Sep 2012 Absrac A general model

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Noes for EE7C Spring 018: Convex Opimizaion and Approximaion Insrucor: Moriz Hard Email: hard+ee7c@berkeley.edu Graduae Insrucor: Max Simchowiz Email: msimchow+ee7c@berkeley.edu Ocober 15, 018 3

More information

Tracking. Announcements

Tracking. Announcements Tracking Tuesday, Nov 24 Krisen Grauman UT Ausin Announcemens Pse 5 ou onigh, due 12/4 Shorer assignmen Auo exension il 12/8 I will no hold office hours omorrow 5 6 pm due o Thanksgiving 1 Las ime: Moion

More information

EECE251. Circuit Analysis I. Set 4: Capacitors, Inductors, and First-Order Linear Circuits

EECE251. Circuit Analysis I. Set 4: Capacitors, Inductors, and First-Order Linear Circuits EEE25 ircui Analysis I Se 4: apaciors, Inducors, and Firs-Order inear ircuis Shahriar Mirabbasi Deparmen of Elecrical and ompuer Engineering Universiy of Briish olumbia shahriar@ece.ubc.ca Overview Passive

More information

2. Nonlinear Conservation Law Equations

2. Nonlinear Conservation Law Equations . Nonlinear Conservaion Law Equaions One of he clear lessons learned over recen years in sudying nonlinear parial differenial equaions is ha i is generally no wise o ry o aack a general class of nonlinear

More information

A Shooting Method for A Node Generation Algorithm

A Shooting Method for A Node Generation Algorithm A Shooing Mehod for A Node Generaion Algorihm Hiroaki Nishikawa W.M.Keck Foundaion Laboraory for Compuaional Fluid Dynamics Deparmen of Aerospace Engineering, Universiy of Michigan, Ann Arbor, Michigan

More information