A Generalized Path Integral Control Approach to Reinforcement Learning

Size: px
Start display at page:

Download "A Generalized Path Integral Control Approach to Reinforcement Learning"

Transcription

1 Journal of Machne Learnng Research Submtted /0; Revsed 7/0; Publshed /0 A Generalzed Path Integral Control Approach to Renforcement Learnng Evangelos ATheodorou Jonas Buchl Stefan Schaal Department of Computer Scence Unversty of Southern Calforna Los Angeles, CA , USA ETHEODOR@USCEDU JONAS@BUCHLIORG SSCHAAL@USCEDU Edtor: Danel Lee Abstract Wth the goal to generate more scalable algorthms wth hgher effcency and fewer open parameters, renforcement learnng RL has recently moved towards combnng classcal technques from optmal control and dynamc programmng wth modern learnng technques from statstcal estmaton theory In ths ven, ths paper suggests to use the framework of stochastc optmal control wth path ntegrals to derve a novel approach to RL wth parameterzed polces Whle soldly grounded n value functon estmaton and optmal control based on the stochastc Hamlton-Jacob- Bellman HJB equatons, polcy mprovements can be transformed nto an approxmaton problem of a path ntegral whch has no open algorthmc parameters other than the exploraton nose The resultng algorthm can be conceved of as model-based, sem-model-based, or even model free, dependng on how the learnng problem s structured The update equatons have no danger of numercal nstabltes as nether matrx nversons nor gradent learnng rates are requred Our new algorthm demonstrates nterestng smlartes wth prevous RL research n the framework of probablty matchng and provdes ntuton why the slghtly heurstcally motvated probablty matchng approach can actually perform well Emprcal evaluatons demonstrate sgnfcant performance mprovements over gradent-based polcy learnng and scalablty to hgh-dmensonal control problems Fnally, a learnng experment on a smulated 2 degree-of-freedom robot dog llustrates the functonalty of our algorthm n a complex robot learnng scenaro We beleve that Polcy Improvement wth Path Integrals PI 2 offers currently one of the most effcent, numercally robust, and easy to mplement algorthms for RL based on trajectory roll-outs Keywords: stochastc optmal control, renforcement learnng, parameterzed polces Introducton Whle renforcement learnng RL s among the most general frameworks of learnng control to create truly autonomous learnng systems, ts scalablty to hgh-dmensonal contnuous state-acton systems, for example, humanod robots, remans problematc Classcal value-functon based methods wth functon approxmaton offer one possble approach, but functon approxmaton under the non-statonary teratve learnng process of the value-functon remans dffcult when one exceeds about 5-0 dmensons Alternatvely, drect polcy learnng from trajectory roll-outs has recently made sgnfcant progress Peters, 2007, but can stll become numercally brttle and full of open Also at ATR Computatonal Neuroscence Laboratores, Kyoto , Japan c 200 Evangelos Theodorou, Jonas Buchl and Stefan Schaal

2 THEODOROU, BUCHLI AND SCHAAL tunng parameters n complex learnng problems In new developments, RL researchers have started to combne the well-developed methods from statstcal learnng and emprcal nference wth classcal RL approaches n order to mnmze tunng parameters and numercal problems, such that ultmately more effcent algorthms can be developed that scale to sgnfcantly more complex learnng system Dayan and Hnton, 997; Koeber and Peters, 2008; Peters and Schaal, 2008c; Toussant and Storkey, 2006; Ghavamzadeh and Yaakov, 2007; Desenroth et al, 2009; Vlasss et al, 2009; Jetchev and Toussant, 2009 In the sprt of these latter deas, ths paper addresses a new method of probablstc renforcement learnng derved from the framework of stochastc optmal control and path ntegrals, based on the orgnal work of Kappen 2007 and Broek et al 2008 As wll be detaled n the sectons below, ths approach makes an appealng theoretcal connecton between value functon approxmaton usng the stochastc HJB equatons and drect polcy learnng by approxmatng a path ntegral, that s, by solvng a statstcal nference problem from sample roll-outs The resultng algorthm, called Polcy Improvement wth Path Integrals PI 2, takes on a surprsngly smple form, has no open algorthmc tunng parameters besdes the exploraton nose, and t has numercally robust performance n hgh dmensonal learnng problems It also makes an nterestng connecton to prevous work on RL based on probablty matchng Dayan and Hnton, 997; Peters and Schaal, 2008c; Koeber and Peters, 2008 and motvates why probablty matchng algorthms can be successful Ths paper s structured nto several major sectons: Secton 2 addresses the theoretcal development of stochastc optmal control wth path ntegrals Ths s a farly theoretcal secton For a quck readng, we would recommend Secton 2 for our basc notaton, and Table for the fnal results Exposng the reader to a sketch of the detals of the dervatons opens the possblty to derve path ntegral optmal control solutons for other dynamcal systems than the one we address n Secton 2 The man steps of the theoretcal developmennclude: Problem formulaton of stochastc optmal control wth the stochastc Hamlton-Jacob- Bellman HJB equaton The transformaton of the HJB nto a lnear PDE The generalzed path ntegral formulaton for control systems wth controlled and uncontrolled dfferental equatons General dervaton of optmal controls for the path ntegral formalsm Path ntegral optmal control appled to specal cases of control systems Secton 3 relates path ntegral optmal control to renforcement learnng Several man ssues are addressed: Renforcement learnng wth parameterzed polces Dynamc Movement Prmtves DMP as a specal case of parameterzed polces, whch matches the problem formulaton of path ntegral optmal control Dervaton of Polcy Improvement wth Path Integrals PI 2, whch s an applcaton of path ntegral optmal control to DMPs Secton 4 dscusses related work 338

3 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING Secton 5 llustrates several applcatons of PI 2 to control problems n robotcs Secton 6 addresses several mportanssues and characterstcs of RL wth PI 2 2 Stochastc Optmal Control wth Path Integrals The goal n stochastc optmal control framework s to control a stochastc dynamcal system whle mnmzng a performance crteron Therefore, stochastc optmal control can be thought as a constraned optmzaton problem n whch the constrans corresponds to stochastc dynamcal systems The analyss and dervatons of stochastc optmal control and path ntegrals n the next sectons rely on the Bellman Prncple of optmalty Bellman and Kalaba, 964 and the HJB equaton 2 Stochastc Optmal Control Defnton and Notaton For our techncal developments, we wll use largely a control theoretc notaton from trajectorybased optmal control, however, wth an attempt to have as much overlap as possble wth the standard RL notaton Sutton and Barto, 998 Let us defne a fnte horzon cost functon for a trajectory τ whch can also be a pece of a trajectory startng at tme n state x t and endng at tme t N Rτ =φ tn + tn r t dt, wth φ tn = φx tn denotng a termnal reward at tme t N and r t denotng the mmedate cost at tme t In stochastc optmal control Stengel, 994, the goal s to fnd the controls u t that mnmze the value functon: Vx t = V t = mneτ u [Rτ ], 2 t :t N where the expectaton Eτ [] s taken over all trajectores startng at x t We consder the rather general class of control systems: ẋ t = fx t,t+gx t u t + ε t =f t + G t u t + ε t, 3 wth x t R n denotng the state of the system, G t = Gx t R n p the control matrx, f t = fx t R n the passve dynamcs, u t R p the control vector and ε t R p Gaussan nose wth varance Σε As mmedate cost we consder r t = rx t,u t,t=q t + 2 ut t Ru t, 4 where q t = qx t,t s an arbtrary state-dependent cost functon, and R s the postve sem-defnte weght matrx of the quadratc control cost The stochastc HJB equaton Stengel, 994; Flemng and Soner, 2006 assocated wth ths stochastc optmal control problem s expressed as follows: t V t = mn r t + x V t T F t + u 2 trace xx V t G t ΣεGt T, 5 If we need to emphasze a partcular tme, we denote t by, whch also smplfes a transton to dscrete tme notaton later We use t wthout subscrpt when no emphass s needed when ths tme slce occurs, t 0 for the start of a trajectory, and t N for the end of a trajectory 339

4 THEODOROU, BUCHLI AND SCHAAL where F s defned as F t = fx t,t+gx t u t To fnd the mnmum, the cost functon 4 s nserted nto 5 and the gradent of the expresson nsde the parenthess s taken wth respect to controls u and set to zero The correspondng optmal control s gven by the equaton: ux t =u t = R G T t xt V t Substtuton of the optmal control above, nto the stochastc HJB 5, results n the followng nonlnear and second order Partal Dfferental Equaton PDE: t V t = q t + x V t T f t 2 xv t T G t R Gt T x V t + 2 trace xx V t G t ΣεGt T The x and xx symbols refer to the Jacoban and Hessan, respectvely, of the value functon wth respect to the state x, whle s the partal dervatve wth respect to tme For notatonal compactness, we wll mostly use subscrpted symbols to denote tme and state dependences, as ntroduced n the equatons above 22 Transformaton of HJB nto a Lnear PDE In order to fnd a soluton to the PDE above, we use a exponental transformaton of the value functon: V t = λlogψ t Gven ths logarthmc transformaton, the partal dervatves of the value functon wth respect to tme and state are expressed as follows: t V t = λ Ψ t t Ψ t, x V t = λ Ψ t x Ψ t, xx V t = λ Ψ 2 t x Ψ t x Ψ T t λ Ψ t xx Ψ t Insertng the logarthmc transformaton and the dervatves of the value functon we obtan: λ t Ψ t = q t λ x Ψ t T f t λ2 Ψ t Ψ t 2Ψt 2 x Ψ t T G t R Gt T x Ψ t + traceγ, 6 2 where the term Γ s expressed as: Γ= λ Ψt 2 x Ψ t x Ψt T λ xx Ψ t G t ΣεGt T Ψ t The trace of Γ s therefore: traceγ=λ Ψ 2 trace x Ψt T G t ΣεG t x Ψ t λ trace xx Ψ t G t ΣεG T t 7 Ψ t 340

5 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING Comparng the underlned terms n 6 and 7, one can recognze that these terms wll cancel under the assumpton of λr = Σε, whch mples the smplfcaton: λg t R G T t = G t ΣεG T t = Σx t =Σ t 8 The ntuton behnd ths assumpton cf also Kappen, 2007; Broek et al, 2008 s that, snce the weght control matrx R s nverse proportonal to the varance of the nose, a hgh varance control npumples cheap control cost, whle small varance control nputs have hgh control cost From a control theoretc stand pont such a relatonshp makes sense due to the fact that under a large dsturbance = hgh varance sgnfcant control authorty s requred to brng the system back to a desrable state Ths control authorty can be acheved wth correspondng low control cosn R Wth ths smplfcaton, 6 reduces to the followng form t Ψ t = λ q tψ t + ft T x Ψ t + 2 trace xx Ψ t G t ΣεGt T, 9 wth boundary condton: Ψ tn = exp λ φ t N The partal dfferental equaton PDE n 9 corresponds to the so called Chapman Kolmogorov PDE, whch s of second order and lnear Analytcal solutons of 9 cannot be found n general for general nonlnear systems and cost functons However, there s a connecton between solutons of PDEs and ther representaton as stochastc dfferental equaton SDEs, thas mathematcally expressed by the Feynman-Kac formula Øksendal, 2003; Yong, 997 The Feynman-Kac formula see appendx B can be used to fnd dstrbutons of random processes whch solve certan SDEs as well as to propose numercal methods for solvng certan PDEs Applyng the Feynman-Kac theorem, the soluton of 9 s: Ψ t = Eτ Ψ tn e t N t λ q tdt = Eτ [exp λ φ t N tn q t dt λ ] 0 Thus, we have transformed our stochastc optmal control problem nto the approxmaton problem of a path ntegral Wth a vew towards a dscrete tme approxmaton, whch wll be needed for numercal mplementatons, the soluton 0 can be formulated as: Ψ t = lm pτ x exp [ λ φ tn + ] N q dt dτ, j= where τ = x t,,x tn s a sample path or trajectory pece startng at state x t and the term pτ x s the probablty of sample path τ condtoned on the start state x t Snce Equaton provdes the exponental cost to go Ψ t n state x t, the ntegraton above s taken wth respect to sample paths τ =x t,x t+,,x tn The dfferental term dτ s defned as dτ =dx t,,dx tn Evaluaton of the stochastc ntegral n requres the specfcaton of pτ x, whch s the topc of our analyss n the next secton 23 Generalzed Path Integral Formulaton To develop our algorthms, we wll need to consder a more general development of the path ntegral approach to stochastc optmal control than presented n Kappen 2007 and Broek et al 2008 In partcular, we have to address than many stochastc dynamcal systems, the control transton matrx G s state dependent and ts structure depends on the partton of the state n drectly and 34

6 THEODOROU, BUCHLI AND SCHAAL non-drectly actuated parts Snce only some of the states are drectly controlled, the state vector s parttoned nto x = [x mt x ct ] T wth x m R k the non-drectly actuated part and x c R l the drectly actuated part Subsequently, the passve dynamcs term and the control transton matrx can be parttoned as f t =[f m T T t ] T wth f m R k, f c R l T and G t =[0 k p ] T wth G c t f c t R l p The dscretzed state space representaton of such systems s gven as: x t+ = x t + f t dt+ G t u t dt+ dtε t, or, n parttoned vector form: x m + x m x c t = + x c + f m f c dt+ 0k p G c u t dt+ dtε t G c t 2 Essentally the stochastc dynamcs are parttoned nto controlled equatons n whch the state x c + s drectly actuated and the uncontrolled equatons n whch the state x m + s not drectly actuated Snce stochastcty s only added n the drectly actuated terms c of 2, we can develop pτ x as follows pτ x t = pτ + x t = px tn,,x t+ x t = Π N j= p x + x, where we exploted the fact that the start state x t of a trajectory s gven and does not contrbute to ts probablty For systems where the control has lower dmensonalty than the state 2, the transton probabltes p x + x are factorzed as follows: p x + x = p x m + x p x c + x = p x m + x m,x c p x c + x m,x c p x c + x, 3 where we have used the fact that p x m + x m,x c,x c s the Drac delta functon, snce x m + can be computed determnstcally from x m For all practcal purposes, 2 the transton probablty of the stochastc dynamcs s reduced to the transton probablty of the drectly actuated part of the state: pτ x t =Π N j= p x + x Π N j= p x c + x 4 Snce we assume that the nose ε s zero mean Gaussan dstrbuted wth varance Σε, where Σε R l l, the transton probablty of the drectly actuated part of the state s defned as: 3 p x c + x = 2πl Σ /2 exp 2 x c + x c f c dt 2 Σ 2 The delta functons wll all ntegrate to n the path ntegral 3 For notatonal smplcty, we wrte weghted square norms or Mahalanobs dstances as v T Mv= v 2 M, 5 342

7 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING where the covarance Σ R l l s expressed as Σ = G c results n the probablty of a path expressed as: pτ x t Π N j= 2π l Σ /2 exp 2 N j= ΣεG c x c + x c T dt Combnng 5 and 4 f c dt 2 Σ Fnally, we ncorporate the assumpton 8 about the relaton between the control cost and the varance of the nose, whch needs to be adjusted to the controlled space as Σ = G c ΣεG c T dt = λg c R G c T dt = λh dt wth H = G c R G c T Thus, we obtan: pτ x t Π N j= 2π l Σ /2 exp 2λ N j= x c + x c 2 f c t dt j Ht j dt as: Wth ths formulaton of the probablty of a trajectory, we can rewrte the the path ntegral Ψ t = lm = lm exp λ φ tn + N j= q dt+ x 2 N c + x c j= dt f c 2 H dt Π N j= 2π l/2 dτc Σ /2 Dτ exp λ Sτ dτ c, 6 where, we defned N Sτ =φ tn + j= q dt+ 2 N j= x c + x c 2 f c t dt j dt, Ht j and Dτ =Π N j= 2π l/2 Σ /2 Note that the ntegraton s over dτ c = dx c,,dx c t N, as the non-drectly actuated states can be ntegrated out due to the fact that the state transton of the non-drectly actuated states s determnstc, and just added Drac delta functons n the ntegral cf Equaton 3 Equaton 6 s wrtten n a more compact form as: Ψ t = lm = lm exp λ Sτ logdτ exp λ Zτ dτ c dτ c, 7 where Zτ =Sτ +λlogdτ It can be shown that ths term s factorzed n path dependent and path ndependent terms of the form: 343

8 THEODOROU, BUCHLI AND SCHAAL Zτ = Sτ + λn l 2 log2πdtλ, where Sτ = Sτ + λ 2 N j= log H Ths formula s a requred step for the dervaton of optmal controls n the next secton The constant term λn l 2 log2πdtλ can be the source of numercal nstabltes especally n cases where fne dscretzaton dt of stochastc dynamcs s requred However, n the next secton, and n a great detal n Appendx A, lemma, we show how ths term drops out of the equatons 24 Optmal Controls For every moment of tme, the optmal controls are gven as u t = R G T xt V t Due to the exponental transformaton of the value functon, the equaton of the optmal controls can be wrtten as u t = λr G t xt Ψ t Ψ t After substtutng Ψ t wth 7 and cancelng the state ndependent terms of the cost we have: e Sτ λ dτ c u t = lm λr Gt T c x e λ Sτ dτ c Further analyss of the equaton above leads to a smplfed verson for the optmal controls as wth the probablty Pτ and local controls u L τ defned as, u t = Pτ u L τ dτ c, 8 Pτ = e λ Sτ e λ Sτ dτ 9 u L τ = R G c t T lm c x Sτ The path cost Sτ s a generalzed verson of the path cosn Kappen 2005a and Kappen 2007, whch only consdered systems wth state ndependent control transton 4 G t To fnd the local controls u L τ we have to calculate the lm c x Sτ Appendx A and more precsely lemma 2 shows n detal the dervaton of the fnal result: c x Sτ = Ht G c ε t b t, lm where the new term b t s expressed as b t = λh t Φ t and Φ t R l s a vector wth the j th element defned as: Φ t j = 2 trace H c [x H ] 4 More precsely f G t c = G c then the term λ 2 N j= log H dsappears snce s state ndependent and t appears n both nomnator and denomnator n 9 In ths case, the path coss reduced to Sτ =Sτ 344

9 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING The local control can now be expressed as: u L τ =R G c t T Ht G c ε t b t, By substtutng H t = G c R G c t T n the equaton above, we get our man result for the local controls of the sampled path for the generalzed path ntegral formulaton: u L τ =R G c T G c R G c T G c ε t b t 20 The equatons n boxes 8, 9 and 20 form the soluton for the generalzed path ntegral stochastc optmal control problem Gven that ths resuls of general value and consttutes the foundaton to derve our renforcement learnng algorthm n the next secton, but also snce many other specal cases can be derved from t, we summarzed all relevant equatons n Table The Gven components of Table nclude a model of the system dynamcs, the cost functon, knowledge of the system s nose process, and a mechansm to generate trajectores τ Is mportant to realze that ths s a model-based approach, as the computatons of the optmal controls requres knowledge of ε ε can be obtaned n two ways Frst, the trajectores τ can be generated purely n smulaton, where the nose s generated from a random number generator Second, trajectores could be generated by a real system, and the nose ε would be computed from the dfference between the actual and the predcted system behavor, thas, G c ε = ẋ t ˆẋ t = ẋ t f t + G t u t Computng the predcton ˆẋ t also requres a model of the system dynamcs Prevous results n Kappen 2005a, Kappen 2007, Kappen 2005b and Broek et al 2008 are specal cases of our generalzed formulaton In the next secton we show how our generalzed formulaton s specalzed to dfferent classes of stochastc dynamcal systems and we provde the correspondng formula of local controls for each class 25 Specal Cases The purpose of ths secton s twofold Frst, t demonstrates how to apply the path ntegral approach to specalzed forms of dynamcal systems, and how the local controls n 20 smplfy for these cases Second, ths secton prepares the specal case whch we wll need for our renforcement learnng algorthm n Secton 3 25 SYSTEMS WITH ONE DIMENSIONAL DIRECTLY ACTUATED STATE The generalzed formulaton of stochastc optmal control wth path ntegrals n Table can be appled to a varety of stochastc dynamcal systems wth dfferent types of control transton matrces One case of partcular nteress where the dmensonalty of the drectly actuated part of the state s D, whle the dmensonalty of the control vector s D or hgher dmensonal As wll be seen below, ths stuaton arses when the controls are generated by a lnearly parameterzed functon approxmator The control transton matrx thus becomes a row vector G c Accordng to 20, the local controls for such systems are expressed as follows: u L τ = R g c g ct R g c 345 g ct ε t b t = g ct R p

10 THEODOROU, BUCHLI AND SCHAAL Gven: The system dynamcs ẋ t = f t + G t u t + ε t cf 3 The mmedate cost r t = q t + 2 ut t Ru t cf 4 A termnal cost term φ tn cf The varance Σε of the mean-zero nose ε t Trajectory startng at and endng at t N : τ =x t,,x tn A parttonng of the system dynamcs nto c controlled and m uncontrolled equatons, where n=c+m s the dmensonalty of the state x t cf Secton 23 Optmal Controls: Optmal controls at every tme step : u t = Pτ uτ dτ c Probablty of a trajectory: Pτ = e λ Sτ e λ Sτ dτ Generalzed trajectory cost: Sτ =Sτ + λ 2 N j= log H where Sτ =φ tn + N j= q dt+ x 2 N c + x c 2 j= dt f c dt H = G c R G c Local Controls: u L τ =R G c T b t = λh t Φ t [Φ t ] j = 2 trace H T c [x H ] G c R G c T H G c ε t b t where Table : Summary of optmal control derved from the path ntegral formalzm Snce the drectly actuated part of the state s D, the vector x c whch appears n the partal dfferentaton above In the case that g c dfferentaton wth respect to x c collapses nto the scalar x c does not depend on x c, the results to zero and the the local controls smplfy to: u L τ = R g c g ct g ct R g c ε t 252 SYSTEMS WITH PARTIALLY ACTUATED STATE The generalzed formula of the local controls 20 was derved for the case where the control transton matrx s state dependent and ts dmensonalty s G c t R l p wth l < n and p the dmensonalty of the control There are many specal cases of stochastc dynamcal systems n optmal control 346

11 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING and robotc applcatons that belong nto ths general class More precsely, for systems havng a state dependent control transton matrx thas square G c R l l wth l = p the local controls based on 20 are reformulated as: u L τ =ε t G c bt 2 Interestngly, a rather general class of mechancal systems such as rgd-body and mult-body dynamcs falls nto ths category When these mechancal systems are expressed n state space formulaton, the control transton matrx s equal to rgd body nerta matrx G c = Mθ t Scavcco and Sclano, 2000 Future work wll address ths specal topc of path ntegral control for mult-body dynamcs Another specal case of systems wth partally actuated state s when the control transton matrx s state ndependent and has dmensonalty G c t = G c R l p The local controls, accordng to 20, become: If G c s square and state ndependent, G c u L τ =R G ct G c R G ct G c ε t 22 = G c R l l, we wll have: u L τ =ε t 23 Ths specal case was explored n Kappen 2005a, Kappen 2007, Kappen 2005b and Broek et al 2008 Our generalzed formulaton allows a broader applcaton of path ntegral control n areas lke robotcs and other control systems, where the control transton matrx s typcally parttoned nto drectly and non-drectly actuated states, and typcally also state dependent 253 SYSTEMS WITH FULLY ACTUATED STATE SPACE In ths class of stochastc systems, the control transton matrx s not parttoned and, therefore, the control u drectly affects all the states The local controls for such systems are provded by smply substtutng G c R n p n 20 wth G t R n n Snce G t s a square matrx we obtan: u L τ =ε t G b t, wth b t = λh t Φ t and Φ t j = 2 trace Ht xt j H t, where the dfferentaton s not taken wth respect to x c j but wth respect to the full state x t j For ths fully actuated state space, there are subclasses of dynamcal systems wth square and/or state ndependent control transton matrx The local controls for these cases are found by just substtutng G c wth G t n 2, 22 and 23 3 Renforcement Learnng wth Parameterzed Polces Equpped wth the theoretcal framework of stochastc optmal control wth path ntegrals, we can now turn to ts applcaton to renforcement learnng wth parameterzed polces Snce the begnnng of actor-crtc algorthms Barto et al, 983, one goal of renforcement learnng has been 347

12 THEODOROU, BUCHLI AND SCHAAL to learn compact polcy representatons, for example, wth neural networks as n the early days of machne learnng Mller et al, 990, or wth general parameterzatons Peters, 2007; Desenroth et al, 2009 Parameterzed polces have much fewer parameters than the classcal tme-ndexed approach of optmal control, where every tme step has t own set of parameters, thas, the optmal controls at ths tme step Usually, functon approxmaton technques are used to represent the optmal controls and the open parameters of the functon approxmator become the polcy parameters Functon approxmators use a state representaton as nput and not an explct tme dependent representaton Ths representaton allows generalzaton across states and promses to acheve better generalzaton of the control polcy to a larger state space, such that polces become re-usable and do not have to be recomputed n every new stuaton The path ntegral approach from the prevous sectons also follows the classcal tme-based optmal control strategy, as can be seen from the tme dependent soluton for optmal controls n 33 However, a mnor re-nterpretaton of the approach and some small mathematcal adjustments allow us to carry t over to parameterzed polces and renforcement learnng, whch results n a new algorthm called Polcy Improvement wth Path Integrals PI 2 3 Parameterzed Polces We are focusng on drect polcy learnng, where the parameters of the polcy are adjusted by a learnng rule drectly, and nondrectly as n value functon approaches of classcal renforcement learnng Sutton and Barto, 998 see Peters 2007 for a dscusson of pros and cons of drect vs ndrect polcy learnng Drect polcy learnng usually assumes a general cost functon Sutton et al, 2000; Peters, 2007 n the form of Jx 0 = pτ 0 Rτ 0 dτ 0, 24 whch s optmzed over states-acton trajectores 5 τ 0 =x t0,a t0,,x tn Under the frst order Markov property, the probablty of a trajectory s pτ = px t Π N j= px + x,a pa x Both the state transton and the polcy are assumed to be stochastc The partcular formulaton of the stochastc polcy s a desgn parameter, motvated by the applcaton doman, analytcal convenence, and the need to nject exploraton durng learnng For contnuous state acton domans, Gaussan dstrbutons are most commonly chosen Gullapall, 990; Wllams, 992; Peters, 2007 An nterestng generalzed stochastc polcy was suggested n Rueckstess et al 2008 and appled n Koeber and Peters 2008, where the stochastc polcy pa t x t s lnearly parameterzed as: a t = g T θ+ε t, 25 wth g t denotng a vector of bass functons and θ the parameter vector Ths polcy has state dependent nose, whch can contrbute to faster learnng as the sgnal-to-nose rato becomes adaptve snce s a functon of g t It should be noted that a standard addtve-nose polcy can be expressed n ths formulaton, too, by choosng one bass functon g t j = 0 For Gaussan nose ε the probablty of an acton s pa t x t =N θ T g t,σ t wth Σt = g T Σεg t Comparng the polcy formulaton 5 We use a t to denote actons here n order to avod usng the symbol u n a conflctng way n the equatons below, and to emphasze that an acton does not necessarly concde wth the control command to a physcal system 348

13 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING n 25 wth the control term n 3, one recognzes that the control polcy formulaton 25 should fnto the framework of path ntegral optmal control 32 Generalzed Parameterzed Polces Before gong nto more detal of our proposed renforcement learnng algorthm, s worthwhle contemplatng what the acton a t actually represents In many applcatons of stochastc optmal control there are three man problems to be consdered: trajectory plannng, feedforward control, and feedback control The results of optmzaton could thus be an optmal knematc trajectory, the correspondng feedforward commands to track the desred trajectory accurately n face of the system s nonlneartes, and/or tme varyng lnear feedback gans gan schedulng for a negatve feedback controller that compensates for perturbatons from accurate trajectory trackng There are very few optmal control algorthms whch compute all three ssues smultaneously, such as Dfferental Dynamc ProgrammngDDP Jacobson and Mayne, 970, or ts smpler verson the Iteratve Lnear Quadratc RegulatorLQR Todorov, 2005 However, these are model based methods whch requre rather accurate knowledge of the dynamcs and make restrctve assumptons concernng dfferentablty of the system dynamcs and the cost functon Path ntegral optmal control allows more flexblty than these related methods The concept of an acton can be vewed n a broader sense Essentally, we consder any nput to the control system as an acton, not unlke the nputs to a transfer functon n classcal lnear control theory The nput can be a motor command, but can also be anythng else, for nstance, a desred state, thas subsequently converted to a motor command by some trackng controller, or a control gan Buchl et al, 200 As an example, consder a robotc system wth rgd body dynamcs RBD equatons Scavcco and Sclano, 2000 usng a parameterzed polcy: q = Mq Cq, q vq+mq u, 26 u = Gqθ+ε t, 27 where M s the RBD nerta matrx, C are Corols and centrpetal forces, and v denotes gravty forces The state of the robos descrbed by the jont angles q and jont veloctes q The polcy 27 s lnearly parameterzed by θ, wth bass functon matrx G one would assume that the dmensonalty of θ s sgnfcantly larger than that of q to assure suffcent expressve power of ths parameterzed polcy Insertng 27 nto 26 results n a dfferental equaton thas compatble wth the system equatons 3 for path ntegral optmal control: q = fq, q+ Gqθ+ε t 28 where fq, q = Mq Cq, q vq, Gq = Mq Gq Ths example s a typcal example where the polcy drectly represents motor commands Alternatvely, we could create another form of control structure for the RBD system: q = Mq Cq, q vq+mq u, u = K P q d q+k D q d q, q d = Gq d, q d θ+ε t

14 THEODOROU, BUCHLI AND SCHAAL Here, a Proportonal-Dervatve PD controller wth postve defnte gan matrces K P and K D converts a desred trajectory q d, q d nto a motor command u In contrast to the prevous example, the parameterzed polcy generates the desred trajectory n 29, and the dfferental equaton for the desred trajectory s compatble wth the path ntegral formalsm What we would lke to emphasze s that the control system s structure s left to the creatvty of ts desgner, and that path ntegral optmal control can be appled on varous levels Importantly, as developed n Secton 23, only the controlled dfferental equatons of the entre control system contrbute to the path ntegral formalsm, thas, 28 n the frst example, or 29 n the second example And only these controlled dfferental equatons need to be known for applyng path ntegral optmal control none of the varables of the uncontrolled equatons s ever used At ths pont, we make a very mportant transton from model-based to model-free learnng In the example of 28, the dynamcs model of the control system needs to be known to apply path ntegral optmal control, as ths s a controlled dfferental equaton In contrast, n 29, the system dynamcs are n an uncontrolled dfferental equaton, and are thus rrelevant for applyng path ntegral optmal control In ths case, only knowledge of the desred trajectory dynamcs s needed, whch s usually created by the system desgner Thus, we obtaned a model-free learnng system 33 Dynamc Movement Prmtves as Generalzed Polces As we are nterested n model-free learnng, we follow the control structure of the 2 nd example of the prevous secton, thas, we optmze control polces whch represent desred trajectores We use Dynamc Movement Prmtves DMPs Ijspeert et al, 2003 as a specal case of parameterzed polces, whch are expressed by the dfferental equatons: τ żt = f t + gt T θ+ε t, 30 τ ẏt = z t, τ ẋt = αx t, f t = α z β z g y t z t Essentally, these polces code a learnable pont attractor for a movement from y t0 to the goal g, where θ determnes the shape of the attractor y t,ẏ t denote the poston and velocty of the trajectory, whle z t,x t are nternal states α z,β z,τ are tme constants The bass functons g t R p are defned by a pecewse lnear functon approxmator wth Gaussan weghtng kernels, as suggested n Schaal and Atkeson 998: [g t ] j = w j x t p k= w g y 0, k w j = exp 05h j x t c j 2, 3 wth bandwth h j and center c j of the Gaussan kernels for more detals see Ijspeert et al 2003 The DMP representaton s advantageous as t guarantees attractor propertes towards the goal whle remanng lnear n the parameters θ of the functon approxmator By varyng the parameter θ the shape of the trajectory changes whle the goal state g and ntal state y t0 reman fxed These propertes facltate learnng Peters and Schaal, 2008a 350

15 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING 34 Polcy Improvements wth Path Integrals: The PI 2 Algorthm As can be easly recognzed, the DMP equatons are of the form of our control system 3, wth only one controlled equaton and a one dmensonal actuated state Ths case has been treated n Secton 25 The motor commands are replaced wth the parameters θ the ssue of tme dependent vs constant parameters wll be addressed below More precsely, the DMP equatons can be wrtten as: ẋ t ż t ẏ t = αx t y t α z β z g y t z t + 0 p 0 p g c T t θ t + ε t x m t The state of the DMP s parttoned nto the controlled part x c t = y t and uncontrolled part =x t z t T The control transton matrx depends on the state, however, t depends only on one of the state varables of the uncontrolled part of the state, thas, x t The path cost for the stochastc dynamcs of the DMPs s gven by: N Sτ = φ tn + j= N φ tn + j= N = φ tn + j= N = φ tn + j= N = φ tn + j= q dt+ 2 q + 2 q + 2 q + 2 q + 2 N j= N j= x c + x c 2 f c t dt j dt+ Ht j g ct θ + ε N j= 2 θ + ε T g c N j= 2 θ + ε T g g ct t 2 Ht j Ht j g ct c g ct N λ 2 θ + ε R g c t j= θ + ε log H N j= 2 θ + ε T Mt T j RM θ + ε 32 wth M = R g gt T j H gt T j R g t becomes a scalar gven by H t = g ct t t R g c t Interestngly, the term j λ 2 N j= log H for the case of DMPs depends only on x t, whch s a determnstc varable and therefore can be gnored snce s the same for all sampled paths We also absorbed, wthout loss of generalty, the tme step dn cost terms Consequently, the fundamental result of the path ntegral stochastc optmal problem for the case of DMPs s expressed as: where the probablty Pτ and local controls uτ are defned as u t = Pτ u L τ dτ c, 33 g ct Pτ = e Sτ λ e Sτ, u L τ = R g c λ dτ g ct R g c ε t, 35

16 THEODOROU, BUCHLI AND SCHAAL and the path cost gven as N Sτ =φ tn + j= q + 2 N εt T j Mt T j RM ε j= Note that θ=0 n these equatons, thas, the parameters are ntalzed to zero These equatons correspond to the case where the stochastc optmal control problem s solved wth one evaluaton of the optmal controls 33 usng dense samplng of the whole state space under the passve dynamcs e, θ = 0, whch requres a sgnfcant amount of exploraton nose Such an approach was pursued n the orgnal work by Kappen 2007 and Broek et al 2008, where a potentally large number of sample trajectores was needed to acheve good results Extendng ths samplng approach to hgh dmensonal spaces, however, s dauntng, as wth very hgh probablty, we would sample prmarly rather useless trajectores Thus, basng samplng towards good ntal condtons seems to be mandatory for hgh dmensonal applcatons Thus, we consder only local samplng and an teratve update procedure Gven a current guess of θ, we generate sample roll-outs usng stochastc parameters θ+ε t at every tme step To see how the generalzed path ntegral formulaton s modfed for the case of teratve updatng, we start wth the equatons of the update of the parameter vector θ, whch can be wrtten as: θ new = The correcton parameter vector δθ t Pτ R g t g T t θ+ε t g T t R dτ g t = Pτ R g t g T t ε t g T t R dτ + R g t g t g t g T t R g t R g t g T t = δθ t + tracer g t g T t θ = δθ t + M t θ 34 s defned as δθ t = Pτ R g t g T t ε t g T t R g t dτ Is mportant to s now tme dependent, thas, for every tme step, a dfferent optmal parameter note that θ new vector s computed In order to return to one sngle tme ndependent parameter vector θ new, the vectors θ new need to be averaged over tme We start wth a frst tentatve suggeston of averagng over tme, and then explan why s napproprate, and what the correct way of tme averagng has to look lke The tentatve and most ntutve tme average s: θ new = N N θ new = =0 N N δθ t + =0 N T θ N M t θ =0 Thus, we would update θ based on two terms The frst term s the average of δθ t, whch s reasonable as t reflects the knowledge we ganed from the exploraton nose However, there would be a second update term due to the average over projected mean parameters θ from every tme step t should be noted that M t s a projecton matrx onto the range space of g t under the metrc R, such that a multplcaton wth M t can only shrnk the norm of θ From the vewpont of havng optmal parameters for every tme step, ths update componens reasonable as t trvally elmnates the part of the parameter vector that les n the null space of g t and whch contrbutes to the command cost 352

17 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING of a trajectory n a useless way From the vew pont of a parameter vector thas constant and tme ndependent and thas updated teratvely, ths second update s undesrable, as the multplcaton of the parameter vector θ wth M t n 34 and the averagng operaton over the tme horzon reduces the L 2 norm of the parameters at every teraton, potentally n an uncontrolled way 6 What we rather wans to acheve convergence when the average of δθ t becomes zero, and we do not want to contnue updatng due to the second term The problem s avoded by elmnatng the projecton matrx n the second term of averagng, such that become: θ new = N N δθ t + =0 N N θ= =0 N N δθ t + θ =0 The meanng of ths reduced update s smply that we keep a componenn θ thas rrelevant and contrbutes to our trajectory cosn a useless way However, ths rrelevant component wll not prevent us from reachng the optmal effectve soluton, thas, the soluton that les n the range space of g t Gven ths modfed update, s, however, also necessary to derve a compatble cost functon As mentoned before, n the unmodfed scenaro, the last term of 32 s: N 2 j= θ+ε T M T RM θ+ε To avod a projecton of θ, we modfy ths cost term to be: N 2 j= θ+m ε T Rθ+M ε Wth ths modfed cost term, the path ntegral formalsm results n the desred θ new wthout the M t projecton of θ The man equatons of the teratve verson of the generalzed path ntegral formulaton, called Polcy Improvement wth Path Integrals PI 2, can be summarzed as: Pτ = e λ Sτ e λ Sτ dτ, 35 N Sτ = φ tn + δθ t = j= N j= q dt+ θ+m t 2 j ε T Rθ+M ε dt, 36 Pτ M t ε t dτ, 37 =0 [δθ] N w j, [δθ t ] j j =0 w, 38 j, N θ new = θ old + δθ = N N Essentally, 35 computes a dscrete probablty at tme of each trajectory roll-out wth the help of the cost 36 For every tme step of the trajectory, a parameter update s computed n 37 based 6 To be precse, θ would be projected and contnue shrnkng untl t les n the ntersecton of all null spaces of the g t bass functon ths null space can easly be of measure zero 353

18 THEODOROU, BUCHLI AND SCHAAL on a probablty weghted average over trajectores The parameter updates at every tme step are fnally averaged n 38 Note that we chose a weghted average by gvng every parameter update a weght 7 accordng to the tme steps lefn the trajectory and the actvaton of the kernel n 3 Ths average can be nterpreted as usng a functon approxmator wth only a constant offset parameter vector to approxmate the tme dependent parameters Gvng early ponts n the trajectory a hgher weghs useful snce ther parameters affect a large tme horzon and thus hgher trajectory costs Other functon approxmaton or averagng schemes could be used to arrve at a fnal parameter update we preferred ths smple approach as t gave very good learnng results The fnal parameter update s θ new = θ old + δθ The parameter λ regulates the senstvty of the exponentated cost and can automatcally be optmzed for every tme step to maxmally dscrmnate between the experenced trajectores More precsely, a constant term can be subtracted from 36 as long as all Sτ reman postve ths constant term 8 cancels n 35 Thus, for a gven number of roll-outs, we compute the exponental term n 35 as exp λ Sτ = exp Sτ mnsτ h maxsτ mnsτ wth h set to a constant, whch we chose to be h = 0 n all our evaluatons The max and mn operators are over all sample roll-outs Ths procedure elmnates λ and leaves the varance of the exploraton nose ε as the only open algorthmc parameter for PI 2 It should be noted that the equatons for PI 2 have no numercal ptfalls: no matrx nversons and no learnng rates, 9 renderng PI 2 to be very easy to use n practce The pseudocode for the fnal PI 2 algorthm for a one dmensonal control system wth functon approxmaton s gven n Table 2 A tutoral Matlab example of applyng PI 2 can be found at 4 Related Work In the next sectons we dscuss related work n the areas of stochastc optmal control and renforcement learnng and analyze the connectons and dfferences wth the PI 2 algorthm and the generalzed path ntegral control formulaton 4 Stochastc Optmal Control and Path Integrals The path ntegral formalsm for optmal control was ntroduced n Kappen 2005a,b In ths work, the role of nose n symmetry breakng phenomena was nvestgated n the context of stochastc optmal control In Kappen et al 2007, Wegernck et al 2006, and Broek et al 2008, the path ntegral formalsm s extended for the stochastc optmal control of mult-agent systems Recent work on stochastc optmal control by Todorov 2008, Todorov 2007 and Todorov 2009b shows that for a class of dscrete stochastc optmal control problems, the Bellman equa- 7 The use of the kernel weghts n the bass functons 3 for the purpose of tme averagng has shown better performance wth respect to other weghtng approaches, across all of our experments Therefore ths s the weghtng that we suggest Users may develop other weghtng schemes as more sutable to ther needs hmnsτ 8 In fact, the term nsde the exponent results by addng maxsτ mnsτ, whch cancels n 35, to the term hsτ maxsτ mnsτ whch s equal to λ Sτ 9 R s a user desgn parameter and usually chosen to be dagonal and nvertble, 354

19 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING Gven: An mmedate cost functon r t = q t + θ T t Rθ t cf A termnal cost term φ tn cf A stochastc parameterzed polcy a t = g T t θ+ε t cf 25 The bass functon g t from the system dynamcs cf 3 and Secton 25 The varance Σε of the mean-zero nose ε t The ntal parameter vector θ Repeat untl convergence of the trajectory cost R: Create K roll-outs of the system from the same start state x 0 usng stochstc parameters θ+ε t at every tme step For k=k, compute: Pτ,k = Sτ e λ,k K Sτ k= [e λ,k ] Sτ,k =φ tn,k+ N j= q,k+ 2 N j=+ θ+m,kε,k T Rθ+M,kε,k M,k = R g,k g T,k g T,k R g,k For =N, compute: δθ t = K k= [Pτ,kM t,k ε t,k] Compute [δθ] j = N =0 N w j, [δθ t ] j N =0 w j, N Update θ θ+δθ Create one noseless roll-out to check the trajectory cost R = φ tn + N =0 r In case the nose cannot be turned off, thas, a stochastc system, multple roll-outs need be averaged Table 2: Pseudocode of the PI 2 algorthm for a D Parameterzed Polcy Note that the dscrete tme step dt was absorbed as a constant multpler n the cost terms ton can be wrtten as the KL dvergence between the probablty dstrbuton of the controlled and uncontrolled dynamcs Furthermore s shown that the class of dscrete KL dvergence control problem s equvalent to the contnuous stochastc optmal control formalsm wth quadratc cost control functon and under the presence of Gaussan nose In Kappen et al 2009, the KL dvergence control formalsm s consdered and s transformed to a probablstc nference problem In all ths aforementoned work, both n the path ntegral formalsm as well as n KL dvergence control, the class of stochastc dynamcal systems under consderaton s rather restrctve snce the 355

20 THEODOROU, BUCHLI AND SCHAAL control transton matrx s state ndependent Moreover, the connecton to drect polcy learnng n RL and model-free learnng was not made n any of the prevous projects Our PI 2 algorthm dffers wth respect to the aforementoned work n the followng ponts In Todorov 2009b the stochastc optmal control problem s nvestgated for dscrete acton - state spaces and therefore s treated as Markov Decson Process MDP To apply our PI 2 algorthm, we do not dscretze the state space and we do not treat the problem as an MDP Instead we work n contnuous state - acton spaces whch are sutable for performng RL n hgh dmensonal robotc systems To the best of our knowledge, our results present RL n one of the most hgh dmensonal contnuous state acton spaces In our dervatons, the probablstc nterpretaton of control comes drectly from the Feynman- Kac Lemma Thus we do not have to mpose any artfcal pseudo-probablty treatment of the cost as n Todorov 2009b In addton, for the contnuous state - acton spaces we do not have to learn the value functon as s suggested n Todorov 2009b va Z-learnng Instead we drectly fnd the controls based on our generalzaton of optmal controls In the prevous work, the problem of how to sample trajectores s not addressed Samplng s performed at once wth the hope to cover the all state space We follow a rather dfferent approach that allows to attack robotc learnng problems of the complexty and dmensonalty of the lttle dog robot The work n Todorov 2009a consders stochastc dynamcs wth state dependent control matrx However, the way of how the stochastc optmal control problem s solved s by mposng strong assumptons on the structure of the cost functon and, therefore, restrctons of the proposed soluton to specal cases of optmal control problems The use of ths specfc cost functon allows transformng the stochastc optmal control problem to a determnstc optmal control problem Under ths transformaton, the stochastc optmal control problem can be solved by usng determnstc algorthms Wth respect to the work n Broek et al 2008, Wegernck et al 2006 and Kappen et al 2009 our PI 2 algorthm has been derved for a rather general class of systems wth control transton matrx thas state dependent In ths general class, Rgd body and mult-body dynamcs as well as the DMPs are ncluded Furthermore we have shown how our results generalze prevous work 42 Renforcement Learnng of Parameterzed Polces There are two man classes of related algorthms: Polcy Gradent algorthms and probablstc algorthms Polcy Gradent algorthms Peters and Schaal, 2006a,b compute the gradent of the cost functon 24 at every teraton and the polcy parameters are updated accordng to θ new = θ old + α θ J Some well-establshed algorthms, whch we wll also use for comparsons, are as follows see also Peters and Schaal, 2006a,b 42 REINFORCE Wllams 992 ntroduced the epsodc REINFORCE algorthm, whch s derved from takng the dervatve of 24 wth respect to the polcy parameters Ths algorthm has rather slow convergence 356

21 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING due to a very nosy estmate of the polcy gradent Is also very senstve to a reward baselne parameter b k see below Recent work derved the optmal baselne for REINFORCE cf Peters and Schaal, 2008a, whch mproved the performance sgnfcantly The epsodc REINFORCE update equatons are: N θk J = Eτ 0 [Rτ 0 b k b k = =0 θk ln pa t x t [ N Eτ 0 =0 θ k ln pa t x t 2 Rτ0 [ N Eτ 0 =0 θ k ln pa t x t ] 2, where k denotes the k-th coeffcent of the parameter vector and Rτ 0 = N N =0 r 422 GPOMDP AND THE POLICY GRADIENT THEOREM ALGORITHM In ther GPOMDP algorthm, Baxter and Bartlett 200 ntroduced several mprovements over RE- INFORCE that made the gradent estmates more effcent GPOMDP can also be derved from the polcy gradent theorem Sutton et al, 2000; Peters and Schaal, 2008a, and an optmal reward baselne can be added cf Peters and Schaal, 2008a In our context, the GPOMDP learnng algorthm can be wrtten as: [ ] N θk J = Eτ 0 r b k j θk ln pa t x t, b k = j=0 =0 Eτ 0 [ θk ln pa t x t 2 r t ] Eτ 0 [ θk ln pa t x t 2] ] ], 423 THE EPISODIC NATURAL ACTOR CRITIC One of the most effcent polcy gradent algorthm was ntroduced n Peters and Schaal 2008b, called the Epsodc Natural Actor Crtc In essence, the method uses the Fsher Informaton Matrx to project the REINFORCE gradent onto a more effectve update drecton, whch s motvated by the theory of natural gradents by Amar 999 The enac algorthm takes the form of: [ θ J J 0 ξ t,k = where J 0 s a constant offset term 424 POWER ] [ θk ln pa t x t ], [ ] [ N = Eτ 0 ξ t,kξt T,k Eτ 0 =0 Rτ 0 ] N ξ t,k, =0 The PoWER algorthm Koeber and Peters, 2008 s a probablstc polcy mprovement method, not a gradent algorthm Is derved from an Expectaton-Maxmzaton framework usng probablty 357

22 THEODOROU, BUCHLI AND SCHAAL matchng Dayan and Hnton, 997; Peters and Schaal, 2008c Usng the notaton of ths paper, the parameter update of PoWER becomes: [ ] N g t g δθ=e τ0 T [ tn t R t =0 gt T E τ0 g t g t R gt ε t t =t o gt T g t where R t = N j= r If we set R = c I n the update 37 of PI 2, and set g gt T = I n the matrx gt T g t nverson term of 39, the two algorthms look essentally dentcal But should be noted that the rewards r t n PoWER need to behave lke an mproper probablty, thas, be strctly postve and ntegrate to a constant number ths property can make the desgn of sutable cost functons more complcated PI 2, n contrast, uses exponentated sum of reward terms, where the mmedate reward can be arbtrary, and only the cost on the motor commands needs be quadratc Our emprcal evaluatons revealed that, for cost functons that share the same optmum n the PoWER pseudo-probablty formulaton and the PI 2 notaton, both algorthms perform essentally dentcal, ndcatng that the matrx nverson term n PoWER may be unmportant for many systems It should be noted than Vlasss et al 2009, PoWER was extended to the dscounted nfnte horzon case, where PoWER s the specal case of a non-dscounted fnte horzon problem 5 Evaluatons We evaluated PI 2 n several synthetc examples n comparson wth REINFORCE, GPOMDP, enac, and, when possble, PoWER Except for PoWER, all algorthms are sutable for optmzng mmedate reward functons of the knd r t = q t + u t Ru t As mentoned above, PoWER requres that the mmedate reward behaves lke an mproper probablty Ths property s ncompatble wth r t = q t + u t Ru t and requres some specal nonlnear transformatons, whch usually change the nature of the optmzaton problem, such that PoWER optmzes a dfferent cost functon Thus, only one of the examples below has a compatble a cost functon for all algorthms, ncludng PoWER In all examples below, exploraton nose and, when applcable, learnng rates, were tuned for every ndvdual algorthms to acheve the best possble numercally stable performance Exploraton nose was only added to the maxmally actvated bass functon n a motor prmtve, 0 and the nose was kept constant for the entre tme that ths bass functon had the hghest actvaton emprcally, ths tck helped mproves the learnng speed of all algorthms 5 Learnng Optmal Performance of a DOF Reachng Task The frst evaluaton consders learnng optmal parameters for a DOF DMP cf Equaton 30 The mmedate cost and termnal cost are, respectvely: r t = 05 f 2 t θ T θ, φ tn = 0000ẏ 2 t N + 0g y tn 2 wth y t0 = 0 and g= we use radans as unts motvated by our nteresn robotcs applcaton, but we could also avod unts entrely The nterpretaton of ths coss that we would lke to reach the goal g wth hgh accuracy whle mnmzng the acceleraton of the movement and whle keepng the parameter vector short Each algorthm was run for 5 trals to compute a parameter update, and 0 Thas, the nose vector n 25 has only one non-zero component ], 358

23 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING a total of 000 updates were performed Note that 5 trals per update were chosen as the DMP had 0 bass functons, and the enac requres at least trals to perform a numercally stable update due to ts matrx nverson The motor prmtves were ntalzed to approxmate a 5-th order polynomal as pont-to-pont movement cf Fgure a,b, called a mnmum-jerk trajectory n the motor control lterature; the movement duraton was 05 seconds, whch s smlar to normal human reachng movements Gaussan nose of N0,0 was added to the ntal parameters of the movement prmtves n order to have dfferenntal condtons for every run of the algorthms The results are gven n Fgure Fgure a,b show the ntal before learnng trajectory generated by the DMP together wth the learnng results of the four dfferent algorthms after learnng essentally, all algorthms acheve the same result such that all trajectores le on top of each other In Fgure c, however, t can be seen that PI 2 outperforms the gradent algorthms by an order of magntude Fgure d llustrates learnng curves for the same task as n Fgure c, just that parameter updates are computed already after two roll-outs the enac was excluded from ths evaluaton as t would be too heurstc to stablze ts ll-condtoned matrx nverson that results from such few roll-outs PI 2 contnues to converge much faster than the other algorthms even n ths specal scenaro However, there are some notceable fluctuaton after convergence Ths nose around the convergence baselne s caused by usng only two nosy roll-outs to contnue updatng the parameters, whch causes contnuous parameter fluctuatons around the optmal parameters Annealng the exploraton nose, or just addng the optmal trajectory from the prevous parameter update as one of the roll-outs for the next parameter update can allevate ths ssue we do not llustrate such lttle trcks n ths paper as they really only affect fne tunng of the algorthm 52 Learnng Optmal Performance of a DOF Va-Pont Task The second evaluaton was dentcal to the frst evaluaton, just that the cost functon now forced the movement to pass through an ntermedate va-pont at t = 300ms Ths evaluaton s an abstract approxmaton of httng a target, for example, as n playng tenns, and requres a sgnfcant change n how the movemens performed relatve to the ntal trajectory Fgure 2a The cost functon was r 300ms = G y t300ms 2, φ tn = 0 wth G = 025 Only ths sngle reward was gven For ths cost functon, the PoWER algorthm can be appled, too, wth cost functon r 300ms = exp /λ r 300ms and r t = 0 otherwse Ths transformed cost functon has the same optmum as r 300ms The resultng learnng curves are gven n Fgure 2 and resemble the prevous evaluaton: PI 2 outperforms the gradent algorthms by roughly an order of magntude, whle all the gradent algorthms have almosdentcal learnng curves As was expected from the smlarty of the update equatons, PoWER and PI 2 have n ths specal case the same performance and are hardly dstngushable n Fgure 2 Fgure 2a demonstrates that all algorthms pass through the desred target G, but that there are remanng dfferences between the algorthms n how they approach the target G these dfference have a small numercal effecn the fnal cost where PI 2 and PoWER have the lowest cost, but these dfference are hardly task relevant 53 Learnng Optmal Performance of a Mult-DOF Va-Pont Task A thrd evaluaton examned the scalablty of our algorthms to a hgh-dmensonal and hghly redundant learnng problem Agan, the learnng task was to pass through an ntermedate target G, 359

24 THEODOROU, BUCHLI AND SCHAAL Poston [rad] Intal PI^2 02 REINFORCE PG 0 NAC a b Tme [s] Velocty [rad/s] Tme [s] Cost Cost c 0 d 0 00 Number of Roll-Outs Number of Roll-Outs Fgure : Comparson of renforcement learnng of an optmzed movement wth motor prmtves a Poston trajectores of the ntal trajectory before learnng and the results of all algorthms after learnng the dfferent algorthms are essentally ndstghushable b The same as a, just usng the velocty trajectores c Average learnng curves for the dfferent algorthms wth std error bars from averagng 0 runs for each of the algorthms d Learnng curves for the dfferent algorthms when only two roll-outs are used per update note that the enac cannot work n ths case and s omtted just that a d = 2,0, or 50 dmensonal motor prmtve was employed We assume that the mult- DOF systems model planar robot arms, where d lnks of equal length l = /d are connected n an open chan wth revolute jonts Essentally, these robots look lke a mult-segment snake n a plane, where the tal of the snake s fxed at the orgn of the 2D coordnate system, and the head of the snake can be moved n the 2D plane by changng the jont angles between all the lnks Fgure 3b,d,f llustrate the movement over tme of these robots: the ntal poston of the robots s when all jont angles are zero and the robot arm completely concdes wth the x-axs of the coordnate frame The goal states of the motor prmtves command each DOF to move to a jont angle, such that the entre robot confguraton afterwards looks lke a sem-crcle where the most dstal lnk of the robot 360

25 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING Poston [rad] 04 G Intal PI^2 REINF PG NAC PoWER Tme [s] Cost a b Number of Roll-Outs Fgure 2: Comparson of renforcement learnng of an optmzed movement wth motor prmtves for passng through an ntermedate target G a Poston trajectores of the ntal trajectory before learnng and the results of all algorthms after learnng b Average learnng curves for the dfferent algorthms wth std error bars from averagng 0 runs for each of the algorthms the end-effector touches the y-axs The hgher prorty task, however, s to move the end-effector through a va-pont G =05, 05 To formalze ths task as a renforcement learnng problem, we denote the jont angles of the robots as ξ, wth =,2,,d, such that the frst lne of 30 reads now as ξ,t = f,t + g T,t θ + ε,t ths small change of notaton s to avod a clash of varables wth thex,y task space of the robot The end-effector poston s computed as: x t = d d = cos j,t, y t = j=ξ d d = The mmedate reward functon for ths problem s defned as sn j= ξ j,t r t = d = d+ 0 f 2,t + 05 θt θ d = d+, 39 r 300ms = x t300ms y t300ms 2, φ tn = 0, where r 300ms s added to r t at tme t = 300ms, thas, we would lke to pass through the vapont at ths tme The ndvdual DOFs of the motor prmtve were ntalzed as n the DOF examples above The cost term n 39 penalzes each DOF for usng hgh acceleratons and large parameter vectors, whch s a crtcal component to acheve a good resoluton of redundancy n the arm Equaton 39 also has a weghtng term d+ that penalzes DOFs proxmal to the orgn more than those that are dstal to the orgn ntutvely, appled to human arm movements, ths would mean that wrst movements are cheaper than shoulder movements, whch s motvated by the 36

26 THEODOROU, BUCHLI AND SCHAAL fact that the wrst has much lower mass and nerta and s thus energetcally more effcent to move The results of ths experment are summarzed n Fgure 3 The learnng curves n the left column demonstrate agan that PI 2 has an order of magntude faster learnng performance than the other algorthms, rrespectve of the dmensonalty PI 2 also converges to the lowest cosn all examples: Algorthm 2-DOFs 0-DOFs 50-DOFs PI ± ± ±50 REINFORCE ± ± ± PG ± ± ± NAC 3000 ± ± ± 2000 Fgure 3 also llustrates the path taken by the end-effector before and after learnng All algorthms manage to pass through the va-pont G approprately, although the path partcularly before reachng the va-pont can be qute dfferent across the algorthms Gven that PI 2 reached the lowest cost wth low varance n all examples, t appears to have found the best soluton We also added a stroboscopc sketch of the robot arm for the PI 2 soluton, whch proceeds from the very rght to the left as a functon of tme It should be emphaszed that there were absolutely no parameter tunng needed to acheve the PI 2 results, whle all gradent algorthms requred readjustng of learnng rates for every example to acheve best performance 54 Applcaton to Robot Learnng Fgure 4 llustrates our applcaton to a robot learnng problem The robot dog s to jump across as gap The jump should make forward progress as much as possble, as s a maneuver n a legged locomoton competton whch scores the speed of the robot note that we only used a physcal smulator of the robot for ths experment, as the actual robot was not avalable The robot has three DOFs per leg, and thus a total of d = 2 DOFs Each DOF was represented as a DMP wth 50 bass functons An ntal seed behavor Fgure 5-top was taught by learnng from demonstraton, whch allowed the robot barely to reach the other sde of the gap wthout fallng nto the gap the demonstraton was generated from a manual adjustment of splne nodes n a splne-based trajectory plan for each leg PI 2 learnng used prmarly the forward progress as a reward, and slghtly penalzed the squared acceleraton of each DOF, and the length of the parameter vector Addtonally, a penalty was ncurred f the yaw or the roll exceeded a threshold value these penaltes encouraged the robot to 362

27 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING G Cost y [m] a 0 b x [m] Number of Roll-Outs DOF G Cost y [m] c 0 d Number of Roll-Outs DOF x [m] G Cost PG e 0 NAC f x [m] 0 Intal PI2 REINFORCE 00 Number of Roll-Outs y [m] 0 50 DOF Fgure 3: Comparson of learnng mult-dof movements 2,0, and 50 DOFs wth planar robot arms passng through a va-pont G a,c,e llustrate the learnng curves for dfferent RL algorthms, whle b,d,f llustrate the end-effector movement after learnng for all algorthms Addtonally, b,d,f also show the ntal end-effector movement, before learnng to pass through G, and a stroboscopc vsualzaton of the arm movement for the fnal result of PI 2 the movements proceed n tme startng at the very rght and endng by almost touchng the y axs 363

28 THEODOROU, BUCHLI AND SCHAAL Cost a Real & Smulated Robot Dog Number of Roll-Outs b Learnng curve for Dog Jump wth PI 2 ±std Fgure 4: Renforcement learnng of optmzng to jump over a gap wth a robot dog The mprovement n cost corresponds to about 5 cm mprovemenn jump dstance, whch changed the robot s behavor from an ntal barely successful jump to jump that completely traversed the gap wth entre body Ths learned behavor allowed the robot to traverse a gap at much hgher speed n a competton on learnng locomoton The experments for ths paper were conducted only on the robot smulator jump straght forward and not to the sde, and not to fall over The exact cost functon s: d r t = r roll + r yaw + a f,t+ 2 05a 2 θ T θ a = e 6,a 2 = e 8, = { 00 roll t 03 2, f roll t >03 r roll = 0, otherwse, { 00 yaw t 0 2, f yaw t >0 r yaw = 0, otherwse, φ tn = 50000goal x nose 2, where roll,yaw are the roll and yaw angles of the robot s body, and x nose s the poston of the front tp the nose of the robon the forward drecton, whch s the drecton towards the goal The multplers for each reward component were tuned to have a balanced nfluence of all terms Ten learnng trals were performed ntally for the frst parameter update The best 5 trals were kept, and fve addtonal new trals were performed for the second and all subsequent updates Essentally, ths method performs mportance samplng, as the rewards for the 5 trals n memory were re-computed 364

29 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING Fgure 5: Sequence of mages from the smulated robot dog jumpng over a 4cm gap Top: before learnng Bottom: After learnng Whle the two sequences look qute smlar at the frst glance, s apparent than the 4th frame, the robot s body s sgnfcantly hgher n the ar, such that after landng, the body of the dog made about 5cm more forward progress as before In partcular, the entre robot s body comes to rest on the other sde of the gap, whch allows for an easy transton to walkng In contrast, before learnng, the robot s body and ts hnd legs are stll on the rght sde of the gap, whch does not allow for a successful contnuaton of walkng wth the latest parameter vectors A total of 00 trals was performed per run, and ten runs were collected for computng mean and standard devatons of learnng curves Fgure 4 llustrates that after about 30 trals e, 5 updates, the performance of the robot was converged and sgnfcantly mproved, such that after the jump, almost the entre body was lyng on the other sde of the gap Fgure 4 captures the temporal performance n a sequence of snapshots of the robot It should be noted that applyng PI 2 was algorthmcally very smple, and manual tunng only focused on generated a good cost functon, whch s a dfferent research topc beyond the scope of ths paper 6 Dscusson Ths paper derved a more general verson of stochastc optmal control wth path ntegrals, based on the orgnal work by Kappen 2007 and Broek et al 2008 The key results were presented n Table and Secton 25, whch consdered how to compute the optmal controls for a general class of stochastc control systems wth state-dependent control transton matrx One mportant class of these systems can be nterpreted n the framework of renforcement learnng wth parameterzed polces For ths class, we derved Polcy Improvement wth Path Integrals PI 2 as a novel algorthm for learnng a parameterzed polcy PI 2 nherts ts sound foundaton n frst order prncples of stochastc optmal control from the path ntegral formalsm Is a probablstc learnng method wthout open algorthmc tunng parameters, except for the exploraton nose In our evaluatons, PI 2 outperformed gradent algorthms sgnfcantly Is also numercally smpler and has easer cost functon desgn than prevous probablstc RL methods that requre thammedate rewards are pseudo-probabltes The smlarty of PI 2 wth algorthms based on probablty matchng ndcates that the prncple of probablty matchng seems to approxmate a stochastc optmal control framework Our evaluatons demonstrated that PI 2 can scale to hgh dmensonal control systems, unlke many other renforcement learnng systems Some ssues, however, deserve more detaled dscussons n the followng paragraphs 365

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning Journal of Machne Learnng Research 00-9 Submtted /0; Publshed 7/ Erratum: A Generalzed Path Integral Control Approach to Renforcement Learnng Evangelos ATheodorou Jonas Buchl Stefan Schaal Department of

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

The Feynman path integral

The Feynman path integral The Feynman path ntegral Aprl 3, 205 Hesenberg and Schrödnger pctures The Schrödnger wave functon places the tme dependence of a physcal system n the state, ψ, t, where the state s a vector n Hlbert space

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

Numerical Heat and Mass Transfer

Numerical Heat and Mass Transfer Master degree n Mechancal Engneerng Numercal Heat and Mass Transfer 06-Fnte-Dfference Method (One-dmensonal, steady state heat conducton) Fausto Arpno f.arpno@uncas.t Introducton Why we use models and

More information

MMA and GCMMA two methods for nonlinear optimization

MMA and GCMMA two methods for nonlinear optimization MMA and GCMMA two methods for nonlnear optmzaton Krster Svanberg Optmzaton and Systems Theory, KTH, Stockholm, Sweden. krlle@math.kth.se Ths note descrbes the algorthms used n the author s 2007 mplementatons

More information

Difference Equations

Difference Equations Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

Lecture 12: Discrete Laplacian

Lecture 12: Discrete Laplacian Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly

More information

Appendix B. The Finite Difference Scheme

Appendix B. The Finite Difference Scheme 140 APPENDIXES Appendx B. The Fnte Dfference Scheme In ths appendx we present numercal technques whch are used to approxmate solutons of system 3.1 3.3. A comprehensve treatment of theoretcal and mplementaton

More information

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models & The Multivariate Gaussian (10/26/04) CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models

More information

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons

More information

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

Physics 5153 Classical Mechanics. D Alembert s Principle and The Lagrangian-1

Physics 5153 Classical Mechanics. D Alembert s Principle and The Lagrangian-1 P. Guterrez Physcs 5153 Classcal Mechancs D Alembert s Prncple and The Lagrangan 1 Introducton The prncple of vrtual work provdes a method of solvng problems of statc equlbrum wthout havng to consder the

More information

Some modelling aspects for the Matlab implementation of MMA

Some modelling aspects for the Matlab implementation of MMA Some modellng aspects for the Matlab mplementaton of MMA Krster Svanberg krlle@math.kth.se Optmzaton and Systems Theory Department of Mathematcs KTH, SE 10044 Stockholm September 2004 1. Consdered optmzaton

More information

Transfer Functions. Convenient representation of a linear, dynamic model. A transfer function (TF) relates one input and one output: ( ) system

Transfer Functions. Convenient representation of a linear, dynamic model. A transfer function (TF) relates one input and one output: ( ) system Transfer Functons Convenent representaton of a lnear, dynamc model. A transfer functon (TF) relates one nput and one output: x t X s y t system Y s The followng termnology s used: x y nput output forcng

More information

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

Note 10. Modeling and Simulation of Dynamic Systems

Note 10. Modeling and Simulation of Dynamic Systems Lecture Notes of ME 475: Introducton to Mechatroncs Note 0 Modelng and Smulaton of Dynamc Systems Department of Mechancal Engneerng, Unversty Of Saskatchewan, 57 Campus Drve, Saskatoon, SK S7N 5A9, Canada

More information

Tracking with Kalman Filter

Tracking with Kalman Filter Trackng wth Kalman Flter Scott T. Acton Vrgna Image and Vdeo Analyss (VIVA), Charles L. Brown Department of Electrcal and Computer Engneerng Department of Bomedcal Engneerng Unversty of Vrgna, Charlottesvlle,

More information

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE Analytcal soluton s usually not possble when exctaton vares arbtrarly wth tme or f the system s nonlnear. Such problems can be solved by numercal tmesteppng

More information

APPENDIX A Some Linear Algebra

APPENDIX A Some Linear Algebra APPENDIX A Some Lnear Algebra The collecton of m, n matrces A.1 Matrces a 1,1,..., a 1,n A = a m,1,..., a m,n wth real elements a,j s denoted by R m,n. If n = 1 then A s called a column vector. Smlarly,

More information

Lecture 21: Numerical methods for pricing American type derivatives

Lecture 21: Numerical methods for pricing American type derivatives Lecture 21: Numercal methods for prcng Amercan type dervatves Xaoguang Wang STAT 598W Aprl 10th, 2014 (STAT 598W) Lecture 21 1 / 26 Outlne 1 Fnte Dfference Method Explct Method Penalty Method (STAT 598W)

More information

The Geometry of Logit and Probit

The Geometry of Logit and Probit The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.

More information

NUMERICAL DIFFERENTIATION

NUMERICAL DIFFERENTIATION NUMERICAL DIFFERENTIATION 1 Introducton Dfferentaton s a method to compute the rate at whch a dependent output y changes wth respect to the change n the ndependent nput x. Ths rate of change s called the

More information

1 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations

1 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations Physcs 171/271 -Davd Klenfeld - Fall 2005 (revsed Wnter 2011) 1 Dervaton of Rate Equatons from Sngle-Cell Conductance (Hodgkn-Huxley-lke) Equatons We consder a network of many neurons, each of whch obeys

More information

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

VQ widely used in coding speech, image, and video

VQ widely used in coding speech, image, and video at Scalar quantzers are specal cases of vector quantzers (VQ): they are constraned to look at one sample at a tme (memoryless) VQ does not have such constrant better RD perfomance expected Source codng

More information

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number

More information

Markov Chain Monte Carlo Lecture 6

Markov Chain Monte Carlo Lecture 6 where (x 1,..., x N ) X N, N s called the populaton sze, f(x) f (x) for at least one {1, 2,..., N}, and those dfferent from f(x) are called the tral dstrbutons n terms of mportance samplng. Dfferent ways

More information

NON-CENTRAL 7-POINT FORMULA IN THE METHOD OF LINES FOR PARABOLIC AND BURGERS' EQUATIONS

NON-CENTRAL 7-POINT FORMULA IN THE METHOD OF LINES FOR PARABOLIC AND BURGERS' EQUATIONS IJRRAS 8 (3 September 011 www.arpapress.com/volumes/vol8issue3/ijrras_8_3_08.pdf NON-CENTRAL 7-POINT FORMULA IN THE METHOD OF LINES FOR PARABOLIC AND BURGERS' EQUATIONS H.O. Bakodah Dept. of Mathematc

More information

Time-Varying Systems and Computations Lecture 6

Time-Varying Systems and Computations Lecture 6 Tme-Varyng Systems and Computatons Lecture 6 Klaus Depold 14. Januar 2014 The Kalman Flter The Kalman estmaton flter attempts to estmate the actual state of an unknown dscrete dynamcal system, gven nosy

More information

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement Markov Chan Monte Carlo MCMC, Gbbs Samplng, Metropols Algorthms, and Smulated Annealng 2001 Bonformatcs Course Supplement SNU Bontellgence Lab http://bsnuackr/ Outlne! Markov Chan Monte Carlo MCMC! Metropols-Hastngs

More information

4DVAR, according to the name, is a four-dimensional variational method.

4DVAR, according to the name, is a four-dimensional variational method. 4D-Varatonal Data Assmlaton (4D-Var) 4DVAR, accordng to the name, s a four-dmensonal varatonal method. 4D-Var s actually a drect generalzaton of 3D-Var to handle observatons that are dstrbuted n tme. The

More information

Errors for Linear Systems

Errors for Linear Systems Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch

More information

Primer on High-Order Moment Estimators

Primer on High-Order Moment Estimators Prmer on Hgh-Order Moment Estmators Ton M. Whted July 2007 The Errors-n-Varables Model We wll start wth the classcal EIV for one msmeasured regressor. The general case s n Erckson and Whted Econometrc

More information

ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM

ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM An elastc wave s a deformaton of the body that travels throughout the body n all drectons. We can examne the deformaton over a perod of tme by fxng our look

More information

Modelli Clamfim Equazione del Calore Lezione ottobre 2014

Modelli Clamfim Equazione del Calore Lezione ottobre 2014 CLAMFIM Bologna Modell 1 @ Clamfm Equazone del Calore Lezone 17 15 ottobre 2014 professor Danele Rtell danele.rtell@unbo.t 1/24? Convoluton The convoluton of two functons g(t) and f(t) s the functon (g

More information

DUE: WEDS FEB 21ST 2018

DUE: WEDS FEB 21ST 2018 HOMEWORK # 1: FINITE DIFFERENCES IN ONE DIMENSION DUE: WEDS FEB 21ST 2018 1. Theory Beam bendng s a classcal engneerng analyss. The tradtonal soluton technque makes smplfyng assumptons such as a constant

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

EPR Paradox and the Physical Meaning of an Experiment in Quantum Mechanics. Vesselin C. Noninski

EPR Paradox and the Physical Meaning of an Experiment in Quantum Mechanics. Vesselin C. Noninski EPR Paradox and the Physcal Meanng of an Experment n Quantum Mechancs Vesseln C Nonnsk vesselnnonnsk@verzonnet Abstract It s shown that there s one purely determnstc outcome when measurement s made on

More information

Lecture 12: Classification

Lecture 12: Classification Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna

More information

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal Inner Product Defnton 1 () A Eucldean space s a fnte-dmensonal vector space over the reals R, wth an nner product,. Defnton 2 (Inner Product) An nner product, on a real vector space X s a symmetrc, blnear,

More information

Vector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence.

Vector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence. Vector Norms Chapter 7 Iteratve Technques n Matrx Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematcs Unversty of Calforna, Berkeley Math 128B Numercal Analyss Defnton A vector norm

More information

Report on Image warping

Report on Image warping Report on Image warpng Xuan Ne, Dec. 20, 2004 Ths document summarzed the algorthms of our mage warpng soluton for further study, and there s a detaled descrpton about the mplementaton of these algorthms.

More information

Iterative General Dynamic Model for Serial-Link Manipulators

Iterative General Dynamic Model for Serial-Link Manipulators EEL6667: Knematcs, Dynamcs and Control of Robot Manpulators 1. Introducton Iteratve General Dynamc Model for Seral-Lnk Manpulators In ths set of notes, we are gong to develop a method for computng a general

More information

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011 Stanford Unversty CS359G: Graph Parttonng and Expanders Handout 4 Luca Trevsan January 3, 0 Lecture 4 In whch we prove the dffcult drecton of Cheeger s nequalty. As n the past lectures, consder an undrected

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering / Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Lecture 4 Bayes Classfcaton Rule Dept. of Electrcal and Computer Engneerng 0909.40.0 / 0909.504.04 Theory & Applcatons

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectaton-Maxmaton Algorthm Charles Elan elan@cs.ucsd.edu November 16, 2007 Ths chapter explans the EM algorthm at multple levels of generalty. Secton 1 gves the standard hgh-level verson of the algorthm.

More information

Inductance Calculation for Conductors of Arbitrary Shape

Inductance Calculation for Conductors of Arbitrary Shape CRYO/02/028 Aprl 5, 2002 Inductance Calculaton for Conductors of Arbtrary Shape L. Bottura Dstrbuton: Internal Summary In ths note we descrbe a method for the numercal calculaton of nductances among conductors

More information

Salmon: Lectures on partial differential equations. Consider the general linear, second-order PDE in the form. ,x 2

Salmon: Lectures on partial differential equations. Consider the general linear, second-order PDE in the form. ,x 2 Salmon: Lectures on partal dfferental equatons 5. Classfcaton of second-order equatons There are general methods for classfyng hgher-order partal dfferental equatons. One s very general (applyng even to

More information

Global Sensitivity. Tuesday 20 th February, 2018

Global Sensitivity. Tuesday 20 th February, 2018 Global Senstvty Tuesday 2 th February, 28 ) Local Senstvty Most senstvty analyses [] are based on local estmates of senstvty, typcally by expandng the response n a Taylor seres about some specfc values

More information

Estimation: Part 2. Chapter GREG estimation

Estimation: Part 2. Chapter GREG estimation Chapter 9 Estmaton: Part 2 9. GREG estmaton In Chapter 8, we have seen that the regresson estmator s an effcent estmator when there s a lnear relatonshp between y and x. In ths chapter, we generalzed the

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

Lecture 4. Instructor: Haipeng Luo

Lecture 4. Instructor: Haipeng Luo Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would

More information

Relevance Vector Machines Explained

Relevance Vector Machines Explained October 19, 2010 Relevance Vector Machnes Explaned Trstan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introducton Ths document has been wrtten n an attempt to make Tppng s [1] Relevance Vector Machnes

More information

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin Proceedngs of the 007 Wnter Smulaton Conference S G Henderson, B Bller, M-H Hseh, J Shortle, J D Tew, and R R Barton, eds LOW BIAS INTEGRATED PATH ESTIMATORS James M Calvn Department of Computer Scence

More information

Linear Feature Engineering 11

Linear Feature Engineering 11 Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19

More information

Composite Hypotheses testing

Composite Hypotheses testing Composte ypotheses testng In many hypothess testng problems there are many possble dstrbutons that can occur under each of the hypotheses. The output of the source s a set of parameters (ponts n a parameter

More information

Mathematical Preparations

Mathematical Preparations 1 Introducton Mathematcal Preparatons The theory of relatvty was developed to explan experments whch studed the propagaton of electromagnetc radaton n movng coordnate systems. Wthn expermental error the

More information

The Finite Element Method

The Finite Element Method The Fnte Element Method GENERAL INTRODUCTION Read: Chapters 1 and 2 CONTENTS Engneerng and analyss Smulaton of a physcal process Examples mathematcal model development Approxmate solutons and methods of

More information

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS) Some Comments on Acceleratng Convergence of Iteratve Sequences Usng Drect Inverson of the Iteratve Subspace (DIIS) C. Davd Sherrll School of Chemstry and Bochemstry Georga Insttute of Technology May 1998

More information

Design and Optimization of Fuzzy Controller for Inverse Pendulum System Using Genetic Algorithm

Design and Optimization of Fuzzy Controller for Inverse Pendulum System Using Genetic Algorithm Desgn and Optmzaton of Fuzzy Controller for Inverse Pendulum System Usng Genetc Algorthm H. Mehraban A. Ashoor Unversty of Tehran Unversty of Tehran h.mehraban@ece.ut.ac.r a.ashoor@ece.ut.ac.r Abstract:

More information

Chapter 13: Multiple Regression

Chapter 13: Multiple Regression Chapter 13: Multple Regresson 13.1 Developng the multple-regresson Model The general model can be descrbed as: It smplfes for two ndependent varables: The sample ft parameter b 0, b 1, and b are used to

More information

Supporting Information

Supporting Information Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to

More information

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:

More information

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction ECONOMICS 5* -- NOTE (Summary) ECON 5* -- NOTE The Multple Classcal Lnear Regresson Model (CLRM): Specfcaton and Assumptons. Introducton CLRM stands for the Classcal Lnear Regresson Model. The CLRM s also

More information

Assortment Optimization under MNL

Assortment Optimization under MNL Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.

More information

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could

More information

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1 On an Extenson of Stochastc Approxmaton EM Algorthm for Incomplete Data Problems Vahd Tadayon Abstract: The Stochastc Approxmaton EM (SAEM algorthm, a varant stochastc approxmaton of EM, s a versatle tool

More information

On a direct solver for linear least squares problems

On a direct solver for linear least squares problems ISSN 2066-6594 Ann. Acad. Rom. Sc. Ser. Math. Appl. Vol. 8, No. 2/2016 On a drect solver for lnear least squares problems Constantn Popa Abstract The Null Space (NS) algorthm s a drect solver for lnear

More information

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA 4 Analyss of Varance (ANOVA) 5 ANOVA 51 Introducton ANOVA ANOVA s a way to estmate and test the means of multple populatons We wll start wth one-way ANOVA If the populatons ncluded n the study are selected

More information

Canonical transformations

Canonical transformations Canoncal transformatons November 23, 2014 Recall that we have defned a symplectc transformaton to be any lnear transformaton M A B leavng the symplectc form nvarant, Ω AB M A CM B DΩ CD Coordnate transformatons,

More information

Hidden Markov Models

Hidden Markov Models Hdden Markov Models Namrata Vaswan, Iowa State Unversty Aprl 24, 204 Hdden Markov Model Defntons and Examples Defntons:. A hdden Markov model (HMM) refers to a set of hdden states X 0, X,..., X t,...,

More information

Research Article Green s Theorem for Sign Data

Research Article Green s Theorem for Sign Data Internatonal Scholarly Research Network ISRN Appled Mathematcs Volume 2012, Artcle ID 539359, 10 pages do:10.5402/2012/539359 Research Artcle Green s Theorem for Sgn Data Lous M. Houston The Unversty of

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

Finite Element Modelling of truss/cable structures

Finite Element Modelling of truss/cable structures Pet Schreurs Endhoven Unversty of echnology Department of Mechancal Engneerng Materals echnology November 3, 214 Fnte Element Modellng of truss/cable structures 1 Fnte Element Analyss of prestressed structures

More information

PHYS 705: Classical Mechanics. Calculus of Variations II

PHYS 705: Classical Mechanics. Calculus of Variations II 1 PHYS 705: Classcal Mechancs Calculus of Varatons II 2 Calculus of Varatons: Generalzaton (no constrant yet) Suppose now that F depends on several dependent varables : We need to fnd such that has a statonary

More information

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty Addtonal Codes usng Fnte Dfference Method Benamn Moll 1 HJB Equaton for Consumpton-Savng Problem Wthout Uncertanty Before consderng the case wth stochastc ncome n http://www.prnceton.edu/~moll/ HACTproect/HACT_Numercal_Appendx.pdf,

More information

Module 9. Lecture 6. Duality in Assignment Problems

Module 9. Lecture 6. Duality in Assignment Problems Module 9 1 Lecture 6 Dualty n Assgnment Problems In ths lecture we attempt to answer few other mportant questons posed n earler lecture for (AP) and see how some of them can be explaned through the concept

More information

CSC 411 / CSC D11 / CSC C11

CSC 411 / CSC D11 / CSC C11 18 Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t

More information

A New Refinement of Jacobi Method for Solution of Linear System Equations AX=b

A New Refinement of Jacobi Method for Solution of Linear System Equations AX=b Int J Contemp Math Scences, Vol 3, 28, no 17, 819-827 A New Refnement of Jacob Method for Soluton of Lnear System Equatons AX=b F Naem Dafchah Department of Mathematcs, Faculty of Scences Unversty of Gulan,

More information

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010 Parametrc fractonal mputaton for mssng data analyss Jae Kwang Km Survey Workng Group Semnar March 29, 2010 1 Outlne Introducton Proposed method Fractonal mputaton Approxmaton Varance estmaton Multple mputaton

More information

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach A Bayes Algorthm for the Multtask Pattern Recognton Problem Drect Approach Edward Puchala Wroclaw Unversty of Technology, Char of Systems and Computer etworks, Wybrzeze Wyspanskego 7, 50-370 Wroclaw, Poland

More information

ECE559VV Project Report

ECE559VV Project Report ECE559VV Project Report (Supplementary Notes Loc Xuan Bu I. MAX SUM-RATE SCHEDULING: THE UPLINK CASE We have seen (n the presentaton that, for downlnk (broadcast channels, the strategy maxmzng the sum-rate

More information

1 Matrix representations of canonical matrices

1 Matrix representations of canonical matrices 1 Matrx representatons of canoncal matrces 2-d rotaton around the orgn: ( ) cos θ sn θ R 0 = sn θ cos θ 3-d rotaton around the x-axs: R x = 1 0 0 0 cos θ sn θ 0 sn θ cos θ 3-d rotaton around the y-axs:

More information

Lecture 3. Ax x i a i. i i

Lecture 3. Ax x i a i. i i 18.409 The Behavor of Algorthms n Practce 2/14/2 Lecturer: Dan Spelman Lecture 3 Scrbe: Arvnd Sankar 1 Largest sngular value In order to bound the condton number, we need an upper bound on the largest

More information

Section 8.3 Polar Form of Complex Numbers

Section 8.3 Polar Form of Complex Numbers 80 Chapter 8 Secton 8 Polar Form of Complex Numbers From prevous classes, you may have encountered magnary numbers the square roots of negatve numbers and, more generally, complex numbers whch are the

More information

Gaussian Mixture Models

Gaussian Mixture Models Lab Gaussan Mxture Models Lab Objectve: Understand the formulaton of Gaussan Mxture Models (GMMs) and how to estmate GMM parameters. You ve already seen GMMs as the observaton dstrbuton n certan contnuous

More information

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity LINEAR REGRESSION ANALYSIS MODULE IX Lecture - 30 Multcollnearty Dr. Shalabh Department of Mathematcs and Statstcs Indan Insttute of Technology Kanpur 2 Remedes for multcollnearty Varous technques have

More information

Physics 5153 Classical Mechanics. Principle of Virtual Work-1

Physics 5153 Classical Mechanics. Principle of Virtual Work-1 P. Guterrez 1 Introducton Physcs 5153 Classcal Mechancs Prncple of Vrtual Work The frst varatonal prncple we encounter n mechancs s the prncple of vrtual work. It establshes the equlbrum condton of a mechancal

More information

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest

More information

Inexact Newton Methods for Inverse Eigenvalue Problems

Inexact Newton Methods for Inverse Eigenvalue Problems Inexact Newton Methods for Inverse Egenvalue Problems Zheng-jan Ba Abstract In ths paper, we survey some of the latest development n usng nexact Newton-lke methods for solvng nverse egenvalue problems.

More information

Temperature. Chapter Heat Engine

Temperature. Chapter Heat Engine Chapter 3 Temperature In prevous chapters of these notes we ntroduced the Prncple of Maxmum ntropy as a technque for estmatng probablty dstrbutons consstent wth constrants. In Chapter 9 we dscussed the

More information

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced,

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced, FREQUENCY DISTRIBUTIONS Page 1 of 6 I. Introducton 1. The dea of a frequency dstrbuton for sets of observatons wll be ntroduced, together wth some of the mechancs for constructng dstrbutons of data. Then

More information