Reinforcement Learning for Robotic Locomotions
|
|
- Alexander Hall
- 5 years ago
- Views:
Transcription
1 Reinforcement Lerning for Robotic Locomotion Bo Liu Stnford Univerity 121 Cmpu Drive Stnford, CA 94305, USA Hunzhong Xu Stnford Univerity 121 Cmpu Drive Stnford, CA 94305, USA Songze Li Stnford Univerity 121 Cmpu Drive Stnford, CA 94305, USA Abtrct In Reinforcement Lerning, it i uully more convenient to optimize in the policy π( ) pce thn forming policy indirectly by ccurtely evluting tte-ction function Q(, ) or vlue function V(). Becue the vlue function doen t precribe ction nd the tte-ction function might be very hrd to etimte in continuou or lrge ction pce. Therefore, in n environment tht uffer from high cot of evlution, uch robotic locomotion ytem, policy optimiztion method i better option due to the ignificntly mller policy pce. In thi project, we invetigte prticulr policy optimiztion method, the Trut Region Policy Optimiztion lgorithm, explore poible vrint of it, nd propoe new method bed on our undertnding of thi lgorithm. 1 Introduction The originl motivtion of thi project come from the deire to undertnd humn motor control unit in the field of bioengineering. Controlling imulted humn body to chieve complex motion uing n rtificil brin will provide not only better undertnding of how humn motor ytem work but lo theoreticl foundtion for urgerie for people hving phyicl dibilitie. Therefore, we focu on finding efficient Reinforcement Lerning lgorithm in environment with high cot evlution. We eventully decided on exploring policy grdient method becue erching in policy pce cn be much more efficient. In prticulr, we invetigte into recent work clled Trut Region Policy Optimiztion (Schulmn et l., 2015), in which the uthor formulte the reinforcement lerning objective n optimiztion problem ubject to trut region contrint. Thi lgorithm work well in prctice nd h theoreticl gurntee to improve in ech epiode if the objective nd contrint re evluted exctly. In the originl pper, the trut region correpond to the region tht i cloe to the old policy in the policy pce. The uthor define cloene in term of the KullbckLeibler (KL) divergence between the old nd new policy ditribution, e.g. D KL (π old ( ) π new ( )). In fct, during optimiztion tep, it ue econd order pproximtion of thi KL contrint which involve the Hein of the KL divergence. Although uing conjugte grdient method, uggeted by the uthor, llow u to void computing the Hein exctly, in generl thi i till computtionlly expenive. Therefore, we quetion to wht extent cn KL contrint outperform other ey-to-compute contrint, or even no contrint t ll, to compente for it cot. Bed on our undertnding of thi model, we lo view thi problem from nother perpective nd propoe new method tht etimte the dvntge vlue uing neurl net nd updte policy directly. 2 Relted Work There re minly two pproche for the propoed problem in reinforcement lerning: policy grdient method nd Q-function method. A policy grdient method directly optimize prmetrized control policy by grdient decent. It belong to the cl of policy erch technique
2 tht mximize the expected return of policy in fixed policy cl, while trditionl vlue function pproximtion pproche derive policie from vlue function. Policy grdient method llow the trightforwrd incorportion of domin knowledge in the policy prmetriztion nd require ignificntly fewer prmeter to repreent the optiml policy thn the correponding vlue function. They re gurnteed to converge to policy loclly optiml t let. Furthermore, they cn hndle continuou tte nd ction, even including imperfect tte informtion. Except for the vnill policy grdient method, there exit vrint like nturl policy grdient (Kkde, 2002) nd lgorithm tht ue trut region (Schulmn et l., 2015). Inted of prmeterizing the policy, the Q-function focue on the tte-ction function Q π ( t, t ) nd updte it with Bellmn Eqution. The optiml Q i obtined vi vlue itertion, in which we repetedly pply Bellmn opertor until it converge. It h been hown tht under ome mild umption, for ny initil Q, thi vlue itertion lgorithm i gurnteed to converge to Q π, where π i the optiml policy (Bird nd other, 1995). Vrint of thi bic vlue itertion lgorithm include the neurl fitted Q-itertion prmeterize Q-function with neurl network nd replce the Bellmn opertor with minimizing MSE between two Q-function. Q-function method do not work generlly policy grdient method lthough they re more mple efficient when they do work. Algorithm Simple & Sclble Dt Efficient Vnill Policy Grdient Good Bd Nturl Policy Grdient Bd OK Q-lerning Good OK Tble 1: comprion of different lgorithm 3 Nottion To mke further derivtion nd decription cler, we provide our nottion for clic Mrkov Deciion Proce (MDP) nd review the conventionl reinforcement lerning objective. MDP i 6-tuple of (S, A, T, r, ρ, γ), where S i the et of poible tte, A i the et of poible ction, T : S A S R i the trnition probbility, r : S R i the rewrd function, ρ : S R i the initil ditribution of tte 0 nd γ i the dicount fctor. Let π denote tochtic policy π : S A [0, 1]. The objective of clic RL i to mximize the expected future rewrd: [ η(π) = E 0, 0,... γ t r( t ) ] where 0 ρ 0 ( 0 ), t π( t t ), t+1 T( t+1 t, t ). Let Q π (, ) nd V π () denote the tndrd ttection function nd vlue function, where [ Q π ( t, t ) = E t+1, t+1,... γ t r( l ) ] Define the dvntge : l=t V π ( t ) = E t Q π ( t, t ) A π (, ) = Q π (, ) V π () In ddition, we define ρ π : S R the dicounted viittion frequencie function: ρ π () = P( 0 = ) + γp( 1 = ) + γ 2 P( 2 = ) Method 4.1 Policy Grdient Method Policy grdient method, the nme ugget, trie to etimte the derivtive of objective with repect to policy prmeter directly. The mot commonly ued grdient etimtor h the form ĝ = E t [ θ log π θ ( t t )Â t ] The TRPO lgorithm i pecific derivtive under thi cl of lgorithm. 4.2 Trut Region Policy Optimiztion Trut region policy optimiztion (TRPO) method combine the ide of Minorize-Mximiztion nd policy grdient method. It define urrogte function which i eier to optimize nd provide trict lower bound for the originl objective function. But
3 in order to ue thi urrogte, the optimiztion h to be done in pce ner the previou policy. Let π 0 denote the beline policy nd π denote ny policy. The following identity expree the expected return of π (Kkde nd Lngford, 2002): η(π) = η(π 0 ) + E 0, 0, π[ A π0 ( t, t )] = η(π 0 ) + P( t = π) π( )γ t A π0 (, ) devition. = η(π 0 ) + γ t P( t = π) π( )A π0 (, ) = η(π 0 ) + ρ π () π( )A π0 (, ) By ubtituting ρ π with ρ π0, we get n pproximtion of the dicounted future rewrd under the new policy π. The new objective i: L(π) = η(π 0 ) + ρ π0 () π( )A π0 (, ) (1) It h been pointed out tht thi pproximtion mtche the originl objective η(π) to firt order (Schulmn et l., 2015). Therefore, mll enough tep π θold π tht improve L(π θold ) lo improve η(π). But the problem i wht mll men here. The uthor ugget uing the KL divergence meure nd provide rigorou theoreticl proof. For implicity, we will not include full derivtion here but only the eventul urrogte objective: mx L θold (θ) CD KL (θ old, θ) θ where C i ome properly choen contnt. However, in prctice, if we ued the penlty coefficient C recommended by the theory, the tep ize would be too mll. To find fter nd more robut optimiztion, trut region i introduced nd the objective become mximizing L ubject to D KL inide the trut region: mx L θold (θ) θ ubject to D KL (θ old, θ) δ We cn further etimte the objective uing importnce mpling bed on the π old ditribution. The contrint cn lo be etimted in the me wy uing Monte-Crlo mpling nd econd order pproximtion. 4.3 Model for TRPO For TRPO nd it vrint, we ue neurl network for both policy nd vlue model. For both network, the input i the tte obervtion. The output of policy model i multi-vrite norml ditribution for ction prmeterized by it men nd tndrd The output of vlue model i ingle number tht i the etimtion of the vlue t thi prticulr tte obervtion. 4.4 Subtitution for KL Divergence The bic ide of TRPO i to optimize the urrogte lo function under the the retriction tht the new policy i cloe enough to the old one. However, one my wonder why we hve to ue KL divergence here to meure the cloene between the two policie. A nturl ubtitution for the KL divergence i the men-qure error (MSE) θ old θ 2 2. Replcing KL contrint with the MSE will ignificntly peed up the trining procedure ince MSE i eier to etimte. Moreover, we recognize tht uing MSE correpond to the originl policy grdient method becue when the MSE i ufficiently mll, the direction tht mximize the objective correpond to it grdient. We experiment TRPO with both KL nd MSE contrint. For beline, we lo implement the method tht purely optimize the objective without ny contrint. We will further nlyze the comprion in following ection with figure. 4.5 Neurl Network Advntge Etimtion Bck to eqution (1) L(π) = η(π 0 ) + ρ π0 () π( )A π0 (, ) Since η(π 0 ) i contnt, mximizing L(π) i equivlent to mking the L(π) η(π 0 ) poitive poible, for exmple: ρ π0 () π( )A π0 (, ) > 0 The intuition here i tht the dvntge vlue A π0 (, ) i n indiction of how good certin
4 ction i nd we would like to incree the probbilitie for better ction tht hve lrge poitive dvntge vlue. In other word, if dvntge vlue cn be ccurtely etimted, we cn optimize the problem by directly mximizing the probbilitie of good mpled ction. In prctice, our ction re mpled from multi-vrite Guin in continuou pce. We etimte dvntge of mpled ction directly from the Monte-Crlo mpling trjectory uing the generlized dvntge etimtion (GAE) (Schulmn et l., 2015b), but we lo wnt n etimtion of the dvntge vlue if men ction were tken t ech time tep. Then by clculting the difference, whenever men ction dvntge i mller (in prctice, we require the difference to be lrger thn threhold to remove ome noie), we mk out the probbility of the mpled ction nd mximize them. Hence, the only problem i how to etimte dvntge vlue for men ction, the ction tht we didn t tke during roll-out. We olve thi problem by uing 3-lyer feed-forwrd neurl network. The input of the network i the conctention of tte obervtion nd ction ; the output of the network i the etimted dvntge vlue. During implementtion, we ue TRPO for the firt few epiode until the MSE between clculted nd etimted dvntge i mller thn threhold. The bove method trt fter thi. 5 Experiment 6 Reult nd Anlyi Figure 1: Swimmer: Averge rewrd v.. Epiode Figure 2: Hopper: Averge rewrd v.. Epiode We tet the different method in the MuJoCo imultor, phyic engine from openai gym. The 3 environment we ue hve increing complexity: Swimmer: 10-dimenionl tte pce, liner rewrd Hopper: 12-dimenionl tte pce, me rewrd Swimmer, with poitive bonu for being in non-terminl tte Wlker: 18-dimenionl tte pce, me rewrd Hopper with n dded penlty for trong impct of the feet gint the ground. Figure 3: Wlker: Averge rewrd v.. Epiode
5 A hown bove, KL divergence doe outperform it vrint. In pite of the high cot, KL contrint i better when meuring ditnce between policie. Becue in MSE ce, though intuitively cloer prmeter correpond to imilr policie, in lrger policy pce even mll updte in prmeter will reult in much different policie. In contrt, the KL contrint i directly etimted upon different π(θ) nd roll-out from different policie re gurnteed to be imilr. However, MSE cn ignificntly horten the trining time. So our uggetion i combintion: pply TRPO with MSE contrint to let the model quickly climb to certin point nd then ue originl TRPO lter for fter convergence. Our dvntge etimtion method did not work t firt. We pent lot of time debugging. One thing we noticed i tht the lgorithm tended to hve lrge MSE lo for dvntge etimtion fter ome conecutive epoch of improvement. In other word, thi provide evidence tht our model work when etimtion i ccurte. In ddition, we conjecture tht the unexpected drop might reult from the over-fitting of our dvntge neurl net. So whenever it ee new obervtion nd ction tht differ from previou experience, it give wrong etimtion of dvntge vlue nd the policy will updte long the wrong grdient direction. Therefore, by crefully tuning L2-regulriztion contnt nd dropout for the neurl network, the model indeed improve, hown in the figure below. In ddition to the nlyi bove, we lo notice n intereting fct bout TRPO: depite the fct tht the verge rewrd incree in ech epiode, the lgorithm i highly enitive to initiliztion. A even uing different rndom eed, the model performnce vrie lrgely. Figure 5: TRPO with different rndom eed 7 Concluion nd Future work In thi project, we explore the cutting-edge TRPO lgorithm in the cl of policy grdient method nd try out ome poible vrint of it. We lo experiment our own modifiction uing dvntge etimtion neurl network. We my continue our invetigtion into other poible ubtitution for KL divergence. In ddition, we would lo like to do more error nlyi on why even the TRPO model i not robut enough nd find method to lower the vrince. 8 Contribution Bo Liu: reponible for coding of TRPO, it vrint nd dvntge etimtion method; prticipted in writeup of project propol, miletone, poter nd finl report; preented poter. Figure 4: TRPO nd dvntge etimtion Hunzhong Xu: reponible for nlyzing cot of Conjugte Grdient in KL, nd etimting feibility of it vrint; helped debugging; collected dt nd ploted ll figure; prticipted in miletone, poter nd finl report writeup.
6 Songze Li: reponible for project ide nd running experiment; prticipted in miletone nd finl report writeup. Reference Leemon Bird et l Reidul lgorithm: Reinforcement lerning with function pproximtion. In Proceeding of the twelfth interntionl conference on mchine lerning, pge Shm Kkde nd John Lngford Approximtely optiml pproximte reinforcement lerning. Shm M Kkde A nturl policy grdient. In Advnce in neurl informtion proceing ytem, pge John Schulmn, Sergey Levine, Pieter Abbeel, Michel Jordn, nd Philipp Moritz Trut region policy optimiztion. In Proceeding of the 32nd Interntionl Conference on Mchine Lerning (ICML-15), pge John Schulmn, Philipp Moritz, Sergey Levine, Michel Jordn, nd Pieter Abbeel. 2015b. High-dimenionl continuou control uing generlized dvntge etimtion. rxiv preprint rxiv:
Reinforcement learning
Reinforcement lerning Regulr MDP Given: Trnition model P Rewrd function R Find: Policy π Reinforcement lerning Trnition model nd rewrd function initilly unknown Still need to find the right policy Lern
More informationArtificial Intelligence Markov Decision Problems
rtificil Intelligence Mrkov eciion Problem ilon - briefly mentioned in hpter Ruell nd orvig - hpter 7 Mrkov eciion Problem; pge of Mrkov eciion Problem; pge of exmple: probbilitic blockworld ction outcome
More informationTP 10:Importance Sampling-The Metropolis Algorithm-The Ising Model-The Jackknife Method
TP 0:Importnce Smpling-The Metropoli Algorithm-The Iing Model-The Jckknife Method June, 200 The Cnonicl Enemble We conider phyicl ytem which re in therml contct with n environment. The environment i uully
More informationReinforcement Learning and Policy Reuse
Reinforcement Lerning nd Policy Reue Mnuel M. Veloo PEL Fll 206 Reding: Reinforcement Lerning: An Introduction R. Sutton nd A. Brto Probbilitic policy reue in reinforcement lerning gent Fernndo Fernndez
More informationPolicy Gradient Methods for Reinforcement Learning with Function Approximation
Policy Grdient Method for Reinforcement Lerning with Function Approximtion Richrd S. Sutton, Dvid McAlleter, Stinder Singh, Yihy Mnour AT&T Lb Reerch, 180 Prk Avenue, Florhm Prk, NJ 07932 Abtrct Function
More information20.2. The Transform and its Inverse. Introduction. Prerequisites. Learning Outcomes
The Trnform nd it Invere 2.2 Introduction In thi Section we formlly introduce the Lplce trnform. The trnform i only pplied to cul function which were introduced in Section 2.1. We find the Lplce trnform
More informationBias in Natural Actor-Critic Algorithms
Bi in Nturl Actor-Critic Algorithm Philip S. Thom pthom@c.um.edu Deprtment of Computer Science, Univerity of Mchuett, Amhert, MA 01002 USA Technicl Report UM-CS-2012-018 Abtrct We how tht two populr dicounted
More informationMArkov decision processes (MDPs) have been widely
Spre Mrkov Deciion Procee with Cul Spre Tlli Entropy Regulriztion for Reinforcement Lerning yungje Lee, Sungjoon Choi, nd Songhwi Oh rxiv:709.0693v3 [c.lg] 3 Oct 07 Abtrct In thi pper, re Mrkov deciion
More informationReinforcement learning II
CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic
More informationMarkov Decision Processes
Mrkov Deciion Procee A Brief Introduction nd Overview Jck L. King Ph.D. Geno UK Limited Preenttion Outline Introduction to MDP Motivtion for Study Definition Key Point of Interet Solution Technique Prtilly
More informationCHOOSING THE NUMBER OF MODELS OF THE REFERENCE MODEL USING MULTIPLE MODELS ADAPTIVE CONTROL SYSTEM
Interntionl Crpthin Control Conference ICCC 00 ALENOVICE, CZEC REPUBLIC y 7-30, 00 COOSING TE NUBER OF ODELS OF TE REFERENCE ODEL USING ULTIPLE ODELS ADAPTIVE CONTROL SYSTE rin BICĂ, Victor-Vleriu PATRICIU
More informationNon-Myopic Multi-Aspect Sensing with Partially Observable Markov Decision Processes
Non-Myopic Multi-Apect Sening with Prtilly Oervle Mrkov Deciion Procee Shiho Ji 2 Ronld Prr nd Lwrence Crin Deprtment of Electricl & Computer Engineering 2 Deprtment of Computer Engineering Duke Univerity
More informationEfficient Planning in R-max
Efficient Plnning in R-mx Mrek Grześ nd Jee Hoey Dvid R. Cheriton School of Computer Science, Univerity of Wterloo 200 Univerity Avenue Wet, Wterloo, ON, N2L 3G1, Cnd {mgrze, jhoey}@c.uwterloo.c ABSTRACT
More informationAPPENDIX 2 LAPLACE TRANSFORMS
APPENDIX LAPLACE TRANSFORMS Thi ppendix preent hort introduction to Lplce trnform, the bic tool ued in nlyzing continuou ytem in the frequency domin. The Lplce trnform convert liner ordinry differentil
More informationWorking with Powers and Exponents
Working ith Poer nd Eponent Nme: September. 00 Repeted Multipliction Remember multipliction i y to rite repeted ddition. To y +++ e rite. Sometime multipliction i done over nd over nd over. To rite e rite.
More informationChapter 2 Organizing and Summarizing Data. Chapter 3 Numerically Summarizing Data. Chapter 4 Describing the Relation between Two Variables
Copyright 013 Peron Eduction, Inc. Tble nd Formul for Sullivn, Sttitic: Informed Deciion Uing Dt 013 Peron Eduction, Inc Chpter Orgnizing nd Summrizing Dt Reltive frequency = frequency um of ll frequencie
More informationPHYS 601 HW 5 Solution. We wish to find a Fourier expansion of e sin ψ so that the solution can be written in the form
5 Solving Kepler eqution Conider the Kepler eqution ωt = ψ e in ψ We wih to find Fourier expnion of e in ψ o tht the olution cn be written in the form ψωt = ωt + A n innωt, n= where A n re the Fourier
More information2. The Laplace Transform
. The Lplce Trnform. Review of Lplce Trnform Theory Pierre Simon Mrqui de Lplce (749-87 French tronomer, mthemticin nd politicin, Miniter of Interior for 6 wee under Npoleon, Preident of Acdemie Frncie
More informationAdministrivia CSE 190: Reinforcement Learning: An Introduction
Administrivi CSE 190: Reinforcement Lerning: An Introduction Any emil sent to me bout the course should hve CSE 190 in the subject line! Chpter 4: Dynmic Progrmming Acknowledgment: A good number of these
More informationarxiv: v6 [stat.ml] 13 Apr 2018
Expected Policy Grdient Kmil Cioek nd Shimon Whiteon Deprtment of Computer Science, Univerity of Oxford Wolfon Building, Prk Rod, Oxford OX1 3QD {kmil.cioek,himon.whiteon}@c.ox.c.uk rxiv:1706.05374v6 [tt.ml
More informationCS 188 Introduction to Artificial Intelligence Fall 2018 Note 7
CS 188 Introduction to Artificil Intelligence Fll 2018 Note 7 These lecture notes re hevily bsed on notes originlly written by Nikhil Shrm. Decision Networks In the third note, we lerned bout gme trees
More informationCONTROL SYSTEMS LABORATORY ECE311 LAB 3: Control Design Using the Root Locus
CONTROL SYSTEMS LABORATORY ECE311 LAB 3: Control Deign Uing the Root Locu 1 Purpoe The purpoe of thi lbortory i to deign cruie control ytem for cr uing the root locu. 2 Introduction Diturbnce D( ) = d
More informationPHYSICS 211 MIDTERM I 22 October 2003
PHYSICS MIDTERM I October 3 Exm i cloed book, cloed note. Ue onl our formul heet. Write ll work nd nwer in exm booklet. The bck of pge will not be grded unle ou o requet on the front of the pge. Show ll
More informationSTABILITY and Routh-Hurwitz Stability Criterion
Krdeniz Technicl Univerity Deprtment of Electricl nd Electronic Engineering 6080 Trbzon, Turkey Chpter 8- nd Routh-Hurwitz Stbility Criterion Bu der notlrı dece bu deri ln öğrencilerin kullnımın çık olup,
More informationReinforcement Learning
Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm
More information1 Online Learning and Regret Minimization
2.997 Decision-Mking in Lrge-Scle Systems My 10 MIT, Spring 2004 Hndout #29 Lecture Note 24 1 Online Lerning nd Regret Minimiztion In this lecture, we consider the problem of sequentil decision mking in
More information{ } = E! & $ " k r t +k +1
Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,
More informationChapter 4: Dynamic Programming
Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,
More information2D1431 Machine Learning Lab 3: Reinforcement Learning
2D1431 Mchine Lerning Lb 3: Reinforcement Lerning Frnk Hoffmnn modified by Örjn Ekeberg December 7, 2004 1 Introduction In this lb you will lern bout dynmic progrmming nd reinforcement lerning. It is ssumed
More informationRobot Planning in Partially Observable Continuous Domains
Robot Plnning in Prtilly Obervble Continuou Domin Joep M. Port Intitut de Robòtic i Informàtic Indutril (UPC-CSIC) Lloren i Artig 4-6, 828, Brcelon Spin Emil: port@iri.upc.edu Mtthij T. J. Spn Informtic
More informationTransfer Functions. Chapter 5. Transfer Functions. Derivation of a Transfer Function. Transfer Functions
5/4/6 PM : Trnfer Function Chpter 5 Trnfer Function Defined G() = Y()/U() preent normlized model of proce, i.e., cn be ued with n input. Y() nd U() re both written in devition vrible form. The form of
More informationBellman Optimality Equation for V*
Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V (s) mx Q (s,) A(s) mx A(s) mx A(s) Er t 1 V (s t 1 ) s t s, t s
More informationRobot Planning in Partially Observable Continuous Domains
Robot Plnning in Prtilly Obervble Continuou Domin Joep M. Port Intitut de Robòtic i Informàtic Indutril (UPC-CSIC) Lloren i Artig 4-6, 828, Brcelon Spin Emil: port@iri.upc.edu Mtthij T. J. Spn Informtic
More informationAcceptance Sampling by Attributes
Introduction Acceptnce Smpling by Attributes Acceptnce smpling is concerned with inspection nd decision mking regrding products. Three spects of smpling re importnt: o Involves rndom smpling of n entire
More informationReview of Calculus, cont d
Jim Lmbers MAT 460 Fll Semester 2009-10 Lecture 3 Notes These notes correspond to Section 1.1 in the text. Review of Clculus, cont d Riemnn Sums nd the Definite Integrl There re mny cses in which some
More information19 Optimal behavior: Game theory
Intro. to Artificil Intelligence: Dle Schuurmns, Relu Ptrscu 1 19 Optiml behvior: Gme theory Adversril stte dynmics hve to ccount for worst cse Compute policy π : S A tht mximizes minimum rewrd Let S (,
More informationJim Lambers MAT 169 Fall Semester Lecture 4 Notes
Jim Lmbers MAT 169 Fll Semester 2009-10 Lecture 4 Notes These notes correspond to Section 8.2 in the text. Series Wht is Series? An infinte series, usully referred to simply s series, is n sum of ll of
More informationThe First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a).
The Fundmentl Theorems of Clculus Mth 4, Section 0, Spring 009 We now know enough bout definite integrls to give precise formultions of the Fundmentl Theorems of Clculus. We will lso look t some bsic emples
More informationMath 2142 Homework 2 Solutions. Problem 1. Prove the following formulas for Laplace transforms for s > 0. a s 2 + a 2 L{cos at} = e st.
Mth 2142 Homework 2 Solution Problem 1. Prove the following formul for Lplce trnform for >. L{1} = 1 L{t} = 1 2 L{in t} = 2 + 2 L{co t} = 2 + 2 Solution. For the firt Lplce trnform, we need to clculte:
More informationTrust Region Policy Optimization
Consider n infinite-horizon discounted Mrkov decision process (MDP), defined by the tuple (S, A, P, r, ρ 0, γ), where S is finite set of sttes, A is finite set of ctions, P : S A S R is the trnsition probbility
More informationMath 1B, lecture 4: Error bounds for numerical methods
Mth B, lecture 4: Error bounds for numericl methods Nthn Pflueger 4 September 0 Introduction The five numericl methods descried in the previous lecture ll operte by the sme principle: they pproximte the
More informationThe ifs Package. December 28, 2005
The if Pckge December 28, 2005 Verion 0.1-1 Title Iterted Function Sytem Author S. M. Icu Mintiner S. M. Icu Iterted Function Sytem Licene GPL Verion 2 or lter. R topic documented:
More informationActor-Critic. Hung-yi Lee
Actor-Critic Hung-yi Lee Asynchronous Advntge Actor-Critic (A3C) Volodymyr Mnih, Adrià Puigdomènech Bdi, Mehdi Mirz, Alex Grves, Timothy P. Lillicrp, Tim Hrley, Dvid Silver, Kory Kvukcuoglu, Asynchronous
More informationBest Approximation. Chapter The General Case
Chpter 4 Best Approximtion 4.1 The Generl Cse In the previous chpter, we hve seen how n interpolting polynomil cn be used s n pproximtion to given function. We now wnt to find the best pproximtion to given
More informationWe will see what is meant by standard form very shortly
THEOREM: For fesible liner progrm in its stndrd form, the optimum vlue of the objective over its nonempty fesible region is () either unbounded or (b) is chievble t lest t one extreme point of the fesible
More informationInfinite Geometric Series
Infinite Geometric Series Finite Geometric Series ( finite SUM) Let 0 < r < 1, nd let n be positive integer. Consider the finite sum It turns out there is simple lgebric expression tht is equivlent to
More informationLine and Surface Integrals: An Intuitive Understanding
Line nd Surfce Integrls: An Intuitive Understnding Joseph Breen Introduction Multivrible clculus is ll bout bstrcting the ides of differentition nd integrtion from the fmilir single vrible cse to tht of
More informationEstimation of Binomial Distribution in the Light of Future Data
British Journl of Mthemtics & Computer Science 102: 1-7, 2015, Article no.bjmcs.19191 ISSN: 2231-0851 SCIENCEDOMAIN interntionl www.sciencedomin.org Estimtion of Binomil Distribution in the Light of Future
More information4-4 E-field Calculations using Coulomb s Law
1/11/5 ection_4_4_e-field_clcultion_uing_coulomb_lw_empty.doc 1/1 4-4 E-field Clcultion uing Coulomb Lw Reding Aignment: pp. 9-98 Specificlly: 1. HO: The Uniform, Infinite Line Chrge. HO: The Uniform Dik
More informationNew data structures to reduce data size and search time
New dt structures to reduce dt size nd serch time Tsuneo Kuwbr Deprtment of Informtion Sciences, Fculty of Science, Kngw University, Hirtsuk-shi, Jpn FIT2018 1D-1, No2, pp1-4 Copyright (c)2018 by The Institute
More informationDuality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.
Dulity #. Second itertion for HW problem Recll our LP emple problem we hve been working on, in equlity form, is given below.,,,, 8 m F which, when written in slightly different form, is 8 F Recll tht we
More informationLECTURE NOTE #12 PROF. ALAN YUILLE
LECTURE NOTE #12 PROF. ALAN YUILLE 1. Clustering, K-mens, nd EM Tsk: set of unlbeled dt D = {x 1,..., x n } Decompose into clsses w 1,..., w M where M is unknown. Lern clss models p(x w)) Discovery of
More informationModule 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo
Module 6 Vlue Itertion CS 886 Sequentil Decision Mking nd Reinforcement Lerning University of Wterloo Mrkov Decision Process Definition Set of sttes: S Set of ctions (i.e., decisions): A Trnsition model:
More informationAccelerator Physics. G. A. Krafft Jefferson Lab Old Dominion University Lecture 5
Accelertor Phyic G. A. Krfft Jefferon L Old Dominion Univerity Lecture 5 ODU Accelertor Phyic Spring 15 Inhomogeneou Hill Eqution Fundmentl trnvere eqution of motion in prticle ccelertor for mll devition
More information1 Probability Density Functions
Lis Yn CS 9 Continuous Distributions Lecture Notes #9 July 6, 28 Bsed on chpter by Chris Piech So fr, ll rndom vribles we hve seen hve been discrete. In ll the cses we hve seen in CS 9, this ment tht our
More informationDecision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information
CS 188: Artificil Intelligence nd Vlue of Informtion Instructors: Dn Klein nd Pieter Abbeel niversity of Cliforni, Berkeley [These slides were creted by Dn Klein nd Pieter Abbeel for CS188 Intro to AI
More informationMIXED MODELS (Sections ) I) In the unrestricted model, interactions are treated as in the random effects model:
1 2 MIXED MODELS (Sections 17.7 17.8) Exmple: Suppose tht in the fiber breking strength exmple, the four mchines used were the only ones of interest, but the interest ws over wide rnge of opertors, nd
More informationSPACE VECTOR PULSE- WIDTH-MODULATED (SV-PWM) INVERTERS
CHAPTER 7 SPACE VECTOR PULSE- WIDTH-MODULATED (SV-PWM) INVERTERS 7-1 INTRODUCTION In Chpter 5, we briefly icue current-regulte PWM inverter uing current-hyterei control, in which the witching frequency
More informationSolution for Assignment 1 : Intro to Probability and Statistics, PAC learning
Solution for Assignment 1 : Intro to Probbility nd Sttistics, PAC lerning 10-701/15-781: Mchine Lerning (Fll 004) Due: Sept. 30th 004, Thursdy, Strt of clss Question 1. Bsic Probbility ( 18 pts) 1.1 (
More informationDecision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees
CS 188: Artificil Intelligence Fll 2011 Decision Networks ME: choose the ction which mximizes the expected utility given the evidence mbrell Lecture 17: Decision Digrms 10/27/2011 Cn directly opertionlize
More informationELECTRICAL CIRCUITS 10. PART II BAND PASS BUTTERWORTH AND CHEBYSHEV
45 ELECTRICAL CIRCUITS 0. PART II BAND PASS BUTTERWRTH AND CHEBYSHEV Introduction Bnd p ctive filter re different enough from the low p nd high p ctive filter tht the ubject will be treted eprte prt. Thi
More informationPackage ifs. R topics documented: August 21, Version Title Iterated Function Systems. Author S. M. Iacus.
Pckge if Augut 21, 2015 Verion 0.1.5 Title Iterted Function Sytem Author S. M. Icu Dte 2015-08-21 Mintiner S. M. Icu Iterted Function Sytem Etimtor. Licene GPL (>= 2) NeedCompiltion
More informationCOUNTING DESCENTS, RISES, AND LEVELS, WITH PRESCRIBED FIRST ELEMENT, IN WORDS
COUNTING DESCENTS, RISES, AND LEVELS, WITH PRESCRIBED FIRST ELEMENT, IN WORDS Sergey Kitev The Mthemtic Intitute, Reykvik Univerity, IS-03 Reykvik, Icelnd ergey@rui Toufik Mnour Deprtment of Mthemtic,
More informationOptimal Treatment of Queueing Model for Highway
Journl of Computtion & Modelling, vol.1, no.1, 011, 61-71 ISSN: 179-765 (print, 179-8850 (online Interntionl Scientific Pre, 011 Optiml Tretment of Queueing Model for Highwy I.A. Imil 1, G.S. Mokddi, S.A.
More informationMULTI-DISCIPLINARY SYSTEM DESIGN OPTIMIZATION OF THE F-350 REAR SUSPENSION *
AIAA-00-xxxx MULTI-DISCIPLINARY SYSTEM DESIGN OPTIMIZATION OF THE F-350 REAR SUSPENSION * Jcob Wronki Mter of Science Cndidte Deprtment of Mechnicl Engineering MIT CADlb J. Michel Gry Mter of Science Cndidte
More informationVSS CONTROL OF STRIP STEERING FOR HOT ROLLING MILLS. M.Okada, K.Murayama, Y.Anabuki, Y.Hayashi
V ONTROL OF TRIP TEERING FOR OT ROLLING MILL M.Okd.Murym Y.Anbuki Y.yhi Wet Jpn Work (urhiki Ditrict) JFE teel orportion wkidori -chome Mizuhim urhiki 7-85 Jpn Abtrct: trip teering i one of the mot eriou
More informationError Estimation of Practical Convolution Discrete Gaussian Sampling
Error Etimtion of Prcticl Convolution Dicrete Guin Smpling Zhongxing Zheng, Xioyun Wng,3, Gungwu Xu 4, Chunhun Zho Deprtment of Computer Science nd Technology, Tinghu Univerity, Beijing 00084, Chin Intitute
More informationApproximation of continuous-time systems with discrete-time systems
Approximtion of continuou-time ytem with icrete-time ytem he continuou-time ytem re replce by icrete-time ytem even for the proceing of continuou-time ignl.. Impule invrince metho 2. Step invrince metho
More informationTHE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.
THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS RADON ROSBOROUGH https://intuitiveexplntionscom/picrd-lindelof-theorem/ This document is proof of the existence-uniqueness theorem
More informationARCHIVUM MATHEMATICUM (BRNO) Tomus 47 (2011), Kristína Rostás
ARCHIVUM MAHEMAICUM (BRNO) omu 47 (20), 23 33 MINIMAL AND MAXIMAL SOLUIONS OF FOURH ORDER IERAED DIFFERENIAL EQUAIONS WIH SINGULAR NONLINEARIY Kritín Rotá Abtrct. In thi pper we re concerned with ufficient
More informationM. A. Pathan, O. A. Daman LAPLACE TRANSFORMS OF THE LOGARITHMIC FUNCTIONS AND THEIR APPLICATIONS
DEMONSTRATIO MATHEMATICA Vol. XLVI No 3 3 M. A. Pthn, O. A. Dmn LAPLACE TRANSFORMS OF THE LOGARITHMIC FUNCTIONS AND THEIR APPLICATIONS Abtrct. Thi pper del with theorem nd formul uing the technique of
More informationarxiv: v1 [stat.ml] 9 Aug 2016
On Lower Bounds for Regret in Reinforcement Lerning In Osbnd Stnford University, Google DeepMind iosbnd@stnford.edu Benjmin Vn Roy Stnford University bvr@stnford.edu rxiv:1608.02732v1 [stt.ml 9 Aug 2016
More informationpositive definite (symmetric with positive eigenvalues) positive semi definite (symmetric with nonnegative eigenvalues)
Chter Liner Qudrtic Regultor Problem inimize the cot function J given by J x' Qx u' Ru dt R > Q oitive definite ymmetric with oitive eigenvlue oitive emi definite ymmetric with nonnegtive eigenvlue ubject
More informationAn Application of the Generalized Shrunken Least Squares Estimator on Principal Component Regression
An Appliction of the Generlized Shrunken Let Squre Etimtor on Principl Component Regreion. Introduction Profeor Jnn-Huei Jinn Deprtment of Sttitic Grnd Vlley Stte Univerity Allendle, MI 0 USA Profeor Chwn-Chin
More informationBernoulli Numbers Jeff Morton
Bernoulli Numbers Jeff Morton. We re interested in the opertor e t k d k t k, which is to sy k tk. Applying this to some function f E to get e t f d k k tk d k f f + d k k tk dk f, we note tht since f
More informationImproper Integrals, and Differential Equations
Improper Integrls, nd Differentil Equtions October 22, 204 5.3 Improper Integrls Previously, we discussed how integrls correspond to res. More specificlly, we sid tht for function f(x), the region creted
More informationSection 14.3 Arc Length and Curvature
Section 4.3 Arc Length nd Curvture Clculus on Curves in Spce In this section, we ly the foundtions for describing the movement of n object in spce.. Vector Function Bsics In Clc, formul for rc length in
More informationImproper Integrals. Type I Improper Integrals How do we evaluate an integral such as
Improper Integrls Two different types of integrls cn qulify s improper. The first type of improper integrl (which we will refer to s Type I) involves evluting n integrl over n infinite region. In the grph
More information1 Linear Least Squares
Lest Squres Pge 1 1 Liner Lest Squres I will try to be consistent in nottion, with n being the number of dt points, nd m < n being the number of prmeters in model function. We re interested in solving
More information7.2 The Definite Integral
7.2 The Definite Integrl the definite integrl In the previous section, it ws found tht if function f is continuous nd nonnegtive, then the re under the grph of f on [, b] is given by F (b) F (), where
More informationSIMULATION OF TRANSIENT EQUILIBRIUM DECAY USING ANALOGUE CIRCUIT
Bjop ol. o. Decemer 008 Byero Journl of Pure nd Applied Science, ():70 75 Received: Octoer, 008 Accepted: Decemer, 008 SIMULATIO OF TRASIET EQUILIBRIUM DECAY USIG AALOGUE CIRCUIT *Adullhi,.., Ango U.S.
More informationMath 3B Final Review
Mth 3B Finl Review Written by Victori Kl vtkl@mth.ucsb.edu SH 6432u Office Hours: R 9:45-10:45m SH 1607 Mth Lb Hours: TR 1-2pm Lst updted: 12/06/14 This is continution of the midterm review. Prctice problems
More informationReview of basic calculus
Review of bsic clculus This brief review reclls some of the most importnt concepts, definitions, nd theorems from bsic clculus. It is not intended to tech bsic clculus from scrtch. If ny of the items below
More informationJack Simons, Henry Eyring Scientist and Professor Chemistry Department University of Utah
1. Born-Oppenheimer pprox.- energy surfces 2. Men-field (Hrtree-Fock) theory- orbitls 3. Pros nd cons of HF- RHF, UHF 4. Beyond HF- why? 5. First, one usully does HF-how? 6. Bsis sets nd nottions 7. MPn,
More informationChapters 4 & 5 Integrals & Applications
Contents Chpters 4 & 5 Integrls & Applictions Motivtion to Chpters 4 & 5 2 Chpter 4 3 Ares nd Distnces 3. VIDEO - Ares Under Functions............................................ 3.2 VIDEO - Applictions
More informationA Fast and Reliable Policy Improvement Algorithm
A Fst nd Relible Policy Improvement Algorithm Ysin Abbsi-Ydkori Peter L. Brtlett Stephen J. Wright Queenslnd University of Technology UC Berkeley nd QUT University of Wisconsin-Mdison Abstrct We introduce
More informationp-adic Egyptian Fractions
p-adic Egyptin Frctions Contents 1 Introduction 1 2 Trditionl Egyptin Frctions nd Greedy Algorithm 2 3 Set-up 3 4 p-greedy Algorithm 5 5 p-egyptin Trditionl 10 6 Conclusion 1 Introduction An Egyptin frction
More informationThe Regulated and Riemann Integrals
Chpter 1 The Regulted nd Riemnn Integrls 1.1 Introduction We will consider severl different pproches to defining the definite integrl f(x) dx of function f(x). These definitions will ll ssign the sme vlue
More informationExcerpted Section. Consider the stochastic diffusion without Poisson jumps governed by the stochastic differential equation (SDE)
? > ) 1 Technique in Computtionl Stochtic Dynmic Progrmming Floyd B. Hnon niverity of Illinoi t Chicgo Chicgo, Illinoi 60607-705 Excerpted Section A. MARKOV CHAI APPROXIMATIO Another pproch to finite difference
More informationWeek 10: Line Integrals
Week 10: Line Integrls Introduction In this finl week we return to prmetrised curves nd consider integrtion long such curves. We lredy sw this in Week 2 when we integrted long curve to find its length.
More informationThe steps of the hypothesis test
ttisticl Methods I (EXT 7005) Pge 78 Mosquito species Time of dy A B C Mid morning 0.0088 5.4900 5.5000 Mid Afternoon.3400 0.0300 0.8700 Dusk 0.600 5.400 3.000 The Chi squre test sttistic is the sum of
More informationRecitation 3: More Applications of the Derivative
Mth 1c TA: Pdric Brtlett Recittion 3: More Applictions of the Derivtive Week 3 Cltech 2012 1 Rndom Question Question 1 A grph consists of the following: A set V of vertices. A set E of edges where ech
More informationContinuous Random Variables
STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 217 Néhémy Lim Continuous Rndom Vribles Nottion. The indictor function of set S is rel-vlued function defined by : { 1 if x S 1 S (x) if x S Suppose tht
More informationMath& 152 Section Integration by Parts
Mth& 5 Section 7. - Integrtion by Prts Integrtion by prts is rule tht trnsforms the integrl of the product of two functions into other (idelly simpler) integrls. Recll from Clculus I tht given two differentible
More informationNear-Bayesian Exploration in Polynomial Time
J. Zico Kolter kolter@cs.stnford.edu Andrew Y. Ng ng@cs.stnford.edu Computer Science Deprtment, Stnford University, CA 94305 Abstrct We consider the explortion/exploittion problem in reinforcement lerning
More informationMath 270A: Numerical Linear Algebra
Mth 70A: Numericl Liner Algebr Instructor: Michel Holst Fll Qurter 014 Homework Assignment #3 Due Give to TA t lest few dys before finl if you wnt feedbck. Exercise 3.1. (The Bsic Liner Method for Liner
More informationMulti-Armed Bandits: Non-adaptive and Adaptive Sampling
CSE 547/Stt 548: Mchine Lerning for Big Dt Lecture Multi-Armed Bndits: Non-dptive nd Adptive Smpling Instructor: Shm Kkde 1 The (stochstic) multi-rmed bndit problem The bsic prdigm is s follows: K Independent
More informationQuadratic Forms. Quadratic Forms
Qudrtic Forms Recll the Simon & Blume excerpt from n erlier lecture which sid tht the min tsk of clculus is to pproximte nonliner functions with liner functions. It s ctully more ccurte to sy tht we pproximte
More informationNUMERICAL INTEGRATION
NUMERICAL INTEGRATION How do we evlute I = f (x) dx By the fundmentl theorem of clculus, if F (x) is n ntiderivtive of f (x), then I = f (x) dx = F (x) b = F (b) F () However, in prctice most integrls
More informationNumerical Analysis: Trapezoidal and Simpson s Rule
nd Simpson s Mthemticl question we re interested in numericlly nswering How to we evlute I = f (x) dx? Clculus tells us tht if F(x) is the ntiderivtive of function f (x) on the intervl [, b], then I =
More information