Barycentric Interpolators for Continuous. Space & Time Reinforcement Learning. Robotics Institute, Carnegie Mellon University

Similar documents
Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Strong Convergence of Weighted Averaged Approximants of Asymptotically Nonexpansive Mappings in Banach Spaces without Uniform Convexity

CSE 5526: Introduction to Neural Networks Linear Regression

Functions of Random Variables

Chapter 5. Curve fitting

Regression and the LMS Algorithm

1 Lyapunov Stability Theory

Introduction to local (nonparametric) density estimation. methods

A Remark on the Uniform Convergence of Some Sequences of Functions

PGE 310: Formulation and Solution in Geosystems Engineering. Dr. Balhoff. Interpolation

Research Article A New Iterative Method for Common Fixed Points of a Finite Family of Nonexpansive Mappings

Mu Sequences/Series Solutions National Convention 2014

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Unsupervised Learning and Other Neural Networks

A conic cutting surface method for linear-quadraticsemidefinite

. The set of these sums. be a partition of [ ab, ]. Consider the sum f( x) f( x 1)

Bayes (Naïve or not) Classifiers: Generative Approach

Chapter 5 Properties of a Random Sample

1 0, x? x x. 1 Root finding. 1.1 Introduction. Solve[x^2-1 0,x] {{x -1},{x 1}} Plot[x^2-1,{x,-2,2}] 3

Initial-Value Problems for ODEs. numerical errors (round-off and truncation errors) Consider a perturbed system: dz dt

Chapter 14 Logistic Regression Models

Numerical Simulations of the Complex Modied Korteweg-de Vries Equation. Thiab R. Taha. The University of Georgia. Abstract

LINEARLY CONSTRAINED MINIMIZATION BY USING NEWTON S METHOD

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

The Mathematical Appendix

Lecture 12 APPROXIMATION OF FIRST ORDER DERIVATIVES

ECON 5360 Class Notes GMM

Rademacher Complexity. Examples

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

Runtime analysis RLS on OneMax. Heuristic Optimization

Cubic Nonpolynomial Spline Approach to the Solution of a Second Order Two-Point Boundary Value Problem

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

Supervised learning: Linear regression Logistic regression

Solving Constrained Flow-Shop Scheduling. Problems with Three Machines

PTAS for Bin-Packing

C-1: Aerodynamics of Airfoils 1 C-2: Aerodynamics of Airfoils 2 C-3: Panel Methods C-4: Thin Airfoil Theory

Lecture 3 Probability review (cont d)

5 Short Proofs of Simplified Stirling s Approximation

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

A tighter lower bound on the circuit size of the hardest Boolean functions

( ) 2 2. Multi-Layer Refraction Problem Rafael Espericueta, Bakersfield College, November, 2006

Block-Based Compact Thermal Modeling of Semiconductor Integrated Circuits

COMPROMISE HYPERSPHERE FOR STOCHASTIC DOMINANCE MODEL

α1 α2 Simplex and Rectangle Elements Multi-index Notation of polynomials of degree Definition: The set P k will be the set of all functions:

Likewise, properties of the optimal policy for equipment replacement & maintenance problems can be used to reduce the computation.

Special Instructions / Useful Data

CS475 Parallel Programming

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

ONE GENERALIZED INEQUALITY FOR CONVEX FUNCTIONS ON THE TRIANGLE

Beam Warming Second-Order Upwind Method

arxiv: v1 [cs.lg] 22 Feb 2015

STK4011 and STK9011 Autumn 2016

Point Estimation: definition of estimators

BERNSTEIN COLLOCATION METHOD FOR SOLVING NONLINEAR DIFFERENTIAL EQUATIONS. Aysegul Akyuz Dascioglu and Nese Isler

Support vector machines

0/1 INTEGER PROGRAMMING AND SEMIDEFINTE PROGRAMMING

Random Variate Generation ENM 307 SIMULATION. Anadolu Üniversitesi, Endüstri Mühendisliği Bölümü. Yrd. Doç. Dr. Gürkan ÖZTÜRK.

DKA method for single variable holomorphic functions

Binary classification: Support Vector Machines

CS5620 Intro to Computer Graphics

X ε ) = 0, or equivalently, lim

On Modified Interval Symmetric Single-Step Procedure ISS2-5D for the Simultaneous Inclusion of Polynomial Zeros

Dimensionality Reduction and Learning

Logistic regression (continued)

Generalization of the Dissimilarity Measure of Fuzzy Sets

Complete Convergence and Some Maximal Inequalities for Weighted Sums of Random Variables

ECE 595, Section 10 Numerical Simulations Lecture 19: FEM for Electronic Transport. Prof. Peter Bermel February 22, 2013

THE PROBABILISTIC STABILITY FOR THE GAMMA FUNCTIONAL EQUATION

STRONG CONSISTENCY OF LEAST SQUARES ESTIMATE IN MULTIPLE REGRESSION WHEN THE ERROR VARIANCE IS INFINITE

Ideal multigrades with trigonometric coefficients

C.11 Bang-bang Control

Large and Moderate Deviation Principles for Kernel Distribution Estimator

3. Basic Concepts: Consequences and Properties

Summary of the lecture in Biostatistics

MATH 247/Winter Notes on the adjoint and on normal operators.

L5 Polynomial / Spline Curves

CHAPTER VI Statistical Analysis of Experimental Data

LINEAR REGRESSION ANALYSIS

UNIT 2 SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS

Simulation Output Analysis

Analysis of Lagrange Interpolation Formula

Econometric Methods. Review of Estimation

Johns Hopkins University Department of Biostatistics Math Review for Introductory Courses

MMJ 1113 FINITE ELEMENT METHOD Introduction to PART I

TESTS BASED ON MAXIMUM LIKELIHOOD

EVALUATION OF FUNCTIONAL INTEGRALS BY MEANS OF A SERIES AND THE METHOD OF BOREL TRANSFORM

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

G S Power Flow Solution

Lecture Note to Rice Chapter 8

Consensus Control for a Class of High Order System via Sliding Mode Control

MOLECULAR VIBRATIONS

Lecture 02: Bounding tail distributions of a random variable

PROJECTION PROBLEM FOR REGULAR POLYGONS

Non-uniform Turán-type problems

A Study of the Reproducibility of Measurements with HUR Leg Extension/Curl Research Line

ENGI 4430 Numerical Integration Page 5-01

Department of Agricultural Economics. PhD Qualifier Examination. August 2011

Transcription:

Barycetrc Iterpolators for Cotuous Space & Tme Reforcemet Learg Rem Muos & Adrew Moore Robotcs Isttute, Carege Mello Uversty Pttsburgh, PA 15213, USA. E-mal:fmuos, awmg@cs.cmu.edu Category : Reforcemet Learg ad Cotrol Preferece : oral presetato Abstract I order to d the optmal cotrol of cotuous state-space ad tme reforcemet learg (RL) problems, we appromate the value fucto (F) wth a partcular class of fuctos called the barycetrc terpolators. We establsh sucet codtos uder whch a RL algorthm coverges to the optmal F, eve whe we use appromate models of the state dyamcs ad the reforcemet fuctos. 1 INTRODUCTION I order to appromate the value fucto (F) of a cotuous state-space ad tme reforcemet learg (RL) problem, we dee a partcular class of fuctos called the barycetrc terpolator, that use some terpolato process based o te sets of pots. Ths class of fuctos, cludg cotuous or dscotuous pecewse lear ad mult-lear fuctos, provdes us wth a geeral method for desgg RL algorthms that coverge to the optmal value fucto. Ideed these fuctos permt us to dscretze the HJB equato of the cotuous cotrol problem by a cosstet (ad thus coverget) appromato scheme, whch s solved by usg some model of the state dyamcs ad the reforcemet fuctos. Secto 2 dees the barycetrc terpolators. Secto 3 descrbes the optmal cotrol problem the determstc cotuous case. Secto 4 states the covergece result for RL algorthms by gvg sucet codtos o the appled model. Secto 5 gves some computatoal ssues for ths method, ad Secto 6 descrbes the appromato scheme used here ad proves the covergece result.

2 DEFINITION OF BARYCENTRIC INTERPOLATORS Let = f g be a set of pots dstrbuted at some resoluto (see (4) below) o the state space of dmeso d. For ay state sde some smple ( 1 ; :::; ), we say that s the baryceter of the f g =1:: sde ths smple wth postve coecets p(j ) of sum 1, called the barycetrc coordates, f = P =1:: p(j ):. Let ( ) be the value of the fucto at the pots. s a barycetrc terpolator f for ay state whch s the baryceter of the pots f g =1:: for some smple ( 1 ; :::; ), wth the barycetrc coordates p(j ), we have : () = X =1:: p(j ): ( ) (1) Moreover we assume that the smple ( 1 ; :::; ) s of dameter O(). Let us descrbe some smple barycetrc terpolators : Pecewse lear fuctos deed by some tragulato o the state space (thus deg cotuous fuctos), see gure 1.a, or deed at ay by a lear combato of (d +1)values at ay pots ( 1 ; :::; d+1 ) 3 (such fuctos may be dscotuous at some boudares), see gure 1.b. Pecewse mult-lear fuctos deed by a mult-lear combato of the 2 d values at the vertces of d-dmesoal rectagles, see gure 1.c. I ths case as well, we ca buld cotuous terpolatos or allow dscotutes at the boudares of the rectagles. A mportat pot s that the covergece result stated Secto 4 does ot requre the cotuty of the fucto. Ths permts us to buld varable resoluto tragulatos (see gure 1.b) or grd (gure 1.c) easly. (a) (b) (c) Fgure 1: Some eamples of barycetrc appromators. These are pecewse cotuous (a) or dscotuous (b) lear or mult-lear (c) terpolators. Remark 1 I the geeral case, for a gve, the choce of a smple ( 1 ; :::; ) 3 s ot uque (see the two sets of grey ad black pots gure 1.b ad 1.c), ad oce the smple ( 1 ; :::; ) 3 s deed, f >d+1 (for eample gure 1.c), the the choce of the barycetrc coordates p(j ) s also ot uque. Remark 2 Depedg o the terpolato method we use, the tme eeded for computg the values wll vary. Followg [Dav96], the cotuous mult-lear terpolato must process 2 d values, whereas the lear cotuous terpolato sde a smple processes (d + 1) values O(d log d) tme.

I comparso to [Gor95], the fuctos used here are averagers that satsfy the barycetrc terpolato property (1). Ths addtoal geometrc costrat permts us to prove the cosstecy (see (15) below) of the appromato scheme ad thus the covergece to the optmal value the cotuous tme case. 3 THE OPTIMAL CONTROL PROBLEM Let us descrbe the optmal cotrol problem the determstc ad dscouted case for cotuous state-space ad tme varables ad dee the value fucto that we ted to appromate. We cosder a dyamcal system whose state dyamcs depeds o the curret state (t) 2 O (the state-space, wth O a ope subset of IR d ) ad cotrol u(t) 2 U (compact subset) by a deretal equato : d = f((t);u(t)) (2) dt From equato (2), the choce of a tal state ad a cotrol fucto u(t) leads to a uque trajectores (t) (see gure 2). Let be the et tme from O (wth the coveto that f (t) always stays O, the = 1). The, we dee the fuctoal J as the dscouted cumulatve reforcemet: J(; u(:)) = Z 0 t r((t);u(t))dt + R(()) where r(; u) s the rug reforcemet ad R() the boudary reforcemet. s the dscout factor (0 <1). We assume that f, r ad R are bouded ad Lpschtza, ad that the boudary @O s C 2. RL uses the method of Dyamc Programmg (DP) that troduces the value fucto (F) : the mamal value of J as a fucto of tal state : () = sup J(; u(:)): u(:) From the DP prcple, we deduce that satses a rst-order deretal equato, called the Hamlto-Jacob-Bellma (HJB) equato (see [FS93] for a survey) : Theorem 1 If s deretable at 2 O, let D () be the gradet of at, the the followg HJB equato holds at. H(; D; ) def = ()l + sup[d ():f(; u)+ r(; u)] = 0 (3) u2u The challege of RL s to get a good appromato of the F, because from we ca deduce the optmal cotrol : for state, the cotrol u () that realzes the supremum the HJB equato provdes a optmal (feed-back) cotrol law. The followg hypothess s a sucet codto for to be cotuous wth O (see [Bar94]) ad s requred for provg the covergece result of the et secto. Hyp 1: For 2 @O; let,! () be the outward ormal of O at, we assume that : -If 9u 2 U; s.t. f(; u):,! () 0 the 9v 2 U; s.t. f(; v),! () < 0: -If 9u 2 U; s.t. f(; u):,! () 0 the 9v 2 U; s.t. f(; v),! () > 0: whch meas that at the states (f there est ay) where some trajectory s taget to the boudary, there ests, for some cotrol, a trajectory strctly comg sde ad oe strctly leavg the state space.

O f(,u) 2 1 η η 3 (t) ( τ) Fgure 2: The state space ad the set of pots (the black dots belog to the teror ad the whte oes to the boudary). The value at some pot s updated, at step, by the dscouted value at pot 2 ( 1; 2; 3). The ma requremet for covergece s that the pots appromate the sese : p( j ) = p(j )+O() (.e. the belog to the grey area). 4 THE CONERGENCE RESULT Let us troduce the set of pots = f g, composed of the teror ( \ O) ad the boudary (@ = O), such that ts cove hull covers the state space O, ad performg a dscretzato at some resoluto : 8 2 O; f jj, jj ad 8 2 @O f jj, j jj (4) 2 \O j2@ Moreover, we appromate the cotrol space U by some te cotrol spaces U U such that for 0, U 0 U ad lm!0 U = U. We would lke to update the value of ay: - teror pot 2 \ O wth the dscouted values at state (; u) (gure 2) : h +1() sup (;u) ( (; u)) + (; u):r (; u) (5) u2u for some state (; u), some tme delay (; u) ad some reforcemet r (; u). - boudary pot 2 @ wth some termal reforcemet R () : +1 () R () (6) The followg theorem states that the values computed by a RL algorthm usg the model (because of some a pror partal ucertaty of the state dyamcs ad the reforcemet fuctos) (; u), (; u), r (; u) ad R () coverge to the optmal value fucto as the umber of teratos! 1ad the resoluto! 0. Let us dee the state (; u) (see gure 2) : (; u) = + (; u):f(; u) (7) for some tme delay (; u) (wth k 1 (; u) k 2 for some costats k 1 > 0 ad k 2 > 0), ad let p(j ) (resp. p( j )) be the barycetrc coordate of sde a smple cotag t (resp. sde the same smple). We wll wrte,,, r,..., stead of (; u), (; u), (; u), r(; u),... whe o cofuso s possble. Theorem 2 Assume that the hypotheses of the prevous sectos hold, ad that for ay resoluto, we use barycetrc terpolators deed o state spaces (satsfyg (4)) such that all pots of \ O are regularly updated wth rule (5) ad all pots of @ are updated wth rule (6) at least oce. Suppose that,, r ad R appromate,, r ad R the sese : 8 ;p( j ) = p(j )+O() (8) = + O( 2 ) (9) r = r + O() (10) R = R + O() (11)

the we have lm!1 = uformly o ay compact O (.e. 8" >0; 8!0 compact O; 9; 9N, such that 8 ; 8 N;sup \ j, j"). Remark 3 For a gve value of, the rule (5) s ot a DP updatg rule for some Markov Decso Problem (MDP) sce the values ; ;r deped o. Ths pot s mportat the RL framework sce ths allows o-le mprovemet of the model of the state dyamcs ad the reforcemet fuctos. Remark 4 Ths result eteds the prevous results of covergece obtaed by Fte-Elemet or Fte-Derece methods (see [Mu97]). Ths theoretcal result ca be appled by startg from a rough (hgh ) ad by combg to the terato process (!1) some learg process of the model (! ) ad a creasg process of the umberofpots (! 0). 5 COMPUTATIONAL ISSUES From (8) we deduce that the method wll also coverge f we use a appromate barycetrc terpolator, deed at ay state 2 ( 1 ; :::; )by the value of the barycetrc terpolator at some state 0 2 ( 1 ; :::; ) such that p( 0 j )=p(j )+ O() (see gure 3). The fact that we eed ot be completely accurate ca be Appro-lear Lear O( δ) 1 2 3 4 Fgure 3: The lear fucto ad the appromato error aroud t (the grey area). The value of the appromate lear fucto plotted here at some state s equal to the value of the lear oe at 0. Ay such appromate baryceter terpolator ca be used (5). used to our advatage. Frst, the computato of barycetrc coordates ca use very fast appromate matr methods. Secod, the model we use to tegrate the dyamcs eed ot be perfect. We ca make ao( 2 ) error, whch s useful f we are learg a model from data: we eed smply arrage to ot gather more data tha s ecessary for the curret. For eample, f we use earest eghbor for our dyamcs learg, we eed to esure eough data so that every observato s O( 2 ) from ts earest eghbor. If we use local regresso, the a mere O() desty s all that s requred [Omo87, AMS97]. 6 PROOF OF THE CONERGENCE RESULT 6.1 Descrpto of the appromato scheme We use a coverget scheme derved from Kusher (see [Kus90]) order to appromate the cotuous cotrol problem by a te MDP. The HJB equato s dscretzed, at some resoluto, to the followg DP equato : for 2 \ O, () =F (:) () def = sup u2u P p(j ): ( )+:r ad for 2 @, () =R(). Ths s a ed-pot equato ad we ca prove that, thaks to the dscout factor, t satses the \strog" cotracto property: sup o (12) +1, : sup, for some <1 (13)

from whch we deduce that there ests eactly oe soluto to the DP equato, whch ca be computed by some value terato process : for ay tal 0,we terate +1 F.Thus for ay resoluto, the values! as!1. Moreover, as s a barycetrc terpolator ad from the deto (7) of, F (:) () = sup u2u ( + :f(; u)) + :r (14) from whch we deduce that the scheme F s cosstet : a formal sese, lm sup!0 1 jf [W ](), W ()j H(W;DW;) (15) ad obta, from the geeral covergece theorem of [BS91] (ad a result of strog ucty obtaed from hyp.1), the covergece of the scheme :! as! 0. 6.2 Use of the \weak cotracto" result of covergece Sce the RL approach used here, we oly have a appromato,,... of the true values,,..., the strog cotracto property (13) does ot hold ay more. However, prevous work ([Mu98]), we have prove the covergece for some weakeed codtos, recalled here : If the values updated by some algorthm satsfy the \weak" cotracto property wth respect to a soluto of a coverget appromato scheme (such as the prevous oe (12)) : sup \O sup @ +1, (1, k:): sup, + o() (16), +1 = O() (17) for some postve costat k, (wth the otato f() o() 9g() = o() wth f() g()) the we have lm!1 = uformly o ay compact O!0 (.e. 8" > 0, 8 compact O, 9 ad N such that 8 ; 8 N, sup \, "). 6.3 Proof of theorem 2 We are gog to use the appromatos (8), (9), (10) ad (11) to deduce that the weak cotracto property holds, ad the use the result of the prevous secto to prove theorem 2. The proof of (17) s mmedate sce, from (6) ad (11) we have :8 2 @, +1(), () = jr (), R()j = O() Now we eed to prove (16). Let us estmate the error E () = (), () betwee the value of the DP equato (12) ad the values computed by rule (5) after oe terato : E +1 () = sup u2u E +1 () = sup u2u P p(j ): ( ), p( j ): ( ) + :r, :r o P P [p(j ), p( j )] ( )+[, ] p( j ): ( ) o + P p( j ): ( ), ( ) + [r, r ]+[, ] r By usg (9) (from whch we deduce : = + O( 2 )) ad (10), we deduce : je +1 ()j sup u2u : P [p(j ), p( j )] ( ) + P p( j ): ( ), ( ) o + O( 2 ): (18)

From the basc propertes of the coecets p(j ) ad p( j )wehave: P [p(j ), p( j )] ( )= P [p(j ), p( j )] ( ), () (19) Moreover, j ( ), ()j j ( ), ( )j + j ( ), ()j + j (), ()j:, #0 From the covergece of the scheme,wehave sup \! 0 for ay compact O ad from the cotuty of ad the fact that the support of the smple fg 3 s O(), we have sup \ j ( ), ()j #0! 0 ad deduce that : sup \ ( ), () #0 X [p(j), p( j)] ( )! 0. Thus, from (19) ad (8), we obta : = o() (20) The \weak" cotracto property (16) holds : from the property of the epoetal fucto 1, l 1 for small values of 2, from (9) ad that k 1,we deduce that deduce that : +1(), () (1, k:) sup 1, k1 2 l 1 + O(2 ), ad from (18) ad (20) we +1(), () + o() wth k = k1 l 1, ad the property (16) holds. Thus the \weak cotracto" result 2 of covergece (descrbed secto 6.2) apples ad covergece occurs. FUTURE WORK Ths work proves the covergece to the optmal value as the resoluto teds to the lmt, but does ot provde us wth the rate of covergece. Our future work wll focus o deg upper bouds of the appromato error, especally for varable resoluto dscretzatos, ad we wll also cosder the stochastc case. Refereces [AMS97] C. G. Atkeso, A. W. Moore, ad S. A. Schaal. Locally Weghted Learg. AI Revew, 11:11{73, Aprl 1997. [Bar94] Guy Barles. Solutos de vscoste des equatos de Hamlto-Jacob, volume 17 of Mathematques et Applcatos. Sprger-erlag, 1994. [BS91] Guy Barles ad P.E. Sougads. Covergece of appromato schemes for fully olear secod order equatos. Asymptotc Aalyss, 4:271{283, 1991. [Dav96] [FS93] [Gor95] [Kus90] Scott Daves. Multdmesoal tragulato ad terpolato for reforcemet learg. Advaces Neural Iformato Processg Systems, 8, 1996. Wedell H. Flemg ad H. Mete Soer. Cotrolled Markov Processes ad scosty Solutos. Applcatos of Mathematcs. Sprger-erlag, 1993. G. Gordo. Stable fucto appromato dyamc programmg. Iteratoal Coferece o Mache Learg, 1995. Harold J. Kusher. Numercal methods for stochastc cotrol problems cotuous tme. SIAM J. Cotrol ad Optmzato, 28:999{1048, 1990. [Mu97] Rem Muos. A coverget reforcemet learg algorthm the cotuous case based o a te derece method. Iteratoal Jot Coferece o Artcal Itellgece, 1997. [Mu98] Rem Muos. A geeral covergece theorem for reforcemet learg the cotuous case. Europea Coferece o Mache Learg, 1998. [Omo87] S. M. Omohudro. Ecet Algorthms wth Neural Network Behavour. Joural of Comple Systems, 1(2):273{347, 1987.