A Reinforcement Learning System with Chaotic Neural Networks-Based Adaptive Hierarchical Memory Structure for Autonomous Robots

Similar documents
Dennis Bricker, 2001 Dept of Industrial Engineering The University of Iowa. MDP: Taxi page 1

Rank One Update And the Google Matrix by Al Bernstein Signal Science, LLC

Partially Observable Systems. 1 Partially Observable Markov Decision Process (POMDP) Formalism

UNIVERSITY OF IOANNINA DEPARTMENT OF ECONOMICS. M.Sc. in Economics MICROECONOMIC THEORY I. Problem Set II

Remember: Project Proposals are due April 11.

4. Eccentric axial loading, cross-section core

The Schur-Cohn Algorithm

Applied Statistics Qualifier Examination

Chapter Newton-Raphson Method of Solving a Nonlinear Equation

DCDM BUSINESS SCHOOL NUMERICAL METHODS (COS 233-8) Solutions to Assignment 3. x f(x)

In this Chapter. Chap. 3 Markov chains and hidden Markov models. Probabilistic Models. Example: CpG Islands

Chapter Newton-Raphson Method of Solving a Nonlinear Equation

Electrochemical Thermodynamics. Interfaces and Energy Conversion

Haddow s Experiment:

Principle Component Analysis

Dynamic Power Management in a Mobile Multimedia System with Guaranteed Quality-of-Service

Quiz: Experimental Physics Lab-I

GAUSS ELIMINATION. Consider the following system of algebraic linear equations

523 P a g e. is measured through p. should be slower for lesser values of p and faster for greater values of p. If we set p*

Definition of Tracking

ESCI 342 Atmospheric Dynamics I Lesson 1 Vectors and Vector Calculus

Lecture 4: Piecewise Cubic Interpolation

Introduction to Numerical Integration Part II

The Number of Rows which Equal Certain Row

M/G/1/GD/ / System. ! Pollaczek-Khinchin (PK) Equation. ! Steady-state probabilities. ! Finding L, W q, W. ! π 0 = 1 ρ

A Family of Multivariate Abel Series Distributions. of Order k

Two Activation Function Wavelet Network for the Identification of Functions with High Nonlinearity

18.7 Artificial Neural Networks

Investigation phase in case of Bragg coupling

Lesson 2. Thermomechanical Measurements for Energy Systems (MENR) Measurements for Mechanical Systems and Production (MMER)

SCALED GRADIENT DESCENT LEARNING RATE Reinforcement learning with light-seeking robot

Using Predictions in Online Optimization: Looking Forward with an Eye on the Past

1 Online Learning and Regret Minimization

For the percentage of full time students at RCC the symbols would be:

Effects of polarization on the reflected wave

Name: SID: Discussion Session:

6.6 The Marquardt Algorithm

INTRODUCTION TO COMPLEX NUMBERS

Decomposition of Boolean Function Sets for Boolean Neural Networks

13.4 Work done by Constant Forces

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

SVMs for regression Multilayer neural networks

UNSCENTED KALMAN FILTER POSITION ESTIMATION FOR AN AUTONOMOUS MOBILE ROBOT

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

Reinforcement Learning

Many-Body Calculations of the Isotope Shift

Solution of Tutorial 5 Drive dynamics & control

International Journal of Pure and Applied Sciences and Technology

State Estimation in TPN and PPN Guidance Laws by Using Unscented and Extended Kalman Filters

Identification of Robot Arm s Joints Time-Varying Stiffness Under Loads

6 Roots of Equations: Open Methods

Variable time amplitude amplification and quantum algorithms for linear algebra. Andris Ambainis University of Latvia

New data structures to reduce data size and search time

Chemical Reaction Engineering

Intro to Nuclear and Particle Physics (5110)

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Demand. Demand and Comparative Statics. Graphically. Marshallian Demand. ECON 370: Microeconomic Theory Summer 2004 Rice University Stanley Gilbert

Measuring Electron Work Function in Metal

CENTROID (AĞIRLIK MERKEZİ )

Minimal DFA. minimal DFA for L starting from any other

NUMERICAL MODELLING OF A CILIUM USING AN INTEGRAL EQUATION

Supporting information How to concatenate the local attractors of subnetworks in the HPFP

Chemical Reaction Engineering

Two Coefficients of the Dyson Product

Lecture 36. Finite Element Methods

6. Chemical Potential and the Grand Partition Function

4.4 Areas, Integrals and Antiderivatives

Interpreting Integrals and the Fundamental Theorem

Reinforcement learning II

INTERPOLATION(1) ELM1222 Numerical Analysis. ELM1222 Numerical Analysis Dr Muharrem Mercimek

Study of Trapezoidal Fuzzy Linear System of Equations S. M. Bargir 1, *, M. S. Bapat 2, J. D. Yadav 3 1

ADVANCEMENT OF THE CLOSELY COUPLED PROBES POTENTIAL DROP TECHNIQUE FOR NDE OF SURFACE CRACKS

Multilayer Perceptron (MLP)

Infinite Geometric Series

An Introduction to Support Vector Machines

Math 426: Probability Final Exam Practice

Trade-offs in Optimization of GMDH-Type Neural Networks for Modelling of A Complex Process

Improving Anytime Point-Based Value Iteration Using Principled Point Selections

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Computing a complete histogram of an image in Log(n) steps and minimum expected memory requirements using hypercubes

Math 8 Winter 2015 Applications of Integration

Jean Fernand Nguema LAMETA UFR Sciences Economiques Montpellier. Abstract

Online Learning Algorithms for Stochastic Water-Filling

3/6/00. Reading Assignments. Outline. Hidden Markov Models: Explanation and Model Learning

Research Article Special Issue

CALIBRATION OF SMALL AREA ESTIMATES IN BUSINESS SURVEYS

Operations with Polynomials

We partition C into n small arcs by forming a partition of [a, b] by picking s i as follows: a = s 0 < s 1 < < s n = b.

Sparse and Overcomplete Representation: Finding Statistical Orders in Natural Images

19 Optimal behavior: Game theory

Solubilities and Thermodynamic Properties of SO 2 in Ionic

Trigonometry. Trigonometry. Solutions. Curriculum Ready ACMMG: 223, 224, 245.

Designing Information Devices and Systems I Discussion 8B

8. INVERSE Z-TRANSFORM

Review of linear algebra. Nuno Vasconcelos UCSD

This model contains two bonds per unit cell (one along the x-direction and the other along y). So we can rewrite the Hamiltonian as:

Scientific notation is a way of expressing really big numbers or really small numbers.

Frequency scaling simulation of Chua s circuit by automatic determination and control of step-size

CHAPTER - 7. Firefly Algorithm based Strategic Bidding to Maximize Profit of IPPs in Competitive Electricity Market

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Transcription:

Interntonl Conference on Control, Automton nd ystems 008 Oct. 4-7, 008 n COEX, eoul, Kore A Renforcement ernng ystem wth Chotc Neurl Networs-Bsed Adptve Herrchcl Memory tructure for Autonomous Robots Msno Obysh, Kenchro Nrt, Tsh Kuremoto nd Kunzu Kobysh Dvson of Computer cence & Desgn Engneerng, Ymguch Unversty, Ube, Jpn (Tel : +8-836-85-958; E-ml:{ m.obys,wu,ob}@ymguch-u.c.p) Abstrct: Humn lerns ncdents by own ctons nd reflects them on the subsequent cton s own experences. These experences re memorzed n hs brn nd recollected f necessry. Ths reserch ncorportes such n ntellgent nformton processng mechnsm, nd pples t to n utonomous gent tht hs three mn functons: lernng, memorzton nd ssoctve recollecton. In the proposed system, n ctor-crtc type renforcement lernng method s used for lernng. Auto-ssoctve chotc neurl networ s lso used le mutul ssoctve memory system. Moreover, the memory prt hs n dptve herrchcl lyered structure of the memory module tht conssts of chotc neurl networs n consderton of the dustment to non-mdp (Mrov Decson Process) envronment. Fnlly, the effectveness of ths proposed method s verfed through the smulton ppled to the mze-serchng problem. Keywords: Renforcement lernng,chotc neurl networ, herrchcl memory structure, utonomous robot. INTRODUCTION Renforcement lernng (R..) s frmewor for n gent to lern the choce of n optml cton bsed on renforcement sgnl []. It hs been ppled to vrety of problem such s utonomous robot nvgton nd non-lner control nd so on. However, so fr, so mny systems wth R.. hve been mde up for only use of the one ts. Reserch for systems mng use of memorzng the results of lernng of mny tss nd pplyng them to other ts wthout lernng hs been lttle done. In ths study, we use the ssoctve chotc neurl networ (ACNN) proposed by Ahr et.l [] s storge mechnsm of results of R.. However, the storge cpcty of ACNN s smll, t s not sutble for worng lone. o, to resolve the problem, we mde up the herrchcl memory structure by mng use of ACNN: short-term memory for present lernng result, long-term memory for mny useful lernng results. Another chrcterstc of the proposed system s tht t s cpble of delng wth non-mdp problem n some degree, becuse of the chotc ssocton blty of ACNN. Fnlly t s verfed tht our proposed method s useful through the computer smulton for mze serchng problem.. PROPOED YTEM TRUCTURE The proposed system conssts of two prts: memory nd lernng. The memory conssts of short-term memory (.T.M.) nd long-term memory (.T.M.). Fg. shows ts overll structure. ernng sector : ctor-crtc system s dopted. It lerns the choce of cton to mxmze the totl predctve rewrds obtned over the future consderng the envronmentl nformton (s) nd rewrd (r) s result of cton ()..T.M. sector: t memorzes the lernng pth of the nformton (envronmentl nformton nd cton) obtned n ernng prt. Unnecessry nformton s forgotten nd useful nformton s stored..t.m. sector: t memorzes only the enough sophstcted nd useful experence n.t.m.. Autonomous gent ernng sector (ctor-crtc system) Memory hort-term memory (.T.M.) contents of TM Fg. Proposed system envronmentl nput s(t) nd rewrd r (t) cton (t) nformton bout pr of cton nd envronmentl one) contents of TM ong-term memory (.T.M.) s (t) Fg. The constructon of ctor-crtc system Envronmen 3. ACTOR-CRITIC REINFORCEMENT EARNIN YTEM The ctor-crtc renforcement lernng system s shown n Fg.. 69

3.. tructure nd lernng of crtc 3.. tructure Functon of the crtc s clculton of P(t) : the predcton vlue of sum of the dscounted rewrds tht wll be gotten over the future nd ts predcton error. These re shortly explned s follows; The sum of the dscounted rewrds tht wll be gotten over the future s defned s V () t. V n () t r( t + n), () =0γ n where γ ( 0 γ < ) s constnt clled dscount rte. Eq. () s rewrtten s () t = r() t + V ( t +) V γ. () Here the predcton vlue of V () t s defned s P () t. The predcton error rˆ () t s expressed s follows; () t = r() t + P( t + ) P() t rˆ γ. (3) The prmeters of the crtc re dusted to reduce ths predcton error rˆ () t. The predcton vlue P() t s clculted s follows; y P J () t y ( t) = ω (4) = 0 n ( x ( t) = exp = m ). (5) σ Here, ω : weght of the th output, y : th output of the mddle lyer of the crtc, x : th nput, m, σ : center, dsperson for th nput of th bss functon respectvely, J : the number of nodes n the mddle lyer of the crtc,. The constructon of the crtc s lso conssted of the RBFN s shown n Fg.3. 3.. ernng ernng of crtc s done by usng commonly used Bc Propgton method whch mes predcton error rˆ () t goes to zero. Updtng rule of prmeters re s follows: ˆ c rt Δω = η c, ( =,, J ). (6) c ω 3.. tructure nd lernng of ctor 3.. tructure Fg.4 shows the constructon of the ctor. The ctor s bsclly conssted of Rdl Bss Functon Networ. The th bss functon of the mddle lyer node s s follows; Here y u m, σ n ( x = exp = J m ) σ, (7) () t y () t + n() t, ( =,, K) y : th = ω. (8) = output of the mddle lyer of the ctor, : center, dsperson for th nput of th bss functon respectvely, K: the number of the ctons, n : ddtve nose, u : representtve vlue of th cton, ω mddle lyer to : connecton weght from th th output. node of the 3.. Nose genertor Nose genertor let the output of the ctor hve the dversty by ddng the nose to t. It comes to relze the lernng of the trl nd error. Clculton of the nose n () t s s follows; n t = n = nose mn, exp( P t, (9) () t t ( ()) nose s unformly rndom number of [,] where t. As the P () t wll be bgger, the nose wll be smller. Ths leds to the stble lernng of the ctor. 3..3 ernng Prmeters of ctor, ω ( =,, J, =,, K), re dusted by usng output u of ctor nd nose n. u ˆ Δ ω = η nt rt ω, (0) η (> 0) s the lernng coeffcent. Eq. (0) mens tht ( nt δt ) s consdered s error, ω s dusted opposte to sgn of ( nt δt ). Fg.3 tructure of crtc Fg.4 tructure of ctor 70

3.3 Acton selecton The cton b t tme t s selected stochstclly usng bbs dstrbuton Eq.(). exp( ub () t T ) P( b x () t ) = K. () exp u t T = ( () ) Here, P( b x() t ) : selecton probblty of bth cton, b, T : postve constnt clled temperture constnt. 4. A HIERARCHICA MEMORY YTEM 4. Assoctve Chotc Neurl Networ (ACNN) CNN s constructed wth chotc neuron models tht hve refrctory nd contnuous output vlue. Its useful usge s s ssoctve memory networ nmed ACNN. Here re the dynmcs of ACNN. x ( t + ) = f ( y ( t + ) + z ( t + )), () y ( t + ) = y( t) α x ( t) +, (3) r z ( t + ) = z ( t) + ϖ x ( t), (4) f n = P p p ϖ = (x ) (x ), (5) P p= x (t) : output of the th neuron t tme t, y (t) : nternl stte respect to refrctory of the th neuron t tme t, z (t) : nternl stte respect to mutul operton of the th neuron t tme t., f ( ) : sgmod functon, ϖ : connecton weght from th neuron to th neuron, p x : th element of pth stored pttern. 4. Networ control Here, networ control s defned s control whch mes trnston of networ from chotc stte to non-chotc one nd vce vers. The networ control lgorthm of ACNN s shown n Fg.5. The stte of ACNN s clculted by Δ x(t), totl chnge of nternl stte x(t) temporlly, nd when Δx(t) s less thn threshold vlue θ, the chotc retrevl of ACNN s stopped by chngng vlues of prmeters r nto smll one. As result, networ converges to stored pttern ner the present networ stte. 4.3 Mutul ssoctve type ACNN 4.3. hort term memory(.t.m.) We me use of ACNN s mutul ssoctve memory system, nmely, uto-ssoctve mtrx s constructed wth envronmentl nputs s(t) nd ther correspondng ctons (t). When s (t) s set s ntl stte of ACNN, ACNN retreves (t) wth s (t) (refer to Fg.6). l s rndom vector to ween the correlton between s (t) nd (t). The memory mtrx W s descrbed s Eq.(6), here, λ s Fg.5 Networ control lgorthm Fg.6 Memory confgurton of ACNN Fg.7 Adptve herrchcl memory structure forgettng coeffcent, nd η s lernng coeffcent. λ s set to smll,becuse tht t ntl lernng stge s (t) s not correspondng to optml (t). W Input s new T [ s l ] [ s l ] old = λ W + η. (6) s Output l : ddtonl rndom memory unts for weenng correltons between s nd Actor - crtc system tored pttern [ s ( t) l( t) ( t) ] Input (t) s ACNN s mutul retrevl system Envronment.T.M. ACNN s mutul retrevl system Unt type memory structure 造.T.M. Unt type memory structure (0) Unt type memory structure () Output (t) 7

.T.M. s one unt conssts of plurl ACNNs, nd one ACNN memorzes nformton for one envronmentl nput pttern (refer to Fg.7). And.T.M hs pth nformton from strt to gol of only one mze serchng problem. 4.3. ong term memory(.t.m.) The.T.M. conssts of plurl unts. The.T.M. memorzes enough refned nformton n the.t.m. s one unt (refer to Fg.7). Nmely, when ctor-crtc lernng hs ccomplshed for certn mze problem, nformton n the.t.m. s updted s follows: In cse tht the present mze problem hs not been experenced, the stored mtrx W s set by Eq.(7) ; W = W. (7) In cse tht the present mze hs been experenced nd present lernng s ddtve lernng, the stored mtrx s updted by Eq.(8) ; new old W = λ W + η W. (8) λ s forgettng coeffcent, nd η s lernng coeffcent. λ s set to lrge vlue s sme s one of η so s not to forget prevous stored ptterns. 4.4 Adptve herrchcl memory structure Fg.7 shows whole confgurton of n dptve herrchcl memory structure. When n envronmentl stte s nputted to gent, t frst t s sent to the.t.m for confrmng f t s the stored nformton or not. If t s the stored nformton, the obtned cton correspondng to t s executed, otherwse, t s used to lern the ctor-crtc system. The pr of the enough refned nd trned envronmentl stte s nd cton n the.t.m. s sent to the.t.m. to be stored. If t s smlr to the stored pttern, nformton of the.t.m. s used to relern t the ctor-crtc system n the.t.m.. 5. COMPUTER IMUATION 5. multon condton Agent cn perceve whether there s sle or not t the forwrd, rght-forwrd, left-forwrd, rght, left s envronment s (refer to Fg.8). Agent cn move lttce to forwrd, bc, left, nd rght s cton (refer to Tble ). Therefore n ctor-crtc, stte s of envronment conssts of 5 nputs (= n). And nds of cton s 4(= K). The number of hdden nodes of R.B.F. s equl to 3(=J) n Fg. 3 nd 4. And the number of unts l s equl to n Fg.6. When gent gets the gol, gent s gven rewrd,.0. For the cse of collson wth wll, rewrd s -.0, nd for ech cton except collson s - 0.. Other prmeters used n ths smulton re shown n Tble. 5. multon nd results 5.. In the cse of smple mze Fg.8 Perceptble re of gent : shded re Tble Acton code of gent Tble Prmeters of smulton ACTOR-CRITIC σ 0. ξ 0.7 η 0.3 γ 0.5 T 0.3 - - Forgettng nd ernng coeffcents λ 0.89 η.00 λ.00 η.00 Chos control prmeters of ACNN Chos / Non-chos Chos / Non-chos α 0.0/.00 r 0.98/0.0 ε 0.05/0.05 f 0.0/00 T 0.3/0.3 - - - () Number of yer n the.t.m. (b) Fg.9 Expermentl mze nd results The num ber of stored ptterns yer 0 5406 yer 6073 yer 633 yer 3 370 yer 4 0033 yer 5 949 yer 6 73 yer 7 9996 yer 8 3383 yer 9 3007 totl 578 At frst, there s no dt n the.t.m., gent lerns the shortest pth of the mze of Fg.9() by usng the ctor-crtc system nd stores the result of lernng n the ACNN of the.t.m. correspondng to stte s n the form of Eq.(6). The fnl refned result for the me 7

s sent to be stored n the frst lyer (= unt(0) )of the.t.m. After lernng, gent restrted from the ntl poston nd got the nformton from ech lyer of the.t.m. nd got the gol le the rrow lne n Fg.9(). Fg.9(b) shows tht the number of stored ptterns concentrtes t yer 0, ths s becuse tht when gent goes ths mze gn, gent uses the nformton n unt(0) of the.t.m., but retrevl n ACNN fled on the wy, ll the nformton W of unt(0) moved to the.t.m. nd ddtonl ctor-crtc lernng ws done nd lernng results were wrtten ddtvely n the form of new Eq.(8) nd ll the nformton W ws sent to unt() of the.t.m. s new experenced nformton. Fg.9(b) shows the number of stored ptterns by Eq.(6) twce nd flure hppened t yer 0, so the number of stored ptterns concentrted t yer 0. 5.. In the cse of lsng Agent moves eepng to te the posture such tht front of the gent s lwys upsde of the pper. In Fg.0, the optml pth t stte A s rght, however, ts pth t B s left. Agent perceves stte A nd B s sme stte, but ts optml ctons re dfferent, ths s clled lsng. In our cse, both ptterns re stored s dfferent ptterns. Our method solves ths problem by usng chos control of ACNN. Nmely, n the cse of sme ste, sme nput to ACNN, ACNN outputs ether left or rght s gent cton, consequently gent moves rght t A. () Non stored pth (b) tored pth (c) tored pth Fg. Expermentl mzes nd results () tored pth (b) tored pth A (c) tored pth 3 B Alsng hppens t these re Fg.0 Expermentl mze 5..3 In the cse of use of stored pth nformton At frst, there s no dt n the.t.m., gent lerns the shortest pth of the mze of Fg. (b) by usng the ctor-crtc system nd stores the result of lernng n the.t.m. n the form of Eq.(5) for ech cton. The fnl refned result s sent to be stored n the frst lyer ( = unt(0) ) of the.t.m. n shown Fg.7. econd, for the mze of Fg.(c), there s the pth nformton, gent tres to get the cton usng the ACNN of unt(0) n the.t.m., but fls becuse of no nformton correspondng to ths envronment. Agent lso lerns (d) Non stored pth Fg. Expermentl lrge scle mze 73

nd the fnl refned result s sent to the second lyer ( = unt() ) of the.t.m.. Fg.() shows the results tht gent s movng long the optml pth by mng use of experences (memory), tht s, (b) nd (c). The colored pth n Fg. () corresponds to those of (b) nd (c). 5..4 In the cse of lrge scle mze After lernng of plurl smll sze mzes, Fg. () to (c), gent tred to get the gol for the mze n Fg. (d). Agents could not ssocte ts cton t the top of the rrow n Fg.(d). To get the gol n such lrge scle mze, mny experenced mzes re needed. 6. CONCUION We proposed renforcement lernng system wth chotc neurl networs-bsed dptve herrchcl memory structure for utonomous robots nd showed ts effectveness through gol serchng problem n plurl mzes. In our future wor, we would le to try to expnd ths method to be used n the cse of contnuous envronment. Acnowledgements A prt of ths study ws supported by JP-KAKENHI (No.850030, No.050077 nd No.050007). REFERENCE [] R.. utton, A.. Brto:"Renforcement ern ng", The MIT Press,998 [] M. Adch, K. Ahr:"Assoctve Dynmcs n Chotc Neurl Networ", Neurl Networs,Vol. 0, No., pp.83-98,997 74