state Environment reinforcement

Size: px

Start display at page:

Download "state Environment reinforcement"

Kristopher Jennings
5 years ago
Views:

1 Tunng Fuzzy Inference Syems by Q-Learnng Mohamed Boumehraz*, Kher Benmahammed** *Laboratore MSE, Département Electronque, Unveré de Bskra, ** Département Electronque, Unveré de Setf Keywords: renforcement learnng, fuzzy nference syems, Q-learnng Abract: Fuzzy rules for control can be effectvely tuned va renforcement learnng. Renforcement learnng s a weak learnng method wch only requres nformaton on the succes or falure of the control applcaton. In ths paper a renforcement learnng method s used to tune on lne the concluson part of fuzzy nference syem rules. The fuzzy rules are tuned n order to maxmze the return functon. To llurate ts effectvness, the learnng method s appled to the well known Cart-Pole balancng syem problem. The results obtaned show sgnfcant mprovements of the speed of learnng. 1. Introducton Renforcement learnng (RL) refers to a class of learnng tasks and algorthms n whch the learnng syem learns an assocatve mappng by maxmzng a scalar evaluaton (renforcement) of ts performance from the envronment [1,2,3]. Compared to supervsed learnng, RL s more dffcult snce t has to work wth much less nformaton. Fuzzy nference syems have been shown able to provde excellent control n a number of practcal applcatons. However, the problem n fuzzy syems s how to defne the approprate fuzzy rules. Several approaches have been proposed to autamtcally extract rules from data ; gradent descent[4], fuzzy cluerng, genetc algorthms [,6] and renforcement learnng[7,8,9,1,11]. In ths paper we use Q-learnng to determne the approprate conclusons for a Mamdan fuzzy nference syem. We assume that the ructure of the fuzzy syem and the membershp functons are specfed a pror. 2. Renforcement Learnng 2.1 Renforcement learnng model In renforcement learnng an agent learns to optmze an nteracton wth a dynamc envronment through tral and error. The agent receves a scalar value or reward wth every acton t executes. The goal of the agent s to learn a rategy for selectng actons such that the expected sum of dscounted rewards s maxmzed[1]. In the andard renforcement learnng model, an agent s connected to ts envronment va perceton and acton, as depcted n fgure 1. At any gven tme ep t, the agent perceves the ate s t, of the envronment and selects an a t. The envronment responds by gvng the agent scalar renforcement sgnal, r(s t ) and changng nto ate s t+1. The agent should choose actons that tend to ncrease the long run sum of values of the renforcement sgnal. It can learn to do ths overtme by syematc tral and error, guded by a wde varety of algorthms. The agent goal s to fnd an optmal polcy, π : S A, whch maps ates to actons, that maxmze some long-run mesure of renforcement. In the general case of the renforcement learnng problem, the agent s actons determne not only ts mmedate rewards, but also the next ate of the envronment. As a result, when takng actons, the agent has to take the future nto account. The renforcement learnng can be summarzed In the followng eps. Intalze the learnng syem repeat 1-Wth the syem n ate s, choose an acton a accordng to an exploraton polcy and apply t to the syem 2- The syem returns a reward r, and also yelds next ate. 3- Use the experence, (s,a,r,s ) to update the learnng syem 4 s s untl s s termnal acton Envronment renforcement Agent 2.2 The return functon The agent's goal s to maxmze the accumulated future rewards. The return functon, or the return, R(t), s a long-term measure of rewards. We have to specfy how the agent should take future nto account n the decsons t makes about how to select an acton now. There are three models that have been the subject of the majorty of work n ths area. The fnte-horzon model In ths case, the horzon corresponds to a fnte number of eps n the future. It exs a termnal ate and the sequence of actons between the ntal ate and the termnal one s called a perod. The return s gven by: R (1) r r r t t1 tk 1 ate Fgure 1. Renforcement learnng scheme 1

2 where K s the number of eps before the termnal ate. The dscounted return (nfnte-horzon model) In ths case the longrun reward s taken nto account, but rewards that are receved n the future are geometrcally dscounted accordng to dscount factor, < < 1 and the crtera becomes. R k r (2) tk k1 The average-reward model A thrd crtera, n whch the agent s supposed to take actons that optmze ts long-run average reward s also used : R n 1 rt k lm n n (3) k1 2.2 The ate value functon or value functon The value functon s a mappng from ates to ates values. The value functon V (s) of ate s, assocated wth a gven polcy (s) s defned as [1] : V E krt k 1 (4) k1 Where s t s the ate at tme t, r t+k+1 s the reward receved for performng acton : a s tk tk () at tme t+k, and s the dscount factor ( <<1). 2.3 Acton-value functon or Q-functon The acton-value functon measures the expected return of executng acton a t at ate s t, and then followng the polcy for selectng actons n subsequent ates. The Q-functon correspondng to polcy (s) s defned as [1]: Qt 1, at r t1q t 1, 1 (6) The advantage of usng Q-functon s that the agent s able to perform one-ep lookahead search wthout knowng the one-ep reward and dynamcs functons. The dsavantage s that the doman of the Q- functon ncreases from the doman of ates S to the doman of ate-acton pars (s,a). 3- Renforcement Learnng Methods 3.1 Q-Learnng It exs several approaches for renforcement learnng wthout models. Some are based on polcy teraton, such as the Actor Crtc Learnng, and others on value teraton, such as Q-Learnng or SARSA. The Q-Learnng, proposed by Watkns [12], s perhaps the more popular of algorthms, by reason of ts smplcty. One-ep Q-Learnng The fr verson of Q-Learnng s based on the temporal dfferences of order, TD(), whle only consderng the followng ep (one-ep Q- Learnng). The agent observes the present ate, s t, and executes an acton, a t, accordng to the evaluaton of the return that t makes at ths age. It updates ts evaluaton of the value of the acton whle takng n account, a) the mmedate renforcement, r t+1, and b) the emated value of the new ate, V t (s t+1 ), that s defned by: Vt ( 1) max baqt ( 1, b) (7) The update corresponds to the equaton: Qt 1 Qt r t1v t( 1) Qt (8) s a learnng rate such that as t. In addton to ts smplcty, Q-Learnng presents several ntereng charactercs. - The evaluatons of Q, the Q-values, are ndependent of the polcy followed by the agent. Ths one can follow any polcy, whle contnung to conruct correct evaluatons of the value of actons. - Q-values are explotable a long tme before the formal convergence that can be sometmes very slow. - Laly, there are proofs of convergence toward the optmal polcy[12]. 4. Optmzaton of fuzzy nference syemes by Q- Learnng Renforcement learnng has been used for optmzaton of fuzzy nference syems by two types of methods: Methods based on polcy teraton, drvng to Actor- Crtc archtectures [7,8], and the others based on value teraton, generalze Q-Learnng[9,1,11], n [11] Glorennec uses Q-Learnng for the optmzaton of a zero order Takag-Sugeno FIS, wth a conant conclusons. If the acton space s contnuous the conclusons are equally drbuted between lower and upper bounds of the acton. In ths paper, we consder a Madan FIS, and contnuous ate and acton spaces. The FIS ructure s fxed a pror by the user and the fuzzy sets for the nputs and output are supposed fxed. Our approach, cons n determnng the optmal conclusons of the fuzzy nference syem. 4.1 Mamdan fuzzy nference syem A Mamdan nference syem s descrbed by a set of fuzzy rules of the form [13]: Rule : f s s A then a s B Where s s the fuzzy syem nput, A s a fuzzy label for nput n th rule, a s the output of the fuzzy syem and B s fuzzy label for the output n th rule. The problem s how to choose the approprate rules n order to optmze syem performance (n RL maxmze the accumulated future rewards)[13]. In ths paper we use Q-learnng to optmze rule conclusons. Several competng conclusons are assocated to each rule, and a qualty value s assgned to each concluson. The concluson wth the hgh qualty s used by the syem to generate actons. The fuzzy rule becomes: Rule : f s s A then a s argmax b BQ( s, b) 2

3 4.2 Learnng process At each rule, several conclusons are assocated, and each concluson has a Q-value: The fuzzy rule s of the form: Rule : f s s A then a s B 1 wth Q (s, B 1 ) or a s B 2 wth Q (s, B 2 ) or a s B 3 wth Q (s, B 3 ) or a s B m wth Q (s, B m ) where B 1, B 2,.., B m are the fuzzy sets of the outputs and Q j (s, B ) s the Q-value of the concluson a s B of the rule j. Durng learnng the Q-value of each concluson s updated usng Q-learnng ( equaton 8): Q Bj) Q Bj) ( ) rt 1V t( 1) Q B ) t1 t t j (9) Where μ (s t ) s the truth value of the th rule and B j s the j th concluson of the th rule. Wth the value of the new ate gven by: 1 j N Vt( ) 1 N 1 1 j1 1 max bbqt (, ) 1 b f s t+1 s a fnal ate then : (1) V t 1 (11). Results The proposed method s appled to a classc problem; the pole balancng problem or nverted pendulum problem. In ths problem a pole s hnged to a motor-drven cart whch moves on ral tracks to ts rght or ts left. The prmary control task s to keep the pole vertcally balanced. μ p s the coeffcent of frcton of pole on cart. The sample perod s 2 ms. We assume that a falure happen when θ > 4. Also, we assume that the equaton of moton s not known to the controller and that only a vector descrbng the cart-pole syem s ate at each tme ep s known. The nputs of the fuzzy controller are error e and error change Δe: e( k) ( k) (13) e ( k) e( k) e( k1) (14) The output s the force f and the Q-values of conclusons. The fuzzy parttons of the nputs and output are descrbed n fgure μδe μ f μ e e Δe θ Fgure 3. Membershp functons f f Fgure 2. The Cart-Pole Syem The dynamcs of the cart-pole syem are modeled by the followng non lnear dfferental equaton [7,13]: 2 f ml sn p gsn cos mc m m l (12) 2 m l 4 cos 3 mc m where g s the gravty, m c s the mass of the cart, m s the mass of the pole, l s the half-pole length and The rule base s choosen arbtrary and the Q-values of the conclusons are set ntally to zero. We use center of area defuzzfcaton and the mn operator to mplement the premse and mplcaton. A tral n our experments refers to artng wth the cart-pole syem set to an ntal ate and endng wth the appearance of a falure sgnal or successful control of the syem for an extended perod (1 tme eps or 2 seconds). The Q-learnng was appled to tune fuzzy rule conclusons. The free conants were =.9 and set ntally to.1 and decreases. Fgure 4. shows the average return per tral performance of the controller durng the learnng process; the average return per tral and fgures and 6 show the response of the syem, after learnng, for ntal angle equal to and 28 respectvely. It s clear that the average return ncreases durng learnng untl t reaches a sub-optumal value. The obtaned fuzzy controller s able to ablze the pole for angles nferor to. 3

4 force [N] velocty [ /s] force[n] angle[ ] velocty[ /s] Average Return angle[ ] 1 Average Return n the nverted pendulum problem Tral fgure 4. The Average Return Fgure 6. Angle, velocty and force for ntal angle equal to Tme [s] tme [s] Fgure. Angle, velocty and force for ntal angle equal. Conclusons In ths work we have proposed a new method of optmzng fuzzy nference syem based on Q- learnng. Ths method was appled to cart-pole syem. After learnng, the controller s able to ablze the pendulum. We assume that ructure of the fuzzy syem s fxed a pror. The optmzaton of membershp functon parameters and number of rules wll mprove the performance of the proposed method. References [1] R. S. Sutton, A. G. Barto, Introducton to renforcement learnng, MIT Press/Bradford Books, Cambrdge, MA, [2] V. Gullapall, Renforcement learnng and ts appcaton to control, Ph. D. Thess, Unvery of Massachusetts, Amher, MA, USA,1992. [3] L. P. Kaelblng, M. L. Lttman, A. W. Moore, Renforcement learnng: a survey, Journal of Journal Artfcal Intellgence Research 4,

5 [4] J. R. Jang, Self-Learnng Fuzzy Controllers Based on Temporal Back Propagaton, IEEE Transactons on Neural Networks, Vol. 3 No., September [] M. G. Cooper, J. J. Vdal, Genetc Desgn of Fuzzy Controller, Proceedngs of Second Inernatonal Conference on Fuzzy Theory and Technology; Durham, NC, October, [6] A. Bonarn, Evolutonary learnng of fuzzy rules:competton and cooperaton, n Fuzzy modelng : paradgms and practce, Kluwer Academc Publshers, Norwell, MA, 199. [7] H. R. Berenj P. Khedkar, Learnng and Tunng Fuzzy Logc Controllers Through Renforcement, IEEE Transactons on Neural Networks, Vol. 3 No., September [8] M. V. Bujtenen, G. Schram, R. Babuska, B. Verbruggen, Adaptve Fuzzy Control of Satellte Atttude by Renforcement Learnng, IEEE Transactons on Fuzzy Syems, Vol. 6, No. 2, May [9] H. R. Berenj, Fuzzy Q-Learnng: a new approach for fuzzy dynamc programmng, Proceedngs of IEEE nternatonal conference on Fuzzy Syems, Nj, [1] P. Y. Glorennec, L. Jouffe, Fuzzy Q-Learnng, Procedngs of FUZZ-IEEE 97, Barcelona, Span, July [11] P. Y. Glorennec, Renforcement Learnng: an Overvew, ESIT 2, Aachen, Germany, 14-1 September 2. [12] C. Watkns Learnng from Delayed Rewards, PhD. Thess, Unvery of Cambrdge, England, [13] K. Passno, S. Yurkovch, Fuzzy Control, Addson Wesley, Calforna, 1998.

Building A Fuzzy Inference System By An Extended Rule Based Q-Learning

Building A Fuzzy Inference System By An Extended Rule Based Q-Learning Buldng A Fuzzy Inference System By An Extended Rule Based Q-Learnng Mn-Soeng Km, Sun-G Hong and Ju-Jang Lee * Dept. of Electrcal Engneerng and Computer Scence, KAIST 373- Kusung-Dong Yusong-Ku Taejon 35-7,