A Scalable Recurrent Neural Netwrk Framewrk fr Mdel-free POMDPs April 3, 2007 Zhenzhen Liu, Itamar Elhanany Machine Intelligence Lab Department f Electrical and Cmputer Engineering The University f Tennessee http://mil.engr.utk.edu
Outline Intrductin Backgrund and mtivatin TRTRL/SMD Simulatin results Summary and future wrk 2
Ingredients fr building Intelligent Machines Implementatin platfrm? Must scale Mammal brain as a reference mdel? Massively parallel architecture Operates at (relatively) lw speeds Fault-tlerant tlerant Sftware vs. Hardware If hardware, what technlgy? (FPGA, VC VLSI, Analg VLSI) (Nnlinear) Functin apprximatin Dealing with high-dimensinal prblems Optimal plicy is unattainable Capturing spatitempral dependencies RNNs,, Bayesian Netwrks, Fuzzy? Bilgically-inspired inspired schemes 3
Scaling ADP Gals T address high-dimensinal state and/r actin spaces Supprt nline learning Deal with partially bservable scenaris (e.g. POMDPs) Hardware realizable Apprach taken Emply recurrent neural netwrks (RNNs( RNNs) Imprved learning algrithm that scales Devised hardware-efficient efficient architecture Embed within apprximate Q-Learning Q framewrk 4
The Real-Time Recurrent Learning (RTRL) Algrithm Originally prpsed in 1989 fr arbitrary RNN tplgy Stchastic gradient-based nline algrithm Activatin functin f neurn k is defined by: y k t 1 f k s k t, where s k is the weighted sum f all activatins leading t neurn k. The netwrk errr at time t is defined by: z k 1 1 J = m m = m t 2 2 m utputs x = y k k if k input if k utput 2 [ d y ] [ e ( )] 2 m utputs where d m (t) dentes the desired target value fr utput neurn m 5
Updating the weights The errr is minimized alng a psitive multiple f the perfrmance measure gradient such that w ( t + 1) = w + Δw Δw J = α = α w k utputs e k y k w The partial derivatives f the activatin functin with respect t the weights are identified as sensitivity elements and dented by p k = yk w 6
Updating the Sensitivities in RTRL The sensitivities f nde k with respect t a change in weight are updated using the recursive expressin w p k ( t + 1) = f k k kl ik t) l N l ( s () t ) w p + δ z ( Each neurn perfrms O(N 3 ) multiplicatins, yielding a ttal cmputatinal cmplexity f O(N 4 ) The strage requirements are dminated by the weights and the sensitivities resulting in O(N 3 ) strage requirements 7
Truncated RTRL (TRTRL) Mtivatin: T btain a scalable versin f the RTRL algrithm while minimizing perfrmance degradatin. Hw? Bilgically-inspired inspired apprach: limit l the sensitivities f each neurn t its ingress (incming) and egress (utging) links. 8
Revising Sensitivity Updates fr TRTRL Fr all ndes nt in the utput set,, the ingress sensitivity functin fr nde i is given by p ( ( ))[ s t w p z ( )]. ( t + 1) = f t i i i + The egress sensitivities fr nde i are updated by p ( )[ i s w p + δ y ( )]. ( t + 1) = f t Fr the utput neurns, a nnzer sensitivity element must exist in rder t update the weights, yielding p i ( ( ))[ i s t w p + w p + δ z ( )]. ( t + 1) = f t i i 9
Strage and Cmputatinal Cmplexity f TRTRL The netwrk architecture remains the same with TRTRL (there s s a weight between each tw neurns) Only the calculatin f sensitivities is reduced The cmputatinal lad fr each neurn becmes O(KN) where K dentes the number f utput neurns The cmputatin cmplexity was reduced frm O(N 4 ) t O(KN 2 ) The strage requirement was reduced frm O(N 3 ) t O(N 2 ) 10
Stchastic-Meta Descent (SMD N. Schraudlph et al.) Gradient descent techniques ften suffer frm slw cnverges, particularly fr ill-cnditined prblems Mainstream apprach: utilize secnd-rder rder infrmatin, e.g. LM, Newtn methds (all utilize Hessian matrix) Hwever, these are cmputatinally heavy Stchastic meta-descent (SMD) has recently been prpsed as a cheap secnd-rder rder gradient technique Emplys an independent learning rate fr each weight Utilizes Hessian infrmatin in lcal step size w ( t + 1) = w + λ δ 11
SMD adpted fr TRTRL We adpted SMD fr TRTRL (first wrk in applying SMD t RNNs) Apprach - adapt learning rate alng expnentiated gradient descent directin J lnλ( t) = lnλ( t 1) μ, lnλ lnλ = = lnλ ( t lnλ ( t J w( t) 1) μ w lnλ 1) + μδ v safeguard factr against unreasnably small, r negative, values e ( ρ,1+ v ) λ = λ ( t 1) max μδ using relatinship x 1+ x 12
SMD adpted fr TRTRL (cnt.) Adapt gradient trace ( δ ( H v( t ) ) v ( t + 1) = βv + λ β ) H t is the instantaneus Hessian (the matrix f secnd derivatives 2 J/ w w kl f the errr J with respect t each pair f weights) at time t The prduct f the Hessian and an arbitrary vectr t Hv R v w r w rv r 0 which fr TRTRL yields ( H tv ) = Rv e p = [ e Rv { p } Rv{ y( t) } p ] εutput εutput 13
SMD in TRTRL T cmplete the analysis, the R-peratr R n S, Y, P is R v R We als added adaptive glbal meta-learning rate by defining t yield { y } = f ( s ) R { s } R { s } = v z, v v v lεu I { } ( ) { } [ ] i p = f s( t) Rv s( t) wi p + w p + δiz f ( s ) [ v p v p ] i + + i J ϕ = = δ( t) v( t) ln λ ( 1+ ηϕ ϕ ( 1) ) μ = μ ( t 1) t l l 14
Using TRTRL/RNN fr Slving POMDP Recall that the mtivatin fr using RNNs was t slve POMDPs O(t) RNN J(t) α + _ Envirnment a(t) Sft-max J(t-1) + _ r(t) In each step: (1) feedfrward all actins, (2) find the ne with maximal (sft-max) value (J), (3) apply crrespnding actin t the envirnment, (4) get next reward and update weights 15
Example 2 - Fur State POMDP 6 12/18 r = 0 5 12/18 r = 8 6 12/18-5 -12/18 r = 0 r = 8 4-state POMDP with identical (cnfusing) bservatins Agent needs t remember prir bservatin t infer state 15 internal neurns and 1 utput neurn 16
Summary Scalable, efficient RNNs Vital tl fr addressing high-dimensinal POMDPs Intrduced a fast, hardware-efficient learning algrithm and architecture Slightly imprved SMD technique (adaptive glbal learning rate) Successfully applied TRTRL-SMD in slving POMDP Pathway fr addressing practical prblems Scalable framewrk fr ADP with RNNs 17