arxiv: v2 [math.oc] 19 Jun 2016

Size: px

Start display at page:

Download "arxiv: v2 [math.oc] 19 Jun 2016"

Jasper Morgan
5 years ago
Views:

1 Using Deep Q-Learning o Conrol Opimizaion Hyperparameers Samanha Hansen IBM T.J. Wason Research Cener arxiv:16.6v [mah.oc] 19 Jun 16 Absrac We presen a novel definiion of he reinforcemen learning sae, acions and reward funcion ha allows a deep Q-nework (DQN) o learn o conrol an opimizaion hyperparameer. Using Q-learning wih experience replay, we rain wo DQNs o accep a sae represenaion of an objecive funcion as inpu and oupu he expeced discouned reurn of rewards, or q-values, conneced o he acions of eiher adjusing he learning rae or leaving i unchanged. The wo DQNs learn a policy similar o a line search, bu differ in he number of allowed acions. The rained DQNs in combinaion wih a gradien-based updae rouine form he basis of he Q-gradien descen algorihms. To demonsrae he viabiliy of his framework, we show ha he DQN s q-values associaed wih opimal acion converge and ha he Q-gradien descen algorihms ouperform gradien descen wih an Armijo or nonmonoone line search. Unlike radiional opimizaion mehods, Q-gradien descen can incorporae any objecive saisic and by varying he acions we gain insigh ino he ype of learning rae adjusmen sraegies ha are successful for neural nework opimizaion. 1 Inroducion This paper demonsraes how o rain a deep Q-nework (DQN) o conrol an opimizaion hyperparameer. Our goal is o minimize an objecive funcion hrough gradien-based updaes of he form x +1 = x α g (1) where α is he learning rae. A each ierae x, we exrac informaion abou he objecive derived from Taylor s heorem and line search mehods o form a sae feaure vecor. The sae feaure vecor is he inpu o a DQN and he oupu is he expeced discouned reurn of rewards, or q- value, conneced o he acion of increasing, decreasing, or preserving he learning rae. We presen a novel definiion of he reinforcemen learning problem ha allows us o rain wo DQNs using Q-learning wih experience replay [1, 1] o successfully conrol he learning rae and learn he q-values associaed wih he opimal acions. The moivaion for his work is founded on he observaion ha gradien-based algorihms are effecive for neural nework opimizaion, bu are highly sensiive o he choice of learning rae [9]. Using a DQN in combinaion wih a gradien-based opimizaion rouine o ieraively adjus he learning rae eliminaes he need for a line search or hyperparameer uning, and is he concep for he Q-gradien descen algorihm. Alhough we resric his paper o deerminisic opimizaion, his framework can exend o he sochasic regime where only gradien esimaes are available. We rain wo DQNs o minimize a feedforward neural nework ha performs phone classificaion in wo separae environmens. The firs environmen conforms o an Armijo line search procedure [1, 1], eiher he learning rae is decreased by a consan facor or an ierae is acceped and he learning rae is rese o an iniial value. The second environmen differs in ha he learning rae can also increase and is never rese. The rained DQNs are he inpu o he Q-gradien descen 1

2 (Q-GD) versions 1 &, and we es hem agains gradien descen wih an Armijo or nonmonoone [5] line search o show ha hese new algorihms are able o find beer soluions on he original neural nework, as well as on a neural nework ha is doubled in size and wih hree imes he amoun of daa. We also compare how each algorihm adjuss he learning rae during he course of he opimizaion procedure in order o exrac characerisics ha explain Q-GD s superior performance. The paper is organized as follows: in Secion we review reinforcemen learning (RL) heory and in Secion 3 we define he RL acions, sae, and reward funcion for he purpose of opimizaion. Secion describes he Q-learning wih experience replay procedure used o rain he DQNs. In Secion 5, we es he Q-GD algorihms agains gradien descen wih an Armijo or nonomonone line search on wo neural neworks ha perform phone classificaion. Secion 6 reviews relevan lieraure and finally, in Secion 7 we provide concluding remarks and discuss fuure areas of research. Noaion: We use brackes indexed by eiher locaion or descripion o denoe accessing an elemen from a vecor. For example, [s] i denoes he i h elemen and [s] encoding denoes he elemen corresponding o descripion encoding for vecor s. Review of Reinforcemen Learning Reinforcemen learning is he presiding mehodology for raining an agen o perform a ask wihin an environmen. These asks are characerized by a clear underlying goal and require he agen o sequenially selec an acion based on he sae of he environmen and he curren policy. The agen learns by receiving feedback from he environmen in he form of a reward. A each ime sep, he agen receives a represenaion of he environmen s sae s S and based on he policy π : S A chooses an acion a A. The agen receives a reward r +1 for aking acion a and arriving in sae s +1. We assume ha he environmen is a Markov Decision Process (MDP), i.e. given he curren sae s and acion a, he probabiliy of arriving in nex sae s +1 and receiving reward r +1 does no depend on any of he previous saes or acions. A successful policy mus balance he immediae reward wih he agen s overall goal. RL achieves his via he acion-value funcion Q π (s, a) : (S, A) R, which is he discouned expeced reurn of rewards given he sae, acion, and policy, ] Q π (s, a) = E π [R +1 s = s, a = a () where T 1 R +1 = r +1 + γ k r +1+k, < γ 1, (3) k=1 T is he maximum number of ime seps and he expecaion is aken given ha he agen is following policy π. The opimal acion-value funcion, Q (s, a) = max π Q π (s, a) saisfies he Bellman equaion, Q (s, a) = E π [ r +1 + γ max a A Q (s +1, a ) ] s = s, a = a which provides a naural updae rule for learning. A each ime sep he effecive esimae ŷ and arge y are given by () ŷ = Q (s, a ), y = r +1 + γ max a A Q (s +1, a ) (5)

3 and he updae is based on heir difference; his mehod is referred o as Q-learning. Noice ha he esimae/arge come from LHS/RHS of () and will boh coninue o change unil Q converges. For finie number of saes and acions Q is a look-up able. When he number of saes is oo large or even infinie, he able is approximaed by a funcion. In paricular, when he acion-value funcion is a neural nework i is referred o as a deep Q-nework (DQN). A pracical choice is o choose a nework archiecure such ha he inpus are he saes and he oupus are he expeced discouned reurn of rewards, or q-value, for each acion. We only consider he case of using a DQN and henceforh use he noaion Q(s; θ) : R S R A (6) o denoe ha he DQN is parameerized by weighs θ. The weighs are updaed by minimizing he l norm beween he esimae and arge, ŷ y, yielding ieraions of he form θ θ β(ŷ y ) θ Q(s ; θ) (7) where β is he learning rae and ŷ = Q(s ; θ), [y ] a = { r T 1 (γ max a A[Q(s +1 ; θ)] a ) a = a [Q(s ; θ)] a a a. (8) For he las acion, only he reward is presen in he arge definiion and for he non-chosen acions, he arges are se o force he error o be zero. The acion a each ime sep is chosen based on he principle of exploraion versus exploiaion. Exploiaion akes advanage of he informaion already garnered by he DQN while exploraion encourages random acions o be aken in prospec of finding a beer policy. We employ an ɛ- greedy policy which chooses he opimal acion w.r. he DQN s q-values wih probabiliy 1 ɛ and randomly oherwise: { arg max a [Q(s ; θ)] a r ɛ a = (9) randomly chosen acion r < ɛ where r U[, 1]. Equaion (9) is he effecive policy since i maps saes o acions. Q-learning is an off-policy procedure because i follows a non-opimal policy (wih probabiliy ɛ a random acion is aken) ye makes updaes o he opimal policy, as illusraed by he max erm in (8). For a comprehensive inroducion o RL, see [18]. 3 Reinforcemen Learning for Opimizaion In his secion, we ouline he environmen, sae, acions, and reward funcion ha define he reinforcemen learning problem for he purpose of opimizaion. 3.1 Acions We presen wo procedures for adjusing he learning rae and show how hey are implemened in pracice. The firs sraegy mimics an Armijo line search [1, 1] in ha he learning rae is rese o an iniial value afer acceping an ierae and can only henceforh be decreased. The second sraegy permis he learning rae o increase or decrease and is never rese. The wo mehods are oulined in Algorihm 1 and are referred o as Q-gradien descen (Q-GD) versions 1 &, respecively. 3

4 Q-GD is a gradien descen opimizaion procedure ha uses a rained DQN o deermine he learning rae. The Q-GD inpus are an iniial ierae and learning rae x 1 and α c, rained DQN Q(s; θ), and maximum number of ime seps T. We use he noaion x o denoe he candidae ierae, which changes a every ime sep, and x o represen an acceped ierae wih associaed decen direcion d( x). In seps 3 and, a sae feaure vecor represenaive of he objecive (discussed in he nex secion) is formed and passed hrough he DQN o deermine he acion. Afer he acion is aken, he candidae ierae is updaed in sep 1. When a good iniial learning rae is known hen he firs version is preferable, e.g. d( x) is he Newon direcion and α c = 1 for convex f. For non scale-invarian search direcions, such as he gradien direcion, he second version is advanageous. Algorihm 1 Q-gradien descen versions 1 & Inpu: iniial ierae x 1, iniial learning rae α c, rained DQN Q(s; θ), number of ime seps T 1: Se x = x 1, d( x) = f(x 1 ), α 1 = α c : for = 1,..., T do 3: Compue sae feaure vecor s : a = arg max a [Q(s ; θ)] a 5: if a = a half hen 6: α +1 = 1 α 7: else if a = a double hen Only for version 8: α +1 = α 9: else if a = a accep hen { α c version 1 1: x = x, d( x) = f( x), α +1 = α version Updae acceped ierae 11: end if 1: x +1 = x + α +1 d( x) Updae candidae ierae 13: end for 1: reurn x = x T 3. Environmen and Sae The environmen is a combinaion of he objecive funcion f : R n R and se of allowed acions and needs o be formulaed as a MDP in order for he Q-learning algorihm o operae. The Markov condiion could be saisfied by including he iniial ierae, and he curren, as well as all proceeding learning raes and descen direcions ino he sae definiion. However, for objecive funcions wih large number of variables such an approach is compuaionally prohibiive and would severely limi he rained DQN s abiliy o generalize o a broader family of funcions. We seek o define he sae such ha i characerizes he objecive funcion a a given ierae, conains some hisory, and is universal o all funcions. We use a nonmonone line search as a saring poin since i provides an effecive crieria for deermining he learning rae ha is independen of funcion variable size or ype. A nonmonoone line search chooses he learning rae such ha he new ierae is sufficienly less han he maximum objecive value of he pas M ieraes, f(x + α d ) max f(x i) + cα d T f(x ), c >. (1) i=,..., M+1 This suggess ha he sae feaures needed in order o deermine he learning rae are he curren learning rae, candidae ierae objecive value, max objecive from he pas M seps, and he do

5 produc beween he descen direcion d and gradien f(x ). Alhough his feaure se would neiher saisfy he Markov propery nor compleely capure he objecive, updaes based on (1) work well in pracice and we use hese saisics as moivaion for he sae feaures. We employ an encoding ha indicaes wheher he candidae ierae is higher/lower han he M lowes achieved objecive values. Le F 1 M be a lis of he M lowes objecive values obained up o ime 1, he sae encoding is given by 1 f(x ) min(f 1 M ) [s ] encoding = min(f 1 M ) < f(x ) max(f 1 M ) (11) 1 oherwise. The number of funcion evaluaions mus also be a sae feaure since he saes wouldn oherwise be saionary and he maximum number of ime seps T designaes an absorbing sae. Based on RPROP [16], he final sae feaure is a measure of alignmen beween successive descen direcions [s ] alignmen = 1 n n sign([d ] i [d 1 ] i ). (1) i=1 In summary here are six feaures: curren learning rae, objecive value, do produc beween he search direcion and gradien, min/max encoding (11), number of funcion evaluaions, and alignmen measure (1). For he purpose of making he sae feaures independen of he specific objecive funcion, all of he feaures are ransformed o be in he inerval [ 1, 1]. For each feaure [s] i, a maximum and minimum value is esimaed so ha [ŝ] i = 1 ([s] i [s min ] i )/([s max ] i [s min ] i ). (13) Addiionally, since he objecive values and gradien norms boh converge owards a lower bound c i, hese feaures are ransformed wice. Firs via [s] i 1/([s] i c i ) and hey by (13), where c i is se o for he gradien norm and an objecive lower bound f lb for he funcion values. In general, f lb can be se o zero for objecives ha are a sum of loss funcions. 3.3 Reward Funcion The reward funcion is crucial in ensuring ha he DQN learns a policy consisen wih he goal of finding he lowes objecive value in he fewes number of seps, and we define i as he inverse disance from he objecive lower bound, r id (f, x ) = c f(x ) f lb, c >, f lb < f(x) x. (1) The reward funcion (1) is sricly posiive and asympoes as f approaches he lower bound. We esed reward funcions based on a sufficien decrease condiion or change in objecive value beween successive ieraes, r sd (f, x ) = 1 f(x 1 ) 1.1f(x ), r oc (f, x ) = f(x 1 ) f(x ) (15) and found ha hey did no adequaely capure he opimizaion goal. To compare he differen reward funcions we ploed f(x T ) agains R max = max R ; for each raining episode of DQN v1 we recorded he sequence of objecive values (f(x T ) being he objecive value a he las ime sep) 5

6 Inverse of Disance from Objecive Lower Bound Objecive Change Sufficien Decrease f(x T ) 1.3 f(x T ) 1.3 f(x T ) R max R max R max Figure 1: Comparison of reward funcions. The images plo f(x T ) versus R max = max R for reward funcions defined by r id (inverse disance from objecive lower bound), r oc (objecive change) and r sd (sufficien decrease) given by equaions 1 and 15. Only r id, shown in he lefmos graph, has he highes R max values concenraed owards lowes final objecive values. and used his informaion o calculae R max for each reward funcion. Figure 1 shows ha reward funcions based on sufficien decrease or objecive change yield high R max values for subopimal final soluions. The main difference beween he reward funcions is ha (1) is based on degree of difficuly in decreasing he objecive and will generae he highes rewards during he final ime seps. Training This secion oulines he Q-learning wih experience replay mehod used o rain DQN versions 1 & [1, 1]. Algorihm exhibis he overall procedure, bu omis some of he specific deails, which are discussed in he subsecions for he sake of clariy. Noe ha updaes w.r.. f(x) are explicily shown and are indexed by he ime sep while he DQN updae in sep 5 is referenced via equaion (16) and is implicily indexed by he ime sep and episode. The DQN learns how o minimize he funcion f(x) hrough repeaed aemps, called learning episodes. For each learning episode, he x ierae is se o an iniial value and he DQN hen has T ime seps o find he lowes objecive value. An alernaive approach for limiing he number of ime seps is o end he episode once he objecive has decreased pas a cerain hreshold. Boh approaches force he DQN o learn a rade off beween finding a good learning rae and exploring he space. Resricing he number of ime seps reflecs real world applicaions where here are compuaional and ime consrains and also does no require a-priori knowledge of he objecive funcion..1 Experience Replay An experience consiss of a (s i, a i, r i+1, s i+1 ) j uple for some episode j [1, e] a ime sep i [M 1, T ], where M and e are in Algorihm seps and 3. These uples are sored in a memory of experiences E. Insead of updaing he DQN wih only he mos recen experience, a subse S E of experiences are drawn from memory and used as a mini-bach o updae he DQN: θ θ β S (s i,a i,r i+1,s i+1 ) j S where he esimae ŷ i and arge y i are given via (8). (ŷ i y i ) θ Q(s i ; θ) (16) 6

7 The A mos recen episodes along wih he op B bes games (in erms of R max value) are sored in memory. A each DQN updae (sep 5) he subsample S is formed by randomly drawing experiences from E and an experience from each of he op B bes games. Adding randomly drawn experiences o he mini-bach helps preven he DQN from over learning during a paricular ime and episode.. Training Specificaions The Q-learning inpu parameers in Algorihm for boh DQN versions 1 & were fixed as follows: he discoun facor was se o γ =.99 and he exploraion probabiliy ɛ was iniially se o 1 hen uniformly decayed o.1 over he firs 1 episodes. For experience replay, A = 5, B = 5, and he mini-bach size was se o S = 3. Addiionally, for he firs 5 episodes he op B bes games were no used in he mini-bach sample. The consans c 1 and c used o calculae he reward (see seps and ) were fixed as.1 and.1, respecively. The oal number of episodes E is 15K for version 1 and K for version. The objecive inpu parameers in Algorihm consis of he objecive funcion f(x) wih lower bound f lb, iniial weighs x 1, iniial learning rae α c, encoding memory M, and he oal number of ime seps T. The objecive funcion has he form 1 N N l(h(z i ; x), i ) (17) i=1 where z i is an acousic feaure vecor wih phoneic label i, l( ) is a cross enropy loss, and h(z; x) is a feedforward neural nework parameerized by x wih sigmoid acivaions and a sofmax funcion a he oupu layer. We se he inpu objecive funcion o f rain, which has a neural nework archiecure and N = 5 daa poins. The number of ime seps is T = 1 and M = 3. A he sar of each episode, he x ierae is rese o x 1 and is updaed for he firs M ime seps using he iniial learning rae (sep ) in order o form he firs sae feaure vecor. In seps 8 and 9, he six sae feaures form he inpu o he DQN and he resuling acion is deermined by an ɛ-greedy policy. Based on he acion, he learning rae is eiher modified, sep 1 or 1, or he curren ierae is acceped and a new gradien direcion is calculaed, sep 15. The ierae x is updaed in sep 17 and his causes he environmen o change o he nex sae (sep 18). The reward for arriving o sae s +1 is calculaed using eiher he objecive value a he new ierae (sep ) or he previous ierae (sep ) for when he acion is o accep. As an aside, we found i beneficial o calculae he reward for each acion a he las ime sep since he arges associaed wih absorbing saes do no change during raining and hus play a vial role for propagaing back informaion. The uple (s, a, r +1, s +1 ) e forms an experience and is added o memory E (sep ). In addiion o he curren experience, a random subse of experiences are drawn and used o form a mini-bach updae for he DQN (sep 5). Special modificaions were needed for raining DQN v since one of is acions permis he learning rae o increase. Too large of a learning rae resuled in updaes ha caused he objecive funcion o diverge and consequenly produce sae vecors wih infinie feaures. To preven his from happening, we used a maximum and minimum learning rae as par of he raining procedure. If DQN v aemped o increase/decrease he learning rae above/below hese values hen i would receive a reward of -1 and he episode would erminae early. In addiion, we employed an rmsprop updae procedure for raining DQN v []. DQN versions 1 & have an archiecure of A wih sigmoid acivaions for he hidden layers and an idenify acivaion for he las layer. The iniial learning rae was se o 7

8 α c = for version 1 and α c = for version. Addiionally, for version only learning raes in he range [.1, 8] were allowed. Algorihm Q-Learning wih Experience Replay Objecive Parameers: f, f lb, x 1, α c, M, T Q-Learning Parameers: E, θ, γ, ɛ, c 1, c, β 1: θ θ : for e = 1,..., E do For each learning each episode 3: for = 1..., M 1 do : x +1 = x α c f(x ) 5: end for 6: se x = x M, d( x) = f(x M ), α M = α c 7: for = M,..., T do 8: Generae sae feaure vecor s 9: Choose acion a according o ɛ-greedy policy (9) 1: if a = a half hen 11: α +1 = 1 α 1: else if a = a double hen Only for version 13: α +1 = α 1: else if a = a accep hen 15: x = x, d( x) = f( x), α +1 = 16: end if 17: x +1 = x + α +1 d( x) { α c version 1 α version 18: Generae sae feaure vecor s +1 19: if a a accep hen : r +1 = c 1 /(f(x +1 ) f lb ) 1: else if a = a accep hen : r +1 = c /(f( x) f lb ) 3: end if : Add experience (s, a, r +1, s +1 ) e o memory E 5: Sample S E and updae θ via (16) 6: end for 7: end for 8: reurn θ 5 Experimens The rained DQNs along wih he iniial learning raes α c are he inpu o he Q-gradien descen algorihms versions 1 & oulined in Algorihm 1. Since here are no heoreical guaranees ha he DQNs would find a good policy or converge, we demonsrae ha Q-GD versions 1 & are effecive algorihms by comparing hem agains gradien descen wih an Armijo or nonmonoone line search and show ha he DQN q-values associaed wih he opimal acions converge o he discouned reurn of rewards a each ime sep. The line search algorihms operae under he same rules as Q-GD v1, bu an ierae is acceped only if (1) is saisfied. We se c = 1 and M = 3 for nonmonoone and, by definiion, M = 1 8

9 for Armijo. 5.1 Resuls on Train Funcion f rain 1^. 1^.3 Train Funcion q gd v1 q gd v nonmonone armijo 1^ (a) Objecive Value versus Time Sep learning rae learning rae Q GD v1 5 1 Nonmonoone LS 5 1 learning rae learning rae 1 5 Q GD v 5 1 Armijo LS 5 1 (b) Learning Rae versus Time Sep Figure : Comparison of Q-GD versions 1 & and gradien descen wih a nonmonoone or Armijo line search on rain funcion. We firs compare Q-GD versions 1 & and gradien descen wih an Armijo or nonmonoone line search on he funcion used o rain DQN versions 1 & ; f rain has he form (17) wih N = 5 and feedforward neural nework archiecure Figure a demonsraes heir performance in minimizing f rain and figure b plos he learning rae a each ime sep. Afer 1 ime seps, he final objecive values are 1.86, 1.91, 1.98, and. for Q-GD v, Q-GD v1, nomonoone, and Armijo, respecively. The plos of he learning raes illuminae why he Q-GD algorihms are superior. Q-GD v has he advanage ha i can increase he learning rae and is policy for minimizing he rain funcion was very simple: i increased he learning rae from o 8 during he firs iniial ime seps and hen lef he learning rae unchanged unil decreasing i a each of he las seven ime seps. Q-GD v1 offers a fairer comparison o he Armijo and nonomonone line searches since he algorihms all follow he same srucure: every ime an ierae is acceped he learning rae is rese o and can only hen be decreased by a facor of wo. The noable difference beween Q-GD v1 and he line search algorihms is he frequency in which he learning rae is decreased. Q-GD v1 decreased he learning rae 5.1% of he ime while he Armijo and nonmonoone line searches decreased he learning rae 36.% and 7.3% of he ime. Q-GD v1 also only decreased he learning rae during he final quarer of he opimizaion procedure. The learned policies illusrae ha a good iniial learning rae is more imporan han a line search procedure for fas iniial objecive decrease. Also, i is beneficial o decrease he learning rae more aggressively during he final ime seps. Unlike he line searches, he Q-GD algorihms have knowledge of when he opimizaion procedure is going o end (since he number of ime seps is an inpu parameer) and can ac adjus he learning rae accordingly. 9

10 5. Generalizaion Abiliy Tes Funcion Q GD v1 Q GD v f es 1^. 1^.3 q gd v1 q gd v nonmonone armijo 1^ (a) Objecive Value versus Time Sep learning rae learning rae 1 Nonmonoone LS 1 learning rae learning rae 1 Armijo LS 1 (b) Learning Rae versus Time Sep Figure 3: Comparison of Q-GD versions 1 & and gradien descen wih a nonmonoone or Armijo line search on es funcion. We nex es o deermine if he sraegy learned by DQN versions 1 & on he rain funcion also works for a new, bu relaed funcion. The es funcion has he same form as he rain funcion, bu wih hree imes he amoun of daa and double he number of variables (17) wih N = 15 and archiecure The purpose of his configuraion is o show ha we can rain he DQN using a small problem and laer implemen i on larger problems in erms of boh variable size and daa. We also increased he number of ime seps from 1 o. Figure 3 exhibis how Q-GD versions 1 & and he nonmonoone and Armijo line search algorihms measure on he es funcion. In figure 3a, we observe ha he algorihms reain heir relaive ordering regarding objecive decrease in a fixed number ime seps; he final values are 1.73, 1.7, 1.8 and 1.89 for Q-GD v, Q-GD v1, nonomonoone and Armijo, respecively. The gap in performance beween Q-GD versions 1 & reduced, showing ha Q-GD v1 was more adap a generalizing o a new funcion. As wih he rain funcion, boh Q-GD versions 1 & decreased he learning rae less frequenly han eiher he nonmonoone or Armijo line searches. However, boh Q-GD versions were more cauious using a higher learning rae a he sar of he of he opimizaion procedure. Q-GD versions 1 & mainained heir underlying sraegies, excep version 1 chose o decrease he learning rae during he firs quarer and version only iniially increased he learning rae o (as opposed o 8). Overall, hese resuls show ha Q-GD versions 1 & were robus when given a new, larger funcion and used over a longer number of ime seps. 5.3 Convergence of DQN Q-values The purpose of his secion is o show ha he six sae feaures deailed in Secion 3. are rich enough for he DQN o discriminae saes in order o learn he q-values associaed wih he opimal acions. We also demonsrae he effec of individually zeroing ou he sae feaures for Q-GD version 1 on he rain funcion. For he final episode, we recorded he q-value associaed wih he seleced acion (no longer using an ɛ-greedy procedure) and resuling reward a each ime sep in order o compare he DQN 1

11 Discouned Reurn of Rewards versus DQN Q value for Opimal Acion version 1 version 1 15 max a [Q(s )] a R Figure : Plo of DQN versions 1 & prediced q-value for opimal acion versus he discouned reurn of rewards (3) a each ime sep on rain funcion. prediced q-values agains he discouned reurn of rewards, defined by (3). Figure shows ha DQN v1 s q-values converged o he discouned reurn of rewards while DQN v found he overall shape of he disribuion. Even hough DQN v was rained wih more episodes (K versus 15K), he addiion of one exra acion exponenially increases he search space, creaing a much more difficul problem. To invesigae how he sae feaures influence he Q-GD algorihms, we ran Q-GD v1 wih eiher he objecive value, gradien norm, or alignmen measure se o zero; since he feaures are ransformed o lie in he inerval [ 1, 1] his corresponds o fixing a given feaure a is median value. We lef he learning rae, objecive encoding, and number of ime seps unchanged as hey are arguably he bare minimum inpus needed o saisfy he Markov propery. Table 1 repors he final objecive value and he raio of halving he learning rae or acceping an ierae obained for seing a given sae feaure o zero during a run of Q-GD v1 on he rain funcion. The baseline (none of he feaures are se o zero) is a final objecive of 1.91 and 51/96 half/accep raio. As a resul of zeroing ou a sae feaure, DQN v1 chooses o half he learning rae more frequenly and ends up wih a worse soluion. This experimen shows ha DQN v1 depends on each feaure o deermine he appropriae acion. Table 1: Effec of seing a sae feaure o zero. Baseline (none of he feaure are se o zero) is a final objecive value of 1.91 and a 51/96 half/accep raio. Feaure Objecive Half/Accep objecive value /677 gradien norm /7 alignmen measure. 36/633 6 Relaed Work Neural nework models yield sae of he ar performance in speech recogniion, naural language processing, and compuer vision [6, 8, 3]. Tesauro popularized neural neworks as an approximaion o he value funcion [19], which Riedmiller laer exended o he acion-value funcion wih he 11

12 adven of he Neural Fied Q Ieraion [15]. Applicaions of using neural neworks in RL appear in seings ranging from playing games o roboics [1, 1]. Using reinforcemen learning o replace an opimizaion heurisic or be embedded wihin he opimizaion algorihm has been explored in a variey of domains [,, 11, 13, 17]. However, none of he previous approaches use deep Q-learning or our proposed RL formulaion. Our work is mos similar o [17]; he auhors use RL o replace a Levenberg-Marquard heurisic for conrolling a damping parameer used in a Gauss-Newon updae rouine. Unlike our work, hey approximae he acion-value funcion by a linear combinaion of basis funcions, which hey rain using Leas Square Policy Ieraion. To our knowledge, our work is he firs o successfully apply deep Q- learning o conrolling an opimizaion hyperparameer. 7 Conclusions This paper lays he foundaion for using deep Q-learning o conrol an opimizaion hyperparameer. We defined he sae, reward funcion, and acions such ha a DQN could learn how o conrol he learning rae used in a gradien-based opimizaion rouine, resuling in wo Q-gradien descen algorihms. Given ha here are no heoreical guaranees ha he DQN would find he opimal policy or ha is q-values would converge, we presened numerical evidence ha he Q-GD algorihms performed beer han eiher gradien descen wih an Armijo or nonmonoone line search and ha he DQNs q-values for he opimal acion converged o he discouned reurn of rewards a each ime sep. Addiionally, we demonsraed ha he Q-GD algorihms were able o generalize when he rain funcion was replaced wih a larger es funcion. A main advanage of he Q-gradien descen mehod is ha i can easily incorporae any objecive saisic by adding i o he sae feaure vecor. Fuure areas of work involve using his framework o explore addiional sae feaures ha can faciliae opimizaion decisions. We rained he DQNs in a simple environmen in order o demonsrae feasibiliy. To make his mehod pracical for large scale opimizaion i is necessary o exend Q-GD o he sochasic regime, ha is creae Q-sochasic gradien descen. A final area of work involves expanding he acions o include conrolling addiional hyperparameers, such as a momenum erm. Overall, he presened framework allows us o develop new opimizaion algorihms and gain inuiion o he ype of sraegies ha are successful for minimizing neural neworks. 1

13 References [1] Larry Armijo. Minimizaion of funcions having lipschiz coninuous firs parial derivaives. Pacific Journal of mahemaics, 16(1):1 3, [] Jusin A Boyan and Andrew W Moore. Learning evaluaion funcions for global opimizaion and boolean saisfiabiliy. In AAAI/IAAI, pages 3 1, [3] Ronan Collober and Jason Weson. A unified archiecure for naural language processing: Deep neural neworks wih muliask learning. In Proceedings of he 5h inernaional conference on Machine learning, pages ACM, 8. [] Marco Dorigo and LM Gambardella. An-q: A reinforcemen learning approach o he raveling salesman problem. In Inernaional Conference on Machine Learning, pages 5 6, [5] Luigi Grippo, Francesco Lampariello, and Sephano Lucidi. A nonmonoone line search echnique for newon s mehod. SIAM Journal on Numerical Analysis, 3():77 716, [6] Geoffrey Hinon, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaily, Andrew Senior, Vincen Vanhoucke, Parick Nguyen, and Tara N Sainah. Deep neural neworks for acousic modeling in speech recogniion: The shared views of four research groups. Signal Processing Magazine, IEEE, 9(6):8 97, 1. [7] Niish Shirish Keskar and George Saon. A nonmonoone learning rae sraegy for sgd raining of deep neural neworks. In Acousics, Speech and Signal Processing (ICASSP), 15 IEEE Inernaional Conference on, pages IEEE, 15. [8] Alex Krizhevsky, Ilya Suskever, and Geoffrey E Hinon. Imagene classificaion wih deep convoluional neural neworks. In Advances in neural informaion processing sysems, pages , 1. [9] Yann A LeCun, Léon Boou, Genevieve B Orr, and Klaus-Rober Müller. Efficien backprop. In Neural neworks: Tricks of he rade, pages 9 8. Springer, 1. [1] Long-Ji Lin. Reinforcemen learning for robos using neural neworks. Technical repor, DTIC Documen, [11] Vicor V Miagkikh and William F Punch III. Global search in combinaorial opimizaion using reinforcemen learning algorihms. In Proceedings of he Congress on Evoluionary Compuaion, volume 1, pages IEEE, [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Marin Riedmiller, Andreas K Fidjeland, Georg Osrovski, e al. Human-level conrol hrough deep reinforcemen learning. Naure, 518(75):59 533, 15. [13] Rober Moll, Theodore J Perkins, and Andrew G Baro. Machine learning for subproblem selecion. In ICML : Proceedings of he Seveneenh Inernaional Conference on Machine Learning, pages 615 6,. [1] Jorge Nocedal and Sephen Wrigh. Numerical opimizaion. Springer Science & Business Media, 6. [15] Marin Riedmiller. Neural fied q ieraion firs experiences wih a daa efficien neural reinforcemen learning mehod. In Machine Learning: ECML 5, pages Springer, 5. [16] Marin Riedmiller and Heinrich Braun. A direc adapive mehod for faser backpropagaion learning: The rprop algorihm. In Neural Neworks, 1993., IEEE Inernaional Conference on, pages IEEE, [17] Paul L Ruvolo, Ian Fasel, and Javier R Movellan. Opimizaion on a budge: A reinforcemen learning approach. In Advances in Neural Informaion Processing Sysems, pages , 9. [18] Richard S Suon and Andrew G Baro. Reinforcemen learning: An inroducion, volume 1. MIT press Cambridge,

14 [19] Gerald Tesauro. Temporal difference learning and d-gammon. Communicaions of he ACM, 38(3):58 68, [] Tijmen Tieleman and Geoffrey Hinon. Lecure 6.5-rmsprop. COURSERA: Neural neworks for machine learning, 1. [1] Chrisopher JCH Wakins and Peer Dayan. Q-learning. Machine learning, 8(3-):79 9,

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

175 CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 10.1 INTRODUCTION Amongs he research work performed, he bes resuls of experimenal work are validaed wih Arificial Neural Nework. From he