arxiv: v2 [math.oc] 19 Jun 2016

Size: px
Start display at page:

Download "arxiv: v2 [math.oc] 19 Jun 2016"

Transcription

1 Using Deep Q-Learning o Conrol Opimizaion Hyperparameers Samanha Hansen IBM T.J. Wason Research Cener arxiv:16.6v [mah.oc] 19 Jun 16 Absrac We presen a novel definiion of he reinforcemen learning sae, acions and reward funcion ha allows a deep Q-nework (DQN) o learn o conrol an opimizaion hyperparameer. Using Q-learning wih experience replay, we rain wo DQNs o accep a sae represenaion of an objecive funcion as inpu and oupu he expeced discouned reurn of rewards, or q-values, conneced o he acions of eiher adjusing he learning rae or leaving i unchanged. The wo DQNs learn a policy similar o a line search, bu differ in he number of allowed acions. The rained DQNs in combinaion wih a gradien-based updae rouine form he basis of he Q-gradien descen algorihms. To demonsrae he viabiliy of his framework, we show ha he DQN s q-values associaed wih opimal acion converge and ha he Q-gradien descen algorihms ouperform gradien descen wih an Armijo or nonmonoone line search. Unlike radiional opimizaion mehods, Q-gradien descen can incorporae any objecive saisic and by varying he acions we gain insigh ino he ype of learning rae adjusmen sraegies ha are successful for neural nework opimizaion. 1 Inroducion This paper demonsraes how o rain a deep Q-nework (DQN) o conrol an opimizaion hyperparameer. Our goal is o minimize an objecive funcion hrough gradien-based updaes of he form x +1 = x α g (1) where α is he learning rae. A each ierae x, we exrac informaion abou he objecive derived from Taylor s heorem and line search mehods o form a sae feaure vecor. The sae feaure vecor is he inpu o a DQN and he oupu is he expeced discouned reurn of rewards, or q- value, conneced o he acion of increasing, decreasing, or preserving he learning rae. We presen a novel definiion of he reinforcemen learning problem ha allows us o rain wo DQNs using Q-learning wih experience replay [1, 1] o successfully conrol he learning rae and learn he q-values associaed wih he opimal acions. The moivaion for his work is founded on he observaion ha gradien-based algorihms are effecive for neural nework opimizaion, bu are highly sensiive o he choice of learning rae [9]. Using a DQN in combinaion wih a gradien-based opimizaion rouine o ieraively adjus he learning rae eliminaes he need for a line search or hyperparameer uning, and is he concep for he Q-gradien descen algorihm. Alhough we resric his paper o deerminisic opimizaion, his framework can exend o he sochasic regime where only gradien esimaes are available. We rain wo DQNs o minimize a feedforward neural nework ha performs phone classificaion in wo separae environmens. The firs environmen conforms o an Armijo line search procedure [1, 1], eiher he learning rae is decreased by a consan facor or an ierae is acceped and he learning rae is rese o an iniial value. The second environmen differs in ha he learning rae can also increase and is never rese. The rained DQNs are he inpu o he Q-gradien descen 1

2 (Q-GD) versions 1 &, and we es hem agains gradien descen wih an Armijo or nonmonoone [5] line search o show ha hese new algorihms are able o find beer soluions on he original neural nework, as well as on a neural nework ha is doubled in size and wih hree imes he amoun of daa. We also compare how each algorihm adjuss he learning rae during he course of he opimizaion procedure in order o exrac characerisics ha explain Q-GD s superior performance. The paper is organized as follows: in Secion we review reinforcemen learning (RL) heory and in Secion 3 we define he RL acions, sae, and reward funcion for he purpose of opimizaion. Secion describes he Q-learning wih experience replay procedure used o rain he DQNs. In Secion 5, we es he Q-GD algorihms agains gradien descen wih an Armijo or nonomonone line search on wo neural neworks ha perform phone classificaion. Secion 6 reviews relevan lieraure and finally, in Secion 7 we provide concluding remarks and discuss fuure areas of research. Noaion: We use brackes indexed by eiher locaion or descripion o denoe accessing an elemen from a vecor. For example, [s] i denoes he i h elemen and [s] encoding denoes he elemen corresponding o descripion encoding for vecor s. Review of Reinforcemen Learning Reinforcemen learning is he presiding mehodology for raining an agen o perform a ask wihin an environmen. These asks are characerized by a clear underlying goal and require he agen o sequenially selec an acion based on he sae of he environmen and he curren policy. The agen learns by receiving feedback from he environmen in he form of a reward. A each ime sep, he agen receives a represenaion of he environmen s sae s S and based on he policy π : S A chooses an acion a A. The agen receives a reward r +1 for aking acion a and arriving in sae s +1. We assume ha he environmen is a Markov Decision Process (MDP), i.e. given he curren sae s and acion a, he probabiliy of arriving in nex sae s +1 and receiving reward r +1 does no depend on any of he previous saes or acions. A successful policy mus balance he immediae reward wih he agen s overall goal. RL achieves his via he acion-value funcion Q π (s, a) : (S, A) R, which is he discouned expeced reurn of rewards given he sae, acion, and policy, ] Q π (s, a) = E π [R +1 s = s, a = a () where T 1 R +1 = r +1 + γ k r +1+k, < γ 1, (3) k=1 T is he maximum number of ime seps and he expecaion is aken given ha he agen is following policy π. The opimal acion-value funcion, Q (s, a) = max π Q π (s, a) saisfies he Bellman equaion, Q (s, a) = E π [ r +1 + γ max a A Q (s +1, a ) ] s = s, a = a which provides a naural updae rule for learning. A each ime sep he effecive esimae ŷ and arge y are given by () ŷ = Q (s, a ), y = r +1 + γ max a A Q (s +1, a ) (5)

3 and he updae is based on heir difference; his mehod is referred o as Q-learning. Noice ha he esimae/arge come from LHS/RHS of () and will boh coninue o change unil Q converges. For finie number of saes and acions Q is a look-up able. When he number of saes is oo large or even infinie, he able is approximaed by a funcion. In paricular, when he acion-value funcion is a neural nework i is referred o as a deep Q-nework (DQN). A pracical choice is o choose a nework archiecure such ha he inpus are he saes and he oupus are he expeced discouned reurn of rewards, or q-value, for each acion. We only consider he case of using a DQN and henceforh use he noaion Q(s; θ) : R S R A (6) o denoe ha he DQN is parameerized by weighs θ. The weighs are updaed by minimizing he l norm beween he esimae and arge, ŷ y, yielding ieraions of he form θ θ β(ŷ y ) θ Q(s ; θ) (7) where β is he learning rae and ŷ = Q(s ; θ), [y ] a = { r T 1 (γ max a A[Q(s +1 ; θ)] a ) a = a [Q(s ; θ)] a a a. (8) For he las acion, only he reward is presen in he arge definiion and for he non-chosen acions, he arges are se o force he error o be zero. The acion a each ime sep is chosen based on he principle of exploraion versus exploiaion. Exploiaion akes advanage of he informaion already garnered by he DQN while exploraion encourages random acions o be aken in prospec of finding a beer policy. We employ an ɛ- greedy policy which chooses he opimal acion w.r. he DQN s q-values wih probabiliy 1 ɛ and randomly oherwise: { arg max a [Q(s ; θ)] a r ɛ a = (9) randomly chosen acion r < ɛ where r U[, 1]. Equaion (9) is he effecive policy since i maps saes o acions. Q-learning is an off-policy procedure because i follows a non-opimal policy (wih probabiliy ɛ a random acion is aken) ye makes updaes o he opimal policy, as illusraed by he max erm in (8). For a comprehensive inroducion o RL, see [18]. 3 Reinforcemen Learning for Opimizaion In his secion, we ouline he environmen, sae, acions, and reward funcion ha define he reinforcemen learning problem for he purpose of opimizaion. 3.1 Acions We presen wo procedures for adjusing he learning rae and show how hey are implemened in pracice. The firs sraegy mimics an Armijo line search [1, 1] in ha he learning rae is rese o an iniial value afer acceping an ierae and can only henceforh be decreased. The second sraegy permis he learning rae o increase or decrease and is never rese. The wo mehods are oulined in Algorihm 1 and are referred o as Q-gradien descen (Q-GD) versions 1 &, respecively. 3

4 Q-GD is a gradien descen opimizaion procedure ha uses a rained DQN o deermine he learning rae. The Q-GD inpus are an iniial ierae and learning rae x 1 and α c, rained DQN Q(s; θ), and maximum number of ime seps T. We use he noaion x o denoe he candidae ierae, which changes a every ime sep, and x o represen an acceped ierae wih associaed decen direcion d( x). In seps 3 and, a sae feaure vecor represenaive of he objecive (discussed in he nex secion) is formed and passed hrough he DQN o deermine he acion. Afer he acion is aken, he candidae ierae is updaed in sep 1. When a good iniial learning rae is known hen he firs version is preferable, e.g. d( x) is he Newon direcion and α c = 1 for convex f. For non scale-invarian search direcions, such as he gradien direcion, he second version is advanageous. Algorihm 1 Q-gradien descen versions 1 & Inpu: iniial ierae x 1, iniial learning rae α c, rained DQN Q(s; θ), number of ime seps T 1: Se x = x 1, d( x) = f(x 1 ), α 1 = α c : for = 1,..., T do 3: Compue sae feaure vecor s : a = arg max a [Q(s ; θ)] a 5: if a = a half hen 6: α +1 = 1 α 7: else if a = a double hen Only for version 8: α +1 = α 9: else if a = a accep hen { α c version 1 1: x = x, d( x) = f( x), α +1 = α version Updae acceped ierae 11: end if 1: x +1 = x + α +1 d( x) Updae candidae ierae 13: end for 1: reurn x = x T 3. Environmen and Sae The environmen is a combinaion of he objecive funcion f : R n R and se of allowed acions and needs o be formulaed as a MDP in order for he Q-learning algorihm o operae. The Markov condiion could be saisfied by including he iniial ierae, and he curren, as well as all proceeding learning raes and descen direcions ino he sae definiion. However, for objecive funcions wih large number of variables such an approach is compuaionally prohibiive and would severely limi he rained DQN s abiliy o generalize o a broader family of funcions. We seek o define he sae such ha i characerizes he objecive funcion a a given ierae, conains some hisory, and is universal o all funcions. We use a nonmonone line search as a saring poin since i provides an effecive crieria for deermining he learning rae ha is independen of funcion variable size or ype. A nonmonoone line search chooses he learning rae such ha he new ierae is sufficienly less han he maximum objecive value of he pas M ieraes, f(x + α d ) max f(x i) + cα d T f(x ), c >. (1) i=,..., M+1 This suggess ha he sae feaures needed in order o deermine he learning rae are he curren learning rae, candidae ierae objecive value, max objecive from he pas M seps, and he do

5 produc beween he descen direcion d and gradien f(x ). Alhough his feaure se would neiher saisfy he Markov propery nor compleely capure he objecive, updaes based on (1) work well in pracice and we use hese saisics as moivaion for he sae feaures. We employ an encoding ha indicaes wheher he candidae ierae is higher/lower han he M lowes achieved objecive values. Le F 1 M be a lis of he M lowes objecive values obained up o ime 1, he sae encoding is given by 1 f(x ) min(f 1 M ) [s ] encoding = min(f 1 M ) < f(x ) max(f 1 M ) (11) 1 oherwise. The number of funcion evaluaions mus also be a sae feaure since he saes wouldn oherwise be saionary and he maximum number of ime seps T designaes an absorbing sae. Based on RPROP [16], he final sae feaure is a measure of alignmen beween successive descen direcions [s ] alignmen = 1 n n sign([d ] i [d 1 ] i ). (1) i=1 In summary here are six feaures: curren learning rae, objecive value, do produc beween he search direcion and gradien, min/max encoding (11), number of funcion evaluaions, and alignmen measure (1). For he purpose of making he sae feaures independen of he specific objecive funcion, all of he feaures are ransformed o be in he inerval [ 1, 1]. For each feaure [s] i, a maximum and minimum value is esimaed so ha [ŝ] i = 1 ([s] i [s min ] i )/([s max ] i [s min ] i ). (13) Addiionally, since he objecive values and gradien norms boh converge owards a lower bound c i, hese feaures are ransformed wice. Firs via [s] i 1/([s] i c i ) and hey by (13), where c i is se o for he gradien norm and an objecive lower bound f lb for he funcion values. In general, f lb can be se o zero for objecives ha are a sum of loss funcions. 3.3 Reward Funcion The reward funcion is crucial in ensuring ha he DQN learns a policy consisen wih he goal of finding he lowes objecive value in he fewes number of seps, and we define i as he inverse disance from he objecive lower bound, r id (f, x ) = c f(x ) f lb, c >, f lb < f(x) x. (1) The reward funcion (1) is sricly posiive and asympoes as f approaches he lower bound. We esed reward funcions based on a sufficien decrease condiion or change in objecive value beween successive ieraes, r sd (f, x ) = 1 f(x 1 ) 1.1f(x ), r oc (f, x ) = f(x 1 ) f(x ) (15) and found ha hey did no adequaely capure he opimizaion goal. To compare he differen reward funcions we ploed f(x T ) agains R max = max R ; for each raining episode of DQN v1 we recorded he sequence of objecive values (f(x T ) being he objecive value a he las ime sep) 5

6 Inverse of Disance from Objecive Lower Bound Objecive Change Sufficien Decrease f(x T ) 1.3 f(x T ) 1.3 f(x T ) R max R max R max Figure 1: Comparison of reward funcions. The images plo f(x T ) versus R max = max R for reward funcions defined by r id (inverse disance from objecive lower bound), r oc (objecive change) and r sd (sufficien decrease) given by equaions 1 and 15. Only r id, shown in he lefmos graph, has he highes R max values concenraed owards lowes final objecive values. and used his informaion o calculae R max for each reward funcion. Figure 1 shows ha reward funcions based on sufficien decrease or objecive change yield high R max values for subopimal final soluions. The main difference beween he reward funcions is ha (1) is based on degree of difficuly in decreasing he objecive and will generae he highes rewards during he final ime seps. Training This secion oulines he Q-learning wih experience replay mehod used o rain DQN versions 1 & [1, 1]. Algorihm exhibis he overall procedure, bu omis some of he specific deails, which are discussed in he subsecions for he sake of clariy. Noe ha updaes w.r.. f(x) are explicily shown and are indexed by he ime sep while he DQN updae in sep 5 is referenced via equaion (16) and is implicily indexed by he ime sep and episode. The DQN learns how o minimize he funcion f(x) hrough repeaed aemps, called learning episodes. For each learning episode, he x ierae is se o an iniial value and he DQN hen has T ime seps o find he lowes objecive value. An alernaive approach for limiing he number of ime seps is o end he episode once he objecive has decreased pas a cerain hreshold. Boh approaches force he DQN o learn a rade off beween finding a good learning rae and exploring he space. Resricing he number of ime seps reflecs real world applicaions where here are compuaional and ime consrains and also does no require a-priori knowledge of he objecive funcion..1 Experience Replay An experience consiss of a (s i, a i, r i+1, s i+1 ) j uple for some episode j [1, e] a ime sep i [M 1, T ], where M and e are in Algorihm seps and 3. These uples are sored in a memory of experiences E. Insead of updaing he DQN wih only he mos recen experience, a subse S E of experiences are drawn from memory and used as a mini-bach o updae he DQN: θ θ β S (s i,a i,r i+1,s i+1 ) j S where he esimae ŷ i and arge y i are given via (8). (ŷ i y i ) θ Q(s i ; θ) (16) 6

7 The A mos recen episodes along wih he op B bes games (in erms of R max value) are sored in memory. A each DQN updae (sep 5) he subsample S is formed by randomly drawing experiences from E and an experience from each of he op B bes games. Adding randomly drawn experiences o he mini-bach helps preven he DQN from over learning during a paricular ime and episode.. Training Specificaions The Q-learning inpu parameers in Algorihm for boh DQN versions 1 & were fixed as follows: he discoun facor was se o γ =.99 and he exploraion probabiliy ɛ was iniially se o 1 hen uniformly decayed o.1 over he firs 1 episodes. For experience replay, A = 5, B = 5, and he mini-bach size was se o S = 3. Addiionally, for he firs 5 episodes he op B bes games were no used in he mini-bach sample. The consans c 1 and c used o calculae he reward (see seps and ) were fixed as.1 and.1, respecively. The oal number of episodes E is 15K for version 1 and K for version. The objecive inpu parameers in Algorihm consis of he objecive funcion f(x) wih lower bound f lb, iniial weighs x 1, iniial learning rae α c, encoding memory M, and he oal number of ime seps T. The objecive funcion has he form 1 N N l(h(z i ; x), i ) (17) i=1 where z i is an acousic feaure vecor wih phoneic label i, l( ) is a cross enropy loss, and h(z; x) is a feedforward neural nework parameerized by x wih sigmoid acivaions and a sofmax funcion a he oupu layer. We se he inpu objecive funcion o f rain, which has a neural nework archiecure and N = 5 daa poins. The number of ime seps is T = 1 and M = 3. A he sar of each episode, he x ierae is rese o x 1 and is updaed for he firs M ime seps using he iniial learning rae (sep ) in order o form he firs sae feaure vecor. In seps 8 and 9, he six sae feaures form he inpu o he DQN and he resuling acion is deermined by an ɛ-greedy policy. Based on he acion, he learning rae is eiher modified, sep 1 or 1, or he curren ierae is acceped and a new gradien direcion is calculaed, sep 15. The ierae x is updaed in sep 17 and his causes he environmen o change o he nex sae (sep 18). The reward for arriving o sae s +1 is calculaed using eiher he objecive value a he new ierae (sep ) or he previous ierae (sep ) for when he acion is o accep. As an aside, we found i beneficial o calculae he reward for each acion a he las ime sep since he arges associaed wih absorbing saes do no change during raining and hus play a vial role for propagaing back informaion. The uple (s, a, r +1, s +1 ) e forms an experience and is added o memory E (sep ). In addiion o he curren experience, a random subse of experiences are drawn and used o form a mini-bach updae for he DQN (sep 5). Special modificaions were needed for raining DQN v since one of is acions permis he learning rae o increase. Too large of a learning rae resuled in updaes ha caused he objecive funcion o diverge and consequenly produce sae vecors wih infinie feaures. To preven his from happening, we used a maximum and minimum learning rae as par of he raining procedure. If DQN v aemped o increase/decrease he learning rae above/below hese values hen i would receive a reward of -1 and he episode would erminae early. In addiion, we employed an rmsprop updae procedure for raining DQN v []. DQN versions 1 & have an archiecure of A wih sigmoid acivaions for he hidden layers and an idenify acivaion for he las layer. The iniial learning rae was se o 7

8 α c = for version 1 and α c = for version. Addiionally, for version only learning raes in he range [.1, 8] were allowed. Algorihm Q-Learning wih Experience Replay Objecive Parameers: f, f lb, x 1, α c, M, T Q-Learning Parameers: E, θ, γ, ɛ, c 1, c, β 1: θ θ : for e = 1,..., E do For each learning each episode 3: for = 1..., M 1 do : x +1 = x α c f(x ) 5: end for 6: se x = x M, d( x) = f(x M ), α M = α c 7: for = M,..., T do 8: Generae sae feaure vecor s 9: Choose acion a according o ɛ-greedy policy (9) 1: if a = a half hen 11: α +1 = 1 α 1: else if a = a double hen Only for version 13: α +1 = α 1: else if a = a accep hen 15: x = x, d( x) = f( x), α +1 = 16: end if 17: x +1 = x + α +1 d( x) { α c version 1 α version 18: Generae sae feaure vecor s +1 19: if a a accep hen : r +1 = c 1 /(f(x +1 ) f lb ) 1: else if a = a accep hen : r +1 = c /(f( x) f lb ) 3: end if : Add experience (s, a, r +1, s +1 ) e o memory E 5: Sample S E and updae θ via (16) 6: end for 7: end for 8: reurn θ 5 Experimens The rained DQNs along wih he iniial learning raes α c are he inpu o he Q-gradien descen algorihms versions 1 & oulined in Algorihm 1. Since here are no heoreical guaranees ha he DQNs would find a good policy or converge, we demonsrae ha Q-GD versions 1 & are effecive algorihms by comparing hem agains gradien descen wih an Armijo or nonmonoone line search and show ha he DQN q-values associaed wih he opimal acions converge o he discouned reurn of rewards a each ime sep. The line search algorihms operae under he same rules as Q-GD v1, bu an ierae is acceped only if (1) is saisfied. We se c = 1 and M = 3 for nonmonoone and, by definiion, M = 1 8

9 for Armijo. 5.1 Resuls on Train Funcion f rain 1^. 1^.3 Train Funcion q gd v1 q gd v nonmonone armijo 1^ (a) Objecive Value versus Time Sep learning rae learning rae Q GD v1 5 1 Nonmonoone LS 5 1 learning rae learning rae 1 5 Q GD v 5 1 Armijo LS 5 1 (b) Learning Rae versus Time Sep Figure : Comparison of Q-GD versions 1 & and gradien descen wih a nonmonoone or Armijo line search on rain funcion. We firs compare Q-GD versions 1 & and gradien descen wih an Armijo or nonmonoone line search on he funcion used o rain DQN versions 1 & ; f rain has he form (17) wih N = 5 and feedforward neural nework archiecure Figure a demonsraes heir performance in minimizing f rain and figure b plos he learning rae a each ime sep. Afer 1 ime seps, he final objecive values are 1.86, 1.91, 1.98, and. for Q-GD v, Q-GD v1, nomonoone, and Armijo, respecively. The plos of he learning raes illuminae why he Q-GD algorihms are superior. Q-GD v has he advanage ha i can increase he learning rae and is policy for minimizing he rain funcion was very simple: i increased he learning rae from o 8 during he firs iniial ime seps and hen lef he learning rae unchanged unil decreasing i a each of he las seven ime seps. Q-GD v1 offers a fairer comparison o he Armijo and nonomonone line searches since he algorihms all follow he same srucure: every ime an ierae is acceped he learning rae is rese o and can only hen be decreased by a facor of wo. The noable difference beween Q-GD v1 and he line search algorihms is he frequency in which he learning rae is decreased. Q-GD v1 decreased he learning rae 5.1% of he ime while he Armijo and nonmonoone line searches decreased he learning rae 36.% and 7.3% of he ime. Q-GD v1 also only decreased he learning rae during he final quarer of he opimizaion procedure. The learned policies illusrae ha a good iniial learning rae is more imporan han a line search procedure for fas iniial objecive decrease. Also, i is beneficial o decrease he learning rae more aggressively during he final ime seps. Unlike he line searches, he Q-GD algorihms have knowledge of when he opimizaion procedure is going o end (since he number of ime seps is an inpu parameer) and can ac adjus he learning rae accordingly. 9

10 5. Generalizaion Abiliy Tes Funcion Q GD v1 Q GD v f es 1^. 1^.3 q gd v1 q gd v nonmonone armijo 1^ (a) Objecive Value versus Time Sep learning rae learning rae 1 Nonmonoone LS 1 learning rae learning rae 1 Armijo LS 1 (b) Learning Rae versus Time Sep Figure 3: Comparison of Q-GD versions 1 & and gradien descen wih a nonmonoone or Armijo line search on es funcion. We nex es o deermine if he sraegy learned by DQN versions 1 & on he rain funcion also works for a new, bu relaed funcion. The es funcion has he same form as he rain funcion, bu wih hree imes he amoun of daa and double he number of variables (17) wih N = 15 and archiecure The purpose of his configuraion is o show ha we can rain he DQN using a small problem and laer implemen i on larger problems in erms of boh variable size and daa. We also increased he number of ime seps from 1 o. Figure 3 exhibis how Q-GD versions 1 & and he nonmonoone and Armijo line search algorihms measure on he es funcion. In figure 3a, we observe ha he algorihms reain heir relaive ordering regarding objecive decrease in a fixed number ime seps; he final values are 1.73, 1.7, 1.8 and 1.89 for Q-GD v, Q-GD v1, nonomonoone and Armijo, respecively. The gap in performance beween Q-GD versions 1 & reduced, showing ha Q-GD v1 was more adap a generalizing o a new funcion. As wih he rain funcion, boh Q-GD versions 1 & decreased he learning rae less frequenly han eiher he nonmonoone or Armijo line searches. However, boh Q-GD versions were more cauious using a higher learning rae a he sar of he of he opimizaion procedure. Q-GD versions 1 & mainained heir underlying sraegies, excep version 1 chose o decrease he learning rae during he firs quarer and version only iniially increased he learning rae o (as opposed o 8). Overall, hese resuls show ha Q-GD versions 1 & were robus when given a new, larger funcion and used over a longer number of ime seps. 5.3 Convergence of DQN Q-values The purpose of his secion is o show ha he six sae feaures deailed in Secion 3. are rich enough for he DQN o discriminae saes in order o learn he q-values associaed wih he opimal acions. We also demonsrae he effec of individually zeroing ou he sae feaures for Q-GD version 1 on he rain funcion. For he final episode, we recorded he q-value associaed wih he seleced acion (no longer using an ɛ-greedy procedure) and resuling reward a each ime sep in order o compare he DQN 1

11 Discouned Reurn of Rewards versus DQN Q value for Opimal Acion version 1 version 1 15 max a [Q(s )] a R Figure : Plo of DQN versions 1 & prediced q-value for opimal acion versus he discouned reurn of rewards (3) a each ime sep on rain funcion. prediced q-values agains he discouned reurn of rewards, defined by (3). Figure shows ha DQN v1 s q-values converged o he discouned reurn of rewards while DQN v found he overall shape of he disribuion. Even hough DQN v was rained wih more episodes (K versus 15K), he addiion of one exra acion exponenially increases he search space, creaing a much more difficul problem. To invesigae how he sae feaures influence he Q-GD algorihms, we ran Q-GD v1 wih eiher he objecive value, gradien norm, or alignmen measure se o zero; since he feaures are ransformed o lie in he inerval [ 1, 1] his corresponds o fixing a given feaure a is median value. We lef he learning rae, objecive encoding, and number of ime seps unchanged as hey are arguably he bare minimum inpus needed o saisfy he Markov propery. Table 1 repors he final objecive value and he raio of halving he learning rae or acceping an ierae obained for seing a given sae feaure o zero during a run of Q-GD v1 on he rain funcion. The baseline (none of he feaures are se o zero) is a final objecive of 1.91 and 51/96 half/accep raio. As a resul of zeroing ou a sae feaure, DQN v1 chooses o half he learning rae more frequenly and ends up wih a worse soluion. This experimen shows ha DQN v1 depends on each feaure o deermine he appropriae acion. Table 1: Effec of seing a sae feaure o zero. Baseline (none of he feaure are se o zero) is a final objecive value of 1.91 and a 51/96 half/accep raio. Feaure Objecive Half/Accep objecive value /677 gradien norm /7 alignmen measure. 36/633 6 Relaed Work Neural nework models yield sae of he ar performance in speech recogniion, naural language processing, and compuer vision [6, 8, 3]. Tesauro popularized neural neworks as an approximaion o he value funcion [19], which Riedmiller laer exended o he acion-value funcion wih he 11

12 adven of he Neural Fied Q Ieraion [15]. Applicaions of using neural neworks in RL appear in seings ranging from playing games o roboics [1, 1]. Using reinforcemen learning o replace an opimizaion heurisic or be embedded wihin he opimizaion algorihm has been explored in a variey of domains [,, 11, 13, 17]. However, none of he previous approaches use deep Q-learning or our proposed RL formulaion. Our work is mos similar o [17]; he auhors use RL o replace a Levenberg-Marquard heurisic for conrolling a damping parameer used in a Gauss-Newon updae rouine. Unlike our work, hey approximae he acion-value funcion by a linear combinaion of basis funcions, which hey rain using Leas Square Policy Ieraion. To our knowledge, our work is he firs o successfully apply deep Q- learning o conrolling an opimizaion hyperparameer. 7 Conclusions This paper lays he foundaion for using deep Q-learning o conrol an opimizaion hyperparameer. We defined he sae, reward funcion, and acions such ha a DQN could learn how o conrol he learning rae used in a gradien-based opimizaion rouine, resuling in wo Q-gradien descen algorihms. Given ha here are no heoreical guaranees ha he DQN would find he opimal policy or ha is q-values would converge, we presened numerical evidence ha he Q-GD algorihms performed beer han eiher gradien descen wih an Armijo or nonmonoone line search and ha he DQNs q-values for he opimal acion converged o he discouned reurn of rewards a each ime sep. Addiionally, we demonsraed ha he Q-GD algorihms were able o generalize when he rain funcion was replaced wih a larger es funcion. A main advanage of he Q-gradien descen mehod is ha i can easily incorporae any objecive saisic by adding i o he sae feaure vecor. Fuure areas of work involve using his framework o explore addiional sae feaures ha can faciliae opimizaion decisions. We rained he DQNs in a simple environmen in order o demonsrae feasibiliy. To make his mehod pracical for large scale opimizaion i is necessary o exend Q-GD o he sochasic regime, ha is creae Q-sochasic gradien descen. A final area of work involves expanding he acions o include conrolling addiional hyperparameers, such as a momenum erm. Overall, he presened framework allows us o develop new opimizaion algorihms and gain inuiion o he ype of sraegies ha are successful for minimizing neural neworks. 1

13 References [1] Larry Armijo. Minimizaion of funcions having lipschiz coninuous firs parial derivaives. Pacific Journal of mahemaics, 16(1):1 3, [] Jusin A Boyan and Andrew W Moore. Learning evaluaion funcions for global opimizaion and boolean saisfiabiliy. In AAAI/IAAI, pages 3 1, [3] Ronan Collober and Jason Weson. A unified archiecure for naural language processing: Deep neural neworks wih muliask learning. In Proceedings of he 5h inernaional conference on Machine learning, pages ACM, 8. [] Marco Dorigo and LM Gambardella. An-q: A reinforcemen learning approach o he raveling salesman problem. In Inernaional Conference on Machine Learning, pages 5 6, [5] Luigi Grippo, Francesco Lampariello, and Sephano Lucidi. A nonmonoone line search echnique for newon s mehod. SIAM Journal on Numerical Analysis, 3():77 716, [6] Geoffrey Hinon, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaily, Andrew Senior, Vincen Vanhoucke, Parick Nguyen, and Tara N Sainah. Deep neural neworks for acousic modeling in speech recogniion: The shared views of four research groups. Signal Processing Magazine, IEEE, 9(6):8 97, 1. [7] Niish Shirish Keskar and George Saon. A nonmonoone learning rae sraegy for sgd raining of deep neural neworks. In Acousics, Speech and Signal Processing (ICASSP), 15 IEEE Inernaional Conference on, pages IEEE, 15. [8] Alex Krizhevsky, Ilya Suskever, and Geoffrey E Hinon. Imagene classificaion wih deep convoluional neural neworks. In Advances in neural informaion processing sysems, pages , 1. [9] Yann A LeCun, Léon Boou, Genevieve B Orr, and Klaus-Rober Müller. Efficien backprop. In Neural neworks: Tricks of he rade, pages 9 8. Springer, 1. [1] Long-Ji Lin. Reinforcemen learning for robos using neural neworks. Technical repor, DTIC Documen, [11] Vicor V Miagkikh and William F Punch III. Global search in combinaorial opimizaion using reinforcemen learning algorihms. In Proceedings of he Congress on Evoluionary Compuaion, volume 1, pages IEEE, [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Marin Riedmiller, Andreas K Fidjeland, Georg Osrovski, e al. Human-level conrol hrough deep reinforcemen learning. Naure, 518(75):59 533, 15. [13] Rober Moll, Theodore J Perkins, and Andrew G Baro. Machine learning for subproblem selecion. In ICML : Proceedings of he Seveneenh Inernaional Conference on Machine Learning, pages 615 6,. [1] Jorge Nocedal and Sephen Wrigh. Numerical opimizaion. Springer Science & Business Media, 6. [15] Marin Riedmiller. Neural fied q ieraion firs experiences wih a daa efficien neural reinforcemen learning mehod. In Machine Learning: ECML 5, pages Springer, 5. [16] Marin Riedmiller and Heinrich Braun. A direc adapive mehod for faser backpropagaion learning: The rprop algorihm. In Neural Neworks, 1993., IEEE Inernaional Conference on, pages IEEE, [17] Paul L Ruvolo, Ian Fasel, and Javier R Movellan. Opimizaion on a budge: A reinforcemen learning approach. In Advances in Neural Informaion Processing Sysems, pages , 9. [18] Richard S Suon and Andrew G Baro. Reinforcemen learning: An inroducion, volume 1. MIT press Cambridge,

14 [19] Gerald Tesauro. Temporal difference learning and d-gammon. Communicaions of he ACM, 38(3):58 68, [] Tijmen Tieleman and Geoffrey Hinon. Lecure 6.5-rmsprop. COURSERA: Neural neworks for machine learning, 1. [1] Chrisopher JCH Wakins and Peer Dayan. Q-learning. Machine learning, 8(3-):79 9,

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 175 CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 10.1 INTRODUCTION Amongs he research work performed, he bes resuls of experimenal work are validaed wih Arificial Neural Nework. From he

More information

Vehicle Arrival Models : Headway

Vehicle Arrival Models : Headway Chaper 12 Vehicle Arrival Models : Headway 12.1 Inroducion Modelling arrival of vehicle a secion of road is an imporan sep in raffic flow modelling. I has imporan applicaion in raffic flow simulaion where

More information

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 RL Lecure 7: Eligibiliy Traces R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 1 N-sep TD Predicion Idea: Look farher ino he fuure when you do TD backup (1, 2, 3,, n seps) R. S. Suon and

More information

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Article from. Predictive Analytics and Futurism. July 2016 Issue 13 Aricle from Predicive Analyics and Fuurism July 6 Issue An Inroducion o Incremenal Learning By Qiang Wu and Dave Snell Machine learning provides useful ools for predicive analyics The ypical machine learning

More information

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II Roland Siegwar Margaria Chli Paul Furgale Marco Huer Marin Rufli Davide Scaramuzza ETH Maser Course: 151-0854-00L Auonomous Mobile Robos Localizaion II ACT and SEE For all do, (predicion updae / ACT),

More information

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon 3..3 INRODUCION O DYNAMIC OPIMIZAION: DISCREE IME PROBLEMS A. he Hamilonian and Firs-Order Condiions in a Finie ime Horizon Define a new funcion, he Hamilonian funcion, H. H he change in he oal value of

More information

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis Speaker Adapaion Techniques For Coninuous Speech Using Medium and Small Adapaion Daa Ses Consaninos Boulis Ouline of he Presenaion Inroducion o he speaker adapaion problem Maximum Likelihood Sochasic Transformaions

More information

A Reinforcement Learning Approach for Collaborative Filtering

A Reinforcement Learning Approach for Collaborative Filtering A Reinforcemen Learning Approach for Collaboraive Filering Jungkyu Lee, Byonghwa Oh 2, Jihoon Yang 2, and Sungyong Park 2 Cyram Inc, Seoul, Korea jklee@cyram.com 2 Sogang Universiy, Seoul, Korea {mrfive,yangjh,parksy}@sogang.ac.kr

More information

Online Convex Optimization Example And Follow-The-Leader

Online Convex Optimization Example And Follow-The-Leader CSE599s, Spring 2014, Online Learning Lecure 2-04/03/2014 Online Convex Opimizaion Example And Follow-The-Leader Lecurer: Brendan McMahan Scribe: Sephen Joe Jonany 1 Review of Online Convex Opimizaion

More information

Random Walk with Anti-Correlated Steps

Random Walk with Anti-Correlated Steps Random Walk wih Ani-Correlaed Seps John Noga Dirk Wagner 2 Absrac We conjecure he expeced value of random walks wih ani-correlaed seps o be exacly. We suppor his conjecure wih 2 plausibiliy argumens and

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION DOI: 0.038/NCLIMATE893 Temporal resoluion and DICE * Supplemenal Informaion Alex L. Maren and Sephen C. Newbold Naional Cener for Environmenal Economics, US Environmenal Proecion

More information

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing Applicaion of a Sochasic-Fuzzy Approach o Modeling Opimal Discree Time Dynamical Sysems by Using Large Scale Daa Processing AA WALASZE-BABISZEWSA Deparmen of Compuer Engineering Opole Universiy of Technology

More information

1 Review of Zero-Sum Games

1 Review of Zero-Sum Games COS 5: heoreical Machine Learning Lecurer: Rob Schapire Lecure #23 Scribe: Eugene Brevdo April 30, 2008 Review of Zero-Sum Games Las ime we inroduced a mahemaical model for wo player zero-sum games. Any

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Noes for EE7C Spring 018: Convex Opimizaion and Approximaion Insrucor: Moriz Hard Email: hard+ee7c@berkeley.edu Graduae Insrucor: Max Simchowiz Email: msimchow+ee7c@berkeley.edu Ocober 15, 018 3

More information

Presentation Overview

Presentation Overview Acion Refinemen in Reinforcemen Learning by Probabiliy Smoohing By Thomas G. Dieerich & Didac Busques Speaer: Kai Xu Presenaion Overview Bacground The Probabiliy Smoohing Mehod Experimenal Sudy of Acion

More information

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9) CSE/NB 528 Lecure 14: Reinforcemen Learning Chaper 9 Image from hp://clasdean.la.asu.edu/news/images/ubep2001/neuron3.jpg Lecure figures are from Dayan & Abbo s book hp://people.brandeis.edu/~abbo/book/index.hml

More information

Planning in POMDPs. Dominik Schoenberger Abstract

Planning in POMDPs. Dominik Schoenberger Abstract Planning in POMDPs Dominik Schoenberger d.schoenberger@sud.u-darmsad.de Absrac This documen briefly explains wha a Parially Observable Markov Decision Process is. Furhermore i inroduces he differen approaches

More information

STATE-SPACE MODELLING. A mass balance across the tank gives:

STATE-SPACE MODELLING. A mass balance across the tank gives: B. Lennox and N.F. Thornhill, 9, Sae Space Modelling, IChemE Process Managemen and Conrol Subjec Group Newsleer STE-SPACE MODELLING Inroducion: Over he pas decade or so here has been an ever increasing

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updaes Michael Kearns AT&T Labs mkearns@research.a.com Sainder Singh AT&T Labs baveja@research.a.com Absrac We give he firs rigorous upper bounds on he error

More information

The Rosenblatt s LMS algorithm for Perceptron (1958) is built around a linear neuron (a neuron with a linear

The Rosenblatt s LMS algorithm for Perceptron (1958) is built around a linear neuron (a neuron with a linear In The name of God Lecure4: Percepron and AALIE r. Majid MjidGhoshunih Inroducion The Rosenbla s LMS algorihm for Percepron 958 is buil around a linear neuron a neuron ih a linear acivaion funcion. Hoever,

More information

Lecture Notes 2. The Hilbert Space Approach to Time Series

Lecture Notes 2. The Hilbert Space Approach to Time Series Time Series Seven N. Durlauf Universiy of Wisconsin. Basic ideas Lecure Noes. The Hilber Space Approach o Time Series The Hilber space framework provides a very powerful language for discussing he relaionship

More information

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018 MATH 5720: Gradien Mehods Hung Phan, UMass Lowell Ocober 4, 208 Descen Direcion Mehods Consider he problem min { f(x) x R n}. The general descen direcions mehod is x k+ = x k + k d k where x k is he curren

More information

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle Chaper 2 Newonian Mechanics Single Paricle In his Chaper we will review wha Newon s laws of mechanics ell us abou he moion of a single paricle. Newon s laws are only valid in suiable reference frames,

More information

Notes on Kalman Filtering

Notes on Kalman Filtering Noes on Kalman Filering Brian Borchers and Rick Aser November 7, Inroducion Daa Assimilaion is he problem of merging model predicions wih acual measuremens of a sysem o produce an opimal esimae of he curren

More information

RC, RL and RLC circuits

RC, RL and RLC circuits Name Dae Time o Complee h m Parner Course/ Secion / Grade RC, RL and RLC circuis Inroducion In his experimen we will invesigae he behavior of circuis conaining combinaions of resisors, capaciors, and inducors.

More information

Some Basic Information about M-S-D Systems

Some Basic Information about M-S-D Systems Some Basic Informaion abou M-S-D Sysems 1 Inroducion We wan o give some summary of he facs concerning unforced (homogeneous) and forced (non-homogeneous) models for linear oscillaors governed by second-order,

More information

An introduction to the theory of SDDP algorithm

An introduction to the theory of SDDP algorithm An inroducion o he heory of SDDP algorihm V. Leclère (ENPC) Augus 1, 2014 V. Leclère Inroducion o SDDP Augus 1, 2014 1 / 21 Inroducion Large scale sochasic problem are hard o solve. Two ways of aacking

More information

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED 0.1 MAXIMUM LIKELIHOOD ESTIMATIO EXPLAIED Maximum likelihood esimaion is a bes-fi saisical mehod for he esimaion of he values of he parameers of a sysem, based on a se of observaions of a random variable

More information

Ensamble methods: Bagging and Boosting

Ensamble methods: Bagging and Boosting Lecure 21 Ensamble mehods: Bagging and Boosing Milos Hauskrech milos@cs.pi.edu 5329 Senno Square Ensemble mehods Mixure of expers Muliple base models (classifiers, regressors), each covers a differen par

More information

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power Alpaydin Chaper, Michell Chaper 7 Alpaydin slides are in urquoise. Ehem Alpaydin, copyrigh: The MIT Press, 010. alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/ ehem/imle All oher slides are based on Michell.

More information

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN The MIT Press, 2014 Lecure Slides for INTRODUCTION TO MACHINE LEARNING 3RD EDITION alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/~ehem/i2ml3e CHAPTER 2: SUPERVISED LEARNING Learning a Class

More information

10. State Space Methods

10. State Space Methods . Sae Space Mehods. Inroducion Sae space modelling was briefly inroduced in chaper. Here more coverage is provided of sae space mehods before some of heir uses in conrol sysem design are covered in he

More information

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still. Lecure - Kinemaics in One Dimension Displacemen, Velociy and Acceleraion Everyhing in he world is moving. Nohing says sill. Moion occurs a all scales of he universe, saring from he moion of elecrons in

More information

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD HAN XIAO 1. Penalized Leas Squares Lasso solves he following opimizaion problem, ˆβ lasso = arg max β R p+1 1 N y i β 0 N x ij β j β j (1.1) for some 0.

More information

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14 CSE/NB 58 Lecure 14: From Supervised o Reinforcemen Learning Chaper 9 1 Recall from las ime: Sigmoid Neworks Oupu v T g w u g wiui w Inpu nodes u = u 1 u u 3 T i Sigmoid oupu funcion: 1 g a 1 a e 1 ga

More information

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence Supplemen for Sochasic Convex Opimizaion: Faser Local Growh Implies Faser Global Convergence Yi Xu Qihang Lin ianbao Yang Proof of heorem heorem Suppose Assumpion holds and F (w) obeys he LGC (6) Given

More information

Single-Pass-Based Heuristic Algorithms for Group Flexible Flow-shop Scheduling Problems

Single-Pass-Based Heuristic Algorithms for Group Flexible Flow-shop Scheduling Problems Single-Pass-Based Heurisic Algorihms for Group Flexible Flow-shop Scheduling Problems PEI-YING HUANG, TZUNG-PEI HONG 2 and CHENG-YAN KAO, 3 Deparmen of Compuer Science and Informaion Engineering Naional

More information

Lab 10: RC, RL, and RLC Circuits

Lab 10: RC, RL, and RLC Circuits Lab 10: RC, RL, and RLC Circuis In his experimen, we will invesigae he behavior of circuis conaining combinaions of resisors, capaciors, and inducors. We will sudy he way volages and currens change in

More information

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model Modal idenificaion of srucures from roving inpu daa by means of maximum likelihood esimaion of he sae space model J. Cara, J. Juan, E. Alarcón Absrac The usual way o perform a forced vibraion es is o fix

More information

Dimitri Solomatine. D.P. Solomatine. Data-driven modelling (part 2). 2

Dimitri Solomatine. D.P. Solomatine. Data-driven modelling (part 2). 2 Daa-driven modelling. Par. Daa-driven Arificial di Neural modelling. Newors Par Dimiri Solomaine Arificial neural newors D.P. Solomaine. Daa-driven modelling par. 1 Arificial neural newors ANN: main pes

More information

Two Coupled Oscillators / Normal Modes

Two Coupled Oscillators / Normal Modes Lecure 3 Phys 3750 Two Coupled Oscillaors / Normal Modes Overview and Moivaion: Today we ake a small, bu significan, sep owards wave moion. We will no ye observe waves, bu his sep is imporan in is own

More information

Ensamble methods: Boosting

Ensamble methods: Boosting Lecure 21 Ensamble mehods: Boosing Milos Hauskrech milos@cs.pi.edu 5329 Senno Square Schedule Final exam: April 18: 1:00-2:15pm, in-class Term projecs April 23 & April 25: a 1:00-2:30pm in CS seminar room

More information

Isolated-word speech recognition using hidden Markov models

Isolated-word speech recognition using hidden Markov models Isolaed-word speech recogniion using hidden Markov models Håkon Sandsmark December 18, 21 1 Inroducion Speech recogniion is a challenging problem on which much work has been done he las decades. Some of

More information

GMM - Generalized Method of Moments

GMM - Generalized Method of Moments GMM - Generalized Mehod of Momens Conens GMM esimaion, shor inroducion 2 GMM inuiion: Maching momens 2 3 General overview of GMM esimaion. 3 3. Weighing marix...........................................

More information

20. Applications of the Genetic-Drift Model

20. Applications of the Genetic-Drift Model 0. Applicaions of he Geneic-Drif Model 1) Deermining he probabiliy of forming any paricular combinaion of genoypes in he nex generaion: Example: If he parenal allele frequencies are p 0 = 0.35 and q 0

More information

Overview. COMP14112: Artificial Intelligence Fundamentals. Lecture 0 Very Brief Overview. Structure of this course

Overview. COMP14112: Artificial Intelligence Fundamentals. Lecture 0 Very Brief Overview. Structure of this course OMP: Arificial Inelligence Fundamenals Lecure 0 Very Brief Overview Lecurer: Email: Xiao-Jun Zeng x.zeng@mancheser.ac.uk Overview This course will focus mainly on probabilisic mehods in AI We shall presen

More information

A Primal-Dual Type Algorithm with the O(1/t) Convergence Rate for Large Scale Constrained Convex Programs

A Primal-Dual Type Algorithm with the O(1/t) Convergence Rate for Large Scale Constrained Convex Programs PROC. IEEE CONFERENCE ON DECISION AND CONTROL, 06 A Primal-Dual Type Algorihm wih he O(/) Convergence Rae for Large Scale Consrained Convex Programs Hao Yu and Michael J. Neely Absrac This paper considers

More information

BU Macro BU Macro Fall 2008, Lecture 4

BU Macro BU Macro Fall 2008, Lecture 4 Dynamic Programming BU Macro 2008 Lecure 4 1 Ouline 1. Cerainy opimizaion problem used o illusrae: a. Resricions on exogenous variables b. Value funcion c. Policy funcion d. The Bellman equaion and an

More information

Online Appendix to Solution Methods for Models with Rare Disasters

Online Appendix to Solution Methods for Models with Rare Disasters Online Appendix o Soluion Mehods for Models wih Rare Disasers Jesús Fernández-Villaverde and Oren Levinal In his Online Appendix, we presen he Euler condiions of he model, we develop he pricing Calvo block,

More information

Christos Papadimitriou & Luca Trevisan November 22, 2016

Christos Papadimitriou & Luca Trevisan November 22, 2016 U.C. Bereley CS170: Algorihms Handou LN-11-22 Chrisos Papadimiriou & Luca Trevisan November 22, 2016 Sreaming algorihms In his lecure and he nex one we sudy memory-efficien algorihms ha process a sream

More information

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions Muli-Period Sochasic Models: Opimali of (s, S) Polic for -Convex Objecive Funcions Consider a seing similar o he N-sage newsvendor problem excep ha now here is a fixed re-ordering cos (> 0) for each (re-)order.

More information

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles Diebold, Chaper 7 Francis X. Diebold, Elemens of Forecasing, 4h Ediion (Mason, Ohio: Cengage Learning, 006). Chaper 7. Characerizing Cycles Afer compleing his reading you should be able o: Define covariance

More information

Learning to Take Concurrent Actions

Learning to Take Concurrent Actions Learning o Take Concurren Acions Khashayar Rohanimanesh Deparmen of Compuer Science Universiy of Massachuses Amhers, MA 0003 khash@cs.umass.edu Sridhar Mahadevan Deparmen of Compuer Science Universiy of

More information

5. Stochastic processes (1)

5. Stochastic processes (1) Lec05.pp S-38.45 - Inroducion o Teleraffic Theory Spring 2005 Conens Basic conceps Poisson process 2 Sochasic processes () Consider some quaniy in a eleraffic (or any) sysem I ypically evolves in ime randomly

More information

A Hop Constrained Min-Sum Arborescence with Outage Costs

A Hop Constrained Min-Sum Arborescence with Outage Costs A Hop Consrained Min-Sum Arborescence wih Ouage Coss Rakesh Kawara Minnesoa Sae Universiy, Mankao, MN 56001 Email: Kawara@mnsu.edu Absrac The hop consrained min-sum arborescence wih ouage coss problem

More information

Class Meeting # 10: Introduction to the Wave Equation

Class Meeting # 10: Introduction to the Wave Equation MATH 8.5 COURSE NOTES - CLASS MEETING # 0 8.5 Inroducion o PDEs, Fall 0 Professor: Jared Speck Class Meeing # 0: Inroducion o he Wave Equaion. Wha is he wave equaion? The sandard wave equaion for a funcion

More information

2. Nonlinear Conservation Law Equations

2. Nonlinear Conservation Law Equations . Nonlinear Conservaion Law Equaions One of he clear lessons learned over recen years in sudying nonlinear parial differenial equaions is ha i is generally no wise o ry o aack a general class of nonlinear

More information

Experiments on logistic regression

Experiments on logistic regression Experimens on logisic regression Ning Bao March, 8 Absrac In his repor, several experimens have been conduced on a spam daa se wih Logisic Regression based on Gradien Descen approach. Firs, he overfiing

More information

A Shooting Method for A Node Generation Algorithm

A Shooting Method for A Node Generation Algorithm A Shooing Mehod for A Node Generaion Algorihm Hiroaki Nishikawa W.M.Keck Foundaion Laboraory for Compuaional Fluid Dynamics Deparmen of Aerospace Engineering, Universiy of Michigan, Ann Arbor, Michigan

More information

Technical Report Doc ID: TR March-2013 (Last revision: 23-February-2016) On formulating quadratic functions in optimization models.

Technical Report Doc ID: TR March-2013 (Last revision: 23-February-2016) On formulating quadratic functions in optimization models. Technical Repor Doc ID: TR--203 06-March-203 (Las revision: 23-Februar-206) On formulaing quadraic funcions in opimizaion models. Auhor: Erling D. Andersen Convex quadraic consrains quie frequenl appear

More information

12: AUTOREGRESSIVE AND MOVING AVERAGE PROCESSES IN DISCRETE TIME. Σ j =

12: AUTOREGRESSIVE AND MOVING AVERAGE PROCESSES IN DISCRETE TIME. Σ j = 1: AUTOREGRESSIVE AND MOVING AVERAGE PROCESSES IN DISCRETE TIME Moving Averages Recall ha a whie noise process is a series { } = having variance σ. The whie noise process has specral densiy f (λ) = of

More information

A Dynamic Model of Economic Fluctuations

A Dynamic Model of Economic Fluctuations CHAPTER 15 A Dynamic Model of Economic Flucuaions Modified for ECON 2204 by Bob Murphy 2016 Worh Publishers, all righs reserved IN THIS CHAPTER, OU WILL LEARN: how o incorporae dynamics ino he AD-AS model

More information

1. An introduction to dynamic optimization -- Optimal Control and Dynamic Programming AGEC

1. An introduction to dynamic optimization -- Optimal Control and Dynamic Programming AGEC This documen was generaed a :45 PM 8/8/04 Copyrigh 04 Richard T. Woodward. An inroducion o dynamic opimizaion -- Opimal Conrol and Dynamic Programming AGEC 637-04 I. Overview of opimizaion Opimizaion is

More information

Chapter 2. First Order Scalar Equations

Chapter 2. First Order Scalar Equations Chaper. Firs Order Scalar Equaions We sar our sudy of differenial equaions in he same way he pioneers in his field did. We show paricular echniques o solve paricular ypes of firs order differenial equaions.

More information

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010 Simulaion-Solving Dynamic Models ABE 5646 Week 2, Spring 2010 Week Descripion Reading Maerial 2 Compuer Simulaion of Dynamic Models Finie Difference, coninuous saes, discree ime Simple Mehods Euler Trapezoid

More information

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power Alpaydin Chaper, Michell Chaper 7 Alpaydin slides are in urquoise. Ehem Alpaydin, copyrigh: The MIT Press, 010. alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/ ehem/imle All oher slides are based on Michell.

More information

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1 SZG Macro 2011 Lecure 3: Dynamic Programming SZG macro 2011 lecure 3 1 Background Our previous discussion of opimal consumpion over ime and of opimal capial accumulaion sugges sudying he general decision

More information

Georey E. Hinton. University oftoronto. Technical Report CRG-TR February 22, Abstract

Georey E. Hinton. University oftoronto.   Technical Report CRG-TR February 22, Abstract Parameer Esimaion for Linear Dynamical Sysems Zoubin Ghahramani Georey E. Hinon Deparmen of Compuer Science Universiy oftorono 6 King's College Road Torono, Canada M5S A4 Email: zoubin@cs.orono.edu Technical

More information

On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems

On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems MATHEMATICS OF OPERATIONS RESEARCH Vol. 38, No. 2, May 2013, pp. 209 227 ISSN 0364-765X (prin) ISSN 1526-5471 (online) hp://dx.doi.org/10.1287/moor.1120.0562 2013 INFORMS On Boundedness of Q-Learning Ieraes

More information

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Kriging Models Predicing Arazine Concenraions in Surface Waer Draining Agriculural Waersheds Paul L. Mosquin, Jeremy Aldworh, Wenlin Chen Supplemenal Maerial Number

More information

Time series model fitting via Kalman smoothing and EM estimation in TimeModels.jl

Time series model fitting via Kalman smoothing and EM estimation in TimeModels.jl Time series model fiing via Kalman smoohing and EM esimaion in TimeModels.jl Gord Sephen Las updaed: January 206 Conens Inroducion 2. Moivaion and Acknowledgemens....................... 2.2 Noaion......................................

More information

Tom Heskes and Onno Zoeter. Presented by Mark Buller

Tom Heskes and Onno Zoeter. Presented by Mark Buller Tom Heskes and Onno Zoeer Presened by Mark Buller Dynamic Bayesian Neworks Direced graphical models of sochasic processes Represen hidden and observed variables wih differen dependencies Generalize Hidden

More information

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients Secion 3.5 Nonhomogeneous Equaions; Mehod of Undeermined Coefficiens Key Terms/Ideas: Linear Differenial operaor Nonlinear operaor Second order homogeneous DE Second order nonhomogeneous DE Soluion o homogeneous

More information

Learning Objectives: Practice designing and simulating digital circuits including flip flops Experience state machine design procedure

Learning Objectives: Practice designing and simulating digital circuits including flip flops Experience state machine design procedure Lab 4: Synchronous Sae Machine Design Summary: Design and implemen synchronous sae machine circuis and es hem wih simulaions in Cadence Viruoso. Learning Objecives: Pracice designing and simulaing digial

More information

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H. ACE 56 Fall 005 Lecure 5: he Simple Linear Regression Model: Sampling Properies of he Leas Squares Esimaors by Professor Sco H. Irwin Required Reading: Griffihs, Hill and Judge. "Inference in he Simple

More information

3.1 More on model selection

3.1 More on model selection 3. More on Model selecion 3. Comparing models AIC, BIC, Adjused R squared. 3. Over Fiing problem. 3.3 Sample spliing. 3. More on model selecion crieria Ofen afer model fiing you are lef wih a handful of

More information

5 The fitting methods used in the normalization of DSD

5 The fitting methods used in the normalization of DSD The fiing mehods used in he normalizaion of DSD.1 Inroducion Sempere-Torres e al. 1994 presened a general formulaion for he DSD ha was able o reproduce and inerpre all previous sudies of DSD. The mehodology

More information

Particle Swarm Optimization

Particle Swarm Optimization Paricle Swarm Opimizaion Speaker: Jeng-Shyang Pan Deparmen of Elecronic Engineering, Kaohsiung Universiy of Applied Science, Taiwan Email: jspan@cc.kuas.edu.w 7/26/2004 ppso 1 Wha is he Paricle Swarm Opimizaion

More information

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB Elecronic Companion EC.1. Proofs of Technical Lemmas and Theorems LEMMA 1. Le C(RB) be he oal cos incurred by he RB policy. Then we have, T L E[C(RB)] 3 E[Z RB ]. (EC.1) Proof of Lemma 1. Using he marginal

More information

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks -

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks - Deep Learning: Theory, Techniques & Applicaions - Recurren Neural Neworks - Prof. Maeo Maeucci maeo.maeucci@polimi.i Deparmen of Elecronics, Informaion and Bioengineering Arificial Inelligence and Roboics

More information

EXERCISES FOR SECTION 1.5

EXERCISES FOR SECTION 1.5 1.5 Exisence and Uniqueness of Soluions 43 20. 1 v c 21. 1 v c 1 2 4 6 8 10 1 2 2 4 6 8 10 Graph of approximae soluion obained using Euler s mehod wih = 0.1. Graph of approximae soluion obained using Euler

More information

Licenciatura de ADE y Licenciatura conjunta Derecho y ADE. Hoja de ejercicios 2 PARTE A

Licenciatura de ADE y Licenciatura conjunta Derecho y ADE. Hoja de ejercicios 2 PARTE A Licenciaura de ADE y Licenciaura conjuna Derecho y ADE Hoja de ejercicios PARTE A 1. Consider he following models Δy = 0.8 + ε (1 + 0.8L) Δ 1 y = ε where ε and ε are independen whie noise processes. In

More information

Conservative Contextual Linear Bandits

Conservative Contextual Linear Bandits Conservaive Conexual Linear Bandis Abbas Kazerouni Sanford Universiy abbask@sanford.edu Yasin Abbasi-Yadkori Adobe Research abbasiya@adobe.com Mohammad Ghavamzadeh DeepMind ghavamza@google.com Benjamin

More information

Introduction to Probability and Statistics Slides 4 Chapter 4

Introduction to Probability and Statistics Slides 4 Chapter 4 Inroducion o Probabiliy and Saisics Slides 4 Chaper 4 Ammar M. Sarhan, asarhan@mahsa.dal.ca Deparmen of Mahemaics and Saisics, Dalhousie Universiy Fall Semeser 8 Dr. Ammar Sarhan Chaper 4 Coninuous Random

More information

Testing for a Single Factor Model in the Multivariate State Space Framework

Testing for a Single Factor Model in the Multivariate State Space Framework esing for a Single Facor Model in he Mulivariae Sae Space Framework Chen C.-Y. M. Chiba and M. Kobayashi Inernaional Graduae School of Social Sciences Yokohama Naional Universiy Japan Faculy of Economics

More information

Lecture 2 October ε-approximation of 2-player zero-sum games

Lecture 2 October ε-approximation of 2-player zero-sum games Opimizaion II Winer 009/10 Lecurer: Khaled Elbassioni Lecure Ocober 19 1 ε-approximaion of -player zero-sum games In his lecure we give a randomized ficiious play algorihm for obaining an approximae soluion

More information

KINEMATICS IN ONE DIMENSION

KINEMATICS IN ONE DIMENSION KINEMATICS IN ONE DIMENSION PREVIEW Kinemaics is he sudy of how hings move how far (disance and displacemen), how fas (speed and velociy), and how fas ha how fas changes (acceleraion). We say ha an objec

More information

Cash Flow Valuation Mode Lin Discrete Time

Cash Flow Valuation Mode Lin Discrete Time IOSR Journal of Mahemaics (IOSR-JM) e-issn: 2278-5728,p-ISSN: 2319-765X, 6, Issue 6 (May. - Jun. 2013), PP 35-41 Cash Flow Valuaion Mode Lin Discree Time Olayiwola. M. A. and Oni, N. O. Deparmen of Mahemaics

More information

Robotics I. April 11, The kinematics of a 3R spatial robot is specified by the Denavit-Hartenberg parameters in Tab. 1.

Robotics I. April 11, The kinematics of a 3R spatial robot is specified by the Denavit-Hartenberg parameters in Tab. 1. Roboics I April 11, 017 Exercise 1 he kinemaics of a 3R spaial robo is specified by he Denavi-Harenberg parameers in ab 1 i α i d i a i θ i 1 π/ L 1 0 1 0 0 L 3 0 0 L 3 3 able 1: able of DH parameers of

More information

Energy Storage Benchmark Problems

Energy Storage Benchmark Problems Energy Sorage Benchmark Problems Daniel F. Salas 1,3, Warren B. Powell 2,3 1 Deparmen of Chemical & Biological Engineering 2 Deparmen of Operaions Research & Financial Engineering 3 Princeon Laboraory

More information

CHAPTER 2 Signals And Spectra

CHAPTER 2 Signals And Spectra CHAPER Signals And Specra Properies of Signals and Noise In communicaion sysems he received waveform is usually caegorized ino he desired par conaining he informaion, and he undesired par. he desired par

More information

Reinforcement Learning: A Tutorial. Scope of Tutorial. 1 Introduction

Reinforcement Learning: A Tutorial. Scope of Tutorial. 1 Introduction Reinforcemen Learning: A Tuorial Mance E. Harmon WL/AACF 224 Avionics Circle Wrigh Laboraory Wrigh-Paerson AFB, OH 45433 mharmon@acm.org Sephanie S. Harmon Wrigh Sae Universiy 56-8 Mallard Glen Drive Cenerville,

More information

Topic Astable Circuits. Recall that an astable circuit has two unstable states;

Topic Astable Circuits. Recall that an astable circuit has two unstable states; Topic 2.2. Asable Circuis. Learning Objecives: A he end o his opic you will be able o; Recall ha an asable circui has wo unsable saes; Explain he operaion o a circui based on a Schmi inverer, and esimae

More information

Retrieval Models. Boolean and Vector Space Retrieval Models. Common Preprocessing Steps. Boolean Model. Boolean Retrieval Model

Retrieval Models. Boolean and Vector Space Retrieval Models. Common Preprocessing Steps. Boolean Model. Boolean Retrieval Model 1 Boolean and Vecor Space Rerieval Models Many slides in his secion are adaped from Prof. Joydeep Ghosh (UT ECE) who in urn adaped hem from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) Rerieval

More information

Robust Learning Control with Application to HVAC Systems

Robust Learning Control with Application to HVAC Systems Robus Learning Conrol wih Applicaion o HVAC Sysems Naional Science Foundaion & Projec Invesigaors: Dr. Charles Anderson, CS Dr. Douglas Hile, ME Dr. Peer Young, ECE Mechanical Engineering Compuer Science

More information

Lecture 9: September 25

Lecture 9: September 25 0-725: Opimizaion Fall 202 Lecure 9: Sepember 25 Lecurer: Geoff Gordon/Ryan Tibshirani Scribes: Xuezhi Wang, Subhodeep Moira, Abhimanu Kumar Noe: LaTeX emplae couresy of UC Berkeley EECS dep. Disclaimer:

More information

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017 Two Popular Bayesian Esimaors: Paricle and Kalman Filers McGill COMP 765 Sep 14 h, 2017 1 1 1, dx x Bel x u x P x z P Recall: Bayes Filers,,,,,,, 1 1 1 1 u z u x P u z u x z P Bayes z = observaion u =

More information

d 1 = c 1 b 2 - b 1 c 2 d 2 = c 1 b 3 - b 1 c 3

d 1 = c 1 b 2 - b 1 c 2 d 2 = c 1 b 3 - b 1 c 3 and d = c b - b c c d = c b - b c c This process is coninued unil he nh row has been compleed. The complee array of coefficiens is riangular. Noe ha in developing he array an enire row may be divided or

More information

Particle Swarm Optimization Combining Diversification and Intensification for Nonlinear Integer Programming Problems

Particle Swarm Optimization Combining Diversification and Intensification for Nonlinear Integer Programming Problems Paricle Swarm Opimizaion Combining Diversificaion and Inensificaion for Nonlinear Ineger Programming Problems Takeshi Masui, Masaoshi Sakawa, Kosuke Kao and Koichi Masumoo Hiroshima Universiy 1-4-1, Kagamiyama,

More information

A DELAY-DEPENDENT STABILITY CRITERIA FOR T-S FUZZY SYSTEM WITH TIME-DELAYS

A DELAY-DEPENDENT STABILITY CRITERIA FOR T-S FUZZY SYSTEM WITH TIME-DELAYS A DELAY-DEPENDENT STABILITY CRITERIA FOR T-S FUZZY SYSTEM WITH TIME-DELAYS Xinping Guan ;1 Fenglei Li Cailian Chen Insiue of Elecrical Engineering, Yanshan Universiy, Qinhuangdao, 066004, China. Deparmen

More information