Reinforcement Learning: A Tutorial. Scope of Tutorial. 1 Introduction

Size: px
Start display at page:

Download "Reinforcement Learning: A Tutorial. Scope of Tutorial. 1 Introduction"

Transcription

1 Reinforcemen Learning: A Tuorial Mance E. Harmon WL/AACF 224 Avionics Circle Wrigh Laboraory Wrigh-Paerson AFB, OH mharmon@acm.org Sephanie S. Harmon Wrigh Sae Universiy 56-8 Mallard Glen Drive Cenerville, OH Scope of Tuorial The purpose of his uorial is o provide an inroducion o reinforcemen learning (RL) a a level easily undersood by sudens and researchers in a wide range of disciplines. The inen is no o presen a rigorous mahemaical discussion ha requires a grea deal of effor on he par of he reader, bu raher o presen a concepual framework ha migh serve as an inroducion o a more rigorous sudy of RL. The fundamenal principles and echniques used o solve RL problems are presened. The mos popular RL algorihms are presened. Secion presens an overview of RL and provides a simple example o develop inuiion of he underlying dynamic programming mechanism. In Secion 2 he pars of a reinforcemen learning problem are discussed. These include he environmen, reinforcemen funcion, and value funcion. Secion 3 gives a descripion of he mos widely used reinforcemen learning algorihms. These include TD(λ) and boh he residual and direc forms of value ieraion, Q-learning, and advanage learning. In Secion 4 some of he ancillary issues in RL are briefly discussed, such as choosing an exploraion sraegy and an appropriae discoun facor. The conclusion is given in Secion 5. Finally, Secion 6 is a glossary of commonly used erms followed by references in Secion 7 and a bibliography of RL applicaions in Secion 8. The uorial srucure is such ha each secion builds on he informaion provided in previous secions. I is assumed ha he reader has some knowledge of learning algorihms ha rely on gradien descen (such as he backpropagaion of errors algorihm). Inroducion There are many unsolved problems ha compuers could solve if he appropriae sofware exised. Fligh conrol sysems for aircraf, auomaed manufacuring sysems, and sophisicaed avionics sysems all presen difficul, nonlinear conrol problems. Many of hese problems are currenly unsolvable, no because curren compuers are oo slow or have oo lile memory, bu simply because i is oo difficul o deermine wha he program should do. If a compuer could learn o solve he problems hrough rial and error, ha would be of grea pracical value. Reinforcemen Learning is an approach o machine inelligence ha combines wo disciplines o successfully solve problems ha neiher discipline can address individually. Dynamic Programming is a field of mahemaics ha has radiionally been used o solve problems of opimizaion and conrol. However, radiional dynamic programming is limied in he size and complexiy of he problems i can address. Supervised learning is a general mehod for raining a parameerized funcion approximaor, such as a neural nework, o represen funcions. However, supervised learning requires sample inpu-oupu pairs from he funcion o be learned. In oher words, supervised learning requires a se of quesions wih he righ answers. For example, we migh no know he bes way o program a compuer o recognize an infrared picure of a ank, bu we do have a large collecion of infrared picures, and we do know wheher

2 each picure conains a ank or no. Supervised learning could look a all he examples wih answers, and learn how o recognize anks in general. Unforunaely, here are many siuaions where we don know he correc answers ha supervised learning requires. For example, in a fligh conrol sysem, he quesion would be he se of all sensor readings a a given ime, and he answer would be how he fligh conrol surfaces should move during he nex millisecond. Simple neural neworks can learn o fly he plane unless here is a se of known answers, so if we don know how o build a conroller in he firs place, simple supervised learning won help. For hese reasons here has been much ineres recenly in a differen approach known as reinforcemen learning (RL). Reinforcemen learning is no a ype of neural nework, nor is i an alernaive o neural neworks. Raher, i is an orhogonal approach ha addresses a differen, more difficul quesion. Reinforcemen learning combines he fields of dynamic programming and supervised learning o yield powerful machine-learning sysems. Reinforcemen learning appeals o many researchers because of is generaliy. In RL, he compuer is simply given a goal o achieve. The compuer hen learns how o achieve ha goal by rial-and-error ineracions wih is environmen. Thus, many researchers are pursuing his form of machine inelligence and are excied abou he possibiliy of solving problems ha have been previously unsolvable. To provide he inuiion behind reinforcemen learning consider he problem of learning o ride a bicycle. The goal given o he RL sysem is simply o ride he bicycle wihou falling over. In he firs rial, he RL sysem begins riding he bicycle and performs a series of acions ha resul in he bicycle being iled 45 degrees o he righ. A his poin heir are wo acions possible: urn he handle bars lef or urn hem righ. The RL sysem urns he handle bars o he lef and immediaely crashes o he ground, hus receiving a negaive reinforcemen. The RL sysem has jus learned no o urn he handle bars lef when iled 45 degrees o he righ. In he nex rial he RL sysem performs a series of acions ha again resul in he bicycle being iled 45 degrees o he righ. The RL sysem knows no o urn he handle bars o he lef, so i performs he only oher possible acion: urn righ. I immediaely crashes o he ground, again receiving a srong negaive reinforcemen. A his poin he RL sysem has no only learned ha urning he handle bars righ or lef when iled 45 degrees o he righ is bad, bu ha he "sae" of being iled 45 degrees o he righ is bad. Again, he RL sysem begins anoher rial and performs a series of acions ha resul in he bicycle being iled 40 degrees o he righ. Two acions are possible: urn righ or urn lef. The RL sysem urns he handle bars lef which resuls in he bicycle being iled 45 degrees o he righ, and ulimaely resuls in a srong negaive reinforcemen. The RL sysem has jus learned no o urn he handle bars o he lef when iled 40 degrees o he righ. By performing enough of hese rial-and-error ineracions wih he environmen, he RL sysem will ulimaely learn how o preven he bicycle from ever falling over. 2 The Pars Of A Reinforcemen Learning Problem In he sandard reinforcemen learning model an agen ineracs wih is environmen. This ineracion akes he form of he agen sensing he environmen, and based on his sensory inpu choosing an acion o perform in he environmen. The acion changes he environmen in some manner and his change is communicaed o he agen hrough a scalar reinforcemen signal. There are hree fundamenal pars of a reinforcemen learning problem: he environmen, he reinforcemen funcion, and he value funcion. The Environmen Every RL sysem learns a mapping from siuaions o acions by rial-and-error ineracions wih a dynamic environmen. This environmen mus a leas be parially observable by he reinforcemen learning sysem, and he observaions may come in he form of sensor readings, symbolic descripions, or possibly menal siuaions (e.g., he siuaion of being los). The acions may be low level (e.g., volage o moors), high level (e.g., accep job offer), or even menal (e.g., shif in focus of aenion). If he RL sysem can observe perfecly all he informaion in he environmen ha migh influence he choice of acion o perform, hen he RL sysem chooses acions based on rue saes of he environmen. This ideal case is he bes possible basis for reinforcemen learning and, in fac, is a necessary condiion for much of he associaed heory.

3 The Reinforcemen Funcion As saed previously, RL sysems learn a mapping from siuaions o acions by rial-and-error ineracions wih a dynamic environmen. The goal of he RL sysem is defined using he concep of a reinforcemen funcion, which is he exac funcion of fuure reinforcemens he agen seeks o maximize. In oher words, here exiss a mapping from sae/acion pairs o reinforcemens; afer performing an acion in a given sae he RL agen will receive some reinforcemen (reward) in he form of a scalar value. The RL agen learns o perform acions ha will maximize he sum of he reinforcemens received when saring from some iniial sae and proceeding o a erminal sae. I is he job of he RL sysem designer o define a reinforcemen funcion ha properly defines he goals of he RL agen. Alhough complex reinforcemen funcions can be defined, here are a leas hree noeworhy classes ofen used o consruc reinforcemen funcions ha properly define he desired goals. Pure Delayed Reward and Avoidance Problems In he Pure Delayed Reward class of funcions he reinforcemens are all zero excep a he erminal sae. The sign of he scalar reinforcemen a he erminal sae indicaes wheher he erminal sae is a goal sae (a reward) or a sae ha should be avoided (a penaly). For example, if one waned an RL agen o learn o play he game of backgammon, he sysem could be defined as follows. The siuaion (sae) would be he configuraion of he playing board (he locaion of each player s pieces). In his case here are approximaely 0 20 differen possible saes. The acions available o he agen are he se of legal moves. The reinforcemen funcion is defined o be zero afer every urn excep when an acion resuls in a win or a loss, in which case he agen receives a + reinforcemen for a win, and a - reinforcemen for a loss. Because he agen is rying o maximize he reinforcemen, i will learn ha he saes corresponding o a win are goal saes and saes resuling in a loss are o be avoided. Anoher example of a pure delayed reward reinforcemen funcion can be found in he sandard car-pole or invered pendulum problem. A car supporing a hinged, invered pendulum is placed on a finie rack. The goal of he RL agen is o learn o balance he pendulum in an uprigh posiion wihou hiing he end of he rack. The siuaion (sae) is he dynamic sae of he car pole sysem. Two acions are available o he agen in each sae: move he car lef, or move he car righ. The reinforcemen funcion is zero everywhere excep for he saes in which he pole falls or he car his he end of he rack, in which case he agen receives a - reinforcemen. Again, because he agen is rying o maximize oal reinforcemen, he agen will learn he sequence of acions necessary o balance he pole and avoid he - reinforcemen. Minimum Time o Goal Reinforcemen funcions in his class cause an agen o perform acions ha generae he shores pah or rajecory o a goal sae. An example is an experimen commonly known as he Car on he hill problem. The problem is defined as ha of a saionary car being posiioned beween wo seep inclines. The goal of he driver (RL agen) is o successfully drive up he incline on he righ o reach a goal sae a he op of he hill. The sae of he environmen is he car s posiion Graviy Wins Goal

4 and velociy. Three acions are available o he agen in each sae: forward hrus, backward hrus, or no hrus a all. The dynamics of he sysem are such ha he car does no have enough hrus o simply drive up he hill. Raher, he driver mus learn o use momenum o his advanage o gain enough velociy o successfully climb he hill. The reinforcemen funcion is - for ALL sae ransiions excep he ransiion o he goal sae, in which case a zero reinforcemen is reurned. Because he agen wishes o maximize reinforcemen, i learns o choose acions ha minimize he ime i akes o reach he goal sae, and in so doing learns he opimal sraegy for driving he car up he hill. Games Thus far i has been assumed ha he learning agen always aemps o maximize he reinforcemen funcion. This need no be he case. The learning agen could jus as easily learn o minimize he reinforcemen funcion. This migh be he case when he reinforcemen is a funcion of limied resources and he agen mus learn o conserve hese resources while achieving a goal (e.g., an airplane execuing a maneuver while conserving as much fuel as possible). An alernaive reinforcemen funcion would be used in he conex of a game environmen, when here are wo or more players wih opposing goals. In a game scenario, he RL sysem can learn o generae opimal behavior for he players involved by finding he maximin, minimax, or saddlepoin of he reinforcemen funcion. For example, a missile migh be given he goal of minimizing he disance o a given arge (in his case an airplane). The airplane would be given he opposing goal of maximizing he disance o he missile. The agen would evaluae he sae for each player and would choose an acion independen of he oher players acion. These acions would hen be execued in parallel. Because he acions are chosen independenly and execued simulaneously, he RL agen learns o choose acions for each player ha would generae he bes oucome for he given player in a wors case scenario. The agen will perform acions for he missile ha will minimize he maximum disance o he airplane assuming he airplane will choose he acion ha maximizes he same disance. The agen will perform acions for he airplane ha will maximize he minimum disance o he missile assuming he missile will perform he acion ha will minimize he same disance. A more deailed discussion of his alernaive can be found in Harmon, Baird, and Klopf (994), and Liman(996). The Value Funcion In previous secions he environmen and he reinforcemen funcion are discussed. However, he issue of how he agen learns o choose good acions, or even how we migh measure he uiliy of an acion is no explained. Firs, wo erms are defined. A policy deermines which acion should be performed in each sae; a policy is a mapping from saes o acions. The value of a sae is defined as he sum of he reinforcemens received when saring in ha sae and following some fixed policy o a erminal sae. The opimal policy would herefore be he mapping from saes o acions ha maximizes he sum of he reinforcemens when saring in an arbirary sae and performing acions unil a erminal sae is reached. Under his definiion he value of a sae is dependen upon he policy. The value funcion is a mapping from saes o sae values and can be approximaed using any ype of funcion approximaor (e.g., mulilayered percepron, memory based sysem, radial basis funcions, look-up able, ec.). An example of a value funcion can be seen using a simple Markov decision process wih 6 saes. The sae space can be visualized using a 4x4 grid. Each square represens a sae. The reinforcemen funcion is - everywhere (i.e., he agen receives a 0-4 reinforcemen of - on each ransiion). There are 4 acions possible in each sae: norh, souh, eas, wes. The goal saes are he upper lef corner and he lower righ corner. The value funcion for he random policy is shown in Figure For each sae he random policy randomly chooses one of he four possible acions. The numbers in he saes represen he expeced values of he saes For example, when saring in he lower lef corner and following a random policy, on average here will be 22 ransiions o oher saes before he erminal sae is reached Figure -4 0

5 The opimal value funcion is shown in Figure 2. Again, saring in he lower lef corner, calculaing he sum of he reinforcemens when performing he opimal policy (he policy ha will maximize he sum of he reinforcemens), he value of ha sae is -3 because i akes only 3 ransiions o reach a erminal sae. If we are given he opimal value funcion, hen i becomes a rivial ask o exrac he opimal policy. For example, one can sar in any sae in Figure 2 and simply choose he acion ha maximizes he immediae reinforcemen received. In oher words, one can perform a one level deep breadh-firs search over acions o find he acion ha will maximize he immediae reward. The opimal policy for he value funcion shown in Figure 2 is given in Figure 3. This leads us o he fundamenal quesion of almos all of reinforcemen learning research: How do we devise an algorihm ha will efficienly find he opimal value funcion? Figure Figure 3

6 3 Approximaing he Value Funcion Reinforcemen learning is a difficul problem because he learning sysem may perform an acion and no be old wheher ha acion was good or bad. For example, a learning auo-pilo program migh be given conrol of a simulaor and old no o crash. I will have o make many decisions each second and hen, afer acing on housands of decisions, he aircraf migh crash. Wha should he sysem learn from his experience? Which of is many acions were responsible for he crash? Assigning blame o individual acions is he problem ha makes reinforcemen learning difficul. Surprisingly, here is a soluion o his problem. I is based on a field of mahemaics called dynamic programming, and i involves jus wo basic principles. Firs, if an acion causes somehing bad o happen immediaely, such as crashing he plane, hen he sysem learns no o do ha acion in ha siuaion again. So whaever acion he sysem performed one millisecond before he crash, i will avoid doing in he fuure. Bu ha principle doesn help for all he earlier acions which didn lead o immediae disaser. The second principle is ha if all he acions in a cerain siuaion leads o bad resuls, hen ha siuaion should be avoided. So if he sysem has experienced a cerain combinaion of aliude and airspeed many differen imes, whereby rying a differen acion each ime, and all acions led o somehing bad, hen i will learn ha he siuaion iself is bad. This is a powerful principle, because he learning sysem can now learn wihou crashing. In he fuure, any ime i chooses an acion ha leads o his paricular siuaion, i will immediaely learn ha paricular acion is bad, wihou having o wai for he crash. By using hese wo principles, a learning sysem can learn o fly a plane, conrol a robo, or do any number of asks. I can firs learn on a simulaor, hen fine une on he acual sysem. This echnique is generally referred o as dynamic programming, and a slighly closer analysis will reveal how dynamic programming can generae he opimal value funcion. The Essence of Dynamic Programming Iniially, he approximaion of he opimal value funcion is poor. In oher words, he mapping from saes o sae values is no valid. The primary objecive of learning is o find he correc mapping. Once his is compleed, he opimal policy can easily be exraced. A his poin some noaion needs o be inroduced : V*(x ) is he opimal value funcion where x is he sae vecor; V(x ) is he approximaion of he value funcion; is a discoun facor in he range [0,] ha causes immediae reinforcemen o have more imporance (weighed more heavily) han fuure reinforcemen. (A more complee discussion of is presened in Secion 4.) In general, V(x ) will be iniialized o random values and will conain no informaion abou he opimal value funcion V*(x ). This means ha he approximaion of he opimal value funcion in a given sae is equal o he rue value of ha sae V*(x ) plus some error in he approximaion, as expressed in equaion () V ( x ) = e( x ) + V * ( x ) () where e(x ) is he error in he approximaion of he value of he sae occupied a ime. Likewise, he approximaion of he value of he sae reached afer performing some acion a ime is he rue value of he sae occupied a ime + plus some error in he approximaion, as expressed in equaion (2). V ( x ) = e( x ) + V * ( x ) (2) As saed previously, he value of sae x for he opimal policy is he sum of he reinforcemens when saring from sae x and performing opimal acions unil a erminal sae is reached. By his definiion, a simple relaionship exiss beween he values of successive saes, x and x +. This relaionship is defined by he Bellman equaion and is expressed in equaion (3). The discoun facor is used o exponenially decrease he weigh of reinforcemens received in he fuure (A discussion of he funcion of in his equaion can be found in Secion 4).

7 V *( x ) = r( x ) + V *( x + ) (3) The approximaion V(x ) also has he same relaionship, as shown in equaion (4). By subsiuing he righhand side of equaions () and (2) ino equaion (4) we ge equaion (5) and expanding yields equaion (6). V ( x ) = r( x ) + V ( x + ) (4) ( ) e( x ) + V *( x ) = r( x ) + e( x ) + V *( x ) + + e( x ) + V * ( x ) = r( x ) + e( x ) + V * ( x ) + + (5) (6) Using equaion (3), V*(x ) is subraced from boh sides of equaion (6) o reveal he relaionship in he errors of successive saes. This relaionship is expressed in equaion (7). e( x ) e( x ) = + (7) The significance of his relaionship can be seen by using he simple Markov chain shown in Figure 4. In Figure 4 he sae labeled T is he erminal sae. The rue value of his sae is known a priori. In oher words, he error in he approximaion of he sae labeled T, e(t), is 0 by definiion. An analogy migh be he sae in which a missile ulimaely his or misses is arge. The rue value of r 2 r Figure 2 3 r T his sae is known: + for a hi, - for a miss. The process of learning is he process of finding an approximaion V(x ) ha makes equaions (3) and (7) rue for all saes x. If he approximaion error in sae 3 is a facor of smaller han he error in sae T, which is by definiion 0, hen he approximaion error in sae 3 mus also be 0. If equaion (7) is rue for all x, hen he approximaion error in each sae x is necessarily 0, ergo V(x )=V*(x ) for all x. The imporance of he discoun facor can be seen by using anoher simple Markov chain ha has no erminal sae (Figure 5). Using a similar argumen as ha used for Figure 4 one can see ha when equaion (7) is saisfied he approximaion error mus be 0 for all saes. The only difference in he prior example and in his case is ha he error in any given sae mus be a facor of 6 smaller han iself (because here are 6 saes in he cycle). Therefore, equaion (7) can only be saisfied if he approximaion error is 0 in every sae. Therefore, as saed earlier, he process of learning is he process of finding a soluion o equaion (4) for all saes x (which is also a soluion o equaion (7)). Several learning algorihms have been developed for precisely his ask. Value Ieraion 6 5 If i is assumed ha he funcion approximaor used o represen V* is a lookup able (each sae has a corresponding elemen in he able whose enry is he approximaed sae value), hen one can find he opimal value funcion by performing sweeps hrough sae space, updaing he value of each sae according o equaion (8) unil a sweep hrough sae space is performed in which here are no changes o sae values (he sae values have converged). 4 Figure e() =e(2) 2 e() = e(3) e() = 3 e(4 ) e() = 4 e(5 ) e() = 5 e(6 ) e() = 6 e() = 0

8 ( ) w = max r( x, u) + V ( x ) V ( x ) + u (8) In equaion (8) u is he acion performed in sae x and causes a ransiion o sae x +, and r(x,u) is he reinforcemen received when performing acion u in sae x. Figure 6 illusraes he updae. Figure 4 Figure 6 depics he scope of a single updae o he approximaion of he value of x. Specific o his example, here are wo acions possible in sae x, and each of hese acions leads o a differen successor sae x +. In a value ieraion updae, one mus firs find he acion ha reurns he maximum value. The only way o accomplish his is o acually perform an acion and calculae he sum of he reinforcemen received and he (possibly discouned) approximaed value of he successor sae V(x + ). This mus be done for all acions u in a given sae x, and is no possible wihou a model of he dynamics of he sysem. For example, in he case of a robo deciding o choose beween pahs o follow, i is no possible o choose one pah, observe he successor sae, and hen reurn o he saring sae o explore he resuls of he nex available acion. Insead, he robo mus in simulaion perform hese acions and observe he resuls. Then, based on he simulaion resuls, he robo may choose he acion ha resuls in he maximum value. One should noe ha he righ side of equaion (8) is simply he difference in he wo sides of he Bellman equaion defined in equaion (4), wih he excepion ha we have generalized he equaion o allow for Markov decision processes (muliple acions possible in a given sae) raher han Markov chains (single acion possible in every sae). This expression is he Bellman residual, and is formally defined by equaion (9). ( ) e( x ) = max r( x, u) + V ( x ) V ( x ) + u (9) E(x ) is he error funcion defined by he Bellman residual over all of sae space. Each updae (equaion (8)) reduces he value of E(x ), and in he limi as he number of updaes goes o infiniy E(x )=0. When E(x )=0, equaion (4) is saisfied and V(x )=V*(x ). Learning is accomplished. Residual Gradien Algorihms Thus far i has been assumed our funcion approximaor is a lookup able. The is normally he case in classical dynamic programming. However, his assumpion severely limis he size and complexiy of he problems solvable. Many real-world problems have exremely large or even coninuous sae spaces. In pracice i is no possible o represen he value funcion for such problems using a lookup able. Hence, an exension o classical value ieraion is o use a funcion approximaor ha can generalize and inerpolae values of saes never before seen. For example, one migh use a neural nework for he approximaion V(x,w ) of V*(x), where w is he parameer vecor. The resuling nework parameer updae is given in equaion (0). V x w w = ( r x u + V x w ) V x w (, ) α max (, ) ( +, ) (, ) u w (0)

9 I is useful o draw an analogy o he updae equaion used in supervised learning algorihms when firs examining equaion (0). In his conex, α is he learning rae, max ( r( x, u) + V( x, w )) u + is he desired oupu of he nework, V(x,w ) is he acual oupu of he nework, and V ( x, w ) is he gradien w of he oupu of he nework wih respec o he parameer vecor. I appears ha we are performing updaes ha will minimize he Bellman residual, bu his is no necessarily he case. The arge value ( r( x u) + V( x w )) max, +, u is a funcion of he parameer vecor w a ime. Once he updae o w is performed, he arge has changed because i is now a funcion of a differen parameer vecor (he vecor a ime +). I is possible ha he Bellman residual has acually been increased raher han decreased. The error funcion on which gradien descen is being performed changes wih every updae o he parameer vecor. This can resul in he values of he nework parameer vecor oscillaing or even growing o infiniy. One soluion o his problem is o perform gradien descen on he mean squared Bellman residual. Because his defines an unchanging error funcion, convergence o a local minimum is guaraneed. This means ha we can ge he benefi of he generaliy of neural neworks while sill guaraneeing convergence. The resuling parameer updae is given in equaion (). V ( x +, w ) V ( x, w ) w = α[ r( x ) + V ( x +, w ) V ( x, w )] w w The resuling mehod is referred o as a residual gradien algorihm because gradien descen is performed on he mean squared Bellman residual. Therefore, equaion () is he updae equaion for residual value ieraion, and equaion (0) is he updae equaion for direc value ieraion. I is imporan o noe ha if he MDP is non-deerminisic hen i becomes necessary o generae independen successor saes o guaranee convergence o he correc answer. For a more deailed discussion see Baird (995); Harmon, Baird, and Klopf (995); and Harmon and Baird (996). Q-Learning Q-learning (Wakins, 989 and 992) is anoher exension o radiional dynamic programming (value ieraion) ha solves he following problem. A deerminisic Markov decision process is one in which he sae ransiions are deerminisic (an acion performed in sae x always ransiions o he same successor sae x + ). Alernaively, in a nondeerminisic Markov decision process, a probabiliy disribuion funcion defines a se of poenial successor saes for a given acion in a given sae. If he MDP is non-deerminisic, hen value ieraion requires ha we find he acion ha reurns he maximum expeced value (he sum of he reinforcemen and he inegral over all possible successor saes for he given acion). For example, o find he expeced value of he successor sae associaed wih a given acion, one mus perform ha acion an infinie number of imes, aking he inegral over he values of all possible successor saes for ha acion. The reason his is necessary is demonsraed in Figure 7. () Figure 5 In Figure 7 here are wo possible acions in sae x. Each acion reurns a reinforcemen of 0. Acion u causes a ransiion o one of wo possible successor saes wih equal probabiliy. The same is rue for

10 acion u 2. The values of he successor saes are 0 and for boh acions. Value ieraion requires ha he value of sae x be equal o he maximum over acions of he sum of reinforcemen and he expeced value of he successor sae. By aking an infinie number of samples of successor saes for acion u, one would be able o calculae ha he acual expeced value is 0.5. The same is rue for acion u 2. Therefore, he value of sae x is 0.5 However, if one were o naively perform value ieraion on his MDP by aking a single sample of he successor sae associaed wih each acion insead of he inegral, hen x would converge o a value of Clearly he wrong answer. Theoreically, value ieraion is possible in he conex of non-deerminisic MDPs. However, in pracice i is compuaionally impossible o calculae he necessary inegrals wihou added knowledge or some degree of modificaion. Q-learning solves he problem of having o ake he max over a se of inegrals. Raher han finding a mapping from saes o sae values (as in value ieraion), Q-learning finds a mapping from sae/acion pairs o values (called Q-values). Insead of having an associaed value funcion, Q- learning makes use of he Q-funcion. In each sae, here is a Q-value associaed wih each acion. The definiion of a Q-value is he sum of he (possibly discouned) reinforcemens received when performing he associaed acion and hen following he given policy hereafer. Likewise, he definiion of an opimal Q- value is he sum of he reinforcemens received when performing he associaed acion and hen following he opimal policy hereafer. In he conex of Q-learning, he value of a sae is defined o be he maximum Q-value in he given sae. Given his definiion i is easy o derive he equivalen of he Bellman equaion (equaion 4) for Q-learning. Q( x, u ) = r( x, u ) + max Q( x, u ) + + u + (2) Q-learning differs from value ieraion in ha i doesn require ha in a given sae each acion be performed and he expeced values of he successor saes be calculaed. While value ieraion performs an updae ha is analogous o a one level breadh-firs search, Q-learning akes a single-sep sample of a Mone-Carlo roll-ou. This process is demonsraed in Figure 8. Figure 6 The updae equaion in Figure 8 is valid when using a lookup able o represen he Q-funcion. The Q- value is a predicion of he sum of he reinforcemens one will receive when performing he associaed acion and hen following he given policy. To updae ha predicion Q(x,u ) one mus perform he associaed acion u, causing a ransiion o he nex sae x + and reurning a scalar reinforcemen r(x,u ). Then one need only find he maximum Q-value in he new sae o have all he necessary informaion for revising he predicion (Q-value) associaed wih he acion jus performed. Q-learning does no require one o calculae he inegral over all possible successor saes in he case ha he sae ransiions are nondeerminisic. The reason is ha a single sample of a successor sae for a given acion is an unbiased esimae of he expeced value of he successor sae. In oher words, afer many updaes he Q-value associaed wih a paricular acion will converge o he expeced sum of all reinforcemens received when performing ha acion and following he opimal policy hereafer.

11 Residual Gradien and Direc Q-learning As i is possible o represen he value funcion wih a neural nework in he conex of value ieraion, so i is possible o represen he Q-funcion wih a neural nework in he conex of Q-learning. The informaion presened in he discussion of value ieraion concerning convergence o a sable value funcion is also applicable o guaraneeing convergence o a sable Q-funcion. Equaion (3) is he updae equaion for direc Q-learning where α is he learning rae, and equaion (4) is he updae equaion for residual gradien Q-learning. Q( x, u, w ) w = α r x u Q x u w Q x u w (, ) + max (,, ) u + + (,, ) w + w = α r x u Q x u w Q x u w (, ) + max (,, ) u + + (,, ) + Q( x, u ) Q( x, u, w ) + + w w (3) (4) Advanage Learning Alhough Q-learning is a significan improvemen over value ieraion, i is sill limied in scope in a leas one imporan way. The number of raining ieraions necessary o sufficienly represen he opimal Q- funcion when using funcion approximaors ha generalize scales poorly wih he size of he ime inerval beween saes. The greaer he number of acions per uni ime (he smaller he incremen in ime beween acions) he greaer he number of raining ieraions required o adequaely represen he opimal Q- funcion. The explanaion for his is demonsraed wih a simple example. Figure 9 depics a Markov decision process wih 000 saes. Sae 0 is he iniial sae and has a single acion available, ransiion o sae. Sae 999 is an absorbing sae Figure 7 In saes..998 here are wo acions available, ransiion o eiher he sae immediaely o he righ or immediaely o he lef. For example, in sae, he acion of going lef will ransiion o sae 0, and he acion of going righ will ransiion o sae 2. Each ransiion incurs a cos (reinforcemen) of. The objecive is o minimize he oal cos accumulaed in ransiioning from sae o sae unil he absorbing saes is reached. The opimal Q-value for each acion is represened by he numbers nex o each sae. For example, in sae 2 he opimal Q-value for he acion of going lef is 000, and he opimal Q-value for he acion of going righ is 998. The opimal policy can easily be found in each sae by choosing o perform he acion wih he minimum Q-value. When using a funcion approximaor ha generalizes over sae/acion pairs (any funcion approximaor oher han a lookup able or equivalen), i is possible o encouner pracical limiaions in he number of raining ieraions required o accuraely approximae he opimal Q-funcion. As he ime inerval beween saes decreases in size, he required precision in he approximaion of he opimal Q-funcion increases exponenially. For example, he opimal Q-funcion associaed wih he MDP in Figure 9 is linear and can be represened by a simple linear funcion approximaor. However, i requires an unreasonably large number of raining ieraions o achieve he level of precision necessary o generae he opimal policy. The reason for he large number of raining ieraions is simple. The difference in he Q-values in a given sae is small relaive o he difference in he Q-values across saes (a raio of approximaely :000). For example, he difference in he Q-values in sae is 2 (00-999=2). The difference in he minimum Q-

12 values in saes and 998 is 998 (999-=998). The approximaion of he opimal Q-funcion mus achieve a degree of precision such ha he iny differences in Q-values in a single sae are represened. Because he differences in Q-values across saes have a greaer impac on he mean squared error, during raining he nework learns o represen hese differences firs. The differences in he Q-values in a given sae have only a iny effec on he mean squared error and herefore ge los in he noise. To represen he differences in Q-values in a given sae requires much greaer precision han o represen he Q-values across saes. As he raio of he ime inerval o he number of saes decreases i becomes necessary o approximae he opimal Q-funcion wih increasing precision. In he limi, infinie precision is necessary. Advanage learning does no share he scaling problem of Q-learning. Similar o Q-learning, advanage learning learns a funcion of sae/acion pairs. However, in advanage learning he value associaed wih each acion is called an advanage. Therefore, advanage learning finds an advanage funcion raher han a Q-funcion or value funcion. The value of a sae is defined o be he value of he maximum advanage in ha sae. For he sae/acion pair (x,u) an advanage is defined as he sum of he value of he sae and he uiliy (advanage) of performing acion u raher han he acion currenly considered bes. For opimal acions his uiliy is zero, meaning he value of he acion is also he value of he sae; for sub-opimal acions he uiliy is negaive, represening he degree of sub-opimaliy relaive o he opimal acion. The equivalen of he Bellman equaion for advanage learning is given in equaion (5). A( x, u ) = max A( x, u ) + u r( x, u ) + max A( x, u ) max A( x, u ) + + u + u K where is he discoun facor per ime sep, K is a ime uni scaling facor, and <> represens he expeced value over all possible resuls of performing acion u in sae x o receive immediae reinforcemen r and o ransiion o a new sae x +. Residual Gradien and Direc Advanage Learning The number of raining ieraions required in Q-learning scales poorly as he raio of he ime inerval beween saes o he number of saes grows small. Advanage learning can find a sufficienly accurae approximaion o he advanage funcion in a number of raining ieraions ha is independen of his raio. The updae equaions for direc advanage learning and residual advanage learning are given in equaions (6) and (7) respecively. Again, he reader is referred o he subsecion in he discussion of value ieraion devoed o residual gradien algorihms. For a furher discussion of advanage learning see Harmon and Baird (996). w = α r( x, u ) + max A( x, u, w ) + max A( x, u, w ) A( x, u, w ) u + + K K + u A( x, u, w ) w (6) (5)

13 w = α r x u A x u w x u w x u w (, ) + max (,, ) + A A u + + max (,, ) (,, ) K K + u max A( x, u, w ) max A( x, u, w ) u + + u + A( x, u, w ) + w K K w w (7) TD(λ) Consider he Markov chain in Figure 0. The iniial sae is 0 and he erminal sae is 999. Each sae ransiion reurns a cos (reinforcemen) of and he value of sae 999 is defined o be 0. Because his is a Markov chain i is no sensible o sugges ha he RL sysem learn o minimize or maximize reinforcemen. Insead, we are concerned exclusively wih predicing he oal reinforcemen received when saring from sae n where n is a sae in he range [..998] Figure 8 Value ieraion, Q-learning, and advanage learning can all solve his problem. However, TD(λ) can solve i faser. In he conex of Markov chains, TD(λ) is idenical o value ieraion wih he excepion ha TD(λ) updaes he value of he curren sae based on a weighed combinaion of he values of fuure saes, as opposed o using only he value of he immediae successor sae. Recall ha in value ieraion he arge value of he curren sae is he sum of he reinforcemen and he value of he successor sae, in oher words, he righ side of he Bellman equaion (Equaion 8). V ( x w ) = r( x ) + V ( x +, w ) (8) Noice ha he arge is also based on an esimae V(x +,w ), and his esimae can be based on zero informaion. Indeed, his is he case much of he ime and can be demonsraed using Figure 0. Assume ha he value funcion for his Markov chain is represened using a lookup able. In his case, our lookup able has 000 elemens, each corresponding o a sae, and he enry in each elemen is he value of he corresponding sae. Before learning begins enries are iniialized o random values. The process of learning sars by updaing he value of sae 0 o be he sum of he reinforcemen received on ransiion from sae 0 o sae and he value of sae. Remember, a his poin he value of sae is arbirary. This is rue for all saes excep he erminal sae (999) which, by definiion, has a value of 0. Because he iniial values of saes are arbirary (wih he excepion of he erminal sae), he enire firs sweep hrough he Markov chain (epoch) of raining resuls in he improvemen of he approximaion of he value funcion only in sae 998. In he firs epoch, only in sae 998 is he updae o he approximaion based on somehing oher han an arbirary value. This is erribly inefficien. In fac, no unil 999 epochs of raining have been performed will he approximaion of he value of sae 0 conain any degree of ruh (he approximaion is based on somehing oher han an arbirary value). In epoch 2 of raining, he approximaion of he value of sae 997 is updaed based on an approximaion of he value of sae 998 ha has as is basis he rue value of sae 999, raher han an arbirary value. In epoch 3, he approximaion of he value of sae 996 will be updaed based on ruh raher han an arbirary value. Each epoch moves ruh back one sep in he chain. The approximaion of he value of sae x is updaed based on he approximaion of he value of he sae one sep ino he fuure, x +. If he value of a sae were based on a weighed average of he values of fuure saes, hen ruh would be propagaed back in ime much more efficienly. In our example above, if

14 insead of updaing he value of a sae based exclusively on he value of he immediae successor sae one used he nex 2 successor saes as he basis of he updae, hen he number of epochs performed before he value of sae 0 is no longer based on an arbirary value is reduced from 000 o 500. If he value approximaion of sae 0 is based on a weighed combinaion of values of he succeeding 500 saes, hen only 2 epochs are required before he value approximaion of sae 0 is based on somehing oher han an arbirary value. This is precisely he funcion of TD(λ) (Suon, 988) for 0<λ<. Insead of updaing a value approximaion based solely on he approximaed value of he immediae successor sae, TD(λ) basis he updae on an exponenial weighing of values of fuure saes. λ is he weighing facor. TD(0), he case of λ=0, is idenical o value ieraion for he example problem saed above. TD() updaes he value approximaion of sae n based solely on he value of he erminal sae. The parameer updae for TD(λ) is given in equaion (9). ( + ) k w = α r( x ) + V ( x, w ) V ( x, w ) λ V ( x, w ) An incremenal form of his equaion can be derived as follows. Given ha g is he value of he sum in (9) for, we can compue g +, using only curren informaion, as k = w k (9) + k g = λ V ( x, w ) + + k = + k = V ( x, w ) + λ V ( x, w ) w = V ( x, w ) + λg w k + w k k = k + w k (20) Noice ha equaion (9) does no have a max or min erm. This suggess ha TD(λ) is used exclusively in he conex of predicion (Markov chains). One way o exend he use of TD(λ) o he domain of Markov decision processes is o perform updaes according o equaion (9) while calculaing he sum according o equaion (20) when following he curren policy. When a sep of exploraion is performed (choosing an acion ha is no currenly considered bes ), he sum of pas gradiens g in equaion (20) should be se o 0. The inuiion behind his mehod follows. The value of a sae x is defined as he sum of he reinforcemens received when saring in x and following he curren policy unil a erminal sae is reached. During raining, he curren policy is he bes approximaion o he opimal policy generaed hus far. On occasion one mus perform acions ha don agree wih he curren policy so ha beer approximaions o he opimal policy can be realized. However, one migh no wan he value of he resuling sae propagaed hrough he chain of pas saes. This would corrup he value approximaions for hese saes by inroducing informaion ha is no consisen wih he definiion of a sae value. One furher noe. TD(λ) for λ=0 is equivalen o value ieraion. Likewise, he discussion of residual gradien algorihms is applicable o TD(λ) when λ=0. However, his is no he case for 0<λ<. No algorihms exis ha guaranee convergence for TD(λ) for 0<λ< when using a general funcion approximaor.

15 4 Miscellaneous Issues Exploraion As saed earlier, he fundamenal quesion in reinforcemen learning research is: How do we devise an algorihm ha will efficienly find he opimal value funcion? I was shown ha he opimal value funcion is a soluion o he se of equaions defined by he Bellman equaion (Equaion 4). The process of learning was subsequenly described as he process of improving an approximaion of he opimal value funcion by incremenally finding a soluion o his se of equaions. One should noice ha he Bellman equaion is defined over all of sae space. The opimal value funcion saisfies his equaion for ALL x in sae space. This requiremen inroduces he need for exploraion. Exploraion is defined as inenionally choosing o perform an acion ha is no considered bes for he express purpose of acquiring knowledge of unseen (or lile seen) saes. In order o idenify a (sub-)opimal approximaion, sae space mus be sufficienly explored. For example, a robo facing an unknown environmen has o spend some ime acquiring knowledge of is environmen. Alernaively, experience acquired during exploraion mus also be considered during acion selecion o minimize he coss (negaive reinforcemens) associaed wih learning. Alhough he robo mus explore is environmen, i should avoid collisions wih obsacles. However, he robo does no know which acions will resul in collision unil all of sae space has been explored. On he oher hand, i is possible ha a policy ha is sufficienly good will be recognized wihou having o explore all of sae space. There is a fundamenal rade-off beween exploraion and exploiaion (using previously acquired knowledge o direc he choice of acion). Therefore, i is imporan o use exploraion echniques ha will maximize he knowledge gained during learning while minimizing he coss of exploraion and learning ime. For a good inroducion o he issues of efficien exploraion see Thrun (992). Discouned vs. Non-Discouned The discoun facor is a number in he range of [0..] and is used o weigh near erm reinforcemen more heavily han disan fuure reinforcemen. For he purpose of discussion, he updae equaion for value ieraion is shown again as equaion (2). The closer is o he greaer he weigh of fuure reinforcemens. The weighing of fuure reinforcemens has a half-life of σ = log0.5 / log. For =0, he value of a sae is based exclusively on he immediae reinforcemen received for performing he associaed acion. For finie horizon Markov decision processes (an MDP ha erminaes) i is no sricly necessary o use a discoun facor. In his case (=), he value of sae x is based on he oal reinforcemen received when saring in sae x and following he given policy. ( ) w = max r( x, u) + V ( x, w ) V ( x, w ) (2) + u In he case of infinie horizon Markov decision processes (an MDP ha never erminaes), a discoun facor is required. Wihou he use of a discoun facor, he sum of he reinforcemens received would be infinie for every sae. The use of a discoun facor limis he maximum value of a sae o be on he order of R.

16 5 Conclusion Reinforcemen learning appeals o many researchers because of is generaliy. Any problem domain ha can be cas as a Markov decision process can poenially benefi from his echnique. In fac, many researchers view reinforcemen learning no as a echnique, bu raher a paricular ype of problem ha is amenable o soluion by he algorihms described above. Reinforcemen learning is an exension of classical dynamic programming in ha i grealy enlarges he se of problems ha can pracically be solved. Unlike supervised learning, reinforcemen learning sysems do no require explici inpu-oupu pairs for raining. By combining dynamic programming wih neural neworks, many are opimisic ha classes of problems previously unsolvable will finally be solved. Acknowledgmens The developmen of his uorial was suppored under Task 232R by he Unied Saes Air Force Office of Scienific Research. We would also like o hank Leemon Baird, Harry Klopf, Eric Blasche, Jim Morgan, and Sco Weaver for useful commens. 6 Glossary policy - a mapping from saes o acions. reinforcemen - a scalar variable ha communicaes he change in he environmen o he reinforcemen learning sysem. For example, if an RL sysem is a conroller for a missile, he reinforcemen signal migh be he disance beween he missile and he arge (In which case, he RL sysem would learn o minimize reinforcemen). Markov decision process - An MDP consiss of a se of saes X; a se of sar saes S ha is a subse of X; a se of acions A; a reinforcemen funcion R where R(x,a) is he expeced immediae reinforcemen for aking acion a in sae x; and an acion model P where P(x' x,a) gives he probabiliy ha execuing acion a in sae x will lead o sae x'. Noe: I is a requiremen ha he choice of acion be dependen solely on he curren sae observaion x. If knowledge of prior acions or saes affecs he curren choice of acion hen he decision process is no Markov. deerminisic - In he conex of saes, here is a one-o-one mapping from sae/acion pairs o successor saes. In oher words, wih probabiliy one, he ransiion from sae x afer performing acion a will always resul in sae x'. non-deerminisic - In he conex of saes, here exiss a probabiliy disribuion funcion P(x' x,a) ha gives he probabiliy ha execuing acion a in sae x will lead o sae x'. sae - The condiion of a physical sysem as specified by a se of appropriae variables. unbiased esimae - The expeced (mean) error in he esimae is zero. 7 References Baird, L. C. (995). Residual Algorihms: Reinforcemen Learning wih Funcion Approximaion. In Armand Priediis & Suar Russell, eds. Machine Learning: Proceedings of he Twelfh Inernaional Conference, 9-2 July, Morgan Kaufmann Publishers, San Francisco, CA. Baird, L. C. (993). Advanage Updaing. (Technical Repor WL-TR-93-46). Wrigh-Paerson Air Force Base Ohio: Wrigh Laboraory. (available from he Defense Technical Informaion Cener, Cameron Saion, Alexandria, VA ). Bersekas, D. P. (995). Dynamic Programming and Opimal Conrol. Ahena Scienific, Belmon, MA. Harmon, M. E., Baird, L. C., and Klopf, A. H. (995). Reinforcemen learning applied o a differenial game. Adapive Behavior, MIT Press, (4), pp

17 Harmon, M. E., Baird, L. C., and Klopf, A. H. (994). Advanage Updaing Applied o a Differenial Game. In Tesauro, Tourezky & Leen, eds. Advances in Neural Informaion Processing Sysems: Proceedings of he 994 Conference, MIT Press, Cambridge, Massachuses. Harmon, M. E., and Baird, L. C. (996). Muli-player residual advanage learning wih general funcion approximaion. (Technical Repor WL-TR ). Wrigh-Paerson Air Force Base Ohio: Wrigh Laboraory. (available from he Defense Technical Informaion Cener, Cameron Saion, Alexandria, VA ). Kaelbling, L. P., Liman, M. L., and Moore, A.W. (996). Reinforcemen Learning: A Survey, Journal of Arificial Inelligence Research Volume 4, pp Liman M. L. (994). Markov games as a framework for muli-agen reinforcemen learning. In Proceedings of he Elevenh Inernaional Conference on Machine Learning,, San Francisco, CA., Morgan Kaufmann, pp Suon, R. S. (988). Learning o Predic by he Mehods of Temporal Differences. Machine Learning 3: Thrun, S. (992). The Role of Exploraion in Learning Conrol. In Handbook for Inelligen Conrol: Neural, Fuzzy and Adapive Approaches, Van Nosrand Reinhold, Florence, Kenucky Wakins, C. J. C. H. (989). Learning from delayed rewards. Docoral hesis, Cambridge Universiy, Cambridge, England. Wakins, J. C. H., Dayan, P. (992). Technical Noe: Q-Learning. Machine Learning 8: Bibliography Applicaions of Reinforcemen Learning Tesauro, G.J. (995). Temporal Differences Learning and TD-Gammon. Communicaions of he ACM, 38: Boyan, J. A., and Liman, M. L. (994). Packe rouing in dynamically changing neworks: A reinforcemen learning approach. In Cowan, J. D., Tesauro, G., and Alspecor, J. (eds.), Advances in Neural Informaion Processing Sysems: Proceedings of he 994 Conference, San Francisco, CA: Morgan Kaufmann. Cries, R. H., and Baro, A.G. (996). Improving Elevaor Performance Using Reinforcemen Learning.. In Tourezky, Mozer & Hasselmo, eds. Advances in Neural Informaion Processing Sysems: Proceedings of he 995 Conference, MIT Press, Cambridge, Massachuses. Singh, S., and Bersekas, D. P. (996). Reinforcemen Learning for Dynamic Channel Allocaion in Cellular Telephone Sysems. To appear in Advances in Neural Informaion Processing Sysems: Proceedings of he 996 Conference, MIT Press, Cambridge, Massachuses. Zhang, W., and Dieerich, T.G. (996). High-Performance Job-Shop Scheduling Wih a Time-Delay TD(λ) Nework. In Tourezky, Mozer & Hasselmo, eds. Advances in Neural Informaion Processing Sysems: Proceedings of he 995 Conference, MIT Press, Cambridge, Massachuses.

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 RL Lecure 7: Eligibiliy Traces R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 1 N-sep TD Predicion Idea: Look farher ino he fuure when you do TD backup (1, 2, 3,, n seps) R. S. Suon and

More information

1 Review of Zero-Sum Games

1 Review of Zero-Sum Games COS 5: heoreical Machine Learning Lecurer: Rob Schapire Lecure #23 Scribe: Eugene Brevdo April 30, 2008 Review of Zero-Sum Games Las ime we inroduced a mahemaical model for wo player zero-sum games. Any

More information

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle Chaper 2 Newonian Mechanics Single Paricle In his Chaper we will review wha Newon s laws of mechanics ell us abou he moion of a single paricle. Newon s laws are only valid in suiable reference frames,

More information

Presentation Overview

Presentation Overview Acion Refinemen in Reinforcemen Learning by Probabiliy Smoohing By Thomas G. Dieerich & Didac Busques Speaer: Kai Xu Presenaion Overview Bacground The Probabiliy Smoohing Mehod Experimenal Sudy of Acion

More information

Notes on Kalman Filtering

Notes on Kalman Filtering Noes on Kalman Filering Brian Borchers and Rick Aser November 7, Inroducion Daa Assimilaion is he problem of merging model predicions wih acual measuremens of a sysem o produce an opimal esimae of he curren

More information

3.1 More on model selection

3.1 More on model selection 3. More on Model selecion 3. Comparing models AIC, BIC, Adjused R squared. 3. Over Fiing problem. 3.3 Sample spliing. 3. More on model selecion crieria Ofen afer model fiing you are lef wih a handful of

More information

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB Elecronic Companion EC.1. Proofs of Technical Lemmas and Theorems LEMMA 1. Le C(RB) be he oal cos incurred by he RB policy. Then we have, T L E[C(RB)] 3 E[Z RB ]. (EC.1) Proof of Lemma 1. Using he marginal

More information

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 175 CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 10.1 INTRODUCTION Amongs he research work performed, he bes resuls of experimenal work are validaed wih Arificial Neural Nework. From he

More information

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9) CSE/NB 528 Lecure 14: Reinforcemen Learning Chaper 9 Image from hp://clasdean.la.asu.edu/news/images/ubep2001/neuron3.jpg Lecure figures are from Dayan & Abbo s book hp://people.brandeis.edu/~abbo/book/index.hml

More information

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still. Lecure - Kinemaics in One Dimension Displacemen, Velociy and Acceleraion Everyhing in he world is moving. Nohing says sill. Moion occurs a all scales of he universe, saring from he moion of elecrons in

More information

Final Spring 2007

Final Spring 2007 .615 Final Spring 7 Overview The purpose of he final exam is o calculae he MHD β limi in a high-bea oroidal okamak agains he dangerous n = 1 exernal ballooning-kink mode. Effecively, his corresponds o

More information

Some Basic Information about M-S-D Systems

Some Basic Information about M-S-D Systems Some Basic Informaion abou M-S-D Sysems 1 Inroducion We wan o give some summary of he facs concerning unforced (homogeneous) and forced (non-homogeneous) models for linear oscillaors governed by second-order,

More information

Sequential Importance Resampling (SIR) Particle Filter

Sequential Importance Resampling (SIR) Particle Filter Paricle Filers++ Pieer Abbeel UC Berkeley EECS Many slides adaped from Thrun, Burgard and Fox, Probabilisic Roboics 1. Algorihm paricle_filer( S -1, u, z ): 2. Sequenial Imporance Resampling (SIR) Paricle

More information

An introduction to the theory of SDDP algorithm

An introduction to the theory of SDDP algorithm An inroducion o he heory of SDDP algorihm V. Leclère (ENPC) Augus 1, 2014 V. Leclère Inroducion o SDDP Augus 1, 2014 1 / 21 Inroducion Large scale sochasic problem are hard o solve. Two ways of aacking

More information

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II Roland Siegwar Margaria Chli Paul Furgale Marco Huer Marin Rufli Davide Scaramuzza ETH Maser Course: 151-0854-00L Auonomous Mobile Robos Localizaion II ACT and SEE For all do, (predicion updae / ACT),

More information

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon 3..3 INRODUCION O DYNAMIC OPIMIZAION: DISCREE IME PROBLEMS A. he Hamilonian and Firs-Order Condiions in a Finie ime Horizon Define a new funcion, he Hamilonian funcion, H. H he change in he oal value of

More information

Chapter 2. First Order Scalar Equations

Chapter 2. First Order Scalar Equations Chaper. Firs Order Scalar Equaions We sar our sudy of differenial equaions in he same way he pioneers in his field did. We show paricular echniques o solve paricular ypes of firs order differenial equaions.

More information

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions Muli-Period Sochasic Models: Opimali of (s, S) Polic for -Convex Objecive Funcions Consider a seing similar o he N-sage newsvendor problem excep ha now here is a fixed re-ordering cos (> 0) for each (re-)order.

More information

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14 CSE/NB 58 Lecure 14: From Supervised o Reinforcemen Learning Chaper 9 1 Recall from las ime: Sigmoid Neworks Oupu v T g w u g wiui w Inpu nodes u = u 1 u u 3 T i Sigmoid oupu funcion: 1 g a 1 a e 1 ga

More information

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17 EES 16A Designing Informaion Devices and Sysems I Spring 019 Lecure Noes Noe 17 17.1 apaciive ouchscreen In he las noe, we saw ha a capacior consiss of wo pieces on conducive maerial separaed by a nonconducive

More information

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8) I. Definiions and Problems A. Perfec Mulicollineariy Econ7 Applied Economerics Topic 7: Mulicollineariy (Sudenmund, Chaper 8) Definiion: Perfec mulicollineariy exiss in a following K-variable regression

More information

Overview. COMP14112: Artificial Intelligence Fundamentals. Lecture 0 Very Brief Overview. Structure of this course

Overview. COMP14112: Artificial Intelligence Fundamentals. Lecture 0 Very Brief Overview. Structure of this course OMP: Arificial Inelligence Fundamenals Lecure 0 Very Brief Overview Lecurer: Email: Xiao-Jun Zeng x.zeng@mancheser.ac.uk Overview This course will focus mainly on probabilisic mehods in AI We shall presen

More information

KINEMATICS IN ONE DIMENSION

KINEMATICS IN ONE DIMENSION KINEMATICS IN ONE DIMENSION PREVIEW Kinemaics is he sudy of how hings move how far (disance and displacemen), how fas (speed and velociy), and how fas ha how fas changes (acceleraion). We say ha an objec

More information

20. Applications of the Genetic-Drift Model

20. Applications of the Genetic-Drift Model 0. Applicaions of he Geneic-Drif Model 1) Deermining he probabiliy of forming any paricular combinaion of genoypes in he nex generaion: Example: If he parenal allele frequencies are p 0 = 0.35 and q 0

More information

Vehicle Arrival Models : Headway

Vehicle Arrival Models : Headway Chaper 12 Vehicle Arrival Models : Headway 12.1 Inroducion Modelling arrival of vehicle a secion of road is an imporan sep in raffic flow modelling. I has imporan applicaion in raffic flow simulaion where

More information

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients Secion 3.5 Nonhomogeneous Equaions; Mehod of Undeermined Coefficiens Key Terms/Ideas: Linear Differenial operaor Nonlinear operaor Second order homogeneous DE Second order nonhomogeneous DE Soluion o homogeneous

More information

Chapter 7: Solving Trig Equations

Chapter 7: Solving Trig Equations Haberman MTH Secion I: The Trigonomeric Funcions Chaper 7: Solving Trig Equaions Le s sar by solving a couple of equaions ha involve he sine funcion EXAMPLE a: Solve he equaion sin( ) The inverse funcions

More information

23.5. Half-Range Series. Introduction. Prerequisites. Learning Outcomes

23.5. Half-Range Series. Introduction. Prerequisites. Learning Outcomes Half-Range Series 2.5 Inroducion In his Secion we address he following problem: Can we find a Fourier series expansion of a funcion defined over a finie inerval? Of course we recognise ha such a funcion

More information

Lecture Notes 2. The Hilbert Space Approach to Time Series

Lecture Notes 2. The Hilbert Space Approach to Time Series Time Series Seven N. Durlauf Universiy of Wisconsin. Basic ideas Lecure Noes. The Hilber Space Approach o Time Series The Hilber space framework provides a very powerful language for discussing he relaionship

More information

Random Walk with Anti-Correlated Steps

Random Walk with Anti-Correlated Steps Random Walk wih Ani-Correlaed Seps John Noga Dirk Wagner 2 Absrac We conjecure he expeced value of random walks wih ani-correlaed seps o be exacly. We suppor his conjecure wih 2 plausibiliy argumens and

More information

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes Represening Periodic Funcions by Fourier Series 3. Inroducion In his Secion we show how a periodic funcion can be expressed as a series of sines and cosines. We begin by obaining some sandard inegrals

More information

Ensamble methods: Boosting

Ensamble methods: Boosting Lecure 21 Ensamble mehods: Boosing Milos Hauskrech milos@cs.pi.edu 5329 Senno Square Schedule Final exam: April 18: 1:00-2:15pm, in-class Term projecs April 23 & April 25: a 1:00-2:30pm in CS seminar room

More information

Ensamble methods: Bagging and Boosting

Ensamble methods: Bagging and Boosting Lecure 21 Ensamble mehods: Bagging and Boosing Milos Hauskrech milos@cs.pi.edu 5329 Senno Square Ensemble mehods Mixure of expers Muliple base models (classifiers, regressors), each covers a differen par

More information

A Shooting Method for A Node Generation Algorithm

A Shooting Method for A Node Generation Algorithm A Shooing Mehod for A Node Generaion Algorihm Hiroaki Nishikawa W.M.Keck Foundaion Laboraory for Compuaional Fluid Dynamics Deparmen of Aerospace Engineering, Universiy of Michigan, Ann Arbor, Michigan

More information

Principle of Least Action

Principle of Least Action The Based on par of Chaper 19, Volume II of The Feynman Lecures on Physics Addison-Wesley, 1964: pages 19-1 hru 19-3 & 19-8 hru 19-9. Edwin F. Taylor July. The Acion Sofware The se of exercises on Acion

More information

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t... Mah 228- Fri Mar 24 5.6 Marix exponenials and linear sysems: The analogy beween firs order sysems of linear differenial equaions (Chaper 5) and scalar linear differenial equaions (Chaper ) is much sronger

More information

) were both constant and we brought them from under the integral.

) were both constant and we brought them from under the integral. YIELD-PER-RECRUIT (coninued The yield-per-recrui model applies o a cohor, bu we saw in he Age Disribuions lecure ha he properies of a cohor do no apply in general o a collecion of cohors, which is wha

More information

STATE-SPACE MODELLING. A mass balance across the tank gives:

STATE-SPACE MODELLING. A mass balance across the tank gives: B. Lennox and N.F. Thornhill, 9, Sae Space Modelling, IChemE Process Managemen and Conrol Subjec Group Newsleer STE-SPACE MODELLING Inroducion: Over he pas decade or so here has been an ever increasing

More information

Lecture 20: Riccati Equations and Least Squares Feedback Control

Lecture 20: Riccati Equations and Least Squares Feedback Control 34-5 LINEAR SYSTEMS Lecure : Riccai Equaions and Leas Squares Feedback Conrol 5.6.4 Sae Feedback via Riccai Equaions A recursive approach in generaing he marix-valued funcion W ( ) equaion for i for he

More information

( ) ( ) if t = t. It must satisfy the identity. So, bulkiness of the unit impulse (hyper)function is equal to 1. The defining characteristic is

( ) ( ) if t = t. It must satisfy the identity. So, bulkiness of the unit impulse (hyper)function is equal to 1. The defining characteristic is UNIT IMPULSE RESPONSE, UNIT STEP RESPONSE, STABILITY. Uni impulse funcion (Dirac dela funcion, dela funcion) rigorously defined is no sricly a funcion, bu disribuion (or measure), precise reamen requires

More information

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010 Simulaion-Solving Dynamic Models ABE 5646 Week 2, Spring 2010 Week Descripion Reading Maerial 2 Compuer Simulaion of Dynamic Models Finie Difference, coninuous saes, discree ime Simple Mehods Euler Trapezoid

More information

10. State Space Methods

10. State Space Methods . Sae Space Mehods. Inroducion Sae space modelling was briefly inroduced in chaper. Here more coverage is provided of sae space mehods before some of heir uses in conrol sysem design are covered in he

More information

Online Convex Optimization Example And Follow-The-Leader

Online Convex Optimization Example And Follow-The-Leader CSE599s, Spring 2014, Online Learning Lecure 2-04/03/2014 Online Convex Opimizaion Example And Follow-The-Leader Lecurer: Brendan McMahan Scribe: Sephen Joe Jonany 1 Review of Online Convex Opimizaion

More information

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles Diebold, Chaper 7 Francis X. Diebold, Elemens of Forecasing, 4h Ediion (Mason, Ohio: Cengage Learning, 006). Chaper 7. Characerizing Cycles Afer compleing his reading you should be able o: Define covariance

More information

Two Coupled Oscillators / Normal Modes

Two Coupled Oscillators / Normal Modes Lecure 3 Phys 3750 Two Coupled Oscillaors / Normal Modes Overview and Moivaion: Today we ake a small, bu significan, sep owards wave moion. We will no ye observe waves, bu his sep is imporan in is own

More information

Lab #2: Kinematics in 1-Dimension

Lab #2: Kinematics in 1-Dimension Reading Assignmen: Chaper 2, Secions 2-1 hrough 2-8 Lab #2: Kinemaics in 1-Dimension Inroducion: The sudy of moion is broken ino wo main areas of sudy kinemaics and dynamics. Kinemaics is he descripion

More information

Math 333 Problem Set #2 Solution 14 February 2003

Math 333 Problem Set #2 Solution 14 February 2003 Mah 333 Problem Se #2 Soluion 14 February 2003 A1. Solve he iniial value problem dy dx = x2 + e 3x ; 2y 4 y(0) = 1. Soluion: This is separable; we wrie 2y 4 dy = x 2 + e x dx and inegrae o ge The iniial

More information

Linear Response Theory: The connection between QFT and experiments

Linear Response Theory: The connection between QFT and experiments Phys540.nb 39 3 Linear Response Theory: The connecion beween QFT and experimens 3.1. Basic conceps and ideas Q: How do we measure he conduciviy of a meal? A: we firs inroduce a weak elecric field E, and

More information

2.7. Some common engineering functions. Introduction. Prerequisites. Learning Outcomes

2.7. Some common engineering functions. Introduction. Prerequisites. Learning Outcomes Some common engineering funcions 2.7 Inroducion This secion provides a caalogue of some common funcions ofen used in Science and Engineering. These include polynomials, raional funcions, he modulus funcion

More information

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing Applicaion of a Sochasic-Fuzzy Approach o Modeling Opimal Discree Time Dynamical Sysems by Using Large Scale Daa Processing AA WALASZE-BABISZEWSA Deparmen of Compuer Engineering Opole Universiy of Technology

More information

Matlab and Python programming: how to get started

Matlab and Python programming: how to get started Malab and Pyhon programming: how o ge sared Equipping readers he skills o wrie programs o explore complex sysems and discover ineresing paerns from big daa is one of he main goals of his book. In his chaper,

More information

Predator - Prey Model Trajectories and the nonlinear conservation law

Predator - Prey Model Trajectories and the nonlinear conservation law Predaor - Prey Model Trajecories and he nonlinear conservaion law James K. Peerson Deparmen of Biological Sciences and Deparmen of Mahemaical Sciences Clemson Universiy Ocober 28, 213 Ouline Drawing Trajecories

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updaes Michael Kearns AT&T Labs mkearns@research.a.com Sainder Singh AT&T Labs baveja@research.a.com Absrac We give he firs rigorous upper bounds on he error

More information

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017 Two Popular Bayesian Esimaors: Paricle and Kalman Filers McGill COMP 765 Sep 14 h, 2017 1 1 1, dx x Bel x u x P x z P Recall: Bayes Filers,,,,,,, 1 1 1 1 u z u x P u z u x z P Bayes z = observaion u =

More information

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality Marix Versions of Some Refinemens of he Arihmeic-Geomeric Mean Inequaliy Bao Qi Feng and Andrew Tonge Absrac. We esablish marix versions of refinemens due o Alzer ], Carwrigh and Field 4], and Mercer 5]

More information

A Reinforcement Learning Approach for Collaborative Filtering

A Reinforcement Learning Approach for Collaborative Filtering A Reinforcemen Learning Approach for Collaboraive Filering Jungkyu Lee, Byonghwa Oh 2, Jihoon Yang 2, and Sungyong Park 2 Cyram Inc, Seoul, Korea jklee@cyram.com 2 Sogang Universiy, Seoul, Korea {mrfive,yangjh,parksy}@sogang.ac.kr

More information

SOLUTIONS TO ECE 3084

SOLUTIONS TO ECE 3084 SOLUTIONS TO ECE 384 PROBLEM 2.. For each sysem below, specify wheher or no i is: (i) memoryless; (ii) causal; (iii) inverible; (iv) linear; (v) ime invarian; Explain your reasoning. If he propery is no

More information

Echocardiography Project and Finite Fourier Series

Echocardiography Project and Finite Fourier Series Echocardiography Projec and Finie Fourier Series 1 U M An echocardiagram is a plo of how a porion of he hear moves as he funcion of ime over he one or more hearbea cycles If he hearbea repeas iself every

More information

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power Alpaydin Chaper, Michell Chaper 7 Alpaydin slides are in urquoise. Ehem Alpaydin, copyrigh: The MIT Press, 010. alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/ ehem/imle All oher slides are based on Michell.

More information

Guest Lectures for Dr. MacFarlane s EE3350 Part Deux

Guest Lectures for Dr. MacFarlane s EE3350 Part Deux Gues Lecures for Dr. MacFarlane s EE3350 Par Deux Michael Plane Mon., 08-30-2010 Wrie name in corner. Poin ou his is a review, so I will go faser. Remind hem o go lisen o online lecure abou geing an A

More information

Planning in POMDPs. Dominik Schoenberger Abstract

Planning in POMDPs. Dominik Schoenberger Abstract Planning in POMDPs Dominik Schoenberger d.schoenberger@sud.u-darmsad.de Absrac This documen briefly explains wha a Parially Observable Markov Decision Process is. Furhermore i inroduces he differen approaches

More information

2. Nonlinear Conservation Law Equations

2. Nonlinear Conservation Law Equations . Nonlinear Conservaion Law Equaions One of he clear lessons learned over recen years in sudying nonlinear parial differenial equaions is ha i is generally no wise o ry o aack a general class of nonlinear

More information

Announcements. Recap: Filtering. Recap: Reasoning Over Time. Example: State Representations for Robot Localization. Particle Filtering

Announcements. Recap: Filtering. Recap: Reasoning Over Time. Example: State Representations for Robot Localization. Particle Filtering Inroducion o Arificial Inelligence V22.0472-001 Fall 2009 Lecure 18: aricle & Kalman Filering Announcemens Final exam will be a 7pm on Wednesday December 14 h Dae of las class 1.5 hrs long I won ask anyhing

More information

This document was generated at 1:04 PM, 09/10/13 Copyright 2013 Richard T. Woodward. 4. End points and transversality conditions AGEC

This document was generated at 1:04 PM, 09/10/13 Copyright 2013 Richard T. Woodward. 4. End points and transversality conditions AGEC his documen was generaed a 1:4 PM, 9/1/13 Copyrigh 213 Richard. Woodward 4. End poins and ransversaliy condiions AGEC 637-213 F z d Recall from Lecure 3 ha a ypical opimal conrol problem is o maimize (,,

More information

In this chapter the model of free motion under gravity is extended to objects projected at an angle. When you have completed it, you should

In this chapter the model of free motion under gravity is extended to objects projected at an angle. When you have completed it, you should Cambridge Universiy Press 978--36-60033-7 Cambridge Inernaional AS and A Level Mahemaics: Mechanics Coursebook Excerp More Informaion Chaper The moion of projeciles In his chaper he model of free moion

More information

INDEX. Transient analysis 1 Initial Conditions 1

INDEX. Transient analysis 1 Initial Conditions 1 INDEX Secion Page Transien analysis 1 Iniial Condiions 1 Please inform me of your opinion of he relaive emphasis of he review maerial by simply making commens on his page and sending i o me a: Frank Mera

More information

Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent

Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent CSE 47 Chaper Reinforcemen Learning The Reinforcemen Learning Agen Agen Sae u Reward r Acion a Enironmen CSE AI Faculy Why reinforcemen learning Programming an agen o drie a car or fly a helicoper is ery

More information

Solutions Problem Set 3 Macro II (14.452)

Solutions Problem Set 3 Macro II (14.452) Soluions Problem Se 3 Macro II (14.452) Francisco A. Gallego 04/27/2005 1 Q heory of invesmen in coninuous ime and no uncerainy Consider he in nie horizon model of a rm facing adjusmen coss o invesmen.

More information

1. An introduction to dynamic optimization -- Optimal Control and Dynamic Programming AGEC

1. An introduction to dynamic optimization -- Optimal Control and Dynamic Programming AGEC This documen was generaed a :45 PM 8/8/04 Copyrigh 04 Richard T. Woodward. An inroducion o dynamic opimizaion -- Opimal Conrol and Dynamic Programming AGEC 637-04 I. Overview of opimizaion Opimizaion is

More information

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Kriging Models Predicing Arazine Concenraions in Surface Waer Draining Agriculural Waersheds Paul L. Mosquin, Jeremy Aldworh, Wenlin Chen Supplemenal Maerial Number

More information

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power Alpaydin Chaper, Michell Chaper 7 Alpaydin slides are in urquoise. Ehem Alpaydin, copyrigh: The MIT Press, 010. alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/ ehem/imle All oher slides are based on Michell.

More information

5. Stochastic processes (1)

5. Stochastic processes (1) Lec05.pp S-38.45 - Inroducion o Teleraffic Theory Spring 2005 Conens Basic conceps Poisson process 2 Sochasic processes () Consider some quaniy in a eleraffic (or any) sysem I ypically evolves in ime randomly

More information

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN The MIT Press, 2014 Lecure Slides for INTRODUCTION TO MACHINE LEARNING 3RD EDITION alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/~ehem/i2ml3e CHAPTER 2: SUPERVISED LEARNING Learning a Class

More information

Unit Root Time Series. Univariate random walk

Unit Root Time Series. Univariate random walk Uni Roo ime Series Univariae random walk Consider he regression y y where ~ iid N 0, he leas squares esimae of is: ˆ yy y y yy Now wha if = If y y hen le y 0 =0 so ha y j j If ~ iid N 0, hen y ~ N 0, he

More information

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H. ACE 56 Fall 005 Lecure 5: he Simple Linear Regression Model: Sampling Properies of he Leas Squares Esimaors by Professor Sco H. Irwin Required Reading: Griffihs, Hill and Judge. "Inference in he Simple

More information

Linear Time-invariant systems, Convolution, and Cross-correlation

Linear Time-invariant systems, Convolution, and Cross-correlation Linear Time-invarian sysems, Convoluion, and Cross-correlaion (1) Linear Time-invarian (LTI) sysem A sysem akes in an inpu funcion and reurns an oupu funcion. x() T y() Inpu Sysem Oupu y() = T[x()] An

More information

Essential Microeconomics : OPTIMAL CONTROL 1. Consider the following class of optimization problems

Essential Microeconomics : OPTIMAL CONTROL 1. Consider the following class of optimization problems Essenial Microeconomics -- 6.5: OPIMAL CONROL Consider he following class of opimizaion problems Max{ U( k, x) + U+ ( k+ ) k+ k F( k, x)}. { x, k+ } = In he language of conrol heory, he vecor k is he vecor

More information

Single and Double Pendulum Models

Single and Double Pendulum Models Single and Double Pendulum Models Mah 596 Projec Summary Spring 2016 Jarod Har 1 Overview Differen ypes of pendulums are used o model many phenomena in various disciplines. In paricular, single and double

More information

A First Course on Kinetics and Reaction Engineering. Class 19 on Unit 18

A First Course on Kinetics and Reaction Engineering. Class 19 on Unit 18 A Firs ourse on Kineics and Reacion Engineering lass 19 on Uni 18 Par I - hemical Reacions Par II - hemical Reacion Kineics Where We re Going Par III - hemical Reacion Engineering A. Ideal Reacors B. Perfecly

More information

15. Vector Valued Functions

15. Vector Valued Functions 1. Vecor Valued Funcions Up o his poin, we have presened vecors wih consan componens, for example, 1, and,,4. However, we can allow he componens of a vecor o be funcions of a common variable. For example,

More information

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1 SZG Macro 2011 Lecure 3: Dynamic Programming SZG macro 2011 lecure 3 1 Background Our previous discussion of opimal consumpion over ime and of opimal capial accumulaion sugges sudying he general decision

More information

18 Biological models with discrete time

18 Biological models with discrete time 8 Biological models wih discree ime The mos imporan applicaions, however, may be pedagogical. The elegan body of mahemaical heory peraining o linear sysems (Fourier analysis, orhogonal funcions, and so

More information

Lecture 4 Notes (Little s Theorem)

Lecture 4 Notes (Little s Theorem) Lecure 4 Noes (Lile s Theorem) This lecure concerns one of he mos imporan (and simples) heorems in Queuing Theory, Lile s Theorem. More informaion can be found in he course book, Bersekas & Gallagher,

More information

RC, RL and RLC circuits

RC, RL and RLC circuits Name Dae Time o Complee h m Parner Course/ Secion / Grade RC, RL and RLC circuis Inroducion In his experimen we will invesigae he behavior of circuis conaining combinaions of resisors, capaciors, and inducors.

More information

Module 2 F c i k c s la l w a s o s f dif di fusi s o i n

Module 2 F c i k c s la l w a s o s f dif di fusi s o i n Module Fick s laws of diffusion Fick s laws of diffusion and hin film soluion Adolf Fick (1855) proposed: d J α d d d J (mole/m s) flu (m /s) diffusion coefficien and (mole/m 3 ) concenraion of ions, aoms

More information

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis Speaker Adapaion Techniques For Coninuous Speech Using Medium and Small Adapaion Daa Ses Consaninos Boulis Ouline of he Presenaion Inroducion o he speaker adapaion problem Maximum Likelihood Sochasic Transformaions

More information

Air Traffic Forecast Empirical Research Based on the MCMC Method

Air Traffic Forecast Empirical Research Based on the MCMC Method Compuer and Informaion Science; Vol. 5, No. 5; 0 ISSN 93-8989 E-ISSN 93-8997 Published by Canadian Cener of Science and Educaion Air Traffic Forecas Empirical Research Based on he MCMC Mehod Jian-bo Wang,

More information

Explaining Total Factor Productivity. Ulrich Kohli University of Geneva December 2015

Explaining Total Factor Productivity. Ulrich Kohli University of Geneva December 2015 Explaining Toal Facor Produciviy Ulrich Kohli Universiy of Geneva December 2015 Needed: A Theory of Toal Facor Produciviy Edward C. Presco (1998) 2 1. Inroducion Toal Facor Produciviy (TFP) has become

More information

Solutions from Chapter 9.1 and 9.2

Solutions from Chapter 9.1 and 9.2 Soluions from Chaper 9 and 92 Secion 9 Problem # This basically boils down o an exercise in he chain rule from calculus We are looking for soluions of he form: u( x) = f( k x c) where k x R 3 and k is

More information

2016 Possible Examination Questions. Robotics CSCE 574

2016 Possible Examination Questions. Robotics CSCE 574 206 Possible Examinaion Quesions Roboics CSCE 574 ) Wha are he differences beween Hydraulic drive and Shape Memory Alloy drive? Name one applicaion in which each one of hem is appropriae. 2) Wha are he

More information

SPH3U: Projectiles. Recorder: Manager: Speaker:

SPH3U: Projectiles. Recorder: Manager: Speaker: SPH3U: Projeciles Now i s ime o use our new skills o analyze he moion of a golf ball ha was ossed hrough he air. Le s find ou wha is special abou he moion of a projecile. Recorder: Manager: Speaker: 0

More information

Inventory Control of Perishable Items in a Two-Echelon Supply Chain

Inventory Control of Perishable Items in a Two-Echelon Supply Chain Journal of Indusrial Engineering, Universiy of ehran, Special Issue,, PP. 69-77 69 Invenory Conrol of Perishable Iems in a wo-echelon Supply Chain Fariborz Jolai *, Elmira Gheisariha and Farnaz Nojavan

More information

Robust estimation based on the first- and third-moment restrictions of the power transformation model

Robust estimation based on the first- and third-moment restrictions of the power transformation model h Inernaional Congress on Modelling and Simulaion, Adelaide, Ausralia, 6 December 3 www.mssanz.org.au/modsim3 Robus esimaion based on he firs- and hird-momen resricions of he power ransformaion Nawaa,

More information

12: AUTOREGRESSIVE AND MOVING AVERAGE PROCESSES IN DISCRETE TIME. Σ j =

12: AUTOREGRESSIVE AND MOVING AVERAGE PROCESSES IN DISCRETE TIME. Σ j = 1: AUTOREGRESSIVE AND MOVING AVERAGE PROCESSES IN DISCRETE TIME Moving Averages Recall ha a whie noise process is a series { } = having variance σ. The whie noise process has specral densiy f (λ) = of

More information

ACE 564 Spring Lecture 7. Extensions of The Multiple Regression Model: Dummy Independent Variables. by Professor Scott H.

ACE 564 Spring Lecture 7. Extensions of The Multiple Regression Model: Dummy Independent Variables. by Professor Scott H. ACE 564 Spring 2006 Lecure 7 Exensions of The Muliple Regression Model: Dumm Independen Variables b Professor Sco H. Irwin Readings: Griffihs, Hill and Judge. "Dumm Variables and Varing Coefficien Models

More information

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Article from. Predictive Analytics and Futurism. July 2016 Issue 13 Aricle from Predicive Analyics and Fuurism July 6 Issue An Inroducion o Incremenal Learning By Qiang Wu and Dave Snell Machine learning provides useful ools for predicive analyics The ypical machine learning

More information

Robotics I. April 11, The kinematics of a 3R spatial robot is specified by the Denavit-Hartenberg parameters in Tab. 1.

Robotics I. April 11, The kinematics of a 3R spatial robot is specified by the Denavit-Hartenberg parameters in Tab. 1. Roboics I April 11, 017 Exercise 1 he kinemaics of a 3R spaial robo is specified by he Denavi-Harenberg parameers in ab 1 i α i d i a i θ i 1 π/ L 1 0 1 0 0 L 3 0 0 L 3 3 able 1: able of DH parameers of

More information

!!"#"$%&#'()!"#&'(*%)+,&',-)./0)1-*23)

!!#$%&#'()!#&'(*%)+,&',-)./0)1-*23) "#"$%&#'()"#&'(*%)+,&',-)./)1-*) #$%&'()*+,&',-.%,/)*+,-&1*#$)()5*6$+$%*,7&*-'-&1*(,-&*6&,7.$%$+*&%'(*8$&',-,%'-&1*(,-&*6&,79*(&,%: ;..,*&1$&$.$%&'()*1$$.,'&',-9*(&,%)?%*,('&5

More information

= ( ) ) or a system of differential equations with continuous parametrization (T = R

= ( ) ) or a system of differential equations with continuous parametrization (T = R XIII. DIFFERENCE AND DIFFERENTIAL EQUATIONS Ofen funcions, or a sysem of funcion, are paramerized in erms of some variable, usually denoed as and inerpreed as ime. The variable is wrien as a funcion of

More information

PHYSICS 220 Lecture 02 Motion, Forces, and Newton s Laws Textbook Sections

PHYSICS 220 Lecture 02 Motion, Forces, and Newton s Laws Textbook Sections PHYSICS 220 Lecure 02 Moion, Forces, and Newon s Laws Texbook Secions 2.2-2.4 Lecure 2 Purdue Universiy, Physics 220 1 Overview Las Lecure Unis Scienific Noaion Significan Figures Moion Displacemen: Δx

More information