18 IMITATION LEARNING

Size: px

Start display at page:

Download "18 IMITATION LEARNING"

Louise Stevens
6 years ago
Views:

1 18 IMITATION LEARNING Programming is a skill bes acquired by pracice and example raher han from books. Alan Turing So far, we have largely considered machine learning problems in which he goal of he learning algorihm is o make a single predicion. In many real world problems, however, an algorihm mus make a sequence of decisions, wih he world possibly changing during ha sequence. Such problems are ofen called sequenial decision making problems. A sraighforward example which will be he running example for his chaper is ha of selfdriving cars. We wan o rain a machine learning algorihm o drive a car. Bu driving a car is no a single predicion: i s a sequence of predicions over ime. And as he machine is driving he car, he world around i is changing, ofen based on is own behavior. This creaes complicaed feedback loops, and one of he major challenges we will face is how o deal wih hese feedback loops. To make his discussion more concree, le s consider he case of a self-driving car. And le s consider a very simplisic car, in which he only decision ha has o be made is how o seer, and ha s beween one of hree opions: {lef, righ, none}. In he imiaion learning seing, we assume ha we have access o an exper or oracle ha already knows how o drive. We wan o wach he exper driving, and learn o imiae heir behavior. Hence: imiaion learning (someimes called learning by demonsraion or programming by example, in he sense ha programs are learned, and no implemened). A each poin in ime = 1... T, he car recieves sensor informaion x (for insance, a camera phoo ahead of he car, or radar readings). I hen has o ake an acion, a ; in he case of he car, his is one of he hree available seering acions. The car hen suffers some loss l ; his migh be zero in he case ha i s driving well, or large in he case ha i crashes. The world hen changes, moves o ime sep + 1, sensor readings x +1 are observed, acion a +1 is aken, loss l +1 is suffered, and he process coninues. The goal is o learn a funcion f ha maps from sensor readings x o acions. Because of close connecions o he field of reinforcemen learning, his funcion is ypically called a policy. The measure of Learning Objecives: Be able o formulae imiaion learning problems. Undersand he failure cases of simple classificaion approaches o imiaion learning. Implemen soluions o hose problems based on eiher classificaion or daase aggregaion. Relae srucured predicion and imiaion learning. Dependencies:

2 imiaion learning 213 success of a policy is: if we were o run his policy, how much oal loss would be suffered. In paricular, suppose ha he rajecory (denoed τ) of observaion/acion/reward riples encounered by your policy is: τ = x 1, a }{{} 1, l 1, x 2, a 2, l }{{} 2,..., x T, a T, l }{{} T (18.1) = f (x 1 ) = f (x 2 ) = f (x T ) The losses l depend implicily on he sae of he world and he acions of he policy. The goal of f is o minimize he expeced loss E τ f =1 T l, where he expecaion is aken over all randomness in he world, and he sequence of acions aken is according o f Imiaion Learning by Classificaion We will begin wih a sraighforward, bu brile, approach o imiaion learning. We assume access o a se of raining rajecories aken by an exper. For example, consider a self-driving car, like ha in Figure A single rajecory τ consiss of a sequence of observaions (wha is seen from he car s sensors) and a sequence of acions (wha acion did he expec ake a ha poin in ime). The idea in imiaion learning by classificaion is o learn a classifier ha aemps o mimic he exper s acion based on he observaions a ha ime. In paricular, we have τ 1, τ 2,..., τ N. Each of he N rajecories comprises a sequence of T-many observaion/acion/loss riples, where he acion is he acion aken by he exper. T, he lengh of he rajecory is ypically called he ime horizon (or jus horizon). For insance, we may ask an exper human driver o drive N = 20 differen roues, and record he observaions and acions ha driver saw and ook during hose roues. These are our raining rajecories. We assume for simpliciy ha each of hese rajecories is of fixed lengh T, hough his is mosly for noaional convenience. The mos sraighforward hing we can do is conver his exper daa ino a big muliclass classificaion problem. We ake our favorie muliclass classificaion algorihm, and use i o learn a mapping from x o a. The daa on which i is rained is he se of all observaion/acion pairs visied during any of he N rajecories. In oal, his would be NT examples. This approach is summarized in Algorihm 18.1 for raining and Algorihm 18.1 for predicion. How well does his approach work? The firs quesion is: how good is he exper? If we learn o mimic an exper, bu he exper is no good, we lose. In general, i also seems unrealisic o believe his algorihm should be able o improve on he exper. Similarly, if our muliclass classificaion algorihm A is crummy, we also lose. So par of he quesion how well does 1 I s compleely okay for f o look a more han jus x when making predicions; for insance, i migh wan o look a x 1, or a 1 and a 2. As long as i only references informaion from he pas, his is fine. For noaional simpliciy, we will assume ha all of his relevan informaion is summarized in x. exper Figure 18.1: A single exper rajecory in a self-driving car.

3 214 a course in machine learning Algorihm 43 SupervisedImiaionTrain(A, τ 1, τ 2,..., τ N ) 1: D (x, a) : n, (x, a, l) τ n // collec all observaion/acion pairs 2: reurn A(D) // rain muliclass classifier on D Algorihm 44 SupervisedImiaionTes( f ) 1: for = 1... T do 2: x curren observaion 3: a f (x ) // ask policy o choose an acion 4: ake acion a 5: l observe insananeous loss 6: end for 7: reurn T =1 l // reurn oal loss his work is he more basic quesion of: wha are we even rying o measure? There is a nice heorem 2 ha gives an upper bound on he loss 2 Ross e al suffered by he SupervisedIL algorihm (Algorihm 18.1) as a funcion of (a) he qualiy of he exper, and (b) he error rae of he learned classifier. To be clear, we need o disinguish beween he loss of he policy when run for T seps o form a full rajecory, and he error rae of he learned classifier, which is jus i s average muliclass classificaion error. The heorem saes, roughly, ha he loss of he learned policy is a mos he loss of he exper plus T 2 imes he error rae of he classifier. Theorem 18 (Loss of SupervisedIL). Suppose ha one runs Algorihm 18.1 using a muliclass classifier ha opimizes he 0-1 loss (or an upperbound hereof). Le ɛ be he error rae of he underlying classifier (in expecaion) and assume ha all insananeous losses are in he range [0, l (max). Le f be he learned policy; hen: E τ f l }{{} loss of learned policy E τ exper l } {{ } loss of exper +l (max) T 2 ɛ (18.2) Inuiively, his bound on he loss is abou a facor of T away from wha we migh hope for. In paricular, he muliclass classifier makes errors on an ɛ fracion of i s acions, measured by zero/one loss. In he wors case, his will lead o a loss of l (max) ɛ for a single sep. Summing all hese errors over he enire rajecory would lead o a loss on he order of l (max) Tɛ, which is a facor T beer han his heorem provides. A naural quesion (addressed in he nex secion) is wheher his is analysis is igh. A relaed quesion (addressed in he secion afer ha) is wheher we can do beer. Before geing here, hough, i s worh highlighing ha an exra facor of T is really

4 imiaion learning 215 bad. I can cause even very small muliclass error raes o blow up; in paricular, if ɛ 1/T, we lose, and T can be in he hundreds or more Failure Analysis The bigges single issue wih he supervised learning approach o imiaion learning is ha i canno learn o recover from failures. Tha is: i has only been rained based on exper rajecories. This means ha he only raining daa i has seen is ha of an exper driver. If i ever veers from ha sae disribuion, i may have no idea how o recover. As a concree example, perhaps he exper driver never ever ges hemselves ino a sae where hey are direcly facing a wall. Moreover, he exper driver probably ends o drive forward more han backward. If he imperfec learner manages o make a few errors and ge suck nex o a wall, i s likely o resor o he general drive forward rule and say here forever. This is he problem of compounding error; and yes, i does happen in pracice. I urns ou ha i s possible o consruc an imiaion learning problem on which he T 2 compounding error is unavoidable. Consider he following somewha arificial problem. A ime = 1 you re shown a picure of eiher a zero or a one. You have wo possible acions: press a buon marked zero or press a buon marked one. The correc hing o do a = 1 is o press he buon ha corresponds o he image you ve been shown. Pressing he correc buon leads o l 1 = 0; he incorrec leads o l 1 = 1. Now, a ime = 2 you are shown anoher image, again of a zero or one. The correc hing o do in his ime sep is he xor of (a) he number wrien on he picure you see righ now, and (b) he correc answer from he previous ime sep. This holds in general for > 1. There are wo imporan hings abou his consrucion. The firs is ha he exper can easily ge zero loss. The second is ha once he learned policy makes a single misake, his can cause i o make all fuure decisions incorrecly. (A leas unil i luckily makes anoher misake o ge i back on rack.) Based on his consrucion, you can show he following heorem 3. 3 Kääriäinen 2006 Theorem 19 (Lower Bound for SupervisedIL). There exis imiaion learning problems on which Algorihm 18.1 is able o achieve small classificaion error ɛ [0, 1/T under an opimal exper, bu for which he es loss is lower bounded as: E τ f l }{{} loss of learned policy T ɛ [ 1 (1 2ɛ) T+1 (18.3) which is bounded by T 2 ɛ and, for small ɛ, grows like T 2 ɛ.

5 216 a course in machine learning Up o consans, his gives maching upper and lower bounds for he loss of a policy learned by supervised imiaion learning ha is prey far (a facor of T) from wha we migh hope for Daase Aggregaion Supervised imiaion learning fails because once i ges off he exper pah, hings can go really badly. Ideally, we migh wan o rain our policy o deal wih any possible siuaion i could encouner. Unforunaely, his is unrealisic: we canno hope o be able o rain on every possible configuraion of he world; and if we could, we wouldn really need o learn anyway, we could jus memorize. So we wan o rain f on a subse of world configuraions, bu using configuraions visied by he exper fails because f canno learn o recover from is own errors. Somehow wha we d like o do is rain f o do well on he configuraions ha i, iself, encouners! This is a classic chicken-and-egg problem. We wan a policy f ha does well in a bunch of world configuraions. Wha se of configuraions? The configuraions ha f encouners! A very classic approach o solving chicken-and-egg problems is ieraion. Sar wih some policy f. Run f and see wha configuraions is visis. Train a new f o do well here. Repea. This is exacly wha he Daase Aggregaion algorihm ( Dagger ) does. Coninuing wih he self-driving car analogy, we firs le a human exper drive a car for a while, and learn an iniial policy f 0 by running sandard supervised imiaion learning (Algorihm 18.1) on he rajecories visied by he human. We hen do somehing unusual. We pu he human exper in he car, and record heir acions, bu he car behaves no according o he exper s behavior, bu according o f 0. Tha is, f 0 is in conrol of he car, and he exper is rying o seer, bu he car is ignoring hem 4 and simply recording heir acions as raining daa. This is shown in Figure Based on rajecories generaed by f 0 bu acions given by he exper, we generae a new daase ha conains informaion abou how o recover from he errors of f 0. We now will rain a new policy, f 1. Because we don wan f 1 o forge wha f 0 already knows, f 1 is rained on he union of he iniial exper-only rajecories ogeher wih he new rajecories generaed by f 0. We repea his process a number of imes MaxIer, yielding Algorihm This algorihm reurns he lis of all policies generaed during is run. A very pracical quesion is: which one should you use? There are essenially wo choices. The firs choice is jus o use he final policy learned. The problem wih his approach is ha Dagger can be somewha unsable in pracice, and policies do no monoonically 4 This is possibly errifying for he exper! f 0 exper Figure 18.2: In DAgger, he rajecory (red) is generaed according o he previously learned policy, f 0, bu he gold sandard acions are given by he exper.

6 imiaion learning 217 Algorihm 45 DaggerTrain(A, MaxIer, N, exper) 1: τ (0) n n=1 N run he exper N many imes 2: D 0 (x, a) : n, (x, a, l) τ (0) n // collec all pairs (same as supervised) 3: f 0 A(D 0 ) // rain iniial policy (muliclass classifier) on D 0 4: for i = 1... MaxIer do 5: τ (i) n n=1 N run policy f i 1 N-many imes // rajecories by f i 1 6: D i (x, exper(x)) : n, (x, a, l) τ (i) n // collec daa se ( ij=0 ) 7: f i A D j 8: end for // observaions x visied by f i 1 // bu acions according o he exper // rain policy f i on union of all daa so far 9: reurn f 0, f 1,..., f MaxIer // reurn collecion of all learned policies improve. A safer alernaive (as we ll see by heory below) is o es all of hem on some held-ou developmen asks, and pick he one ha does bes here. This requires a bi more compuaion, bu is a much beer approach in general. One major difference in requiremens beween Dagger (Algorihm 18.3) and SupervisedIL (Algorihm 18.1) is he requiremen of ineracion wih he exper. In SupervisedIL, you only need access o a bunch of rajecories aken by he exper, passively. In Dagger, you need access o hem exper hemselves, so you can ask quesions like if you saw configuraion x, wha would you do? This pus much more demand on he exper. Anoher quesion ha arises is: wha should N, he number of rajecories generaed in each round, be? In pracice, he iniial N should probably be reasonably large, so ha he iniial policy f 0 is prey good. The number of rajecories generaed by ieraion subsequenly can be much smaller, perhaps even jus one. Inuiively, Dagger should be less sensiive o compounding error han SupervisedIL, precisely because i ges rained on observaions ha i is likely o see a es ime. This is formalized in he following heorem: Theorem 20 (Loss of Dagger). Suppose ha one runs Algorihm 18.3 using a muliclass classifier ha opimizes he 0-1 loss (or an upperbound hereof). Le ɛ be he error rae of he underlying classifier (in expecaion) and assume ha all insananeous losses are in he range [0, l (max). Le f be he learned policy; hen: ( ) l l l +l (max) Tɛ + (max) T log T O E τ f }{{} loss of learned policy E τ exper } {{ } loss of exper MaxIer Furhermore, if he loss funcion is srongly convex in f, and MaxIer is (18.4)

7 218 a course in machine learning Õ(T/ɛ), hen: E τ f l }{{} loss of learned policy E τ exper l } {{ } loss of exper +l (max) Tɛ + O(ɛ) (18.5) Boh of hese resuls show ha, assuming MaxIer is large enough, he loss of he learned policy f (here, aken o be he bes on of all he MaxIer policies learned) grows like Tɛ, which is wha we hope for. Noe ha he final erm in he firs bound ges small so long as MaxIer is a leas T log T Expensive Algorihms as Expers Because of he srong requiremen on he exper in Dagger (i.e., ha you need o be able o query i many imes during raining), one of he mos subsanial use cases for Dagger is o learn o (quickly) imiae oherwise slow algorihms. Here are wo prooypical examples: 1. Game playing. When a game (like chess or minecraf) can be run in simulaion, you can ofen explicily compue a semi-opimal exper behavior wih brue-force search. Bu his search migh be oo compuaionally expensive o play in real ime, so you can use i during raining ime, learning a fas policy ha aemps o mimic he expensive search. This learned policy can hen be applied a es ime. 2. Discree opimizers. Many discree opimizaion problems can be compuaionally expensive o run in real ime; for insance, even shores pah search on a large graph can be oo slow for real ime use. We can compue shores pahs offline as raining daa and hen use imiaion learning o ry o build shores pah opimizers ha will run sufficienly efficienly in real ime. Consider he game playing example, and for concreeness, suppose you are rying o learn o play soliaire (his is an easier example because i s a single player game). When running DaggerTrain (Algorihm 18.3 o learn a chess-playing policy, he algorihm will repeaedly ask for exper(x), where x is he curren sae of he game. Wha should his funcion reurn? Ideally, i should reurn he/an acion a such ha, if a is aken, and hen he res of he game is played opimally, he player wins. Compuing his exacly is going o be very difficul for anyhing excep he simples games, so we need o resor o an approxiamion.

8 imiaion learning 219 Algorihm 46 DephLimiedDFS(x, h, MaxDeph) 1: if x is a erminal sae or MaxDeph 0 hen 2: reurn (, h(x)) // if we canno search deeper 3: else // reurn no acion ( ) and he curren heurisic score 4: BesAcion, BesScore, // keep rack of bes acion & is score 5: for all acions a from x do 6: (_, score) DephLimiedDFS(x a, h, MaxDeph 1) // ge score // for acion a, deph reduced by one by appending a o x 7: if score > BesScore hen 8: BesAcion, BesScore a, score // updae racked bes acion & score 9: end if 10: end for 11: end if 12: reurn (BesAcion, BesScore) // reurn bes found acion and is score A common sraegy is o run a deph-limied deph firs search, saring a sae x, and erminaing afer a mos hree of four moves (see Figure 18.3). This will generae a search ree. Unless you are very near he end of he game, none of he leaves of his ree will correspond o he end of he game. So you ll need some heurisic, h, for evaluaing saes ha are non-erminals. You can propagae his heurisic score up o he roo, and choose he acion ha looks bes wih his deph four search. This is no necessarily going o be he opimal acion, and here s a speed/accuracy rade-off for searching deeper, bu his is ypically effecive. This approach summarized in Algorihm Srucured Predicion via Imiaion Learning Figure 18.3: imi:dldfs: Deph limied deph-firs search A final case where an exper can ofen be compued algorihmically arises when one solves srucured predicion (see Chaper 17) via imiaion learning. I is cleares how his can work in he case of sequence labeling. Recall here ha prediced oupus should be sequences of labels. The running example from he earlier chaper was: x = monsers ea asy bunnies (18.6) y = noun verb adj noun (18.7) One can easily cas he predicion of y as a sequenial decision making problem, by reaing he producion of y in a lef-o-righ manner. In his case, we have a ime horizon T = 4. We wan o learn a policy f ha firs predics noun hen verb hen adj hen noun on his inpu.

9 220 a course in machine learning Le s suppose ha he inpu o f consiss of feaures exraced boh from he inpu (x) and he curren prediced oupu prefix ŷ, denoed φ(x, ŷ). For insance, φ(x, ŷ) migh represen a similar se of feaures o hose use in Chaper 17. I is perhaps easies o hink of f as jus a classifier: given some feaures of he inpu senence x ( monsers ea asy bunnies ), and some feaures abou previous predicions in he oupu prefix (so far, produced noun verb ), he goal of f is o predic he ag for he nex word ( asy ) in his conex. An imporan quesion is: wha is he exper in his case? Inuiively, he exper should provide he correc nex label, bu wha does his mean? Tha depends on he loss funcion being opimized. Under Hamming loss (sum zero/one loss over each individual predicion), he exper is sraighforward. When he exper is asked o produce an acion for he hird word, he exper s response is always adj (or whaever happens o be he correc label for he hird word in he senence i is currenly raining on). More generally, he exper ges o look a x, y and a prefix ŷ of he oupu. Noe, imporanly, ha he prefix migh be wrong! In paricular, afer he firs ieraion of Dagger, he prefix will be prediced by he learned policy, which may make misakes! The exper also has some srucured loss funcion l ha i is rying o minimize. Like in he previous secion, he exper s goal is o choose he acion ha minimizes he long-erm loss according o l on his example. To be more formal, we need a bi of noaion. Le bes(l, y, ŷ) denoe he loss (according o l and he ground ruh y) of he bes reachable oupu saring a ŷ. For insance, if y is noun verb adj noun and ŷ is noun noun, and he loss is Hamming loss, hen he bes achievable oupu (predicing lef-o-righ) is noun noun adj noun which has a loss of 1. Thus, bes for his siuaion is 1. Given ha noion of bes, he exper is easy o define: exper(l, y, ŷ) = argmin bes(l, y, ŷ a) (18.8) a Namely, i is he acion ha leads o he bes possible compleion afer aking ha acion. So in he example above, he exper acion is adj. For some problems and some loss funcions, compuing he exper is easy. In paricular, for sequence labeling under Hamming loss, i s rivial. In he case ha you can compue he exper exacly, i is ofen called an oracle. 5 For some oher problems, exacly compuing an oracle is compuaionally expensive or inracable. In hose cases, one can ofen resor o deph limied deph-firs-search (Algorihm 18.4) o compue an approximae oracle as an exper. To be very concree, a ypical implemenaion of Dagger applied o sequence labeling would go as follows. Each srucured raining example (a pair of senence and ag-sequence) gives rise o one rajec- 5 Some lieraure calls i a dynamic oracle, hough he exra word is unnecessary.

10 imiaion learning 221 ory. A raining ime, a predic ag seqence is generaed lef-o-righ, saring wih he empy sequence. A any given ime sep, you are aemping o predic he label of he h word in he inpu. You define a feaure vecor φ(x, ŷ), which will ypically consis of: (a) he h word, (b) lef and righ neighbors of he h word, (c) he las few predicions in ŷ, and (d) anyhing else you can hink of. In paricular, he feaures are no limied o Markov syle feaures, because we re no longer rying o do dynamic programming. The exper label for he h word is jus he corresponding label in he ground ruh y. Given all his, one can run Dagger (Algorihm 18.4) exacly as specified. Moving o srucured predicion problems oher han sequence labeling problems is beyond he scope of his book. The general framework is o cas your srucured predicion problem as a sequenial decision making problem. Once you ve done ha, you need o decide on feaures (his is he easy par) and an exper (his is ofen he harder par). However, once you ve done so, here are generic libraries for compiling your specificaion down o code Furher Reading TODO furher reading

1 Review of Zero-Sum Games

1 Review of Zero-Sum Games COS 5: heoreical Machine Learning Lecurer: Rob Schapire Lecure #23 Scribe: Eugene Brevdo April 30, 2008 Review of Zero-Sum Games Las ime we inroduced a mahemaical model for wo player zero-sum games. Any