The Bellman Eqaton Reza Shadmehr In ths docment I wll rovde an elanaton of the Bellman eqaton, whch s a method for otmzng a cost fncton and arrvng at a control olcy.. Eamle of a game Sose that or states refer to the oston on a grd, as shown below. If we are at the goal state, then the state cost er tme ste s zero. If we are at any other state, the state cost er tme ste s 5. Let s se the term J to refer to ths state cost er tme ste: 5 0 5 J ( The goal state s at row, col., whch means that f we are at ths state, we ncr no state costs. The doble lnes refer to a wall, reventng one to move from one state to the neghborng state. That s, there s a wall between the to left and to mddle states. If we erform some acton (say, move from one bo to the neghborng bo, there wll be a motor cost er tme ste, whch we refer to wth symbol J. The motor cost s one f we move J, and zero otherwse. So the total cost er tme ste s: ( n ( n J J α ( ( The term ( n ( n refers to the olcy that we have. Ths olcy secfes the acton that we wll erform for each state at tme ont n. For eamle, f we ck a random olcy, then we mght have actons that look lke ths: ( n (3 Sose or fnal tme ste s. If we are now at tme ont k, or objectve s to fnd the olcy that ( mnmzes the total cost to go α. Let s defne the goodness of each olcy va a vale fncton: k
( ( k ( k ( ( k α (4 If we are at the last tme ste, then the vale of or olcy s smly the cost er ste at ths last tme ste: ( J J (5 The otmm olcy s one that mnmzes Eq. (4, whch s smly to do nothng: * ( ( (6 In that case, we have: 5 0 5 ( (7 * J The Bellman otmalty rncle states that an otmal olcy has the roerty that whatever the ntal state and ntal decson are, the remanng decsons mst consttte an otmal olcy wth regard to the state resltng from the frst decson. Ths means that n order to fnd the otmal olcy for tme ont, for each state we mst fnd the command that mnmzes the followng: ( ( ( { α * ( } * argmn (8 Once we have the otmal command for each state, the vale of that state s: ( mn ( ( α { } * Consder the to mddle state. If we were to rodce an acton that moves s down, we wold have the ( state cost of 5, and motor cost of, and so α 6. The vale of the state we get to s zero. So the ( total vale of ths acton s 6. If we were to stay and not move, α 5 ls the vale of the state that we get to (the crrent state, whch s 5, for a sm total of 0. So the vale of the acton of movng down s 6, whereas the acton of dong nothng has the vale of 0. The vale of the acton of movng to the rght s. So the best acton that we can do for the to mddle state s to move down. Smlarly, the best acton that we can do for the bottom mddle state s to stay stll. In ths way, we can defne the acton that mnmzes Eq. (8 for each state, resltng n or olcy for tme ste :
( * ( (9 The vale of ths otmal olcy s: 0 6 6 0 0 6 (0 0 6 6 ( ( 0 0 0 Before we roceed to the net ste, t s worthwhle to take another look at Eq. (0. We have a vale assocated wth each state. Ths vale s the total cost that wll be ncrred f, startng at a gven state, we were to erform the best seqence of actons ossble. The seqence of actons wll rodce a seqence of cost er stes that together sm to be the vale assgned to each state. In a sense, when we fnd orselves at a gven state, the vale of that state reresents the lowest cost to go that we hoe to ncr f we were to rodce the best actons ossble. Now we reeat ths rocess for tme ste. Let s reconsder the to mddle state. The vale of stayng stll s: J ( J ( * 5 0 6. The vale of movng down s 5 0 6. The vale of movng to the rght s. The best acton remans to move down. Consder the bottom mddle state. The vale of movng s. The vale of stayng stll s 5. The best acton s to move (or move to rght to neghbor. The otmal olcy at tme ste s: ( * ( The vale of ths otmal olcy s: 5 6 6 5 0 6 ( 5 6 6 ( ( We comte the otmal olcy for tme ste 3 :
( * ( 3 (3 The vale of ths otmal olcy s: 0 6 6 0 0 6 (4 8 6 6 ( 3 ( Smlarly, we comte the otmal olcy for tme ste 4 : ( * ( 4 (5 The vale of ths otmal olcy s: 5 6 6 4 0 6 (6 8 6 6 ( 4 ( Now the nterestng reslt comes when we consder tme ste 5. Consder the to left bo. If we were to move down, the cost s 30. If we were to stay, the cost s also 30. So the otmal olcy s ether to stay stll or move down, as both gves s the same cost. The reason for ths s that the effort cost of movng s so large for ths state (t s so far from the goal state that the reward of gettng to the goal only barely comensates for the cost of movng. Sose we decde to stay, and so we have: The vale of ths otmal olcy s: ( ( * ( 5 * ( 4 (7 30 6 6 4 0 6 (8 8 6 6 ( 5 ( However, at tme ste 6, for the to left state or otmal acton s to move down, and so we get:
( * ( 6 (9 For each tme ste, we have sed the Bellman eqaton (Eq. 8 to fnd the otmal feedback control olcy.. Eamle of a lnear system wthot nose In condtons for whch we are dealng wth a lnear dynamcal system, the vale fncton wll trn ot to be a qadratc fncton of state, and the control olcy wll become a lnear fncton of state. These class of control roblems are also called Lnear Qadratc Reglators. Let s start wth a lnear system wthot nose. ( n ( n ( n A B y ( n ( n C (0 We have the followng cost er ste: ( n ( n T ( n ( n ( n T ( n α y T y L ( nt T ( n ( n ( nt ( n C T C L ( Let s begn at the fnal tme ste n (. At ths tme, the best acton that we can erform s one that mnmzes the cost acton at the fnal tme ont s: ( α. That acton s: ( * 0 If we erform ths otmal acton, the vale of the state we are at s: 0. That s, regardless of state, the otmal olcy of ( ( ( T T ( ( * C T C (3 We see that at the fnal tme ont, the vale fncton s a qadratc fncton of state. Let s defne ( matr W as follows: ( T ( And so the vale for the otmal olcy can be wrtten as: W C T C (4 ( ( T ( ( * W (5
In order to fnd the otmal olcy for tme ont, for each state we mst fnd the command that mnmzes the sm of the cost at the crrent tme ste, ls the vale of the state that we arrve at after we rodce the command: ( arg mn { * ( α, } * ( ( ( ( (6 We can wrte the eresson nsde the brackets as: ( ( T T ( ( ( T ( α C T C L T ( ( ( ( ( ( ( ( (, ( ( A B W A B ( T T ( ( ( T T ( ( AW A BW B ( T T ( ( BW A (7 ( To mnmze the sm n Eq. (6, we fnd ts dervatve wth resect to and set t eqal to zero. Ths gves s the otmal commands: Let s defne the followng matr: ( T ( ( T ( ( L B W B B W A 0 (8 We now can wrte the otmal olcy as follows: *( T ( T ( ( L B W B B W A (9 ( T ( T ( G L B W B B W A (3 * ( ( ( G The vale of each state nder the otmal olcy can be wrtten as (3 ( ( ( ( *( ( (, (, *( α * ( T T ( ( ( T ( T ( ( BT B G LG AW A G BW BG ( T ( T T ( ( G B W A ( T T ( ( ( T ( T T ( ( ( Notce that the vale fncton s a qadratc fncton of state. We can smlfy t a lttle sng the defnton of ( G : ( T ( ( T T ( ( ( T T ( ( G LG G B W BG G L B W B G ( T T ( G B W A (33 ( Let s defne W as follows: ( T ( T ( ( T T ( ( T T ( W BT B AW A G BW A G BW A T ( T ( ( T T ( BT B AW AG BW A (34
We can wrte the vale fncton as: We now have a rece. For ste we have: ( ( ( ( W (35 ( T ( T ( G L B W B B W A (36 And the followng vale fncton: ( ( ( ( W ( T ( T ( ( T T ( W B T B A W AG B W A (37 ( And so the rocedre s as follows: startng from the last tme ont, we comte G (whch s zero ( ( and W (Eq. 4. We net move tme ont and comte G ( (Eq. 3 and W (Eq. 34. ( ( We then se Eq. (36 to comte G and W (Eq. 37. And so on, ntl we reach tme ont 0. For each tme ste, we wll have a olcy that transforms or crrent state nto a motor command. As an eamle, let s consder movng a sngle jont model of the elbow. The state of the system s defned by ts oston and velocty (referrng to anglar oston and velocty. The dynamcs of the system are descrbed as follows: 0 0 0 0 k b 0 m m m Ac Cc y 0 The above eqatons are wrtten n contnos tme. To reresent t n dscrete tme (wth a tme ste of Δ t, we can wrte the dscrete eqatons as follows: ( k ( k ( k A B ( k ( k y C A ( I Ac Δt B BcΔt C C c I wanted the elbow to make a movement that ended at a goal state of t ( 300 ms 0.5, wth zero velocty, and held there for an addtonal 00ms. I sed the followng arameter vales for the arm:
k 3 N. m / rad b 0.45 N. ms. / rad m 0.3 kgm. / rad I set the state cost matr T to have the followng vales as a fncton of tme: 3. Eamle of a lnear system wth sgnal deendent nose Let s consder a smle scalar system of the form: ( t ( t ( t ( t ( t ( t a b( ε ε N 0, c ( ( t ( t y ε ε N 0, σ y y y In ths system, the state s a scalar, and so s the observaton y. However, notce that n ths system the nose s sgnal deendent. That s, the varance of the nose deends on the sze of the motor ( t commands. We begn by eressng the random varable ε n terms of random varable ( t φ N 0, ( t and : ( t ( t ( t c φ ( ( ( Let s sose that the cost er ste s: α t α( t α( t ont the otmal olcy * ( ( 0 ( ( α ε. Ths mles that at the last tme and the vale of the states acheved nder ths olcy s. We now fnd the otmal olcy for tme ste. We begn by comtng the term ( ( ( E *,. ( ( ( E ( var E ( ( ( ( E *(, αe ( ( ( α var E ( ( ( α bc ( ( a b The cost that we need to mnmze at tme ste s: ( ( ( ( ( ( ( α( α ( E *(, ( ( ( ( ( α( α( α b c ( ( a b We fnd ( that mnmzes ths cost:
d ( ( ( ( α ( αb c αb αab d ( ( ( α αb c αb αab g * ( ( ( 0 g However, becase s a random varable, at any tme ont we wll have an estmate of t, ˆ. And so or otmal olcy at tme ont s ( ˆ ( g ˆ (. Usng or olcy for tme ste, we can comte the vale fncton (, ˆ ( ( demonstrate that t s a qadratc fncton of, and the error n estmate of that state ( ( ˆ. ( ( ( ( ( (, ˆ ( ˆ ( ( ( ( ( αbc ( g ˆ α( a bg ˆ ( ( ( ( α αa ( ( α αb c αb ( g ˆ α α g * ( ( ( α abg ˆ ( ( ( α αa ( αabg ( ˆ zˆ zˆ z ˆ z ( ( ( ( ( ( ( (, ˆ ( α α α α ( ˆ ( ( ( ( ( w ( w ( ˆ z and ( ( ( ( αabg ˆ a abg abg ( t Now let s consder the tme ste. We can observe y and wrte the eqaton for the Kalman gan. As we wll see, the Kalman gan wll not deend on. var tt ( ˆ ( t ( tt ( tt ( t ( ( tt ˆ ˆ k y ˆ dp ( tt dk ( tt ( t k ( t ( t ( tt ( t ( t ( t εy k ˆ k k P t ( tt t ( k var( ˆ ( k ( ( σy ( t ( t ( t k k P k σy ( tt ( tt ( tt ( t ( t σy ( tt P P k k P ( tt P σ y
( At tme ont t, or estmate of ˆ t s smly the ror estmate ( ˆ t ( t, and the Kalman gan. ( t ( t ˆ ˆ k y ˆ At tme ont ( tt ( tt ( tt ( t t ( t t ( t ( ˆ tt ˆ aˆ b tt aˆ ak ( y ˆ b ( t ( t ( t ( t ( t ( t ˆ aˆ ak y ˆ b ( tt ( t ( t ( ( t (. Let s wrte ˆ t n terms of we showed that the vale fncton nder the otmal olcy ( ( (, ˆ ( qadratc fncton of ( ( and the error n estmate of that state ˆ ( ( ( ( ( ( ( relatonsh as (, ˆ w ( w ( ˆ otmal olcy for tme ont.. Let s wrte that and then fnd the ( ( ( ( α( α( ( ( ( ( ( E * (, ˆ,, ˆ ( ( ( ( ( ( ( (, ˆ ( ( ( w E ( ˆ ( ( ( ( ( ( ( ( ( ( ˆ E ( a b bε aˆ ak ( y ˆ b ( ( ( ( ( E ( a( ˆ ak ( εy ˆ bε ( ( ( ( E (( a ak ( ˆ ak εy bε ( ( ( d ( a ak ( ˆ ( ( ( ( ( ( ˆ E d adk εy bd ε a ( k εy abk εyε b ε ( ( d a ( k σy b c ( ( ( ( ( ( ( ( ( ( α( α( w ( a b w b c ( ( ( ( w d a ( k σy bc ( ( d ( ( ( ( ( ( ( ( α w b c w b c w b abw d * ( ( ( ( ( ( ( ( α w b c w b c w b abw E w a b w bc E E g ( ( The best that we can do s mlement the olcy g * ( ( g ˆ. Let s show that nder ( ths olcy, the vale fncton remans qadratc n terms and the error n estmate of that state ( ( ˆ. s a
( ( ( ( ( ( ( ( ( ( ( ( w d a ( k σy b c ( ( ( ( ( ( ( α w a ( w ab ( ( ( ( ( ( α bw w bc w bc ( w d a ( k ( ( ( ( ( ( ( α w a ( abw g ˆ ( ( ( ( ( abw g ( ˆ w d a ( k σ y α α w a b w b c * If we note that ˆ ˆ ( ˆ olcy as: σy z z z z, then we can wrte the vale fncton nder the otmal ( ( ( ( ( ( ( ( ( ( α ( ˆ ( ( w d a ( k σ y ( ( ( ( ˆ ( ( ( 3 w a abw g abw g w w w Now we can smmarze the algorthm. At any tme ont t, the otmal olcy and the vale of that olcy are: * ( t ( t ( t ( g ˆ ( t ( t ( t ( t ( t (, ˆ ( ˆ ( t ( t ( t 3 w w w At the last tme ont we have: ( At any other tme ont we have: ( ( ( α 3 g 0 w w 0 w 0 ( α ( ( ( ( ( ( ( ( ( α ( ( ( ( ( ( 3 σ y g w b c w b w b c w ab w w a w abg w w abg w w d a k