The Bellman Equation

Similar documents
Hidden Markov Model Cheat Sheet

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Lecture 3: Probability Distributions

Mixture of Gaussians Expectation Maximization (EM) Part 2

( ) 2 ( ) ( ) Problem Set 4 Suggested Solutions. Problem 1

Managing Capacity Through Reward Programs. on-line companion page. Byung-Do Kim Seoul National University College of Business Administration

The Dirac Equation for a One-electron atom. In this section we will derive the Dirac equation for a one-electron atom.

Tracking with Kalman Filter

Conservation of Angular Momentum = "Spin"

Solution of Linear System of Equations and Matrix Inversion Gauss Seidel Iteration Method

Transfer Functions. Convenient representation of a linear, dynamic model. A transfer function (TF) relates one input and one output: ( ) system

Singular Value Decomposition: Theory and Applications

Expected Value and Variance

6 Supplementary Materials

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

+, where 0 x N - n. k k

Confidence intervals for weighted polynomial calibrations

Algorithms for factoring

Complex Numbers Practice 0708 & SP 1. The complex number z is defined by

6. Hamilton s Equations

Geometric Modeling

AE/ME 339. K. M. Isaac. 8/31/2004 topic4: Implicit method, Stability, ADI method. Computational Fluid Dynamics (AE/ME 339) MAEEM Dept.

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Problem Set 9 Solutions

Module 9. Lecture 6. Duality in Assignment Problems

Generalized Linear Methods

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

Fuzzy approach to solve multi-objective capacitated transportation problem

Lecture 10 Support Vector Machines II

Chapter Twelve. Integration. We now turn our attention to the idea of an integral in dimensions higher than one. Consider a real-valued function f : D

10-701/ Machine Learning, Fall 2005 Homework 3

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Lecture 10: May 6, 2013

Structure and Drive Paul A. Jensen Copyright July 20, 2003

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Calculation of time complexity (3%)

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

PHYS 705: Classical Mechanics. Newtonian Mechanics

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Dynamic Systems on Graphs

CS 3750 Machine Learning Lecture 6. Monte Carlo methods. CS 3750 Advanced Machine Learning. Markov chain Monte Carlo

1 Convex Optimization

So far: simple (planar) geometries

Multilayer Perceptron (MLP)

Pattern Classification

1 GSW Iterative Techniques for y = Ax

Digital PI Controller Equations

Limited Dependent Variables

Pattern Classification (II) 杜俊

Mechanics Physics 151

ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM

Société de Calcul Mathématique SA

CS286r Assign One. Answer Key

Advanced Topics in Optimization. Piecewise Linear Approximation of a Nonlinear Function

Lecture 12: Discrete Laplacian

Lecture 14: Bandits with Budget Constraints

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

JAB Chain. Long-tail claims development. ASTIN - September 2005 B.Verdier A. Klinger

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

Physics 181. Particle Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

( ) [ ( k) ( k) ( x) ( ) ( ) ( ) [ ] ξ [ ] [ ] [ ] ( )( ) i ( ) ( )( ) 2! ( ) = ( ) 3 Interpolation. Polynomial Approximation.

AGC Introduction

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

χ x B E (c) Figure 2.1.1: (a) a material particle in a body, (b) a place in space, (c) a configuration of the body

Supplementary Material for Spectral Clustering based on the graph p-laplacian

Iterative General Dynamic Model for Serial-Link Manipulators

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

PHYS 705: Classical Mechanics. Calculus of Variations II

Lecture Notes on Linear Regression

2.3 Nilpotent endomorphisms

Midterm Examination. Regression and Forecasting Models

Lecture 2: Prelude to the big shrink

Canonical transformations

EPR Paradox and the Physical Meaning of an Experiment in Quantum Mechanics. Vesselin C. Noninski

Chapter 3 Describing Data Using Numerical Measures

ECE559VV Project Report

The Robustness of a Nash Equilibrium Simulation Model

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

EEL 6266 Power System Operation and Control. Chapter 3 Economic Dispatch Using Dynamic Programming

1 Matrix representations of canonical matrices

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Composite Hypotheses testing

Week 5: Neural Networks

STAT 511 FINAL EXAM NAME Spring 2001

Lecture 7: Boltzmann distribution & Thermodynamics of mixing

Hidden Markov Models

The Second Anti-Mathima on Game Theory

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

SE Story Shear Frame. Final Project. 2 Story Bending Beam. m 2. u 2. m 1. u 1. m 3. u 3 L 3. Given: L 1 L 2. EI ω 1 ω 2 Solve for m 2.

Quantum Mechanics for Scientists and Engineers. David Miller

Notes on Frequency Estimation in Data Streams

DO NOT DO HOMEWORK UNTIL IT IS ASSIGNED. THE ASSIGNMENTS MAY CHANGE UNTIL ANNOUNCED.

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

What Independencies does a Bayes Net Model? Bayesian Networks: Independencies and Inference. Quick proof that independence is symmetric

Topic 5: Non-Linear Regression

18.1 Introduction and Recap

5 The Rational Canonical Form

Feature Selection: Part 1

Transcription:

The Bellman Eqaton Reza Shadmehr In ths docment I wll rovde an elanaton of the Bellman eqaton, whch s a method for otmzng a cost fncton and arrvng at a control olcy.. Eamle of a game Sose that or states refer to the oston on a grd, as shown below. If we are at the goal state, then the state cost er tme ste s zero. If we are at any other state, the state cost er tme ste s 5. Let s se the term J to refer to ths state cost er tme ste: 5 0 5 J ( The goal state s at row, col., whch means that f we are at ths state, we ncr no state costs. The doble lnes refer to a wall, reventng one to move from one state to the neghborng state. That s, there s a wall between the to left and to mddle states. If we erform some acton (say, move from one bo to the neghborng bo, there wll be a motor cost er tme ste, whch we refer to wth symbol J. The motor cost s one f we move J, and zero otherwse. So the total cost er tme ste s: ( n ( n J J α ( ( The term ( n ( n refers to the olcy that we have. Ths olcy secfes the acton that we wll erform for each state at tme ont n. For eamle, f we ck a random olcy, then we mght have actons that look lke ths: ( n (3 Sose or fnal tme ste s. If we are now at tme ont k, or objectve s to fnd the olcy that ( mnmzes the total cost to go α. Let s defne the goodness of each olcy va a vale fncton: k

( ( k ( k ( ( k α (4 If we are at the last tme ste, then the vale of or olcy s smly the cost er ste at ths last tme ste: ( J J (5 The otmm olcy s one that mnmzes Eq. (4, whch s smly to do nothng: * ( ( (6 In that case, we have: 5 0 5 ( (7 * J The Bellman otmalty rncle states that an otmal olcy has the roerty that whatever the ntal state and ntal decson are, the remanng decsons mst consttte an otmal olcy wth regard to the state resltng from the frst decson. Ths means that n order to fnd the otmal olcy for tme ont, for each state we mst fnd the command that mnmzes the followng: ( ( ( { α * ( } * argmn (8 Once we have the otmal command for each state, the vale of that state s: ( mn ( ( α { } * Consder the to mddle state. If we were to rodce an acton that moves s down, we wold have the ( state cost of 5, and motor cost of, and so α 6. The vale of the state we get to s zero. So the ( total vale of ths acton s 6. If we were to stay and not move, α 5 ls the vale of the state that we get to (the crrent state, whch s 5, for a sm total of 0. So the vale of the acton of movng down s 6, whereas the acton of dong nothng has the vale of 0. The vale of the acton of movng to the rght s. So the best acton that we can do for the to mddle state s to move down. Smlarly, the best acton that we can do for the bottom mddle state s to stay stll. In ths way, we can defne the acton that mnmzes Eq. (8 for each state, resltng n or olcy for tme ste :

( * ( (9 The vale of ths otmal olcy s: 0 6 6 0 0 6 (0 0 6 6 ( ( 0 0 0 Before we roceed to the net ste, t s worthwhle to take another look at Eq. (0. We have a vale assocated wth each state. Ths vale s the total cost that wll be ncrred f, startng at a gven state, we were to erform the best seqence of actons ossble. The seqence of actons wll rodce a seqence of cost er stes that together sm to be the vale assgned to each state. In a sense, when we fnd orselves at a gven state, the vale of that state reresents the lowest cost to go that we hoe to ncr f we were to rodce the best actons ossble. Now we reeat ths rocess for tme ste. Let s reconsder the to mddle state. The vale of stayng stll s: J ( J ( * 5 0 6. The vale of movng down s 5 0 6. The vale of movng to the rght s. The best acton remans to move down. Consder the bottom mddle state. The vale of movng s. The vale of stayng stll s 5. The best acton s to move (or move to rght to neghbor. The otmal olcy at tme ste s: ( * ( The vale of ths otmal olcy s: 5 6 6 5 0 6 ( 5 6 6 ( ( We comte the otmal olcy for tme ste 3 :

( * ( 3 (3 The vale of ths otmal olcy s: 0 6 6 0 0 6 (4 8 6 6 ( 3 ( Smlarly, we comte the otmal olcy for tme ste 4 : ( * ( 4 (5 The vale of ths otmal olcy s: 5 6 6 4 0 6 (6 8 6 6 ( 4 ( Now the nterestng reslt comes when we consder tme ste 5. Consder the to left bo. If we were to move down, the cost s 30. If we were to stay, the cost s also 30. So the otmal olcy s ether to stay stll or move down, as both gves s the same cost. The reason for ths s that the effort cost of movng s so large for ths state (t s so far from the goal state that the reward of gettng to the goal only barely comensates for the cost of movng. Sose we decde to stay, and so we have: The vale of ths otmal olcy s: ( ( * ( 5 * ( 4 (7 30 6 6 4 0 6 (8 8 6 6 ( 5 ( However, at tme ste 6, for the to left state or otmal acton s to move down, and so we get:

( * ( 6 (9 For each tme ste, we have sed the Bellman eqaton (Eq. 8 to fnd the otmal feedback control olcy.. Eamle of a lnear system wthot nose In condtons for whch we are dealng wth a lnear dynamcal system, the vale fncton wll trn ot to be a qadratc fncton of state, and the control olcy wll become a lnear fncton of state. These class of control roblems are also called Lnear Qadratc Reglators. Let s start wth a lnear system wthot nose. ( n ( n ( n A B y ( n ( n C (0 We have the followng cost er ste: ( n ( n T ( n ( n ( n T ( n α y T y L ( nt T ( n ( n ( nt ( n C T C L ( Let s begn at the fnal tme ste n (. At ths tme, the best acton that we can erform s one that mnmzes the cost acton at the fnal tme ont s: ( α. That acton s: ( * 0 If we erform ths otmal acton, the vale of the state we are at s: 0. That s, regardless of state, the otmal olcy of ( ( ( T T ( ( * C T C (3 We see that at the fnal tme ont, the vale fncton s a qadratc fncton of state. Let s defne ( matr W as follows: ( T ( And so the vale for the otmal olcy can be wrtten as: W C T C (4 ( ( T ( ( * W (5

In order to fnd the otmal olcy for tme ont, for each state we mst fnd the command that mnmzes the sm of the cost at the crrent tme ste, ls the vale of the state that we arrve at after we rodce the command: ( arg mn { * ( α, } * ( ( ( ( (6 We can wrte the eresson nsde the brackets as: ( ( T T ( ( ( T ( α C T C L T ( ( ( ( ( ( ( ( (, ( ( A B W A B ( T T ( ( ( T T ( ( AW A BW B ( T T ( ( BW A (7 ( To mnmze the sm n Eq. (6, we fnd ts dervatve wth resect to and set t eqal to zero. Ths gves s the otmal commands: Let s defne the followng matr: ( T ( ( T ( ( L B W B B W A 0 (8 We now can wrte the otmal olcy as follows: *( T ( T ( ( L B W B B W A (9 ( T ( T ( G L B W B B W A (3 * ( ( ( G The vale of each state nder the otmal olcy can be wrtten as (3 ( ( ( ( *( ( (, (, *( α * ( T T ( ( ( T ( T ( ( BT B G LG AW A G BW BG ( T ( T T ( ( G B W A ( T T ( ( ( T ( T T ( ( ( Notce that the vale fncton s a qadratc fncton of state. We can smlfy t a lttle sng the defnton of ( G : ( T ( ( T T ( ( ( T T ( ( G LG G B W BG G L B W B G ( T T ( G B W A (33 ( Let s defne W as follows: ( T ( T ( ( T T ( ( T T ( W BT B AW A G BW A G BW A T ( T ( ( T T ( BT B AW AG BW A (34

We can wrte the vale fncton as: We now have a rece. For ste we have: ( ( ( ( W (35 ( T ( T ( G L B W B B W A (36 And the followng vale fncton: ( ( ( ( W ( T ( T ( ( T T ( W B T B A W AG B W A (37 ( And so the rocedre s as follows: startng from the last tme ont, we comte G (whch s zero ( ( and W (Eq. 4. We net move tme ont and comte G ( (Eq. 3 and W (Eq. 34. ( ( We then se Eq. (36 to comte G and W (Eq. 37. And so on, ntl we reach tme ont 0. For each tme ste, we wll have a olcy that transforms or crrent state nto a motor command. As an eamle, let s consder movng a sngle jont model of the elbow. The state of the system s defned by ts oston and velocty (referrng to anglar oston and velocty. The dynamcs of the system are descrbed as follows: 0 0 0 0 k b 0 m m m Ac Cc y 0 The above eqatons are wrtten n contnos tme. To reresent t n dscrete tme (wth a tme ste of Δ t, we can wrte the dscrete eqatons as follows: ( k ( k ( k A B ( k ( k y C A ( I Ac Δt B BcΔt C C c I wanted the elbow to make a movement that ended at a goal state of t ( 300 ms 0.5, wth zero velocty, and held there for an addtonal 00ms. I sed the followng arameter vales for the arm:

k 3 N. m / rad b 0.45 N. ms. / rad m 0.3 kgm. / rad I set the state cost matr T to have the followng vales as a fncton of tme: 3. Eamle of a lnear system wth sgnal deendent nose Let s consder a smle scalar system of the form: ( t ( t ( t ( t ( t ( t a b( ε ε N 0, c ( ( t ( t y ε ε N 0, σ y y y In ths system, the state s a scalar, and so s the observaton y. However, notce that n ths system the nose s sgnal deendent. That s, the varance of the nose deends on the sze of the motor ( t commands. We begn by eressng the random varable ε n terms of random varable ( t φ N 0, ( t and : ( t ( t ( t c φ ( ( ( Let s sose that the cost er ste s: α t α( t α( t ont the otmal olcy * ( ( 0 ( ( α ε. Ths mles that at the last tme and the vale of the states acheved nder ths olcy s. We now fnd the otmal olcy for tme ste. We begn by comtng the term ( ( ( E *,. ( ( ( E ( var E ( ( ( ( E *(, αe ( ( ( α var E ( ( ( α bc ( ( a b The cost that we need to mnmze at tme ste s: ( ( ( ( ( ( ( α( α ( E *(, ( ( ( ( ( α( α( α b c ( ( a b We fnd ( that mnmzes ths cost:

d ( ( ( ( α ( αb c αb αab d ( ( ( α αb c αb αab g * ( ( ( 0 g However, becase s a random varable, at any tme ont we wll have an estmate of t, ˆ. And so or otmal olcy at tme ont s ( ˆ ( g ˆ (. Usng or olcy for tme ste, we can comte the vale fncton (, ˆ ( ( demonstrate that t s a qadratc fncton of, and the error n estmate of that state ( ( ˆ. ( ( ( ( ( (, ˆ ( ˆ ( ( ( ( ( αbc ( g ˆ α( a bg ˆ ( ( ( ( α αa ( ( α αb c αb ( g ˆ α α g * ( ( ( α abg ˆ ( ( ( α αa ( αabg ( ˆ zˆ zˆ z ˆ z ( ( ( ( ( ( ( (, ˆ ( α α α α ( ˆ ( ( ( ( ( w ( w ( ˆ z and ( ( ( ( αabg ˆ a abg abg ( t Now let s consder the tme ste. We can observe y and wrte the eqaton for the Kalman gan. As we wll see, the Kalman gan wll not deend on. var tt ( ˆ ( t ( tt ( tt ( t ( ( tt ˆ ˆ k y ˆ dp ( tt dk ( tt ( t k ( t ( t ( tt ( t ( t ( t εy k ˆ k k P t ( tt t ( k var( ˆ ( k ( ( σy ( t ( t ( t k k P k σy ( tt ( tt ( tt ( t ( t σy ( tt P P k k P ( tt P σ y

( At tme ont t, or estmate of ˆ t s smly the ror estmate ( ˆ t ( t, and the Kalman gan. ( t ( t ˆ ˆ k y ˆ At tme ont ( tt ( tt ( tt ( t t ( t t ( t ( ˆ tt ˆ aˆ b tt aˆ ak ( y ˆ b ( t ( t ( t ( t ( t ( t ˆ aˆ ak y ˆ b ( tt ( t ( t ( ( t (. Let s wrte ˆ t n terms of we showed that the vale fncton nder the otmal olcy ( ( (, ˆ ( qadratc fncton of ( ( and the error n estmate of that state ˆ ( ( ( ( ( ( ( relatonsh as (, ˆ w ( w ( ˆ otmal olcy for tme ont.. Let s wrte that and then fnd the ( ( ( ( α( α( ( ( ( ( ( E * (, ˆ,, ˆ ( ( ( ( ( ( ( (, ˆ ( ( ( w E ( ˆ ( ( ( ( ( ( ( ( ( ( ˆ E ( a b bε aˆ ak ( y ˆ b ( ( ( ( ( E ( a( ˆ ak ( εy ˆ bε ( ( ( ( E (( a ak ( ˆ ak εy bε ( ( ( d ( a ak ( ˆ ( ( ( ( ( ( ˆ E d adk εy bd ε a ( k εy abk εyε b ε ( ( d a ( k σy b c ( ( ( ( ( ( ( ( ( ( α( α( w ( a b w b c ( ( ( ( w d a ( k σy bc ( ( d ( ( ( ( ( ( ( ( α w b c w b c w b abw d * ( ( ( ( ( ( ( ( α w b c w b c w b abw E w a b w bc E E g ( ( The best that we can do s mlement the olcy g * ( ( g ˆ. Let s show that nder ( ths olcy, the vale fncton remans qadratc n terms and the error n estmate of that state ( ( ˆ. s a

( ( ( ( ( ( ( ( ( ( ( ( w d a ( k σy b c ( ( ( ( ( ( ( α w a ( w ab ( ( ( ( ( ( α bw w bc w bc ( w d a ( k ( ( ( ( ( ( ( α w a ( abw g ˆ ( ( ( ( ( abw g ( ˆ w d a ( k σ y α α w a b w b c * If we note that ˆ ˆ ( ˆ olcy as: σy z z z z, then we can wrte the vale fncton nder the otmal ( ( ( ( ( ( ( ( ( ( α ( ˆ ( ( w d a ( k σ y ( ( ( ( ˆ ( ( ( 3 w a abw g abw g w w w Now we can smmarze the algorthm. At any tme ont t, the otmal olcy and the vale of that olcy are: * ( t ( t ( t ( g ˆ ( t ( t ( t ( t ( t (, ˆ ( ˆ ( t ( t ( t 3 w w w At the last tme ont we have: ( At any other tme ont we have: ( ( ( α 3 g 0 w w 0 w 0 ( α ( ( ( ( ( ( ( ( ( α ( ( ( ( ( ( 3 σ y g w b c w b w b c w ab w w a w abg w w abg w w d a k