Lecture Notes: (Stochastic) Optimal Control

Size: px

Start display at page:

Download "Lecture Notes: (Stochastic) Optimal Control"

Tyler Hood
6 years ago
Views:

1 Lecture Notes: (Stochastic) Optimal ontrol Marc Toussaint Machine Learning & Robotics group, TU erlin Franklinstr. 28/29, FR 6-9, 587 erlin, Germany July, 2 Disclaimer: These notes are not meant to be a complete or comprehensive survey on Stochastic Optimal ontrol. This is more of a personal script which I use to keep an overview over control methods and their derivations. One point of these notes is to fix a consistent notation and provide a coherent overview for these specific methods. ontents Notation 2 Stochastic optimal control (discrete time) 2 2. The Linear-uadratic-Gaussian (LQG) case Message passing in LQG Special case: kinematic control Special case: multiple kinematic task variables Special case: Pseudo-dynamic process Special case: dynamic process Special case: multiple dynamic/kinematic task variables The optimization view on classical control laws 7 3. General uadratic loss function and constraints Ridge regression Motion rate control (pseudo-inverse kinematics) Regularized inverse kinematics (singularity robust motion rate control) Multiple regularized task variables Multiple prioritized task variables (prioritized inverse kinematics) Optimal dynamic control (incl. operational space control) Notation x system state (can be or (, )) R n robot posture (vector of joint angles) y R d a task variable (e.g., endeffector position) φ : y differentiable kinematic function J() = φ Rd n task Jacobian in posture We define a Gaussian over x with mean a and covariance matrix A as the function N(x a, A) = 2πA /2 exp{ 2 (x a) A - (x a)} () with property N(x a, A) = N(a x, A). We also define the canonical representation Nx a, A = exp{ 2 a A - a} 2πA - /2 exp{ 2 x A x x a} (2) with properties Nx a, A = N(x A - a, A - ), N(x a, A) = Nx A - a, A -.

2 Lecture Notes: (Stochastic) Optimal ontrol, Marc Toussaint July, 2 2 The product of two Gaussians can be expressed as Nx a, A Nx b, = Nx a b, A N(A - a - b, A - - ), (3) N(x a, A) N(x b, ) = Nx A - a - b, A - - N(a b, A ), (4) N(x a, A) Nx b, = Nx A - a b, A - N(a - b, A - ). (5) Linear transformations in x imply the following identities, N(F x f a, A) = F N(x F - (a f), F - AF - ), (6) = F Nx F A - (a f), F A - F, (7) NF x f a, A = F Nx F (a Af), F AF. (8) The joint Gaussian of two linearly dependent Gaussian variables reads N(x a, A) N(y b F x, ) = N ( x y A a A, b F a A A F ) F A F A F A (See the lecture notes on Gaussian identities for more identities.) Let us collect some matrix identities which we will need throughout. The Woodbury identity (J - J W ) - J - = W - J (JW - J ) -, holds for any positive definite and W. Further we have the identity I n (J - J W ) - J - J = (J - J W ) - W. () We define the pseudo-inverse of J w.r.t. W as (9) J W = W - J (JW - J ) - () and similar uantity as J = (J- J ) - J -. (2) 2 Stochastic optimal control (discrete time) We assume a framework that is basically the same as for Markov Decision Processes but with a slight change in notation. Instead of maximizing rewards we will minimize costs; instead of an action a t we refer to the control u t ; instead of V (x) we denote the optimal value function by J t (x). ost and dynamics are generally non-stationary. onsider a discrete time stochastic controlled system x t = f t (x t, u t ) ξ t, ξ t N(, Q t ) (3) with the state x t R n, the control signal u t R m, and Gaussian noise ξ of covariance Q. An alternative notation is P (x t u t, x t ) = N(x t f t (x t, u t ), Q t ) (4) For a given state-control seuence x :T, u :T we define the cost (x :T, u :T ) = c t (x t, u t ). t= Unlike the reward function in stationary MDPs, this cost function is typically not stationary. For instance, the cost might focus on the final state x T relative to a goal state. In conseuence also optimal policies are non-stationary. Just as we had in the MDP case, the value function obeys the ellman optimality euation J t (x) = min c t (x, u) P (x u, x) J t (x ) (6) u x There are two versions of stochastic optimal control problems: The open-loop control problem is to find a control seuence u :T that minimizes the expected cost. The closed-loop (feedback) control problem is find a control policy πt : x t u t (that exploits the true state observation in each time step and maps it to a feedback control signal) that minimizes the expected cost. (5)

3 Lecture Notes: (Stochastic) Optimal ontrol, Marc Toussaint July, The Linear-uadratic-Gaussian (LQG) case onsider a linear control process with Gaussian noise, P (x t x t, u t ) = N(x t A t x t a t t u t, Q t ), (7) and uadratic costs, c t (x t, u t ) = x tr t x t 2r t x t u th t u t. (8) The general LQG case is specified by matrices and vectors A :T, a :T, :T, Q :T, R :T, r :T, H :T. With a proper choice of R t and r t this corresponds to the problem of tracing an arbitrary desired trajectory x t, where the cost is uadratic in (x t x t ). The LQG case allows us to derive an exact backward recursion, called Riccati euation, for the computation of the value function. The value function will always be a uadratic form of the state. Let us assume that we know the value function J t (x) at time t and that it has the form Then J t (x) = x V t x 2v tx. J t (x) = min x R t x 2r t x u H t u u N(y A t x a t t u, Q t ) (y V t y 2v ty) dx y (9) (2) ONVENTION: For the remainder of this section we will drop the subscript t for A, a,, Q, R, r, H whereever it is missing we refer to time t. The expectation of a uadratic form under a Gaussian is E N(y a,a) {y V y 2v y} = a V a 2v a tr(v A). So we have J t (x) = min x Rx 2r x u Hu (Ax a u) V t (Ax a u) u 2v t(ax a u) tr(v t Q) = min x (R A V t A)x 2(r (v t V t a) A)x u (H V t )u u 2u (V t (Ax a) v t ) a V t a 2v ta tr(v t Q) Minimizing w.r.t. u by setting the gradient to zero we have = 2(H V t )u 2 (V t (Ax a) v t ) (23) u t (x) = (H V t ) - (V t (Ax a) v t ) (24) J t (x) = x (R A V t A)x 2(r (v t V t a) A)x (V t (Ax a) v t ) (H V t ) - (V t (Ax a) v t ) 2(V t (Ax a) v t ) (H V t ) - (V t (Ax a) v t ) a V t a 2v ta tr(v t Q) (25) J t (x) = x V t x 2x v t terms independent of x, V t = R A V t A KV t A v t = r A (v t V t a) K(v t V t a) (27) K := A V t(v t (H - ) - ) -, The last euation for J t is called the Ricatti euation. Initialized with V T = R T and v T = r T this gives a backward recursion to compute the value function J t at each time step. Euation (24) also gives the optimal control policy. Note that the optimal control and path is independent of the process noise Q. (2) (22) (26) 2.2 Message passing in LQG We may translate costs to probabilities by introducing a binary random variable z t dependent on x t and u t, P (z t = u t, x t ) = exp{ c t (x t, u t )}. (28)

4 Lecture Notes: (Stochastic) Optimal ontrol, Marc Toussaint July, 2 4 In the LQG case we can simplify this to P (z t = x t ) Nx t r t, R t, P (u t ) = N(u t, H - ) (29) (3) What is the posterior over the state trajectory x :T conditioned on that we permanently observe ĉ t =? Since we can integrate out the control u t this is a simple Markov process with continuous state and Gaussian observation, P (x t x t ) = u du N(x t Ax t a u, Q) N(u, H - ) (3) = N(x t Ax t a, Q H - ). (32) Inference is a standard forward-backward process just as Kalman smoothing. The messages read µ xt- x t (x t ) = N(x t s t, S t ), s t = a t- A t- (S - t- R t- ) - (S - t-s t- r t- ) S t = Q t- H - t- A t- (S - t- R t- ) - A t- (33) µ xt x t (x t ) = N(x t v t, V t ), v t = A - t a t A - t (Vt - R t ) - (Vtv - t r t ) V t = A - t Q t H - t (Vt - R t ) - A - t (34) µĉt x t (x t ) = Nx t r t, R t, The potentials (v t, V T ) which define the backward message can also be expressed in a different way: let us define (35) V t = Vt - R t v t = Vtv - t r t, (36) (37) which corresponds to a backward message (in canonical representation) which has the cost message already absorbed. Using a special case of the Woodbury identity, (A - ) - = A A(A - ) - A, (38) the bwd messages can be rewritten as Vt - = A - V t Q H - - A = A Vt A K V t A K := A Vt V t (Q H - ) - - Vt - v t = A Vt a t A v t K V t a t K v t = A ( v t V t a t ) K( v t V t a t ) (4) V t = R t (A K) V t A (4) v t = r t (A K)( v t V t a t ) They correspond exactly to the Recatti euations (26), (27) except for the dependence on Q which interacts directly with the control cost metric H. Yet another way to write them is: Proof. V t = R t A I V t V t (Q H - ) - - Vt A (43) v t = r t A I V t V t (Q H - ) - - ( v t V t a t ) (44) Since all factors are pairwise we can use the expression (??) for the messages. We have µ xt- x t (x t) = R x t- dx t- P (x t x t-) µ xt-2 x t- (x t-) µĉt- x t- (x t-) = R x t- dx t- N(x t A t-x t- a t-, Q t-h - t-) N(x t- s t-, S t-) Nx t- r t-, R t- Using the product rule (4) on the last two terms gives a Gaussian N(s t- R - t-r t-, S t- R - t-) independent of x t which we can subsume in the normalization. What remains is µ xt- x t (x t) R x t- dx t- N(x t A t-x t- a t-, Q t-h - t-) Nx t- St-s t- r t-, St- - R t- = R x t- dx t- N(x t A t-x t- a t-, Q t-h - t-) N(x t- (St- - R t-) - (St-s t- r t-), (St- - R t-) - ) (39) (42)

5 Lecture Notes: (Stochastic) Optimal ontrol, Marc Toussaint July, 2 5 = N(x t A t-(s - t- R t-) - (S - t-s t- r t-) a t-, Q t-h - t- A t-(s - t- R t-) - A t-) which gives the messages as in (33). For comparison we also give the canonical representation. Let S t- = St- - R t- and s t = St-s t- r t-, S t = Q t-h - t- A S- t- t- A t- = A t-{a - t-(q t-h - t-)a - - t- S t-}a t- St - = A - t-{ S t- S t- S t- A t-(q t-h - t-) - A t- S t-}a - t- s t = a t- A S- t- t- s t St - s t = A - S t- t-a - t-a t- A - t- s t A - S t- t- S t-a - t-a t- A - S t- t- s t = A - t-( s t S t-a - t-a t-) A - S t- t- ( s t S t-a - t-a t-) We repeat the derivation for µ xt x t (x t), µ xt x t (x t) = R x t dx t P (x t x t) µ xt2 x t (x t) µĉt x t (x t) = R x t dx t N(x t A tx t a t, Q th - t ) N(x t v t, V t) Nx t r t, R t R x t dx t N(x t A tx t a t, Q th - t ) Nx t Vtv - t r t, Vt - R t = N(A tx t a t (Vt - R t) - (Vtv - t r t), Q th - t (Vt - R t) - ) = N(x t A - t a t A - t (Vt - R t) - (Vtv - t r t), A - t Q th - t (Vt - R t) - A - t ) For this backward message it is instructive to derive the canonical representation. Let V t = Vt - R t and v t = Vtv - t r t, Vt - = A - V t Q H - - A = A VtA A Vt V t (Q H - ) - - VtA = A { V t V t V t (Q H - ) - - Vt}A v t = A - t a t A - - t V t v t Vt - v t = A Vta t A v t A Vt - Vta t A Vt - v t = A ( v t V ta t) A Vt - ( v t V ta t) 2.3 Special case: kinematic control We assume a kinematic control problem: (We write instead of x.) The Process is simply t = t u t ξ. (45) This means we have A t = t =, a t = (46) 2.4 Special case: multiple kinematic task variables Let φ i : x i be a kinematic mapping to a task variable y i and J i () its Jacobian. We assume we are given targets y i,:t in the task space and (time-dependent) precisions ϱ i,t by which we want to follow the task targets. We have c t ( t, u t ) = u t H R t = r t = u t H = u t H i= i= yi,t φ i ( t ) - i i= yi,t φ i (ˆ t ) J iˆ t J i t -, J i = J i (ˆ t ) i i= i= J i - i J i t J i - i J i t 2(y i,t φ i (ˆ t ) J iˆ t ) - i J i t const (47) J i - i (y i,t φ i (ˆ t ) J iˆ t ) (49) Note: the product of a fwd message with a task message corresponds to the classical optimal control for multiple regularized task variables (92) N( t s t, S t ) N t r t, R t N( t b, ) (5) - b = R t St - r t St - s t (5) (48)

6 Lecture Notes: (Stochastic) Optimal ontrol, Marc Toussaint July, Special case: Pseudo-dynamic process We replace x by t = t A and assume u t corresponds directly to accelerations: t P ( t t, t ) = N( t t τ t, W - ), P ( t t, u t ) = N( t t τu t, Q), t A = τ t A A τ 2 Au t ξ, dξdξ = t t τ A = τ A, = τ 2 A, a = (52) (53) W - A (54) Q The following euations might be computationally more efficient than the general Ricatti recursion but I m not sure, u t (x) = (H V t ) - (V t (Ax a) v t ) V t = v t = V t V 2 t Vt 3 Vt 4 v t vt 2 A = R V. τ V. V. V. 2 V. 3 τv. A τ 2 V. 3 τv. 3 V. 4 A(H τ 2 V. 4 ) - V. 3, τv. 3 V. 4 «, (56) A = r v. τ v A τ 2 V 3. τv. 3 V. 4 A(H τ 2 V. 4 ) - v. (57) u t (x) = (H τ 2 V 4. ) - (V t Ax v t ) 2.6 Special case: dynamic process We replace x by t = t A t P ( t t, t ) = N( t t τ t, W - ), P ( t t, u t ) = N( t t τm - (u t F ), Q), (6) t A = τ t A A τ 2 AM - (u t F ) ξ, dξdξ = W - A t τ Q (6) t A = τ A, = τ 2 M - τm - A, a = τ 2 M - F τm - A (62) F The following euations might be computationally more efficient than the general Ricatti recursion but I m not sure, u t (x) = (H V t ) - (V t (Ax a) v t ) V t = v t = V t V 2 t Vt 3 Vt 4 v t vt 2 A = R V. τ A = r v. τ V.. V 3 V. V 2 2V 2. F v 2V. 4 F 2τV 2 V 3. (55) (58) (59). τv. A τ 2 τv. 3 V. 4 AM - (H τ 2 M - V. 4 M - ) - M - V. 3, τv. 3 V. 4 (63). F V 3. A τ 2 τv. 3 V. 4 AM - (H τ 2 M - V. 4 M - ) - M - (v. 2τ V 2. F V. 4 F u t (x) = (H τ 2 M - V 4. M - ) - M - (V t (Ax a) v t ) (65) 2.7 Special case: multiple dynamic/kinematic task variables As before we have access to kinematic functions φ i () and Jacobians J i (). We are given task targets x i,:t and ẋ i,:t and want to follow them with (time-dependent) precisions ϱ i,:t and ν i,:t. We have c( t, t, u t ) = ϱ i,t x i,t φ i ( t ) 2 ν i,t ẋ i,t J i t 2 u th t u t i= ϱ i,t t J i J i t 2(x i,t φ i (ˆ t ) J iˆ t ) J i t const ν i,t t J i J i t 2(ẋ i,t) J i t const u th t u t i= A) (64) (66) «,

7 Lecture Notes: (Stochastic) Optimal ontrol, Marc Toussaint July, 2 7 R t = r t = i= i= ϱ i,t J i J i ν i,t J i J i A ϱ i,t J i (x i,t φ i(ˆ t ) J iˆ t ) ν i,t J i ẋ A (68) i,t 3 The optimization view on classical control laws In this section we will first review classical control laws as minimization of a basic loss function. Since this loss function has a ayesian interpretation it will be straight-forward to develop also a ayesian view on these control laws. The ayesian inference approach can then be generalized to what we actually aim for: motion planning in temporal probabilistic models. (67) 3. General uadratic loss function and constraints Let y R d and R n. Given y, consider the problem of finding that minimizes L = W y J - 2h W (69) where W = W denotes a norm and and W are symmetric positive definite matrices. This loss function can be interpreted as follows: The first term measures how well a constraint y = J is fulfilled relative to a covariance matrix, the second term measures = with metric W, the third term measures the scalar product between and h w.r.t. W. The solution can easily be found by taking the derivative L = 2 W 2(y J) - J 2h W L = = (J - J W ) - (J - y W h) Using the Woddbury and related identities and definitions as given in section we can rewrite the solution in several forms: = (J - J W ) - (J - y W h) = (J - J W ) - J - y I n (J - J W ) - J - J h (7) = (J - J W ) - J - (y Jh) h (72) = W - J (JW - J ) - y I n W - J (JW - J ) - J h (73) = W - J (JW - J ) - (y Jh) h. (74) This also allows us to properly derive the following limits: : = J W y (I n J W J) h = J W (y Jh) h (75) W : = J y (I n J J) h = J (y Jh) h W = λi n : = J (JJ λ) - y I n J (JJ λ) - J h (76) = σi d : = (J J σw ) - J y I n (J J σw ) - J J h These limits can be interpreted as follows. : we need to fulfill the constraint y = J exactly. = σi d : we use a standard suared error measure for y J. W : we do not care about the norm W (i.e., no regularization); but interestingly, the cost term h W has a nullspace effect also in this limit. W = λi n : we use a standard ridge as regulariser. The first of these limits is perhaps the most important. It corresponds to a hard constraint, that is, (75) is the solution to argmin W 2h W such that y = J (77) The loss function (69) has many applications, as we discuss in the following. (7)

8 Lecture Notes: (Stochastic) Optimal ontrol, Marc Toussaint July, Ridge regression Let us first give an off topic example from machine learning: In ridge regression, when we have d samples of n-dimensional inputs and D outputs, we have a minimization problem L = y Xβ λ β with a input data matrix X R d n, an output data vector y R d and a regressor β R n. The first term measures the standard suared error (with uniform output covariance = I d ), the second is a regulariser (or stabilizer) as introduced by Tikhonov. The special form λ β of the regulariser is called ridge. The solution is given by euation (76) when replacing the notation according to β, y y, J X, I d, W λi n, h : β = X (XX λi d ) - y. In the ayesian interpretation of ridge regression, the ridge λ β defines a prior exp{ 2λ β } over the regressor β. The above euation gives the MAP β. Since ridge regression has a ayesian interpretation, also standard motion rate control, as discussed shortly, will have a ayesian interpretation. 3.3 Motion rate control (pseudo-inverse kinematics) onsider a robot with n DoFs and a d-dimensional task space with d < n (e.g., an endeffector state). The current joint state is R n. In a given state we can compute the end-effector Jacobian J and we are given a joint space potential H(). We would like to compute joint velocities which fulfill the task constraint ẏ = J while minimizing the absolute joint velocity W and following the negative gradient h = W - H(). In summary, the problem and its solution are (problem) = argmin W 2 H() such that J = ẏ (78) (solution) = J W ẏ (I n J W J) W - H(). (79) The solution was taken from (75) by replacing the notation according to, y ẏ. Note that we have derived pseudo-inverse kinematics from a basic constrained uadratic optimization problem. Let us repeat this briefly for case when time is discretized. We can formulate the problem as t = argmin t t- W 2h W t such that φ( t ) = y t (8) t Generally, the constraint φ( t ) = y t is non-linear. We linearize it at t- and get the simpler problem and its solution (problem) t = argmin t t- W 2h W t such that J( t t- ) = y t φ( t- ) (8) t (solution) t = t- J W y t φ( t- ) (I n J W J) h. (82) The solution was taken from (75) by replacing the notation according to ( t t- ), y (y t φ( t- )). 3.4 Regularized inverse kinematics (singularity robust motion rate control) Under some conditions motion rate control is infeable, for instance when the arm cannot be further streched to reach a desired endeffector position. In this case the computation of the pseudo-inverse J W becomes singular. lassical control developed the singularity robuse pseudo-inverse (Nakamura & Hanafusa, 986), which can be interpreted as regularizing the computation of the pseudo-inverse, or as relaxing the hard task constraint. In our framework this corresponds to not taking the limit. This regularized inverse kinematics is given as (problem) = argmin W ẏ J - 2h W (83) (solution) = JWẏ (I n J W J) h. (84) J W := (J - J W ) - J - = W - J (JW - J ) - (85) The solution was taken from (7) by replacing the notation according to, y ẏ. Note that J W is a regularization of J W (defined in ()). Euations (7-74) give many interesting alternatives to write this control law. The linearized ( ˆφ is linearized at t- as above) time discretized version is: (problem) t = argmin t t- W y t ˆφ( t ) - 2h W t (86) t (solution) t = t- J W y t φ( t- ) (I n J W J) h. (87)

9 Lecture Notes: (Stochastic) Optimal ontrol, Marc Toussaint July, Multiple regularized task variables Assume we have m task variables y,.., y m, where the ith variable y i R di is d i -dimensional. Also assume that we regularize w.r.t. each task variable, that is, we have different error metrics i - in each task space. We want to follow all of the tasks and express this as the optimization problem and its solution (problem) = argmin W i= i= m (solution) = J i - J W ẏ i J - 2h W i - m i= (88) J i - ẏ i W h. (89) The solution was taken from (7) in the following way: We can collect all task variables into one bit task vector y = y. A y m, J = J. A J m, =... m A J - J = i= J i - i J i, J - ẏ = i= J i - i ẏ i (9) And the linearized time discretized version: (problem) t = argmin t t- W t m (solution) t = t- J i - J W i= i= y i,t ˆφ i ( t ) - 2h W i - m i= (9) J i - y i,t φ( t- ) W h. (92) 3.6 Multiple prioritized task variables (prioritized inverse kinematics) The case of multiple task variables is classically not addressed by regularizing all of them, but by imposing a hierarchy on them (Nakamura et al., 987; aerlocher & oulic, 24). Let us first explain the classical prioritized inverse kinematics: The control law is based on standard motion rate control but iteratively projects each desired task rate ẏ i in the remaining nullspace of all higher level control signals. Initializing the nullspace projection with N = I and =, the control law is defined by iterating for i =,.., m J i = J i N i-, i = i- J i (ẏ i J i i- ), N i = N i- J i J i. (93) We call J i a nullspace Jacobian which has the property that J i projects to changes in that do not change control variables x,..,i- with higher priority. Also an additional nullspace movement h in the remaining nullspace of all control signals can be included when defining the final control law as = m N m h. (94) In effect, the first task rate ẋ is guaranteed to be fulfilled exactly. The second ẋ 2 is guaranteed to be fulfilled as best as possible given that ẋ must be fulfilled, et cetera. This hierarchical projection of task can also be derived by starting with the regularize task variables as in the problem (88) and then iteratively taking the limit i starting with i = up to i = m. More formally, the iterative limit corresponds to i = ɛ m i I di and ɛ. For m = 2 task variables one can prove the euivalence between prioritized inverse kinematics and the hierarchical classical limit of the MAP motion exactly (by directly applying the Woodbury identity). For m > 2 we could not find an elegant proof but we numerically confirmed this limit for up to m = 4. Non-zero task variances can again be interpreted as regularizers. Note that without regularizers the standard prioritized inverse kinematics is numerically brittle. Handling many control signals (e.g., the over-determined case di >n) is problematic since the nullspace-projected Jacobians will become singular (with rank < d i ). For non-zero regularizations i the computations in euation (92) are numerically robust. 3.7 Optimal dynamic control (incl. operational space control) onsider a robot with dynamics = M - (u F ), where M is some generalized mass matrix, F subsumes external (also oriolis and gravitational) forces, and u is the n-dimensional torue control signal. We want to compute a

10 Lecture Notes: (Stochastic) Optimal ontrol, Marc Toussaint July, 2 control signal u which generates an acceleration such that a general task constraint ÿ = J J remains fulfilled while also minimizing the absolute norm u H of the control. The problem and its solution can be written as (problem) u = argmin u u H 2h Hu such that ÿ J JM - (u F ) = (95) (solution) u = T H (ÿ J T F ) (I n T H T ) h, with T = JM-. (96) The solution was taken from (75) by replacing the notation according to u, y (ÿ J JM - F ), J JM -. For h = this solution is identical to Theorem in (Peters et al., 25). Peters et al. discuss in detail stability issues and important special cases of this control scheme. A common special case is H = M -, which is called operational space control. References aerlocher, P., & oulic, R. (24). An inverse kinematic architecture enforcing an arbitrary number of strict priority levels. The Visual omputer. Nakamura, Y., & Hanafusa, H. (986). Inverse kinematic solutions with singularity robustness for robot manipulator control. Journal of Dynamic Systems, Measurement and ontrol, 8. Nakamura, Y., Hanafusa, H., & Yoshikawa, T. (987). Task-priority based redundancy control of robot manipulators. Int. Journal of Robotics Research, 6. Peters, J., Mistry, M., Udwadia, F. E., ory, R., Nakanishi, J., & Schaal, S. (25). A unifying framework for the control of robotics systems. IEEE Int. onf. on Intelligent Robots and Systems (IROS 25) (pp ).

Optimal Control with Learned Forward Models

Optimal Control with Learned Forward Models Pieter Abbeel UC Berkeley Jan Peters TU Darmstadt 1 Where we are? Reinforcement Learning Data = {(x i, u i, x i+1, r i )}} x u xx r u xx V (x) π (u x) Now V