Optimal Control of Partially Observable Piecewise Deterministic Markov Processes

Size: px

Start display at page:

Download "Optimal Control of Partially Observable Piecewise Deterministic Markov Processes"

Shawn Hawkins
5 years ago
Views:

1 Optimal Control of Partially Observable Piecewise Deterministic Markov Processes Nicole Bäuerle based on a joint work with D. Lange Wien, April 2018

2 Outline Motivation PDMP with Partial Observation Controlled PDMP with Partial Observation Reduction to a Discrete-Time Problem The Filter Existence Results Application

3 Motivation process state running cost 0 1 st jump 2nd time y process state A typical path of a PDMP

4 Ingredients of a PDMP In order to define a PDMP we need the following data The drift Φ : R d R + R d is continuous and satisfies for all y R d and s, t > 0: Φ(y, t + s) = Φ(Φ(y, s), t).

5 Ingredients of a PDMP In order to define a PDMP we need the following data The drift Φ : R d R + R d is continuous and satisfies for all y R d and s, t > 0: Φ(y, t + s) = Φ(Φ(y, s), t). The jump times 0 := T 0 < T 1 <... are R + -valued random variables such that S n := T n T n 1, n N, S 0 := 0. They are generated by an intensity λ : R d (0, ).

6 Ingredients of a PDMP In order to define a PDMP we need the following data The drift Φ : R d R + R d is continuous and satisfies for all y R d and s, t > 0: Φ(y, t + s) = Φ(Φ(y, s), t). The jump times 0 := T 0 < T 1 <... are R + -valued random variables such that S n := T n T n 1, n N, S 0 := 0. They are generated by an intensity λ : R d (0, ). A transition kernel Q from R d to R d gives the probability Q(B y) that the process jumps into B given the state y.

7 Ingredients of a PDMP In order to define a PDMP we need the following data The drift Φ : R d R + R d is continuous and satisfies for all y R d and s, t > 0: Φ(y, t + s) = Φ(Φ(y, s), t). The jump times 0 := T 0 < T 1 <... are R + -valued random variables such that S n := T n T n 1, n N, S 0 := 0. They are generated by an intensity λ : R d (0, ). A transition kernel Q from R d to R d gives the probability Q(B y) that the process jumps into B given the state y. The PDMP is then defined by Y t := Φ(Y Tn, t T n ), for T n t < T n+1, n N 0.

8 A PDMP with Partial Observation Let (ɛ n ) n N be iid R d -valued random variables. We assume that the controller is only able to observe X n := Y Tn + ɛ n. Define Λ(y, t) := t 0 λ( Φ(y, s) ) ds. Then P x,y (S n t, Y Tn C, X n D S 0, Y T0, X 0,..., S n 1, Y Tn 1, X n 1 ) = t 0 exp ( Λ(Y Tn 1, s) ) λ ( Φ(Y Tn 1, s) ) C Q ɛ (D y )Q ( dy Φ(Y Tn 1, s) ) ds

9 Policies for POPDMP Consider marked point process (T n, Ŷn := Y Tn, X n ).

10 Policies for POPDMP Consider marked point process (T n, Ŷn := Y Tn, X n ). Relaxed controls on action space A: R := {r : [0, ) P(A) r measurable}.

11 Policies for POPDMP Consider marked point process (T n, Ŷn := Y Tn, X n ). Relaxed controls on action space A: R := {r : [0, ) P(A) r measurable}. Observable histories: H 0 := R d and for n N H n := H n 1 R R + R d. Element: h n = (x 0, r 0, s 1, x 1,..., r n 1, s n, x n ) H n

12 Policies for POPDMP Consider marked point process (T n, Ŷn := Y Tn, X n ). Relaxed controls on action space A: R := {r : [0, ) P(A) r measurable}. Observable histories: H 0 := R d and for n N H n := H n 1 R R + R d. Element: h n = (x 0, r 0, s 1, x 1,..., r n 1, s n, x n ) H n A discrete time history dependent relaxed control policy is π D := (π D 0, πd 1,... ) where πd n : H n R is measurable.

13 Policies for POPDMP Consider marked point process (T n, Ŷn := Y Tn, X n ). Relaxed controls on action space A: R := {r : [0, ) P(A) r measurable}. Observable histories: H 0 := R d and for n N H n := H n 1 R R + R d. Element: h n = (x 0, r 0, s 1, x 1,..., r n 1, s n, x n ) H n A discrete time history dependent relaxed control policy is π D := (π D 0, πd 1,... ) where πd n : H n R is measurable. Continuous control: π t := 1 {Tn t<tn+1 }(t) πn D (H n )(t T n ). n=0

14 Controlled Data We assume that the policy influences the drift, the intensity and the transition kernel.

15 Controlled Data We assume that the policy influences the drift, the intensity and the transition kernel. Drift Φ : R R d R + R d, notation Φ r (y, t).

16 Controlled Data We assume that the policy influences the drift, the intensity and the transition kernel. Drift Φ : R R d R + R d, notation Φ r (y, t). Jump rate λ A : R d A (0, ).

17 Controlled Data We assume that the policy influences the drift, the intensity and the transition kernel. Drift Φ : R R d R + R d, notation Φ r (y, t). Jump rate λ A : R d A (0, ). Jump kernel Q A from R d A to R d.

18 Controlled Process A Controlled Partially Observable Piecewise Deterministic Markov Process with local characteristics (Φ r, λ A, Q A, Q ɛ ) satisfies P πd t = e ΛπD n 1(Ŷn 1,s) 0 x,y(s n t, Ŷn C, X n D S 0,..., S n 1, Ŷn 1, X n 1, πn 1 D ) λ A( Φ πd n 1 ( Ŷ n 1, s), a ) A Q ɛ (D y )Q A( dy Φ πd n 1 ( Ŷ n 1, s), a ) πn 1 D (H n 1, s)(da)ds C where Λ r (y, t) := t 0 A λa (Φ r (y, s), a)r s (da) ds. Toy Example.

19 The Optimization Problem For a fixed policy π D we define the cost as [ J(x, π D ) := e βt c(y t, a) π t (da) dt IE πd x,y 0 A ] Q 0 (dy x).

20 The Optimization Problem For a fixed policy π D we define the cost as [ J(x, π D ) := e βt c(y t, a) π t (da) dt IE πd x,y The value function is defined as 0 J(x) := inf π D J(x, π D ) for all x R d. A ] Q 0 (dy x).

21 The Optimization Problem For a fixed policy π D we define the cost as [ J(x, π D ) := e βt c(y t, a) π t (da) dt IE πd x,y The value function is defined as A policy π is optimal if 0 J(x) := inf π D J(x, π D ) for all x R d. J(x) = J(x, π ) for all x R d. A ] Q 0 (dy x).

22 A discrete-time POMDP We define a POMDP as follows:

23 A discrete-time POMDP We define a POMDP as follows: State space R + R 2d (s, y, x).

24 A discrete-time POMDP We define a POMDP as follows: State space R + R 2d (s, y, x). Action space R r.

25 A discrete-time POMDP We define a POMDP as follows: State space R + R 2d (s, y, x). Action space R r. Transition law with Γ r (y, t) := βt + t 0 t e Γr (y,u) 0 A λa (Φ r (y, u), a)r u (da)du Q ( [0, t] C D y, r ) = Aλ A( Φ r (y, u), a ) Q(D y )Q A( dy Φ r (y, u), a ) r u (da)du. C

26 A discrete-time POMDP We define a POMDP as follows: State space R + R 2d (s, y, x). Action space R r. Transition law with Γ r (y, t) := βt + t 0 t e Γr (y,u) 0 A λa (Φ r (y, u), a)r u (da)du Q ( [0, t] C D y, r ) = Aλ A( Φ r (y, u), a ) Q(D y )Q A( dy Φ r (y, u), a ) r u (da)du. C The one-stage cost g(y, r) = 0 e Γr (y,t) A c(φ r (y, t), a) r t (da) dt.

27 The Problem for the discrete-time POMDP For a fixed policy π D we define the cost as [ ] J(x, π D ) := ĨEπD x g(ŷk, πk D (H k)). k=0

28 The Problem for the discrete-time POMDP For a fixed policy π D we define the cost as [ ] J(x, π D ) := ĨEπD x g(ŷk, πk D (H k)). The value function is defined as k=0 J(x) := inf π D J(x, π D ) x R d.

29 The Problem for the discrete-time POMDP For a fixed policy π D we define the cost as [ ] J(x, π D ) := ĨEπD x g(ŷk, πk D (H k)). The value function is defined as A policy π is optimal if k=0 J(x) := inf π D J(x, π D ) x R d. J(x) = J(x, π ) x R d.

30 Equivalence of the two Problems Theorem Let x R d be an initial observation and π D a policy. Then, it holds J(x, π D ) = J(x, π D ).

31 Assumptions (C1) The action space A is a compact metric space.

32 Assumptions (C1) The action space A is a compact metric space. (C2) λ A : R d A (0, ) is continuous and 0 < λ < λ A < λ.

33 Assumptions (C1) The action space A is a compact metric space. (C2) λ A : R d A (0, ) is continuous and 0 < λ < λ A < λ. (C3) Q A is weakly continuous, i.e. (x, a) v(z)q A (dz x, a) is continuous and bounded for all v : R d R continuous and bounded.

34 Assumptions (C1) The action space A is a compact metric space. (C2) λ A : R d A (0, ) is continuous and 0 < λ < λ A < λ. (C3) Q A is weakly continuous, i.e. (x, a) v(z)q A (dz x, a) is continuous and bounded for all v : R d R continuous and bounded. (B1) There exists E 0 = {y 1,..., y d } R d s.t. for all y R d and a A we have Q A (E 0 y, a) = 1 and Q 0 (E 0 ) = 1.

35 Assumptions (C1) The action space A is a compact metric space. (C2) λ A : R d A (0, ) is continuous and 0 < λ < λ A < λ. (C3) Q A is weakly continuous, i.e. (x, a) v(z)q A (dz x, a) is continuous and bounded for all v : R d R continuous and bounded. (B1) There exists E 0 = {y 1,..., y d } R d s.t. for all y R d and a A we have Q A (E 0 y, a) = 1 and Q 0 (E 0 ) = 1. (B2) Q ɛ has a bounded density f ɛ with respect to some σ-finite measure ν.

36 The Filter Assumption (B2) implies that the transition law has the density q(s, y, x y, r) = e Γr (y,s) f (x y ) λ A( Φ r (y, s), a ) Q A( y Φ r (y, s), a ) r s (da). A

37 The Filter Assumption (B2) implies that the transition law has the density q(s, y, x y, r) = e Γr (y,s) f (x y ) A λ A( Φ r (y, s), a ) Q A( y Φ r (y, s), a ) r s (da). Define for y E 0 the updating operator Ψ(ρ, r, s, x)(y y E q(s, ) := 0 y, x y, r)ρ(y) ŷ E y E q(s, 0 0 ŷ, x y, r)ρ(y). and for h n = (x 0, r 0, s 1, x 1,..., r n 1, s n, x n ), µ 0 (x 0 ) := Q 0 ( x 0 ), µ n ( h n ) = µ n ( h n 1, r n 1, s n, x n ) := Ψ ( µ n 1 ( h n 1 ), r n 1, s n, x n ).

38 The Filter Theorem It holds that ) P πd x (Ŷn C X 0, R 0, S 1, X 1,..., R n 1, S n, X n = µ n (C X 0, R 0, S 1, X 1,..., R n 1, S n, X n )

39 Filtered MDP We define a POMDP as follows:

40 Filtered MDP We define a POMDP as follows: State space P(E 0 ) ρ.

41 Filtered MDP We define a POMDP as follows: State space P(E 0 ) ρ. Action space R r.

42 Filtered MDP We define a POMDP as follows: State space P(E 0 ) ρ. Action space R r. Transition law ˆQ(B ρ, r) = y E 0 1 B ( Ψ(ρ, r, s, x) ) q SX (s, x y, r)ν(dx)dsρ(y), where q SX (s, x y, r) := y E 0 q(s, y, x y, r).

43 Filtered MDP We define a POMDP as follows: State space P(E 0 ) ρ. Action space R r. Transition law ˆQ(B ρ, r) = y E 0 1 B ( Ψ(ρ, r, s, x) ) q SX (s, x y, r)ν(dx)dsρ(y), where q SX (s, x y, r) := y E q(s, 0 y, x y, r). The one-stage cost ĝ(ρ, r) := g(y, r)ρ(y). y E 0

44 Filtered MDP Policy π = (f 0, f 1,...) with f n : P(E 0 ) R can be seen as a special policy π D Π D by setting π D n (h n ) := f n (µ n ( h n )).

45 Filtered MDP Policy π = (f 0, f 1,...) with f n : P(E 0 ) R can be seen as a special policy π D Π D by setting π D n (h n ) := f n (µ n ( h n )). We denote the cost of policy π by [ V (ρ, π) := ÎE π ρ ĝ ( µ n, f n (µ n ) )]. n=0

46 Filtered MDP Policy π = (f 0, f 1,...) with f n : P(E 0 ) R can be seen as a special policy π D Π D by setting π D n (h n ) := f n (µ n ( h n )). We denote the cost of policy π by [ V (ρ, π) := ÎE π ρ ĝ ( µ n, f n (µ n ) )]. n=0 The value function is defined as V (ρ) := inf V (ρ, π) for all ρ P(E 0 ). π Π

47 Equivalence of the two Problems Theorem Let x R d be an initial observation, π and π D given as on previous slide. Then, it holds V (Q 0 ( x), π) = J(x, π D ).

48 Further Assumptions (C4) r Φ r (y, t) is continuous for all y E 0, t 0.

49 Further Assumptions (C4) r Φ r (y, t) is continuous for all y E 0, t 0. (C5) c : R d A R + is lower semi-continuous with respect to the product topology.

50 Regularization of the Filter Let h σ : R R, σ > 0 be a regularization kernel, i.e. (i) h σ (t) 0 for all t R, (ii) R h σ(t)dt = 1, a (iii) lim σ 0 a h σ(t)dt = 1 for all a > 0.

51 Regularization of the Filter Let h σ : R R, σ > 0 be a regularization kernel, i.e. (i) h σ (t) 0 for all t R, (ii) R h σ(t)dt = 1, a (iii) lim σ 0 a h σ(t)dt = 1 for all a > 0. We use a regularized filter of the form ˆΨ(ρ, r, s, x)(y R y E q(u, ) := 0 y, x y, r)ρ(y)h σ (s u)du ŷ E y E q(u, 0 0 ŷ, x y, r)ρ(y)h σ (s u)du. Note that we have lim σ 0 ˆΨ = Ψ R

52 Solution of the Problem Theorem Under all assumptions we have for the regularized version of the problem a) V σ (ρ) = inf r R {ĝ(ρ, r) + P(E 0 ) V σ (ρ ) ˆQ(dρ } ρ, r). b) There exists an f which attains the minimum in a) and (f, f,...) is optimal for the filtered problem. The optimal policy for the original problem is (π0 D, πd 1,...) with π P 0 (x)(t) = f ( Q 0 ( x) ) (t), x R d π P n (h n )(t) = f ( µ n ( h n ) ) (t), h n H n.

53 Application: Toy Example (i) The state space is R, E 0 := { 2, 0, 2}, the action space is A := [ 1; 1]. (ii) The controlled drift is given by d dt Φr (y, t) = ar t (da), Φ(y, 0) = y. (iii) We set λ A 1 and β := 1. A (iv) The jump transition kernel Q A and the cost function are given in the picture on the next slide. (vii) Signal density: f ɛ ( 1) = f ɛ (0) = f ɛ (1) = 1 3. back

54 Application: Transition Kernel and Cost Function c(y) y Q(y; ) = δ 2 ( ) Q(y; ) = δ 0 ( ) Q(y; ) = δ 2 ( )

55 Solution of the Problem: Optimal Policy Theorem In this POPDMP all assumptions which we previously made, are satisfied and there exists an optimal policy. In this example we have also computed the value function and the optimal policy numerically by value iteration. For the optimal policy we obtained { 1{t 1 π n (h n, t) := 2 } if µ 1 n µ 3 n, 1 {t 1 2 } if µ 1 n µ 3 n.

56 Solution of the Problem: Value Function

57 References References N.B. and Lange, D.: Optimal control of partially observable piecewise deterministic Markov processes. SIAM J. Control Optim., 56, , N.B. and Rieder, U.: Markov Decision Processes with Applications to Finance. Springer, Brandejsky, A. and de Saporta, B. and Dufour, F.: Optimal stopping for partially observed piecewise-deterministic Markov processes. Stochastic Process. Appl., 123, , Davis, M.H.A.: Markov Models and Optimization. Chapman and Hall, Yushkevich, A. A.: On reducing a jump controllable Markov model to a model with discrete time. Theory Probab. Appl., 25, 58-69, 1980.

58 References Thank you very much for your attention!

arxiv: v2 [math.oc] 17 Jan 2018

arxiv: v2 [math.oc] 17 Jan 2018 OPTIML CONTROL OF PRTILLY OBSERVBLE PIECEWISE DETERMINISTIC MRKOV PROCESSES NICOLE BÄUERLE ND DIRK LNGE arxiv:176.9142v2 [math.oc 17 Jan 218 bstract. In this paper we consider a control problem for a Partially