A. Proofs Proof of Lemma 3. Consider he Bellman equaion λ + V π,l x, a lx, a + V π,l Ax + Ba, πax + Ba. We prove he lemma by showing ha he given quadraic form is he unique soluion of he Bellman equaion. Le z x a z Ax + Ba Ax + Ba + c We guess a quadraic form for he value funcions wrie The above equaion has a soluion if We have ha P I A x 0 B +. a c λ + z P z + L z x g Qx g + a a + z P z + L z. P P 2 P 2 P 22 L L L 2 A I P B I A Q 0 B +, 0 I L + 2 0 c P I A B 2g Q 0, 2 λ g Qg + c P 22 c + L 2 c. I A B A BK <. This implies ha ieraive equaions 2 have a unique soluion. he quadraic form is he soluion of he Bellman equaion. Proof of Lemma 4. From Lemma 3, we have ha A I P B I A Q 0 P B + 0 I L L + 2 0 c P I A B 2g Q 0. Noice ha he value of P depends only on he values of A, B, K, which in urn, by Lemma 2, depend only on {K, P,..., P }. marix P is deermined by K independenly of he adversarial choices {g,..., g }. In he absence of adversarial vecors, he opimal policy has he form of πx x, where K I + B SB B SA S is he soluion of he Riccai equaion. Consider a problem where g g 2 0, c c 2 0, K K is he gain marix of he opimal policy. Then, V is he value funcion of he opimal policy. Because π 2 is he greedy policy wih respec o V, i is he opimal policy hus K 2 is also he gain marix of he opimal policy, so K 2 K. Repeaing he same argumen shows ha all gain marices are he same. if we choose K o be he opimal gain marix in he non-adversarial problem, we will ge K K hence P P 2 P. Proof of Lemma 7. Firs we prove i. Under policy π x x + c, we have ha x π, π x π Ax π + Bπ x π, π Ax π + Bπ x π.
by 7, Then 5 implies ha By Lemmas 2 4, c 2 P,22 λ x π g Qx π g + x π + c x π + c g Qg + c I + B I A + BK Q + K K I A + BK Bc + 2 g Q c K I A + BK Bc. L,2 2 g Q c K I A + BK B, P,22 I + B I A + BK Q + K K I A + BK B. L s,2. c P,22 L s,2 P,22 B I A + BK Qg s K c s D g s + H c s, 3 where H P,22 B I A + BK K. To obain a bound on max c from he above equaion, we need o show ha H is sufficienly smaller han one. Le N I A + BK, M K NB, L I + M M M. We have ha H I + B N Q + K K NB M I + B N K K NB M I + M M M L, 4 LL I + M M M MI + M M I + M M M M + I II + M M I + M M I I + M M. Because M M λmax M M, N / ρ, M M K 2 B 2 / ρ 2, we ge ha LL I + M M I I + M M By 4 he above inequaliy, we ge ha + M M + K 2 B 2 / ρ 2 K 2 B 2 / ρ 2 + K 2 B 2 / ρ 2. H L L λ max LL K B / ρ + K 2 B 2 / ρ 2. LL
Le v / H. We ge ha v K B / ρ + K 2 B 2 / ρ 2 + K 2 B 2 / ρ 2 + K 2 B 2 / ρ 2 K B / ρ + K 2 B 2 / ρ 2 + K 2 B 2 / ρ 2 + K B ρ H. Now we are ready o bound c. By 3, we ge ha for any, c D G + c s D G + H max c s. s max c D G + H max c hus, max c D G H D GH C. Proof of ii. Firs we wrie c in erms of c : c D g s + H c s Dg + Hc + 2 D 2 g s + H 2 c s 2 2 Dg + Hc + 2 c Dg + 2I + Hc. This implies ha c c Dg I Hc. Then we use he facs ha c C H < o obain c c D G + 2C. Proof of Lemma 8. Le f π : X X be he ransiion funcion under policy π K, c, i.e. f π x A BKx + Bc. Le ɛ k, x k x π ɛ x x π denoe he difference beween he sae variable he limiing sae under he chosen policy. We wrie 4 From his decomposiion, we ge ha ɛ k, f π k x k f π x k + f π x k x π f π k x k f π x k + f π x k f π x π. ɛ k, B c k c + f π x k f π x π B c k c + ρ x k x π B D G + 2C s + ρ x k x π. 4 A similar decomposiion, bu wih a differen norm, was used in Even-Dar e al., 2009, proof of Lemma 5.2. o bound he difference beween he saionary disribuion of he chosen policy he disribuion of he sae variable in a finie MDP problem. sk
ɛ B D G + 2C k B D G + 2C B D G + 2C ρ ρ k s + ρ x x π sk s ks ρ k B C + ρ ρ B C + ρ s ρ, where he second sep follows from Equaion 7, Lemma 7, he fac ha x 0. If > logt / log/ρ, we ge ha s s: / s + s:> >/ s log + s log/ρ log / log/ρ + log log +. log/ρ log / log/ρ ɛ B D G + 2C ρ B C + ρ ρ. + log + log log/ρ log / log/ρ To prove he second par of lemma, le u T logt / log/ρ. We have ha logt / log/ρ >u T u T >u T u T T + logt. 5 by 8 5, ɛ ɛ + ɛ u T >u T 4 B C ρ log T log/ρ + B C ρ + B D G + 2C + log T + log T + log T. log/ρ The fac ha all gain marices are idenical grealy simplifies he boundedness proof. Proof of Lemma. Firs, i is easy o verify ha P,22 I hus, HV P,22 2I. The gradien of he value funcion can be wrien as a V x π, a 2P,22 a + P,2 x π + L,2. a V x π, a F for any a U.
Proof of i. By 8, x X, by Lemma 7, c C. all acions are bounded by Proof of ii iii. By Lemma 6, Similarly, a x + c K X + C U. x π + c K X + C U. x π + c K X + C U. Proof of iv. By 4 he fac ha K K P P, we ge ha L 2 ρ G Q + ρc P. Furher, by 2, for any policy π Π any acion saisfying a U, he value funcions are bounded by V x π, a x π a x π P a + L x π a P X + U 2 + 2 ρ G Q + ρc P X + U V. Proof of Lemma 3. For policy π K, c, we have l x, π x Q + K Kx 2c K + g Qx + c c + g Qg. Define S Q + K K d 2c K + g Q. We wrie γ T x π Sx π d x π d x π x π + T x π Sx π d x π S /2 x π S /2 x π S /2 x π + S /2 x π. γ T d x π x π + S /2 x π x π S /2 x π + S /2 x π d + S /2 S /2 x π + S /2 x π x π x π Z x π x π. We ge he desired resul by Lemma 6.