Optimal Transport Methods in Operations Research and Statistics

Size: px

Start display at page:

Download "Optimal Transport Methods in Operations Research and Statistics"

Joshua Sims
5 years ago
Views:

1 Optimal Transport Methods in Operations Research and Statistics Jose Blanchet (based on work with F. He, Y. Kang, K. Murthy, F. Zhang). Stanford University (Management Science and Engineering), and Columbia University (Department of Statistics and Department of IEOR). Blanchet (Columbia U. and Stanford U.) 1 / 57

2 Goal: Goal: Introduce optimal transport techniques and applications in OR & Statistics Optimal transport is useful tool in model robustness, equilibrium, and machine learning! Blanchet (Columbia U. and Stanford U.) 2 / 57

3 Agenda Introduction to Optimal Transport Blanchet (Columbia U. and Stanford U.) 3 / 57

4 Agenda Introduction to Optimal Transport Economic Interpretations and Wasserstein Distances Blanchet (Columbia U. and Stanford U.) 3 / 57

5 Agenda Introduction to Optimal Transport Economic Interpretations and Wasserstein Distances Applications in Stochastic Operations Research Blanchet (Columbia U. and Stanford U.) 3 / 57

6 Agenda Introduction to Optimal Transport Economic Interpretations and Wasserstein Distances Applications in Stochastic Operations Research Applications in Distributionally Robust Optimization Blanchet (Columbia U. and Stanford U.) 3 / 57

7 Agenda Introduction to Optimal Transport Economic Interpretations and Wasserstein Distances Applications in Stochastic Operations Research Applications in Distributionally Robust Optimization Applications in Statistics Blanchet (Columbia U. and Stanford U.) 3 / 57

8 Introduction to Optimal Transport Monge-Kantorovich Problem & Duality (see e.g. C. Villani s 2008 textbook) Blanchet (Columbia U. and Stanford U.) 4 / 57

9 Monge Problem What s the cheapest way to transport a pile of sand to cover a sinkhole? Blanchet (Columbia U. and Stanford U.) 5 / 57

10 Monge Problem What s the cheapest way to transport a pile of sand to cover a sinkhole? min E µ {c (X, T (X ))}, T ( ):T (X ) v Blanchet (Columbia U. and Stanford U.) 6 / 57

11 Monge Problem What s the cheapest way to transport a pile of sand to cover a sinkhole? min E µ {c (X, T (X ))}, T ( ):T (X ) v where c (x, y) 0 is the cost of transporting x to y. Blanchet (Columbia U. and Stanford U.) 6 / 57

12 Monge Problem What s the cheapest way to transport a pile of sand to cover a sinkhole? min E µ {c (X, T (X ))}, T ( ):T (X ) v where c (x, y) 0 is the cost of transporting x to y. T (X ) v means T (X ) follows distribution v ( ). Blanchet (Columbia U. and Stanford U.) 6 / 57

13 Monge Problem What s the cheapest way to transport a pile of sand to cover a sinkhole? min E µ {c (X, T (X ))}, T ( ):T (X ) v where c (x, y) 0 is the cost of transporting x to y. T (X ) v means T (X ) follows distribution v ( ). Problem is highly non-linear, not much progress for about 150 yrs! Blanchet (Columbia U. and Stanford U.) 6 / 57

14 Kantorovich Relaxation: Primal Problem Let Π (µ, v) be the class of joint distributions π of random variables (X, Y ) such that π X = marginal of X = µ, π Y = marginal of Y = v. Blanchet (Columbia U. and Stanford U.) 7 / 57

15 Kantorovich Relaxation: Primal Problem Let Π (µ, v) be the class of joint distributions π of random variables (X, Y ) such that π X = marginal of X = µ, π Y = marginal of Y = v. Solve min{e π [c (X, Y )] : π Π (µ, v)} Blanchet (Columbia U. and Stanford U.) 7 / 57

16 Kantorovich Relaxation: Primal Problem Let Π (µ, v) be the class of joint distributions π of random variables (X, Y ) such that Solve π X = marginal of X = µ, π Y = marginal of Y = v. min{e π [c (X, Y )] : π Π (µ, v)} Linear programming (infinite dimensional): D c (µ, v) : = min Y π(dx,dy ) 0 X Y π (dx, dy) = µ (dx), c (x, y) π (dx, dy) X π (dx, dy) = v (dy). Blanchet (Columbia U. and Stanford U.) 7 / 57

17 Kantorovich Relaxation: Primal Problem Let Π (µ, v) be the class of joint distributions π of random variables (X, Y ) such that Solve π X = marginal of X = µ, π Y = marginal of Y = v. min{e π [c (X, Y )] : π Π (µ, v)} Linear programming (infinite dimensional): D c (µ, v) : = min Y π(dx,dy ) 0 X Y π (dx, dy) = µ (dx), If c (x, y) = d p (x, y) (d-metric) then Dc 1/p metric. c (x, y) π (dx, dy) X π (dx, dy) = v (dy). (µ, v) is a p-wasserstein Blanchet (Columbia U. and Stanford U.) 7 / 57

18 Illustration of Optimal Transport Costs Monge s solution would take the form π (dx, dy) = δ {T (x )} (dy) µ (dx). Blanchet (Columbia U. and Stanford U.) 8 / 57

19 Kantorovich Relaxation: Dual Problem Primal has always a solution for c ( ) 0 lower semicontinuous. Blanchet (Columbia U. and Stanford U.) 9 / 57

20 Kantorovich Relaxation: Dual Problem Primal has always a solution for c ( ) 0 lower semicontinuous. Linear programming (Dual): sup α (x) µ (dx) + β (y) v (dy) α,β X Y α (x) + β (y) c (x, y) (x, y) X Y. Blanchet (Columbia U. and Stanford U.) 9 / 57

21 Kantorovich Relaxation: Dual Problem Primal has always a solution for c ( ) 0 lower semicontinuous. Linear programming (Dual): sup α (x) µ (dx) + β (y) v (dy) α,β X Y α (x) + β (y) c (x, y) (x, y) X Y. Dual α and β can be taken over continuous functions. Blanchet (Columbia U. and Stanford U.) 9 / 57

22 Kantorovich Relaxation: Dual Problem Primal has always a solution for c ( ) 0 lower semicontinuous. Linear programming (Dual): sup α (x) µ (dx) + β (y) v (dy) α,β X Y α (x) + β (y) c (x, y) (x, y) X Y. Dual α and β can be taken over continuous functions. Complementary slackness: Equality holds on the support of π (primal optimizer). Blanchet (Columbia U. and Stanford U.) 9 / 57

23 Kantorovich Relaxation: Primal Interpretation John wants to remove of a pile of sand, µ ( ). Blanchet (Columbia U. and Stanford U.) 10 / 57

24 Kantorovich Relaxation: Primal Interpretation John wants to remove of a pile of sand, µ ( ). Peter wants to cover a sinkhole, v ( ). Blanchet (Columbia U. and Stanford U.) 10 / 57

25 Kantorovich Relaxation: Primal Interpretation John wants to remove of a pile of sand, µ ( ). Peter wants to cover a sinkhole, v ( ). Cost for John and Peter to transport the sand to cover the sinkhole is D c (µ, v) = c (x, y) π (dx, dy). X Y Blanchet (Columbia U. and Stanford U.) 10 / 57

26 Kantorovich Relaxation: Primal Interpretation John wants to remove of a pile of sand, µ ( ). Peter wants to cover a sinkhole, v ( ). Cost for John and Peter to transport the sand to cover the sinkhole is D c (µ, v) = c (x, y) π (dx, dy). X Y Now comes Maria, who has a business... Blanchet (Columbia U. and Stanford U.) 10 / 57

27 Kantorovich Relaxation: Primal Interpretation John wants to remove of a pile of sand, µ ( ). Peter wants to cover a sinkhole, v ( ). Cost for John and Peter to transport the sand to cover the sinkhole is D c (µ, v) = c (x, y) π (dx, dy). X Y Now comes Maria, who has a business... Maria promises to transport on behalf of John and Peter the whole amount. Blanchet (Columbia U. and Stanford U.) 10 / 57

28 Kantorovich Relaxation: Primal Interpretation Maria charges John α (x) per-unit of mass at x (similarly to Peter). Blanchet (Columbia U. and Stanford U.) 11 / 57

29 Kantorovich Relaxation: Primal Interpretation Maria charges John α (x) per-unit of mass at x (similarly to Peter). For Peter and John to agree we must have α (x) + β (y) c (x, y). Blanchet (Columbia U. and Stanford U.) 11 / 57

30 Kantorovich Relaxation: Primal Interpretation Maria charges John α (x) per-unit of mass at x (similarly to Peter). For Peter and John to agree we must have α (x) + β (y) c (x, y). Maria wishes to maximize her profit α (x) µ (dx) + β (y) v (dy). Blanchet (Columbia U. and Stanford U.) 11 / 57

31 Kantorovich Relaxation: Primal Interpretation Maria charges John α (x) per-unit of mass at x (similarly to Peter). For Peter and John to agree we must have α (x) + β (y) c (x, y). Maria wishes to maximize her profit α (x) µ (dx) + β (y) v (dy). Kantorovich duality says primal and dual optimal values coincide and (under mild regularity) α (x) = inf {c (x, y) y β (y)} β (y) = inf {c (x, y) x α (x)}. Blanchet (Columbia U. and Stanford U.) 11 / 57

32 Proof Techniques Suppose X and Y compact { sup inf π 0, α,β X Y X Y X Y c (x, y) π (dx, dy) α (x) π (dx, dy) + β (y) π (dx, dy) + X Y α (x) µ (dx) β (y) v (dy)} Blanchet (Columbia U. and Stanford U.) 12 / 57

33 Proof Techniques Suppose X and Y compact { sup inf π 0, α,β X Y X Y X Y c (x, y) π (dx, dy) α (x) π (dx, dy) + β (y) π (dx, dy) + X Y α (x) µ (dx) β (y) v (dy)} Swap sup and inf using Sion s min-max theorem by a compactness argument and conclude. Blanchet (Columbia U. and Stanford U.) 12 / 57

34 Proof Techniques Suppose X and Y compact { sup inf π 0, α,β X Y X Y X Y c (x, y) π (dx, dy) α (x) π (dx, dy) + β (y) π (dx, dy) + X Y α (x) µ (dx) β (y) v (dy)} Swap sup and inf using Sion s min-max theorem by a compactness argument and conclude. Significant amount of work needed to extend to general Polish spaces and construct the dual optimizers (primal a bit easier). Blanchet (Columbia U. and Stanford U.) 12 / 57

35 Application of Optimal Transport in Economics Economic Interpretations (see e.g. A. Galichon s 2016 textbook & McCaan 2013 notes). Blanchet (Columbia U. and Stanford U.) 13 / 57

36 Applications in Labor Markets Worker with skill x & company with technology y have surplus Ψ (x, y). Blanchet (Columbia U. and Stanford U.) 14 / 57

37 Applications in Labor Markets Worker with skill x & company with technology y have surplus Ψ (x, y). The population of workers is given by µ (x). Blanchet (Columbia U. and Stanford U.) 14 / 57

38 Applications in Labor Markets Worker with skill x & company with technology y have surplus Ψ (x, y). The population of workers is given by µ (x). The population of companies is given by v (y). Blanchet (Columbia U. and Stanford U.) 14 / 57

39 Applications in Labor Markets Worker with skill x & company with technology y have surplus Ψ (x, y). The population of workers is given by µ (x). The population of companies is given by v (y). The salary of worker x is α (x) & cost of technology y is β (y) α (x) + β (y) Ψ (x, y). Blanchet (Columbia U. and Stanford U.) 14 / 57

40 Applications in Labor Markets Worker with skill x & company with technology y have surplus Ψ (x, y). The population of workers is given by µ (x). The population of companies is given by v (y). The salary of worker x is α (x) & cost of technology y is β (y) α (x) + β (y) Ψ (x, y). Companies want to minimize total production cost α (x) µ (x) dx + β (y) v (y) dy Blanchet (Columbia U. and Stanford U.) 14 / 57

41 Applications in Labor Markets Letting a central planner organize the Labor market Blanchet (Columbia U. and Stanford U.) 15 / 57

42 Applications in Labor Markets Letting a central planner organize the Labor market The planner wishes to maximize total surplus Ψ (x, y) π (dx, dy) Blanchet (Columbia U. and Stanford U.) 15 / 57

43 Applications in Labor Markets Letting a central planner organize the Labor market The planner wishes to maximize total surplus Ψ (x, y) π (dx, dy) Over assignments π ( ) which satisfy market clearing Y π (dx, dy) = µ (dx), X π (dx, dy) = v (dy). Blanchet (Columbia U. and Stanford U.) 15 / 57

44 Solving for Optimal Transport Coupling Suppose that Ψ (x, y) = xy, µ (x) = I (x [0, 1]), v (y) = e y I (y > 0). Blanchet (Columbia U. and Stanford U.) 16 / 57

45 Solving for Optimal Transport Coupling Suppose that Ψ (x, y) = xy, µ (x) = I (x [0, 1]), v (y) = e y I (y > 0). Solve primal by sampling: Let {X n µ and v, respectively. F µn (x) = 1 n i } n i=1 and n {Yi } n i=1 both i.i.d. from n I (Xi n x), F vn (y) = 1 n i=1 n I ( Yj n y ) j=1 Blanchet (Columbia U. and Stanford U.) 16 / 57

46 Solving for Optimal Transport Coupling Suppose that Ψ (x, y) = xy, µ (x) = I (x [0, 1]), v (y) = e y I (y > 0). Solve primal by sampling: Let {X n µ and v, respectively. F µn (x) = 1 n i } n i=1 and n {Yi } n i=1 both i.i.d. from n I (Xi n x), F vn (y) = 1 n i=1 n I ( Yj n y ) j=1 Consider max π(x n i,x n j ) 0 i,j π ( x n j i, yj n Ψ ( xi n, yj n ) ( π x n i, y n ) ) = 1 n x i, j π ( x n i i, yj n ) = 1 n y j. Blanchet (Columbia U. and Stanford U.) 16 / 57

47 Solving for Optimal Transport Coupling Suppose that Ψ (x, y) = xy, µ (x) = I (x [0, 1]), v (y) = e y I (y > 0). Solve primal by sampling: Let {X n µ and v, respectively. F µn (x) = 1 n i } n i=1 and n {Yi } n i=1 both i.i.d. from n I (Xi n x), F vn (y) = 1 n i=1 n I ( Yj n y ) j=1 Consider max π(x n i,x n j ) 0 i,j π ( x n j i, yj n Ψ ( xi n, yj n ) ( π x n i, y n ) ) = 1 n x i, j π ( x n i i, yj n ) = 1 n y j. Clearly, simply sort and match is the solution! Blanchet (Columbia U. and Stanford U.) 16 / 57

48 Solving for Optimal Transport Coupling Think of Y n j ( ) = log 1 Uj n for Uj n s i.i.d. uniform(0, 1). Blanchet (Columbia U. and Stanford U.) 17 / 57

49 Solving for Optimal Transport Coupling Think of Y n j ( ) = log 1 Uj n for Uj n s i.i.d. uniform(0, 1). The j-th order statistic X n (j) is matched to Y n (j). Blanchet (Columbia U. and Stanford U.) 17 / 57

50 Solving for Optimal Transport Coupling Think of Y n j ( ) = log 1 Uj n for Uj n s i.i.d. uniform(0, 1). The j-th order statistic X(j) n is matched to Y n (j). As n, X(nt) n t, so Y n (nt) log (1 t). Blanchet (Columbia U. and Stanford U.) 17 / 57

51 Solving for Optimal Transport Coupling Think of Y n j ( ) = log 1 Uj n for Uj n s i.i.d. uniform(0, 1). The j-th order statistic X(j) n is matched to Y n (j). As n, X(nt) n t, so Y n (nt) log (1 t). Thus, the optimal coupling as n is X = U and Y = log (1 U) (comonotonic coupling). Blanchet (Columbia U. and Stanford U.) 17 / 57

52 Identities for Wasserstein Distances Comonotonic coupling is the solution if: Ψ (x + h, y + h) Ψ (x, y) ( 2 x,y Ψ (x, y) 0) - supermodularity. Blanchet (Columbia U. and Stanford U.) 18 / 57

53 Identities for Wasserstein Distances Comonotonic coupling is the solution if: Ψ (x + h, y + h) Ψ (x, y) ( 2 x,y Ψ (x, y) 0) - supermodularity. Of for costs c (x, y) = Ψ (x, y) if 2 x,y c (x, y) 0 (submodularity). Blanchet (Columbia U. and Stanford U.) 18 / 57

54 Identities for Wasserstein Distances Comonotonic coupling is the solution if: Ψ (x + h, y + h) Ψ (x, y) ( 2 x,y Ψ (x, y) 0) - supermodularity. Of for costs c (x, y) = Ψ (x, y) if 2 x,y c (x, y) 0 (submodularity). Corollary: Suppose c (x, y) = x y then X = Fµ 1 (U) and Y = Fv 1 (U) thus D c ( Fµ, F v ) = 1 0 F 1 µ (u) F 1 v (u) du. Blanchet (Columbia U. and Stanford U.) 18 / 57

55 Identities for Wasserstein Distances Comonotonic coupling is the solution if: Ψ (x + h, y + h) Ψ (x, y) ( 2 x,y Ψ (x, y) 0) - supermodularity. Of for costs c (x, y) = Ψ (x, y) if 2 x,y c (x, y) 0 (submodularity). Corollary: Suppose c (x, y) = x y then X = Fµ 1 (U) and Y = Fv 1 (U) thus D c ( Fµ, F v ) = 1 0 F 1 µ (u) F 1 v (u) du. Similar identities are common for Wasserstein distances... Blanchet (Columbia U. and Stanford U.) 18 / 57

56 Interesting Insight on Salary Effects In equilibrium, by the envelope theorem β (y) = d dy sup [Ψ (x, y) λ (x)] = x y Ψ (x y, y) = x y. Blanchet (Columbia U. and Stanford U.) 19 / 57

57 Interesting Insight on Salary Effects In equilibrium, by the envelope theorem β (y) = d dy sup [Ψ (x, y) λ (x)] = x y Ψ (x y, y) = x y. We also know that y = log (1 x), or x = 1 exp ( y) α (x) + β ( log (1 x)) = xy. β (y) = y + exp ( y) 1 + β (0). Blanchet (Columbia U. and Stanford U.) 19 / 57

58 Interesting Insight on Salary Effects In equilibrium, by the envelope theorem β (y) = d dy sup [Ψ (x, y) λ (x)] = x y Ψ (x y, y) = x y. We also know that y = log (1 x), or x = 1 exp ( y) α (x) + β ( log (1 x)) = xy. β (y) = y + exp ( y) 1 + β (0). What if Ψ (x, y) Ψ (x, y) + f (x)? (i.e. productivity grows). Blanchet (Columbia U. and Stanford U.) 19 / 57

59 Interesting Insight on Salary Effects In equilibrium, by the envelope theorem β (y) = d dy sup [Ψ (x, y) λ (x)] = x y Ψ (x y, y) = x y. We also know that y = log (1 x), or x = 1 exp ( y) α (x) + β ( log (1 x)) = xy. β (y) = y + exp ( y) 1 + β (0). What if Ψ (x, y) Ψ (x, y) + f (x)? (i.e. productivity grows). Answer: salaries grows if f ( ) is increasing. Blanchet (Columbia U. and Stanford U.) 19 / 57

60 Applications of Optimal Transport in Stochastic OR Application of Optimal Transport in Stochastic OR Blanchet and Murthy (2016) Insight: Diffusion approximations and optimal transport Blanchet (Columbia U. and Stanford U.) 20 / 57

61 A Distributionally Robust Performance Analysis In Stochastic OR we are often interested in evaluating for a complex model P true E Ptrue (f (X )) Blanchet (Columbia U. and Stanford U.) 21 / 57

62 A Distributionally Robust Performance Analysis In Stochastic OR we are often interested in evaluating E Ptrue (f (X )) for a complex model P true Moreover, we wish to control / optimize it min θ E Ptrue (h (X, θ)). Blanchet (Columbia U. and Stanford U.) 21 / 57

63 A Distributionally Robust Performance Analysis In Stochastic OR we are often interested in evaluating E Ptrue (f (X )) for a complex model P true Moreover, we wish to control / optimize it min θ E Ptrue (h (X, θ)). Model P true might be unknown or too diffi cult to work with. Blanchet (Columbia U. and Stanford U.) 21 / 57

64 A Distributionally Robust Performance Analysis In Stochastic OR we are often interested in evaluating E Ptrue (f (X )) for a complex model P true Moreover, we wish to control / optimize it min θ E Ptrue (h (X, θ)). Model P true might be unknown or too diffi cult to work with. So, we introduce a proxy P 0 which provides a good trade-off between tractability and model fidelity (e.g. Brownian motion for heavy-traffi c approximations). Blanchet (Columbia U. and Stanford U.) 21 / 57

65 A Distributionally Robust Performance Analysis For f ( ) upper semicontinuous with E P0 f (X ) < sup E P (f (Y )) D c (P, P 0 ) δ, X takes values on a Polish space and c ( ) is lower semi-continuous. Blanchet (Columbia U. and Stanford U.) 22 / 57

66 A Distributionally Robust Performance Analysis For f ( ) upper semicontinuous with E P0 f (X ) < sup E P (f (Y )) D c (P, P 0 ) δ, X takes values on a Polish space and c ( ) is lower semi-continuous. Also an infinite dimensional linear program sup f (y) π (dx, dy) X Y s.t. c (x, y) π (dx, dy) δ X Y π (dx, dy) = P 0 (dx). Y Blanchet (Columbia U. and Stanford U.) 22 / 57

67 A Distributionally Robust Performance Analysis Formal duality: Dual = inf λ 0,α { λδ + } α (x) P 0 (dx) λc (x, y) + α (x) f (y). Blanchet (Columbia U. and Stanford U.) 23 / 57

68 A Distributionally Robust Performance Analysis Formal duality: Dual = inf λ 0,α { λδ + } α (x) P 0 (dx) λc (x, y) + α (x) f (y). B. & Murthy (2016) - No duality gap: ( )] Dual = inf [λδ + E 0 sup {f (y) λc (X, y)}. λ 0 y Blanchet (Columbia U. and Stanford U.) 23 / 57

69 A Distributionally Robust Performance Analysis Formal duality: Dual = inf λ 0,α { λδ + } α (x) P 0 (dx) λc (x, y) + α (x) f (y). B. & Murthy (2016) - No duality gap: ( )] Dual = inf [λδ + E 0 sup {f (y) λc (X, y)}. λ 0 y We refer to this as RoPA Duality in this talk. Blanchet (Columbia U. and Stanford U.) 23 / 57

70 A Distributionally Robust Performance Analysis Formal duality: Dual = inf λ 0,α { λδ + } α (x) P 0 (dx) λc (x, y) + α (x) f (y). B. & Murthy (2016) - No duality gap: ( )] Dual = inf [λδ + E 0 sup {f (y) λc (X, y)}. λ 0 y We refer to this as RoPA Duality in this talk. Let us consider the important case f (y) = I (y A) & c (x, x) = 0. Blanchet (Columbia U. and Stanford U.) 23 / 57

71 A Distributionally Robust Performance Analysis So, if f (y) = I (y A) and c A (X ) = inf{y A : c (x, y)}, then [ Dual = inf λδ + E 0 (1 λc A (X )) +] = P 0 (c A (X ) 1/λ ). λ 0 Blanchet (Columbia U. and Stanford U.) 24 / 57

72 A Distributionally Robust Performance Analysis So, if f (y) = I (y A) and c A (X ) = inf{y A : c (x, y)}, then [ Dual = inf λδ + E 0 (1 λc A (X )) +] = P 0 (c A (X ) 1/λ ). λ 0 If c A (X ) is continuous under P 0 & E 0 (c A (X )) δ, then δ = E 0 [c A (X ) I (c A (X ) 1/λ )]. Blanchet (Columbia U. and Stanford U.) 24 / 57

73 Example: Model Uncertainty in Bankruptcy Calculations R (t) = the reserve (perhaps multiple lines) at time t. Blanchet (Columbia U. and Stanford U.) 25 / 57

74 Example: Model Uncertainty in Bankruptcy Calculations R (t) = the reserve (perhaps multiple lines) at time t. Bankruptcy probability (in finite time horizon T ) u T = P true (R (t) B for some t [0, T ]). Blanchet (Columbia U. and Stanford U.) 25 / 57

75 Example: Model Uncertainty in Bankruptcy Calculations R (t) = the reserve (perhaps multiple lines) at time t. Bankruptcy probability (in finite time horizon T ) u T = P true (R (t) B for some t [0, T ]). B is a set which models bankruptcy. Blanchet (Columbia U. and Stanford U.) 25 / 57

76 Example: Model Uncertainty in Bankruptcy Calculations R (t) = the reserve (perhaps multiple lines) at time t. Bankruptcy probability (in finite time horizon T ) u T = P true (R (t) B for some t [0, T ]). B is a set which models bankruptcy. Problem: Model (P true ) may be complex, intractable or simply unknown... Blanchet (Columbia U. and Stanford U.) 25 / 57

77 A Distributionally Robust Risk Analysis Formulation Our solution: Estimate u T by solving sup P true (R (t) B for some t [0, T ]), D c (P 0,P ) δ where P 0 is a suitable model. Blanchet (Columbia U. and Stanford U.) 26 / 57

78 A Distributionally Robust Risk Analysis Formulation Our solution: Estimate u T by solving sup P true (R (t) B for some t [0, T ]), D c (P 0,P ) δ where P 0 is a suitable model. P 0 = proxy for P true. Blanchet (Columbia U. and Stanford U.) 26 / 57

79 A Distributionally Robust Risk Analysis Formulation Our solution: Estimate u T by solving sup P true (R (t) B for some t [0, T ]), D c (P 0,P ) δ where P 0 is a suitable model. P 0 = proxy for P true. P 0 right trade-off between fidelity and tractability. Blanchet (Columbia U. and Stanford U.) 26 / 57

80 A Distributionally Robust Risk Analysis Formulation Our solution: Estimate u T by solving sup P true (R (t) B for some t [0, T ]), D c (P 0,P ) δ where P 0 is a suitable model. P 0 = proxy for P true. P 0 right trade-off between fidelity and tractability. δ is the distributional uncertainty size. Blanchet (Columbia U. and Stanford U.) 26 / 57

81 A Distributionally Robust Risk Analysis Formulation Our solution: Estimate u T by solving sup P true (R (t) B for some t [0, T ]), D c (P 0,P ) δ where P 0 is a suitable model. P 0 = proxy for P true. P 0 right trade-off between fidelity and tractability. δ is the distributional uncertainty size. D c ( ) is the distributional uncertainty region. Blanchet (Columbia U. and Stanford U.) 26 / 57

82 Desirable Elements of Distributionally Robust Formulation Would like D c ( ) to have wide flexibility (even non-parametric). Blanchet (Columbia U. and Stanford U.) 27 / 57

83 Desirable Elements of Distributionally Robust Formulation Would like D c ( ) to have wide flexibility (even non-parametric). Want optimization to be tractable. Blanchet (Columbia U. and Stanford U.) 27 / 57

84 Desirable Elements of Distributionally Robust Formulation Would like D c ( ) to have wide flexibility (even non-parametric). Want optimization to be tractable. Want to preserve advantages of using P 0. Blanchet (Columbia U. and Stanford U.) 27 / 57

85 Desirable Elements of Distributionally Robust Formulation Would like D c ( ) to have wide flexibility (even non-parametric). Want optimization to be tractable. Want to preserve advantages of using P 0. Want a way to estimate δ. Blanchet (Columbia U. and Stanford U.) 27 / 57

86 Connections to Distributionally Robust Optimization Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) ( )) dv D (v µ) = E v (log. dµ Blanchet (Columbia U. and Stanford U.) 28 / 57

87 Connections to Distributionally Robust Optimization Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) ( )) dv D (v µ) = E v (log. dµ Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009). Blanchet (Columbia U. and Stanford U.) 28 / 57

88 Connections to Distributionally Robust Optimization Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) ( )) dv D (v µ) = E v (log. dµ Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009). Big problem: Absolute continuity may typically be violated... Blanchet (Columbia U. and Stanford U.) 28 / 57

89 Connections to Distributionally Robust Optimization Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) ( )) dv D (v µ) = E v (log. dµ Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009). Big problem: Absolute continuity may typically be violated... Think of using Brownian motion as a proxy model for R (t)... Blanchet (Columbia U. and Stanford U.) 28 / 57

90 Connections to Distributionally Robust Optimization Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) ( )) dv D (v µ) = E v (log. dµ Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009). Big problem: Absolute continuity may typically be violated... Think of using Brownian motion as a proxy model for R (t)... Optimal transport is a natural option! Blanchet (Columbia U. and Stanford U.) 28 / 57

91 Application 1: Back to Classical Risk Problem Suppose that c (x, y) = d J (x ( ), y ( )) = Skorokhod J 1 metric. = inf { sup x (t) y (φ (t)), sup φ (t) t }. φ( ) bijection t [0,1] t [0,1] Blanchet (Columbia U. and Stanford U.) 29 / 57

92 Application 1: Back to Classical Risk Problem Suppose that c (x, y) = d J (x ( ), y ( )) = Skorokhod J 1 metric. = inf { sup x (t) y (φ (t)), sup φ (t) t }. φ( ) bijection t [0,1] t [0,1] If R (t) = b Z (t), then ruin during time interval [0, 1] is B b = {R ( ) : 0 inf R (t)} = {Z ( ) : b sup Z (t)}. t [0,1] t [0,1] Blanchet (Columbia U. and Stanford U.) 29 / 57

93 Application 1: Back to Classical Risk Problem Suppose that c (x, y) = d J (x ( ), y ( )) = Skorokhod J 1 metric. = inf { sup x (t) y (φ (t)), sup φ (t) t }. φ( ) bijection t [0,1] t [0,1] If R (t) = b Z (t), then ruin during time interval [0, 1] is B b = {R ( ) : 0 inf R (t)} = {Z ( ) : b sup Z (t)}. t [0,1] t [0,1] Let P 0 ( ) be the Wiener measure want to compute sup P (Z B b ). D c (P 0,P ) δ Blanchet (Columbia U. and Stanford U.) 29 / 57

94 Application 1: Computing Distance to Bankruptcy So: {c Bb (Z ) 1/λ } = {sup t [0,1] Z (t) b 1/λ }, and ( ) sup P (Z B b ) = P 0 sup Z (t) b 1/λ. D c (P 0,P ) δ t [0,1] Blanchet (Columbia U. and Stanford U.) 30 / 57

95 Application 1: Computing Uncertainty Size Note any coupling π so that π X = P 0 and π Y = P satisfies D c (P 0, P) E π [c (X, Y )] δ. Blanchet (Columbia U. and Stanford U.) 31 / 57

96 Application 1: Computing Uncertainty Size Note any coupling π so that π X = P 0 and π Y = P satisfies D c (P 0, P) E π [c (X, Y )] δ. So use any coupling between evidence and P 0 or expert knowledge. Blanchet (Columbia U. and Stanford U.) 31 / 57

97 Application 1: Computing Uncertainty Size Note any coupling π so that π X = P 0 and π Y = P satisfies D c (P 0, P) E π [c (X, Y )] δ. So use any coupling between evidence and P 0 or expert knowledge. We discuss choosing δ non-parametrically momentarily. Blanchet (Columbia U. and Stanford U.) 31 / 57

98 Application 1: Illustration of Coupling Given arrivals and claim sizes let Z (t) = m 1/2 2 N (t) k=1 (X k m 1 ) Blanchet (Columbia U. and Stanford U.) 32 / 57

99 Application 1: Coupling in Action Blanchet (Columbia U. and Stanford U.) 33 / 57

100 Application 1: Numerical Example Assume Poisson arrivals. Pareto claim sizes with index 2.2 (P (V > t) = 1/(1 + t) 2.2 ). Cost c (x, y) = d J (x, y) 2 < note power of 2. Used Algorithm 1 to calibrate (estimating means and variances from data). b P 0 (Ruin) P true (Ruin) P robust (Ruin) P true (Ruin) Blanchet (Columbia U. and Stanford U.) 34 / 57

101 Additional Applications: Multidimensional Ruin Problems contains more applications. Blanchet (Columbia U. and Stanford U.) 35 / 57

102 Additional Applications: Multidimensional Ruin Problems contains more applications. Control: min θ sup P :D (P,P0 ) δ E [L (θ, Z )] < robust optimal reinsurance. Blanchet (Columbia U. and Stanford U.) 35 / 57

103 Additional Applications: Multidimensional Ruin Problems contains more applications. Control: min θ sup P :D (P,P0 ) δ E [L (θ, Z )] < robust optimal reinsurance. Multidimensional risk processes (explicit evaluation of c B (x) for d J metric). Blanchet (Columbia U. and Stanford U.) 35 / 57

Multidimensional risk processes (explicit evaluation of c B (x) for d J metric). Key insight: Geometry of target set often remains largely the Blanchet same! (Columbia U. and Stanford U.

104 Multidimensional risk processes (explicit evaluation of c B (x) for d J metric). Key insight: Geometry of target set often remains largely the Blanchet same! (Columbia U. and Stanford U.) 35 / 57 Additional Applications: Multidimensional Ruin Problems contains more applications. Control: min θ sup P :D (P,P0 ) δ E [L (θ, Z )] < robust optimal reinsurance.

105 Connections to Distributionally Robust Optimization Based on: Robust Wasserstein Profile Inference (B., Murthy & Kang 16) Highlight: Additional insights into why optimal transport... Blanchet (Columbia U. and Stanford U.) 36 / 57

106 Distributionally Robust Optimization in Machine Learning Consider estimating β R m in linear regression Y i = βx i + e i, where {(Y i, X i )} n i=1 are data points. Blanchet (Columbia U. and Stanford U.) 37 / 57

107 Distributionally Robust Optimization in Machine Learning Consider estimating β R m in linear regression Y i = βx i + e i, where {(Y i, X i )} n i=1 are data points. Optimal Least Squares approach consists in estimating β via [ ( ) ] 2 min E Pn Y β T n 1 ( ) 2 X = min β β n Y i β T X i = i=1 Blanchet (Columbia U. and Stanford U.) 37 / 57

108 Distributionally Robust Optimization in Machine Learning Consider estimating β R m in linear regression Y i = βx i + e i, where {(Y i, X i )} n i=1 are data points. Optimal Least Squares approach consists in estimating β via [ ( ) ] 2 min E Pn Y β T n 1 ( ) 2 X = min β β n Y i β T X i = i=1 Apply the distributionally robust estimator based on optimal transport. Blanchet (Columbia U. and Stanford U.) 37 / 57

109 Connection to Sqrt-Lasso Theorem (B., Kang, Murthy (2016)) Suppose that c ( (x, y), ( { x, y )) x x = 2 q if y = y if y = y. Then, if 1/p + 1/q = 1 ( ( ) ) 2 max E 1/2 P Y β T X P :D c (P,P n ) δ = E 1/2 P n [ ( ) ] 2 Y β T X + δ β p. Remark 1: This is sqrt-lasso (Belloni et al. (2011)). Remark 2: Uses RoPA duality theorem & "judicious choice of c ( ) Blanchet (Columbia U. and Stanford U.) 38 / 57

110 Connection to Regularized Logistic Regression Theorem (B., Kang, Murthy (2016)) Suppose that c ( (x, y), ( x, y )) = { x x q if y = y if y = y. Then, ] sup E P [log(1 + e Y βt X ) P : D c (P,P n ) δ ] = E Pn [log(1 + e Y βt X ) + δ β p. Remark 1: Approximate connection studied in Esfahani and Kuhn (2015). Blanchet (Columbia U. and Stanford U.) 39 / 57

111 Unification and Extensions of Regularized Estimators Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Blanchet (Columbia U. and Stanford U.) 40 / 57

112 Unification and Extensions of Regularized Estimators Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Support Vector Machines: B., Kang, Murthy (2016) - Blanchet (Columbia U. and Stanford U.) 40 / 57

113 Unification and Extensions of Regularized Estimators Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Support Vector Machines: B., Kang, Murthy (2016) - Group Lasso: B., & Kang (2016): Blanchet (Columbia U. and Stanford U.) 40 / 57

114 Unification and Extensions of Regularized Estimators Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Support Vector Machines: B., Kang, Murthy (2016) - Group Lasso: B., & Kang (2016): Generalized adaptive ridge: B., Kang, Murthy, Zhang (2017): Blanchet (Columbia U. and Stanford U.) 40 / 57

115 Unification and Extensions of Regularized Estimators Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Support Vector Machines: B., Kang, Murthy (2016) - Group Lasso: B., & Kang (2016): Generalized adaptive ridge: B., Kang, Murthy, Zhang (2017): Semisupervised learning: B., and Kang (2016): Blanchet (Columbia U. and Stanford U.) 40 / 57

116 How Regularization and Dual Norms Arise? Let us work out a simple example... Blanchet (Columbia U. and Stanford U.) 41 / 57

117 How Regularization and Dual Norms Arise? Let us work out a simple example... Recall RoPA Duality: Pick c ((x, y), (x, y )) = (x, y) (x, y ) 2 q ( max E P ((X, Y ) (β, 1)) 2) P :D c (P,P n ) δ { [ ((x = min λδ + E Pn sup, y ) (β, 1) )2 λ (X, Y ) ( x, y ) 2 λ 0 (x,y q ) Blanchet (Columbia U. and Stanford U.) 41 / 57

118 How Regularization and Dual Norms Arise? Let us work out a simple example... Recall RoPA Duality: Pick c ((x, y), (x, y )) = (x, y) (x, y ) 2 q ( max E P ((X, Y ) (β, 1)) 2) P :D c (P,P n ) δ { [ ((x = min λδ + E Pn sup, y ) (β, 1) )2 λ (X, Y ) ( x, y ) 2 λ 0 (x,y q ) Let s focus on the inside E Pn... Blanchet (Columbia U. and Stanford U.) 41 / 57

119 How Regularization and Dual Norms Arise? Let = (X, Y ) (x, y ) [ ((x sup, y ) (β, 1) )2 λ (X, Y ) ( x, y ) ] 2 (x,y q ) [ ] = sup ((X, Y ) (β, 1) (β, 1)) 2 λ 2 q [ ] = sup q ( (X, Y ) (β, 1) + q (β, 1) p ) 2 λ 2 q Blanchet (Columbia U. and Stanford U.) 42 / 57

120 How Regularization and Dual Norms Arise? Let = (X, Y ) (x, y ) [ ((x sup, y ) (β, 1) )2 λ (X, Y ) ( x, y ) ] 2 (x,y q ) [ ] = sup ((X, Y ) (β, 1) (β, 1)) 2 λ 2 q [ ] = sup q ( (X, Y ) (β, 1) + q (β, 1) p ) 2 λ 2 q Last equality uses z z 2 is symmetric around origin and a b a p b q. Blanchet (Columbia U. and Stanford U.) 42 / 57

121 How Regularization and Dual Norms Arise? Let = (X, Y ) (x, y ) [ ((x sup, y ) (β, 1) )2 λ (X, Y ) ( x, y ) ] 2 (x,y q ) [ ] = sup ((X, Y ) (β, 1) (β, 1)) 2 λ 2 q [ ] = sup q ( (X, Y ) (β, 1) + q (β, 1) p ) 2 λ 2 q Last equality uses z z 2 is symmetric around origin and a b a p b q. Note problem is now one-dimensional (easily computable). Blanchet (Columbia U. and Stanford U.) 42 / 57

122 On Role of Transport Cost... Data-driven chose of c ( ). Blanchet (Columbia U. and Stanford U.) 43 / 57

123 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 A = (x x ) A (x x) with A positive definite (Mahalanobis distance). Blanchet (Columbia U. and Stanford U.) 43 / 57

124 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 A = (x x ) A (x x) with A positive definite (Mahalanobis distance). Then, max P :D c (P,P n ) δ = min E 1/2 β P n E 1/2 P [ ( Y β T X ( ( Y β T X ) 2 ) ) 2 ] + δ β A 1. Blanchet (Columbia U. and Stanford U.) 43 / 57

125 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 A = (x x ) A (x x) with A positive definite (Mahalanobis distance). Then, max P :D c (P,P n ) δ = min E 1/2 β P n E 1/2 P [ ( Y β T X ( ( Y β T X ) 2 ) ) 2 ] + δ β A 1. Intuition: Think of A diagonal, encoding inverse variability of X i s... Blanchet (Columbia U. and Stanford U.) 43 / 57

126 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 A = (x x ) A (x x) with A positive definite (Mahalanobis distance). Then, max P :D c (P,P n ) δ = min E 1/2 β P n E 1/2 P [ ( Y β T X ( ( Y β T X ) 2 ) ) 2 ] + δ β A 1. Intuition: Think of A diagonal, encoding inverse variability of X i s... High variability > cheap transportation > high impact in risk estimation. Blanchet (Columbia U. and Stanford U.) 43 / 57

127 On Role of Transport Cost... Data-driven chose of c ( ). Blanchet (Columbia U. and Stanford U.) 44 / 57

128 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 Λ = (x x ) Λ (x x) with Λ positive definite (Mahalanobis distance). Blanchet (Columbia U. and Stanford U.) 44 / 57

129 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 Λ = (x x ) Λ (x x) with Λ positive definite (Mahalanobis distance). Then, ( ( ) ) 2 max E 1/2 P Y β T X P :D c (P,P n ) δ [ ( Y β T X = min E 1/2 β P n ) 2 ] + δ β Λ 1. Blanchet (Columbia U. and Stanford U.) 44 / 57

130 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 Λ = (x x ) Λ (x x) with Λ positive definite (Mahalanobis distance). Then, ( ( ) ) 2 max E 1/2 P Y β T X P :D c (P,P n ) δ [ ( Y β T X = min E 1/2 β P n ) 2 ] + δ β Λ 1. Intuition: Think of Λ diagonal, encoding inverse variability of X i s... Blanchet (Columbia U. and Stanford U.) 44 / 57

131 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 Λ = (x x ) Λ (x x) with Λ positive definite (Mahalanobis distance). Then, ( ( ) ) 2 max E 1/2 P Y β T X P :D c (P,P n ) δ [ ( Y β T X = min E 1/2 β P n ) 2 ] + δ β Λ 1. Intuition: Think of Λ diagonal, encoding inverse variability of X i s... High variability > cheap transportation > high impact in risk estimation. Blanchet (Columbia U. and Stanford U.) 44 / 57

132 On Role of Transport Cost... Comparing L 1 regularization vs data-driven cost regularization: real data BC BN QSAR Magic 3*LRL1 Train.185 ± ± ± ±.087 Test.428 ± ± ± ±.050 Accur.929 ± ± ± ±.045 3*DRO-NL Train.032 ± ± ± ±.084 Test.119 ± ± ± ±.049 Accur.955 ± ± ± ±.043 Num Predictors Train Size Test Size Table: Numerical results for real data sets. Blanchet (Columbia U. and Stanford U.) 45 / 57

133 Connections to Statistical Analysis Based on: Robust Wasserstein Profile Inference (B., Murthy & Kang 16) Highlight: How to choose size of uncertainty? Blanchet (Columbia U. and Stanford U.) 46 / 57

134 Towards an Optimal Choice of Uncertainty Size How to choose uncertainty size in a data-driven way? Blanchet (Columbia U. and Stanford U.) 47 / 57

135 Towards an Optimal Choice of Uncertainty Size How to choose uncertainty size in a data-driven way? Once again, consider Lasso as example: min β = min β max P :D c (P,P n ) δ { E 1/2 P n ( ( ) ) 2 E P Y β T X [ ( Y β T X ) 2 ] + δ β p } 2. Blanchet (Columbia U. and Stanford U.) 47 / 57

136 Towards an Optimal Choice of Uncertainty Size How to choose uncertainty size in a data-driven way? Once again, consider Lasso as example: min β = min β max P :D c (P,P n ) δ { E 1/2 P n ( ( ) ) 2 E P Y β T X [ ( Y β T X ) 2 ] + δ β p } 2. Use left hand side to define a statistical principle to choose δ. Blanchet (Columbia U. and Stanford U.) 47 / 57

137 Towards an Optimal Choice of Uncertainty Size How to choose uncertainty size in a data-driven way? Once again, consider Lasso as example: min β = min β max P :D c (P,P n ) δ { E 1/2 P n ( ( ) ) 2 E P Y β T X [ ( Y β T X ) 2 ] + δ β p } 2. Use left hand side to define a statistical principle to choose δ. Important: Optimizing δ is equivalent to optimizing regularization! Blanchet (Columbia U. and Stanford U.) 47 / 57

138 Towards an Optimal Choice of Uncertainty Size Standard way to pick δ (Esfahani and Kuhn (2015)). Blanchet (Columbia U. and Stanford U.) 48 / 57

139 Towards an Optimal Choice of Uncertainty Size Standard way to pick δ (Esfahani and Kuhn (2015)). Estimate D (P true, P n ) using concentration of measure results. Blanchet (Columbia U. and Stanford U.) 48 / 57

140 Towards an Optimal Choice of Uncertainty Size Standard way to pick δ (Esfahani and Kuhn (2015)). Estimate D (P true, P n ) using concentration of measure results. Not a good idea: rate of convergence of the form O ( 1/n 1/d ) (d is the data dimension). Blanchet (Columbia U. and Stanford U.) 48 / 57

141 Towards an Optimal Choice of Uncertainty Size Standard way to pick δ (Esfahani and Kuhn (2015)). Estimate D (P true, P n ) using concentration of measure results. Not a good idea: rate of convergence of the form O ( 1/n 1/d ) (d is the data dimension). Instead we seek an optimal approach. Blanchet (Columbia U. and Stanford U.) 48 / 57

142 Towards an Optimal Choice of Uncertainty Size Keep in mind linear regression problem Y i = β T X i + ɛ i. Blanchet (Columbia U. and Stanford U.) 49 / 57

143 Towards an Optimal Choice of Uncertainty Size Keep in mind linear regression problem Y i = β T X i + ɛ i. The plausible model variations of P n are given by the set U δ (n) = {P : D c (P, P n ) δ}. Blanchet (Columbia U. and Stanford U.) 49 / 57

144 Towards an Optimal Choice of Uncertainty Size Keep in mind linear regression problem Y i = β T X i + ɛ i. The plausible model variations of P n are given by the set U δ (n) = {P : D c (P, P n ) δ}. Given P U δ (n), define β (P) = arg min E P (Y β T X ). Blanchet (Columbia U. and Stanford U.) 49 / 57

145 Towards an Optimal Choice of Uncertainty Size Keep in mind linear regression problem Y i = β T X i + ɛ i. The plausible model variations of P n are given by the set U δ (n) = {P : D c (P, P n ) δ}. Given P U δ (n), define β (P) = arg min E P (Y β T X It is natural to say that are plausible estimates of β. Λ δ (n) = { β (P) : P U δ (n)} ). Blanchet (Columbia U. and Stanford U.) 49 / 57

146 Optimal Choice of Uncertainty Size Given a confidence level 1 α we advocate choosing δ via min δ s.t. P (β Λ δ (n)) 1 α. Blanchet (Columbia U. and Stanford U.) 50 / 57

147 Optimal Choice of Uncertainty Size Given a confidence level 1 α we advocate choosing δ via min δ s.t. P (β Λ δ (n)) 1 α. Equivalently: Find smallest confidence region Λ δ (n) at level 1 α. Blanchet (Columbia U. and Stanford U.) 50 / 57

148 Optimal Choice of Uncertainty Size Given a confidence level 1 α we advocate choosing δ via min δ s.t. P (β Λ δ (n)) 1 α. Equivalently: Find smallest confidence region Λ δ (n) at level 1 α. In simple words: Find the smallest δ so that β is plausible with confidence level 1 α. Blanchet (Columbia U. and Stanford U.) 50 / 57

149 The Robust Wasserstein Profile Function The value β (P) is characterized by ( ) ) 2 (( ) ) E P β (Y β T X = 2E P Y β T X X = 0. Blanchet (Columbia U. and Stanford U.) 51 / 57

150 The Robust Wasserstein Profile Function The value β (P) is characterized by ( ) ) 2 (( ) ) E P β (Y β T X = 2E P Y β T X X = 0. Define the Robust Wasserstein Profile (RWP) Function: ) ) R n (β) = min{d c (P, P n ) : E P ((Y β T X X = 0}. Blanchet (Columbia U. and Stanford U.) 51 / 57

151 The Robust Wasserstein Profile Function The value β (P) is characterized by ( ) ) 2 (( ) ) E P β (Y β T X = 2E P Y β T X X = 0. Define the Robust Wasserstein Profile (RWP) Function: ) ) R n (β) = min{d c (P, P n ) : E P ((Y β T X X = 0}. Note that R n (β ) δ β Λ δ (n) = { β (P) : D (P, P n ) δ}. Blanchet (Columbia U. and Stanford U.) 51 / 57

152 The Robust Wasserstein Profile Function The value β (P) is characterized by ( ) ) 2 (( ) ) E P β (Y β T X = 2E P Y β T X X = 0. Define the Robust Wasserstein Profile (RWP) Function: ) ) R n (β) = min{d c (P, P n ) : E P ((Y β T X X = 0}. Note that R n (β ) δ β Λ δ (n) = { β (P) : D (P, P n ) δ}. So δ is 1 α quantile of R n (β )! Blanchet (Columbia U. and Stanford U.) 51 / 57

153 The Robust Wasserstein Profile Function Blanchet (Columbia U. and Stanford U.) 52 / 57

154 Computing Optimal Regularization Parameter Theorem (B., Murthy, Kang (2016)) Suppose that {(Y i, X i )} n i=1 i.i.d. sample with finite variance, with c ( (x, y), ( { x, y )) x x = 2 q if y = y if y = y, is an then where L 1 is explicitly and nr n (β ) L 1, L 1 D L2 := E [e 2 ] E [e 2 ] (E e ) 2 N(0, Cov (X )) 2 q. Remark: We recover same order of regularization (but L 1 gives the optimal constant!) Blanchet (Columbia U. and Stanford U.) 53 / 57

155 Discussion on Optimal Uncertainty Size Optimal δ is of order O (1/n) as opposed to O ( 1/n 1/d ) as advocated in the standard approach. Blanchet (Columbia U. and Stanford U.) 54 / 57

156 Discussion on Optimal Uncertainty Size Optimal δ is of order O (1/n) as opposed to O ( 1/n 1/d ) as advocated in the standard approach. We characterize the asymptotic constant (not only order) in optimal regularization: P ( L 1 η 1 α ) = 1 α. Blanchet (Columbia U. and Stanford U.) 54 / 57

157 Discussion on Optimal Uncertainty Size Optimal δ is of order O (1/n) as opposed to O ( 1/n 1/d ) as advocated in the standard approach. We characterize the asymptotic constant (not only order) in optimal regularization: P ( L 1 η 1 α ) = 1 α. R n (β ) is inspired by Empirical Likelihood Owen (1988). Blanchet (Columbia U. and Stanford U.) 54 / 57

158 Discussion on Optimal Uncertainty Size Optimal δ is of order O (1/n) as opposed to O ( 1/n 1/d ) as advocated in the standard approach. We characterize the asymptotic constant (not only order) in optimal regularization: P ( L 1 η 1 α ) = 1 α. R n (β ) is inspired by Empirical Likelihood Owen (1988). Lam & Zhou (2015) use Empirical Likelihood in DRO, but focus on divergence. Blanchet (Columbia U. and Stanford U.) 54 / 57

159 A Toy Example Illustrating Proof Techniques Consider min β max P :D c (P,P n ) δ E [(Y β) 2] with c (y, y ) = (y y ) ρ and define R n (β) = min (y u) ρ π (dy, du) : π(dy,du) 0 u R π (dy, du) = 1 n δ {Y i } (dy) i, 2 (u β) π (dy, du) = 0. Blanchet (Columbia U. and Stanford U.) 55 / 57

160 A Toy Example Illustrating Proof Techniques Dual linear programming problem: Plug in β = β { R n (β ) = sup 1 n { λ R n sup λ(u β ) Y i u ρ }} i=1 u R { = sup λ } n n i=1(y i β ) λ R 1 { n n i=1 sup u R λ (u Yi ) Y i u ρ } { = sup λ ρ } n λ n (Y i β ) (ρ 1) λ ρ 1 i=1 ρ 1 ρ = (Y i β n ) = 1 ( N 0, σ 2 ) ρ n 1/2. n i=1 Blanchet (Columbia U. and Stanford U.) 56 / 57

161 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. Blanchet (Columbia U. and Stanford U.) 57 / 57

162 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. Blanchet (Columbia U. and Stanford U.) 57 / 57

163 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. OT can be used in path-space to quantify error in diffusion approximations. Blanchet (Columbia U. and Stanford U.) 57 / 57

164 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. OT can be used in path-space to quantify error in diffusion approximations. OT can be used for data-driven distributionally robust optimization. Blanchet (Columbia U. and Stanford U.) 57 / 57

165 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. OT can be used in path-space to quantify error in diffusion approximations. OT can be used for data-driven distributionally robust optimization. Cost function in OT can be used to improve out-of-sample performance. Blanchet (Columbia U. and Stanford U.) 57 / 57

166 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. OT can be used in path-space to quantify error in diffusion approximations. OT can be used for data-driven distributionally robust optimization. Cost function in OT can be used to improve out-of-sample performance. OT can be used for statistical inference using RWP function. Blanchet (Columbia U. and Stanford U.) 57 / 57

Optimal Transport in Risk Analysis

Optimal Transport in Risk Analysis Jose Blanchet (based on work with Y. Kang and K. Murthy) Stanford University (Management Science and Engineering), and Columbia University (Department of Statistics and