Optimal Transport Methods in Operations Research and Statistics

Size: px
Start display at page:

Download "Optimal Transport Methods in Operations Research and Statistics"

Transcription

1 Optimal Transport Methods in Operations Research and Statistics Jose Blanchet (based on work with F. He, Y. Kang, K. Murthy, F. Zhang). Stanford University (Management Science and Engineering), and Columbia University (Department of Statistics and Department of IEOR). Blanchet (Columbia U. and Stanford U.) 1 / 57

2 Goal: Goal: Introduce optimal transport techniques and applications in OR & Statistics Optimal transport is useful tool in model robustness, equilibrium, and machine learning! Blanchet (Columbia U. and Stanford U.) 2 / 57

3 Agenda Introduction to Optimal Transport Blanchet (Columbia U. and Stanford U.) 3 / 57

4 Agenda Introduction to Optimal Transport Economic Interpretations and Wasserstein Distances Blanchet (Columbia U. and Stanford U.) 3 / 57

5 Agenda Introduction to Optimal Transport Economic Interpretations and Wasserstein Distances Applications in Stochastic Operations Research Blanchet (Columbia U. and Stanford U.) 3 / 57

6 Agenda Introduction to Optimal Transport Economic Interpretations and Wasserstein Distances Applications in Stochastic Operations Research Applications in Distributionally Robust Optimization Blanchet (Columbia U. and Stanford U.) 3 / 57

7 Agenda Introduction to Optimal Transport Economic Interpretations and Wasserstein Distances Applications in Stochastic Operations Research Applications in Distributionally Robust Optimization Applications in Statistics Blanchet (Columbia U. and Stanford U.) 3 / 57

8 Introduction to Optimal Transport Monge-Kantorovich Problem & Duality (see e.g. C. Villani s 2008 textbook) Blanchet (Columbia U. and Stanford U.) 4 / 57

9 Monge Problem What s the cheapest way to transport a pile of sand to cover a sinkhole? Blanchet (Columbia U. and Stanford U.) 5 / 57

10 Monge Problem What s the cheapest way to transport a pile of sand to cover a sinkhole? min E µ {c (X, T (X ))}, T ( ):T (X ) v Blanchet (Columbia U. and Stanford U.) 6 / 57

11 Monge Problem What s the cheapest way to transport a pile of sand to cover a sinkhole? min E µ {c (X, T (X ))}, T ( ):T (X ) v where c (x, y) 0 is the cost of transporting x to y. Blanchet (Columbia U. and Stanford U.) 6 / 57

12 Monge Problem What s the cheapest way to transport a pile of sand to cover a sinkhole? min E µ {c (X, T (X ))}, T ( ):T (X ) v where c (x, y) 0 is the cost of transporting x to y. T (X ) v means T (X ) follows distribution v ( ). Blanchet (Columbia U. and Stanford U.) 6 / 57

13 Monge Problem What s the cheapest way to transport a pile of sand to cover a sinkhole? min E µ {c (X, T (X ))}, T ( ):T (X ) v where c (x, y) 0 is the cost of transporting x to y. T (X ) v means T (X ) follows distribution v ( ). Problem is highly non-linear, not much progress for about 150 yrs! Blanchet (Columbia U. and Stanford U.) 6 / 57

14 Kantorovich Relaxation: Primal Problem Let Π (µ, v) be the class of joint distributions π of random variables (X, Y ) such that π X = marginal of X = µ, π Y = marginal of Y = v. Blanchet (Columbia U. and Stanford U.) 7 / 57

15 Kantorovich Relaxation: Primal Problem Let Π (µ, v) be the class of joint distributions π of random variables (X, Y ) such that π X = marginal of X = µ, π Y = marginal of Y = v. Solve min{e π [c (X, Y )] : π Π (µ, v)} Blanchet (Columbia U. and Stanford U.) 7 / 57

16 Kantorovich Relaxation: Primal Problem Let Π (µ, v) be the class of joint distributions π of random variables (X, Y ) such that Solve π X = marginal of X = µ, π Y = marginal of Y = v. min{e π [c (X, Y )] : π Π (µ, v)} Linear programming (infinite dimensional): D c (µ, v) : = min Y π(dx,dy ) 0 X Y π (dx, dy) = µ (dx), c (x, y) π (dx, dy) X π (dx, dy) = v (dy). Blanchet (Columbia U. and Stanford U.) 7 / 57

17 Kantorovich Relaxation: Primal Problem Let Π (µ, v) be the class of joint distributions π of random variables (X, Y ) such that Solve π X = marginal of X = µ, π Y = marginal of Y = v. min{e π [c (X, Y )] : π Π (µ, v)} Linear programming (infinite dimensional): D c (µ, v) : = min Y π(dx,dy ) 0 X Y π (dx, dy) = µ (dx), If c (x, y) = d p (x, y) (d-metric) then Dc 1/p metric. c (x, y) π (dx, dy) X π (dx, dy) = v (dy). (µ, v) is a p-wasserstein Blanchet (Columbia U. and Stanford U.) 7 / 57

18 Illustration of Optimal Transport Costs Monge s solution would take the form π (dx, dy) = δ {T (x )} (dy) µ (dx). Blanchet (Columbia U. and Stanford U.) 8 / 57

19 Kantorovich Relaxation: Dual Problem Primal has always a solution for c ( ) 0 lower semicontinuous. Blanchet (Columbia U. and Stanford U.) 9 / 57

20 Kantorovich Relaxation: Dual Problem Primal has always a solution for c ( ) 0 lower semicontinuous. Linear programming (Dual): sup α (x) µ (dx) + β (y) v (dy) α,β X Y α (x) + β (y) c (x, y) (x, y) X Y. Blanchet (Columbia U. and Stanford U.) 9 / 57

21 Kantorovich Relaxation: Dual Problem Primal has always a solution for c ( ) 0 lower semicontinuous. Linear programming (Dual): sup α (x) µ (dx) + β (y) v (dy) α,β X Y α (x) + β (y) c (x, y) (x, y) X Y. Dual α and β can be taken over continuous functions. Blanchet (Columbia U. and Stanford U.) 9 / 57

22 Kantorovich Relaxation: Dual Problem Primal has always a solution for c ( ) 0 lower semicontinuous. Linear programming (Dual): sup α (x) µ (dx) + β (y) v (dy) α,β X Y α (x) + β (y) c (x, y) (x, y) X Y. Dual α and β can be taken over continuous functions. Complementary slackness: Equality holds on the support of π (primal optimizer). Blanchet (Columbia U. and Stanford U.) 9 / 57

23 Kantorovich Relaxation: Primal Interpretation John wants to remove of a pile of sand, µ ( ). Blanchet (Columbia U. and Stanford U.) 10 / 57

24 Kantorovich Relaxation: Primal Interpretation John wants to remove of a pile of sand, µ ( ). Peter wants to cover a sinkhole, v ( ). Blanchet (Columbia U. and Stanford U.) 10 / 57

25 Kantorovich Relaxation: Primal Interpretation John wants to remove of a pile of sand, µ ( ). Peter wants to cover a sinkhole, v ( ). Cost for John and Peter to transport the sand to cover the sinkhole is D c (µ, v) = c (x, y) π (dx, dy). X Y Blanchet (Columbia U. and Stanford U.) 10 / 57

26 Kantorovich Relaxation: Primal Interpretation John wants to remove of a pile of sand, µ ( ). Peter wants to cover a sinkhole, v ( ). Cost for John and Peter to transport the sand to cover the sinkhole is D c (µ, v) = c (x, y) π (dx, dy). X Y Now comes Maria, who has a business... Blanchet (Columbia U. and Stanford U.) 10 / 57

27 Kantorovich Relaxation: Primal Interpretation John wants to remove of a pile of sand, µ ( ). Peter wants to cover a sinkhole, v ( ). Cost for John and Peter to transport the sand to cover the sinkhole is D c (µ, v) = c (x, y) π (dx, dy). X Y Now comes Maria, who has a business... Maria promises to transport on behalf of John and Peter the whole amount. Blanchet (Columbia U. and Stanford U.) 10 / 57

28 Kantorovich Relaxation: Primal Interpretation Maria charges John α (x) per-unit of mass at x (similarly to Peter). Blanchet (Columbia U. and Stanford U.) 11 / 57

29 Kantorovich Relaxation: Primal Interpretation Maria charges John α (x) per-unit of mass at x (similarly to Peter). For Peter and John to agree we must have α (x) + β (y) c (x, y). Blanchet (Columbia U. and Stanford U.) 11 / 57

30 Kantorovich Relaxation: Primal Interpretation Maria charges John α (x) per-unit of mass at x (similarly to Peter). For Peter and John to agree we must have α (x) + β (y) c (x, y). Maria wishes to maximize her profit α (x) µ (dx) + β (y) v (dy). Blanchet (Columbia U. and Stanford U.) 11 / 57

31 Kantorovich Relaxation: Primal Interpretation Maria charges John α (x) per-unit of mass at x (similarly to Peter). For Peter and John to agree we must have α (x) + β (y) c (x, y). Maria wishes to maximize her profit α (x) µ (dx) + β (y) v (dy). Kantorovich duality says primal and dual optimal values coincide and (under mild regularity) α (x) = inf {c (x, y) y β (y)} β (y) = inf {c (x, y) x α (x)}. Blanchet (Columbia U. and Stanford U.) 11 / 57

32 Proof Techniques Suppose X and Y compact { sup inf π 0, α,β X Y X Y X Y c (x, y) π (dx, dy) α (x) π (dx, dy) + β (y) π (dx, dy) + X Y α (x) µ (dx) β (y) v (dy)} Blanchet (Columbia U. and Stanford U.) 12 / 57

33 Proof Techniques Suppose X and Y compact { sup inf π 0, α,β X Y X Y X Y c (x, y) π (dx, dy) α (x) π (dx, dy) + β (y) π (dx, dy) + X Y α (x) µ (dx) β (y) v (dy)} Swap sup and inf using Sion s min-max theorem by a compactness argument and conclude. Blanchet (Columbia U. and Stanford U.) 12 / 57

34 Proof Techniques Suppose X and Y compact { sup inf π 0, α,β X Y X Y X Y c (x, y) π (dx, dy) α (x) π (dx, dy) + β (y) π (dx, dy) + X Y α (x) µ (dx) β (y) v (dy)} Swap sup and inf using Sion s min-max theorem by a compactness argument and conclude. Significant amount of work needed to extend to general Polish spaces and construct the dual optimizers (primal a bit easier). Blanchet (Columbia U. and Stanford U.) 12 / 57

35 Application of Optimal Transport in Economics Economic Interpretations (see e.g. A. Galichon s 2016 textbook & McCaan 2013 notes). Blanchet (Columbia U. and Stanford U.) 13 / 57

36 Applications in Labor Markets Worker with skill x & company with technology y have surplus Ψ (x, y). Blanchet (Columbia U. and Stanford U.) 14 / 57

37 Applications in Labor Markets Worker with skill x & company with technology y have surplus Ψ (x, y). The population of workers is given by µ (x). Blanchet (Columbia U. and Stanford U.) 14 / 57

38 Applications in Labor Markets Worker with skill x & company with technology y have surplus Ψ (x, y). The population of workers is given by µ (x). The population of companies is given by v (y). Blanchet (Columbia U. and Stanford U.) 14 / 57

39 Applications in Labor Markets Worker with skill x & company with technology y have surplus Ψ (x, y). The population of workers is given by µ (x). The population of companies is given by v (y). The salary of worker x is α (x) & cost of technology y is β (y) α (x) + β (y) Ψ (x, y). Blanchet (Columbia U. and Stanford U.) 14 / 57

40 Applications in Labor Markets Worker with skill x & company with technology y have surplus Ψ (x, y). The population of workers is given by µ (x). The population of companies is given by v (y). The salary of worker x is α (x) & cost of technology y is β (y) α (x) + β (y) Ψ (x, y). Companies want to minimize total production cost α (x) µ (x) dx + β (y) v (y) dy Blanchet (Columbia U. and Stanford U.) 14 / 57

41 Applications in Labor Markets Letting a central planner organize the Labor market Blanchet (Columbia U. and Stanford U.) 15 / 57

42 Applications in Labor Markets Letting a central planner organize the Labor market The planner wishes to maximize total surplus Ψ (x, y) π (dx, dy) Blanchet (Columbia U. and Stanford U.) 15 / 57

43 Applications in Labor Markets Letting a central planner organize the Labor market The planner wishes to maximize total surplus Ψ (x, y) π (dx, dy) Over assignments π ( ) which satisfy market clearing Y π (dx, dy) = µ (dx), X π (dx, dy) = v (dy). Blanchet (Columbia U. and Stanford U.) 15 / 57

44 Solving for Optimal Transport Coupling Suppose that Ψ (x, y) = xy, µ (x) = I (x [0, 1]), v (y) = e y I (y > 0). Blanchet (Columbia U. and Stanford U.) 16 / 57

45 Solving for Optimal Transport Coupling Suppose that Ψ (x, y) = xy, µ (x) = I (x [0, 1]), v (y) = e y I (y > 0). Solve primal by sampling: Let {X n µ and v, respectively. F µn (x) = 1 n i } n i=1 and n {Yi } n i=1 both i.i.d. from n I (Xi n x), F vn (y) = 1 n i=1 n I ( Yj n y ) j=1 Blanchet (Columbia U. and Stanford U.) 16 / 57

46 Solving for Optimal Transport Coupling Suppose that Ψ (x, y) = xy, µ (x) = I (x [0, 1]), v (y) = e y I (y > 0). Solve primal by sampling: Let {X n µ and v, respectively. F µn (x) = 1 n i } n i=1 and n {Yi } n i=1 both i.i.d. from n I (Xi n x), F vn (y) = 1 n i=1 n I ( Yj n y ) j=1 Consider max π(x n i,x n j ) 0 i,j π ( x n j i, yj n Ψ ( xi n, yj n ) ( π x n i, y n ) ) = 1 n x i, j π ( x n i i, yj n ) = 1 n y j. Blanchet (Columbia U. and Stanford U.) 16 / 57

47 Solving for Optimal Transport Coupling Suppose that Ψ (x, y) = xy, µ (x) = I (x [0, 1]), v (y) = e y I (y > 0). Solve primal by sampling: Let {X n µ and v, respectively. F µn (x) = 1 n i } n i=1 and n {Yi } n i=1 both i.i.d. from n I (Xi n x), F vn (y) = 1 n i=1 n I ( Yj n y ) j=1 Consider max π(x n i,x n j ) 0 i,j π ( x n j i, yj n Ψ ( xi n, yj n ) ( π x n i, y n ) ) = 1 n x i, j π ( x n i i, yj n ) = 1 n y j. Clearly, simply sort and match is the solution! Blanchet (Columbia U. and Stanford U.) 16 / 57

48 Solving for Optimal Transport Coupling Think of Y n j ( ) = log 1 Uj n for Uj n s i.i.d. uniform(0, 1). Blanchet (Columbia U. and Stanford U.) 17 / 57

49 Solving for Optimal Transport Coupling Think of Y n j ( ) = log 1 Uj n for Uj n s i.i.d. uniform(0, 1). The j-th order statistic X n (j) is matched to Y n (j). Blanchet (Columbia U. and Stanford U.) 17 / 57

50 Solving for Optimal Transport Coupling Think of Y n j ( ) = log 1 Uj n for Uj n s i.i.d. uniform(0, 1). The j-th order statistic X(j) n is matched to Y n (j). As n, X(nt) n t, so Y n (nt) log (1 t). Blanchet (Columbia U. and Stanford U.) 17 / 57

51 Solving for Optimal Transport Coupling Think of Y n j ( ) = log 1 Uj n for Uj n s i.i.d. uniform(0, 1). The j-th order statistic X(j) n is matched to Y n (j). As n, X(nt) n t, so Y n (nt) log (1 t). Thus, the optimal coupling as n is X = U and Y = log (1 U) (comonotonic coupling). Blanchet (Columbia U. and Stanford U.) 17 / 57

52 Identities for Wasserstein Distances Comonotonic coupling is the solution if: Ψ (x + h, y + h) Ψ (x, y) ( 2 x,y Ψ (x, y) 0) - supermodularity. Blanchet (Columbia U. and Stanford U.) 18 / 57

53 Identities for Wasserstein Distances Comonotonic coupling is the solution if: Ψ (x + h, y + h) Ψ (x, y) ( 2 x,y Ψ (x, y) 0) - supermodularity. Of for costs c (x, y) = Ψ (x, y) if 2 x,y c (x, y) 0 (submodularity). Blanchet (Columbia U. and Stanford U.) 18 / 57

54 Identities for Wasserstein Distances Comonotonic coupling is the solution if: Ψ (x + h, y + h) Ψ (x, y) ( 2 x,y Ψ (x, y) 0) - supermodularity. Of for costs c (x, y) = Ψ (x, y) if 2 x,y c (x, y) 0 (submodularity). Corollary: Suppose c (x, y) = x y then X = Fµ 1 (U) and Y = Fv 1 (U) thus D c ( Fµ, F v ) = 1 0 F 1 µ (u) F 1 v (u) du. Blanchet (Columbia U. and Stanford U.) 18 / 57

55 Identities for Wasserstein Distances Comonotonic coupling is the solution if: Ψ (x + h, y + h) Ψ (x, y) ( 2 x,y Ψ (x, y) 0) - supermodularity. Of for costs c (x, y) = Ψ (x, y) if 2 x,y c (x, y) 0 (submodularity). Corollary: Suppose c (x, y) = x y then X = Fµ 1 (U) and Y = Fv 1 (U) thus D c ( Fµ, F v ) = 1 0 F 1 µ (u) F 1 v (u) du. Similar identities are common for Wasserstein distances... Blanchet (Columbia U. and Stanford U.) 18 / 57

56 Interesting Insight on Salary Effects In equilibrium, by the envelope theorem β (y) = d dy sup [Ψ (x, y) λ (x)] = x y Ψ (x y, y) = x y. Blanchet (Columbia U. and Stanford U.) 19 / 57

57 Interesting Insight on Salary Effects In equilibrium, by the envelope theorem β (y) = d dy sup [Ψ (x, y) λ (x)] = x y Ψ (x y, y) = x y. We also know that y = log (1 x), or x = 1 exp ( y) α (x) + β ( log (1 x)) = xy. β (y) = y + exp ( y) 1 + β (0). Blanchet (Columbia U. and Stanford U.) 19 / 57

58 Interesting Insight on Salary Effects In equilibrium, by the envelope theorem β (y) = d dy sup [Ψ (x, y) λ (x)] = x y Ψ (x y, y) = x y. We also know that y = log (1 x), or x = 1 exp ( y) α (x) + β ( log (1 x)) = xy. β (y) = y + exp ( y) 1 + β (0). What if Ψ (x, y) Ψ (x, y) + f (x)? (i.e. productivity grows). Blanchet (Columbia U. and Stanford U.) 19 / 57

59 Interesting Insight on Salary Effects In equilibrium, by the envelope theorem β (y) = d dy sup [Ψ (x, y) λ (x)] = x y Ψ (x y, y) = x y. We also know that y = log (1 x), or x = 1 exp ( y) α (x) + β ( log (1 x)) = xy. β (y) = y + exp ( y) 1 + β (0). What if Ψ (x, y) Ψ (x, y) + f (x)? (i.e. productivity grows). Answer: salaries grows if f ( ) is increasing. Blanchet (Columbia U. and Stanford U.) 19 / 57

60 Applications of Optimal Transport in Stochastic OR Application of Optimal Transport in Stochastic OR Blanchet and Murthy (2016) Insight: Diffusion approximations and optimal transport Blanchet (Columbia U. and Stanford U.) 20 / 57

61 A Distributionally Robust Performance Analysis In Stochastic OR we are often interested in evaluating for a complex model P true E Ptrue (f (X )) Blanchet (Columbia U. and Stanford U.) 21 / 57

62 A Distributionally Robust Performance Analysis In Stochastic OR we are often interested in evaluating E Ptrue (f (X )) for a complex model P true Moreover, we wish to control / optimize it min θ E Ptrue (h (X, θ)). Blanchet (Columbia U. and Stanford U.) 21 / 57

63 A Distributionally Robust Performance Analysis In Stochastic OR we are often interested in evaluating E Ptrue (f (X )) for a complex model P true Moreover, we wish to control / optimize it min θ E Ptrue (h (X, θ)). Model P true might be unknown or too diffi cult to work with. Blanchet (Columbia U. and Stanford U.) 21 / 57

64 A Distributionally Robust Performance Analysis In Stochastic OR we are often interested in evaluating E Ptrue (f (X )) for a complex model P true Moreover, we wish to control / optimize it min θ E Ptrue (h (X, θ)). Model P true might be unknown or too diffi cult to work with. So, we introduce a proxy P 0 which provides a good trade-off between tractability and model fidelity (e.g. Brownian motion for heavy-traffi c approximations). Blanchet (Columbia U. and Stanford U.) 21 / 57

65 A Distributionally Robust Performance Analysis For f ( ) upper semicontinuous with E P0 f (X ) < sup E P (f (Y )) D c (P, P 0 ) δ, X takes values on a Polish space and c ( ) is lower semi-continuous. Blanchet (Columbia U. and Stanford U.) 22 / 57

66 A Distributionally Robust Performance Analysis For f ( ) upper semicontinuous with E P0 f (X ) < sup E P (f (Y )) D c (P, P 0 ) δ, X takes values on a Polish space and c ( ) is lower semi-continuous. Also an infinite dimensional linear program sup f (y) π (dx, dy) X Y s.t. c (x, y) π (dx, dy) δ X Y π (dx, dy) = P 0 (dx). Y Blanchet (Columbia U. and Stanford U.) 22 / 57

67 A Distributionally Robust Performance Analysis Formal duality: Dual = inf λ 0,α { λδ + } α (x) P 0 (dx) λc (x, y) + α (x) f (y). Blanchet (Columbia U. and Stanford U.) 23 / 57

68 A Distributionally Robust Performance Analysis Formal duality: Dual = inf λ 0,α { λδ + } α (x) P 0 (dx) λc (x, y) + α (x) f (y). B. & Murthy (2016) - No duality gap: ( )] Dual = inf [λδ + E 0 sup {f (y) λc (X, y)}. λ 0 y Blanchet (Columbia U. and Stanford U.) 23 / 57

69 A Distributionally Robust Performance Analysis Formal duality: Dual = inf λ 0,α { λδ + } α (x) P 0 (dx) λc (x, y) + α (x) f (y). B. & Murthy (2016) - No duality gap: ( )] Dual = inf [λδ + E 0 sup {f (y) λc (X, y)}. λ 0 y We refer to this as RoPA Duality in this talk. Blanchet (Columbia U. and Stanford U.) 23 / 57

70 A Distributionally Robust Performance Analysis Formal duality: Dual = inf λ 0,α { λδ + } α (x) P 0 (dx) λc (x, y) + α (x) f (y). B. & Murthy (2016) - No duality gap: ( )] Dual = inf [λδ + E 0 sup {f (y) λc (X, y)}. λ 0 y We refer to this as RoPA Duality in this talk. Let us consider the important case f (y) = I (y A) & c (x, x) = 0. Blanchet (Columbia U. and Stanford U.) 23 / 57

71 A Distributionally Robust Performance Analysis So, if f (y) = I (y A) and c A (X ) = inf{y A : c (x, y)}, then [ Dual = inf λδ + E 0 (1 λc A (X )) +] = P 0 (c A (X ) 1/λ ). λ 0 Blanchet (Columbia U. and Stanford U.) 24 / 57

72 A Distributionally Robust Performance Analysis So, if f (y) = I (y A) and c A (X ) = inf{y A : c (x, y)}, then [ Dual = inf λδ + E 0 (1 λc A (X )) +] = P 0 (c A (X ) 1/λ ). λ 0 If c A (X ) is continuous under P 0 & E 0 (c A (X )) δ, then δ = E 0 [c A (X ) I (c A (X ) 1/λ )]. Blanchet (Columbia U. and Stanford U.) 24 / 57

73 Example: Model Uncertainty in Bankruptcy Calculations R (t) = the reserve (perhaps multiple lines) at time t. Blanchet (Columbia U. and Stanford U.) 25 / 57

74 Example: Model Uncertainty in Bankruptcy Calculations R (t) = the reserve (perhaps multiple lines) at time t. Bankruptcy probability (in finite time horizon T ) u T = P true (R (t) B for some t [0, T ]). Blanchet (Columbia U. and Stanford U.) 25 / 57

75 Example: Model Uncertainty in Bankruptcy Calculations R (t) = the reserve (perhaps multiple lines) at time t. Bankruptcy probability (in finite time horizon T ) u T = P true (R (t) B for some t [0, T ]). B is a set which models bankruptcy. Blanchet (Columbia U. and Stanford U.) 25 / 57

76 Example: Model Uncertainty in Bankruptcy Calculations R (t) = the reserve (perhaps multiple lines) at time t. Bankruptcy probability (in finite time horizon T ) u T = P true (R (t) B for some t [0, T ]). B is a set which models bankruptcy. Problem: Model (P true ) may be complex, intractable or simply unknown... Blanchet (Columbia U. and Stanford U.) 25 / 57

77 A Distributionally Robust Risk Analysis Formulation Our solution: Estimate u T by solving sup P true (R (t) B for some t [0, T ]), D c (P 0,P ) δ where P 0 is a suitable model. Blanchet (Columbia U. and Stanford U.) 26 / 57

78 A Distributionally Robust Risk Analysis Formulation Our solution: Estimate u T by solving sup P true (R (t) B for some t [0, T ]), D c (P 0,P ) δ where P 0 is a suitable model. P 0 = proxy for P true. Blanchet (Columbia U. and Stanford U.) 26 / 57

79 A Distributionally Robust Risk Analysis Formulation Our solution: Estimate u T by solving sup P true (R (t) B for some t [0, T ]), D c (P 0,P ) δ where P 0 is a suitable model. P 0 = proxy for P true. P 0 right trade-off between fidelity and tractability. Blanchet (Columbia U. and Stanford U.) 26 / 57

80 A Distributionally Robust Risk Analysis Formulation Our solution: Estimate u T by solving sup P true (R (t) B for some t [0, T ]), D c (P 0,P ) δ where P 0 is a suitable model. P 0 = proxy for P true. P 0 right trade-off between fidelity and tractability. δ is the distributional uncertainty size. Blanchet (Columbia U. and Stanford U.) 26 / 57

81 A Distributionally Robust Risk Analysis Formulation Our solution: Estimate u T by solving sup P true (R (t) B for some t [0, T ]), D c (P 0,P ) δ where P 0 is a suitable model. P 0 = proxy for P true. P 0 right trade-off between fidelity and tractability. δ is the distributional uncertainty size. D c ( ) is the distributional uncertainty region. Blanchet (Columbia U. and Stanford U.) 26 / 57

82 Desirable Elements of Distributionally Robust Formulation Would like D c ( ) to have wide flexibility (even non-parametric). Blanchet (Columbia U. and Stanford U.) 27 / 57

83 Desirable Elements of Distributionally Robust Formulation Would like D c ( ) to have wide flexibility (even non-parametric). Want optimization to be tractable. Blanchet (Columbia U. and Stanford U.) 27 / 57

84 Desirable Elements of Distributionally Robust Formulation Would like D c ( ) to have wide flexibility (even non-parametric). Want optimization to be tractable. Want to preserve advantages of using P 0. Blanchet (Columbia U. and Stanford U.) 27 / 57

85 Desirable Elements of Distributionally Robust Formulation Would like D c ( ) to have wide flexibility (even non-parametric). Want optimization to be tractable. Want to preserve advantages of using P 0. Want a way to estimate δ. Blanchet (Columbia U. and Stanford U.) 27 / 57

86 Connections to Distributionally Robust Optimization Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) ( )) dv D (v µ) = E v (log. dµ Blanchet (Columbia U. and Stanford U.) 28 / 57

87 Connections to Distributionally Robust Optimization Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) ( )) dv D (v µ) = E v (log. dµ Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009). Blanchet (Columbia U. and Stanford U.) 28 / 57

88 Connections to Distributionally Robust Optimization Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) ( )) dv D (v µ) = E v (log. dµ Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009). Big problem: Absolute continuity may typically be violated... Blanchet (Columbia U. and Stanford U.) 28 / 57

89 Connections to Distributionally Robust Optimization Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) ( )) dv D (v µ) = E v (log. dµ Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009). Big problem: Absolute continuity may typically be violated... Think of using Brownian motion as a proxy model for R (t)... Blanchet (Columbia U. and Stanford U.) 28 / 57

90 Connections to Distributionally Robust Optimization Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) ( )) dv D (v µ) = E v (log. dµ Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009). Big problem: Absolute continuity may typically be violated... Think of using Brownian motion as a proxy model for R (t)... Optimal transport is a natural option! Blanchet (Columbia U. and Stanford U.) 28 / 57

91 Application 1: Back to Classical Risk Problem Suppose that c (x, y) = d J (x ( ), y ( )) = Skorokhod J 1 metric. = inf { sup x (t) y (φ (t)), sup φ (t) t }. φ( ) bijection t [0,1] t [0,1] Blanchet (Columbia U. and Stanford U.) 29 / 57

92 Application 1: Back to Classical Risk Problem Suppose that c (x, y) = d J (x ( ), y ( )) = Skorokhod J 1 metric. = inf { sup x (t) y (φ (t)), sup φ (t) t }. φ( ) bijection t [0,1] t [0,1] If R (t) = b Z (t), then ruin during time interval [0, 1] is B b = {R ( ) : 0 inf R (t)} = {Z ( ) : b sup Z (t)}. t [0,1] t [0,1] Blanchet (Columbia U. and Stanford U.) 29 / 57

93 Application 1: Back to Classical Risk Problem Suppose that c (x, y) = d J (x ( ), y ( )) = Skorokhod J 1 metric. = inf { sup x (t) y (φ (t)), sup φ (t) t }. φ( ) bijection t [0,1] t [0,1] If R (t) = b Z (t), then ruin during time interval [0, 1] is B b = {R ( ) : 0 inf R (t)} = {Z ( ) : b sup Z (t)}. t [0,1] t [0,1] Let P 0 ( ) be the Wiener measure want to compute sup P (Z B b ). D c (P 0,P ) δ Blanchet (Columbia U. and Stanford U.) 29 / 57

94 Application 1: Computing Distance to Bankruptcy So: {c Bb (Z ) 1/λ } = {sup t [0,1] Z (t) b 1/λ }, and ( ) sup P (Z B b ) = P 0 sup Z (t) b 1/λ. D c (P 0,P ) δ t [0,1] Blanchet (Columbia U. and Stanford U.) 30 / 57

95 Application 1: Computing Uncertainty Size Note any coupling π so that π X = P 0 and π Y = P satisfies D c (P 0, P) E π [c (X, Y )] δ. Blanchet (Columbia U. and Stanford U.) 31 / 57

96 Application 1: Computing Uncertainty Size Note any coupling π so that π X = P 0 and π Y = P satisfies D c (P 0, P) E π [c (X, Y )] δ. So use any coupling between evidence and P 0 or expert knowledge. Blanchet (Columbia U. and Stanford U.) 31 / 57

97 Application 1: Computing Uncertainty Size Note any coupling π so that π X = P 0 and π Y = P satisfies D c (P 0, P) E π [c (X, Y )] δ. So use any coupling between evidence and P 0 or expert knowledge. We discuss choosing δ non-parametrically momentarily. Blanchet (Columbia U. and Stanford U.) 31 / 57

98 Application 1: Illustration of Coupling Given arrivals and claim sizes let Z (t) = m 1/2 2 N (t) k=1 (X k m 1 ) Blanchet (Columbia U. and Stanford U.) 32 / 57

99 Application 1: Coupling in Action Blanchet (Columbia U. and Stanford U.) 33 / 57

100 Application 1: Numerical Example Assume Poisson arrivals. Pareto claim sizes with index 2.2 (P (V > t) = 1/(1 + t) 2.2 ). Cost c (x, y) = d J (x, y) 2 < note power of 2. Used Algorithm 1 to calibrate (estimating means and variances from data). b P 0 (Ruin) P true (Ruin) P robust (Ruin) P true (Ruin) Blanchet (Columbia U. and Stanford U.) 34 / 57

101 Additional Applications: Multidimensional Ruin Problems contains more applications. Blanchet (Columbia U. and Stanford U.) 35 / 57

102 Additional Applications: Multidimensional Ruin Problems contains more applications. Control: min θ sup P :D (P,P0 ) δ E [L (θ, Z )] < robust optimal reinsurance. Blanchet (Columbia U. and Stanford U.) 35 / 57

103 Additional Applications: Multidimensional Ruin Problems contains more applications. Control: min θ sup P :D (P,P0 ) δ E [L (θ, Z )] < robust optimal reinsurance. Multidimensional risk processes (explicit evaluation of c B (x) for d J metric). Blanchet (Columbia U. and Stanford U.) 35 / 57

104 Multidimensional risk processes (explicit evaluation of c B (x) for d J metric). Key insight: Geometry of target set often remains largely the Blanchet same! (Columbia U. and Stanford U.) 35 / 57 Additional Applications: Multidimensional Ruin Problems contains more applications. Control: min θ sup P :D (P,P0 ) δ E [L (θ, Z )] < robust optimal reinsurance.

105 Connections to Distributionally Robust Optimization Based on: Robust Wasserstein Profile Inference (B., Murthy & Kang 16) Highlight: Additional insights into why optimal transport... Blanchet (Columbia U. and Stanford U.) 36 / 57

106 Distributionally Robust Optimization in Machine Learning Consider estimating β R m in linear regression Y i = βx i + e i, where {(Y i, X i )} n i=1 are data points. Blanchet (Columbia U. and Stanford U.) 37 / 57

107 Distributionally Robust Optimization in Machine Learning Consider estimating β R m in linear regression Y i = βx i + e i, where {(Y i, X i )} n i=1 are data points. Optimal Least Squares approach consists in estimating β via [ ( ) ] 2 min E Pn Y β T n 1 ( ) 2 X = min β β n Y i β T X i = i=1 Blanchet (Columbia U. and Stanford U.) 37 / 57

108 Distributionally Robust Optimization in Machine Learning Consider estimating β R m in linear regression Y i = βx i + e i, where {(Y i, X i )} n i=1 are data points. Optimal Least Squares approach consists in estimating β via [ ( ) ] 2 min E Pn Y β T n 1 ( ) 2 X = min β β n Y i β T X i = i=1 Apply the distributionally robust estimator based on optimal transport. Blanchet (Columbia U. and Stanford U.) 37 / 57

109 Connection to Sqrt-Lasso Theorem (B., Kang, Murthy (2016)) Suppose that c ( (x, y), ( { x, y )) x x = 2 q if y = y if y = y. Then, if 1/p + 1/q = 1 ( ( ) ) 2 max E 1/2 P Y β T X P :D c (P,P n ) δ = E 1/2 P n [ ( ) ] 2 Y β T X + δ β p. Remark 1: This is sqrt-lasso (Belloni et al. (2011)). Remark 2: Uses RoPA duality theorem & "judicious choice of c ( ) Blanchet (Columbia U. and Stanford U.) 38 / 57

110 Connection to Regularized Logistic Regression Theorem (B., Kang, Murthy (2016)) Suppose that c ( (x, y), ( x, y )) = { x x q if y = y if y = y. Then, ] sup E P [log(1 + e Y βt X ) P : D c (P,P n ) δ ] = E Pn [log(1 + e Y βt X ) + δ β p. Remark 1: Approximate connection studied in Esfahani and Kuhn (2015). Blanchet (Columbia U. and Stanford U.) 39 / 57

111 Unification and Extensions of Regularized Estimators Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Blanchet (Columbia U. and Stanford U.) 40 / 57

112 Unification and Extensions of Regularized Estimators Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Support Vector Machines: B., Kang, Murthy (2016) - Blanchet (Columbia U. and Stanford U.) 40 / 57

113 Unification and Extensions of Regularized Estimators Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Support Vector Machines: B., Kang, Murthy (2016) - Group Lasso: B., & Kang (2016): Blanchet (Columbia U. and Stanford U.) 40 / 57

114 Unification and Extensions of Regularized Estimators Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Support Vector Machines: B., Kang, Murthy (2016) - Group Lasso: B., & Kang (2016): Generalized adaptive ridge: B., Kang, Murthy, Zhang (2017): Blanchet (Columbia U. and Stanford U.) 40 / 57

115 Unification and Extensions of Regularized Estimators Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Support Vector Machines: B., Kang, Murthy (2016) - Group Lasso: B., & Kang (2016): Generalized adaptive ridge: B., Kang, Murthy, Zhang (2017): Semisupervised learning: B., and Kang (2016): Blanchet (Columbia U. and Stanford U.) 40 / 57

116 How Regularization and Dual Norms Arise? Let us work out a simple example... Blanchet (Columbia U. and Stanford U.) 41 / 57

117 How Regularization and Dual Norms Arise? Let us work out a simple example... Recall RoPA Duality: Pick c ((x, y), (x, y )) = (x, y) (x, y ) 2 q ( max E P ((X, Y ) (β, 1)) 2) P :D c (P,P n ) δ { [ ((x = min λδ + E Pn sup, y ) (β, 1) )2 λ (X, Y ) ( x, y ) 2 λ 0 (x,y q ) Blanchet (Columbia U. and Stanford U.) 41 / 57

118 How Regularization and Dual Norms Arise? Let us work out a simple example... Recall RoPA Duality: Pick c ((x, y), (x, y )) = (x, y) (x, y ) 2 q ( max E P ((X, Y ) (β, 1)) 2) P :D c (P,P n ) δ { [ ((x = min λδ + E Pn sup, y ) (β, 1) )2 λ (X, Y ) ( x, y ) 2 λ 0 (x,y q ) Let s focus on the inside E Pn... Blanchet (Columbia U. and Stanford U.) 41 / 57

119 How Regularization and Dual Norms Arise? Let = (X, Y ) (x, y ) [ ((x sup, y ) (β, 1) )2 λ (X, Y ) ( x, y ) ] 2 (x,y q ) [ ] = sup ((X, Y ) (β, 1) (β, 1)) 2 λ 2 q [ ] = sup q ( (X, Y ) (β, 1) + q (β, 1) p ) 2 λ 2 q Blanchet (Columbia U. and Stanford U.) 42 / 57

120 How Regularization and Dual Norms Arise? Let = (X, Y ) (x, y ) [ ((x sup, y ) (β, 1) )2 λ (X, Y ) ( x, y ) ] 2 (x,y q ) [ ] = sup ((X, Y ) (β, 1) (β, 1)) 2 λ 2 q [ ] = sup q ( (X, Y ) (β, 1) + q (β, 1) p ) 2 λ 2 q Last equality uses z z 2 is symmetric around origin and a b a p b q. Blanchet (Columbia U. and Stanford U.) 42 / 57

121 How Regularization and Dual Norms Arise? Let = (X, Y ) (x, y ) [ ((x sup, y ) (β, 1) )2 λ (X, Y ) ( x, y ) ] 2 (x,y q ) [ ] = sup ((X, Y ) (β, 1) (β, 1)) 2 λ 2 q [ ] = sup q ( (X, Y ) (β, 1) + q (β, 1) p ) 2 λ 2 q Last equality uses z z 2 is symmetric around origin and a b a p b q. Note problem is now one-dimensional (easily computable). Blanchet (Columbia U. and Stanford U.) 42 / 57

122 On Role of Transport Cost... Data-driven chose of c ( ). Blanchet (Columbia U. and Stanford U.) 43 / 57

123 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 A = (x x ) A (x x) with A positive definite (Mahalanobis distance). Blanchet (Columbia U. and Stanford U.) 43 / 57

124 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 A = (x x ) A (x x) with A positive definite (Mahalanobis distance). Then, max P :D c (P,P n ) δ = min E 1/2 β P n E 1/2 P [ ( Y β T X ( ( Y β T X ) 2 ) ) 2 ] + δ β A 1. Blanchet (Columbia U. and Stanford U.) 43 / 57

125 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 A = (x x ) A (x x) with A positive definite (Mahalanobis distance). Then, max P :D c (P,P n ) δ = min E 1/2 β P n E 1/2 P [ ( Y β T X ( ( Y β T X ) 2 ) ) 2 ] + δ β A 1. Intuition: Think of A diagonal, encoding inverse variability of X i s... Blanchet (Columbia U. and Stanford U.) 43 / 57

126 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 A = (x x ) A (x x) with A positive definite (Mahalanobis distance). Then, max P :D c (P,P n ) δ = min E 1/2 β P n E 1/2 P [ ( Y β T X ( ( Y β T X ) 2 ) ) 2 ] + δ β A 1. Intuition: Think of A diagonal, encoding inverse variability of X i s... High variability > cheap transportation > high impact in risk estimation. Blanchet (Columbia U. and Stanford U.) 43 / 57

127 On Role of Transport Cost... Data-driven chose of c ( ). Blanchet (Columbia U. and Stanford U.) 44 / 57

128 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 Λ = (x x ) Λ (x x) with Λ positive definite (Mahalanobis distance). Blanchet (Columbia U. and Stanford U.) 44 / 57

129 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 Λ = (x x ) Λ (x x) with Λ positive definite (Mahalanobis distance). Then, ( ( ) ) 2 max E 1/2 P Y β T X P :D c (P,P n ) δ [ ( Y β T X = min E 1/2 β P n ) 2 ] + δ β Λ 1. Blanchet (Columbia U. and Stanford U.) 44 / 57

130 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 Λ = (x x ) Λ (x x) with Λ positive definite (Mahalanobis distance). Then, ( ( ) ) 2 max E 1/2 P Y β T X P :D c (P,P n ) δ [ ( Y β T X = min E 1/2 β P n ) 2 ] + δ β Λ 1. Intuition: Think of Λ diagonal, encoding inverse variability of X i s... Blanchet (Columbia U. and Stanford U.) 44 / 57

131 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 Λ = (x x ) Λ (x x) with Λ positive definite (Mahalanobis distance). Then, ( ( ) ) 2 max E 1/2 P Y β T X P :D c (P,P n ) δ [ ( Y β T X = min E 1/2 β P n ) 2 ] + δ β Λ 1. Intuition: Think of Λ diagonal, encoding inverse variability of X i s... High variability > cheap transportation > high impact in risk estimation. Blanchet (Columbia U. and Stanford U.) 44 / 57

132 On Role of Transport Cost... Comparing L 1 regularization vs data-driven cost regularization: real data BC BN QSAR Magic 3*LRL1 Train.185 ± ± ± ±.087 Test.428 ± ± ± ±.050 Accur.929 ± ± ± ±.045 3*DRO-NL Train.032 ± ± ± ±.084 Test.119 ± ± ± ±.049 Accur.955 ± ± ± ±.043 Num Predictors Train Size Test Size Table: Numerical results for real data sets. Blanchet (Columbia U. and Stanford U.) 45 / 57

133 Connections to Statistical Analysis Based on: Robust Wasserstein Profile Inference (B., Murthy & Kang 16) Highlight: How to choose size of uncertainty? Blanchet (Columbia U. and Stanford U.) 46 / 57

134 Towards an Optimal Choice of Uncertainty Size How to choose uncertainty size in a data-driven way? Blanchet (Columbia U. and Stanford U.) 47 / 57

135 Towards an Optimal Choice of Uncertainty Size How to choose uncertainty size in a data-driven way? Once again, consider Lasso as example: min β = min β max P :D c (P,P n ) δ { E 1/2 P n ( ( ) ) 2 E P Y β T X [ ( Y β T X ) 2 ] + δ β p } 2. Blanchet (Columbia U. and Stanford U.) 47 / 57

136 Towards an Optimal Choice of Uncertainty Size How to choose uncertainty size in a data-driven way? Once again, consider Lasso as example: min β = min β max P :D c (P,P n ) δ { E 1/2 P n ( ( ) ) 2 E P Y β T X [ ( Y β T X ) 2 ] + δ β p } 2. Use left hand side to define a statistical principle to choose δ. Blanchet (Columbia U. and Stanford U.) 47 / 57

137 Towards an Optimal Choice of Uncertainty Size How to choose uncertainty size in a data-driven way? Once again, consider Lasso as example: min β = min β max P :D c (P,P n ) δ { E 1/2 P n ( ( ) ) 2 E P Y β T X [ ( Y β T X ) 2 ] + δ β p } 2. Use left hand side to define a statistical principle to choose δ. Important: Optimizing δ is equivalent to optimizing regularization! Blanchet (Columbia U. and Stanford U.) 47 / 57

138 Towards an Optimal Choice of Uncertainty Size Standard way to pick δ (Esfahani and Kuhn (2015)). Blanchet (Columbia U. and Stanford U.) 48 / 57

139 Towards an Optimal Choice of Uncertainty Size Standard way to pick δ (Esfahani and Kuhn (2015)). Estimate D (P true, P n ) using concentration of measure results. Blanchet (Columbia U. and Stanford U.) 48 / 57

140 Towards an Optimal Choice of Uncertainty Size Standard way to pick δ (Esfahani and Kuhn (2015)). Estimate D (P true, P n ) using concentration of measure results. Not a good idea: rate of convergence of the form O ( 1/n 1/d ) (d is the data dimension). Blanchet (Columbia U. and Stanford U.) 48 / 57

141 Towards an Optimal Choice of Uncertainty Size Standard way to pick δ (Esfahani and Kuhn (2015)). Estimate D (P true, P n ) using concentration of measure results. Not a good idea: rate of convergence of the form O ( 1/n 1/d ) (d is the data dimension). Instead we seek an optimal approach. Blanchet (Columbia U. and Stanford U.) 48 / 57

142 Towards an Optimal Choice of Uncertainty Size Keep in mind linear regression problem Y i = β T X i + ɛ i. Blanchet (Columbia U. and Stanford U.) 49 / 57

143 Towards an Optimal Choice of Uncertainty Size Keep in mind linear regression problem Y i = β T X i + ɛ i. The plausible model variations of P n are given by the set U δ (n) = {P : D c (P, P n ) δ}. Blanchet (Columbia U. and Stanford U.) 49 / 57

144 Towards an Optimal Choice of Uncertainty Size Keep in mind linear regression problem Y i = β T X i + ɛ i. The plausible model variations of P n are given by the set U δ (n) = {P : D c (P, P n ) δ}. Given P U δ (n), define β (P) = arg min E P (Y β T X ). Blanchet (Columbia U. and Stanford U.) 49 / 57

145 Towards an Optimal Choice of Uncertainty Size Keep in mind linear regression problem Y i = β T X i + ɛ i. The plausible model variations of P n are given by the set U δ (n) = {P : D c (P, P n ) δ}. Given P U δ (n), define β (P) = arg min E P (Y β T X It is natural to say that are plausible estimates of β. Λ δ (n) = { β (P) : P U δ (n)} ). Blanchet (Columbia U. and Stanford U.) 49 / 57

146 Optimal Choice of Uncertainty Size Given a confidence level 1 α we advocate choosing δ via min δ s.t. P (β Λ δ (n)) 1 α. Blanchet (Columbia U. and Stanford U.) 50 / 57

147 Optimal Choice of Uncertainty Size Given a confidence level 1 α we advocate choosing δ via min δ s.t. P (β Λ δ (n)) 1 α. Equivalently: Find smallest confidence region Λ δ (n) at level 1 α. Blanchet (Columbia U. and Stanford U.) 50 / 57

148 Optimal Choice of Uncertainty Size Given a confidence level 1 α we advocate choosing δ via min δ s.t. P (β Λ δ (n)) 1 α. Equivalently: Find smallest confidence region Λ δ (n) at level 1 α. In simple words: Find the smallest δ so that β is plausible with confidence level 1 α. Blanchet (Columbia U. and Stanford U.) 50 / 57

149 The Robust Wasserstein Profile Function The value β (P) is characterized by ( ) ) 2 (( ) ) E P β (Y β T X = 2E P Y β T X X = 0. Blanchet (Columbia U. and Stanford U.) 51 / 57

150 The Robust Wasserstein Profile Function The value β (P) is characterized by ( ) ) 2 (( ) ) E P β (Y β T X = 2E P Y β T X X = 0. Define the Robust Wasserstein Profile (RWP) Function: ) ) R n (β) = min{d c (P, P n ) : E P ((Y β T X X = 0}. Blanchet (Columbia U. and Stanford U.) 51 / 57

151 The Robust Wasserstein Profile Function The value β (P) is characterized by ( ) ) 2 (( ) ) E P β (Y β T X = 2E P Y β T X X = 0. Define the Robust Wasserstein Profile (RWP) Function: ) ) R n (β) = min{d c (P, P n ) : E P ((Y β T X X = 0}. Note that R n (β ) δ β Λ δ (n) = { β (P) : D (P, P n ) δ}. Blanchet (Columbia U. and Stanford U.) 51 / 57

152 The Robust Wasserstein Profile Function The value β (P) is characterized by ( ) ) 2 (( ) ) E P β (Y β T X = 2E P Y β T X X = 0. Define the Robust Wasserstein Profile (RWP) Function: ) ) R n (β) = min{d c (P, P n ) : E P ((Y β T X X = 0}. Note that R n (β ) δ β Λ δ (n) = { β (P) : D (P, P n ) δ}. So δ is 1 α quantile of R n (β )! Blanchet (Columbia U. and Stanford U.) 51 / 57

153 The Robust Wasserstein Profile Function Blanchet (Columbia U. and Stanford U.) 52 / 57

154 Computing Optimal Regularization Parameter Theorem (B., Murthy, Kang (2016)) Suppose that {(Y i, X i )} n i=1 i.i.d. sample with finite variance, with c ( (x, y), ( { x, y )) x x = 2 q if y = y if y = y, is an then where L 1 is explicitly and nr n (β ) L 1, L 1 D L2 := E [e 2 ] E [e 2 ] (E e ) 2 N(0, Cov (X )) 2 q. Remark: We recover same order of regularization (but L 1 gives the optimal constant!) Blanchet (Columbia U. and Stanford U.) 53 / 57

155 Discussion on Optimal Uncertainty Size Optimal δ is of order O (1/n) as opposed to O ( 1/n 1/d ) as advocated in the standard approach. Blanchet (Columbia U. and Stanford U.) 54 / 57

156 Discussion on Optimal Uncertainty Size Optimal δ is of order O (1/n) as opposed to O ( 1/n 1/d ) as advocated in the standard approach. We characterize the asymptotic constant (not only order) in optimal regularization: P ( L 1 η 1 α ) = 1 α. Blanchet (Columbia U. and Stanford U.) 54 / 57

157 Discussion on Optimal Uncertainty Size Optimal δ is of order O (1/n) as opposed to O ( 1/n 1/d ) as advocated in the standard approach. We characterize the asymptotic constant (not only order) in optimal regularization: P ( L 1 η 1 α ) = 1 α. R n (β ) is inspired by Empirical Likelihood Owen (1988). Blanchet (Columbia U. and Stanford U.) 54 / 57

158 Discussion on Optimal Uncertainty Size Optimal δ is of order O (1/n) as opposed to O ( 1/n 1/d ) as advocated in the standard approach. We characterize the asymptotic constant (not only order) in optimal regularization: P ( L 1 η 1 α ) = 1 α. R n (β ) is inspired by Empirical Likelihood Owen (1988). Lam & Zhou (2015) use Empirical Likelihood in DRO, but focus on divergence. Blanchet (Columbia U. and Stanford U.) 54 / 57

159 A Toy Example Illustrating Proof Techniques Consider min β max P :D c (P,P n ) δ E [(Y β) 2] with c (y, y ) = (y y ) ρ and define R n (β) = min (y u) ρ π (dy, du) : π(dy,du) 0 u R π (dy, du) = 1 n δ {Y i } (dy) i, 2 (u β) π (dy, du) = 0. Blanchet (Columbia U. and Stanford U.) 55 / 57

160 A Toy Example Illustrating Proof Techniques Dual linear programming problem: Plug in β = β { R n (β ) = sup 1 n { λ R n sup λ(u β ) Y i u ρ }} i=1 u R { = sup λ } n n i=1(y i β ) λ R 1 { n n i=1 sup u R λ (u Yi ) Y i u ρ } { = sup λ ρ } n λ n (Y i β ) (ρ 1) λ ρ 1 i=1 ρ 1 ρ = (Y i β n ) = 1 ( N 0, σ 2 ) ρ n 1/2. n i=1 Blanchet (Columbia U. and Stanford U.) 56 / 57

161 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. Blanchet (Columbia U. and Stanford U.) 57 / 57

162 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. Blanchet (Columbia U. and Stanford U.) 57 / 57

163 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. OT can be used in path-space to quantify error in diffusion approximations. Blanchet (Columbia U. and Stanford U.) 57 / 57

164 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. OT can be used in path-space to quantify error in diffusion approximations. OT can be used for data-driven distributionally robust optimization. Blanchet (Columbia U. and Stanford U.) 57 / 57

165 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. OT can be used in path-space to quantify error in diffusion approximations. OT can be used for data-driven distributionally robust optimization. Cost function in OT can be used to improve out-of-sample performance. Blanchet (Columbia U. and Stanford U.) 57 / 57

166 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. OT can be used in path-space to quantify error in diffusion approximations. OT can be used for data-driven distributionally robust optimization. Cost function in OT can be used to improve out-of-sample performance. OT can be used for statistical inference using RWP function. Blanchet (Columbia U. and Stanford U.) 57 / 57

Optimal Transport in Risk Analysis

Optimal Transport in Risk Analysis Optimal Transport in Risk Analysis Jose Blanchet (based on work with Y. Kang and K. Murthy) Stanford University (Management Science and Engineering), and Columbia University (Department of Statistics and

More information

PRIVATE RISK AND VALUATION: A DISTRIBUTIONALLY ROBUST OPTIMIZATION VIEW

PRIVATE RISK AND VALUATION: A DISTRIBUTIONALLY ROBUST OPTIMIZATION VIEW PRIVATE RISK AND VALUATION: A DISTRIBUTIONALLY ROBUST OPTIMIZATION VIEW J. BLANCHET AND U. LALL ABSTRACT. This chapter summarizes a body of work which has as objective the quantification of risk exposures

More information

Distirbutional robustness, regularizing variance, and adversaries

Distirbutional robustness, regularizing variance, and adversaries Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned

More information

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs

More information

Distributionally Robust Groupwise Regularization Estimator

Distributionally Robust Groupwise Regularization Estimator Proceedings of Machine Learning Research 77:97 112, 2017 ACML 2017 Distributionally Robust Groupwise Regularization Estimator Jose Blanchet jose.blanchet@columbia.edu Yang Kang yang.kang@columbia.edu 1255

More information

Distributionally Robust Stochastic Optimization with Wasserstein Distance

Distributionally Robust Stochastic Optimization with Wasserstein Distance Distributionally Robust Stochastic Optimization with Wasserstein Distance Rui Gao DOS Seminar, Oct 2016 Joint work with Anton Kleywegt School of Industrial and Systems Engineering Georgia Tech What is

More information

Quantifying Stochastic Model Errors via Robust Optimization

Quantifying Stochastic Model Errors via Robust Optimization Quantifying Stochastic Model Errors via Robust Optimization IPAM Workshop on Uncertainty Quantification for Multiscale Stochastic Systems and Applications Jan 19, 2016 Henry Lam Industrial & Operations

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Superhedging and distributionally robust optimization with neural networks

Superhedging and distributionally robust optimization with neural networks Superhedging and distributionally robust optimization with neural networks Stephan Eckstein joint work with Michael Kupper and Mathias Pohl Robust Techniques in Quantitative Finance University of Oxford

More information

Introduction to Rare Event Simulation

Introduction to Rare Event Simulation Introduction to Rare Event Simulation Brown University: Summer School on Rare Event Simulation Jose Blanchet Columbia University. Department of Statistics, Department of IEOR. Blanchet (Columbia) 1 / 31

More information

arxiv: v3 [stat.ml] 4 Aug 2017

arxiv: v3 [stat.ml] 4 Aug 2017 SEMI-SUPERVISED LEARNING BASED ON DISTRIBUTIONALLY ROBUST OPTIMIZATION JOSE BLANCHET AND YANG KANG arxiv:1702.08848v3 stat.ml] 4 Aug 2017 Abstract. We propose a novel method for semi-supervised learning

More information

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Nishant Gurnani. GAN Reading Group. April 14th, / 107 Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107 Why are these Papers Important? 2 / 107 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN,

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Distributionally Robust Optimization with ROME (part 1)

Distributionally Robust Optimization with ROME (part 1) Distributionally Robust Optimization with ROME (part 1) Joel Goh Melvyn Sim Department of Decision Sciences NUS Business School, Singapore 18 Jun 2009 NUS Business School Guest Lecture J. Goh, M. Sim (NUS)

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Ambiguity Sets and their applications to SVM

Ambiguity Sets and their applications to SVM Ambiguity Sets and their applications to SVM Ammon Washburn University of Arizona April 22, 2016 Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 1 / 25 Introduction Go over some (very

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

Robust Dual-Response Optimization

Robust Dual-Response Optimization Yanıkoğlu, den Hertog, and Kleijnen Robust Dual-Response Optimization 29 May 1 June 1 / 24 Robust Dual-Response Optimization İhsan Yanıkoğlu, Dick den Hertog, Jack P.C. Kleijnen Özyeğin University, İstanbul,

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This

More information

Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization

Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization Instructor: Farid Alizadeh Author: Ai Kagawa 12/12/2012

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Generative Models and Optimal Transport

Generative Models and Optimal Transport Generative Models and Optimal Transport Marco Cuturi Joint work / work in progress with G. Peyré, A. Genevay (ENS), F. Bach (INRIA), G. Montavon, K-R Müller (TU Berlin) Statistics 0.1 : Density Fitting

More information

Distributionally Robust Discrete Optimization with Entropic Value-at-Risk

Distributionally Robust Discrete Optimization with Entropic Value-at-Risk Distributionally Robust Discrete Optimization with Entropic Value-at-Risk Daniel Zhuoyu Long Department of SEEM, The Chinese University of Hong Kong, zylong@se.cuhk.edu.hk Jin Qi NUS Business School, National

More information

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez

More information

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li. Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Lecture 2: August 31

Lecture 2: August 31 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy

More information

1 Regression with High Dimensional Data

1 Regression with High Dimensional Data 6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:

More information

Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory

Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory Maximilian Kasy May 25, 218 1 / 27 Introduction Recent years saw a boom of machine learning methods. Impressive advances

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

(Part 1) High-dimensional statistics May / 41

(Part 1) High-dimensional statistics May / 41 Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2

More information

Bayesian Analysis of Risk for Data Mining Based on Empirical Likelihood

Bayesian Analysis of Risk for Data Mining Based on Empirical Likelihood 1 / 29 Bayesian Analysis of Risk for Data Mining Based on Empirical Likelihood Yuan Liao Wenxin Jiang Northwestern University Presented at: Department of Statistics and Biostatistics Rutgers University

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/32 Margin Classifiers margin b = 0 Sridhar Mahadevan: CMPSCI 689 p.

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities

Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities Maxim Raginsky and Igal Sason ISIT 2013, Istanbul, Turkey Capacity-Achieving Channel Codes The set-up DMC

More information

Ambiguity in portfolio optimization

Ambiguity in portfolio optimization May/June 2006 Introduction: Risk and Ambiguity Frank Knight Risk, Uncertainty and Profit (1920) Risk: the decision-maker can assign mathematical probabilities to random phenomena Uncertainty: randomness

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

Machine learning, shrinkage estimation, and economic theory

Machine learning, shrinkage estimation, and economic theory Machine learning, shrinkage estimation, and economic theory Maximilian Kasy December 14, 2018 1 / 43 Introduction Recent years saw a boom of machine learning methods. Impressive advances in domains such

More information

Covariance function estimation in Gaussian process regression

Covariance function estimation in Gaussian process regression Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

Proceedings of the 2014 Winter Simulation Conference A. Tolk, S. D. Diallo, I. O. Ryzhov, L. Yilmaz, S. Buckley, and J. A. Miller, eds.

Proceedings of the 2014 Winter Simulation Conference A. Tolk, S. D. Diallo, I. O. Ryzhov, L. Yilmaz, S. Buckley, and J. A. Miller, eds. Proceedings of the 204 Winter Simulation Conference A. Tolk, S. D. Diallo, I. O. Ryzhov, L. Yilmaz, S. Buckley, and J. A. Miller, eds. ROBUST RARE-EVENT PERFORMANCE ANALYSIS WITH NATURAL NON-CONVEX CONSTRAINTS

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Convex Optimization M2

Convex Optimization M2 Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial

More information

Cutting Plane Training of Structural SVM

Cutting Plane Training of Structural SVM Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs

More information

Optimal transportation and optimal control in a finite horizon framework

Optimal transportation and optimal control in a finite horizon framework Optimal transportation and optimal control in a finite horizon framework Guillaume Carlier and Aimé Lachapelle Université Paris-Dauphine, CEREMADE July 2008 1 MOTIVATIONS - A commitment problem (1) Micro

More information

Wasserstein GAN. Juho Lee. Jan 23, 2017

Wasserstein GAN. Juho Lee. Jan 23, 2017 Wasserstein GAN Juho Lee Jan 23, 2017 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance)

More information

Summary of the simplex method

Summary of the simplex method MVE165/MMG630, The simplex method; degeneracy; unbounded solutions; infeasibility; starting solutions; duality; interpretation Ann-Brith Strömberg 2012 03 16 Summary of the simplex method Optimality condition:

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

Random Convex Approximations of Ambiguous Chance Constrained Programs

Random Convex Approximations of Ambiguous Chance Constrained Programs Random Convex Approximations of Ambiguous Chance Constrained Programs Shih-Hao Tseng Eilyan Bitar Ao Tang Abstract We investigate an approach to the approximation of ambiguous chance constrained programs

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Short Course Robust Optimization and Machine Learning. Lecture 6: Robust Optimization in Machine Learning

Short Course Robust Optimization and Machine Learning. Lecture 6: Robust Optimization in Machine Learning Short Course Robust Optimization and Machine Machine Lecture 6: Robust Optimization in Machine Laurent El Ghaoui EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012

More information

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Compiled by David Rosenberg Abstract Boyd and Vandenberghe s Convex Optimization book is very well-written and a pleasure to read. The

More information

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models IEEE Transactions on Information Theory, vol.58, no.2, pp.708 720, 2012. 1 f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models Takafumi Kanamori Nagoya University,

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Signal Recovery from Permuted Observations

Signal Recovery from Permuted Observations EE381V Course Project Signal Recovery from Permuted Observations 1 Problem Shanshan Wu (sw33323) May 8th, 2015 We start with the following problem: let s R n be an unknown n-dimensional real-valued signal,

More information

Joint distribution optimal transportation for domain adaptation

Joint distribution optimal transportation for domain adaptation Joint distribution optimal transportation for domain adaptation Changhuang Wan Mechanical and Aerospace Engineering Department The Ohio State University March 8 th, 2018 Joint distribution optimal transportation

More information

Multivariate Topological Data Analysis

Multivariate Topological Data Analysis Cleveland State University November 20, 2008, Duke University joint work with Gunnar Carlsson (Stanford), Peter Kim and Zhiming Luo (Guelph), and Moo Chung (Wisconsin Madison) Framework Ideal Truth (Parameter)

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

Robust linear optimization under general norms

Robust linear optimization under general norms Operations Research Letters 3 (004) 50 56 Operations Research Letters www.elsevier.com/locate/dsw Robust linear optimization under general norms Dimitris Bertsimas a; ;, Dessislava Pachamanova b, Melvyn

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Are You a Bayesian or a Frequentist?

Are You a Bayesian or a Frequentist? Are You a Bayesian or a Frequentist? Michael I. Jordan Department of EECS Department of Statistics University of California, Berkeley http://www.cs.berkeley.edu/ jordan 1 Statistical Inference Bayesian

More information

Generalized quantiles as risk measures

Generalized quantiles as risk measures Generalized quantiles as risk measures Bellini, Klar, Muller, Rosazza Gianin December 1, 2014 Vorisek Jan Introduction Quantiles q α of a random variable X can be defined as the minimizers of a piecewise

More information

Lecture 7: Dynamic panel models 2

Lecture 7: Dynamic panel models 2 Lecture 7: Dynamic panel models 2 Ragnar Nymoen Department of Economics, UiO 25 February 2010 Main issues and references The Arellano and Bond method for GMM estimation of dynamic panel data models A stepwise

More information

Distributionally Robust Convex Optimization

Distributionally Robust Convex Optimization Distributionally Robust Convex Optimization Wolfram Wiesemann 1, Daniel Kuhn 1, and Melvyn Sim 2 1 Department of Computing, Imperial College London, United Kingdom 2 Department of Decision Sciences, National

More information

How much is information worth? A geometric insight using duality between payoffs and beliefs

How much is information worth? A geometric insight using duality between payoffs and beliefs How much is information worth? A geometric insight using duality between payoffs and beliefs Michel De Lara and Olivier Gossner CERMICS, École des Ponts ParisTech CNRS, X Paris Saclay and LSE Séminaire

More information

Topic: Primal-Dual Algorithms Date: We finished our discussion of randomized rounding and began talking about LP Duality.

Topic: Primal-Dual Algorithms Date: We finished our discussion of randomized rounding and began talking about LP Duality. CS787: Advanced Algorithms Scribe: Amanda Burton, Leah Kluegel Lecturer: Shuchi Chawla Topic: Primal-Dual Algorithms Date: 10-17-07 14.1 Last Time We finished our discussion of randomized rounding and

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of

More information

Data-Driven Distributionally Robust Chance-Constrained Optimization with Wasserstein Metric

Data-Driven Distributionally Robust Chance-Constrained Optimization with Wasserstein Metric Data-Driven Distributionally Robust Chance-Constrained Optimization with asserstein Metric Ran Ji Department of System Engineering and Operations Research, George Mason University, rji2@gmu.edu; Miguel

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Cupid s Invisible Hand

Cupid s Invisible Hand Alfred Galichon Bernard Salanié Chicago, June 2012 Matching involves trade-offs Marriage partners vary across several dimensions, their preferences over partners also vary and the analyst can only observe

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Lecture 6: September 19

Lecture 6: September 19 36-755: Advanced Statistical Theory I Fall 2016 Lecture 6: September 19 Lecturer: Alessandro Rinaldo Scribe: YJ Choe Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Topic 2: Logistic Regression

Topic 2: Logistic Regression CS 4850/6850: Introduction to Machine Learning Fall 208 Topic 2: Logistic Regression Instructor: Daniel L. Pimentel-Alarcón c Copyright 208 2. Introduction Arguably the simplest task that we can teach

More information

21.1 Lower bounds on minimax risk for functional estimation

21.1 Lower bounds on minimax risk for functional estimation ECE598: Information-theoretic methods in high-dimensional statistics Spring 016 Lecture 1: Functional estimation & testing Lecturer: Yihong Wu Scribe: Ashok Vardhan, Apr 14, 016 In this chapter, we will

More information

Optimal Transport and Wasserstein Distance

Optimal Transport and Wasserstein Distance Optimal Transport and Wasserstein Distance The Wasserstein distance which arises from the idea of optimal transport is being used more and more in Statistics and Machine Learning. In these notes we review

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han Math for Machine Learning Open Doors to Data Science and Artificial Intelligence Richard Han Copyright 05 Richard Han All rights reserved. CONTENTS PREFACE... - INTRODUCTION... LINEAR REGRESSION... 4 LINEAR

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable. Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information