Optimal Transport Methods in Operations Research and Statistics
|
|
- Joshua Sims
- 5 years ago
- Views:
Transcription
1 Optimal Transport Methods in Operations Research and Statistics Jose Blanchet (based on work with F. He, Y. Kang, K. Murthy, F. Zhang). Stanford University (Management Science and Engineering), and Columbia University (Department of Statistics and Department of IEOR). Blanchet (Columbia U. and Stanford U.) 1 / 57
2 Goal: Goal: Introduce optimal transport techniques and applications in OR & Statistics Optimal transport is useful tool in model robustness, equilibrium, and machine learning! Blanchet (Columbia U. and Stanford U.) 2 / 57
3 Agenda Introduction to Optimal Transport Blanchet (Columbia U. and Stanford U.) 3 / 57
4 Agenda Introduction to Optimal Transport Economic Interpretations and Wasserstein Distances Blanchet (Columbia U. and Stanford U.) 3 / 57
5 Agenda Introduction to Optimal Transport Economic Interpretations and Wasserstein Distances Applications in Stochastic Operations Research Blanchet (Columbia U. and Stanford U.) 3 / 57
6 Agenda Introduction to Optimal Transport Economic Interpretations and Wasserstein Distances Applications in Stochastic Operations Research Applications in Distributionally Robust Optimization Blanchet (Columbia U. and Stanford U.) 3 / 57
7 Agenda Introduction to Optimal Transport Economic Interpretations and Wasserstein Distances Applications in Stochastic Operations Research Applications in Distributionally Robust Optimization Applications in Statistics Blanchet (Columbia U. and Stanford U.) 3 / 57
8 Introduction to Optimal Transport Monge-Kantorovich Problem & Duality (see e.g. C. Villani s 2008 textbook) Blanchet (Columbia U. and Stanford U.) 4 / 57
9 Monge Problem What s the cheapest way to transport a pile of sand to cover a sinkhole? Blanchet (Columbia U. and Stanford U.) 5 / 57
10 Monge Problem What s the cheapest way to transport a pile of sand to cover a sinkhole? min E µ {c (X, T (X ))}, T ( ):T (X ) v Blanchet (Columbia U. and Stanford U.) 6 / 57
11 Monge Problem What s the cheapest way to transport a pile of sand to cover a sinkhole? min E µ {c (X, T (X ))}, T ( ):T (X ) v where c (x, y) 0 is the cost of transporting x to y. Blanchet (Columbia U. and Stanford U.) 6 / 57
12 Monge Problem What s the cheapest way to transport a pile of sand to cover a sinkhole? min E µ {c (X, T (X ))}, T ( ):T (X ) v where c (x, y) 0 is the cost of transporting x to y. T (X ) v means T (X ) follows distribution v ( ). Blanchet (Columbia U. and Stanford U.) 6 / 57
13 Monge Problem What s the cheapest way to transport a pile of sand to cover a sinkhole? min E µ {c (X, T (X ))}, T ( ):T (X ) v where c (x, y) 0 is the cost of transporting x to y. T (X ) v means T (X ) follows distribution v ( ). Problem is highly non-linear, not much progress for about 150 yrs! Blanchet (Columbia U. and Stanford U.) 6 / 57
14 Kantorovich Relaxation: Primal Problem Let Π (µ, v) be the class of joint distributions π of random variables (X, Y ) such that π X = marginal of X = µ, π Y = marginal of Y = v. Blanchet (Columbia U. and Stanford U.) 7 / 57
15 Kantorovich Relaxation: Primal Problem Let Π (µ, v) be the class of joint distributions π of random variables (X, Y ) such that π X = marginal of X = µ, π Y = marginal of Y = v. Solve min{e π [c (X, Y )] : π Π (µ, v)} Blanchet (Columbia U. and Stanford U.) 7 / 57
16 Kantorovich Relaxation: Primal Problem Let Π (µ, v) be the class of joint distributions π of random variables (X, Y ) such that Solve π X = marginal of X = µ, π Y = marginal of Y = v. min{e π [c (X, Y )] : π Π (µ, v)} Linear programming (infinite dimensional): D c (µ, v) : = min Y π(dx,dy ) 0 X Y π (dx, dy) = µ (dx), c (x, y) π (dx, dy) X π (dx, dy) = v (dy). Blanchet (Columbia U. and Stanford U.) 7 / 57
17 Kantorovich Relaxation: Primal Problem Let Π (µ, v) be the class of joint distributions π of random variables (X, Y ) such that Solve π X = marginal of X = µ, π Y = marginal of Y = v. min{e π [c (X, Y )] : π Π (µ, v)} Linear programming (infinite dimensional): D c (µ, v) : = min Y π(dx,dy ) 0 X Y π (dx, dy) = µ (dx), If c (x, y) = d p (x, y) (d-metric) then Dc 1/p metric. c (x, y) π (dx, dy) X π (dx, dy) = v (dy). (µ, v) is a p-wasserstein Blanchet (Columbia U. and Stanford U.) 7 / 57
18 Illustration of Optimal Transport Costs Monge s solution would take the form π (dx, dy) = δ {T (x )} (dy) µ (dx). Blanchet (Columbia U. and Stanford U.) 8 / 57
19 Kantorovich Relaxation: Dual Problem Primal has always a solution for c ( ) 0 lower semicontinuous. Blanchet (Columbia U. and Stanford U.) 9 / 57
20 Kantorovich Relaxation: Dual Problem Primal has always a solution for c ( ) 0 lower semicontinuous. Linear programming (Dual): sup α (x) µ (dx) + β (y) v (dy) α,β X Y α (x) + β (y) c (x, y) (x, y) X Y. Blanchet (Columbia U. and Stanford U.) 9 / 57
21 Kantorovich Relaxation: Dual Problem Primal has always a solution for c ( ) 0 lower semicontinuous. Linear programming (Dual): sup α (x) µ (dx) + β (y) v (dy) α,β X Y α (x) + β (y) c (x, y) (x, y) X Y. Dual α and β can be taken over continuous functions. Blanchet (Columbia U. and Stanford U.) 9 / 57
22 Kantorovich Relaxation: Dual Problem Primal has always a solution for c ( ) 0 lower semicontinuous. Linear programming (Dual): sup α (x) µ (dx) + β (y) v (dy) α,β X Y α (x) + β (y) c (x, y) (x, y) X Y. Dual α and β can be taken over continuous functions. Complementary slackness: Equality holds on the support of π (primal optimizer). Blanchet (Columbia U. and Stanford U.) 9 / 57
23 Kantorovich Relaxation: Primal Interpretation John wants to remove of a pile of sand, µ ( ). Blanchet (Columbia U. and Stanford U.) 10 / 57
24 Kantorovich Relaxation: Primal Interpretation John wants to remove of a pile of sand, µ ( ). Peter wants to cover a sinkhole, v ( ). Blanchet (Columbia U. and Stanford U.) 10 / 57
25 Kantorovich Relaxation: Primal Interpretation John wants to remove of a pile of sand, µ ( ). Peter wants to cover a sinkhole, v ( ). Cost for John and Peter to transport the sand to cover the sinkhole is D c (µ, v) = c (x, y) π (dx, dy). X Y Blanchet (Columbia U. and Stanford U.) 10 / 57
26 Kantorovich Relaxation: Primal Interpretation John wants to remove of a pile of sand, µ ( ). Peter wants to cover a sinkhole, v ( ). Cost for John and Peter to transport the sand to cover the sinkhole is D c (µ, v) = c (x, y) π (dx, dy). X Y Now comes Maria, who has a business... Blanchet (Columbia U. and Stanford U.) 10 / 57
27 Kantorovich Relaxation: Primal Interpretation John wants to remove of a pile of sand, µ ( ). Peter wants to cover a sinkhole, v ( ). Cost for John and Peter to transport the sand to cover the sinkhole is D c (µ, v) = c (x, y) π (dx, dy). X Y Now comes Maria, who has a business... Maria promises to transport on behalf of John and Peter the whole amount. Blanchet (Columbia U. and Stanford U.) 10 / 57
28 Kantorovich Relaxation: Primal Interpretation Maria charges John α (x) per-unit of mass at x (similarly to Peter). Blanchet (Columbia U. and Stanford U.) 11 / 57
29 Kantorovich Relaxation: Primal Interpretation Maria charges John α (x) per-unit of mass at x (similarly to Peter). For Peter and John to agree we must have α (x) + β (y) c (x, y). Blanchet (Columbia U. and Stanford U.) 11 / 57
30 Kantorovich Relaxation: Primal Interpretation Maria charges John α (x) per-unit of mass at x (similarly to Peter). For Peter and John to agree we must have α (x) + β (y) c (x, y). Maria wishes to maximize her profit α (x) µ (dx) + β (y) v (dy). Blanchet (Columbia U. and Stanford U.) 11 / 57
31 Kantorovich Relaxation: Primal Interpretation Maria charges John α (x) per-unit of mass at x (similarly to Peter). For Peter and John to agree we must have α (x) + β (y) c (x, y). Maria wishes to maximize her profit α (x) µ (dx) + β (y) v (dy). Kantorovich duality says primal and dual optimal values coincide and (under mild regularity) α (x) = inf {c (x, y) y β (y)} β (y) = inf {c (x, y) x α (x)}. Blanchet (Columbia U. and Stanford U.) 11 / 57
32 Proof Techniques Suppose X and Y compact { sup inf π 0, α,β X Y X Y X Y c (x, y) π (dx, dy) α (x) π (dx, dy) + β (y) π (dx, dy) + X Y α (x) µ (dx) β (y) v (dy)} Blanchet (Columbia U. and Stanford U.) 12 / 57
33 Proof Techniques Suppose X and Y compact { sup inf π 0, α,β X Y X Y X Y c (x, y) π (dx, dy) α (x) π (dx, dy) + β (y) π (dx, dy) + X Y α (x) µ (dx) β (y) v (dy)} Swap sup and inf using Sion s min-max theorem by a compactness argument and conclude. Blanchet (Columbia U. and Stanford U.) 12 / 57
34 Proof Techniques Suppose X and Y compact { sup inf π 0, α,β X Y X Y X Y c (x, y) π (dx, dy) α (x) π (dx, dy) + β (y) π (dx, dy) + X Y α (x) µ (dx) β (y) v (dy)} Swap sup and inf using Sion s min-max theorem by a compactness argument and conclude. Significant amount of work needed to extend to general Polish spaces and construct the dual optimizers (primal a bit easier). Blanchet (Columbia U. and Stanford U.) 12 / 57
35 Application of Optimal Transport in Economics Economic Interpretations (see e.g. A. Galichon s 2016 textbook & McCaan 2013 notes). Blanchet (Columbia U. and Stanford U.) 13 / 57
36 Applications in Labor Markets Worker with skill x & company with technology y have surplus Ψ (x, y). Blanchet (Columbia U. and Stanford U.) 14 / 57
37 Applications in Labor Markets Worker with skill x & company with technology y have surplus Ψ (x, y). The population of workers is given by µ (x). Blanchet (Columbia U. and Stanford U.) 14 / 57
38 Applications in Labor Markets Worker with skill x & company with technology y have surplus Ψ (x, y). The population of workers is given by µ (x). The population of companies is given by v (y). Blanchet (Columbia U. and Stanford U.) 14 / 57
39 Applications in Labor Markets Worker with skill x & company with technology y have surplus Ψ (x, y). The population of workers is given by µ (x). The population of companies is given by v (y). The salary of worker x is α (x) & cost of technology y is β (y) α (x) + β (y) Ψ (x, y). Blanchet (Columbia U. and Stanford U.) 14 / 57
40 Applications in Labor Markets Worker with skill x & company with technology y have surplus Ψ (x, y). The population of workers is given by µ (x). The population of companies is given by v (y). The salary of worker x is α (x) & cost of technology y is β (y) α (x) + β (y) Ψ (x, y). Companies want to minimize total production cost α (x) µ (x) dx + β (y) v (y) dy Blanchet (Columbia U. and Stanford U.) 14 / 57
41 Applications in Labor Markets Letting a central planner organize the Labor market Blanchet (Columbia U. and Stanford U.) 15 / 57
42 Applications in Labor Markets Letting a central planner organize the Labor market The planner wishes to maximize total surplus Ψ (x, y) π (dx, dy) Blanchet (Columbia U. and Stanford U.) 15 / 57
43 Applications in Labor Markets Letting a central planner organize the Labor market The planner wishes to maximize total surplus Ψ (x, y) π (dx, dy) Over assignments π ( ) which satisfy market clearing Y π (dx, dy) = µ (dx), X π (dx, dy) = v (dy). Blanchet (Columbia U. and Stanford U.) 15 / 57
44 Solving for Optimal Transport Coupling Suppose that Ψ (x, y) = xy, µ (x) = I (x [0, 1]), v (y) = e y I (y > 0). Blanchet (Columbia U. and Stanford U.) 16 / 57
45 Solving for Optimal Transport Coupling Suppose that Ψ (x, y) = xy, µ (x) = I (x [0, 1]), v (y) = e y I (y > 0). Solve primal by sampling: Let {X n µ and v, respectively. F µn (x) = 1 n i } n i=1 and n {Yi } n i=1 both i.i.d. from n I (Xi n x), F vn (y) = 1 n i=1 n I ( Yj n y ) j=1 Blanchet (Columbia U. and Stanford U.) 16 / 57
46 Solving for Optimal Transport Coupling Suppose that Ψ (x, y) = xy, µ (x) = I (x [0, 1]), v (y) = e y I (y > 0). Solve primal by sampling: Let {X n µ and v, respectively. F µn (x) = 1 n i } n i=1 and n {Yi } n i=1 both i.i.d. from n I (Xi n x), F vn (y) = 1 n i=1 n I ( Yj n y ) j=1 Consider max π(x n i,x n j ) 0 i,j π ( x n j i, yj n Ψ ( xi n, yj n ) ( π x n i, y n ) ) = 1 n x i, j π ( x n i i, yj n ) = 1 n y j. Blanchet (Columbia U. and Stanford U.) 16 / 57
47 Solving for Optimal Transport Coupling Suppose that Ψ (x, y) = xy, µ (x) = I (x [0, 1]), v (y) = e y I (y > 0). Solve primal by sampling: Let {X n µ and v, respectively. F µn (x) = 1 n i } n i=1 and n {Yi } n i=1 both i.i.d. from n I (Xi n x), F vn (y) = 1 n i=1 n I ( Yj n y ) j=1 Consider max π(x n i,x n j ) 0 i,j π ( x n j i, yj n Ψ ( xi n, yj n ) ( π x n i, y n ) ) = 1 n x i, j π ( x n i i, yj n ) = 1 n y j. Clearly, simply sort and match is the solution! Blanchet (Columbia U. and Stanford U.) 16 / 57
48 Solving for Optimal Transport Coupling Think of Y n j ( ) = log 1 Uj n for Uj n s i.i.d. uniform(0, 1). Blanchet (Columbia U. and Stanford U.) 17 / 57
49 Solving for Optimal Transport Coupling Think of Y n j ( ) = log 1 Uj n for Uj n s i.i.d. uniform(0, 1). The j-th order statistic X n (j) is matched to Y n (j). Blanchet (Columbia U. and Stanford U.) 17 / 57
50 Solving for Optimal Transport Coupling Think of Y n j ( ) = log 1 Uj n for Uj n s i.i.d. uniform(0, 1). The j-th order statistic X(j) n is matched to Y n (j). As n, X(nt) n t, so Y n (nt) log (1 t). Blanchet (Columbia U. and Stanford U.) 17 / 57
51 Solving for Optimal Transport Coupling Think of Y n j ( ) = log 1 Uj n for Uj n s i.i.d. uniform(0, 1). The j-th order statistic X(j) n is matched to Y n (j). As n, X(nt) n t, so Y n (nt) log (1 t). Thus, the optimal coupling as n is X = U and Y = log (1 U) (comonotonic coupling). Blanchet (Columbia U. and Stanford U.) 17 / 57
52 Identities for Wasserstein Distances Comonotonic coupling is the solution if: Ψ (x + h, y + h) Ψ (x, y) ( 2 x,y Ψ (x, y) 0) - supermodularity. Blanchet (Columbia U. and Stanford U.) 18 / 57
53 Identities for Wasserstein Distances Comonotonic coupling is the solution if: Ψ (x + h, y + h) Ψ (x, y) ( 2 x,y Ψ (x, y) 0) - supermodularity. Of for costs c (x, y) = Ψ (x, y) if 2 x,y c (x, y) 0 (submodularity). Blanchet (Columbia U. and Stanford U.) 18 / 57
54 Identities for Wasserstein Distances Comonotonic coupling is the solution if: Ψ (x + h, y + h) Ψ (x, y) ( 2 x,y Ψ (x, y) 0) - supermodularity. Of for costs c (x, y) = Ψ (x, y) if 2 x,y c (x, y) 0 (submodularity). Corollary: Suppose c (x, y) = x y then X = Fµ 1 (U) and Y = Fv 1 (U) thus D c ( Fµ, F v ) = 1 0 F 1 µ (u) F 1 v (u) du. Blanchet (Columbia U. and Stanford U.) 18 / 57
55 Identities for Wasserstein Distances Comonotonic coupling is the solution if: Ψ (x + h, y + h) Ψ (x, y) ( 2 x,y Ψ (x, y) 0) - supermodularity. Of for costs c (x, y) = Ψ (x, y) if 2 x,y c (x, y) 0 (submodularity). Corollary: Suppose c (x, y) = x y then X = Fµ 1 (U) and Y = Fv 1 (U) thus D c ( Fµ, F v ) = 1 0 F 1 µ (u) F 1 v (u) du. Similar identities are common for Wasserstein distances... Blanchet (Columbia U. and Stanford U.) 18 / 57
56 Interesting Insight on Salary Effects In equilibrium, by the envelope theorem β (y) = d dy sup [Ψ (x, y) λ (x)] = x y Ψ (x y, y) = x y. Blanchet (Columbia U. and Stanford U.) 19 / 57
57 Interesting Insight on Salary Effects In equilibrium, by the envelope theorem β (y) = d dy sup [Ψ (x, y) λ (x)] = x y Ψ (x y, y) = x y. We also know that y = log (1 x), or x = 1 exp ( y) α (x) + β ( log (1 x)) = xy. β (y) = y + exp ( y) 1 + β (0). Blanchet (Columbia U. and Stanford U.) 19 / 57
58 Interesting Insight on Salary Effects In equilibrium, by the envelope theorem β (y) = d dy sup [Ψ (x, y) λ (x)] = x y Ψ (x y, y) = x y. We also know that y = log (1 x), or x = 1 exp ( y) α (x) + β ( log (1 x)) = xy. β (y) = y + exp ( y) 1 + β (0). What if Ψ (x, y) Ψ (x, y) + f (x)? (i.e. productivity grows). Blanchet (Columbia U. and Stanford U.) 19 / 57
59 Interesting Insight on Salary Effects In equilibrium, by the envelope theorem β (y) = d dy sup [Ψ (x, y) λ (x)] = x y Ψ (x y, y) = x y. We also know that y = log (1 x), or x = 1 exp ( y) α (x) + β ( log (1 x)) = xy. β (y) = y + exp ( y) 1 + β (0). What if Ψ (x, y) Ψ (x, y) + f (x)? (i.e. productivity grows). Answer: salaries grows if f ( ) is increasing. Blanchet (Columbia U. and Stanford U.) 19 / 57
60 Applications of Optimal Transport in Stochastic OR Application of Optimal Transport in Stochastic OR Blanchet and Murthy (2016) Insight: Diffusion approximations and optimal transport Blanchet (Columbia U. and Stanford U.) 20 / 57
61 A Distributionally Robust Performance Analysis In Stochastic OR we are often interested in evaluating for a complex model P true E Ptrue (f (X )) Blanchet (Columbia U. and Stanford U.) 21 / 57
62 A Distributionally Robust Performance Analysis In Stochastic OR we are often interested in evaluating E Ptrue (f (X )) for a complex model P true Moreover, we wish to control / optimize it min θ E Ptrue (h (X, θ)). Blanchet (Columbia U. and Stanford U.) 21 / 57
63 A Distributionally Robust Performance Analysis In Stochastic OR we are often interested in evaluating E Ptrue (f (X )) for a complex model P true Moreover, we wish to control / optimize it min θ E Ptrue (h (X, θ)). Model P true might be unknown or too diffi cult to work with. Blanchet (Columbia U. and Stanford U.) 21 / 57
64 A Distributionally Robust Performance Analysis In Stochastic OR we are often interested in evaluating E Ptrue (f (X )) for a complex model P true Moreover, we wish to control / optimize it min θ E Ptrue (h (X, θ)). Model P true might be unknown or too diffi cult to work with. So, we introduce a proxy P 0 which provides a good trade-off between tractability and model fidelity (e.g. Brownian motion for heavy-traffi c approximations). Blanchet (Columbia U. and Stanford U.) 21 / 57
65 A Distributionally Robust Performance Analysis For f ( ) upper semicontinuous with E P0 f (X ) < sup E P (f (Y )) D c (P, P 0 ) δ, X takes values on a Polish space and c ( ) is lower semi-continuous. Blanchet (Columbia U. and Stanford U.) 22 / 57
66 A Distributionally Robust Performance Analysis For f ( ) upper semicontinuous with E P0 f (X ) < sup E P (f (Y )) D c (P, P 0 ) δ, X takes values on a Polish space and c ( ) is lower semi-continuous. Also an infinite dimensional linear program sup f (y) π (dx, dy) X Y s.t. c (x, y) π (dx, dy) δ X Y π (dx, dy) = P 0 (dx). Y Blanchet (Columbia U. and Stanford U.) 22 / 57
67 A Distributionally Robust Performance Analysis Formal duality: Dual = inf λ 0,α { λδ + } α (x) P 0 (dx) λc (x, y) + α (x) f (y). Blanchet (Columbia U. and Stanford U.) 23 / 57
68 A Distributionally Robust Performance Analysis Formal duality: Dual = inf λ 0,α { λδ + } α (x) P 0 (dx) λc (x, y) + α (x) f (y). B. & Murthy (2016) - No duality gap: ( )] Dual = inf [λδ + E 0 sup {f (y) λc (X, y)}. λ 0 y Blanchet (Columbia U. and Stanford U.) 23 / 57
69 A Distributionally Robust Performance Analysis Formal duality: Dual = inf λ 0,α { λδ + } α (x) P 0 (dx) λc (x, y) + α (x) f (y). B. & Murthy (2016) - No duality gap: ( )] Dual = inf [λδ + E 0 sup {f (y) λc (X, y)}. λ 0 y We refer to this as RoPA Duality in this talk. Blanchet (Columbia U. and Stanford U.) 23 / 57
70 A Distributionally Robust Performance Analysis Formal duality: Dual = inf λ 0,α { λδ + } α (x) P 0 (dx) λc (x, y) + α (x) f (y). B. & Murthy (2016) - No duality gap: ( )] Dual = inf [λδ + E 0 sup {f (y) λc (X, y)}. λ 0 y We refer to this as RoPA Duality in this talk. Let us consider the important case f (y) = I (y A) & c (x, x) = 0. Blanchet (Columbia U. and Stanford U.) 23 / 57
71 A Distributionally Robust Performance Analysis So, if f (y) = I (y A) and c A (X ) = inf{y A : c (x, y)}, then [ Dual = inf λδ + E 0 (1 λc A (X )) +] = P 0 (c A (X ) 1/λ ). λ 0 Blanchet (Columbia U. and Stanford U.) 24 / 57
72 A Distributionally Robust Performance Analysis So, if f (y) = I (y A) and c A (X ) = inf{y A : c (x, y)}, then [ Dual = inf λδ + E 0 (1 λc A (X )) +] = P 0 (c A (X ) 1/λ ). λ 0 If c A (X ) is continuous under P 0 & E 0 (c A (X )) δ, then δ = E 0 [c A (X ) I (c A (X ) 1/λ )]. Blanchet (Columbia U. and Stanford U.) 24 / 57
73 Example: Model Uncertainty in Bankruptcy Calculations R (t) = the reserve (perhaps multiple lines) at time t. Blanchet (Columbia U. and Stanford U.) 25 / 57
74 Example: Model Uncertainty in Bankruptcy Calculations R (t) = the reserve (perhaps multiple lines) at time t. Bankruptcy probability (in finite time horizon T ) u T = P true (R (t) B for some t [0, T ]). Blanchet (Columbia U. and Stanford U.) 25 / 57
75 Example: Model Uncertainty in Bankruptcy Calculations R (t) = the reserve (perhaps multiple lines) at time t. Bankruptcy probability (in finite time horizon T ) u T = P true (R (t) B for some t [0, T ]). B is a set which models bankruptcy. Blanchet (Columbia U. and Stanford U.) 25 / 57
76 Example: Model Uncertainty in Bankruptcy Calculations R (t) = the reserve (perhaps multiple lines) at time t. Bankruptcy probability (in finite time horizon T ) u T = P true (R (t) B for some t [0, T ]). B is a set which models bankruptcy. Problem: Model (P true ) may be complex, intractable or simply unknown... Blanchet (Columbia U. and Stanford U.) 25 / 57
77 A Distributionally Robust Risk Analysis Formulation Our solution: Estimate u T by solving sup P true (R (t) B for some t [0, T ]), D c (P 0,P ) δ where P 0 is a suitable model. Blanchet (Columbia U. and Stanford U.) 26 / 57
78 A Distributionally Robust Risk Analysis Formulation Our solution: Estimate u T by solving sup P true (R (t) B for some t [0, T ]), D c (P 0,P ) δ where P 0 is a suitable model. P 0 = proxy for P true. Blanchet (Columbia U. and Stanford U.) 26 / 57
79 A Distributionally Robust Risk Analysis Formulation Our solution: Estimate u T by solving sup P true (R (t) B for some t [0, T ]), D c (P 0,P ) δ where P 0 is a suitable model. P 0 = proxy for P true. P 0 right trade-off between fidelity and tractability. Blanchet (Columbia U. and Stanford U.) 26 / 57
80 A Distributionally Robust Risk Analysis Formulation Our solution: Estimate u T by solving sup P true (R (t) B for some t [0, T ]), D c (P 0,P ) δ where P 0 is a suitable model. P 0 = proxy for P true. P 0 right trade-off between fidelity and tractability. δ is the distributional uncertainty size. Blanchet (Columbia U. and Stanford U.) 26 / 57
81 A Distributionally Robust Risk Analysis Formulation Our solution: Estimate u T by solving sup P true (R (t) B for some t [0, T ]), D c (P 0,P ) δ where P 0 is a suitable model. P 0 = proxy for P true. P 0 right trade-off between fidelity and tractability. δ is the distributional uncertainty size. D c ( ) is the distributional uncertainty region. Blanchet (Columbia U. and Stanford U.) 26 / 57
82 Desirable Elements of Distributionally Robust Formulation Would like D c ( ) to have wide flexibility (even non-parametric). Blanchet (Columbia U. and Stanford U.) 27 / 57
83 Desirable Elements of Distributionally Robust Formulation Would like D c ( ) to have wide flexibility (even non-parametric). Want optimization to be tractable. Blanchet (Columbia U. and Stanford U.) 27 / 57
84 Desirable Elements of Distributionally Robust Formulation Would like D c ( ) to have wide flexibility (even non-parametric). Want optimization to be tractable. Want to preserve advantages of using P 0. Blanchet (Columbia U. and Stanford U.) 27 / 57
85 Desirable Elements of Distributionally Robust Formulation Would like D c ( ) to have wide flexibility (even non-parametric). Want optimization to be tractable. Want to preserve advantages of using P 0. Want a way to estimate δ. Blanchet (Columbia U. and Stanford U.) 27 / 57
86 Connections to Distributionally Robust Optimization Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) ( )) dv D (v µ) = E v (log. dµ Blanchet (Columbia U. and Stanford U.) 28 / 57
87 Connections to Distributionally Robust Optimization Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) ( )) dv D (v µ) = E v (log. dµ Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009). Blanchet (Columbia U. and Stanford U.) 28 / 57
88 Connections to Distributionally Robust Optimization Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) ( )) dv D (v µ) = E v (log. dµ Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009). Big problem: Absolute continuity may typically be violated... Blanchet (Columbia U. and Stanford U.) 28 / 57
89 Connections to Distributionally Robust Optimization Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) ( )) dv D (v µ) = E v (log. dµ Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009). Big problem: Absolute continuity may typically be violated... Think of using Brownian motion as a proxy model for R (t)... Blanchet (Columbia U. and Stanford U.) 28 / 57
90 Connections to Distributionally Robust Optimization Standard choices based on divergence (such as Kullback-Leibler) - Hansen & Sargent (2016) ( )) dv D (v µ) = E v (log. dµ Robust Optimization: Ben-Tal, El Ghaoui, Nemirovski (2009). Big problem: Absolute continuity may typically be violated... Think of using Brownian motion as a proxy model for R (t)... Optimal transport is a natural option! Blanchet (Columbia U. and Stanford U.) 28 / 57
91 Application 1: Back to Classical Risk Problem Suppose that c (x, y) = d J (x ( ), y ( )) = Skorokhod J 1 metric. = inf { sup x (t) y (φ (t)), sup φ (t) t }. φ( ) bijection t [0,1] t [0,1] Blanchet (Columbia U. and Stanford U.) 29 / 57
92 Application 1: Back to Classical Risk Problem Suppose that c (x, y) = d J (x ( ), y ( )) = Skorokhod J 1 metric. = inf { sup x (t) y (φ (t)), sup φ (t) t }. φ( ) bijection t [0,1] t [0,1] If R (t) = b Z (t), then ruin during time interval [0, 1] is B b = {R ( ) : 0 inf R (t)} = {Z ( ) : b sup Z (t)}. t [0,1] t [0,1] Blanchet (Columbia U. and Stanford U.) 29 / 57
93 Application 1: Back to Classical Risk Problem Suppose that c (x, y) = d J (x ( ), y ( )) = Skorokhod J 1 metric. = inf { sup x (t) y (φ (t)), sup φ (t) t }. φ( ) bijection t [0,1] t [0,1] If R (t) = b Z (t), then ruin during time interval [0, 1] is B b = {R ( ) : 0 inf R (t)} = {Z ( ) : b sup Z (t)}. t [0,1] t [0,1] Let P 0 ( ) be the Wiener measure want to compute sup P (Z B b ). D c (P 0,P ) δ Blanchet (Columbia U. and Stanford U.) 29 / 57
94 Application 1: Computing Distance to Bankruptcy So: {c Bb (Z ) 1/λ } = {sup t [0,1] Z (t) b 1/λ }, and ( ) sup P (Z B b ) = P 0 sup Z (t) b 1/λ. D c (P 0,P ) δ t [0,1] Blanchet (Columbia U. and Stanford U.) 30 / 57
95 Application 1: Computing Uncertainty Size Note any coupling π so that π X = P 0 and π Y = P satisfies D c (P 0, P) E π [c (X, Y )] δ. Blanchet (Columbia U. and Stanford U.) 31 / 57
96 Application 1: Computing Uncertainty Size Note any coupling π so that π X = P 0 and π Y = P satisfies D c (P 0, P) E π [c (X, Y )] δ. So use any coupling between evidence and P 0 or expert knowledge. Blanchet (Columbia U. and Stanford U.) 31 / 57
97 Application 1: Computing Uncertainty Size Note any coupling π so that π X = P 0 and π Y = P satisfies D c (P 0, P) E π [c (X, Y )] δ. So use any coupling between evidence and P 0 or expert knowledge. We discuss choosing δ non-parametrically momentarily. Blanchet (Columbia U. and Stanford U.) 31 / 57
98 Application 1: Illustration of Coupling Given arrivals and claim sizes let Z (t) = m 1/2 2 N (t) k=1 (X k m 1 ) Blanchet (Columbia U. and Stanford U.) 32 / 57
99 Application 1: Coupling in Action Blanchet (Columbia U. and Stanford U.) 33 / 57
100 Application 1: Numerical Example Assume Poisson arrivals. Pareto claim sizes with index 2.2 (P (V > t) = 1/(1 + t) 2.2 ). Cost c (x, y) = d J (x, y) 2 < note power of 2. Used Algorithm 1 to calibrate (estimating means and variances from data). b P 0 (Ruin) P true (Ruin) P robust (Ruin) P true (Ruin) Blanchet (Columbia U. and Stanford U.) 34 / 57
101 Additional Applications: Multidimensional Ruin Problems contains more applications. Blanchet (Columbia U. and Stanford U.) 35 / 57
102 Additional Applications: Multidimensional Ruin Problems contains more applications. Control: min θ sup P :D (P,P0 ) δ E [L (θ, Z )] < robust optimal reinsurance. Blanchet (Columbia U. and Stanford U.) 35 / 57
103 Additional Applications: Multidimensional Ruin Problems contains more applications. Control: min θ sup P :D (P,P0 ) δ E [L (θ, Z )] < robust optimal reinsurance. Multidimensional risk processes (explicit evaluation of c B (x) for d J metric). Blanchet (Columbia U. and Stanford U.) 35 / 57
104 Multidimensional risk processes (explicit evaluation of c B (x) for d J metric). Key insight: Geometry of target set often remains largely the Blanchet same! (Columbia U. and Stanford U.) 35 / 57 Additional Applications: Multidimensional Ruin Problems contains more applications. Control: min θ sup P :D (P,P0 ) δ E [L (θ, Z )] < robust optimal reinsurance.
105 Connections to Distributionally Robust Optimization Based on: Robust Wasserstein Profile Inference (B., Murthy & Kang 16) Highlight: Additional insights into why optimal transport... Blanchet (Columbia U. and Stanford U.) 36 / 57
106 Distributionally Robust Optimization in Machine Learning Consider estimating β R m in linear regression Y i = βx i + e i, where {(Y i, X i )} n i=1 are data points. Blanchet (Columbia U. and Stanford U.) 37 / 57
107 Distributionally Robust Optimization in Machine Learning Consider estimating β R m in linear regression Y i = βx i + e i, where {(Y i, X i )} n i=1 are data points. Optimal Least Squares approach consists in estimating β via [ ( ) ] 2 min E Pn Y β T n 1 ( ) 2 X = min β β n Y i β T X i = i=1 Blanchet (Columbia U. and Stanford U.) 37 / 57
108 Distributionally Robust Optimization in Machine Learning Consider estimating β R m in linear regression Y i = βx i + e i, where {(Y i, X i )} n i=1 are data points. Optimal Least Squares approach consists in estimating β via [ ( ) ] 2 min E Pn Y β T n 1 ( ) 2 X = min β β n Y i β T X i = i=1 Apply the distributionally robust estimator based on optimal transport. Blanchet (Columbia U. and Stanford U.) 37 / 57
109 Connection to Sqrt-Lasso Theorem (B., Kang, Murthy (2016)) Suppose that c ( (x, y), ( { x, y )) x x = 2 q if y = y if y = y. Then, if 1/p + 1/q = 1 ( ( ) ) 2 max E 1/2 P Y β T X P :D c (P,P n ) δ = E 1/2 P n [ ( ) ] 2 Y β T X + δ β p. Remark 1: This is sqrt-lasso (Belloni et al. (2011)). Remark 2: Uses RoPA duality theorem & "judicious choice of c ( ) Blanchet (Columbia U. and Stanford U.) 38 / 57
110 Connection to Regularized Logistic Regression Theorem (B., Kang, Murthy (2016)) Suppose that c ( (x, y), ( x, y )) = { x x q if y = y if y = y. Then, ] sup E P [log(1 + e Y βt X ) P : D c (P,P n ) δ ] = E Pn [log(1 + e Y βt X ) + δ β p. Remark 1: Approximate connection studied in Esfahani and Kuhn (2015). Blanchet (Columbia U. and Stanford U.) 39 / 57
111 Unification and Extensions of Regularized Estimators Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Blanchet (Columbia U. and Stanford U.) 40 / 57
112 Unification and Extensions of Regularized Estimators Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Support Vector Machines: B., Kang, Murthy (2016) - Blanchet (Columbia U. and Stanford U.) 40 / 57
113 Unification and Extensions of Regularized Estimators Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Support Vector Machines: B., Kang, Murthy (2016) - Group Lasso: B., & Kang (2016): Blanchet (Columbia U. and Stanford U.) 40 / 57
114 Unification and Extensions of Regularized Estimators Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Support Vector Machines: B., Kang, Murthy (2016) - Group Lasso: B., & Kang (2016): Generalized adaptive ridge: B., Kang, Murthy, Zhang (2017): Blanchet (Columbia U. and Stanford U.) 40 / 57
115 Unification and Extensions of Regularized Estimators Distributionally Robust Optimization using Optimal Transport recovers many other estimators... Support Vector Machines: B., Kang, Murthy (2016) - Group Lasso: B., & Kang (2016): Generalized adaptive ridge: B., Kang, Murthy, Zhang (2017): Semisupervised learning: B., and Kang (2016): Blanchet (Columbia U. and Stanford U.) 40 / 57
116 How Regularization and Dual Norms Arise? Let us work out a simple example... Blanchet (Columbia U. and Stanford U.) 41 / 57
117 How Regularization and Dual Norms Arise? Let us work out a simple example... Recall RoPA Duality: Pick c ((x, y), (x, y )) = (x, y) (x, y ) 2 q ( max E P ((X, Y ) (β, 1)) 2) P :D c (P,P n ) δ { [ ((x = min λδ + E Pn sup, y ) (β, 1) )2 λ (X, Y ) ( x, y ) 2 λ 0 (x,y q ) Blanchet (Columbia U. and Stanford U.) 41 / 57
118 How Regularization and Dual Norms Arise? Let us work out a simple example... Recall RoPA Duality: Pick c ((x, y), (x, y )) = (x, y) (x, y ) 2 q ( max E P ((X, Y ) (β, 1)) 2) P :D c (P,P n ) δ { [ ((x = min λδ + E Pn sup, y ) (β, 1) )2 λ (X, Y ) ( x, y ) 2 λ 0 (x,y q ) Let s focus on the inside E Pn... Blanchet (Columbia U. and Stanford U.) 41 / 57
119 How Regularization and Dual Norms Arise? Let = (X, Y ) (x, y ) [ ((x sup, y ) (β, 1) )2 λ (X, Y ) ( x, y ) ] 2 (x,y q ) [ ] = sup ((X, Y ) (β, 1) (β, 1)) 2 λ 2 q [ ] = sup q ( (X, Y ) (β, 1) + q (β, 1) p ) 2 λ 2 q Blanchet (Columbia U. and Stanford U.) 42 / 57
120 How Regularization and Dual Norms Arise? Let = (X, Y ) (x, y ) [ ((x sup, y ) (β, 1) )2 λ (X, Y ) ( x, y ) ] 2 (x,y q ) [ ] = sup ((X, Y ) (β, 1) (β, 1)) 2 λ 2 q [ ] = sup q ( (X, Y ) (β, 1) + q (β, 1) p ) 2 λ 2 q Last equality uses z z 2 is symmetric around origin and a b a p b q. Blanchet (Columbia U. and Stanford U.) 42 / 57
121 How Regularization and Dual Norms Arise? Let = (X, Y ) (x, y ) [ ((x sup, y ) (β, 1) )2 λ (X, Y ) ( x, y ) ] 2 (x,y q ) [ ] = sup ((X, Y ) (β, 1) (β, 1)) 2 λ 2 q [ ] = sup q ( (X, Y ) (β, 1) + q (β, 1) p ) 2 λ 2 q Last equality uses z z 2 is symmetric around origin and a b a p b q. Note problem is now one-dimensional (easily computable). Blanchet (Columbia U. and Stanford U.) 42 / 57
122 On Role of Transport Cost... Data-driven chose of c ( ). Blanchet (Columbia U. and Stanford U.) 43 / 57
123 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 A = (x x ) A (x x) with A positive definite (Mahalanobis distance). Blanchet (Columbia U. and Stanford U.) 43 / 57
124 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 A = (x x ) A (x x) with A positive definite (Mahalanobis distance). Then, max P :D c (P,P n ) δ = min E 1/2 β P n E 1/2 P [ ( Y β T X ( ( Y β T X ) 2 ) ) 2 ] + δ β A 1. Blanchet (Columbia U. and Stanford U.) 43 / 57
125 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 A = (x x ) A (x x) with A positive definite (Mahalanobis distance). Then, max P :D c (P,P n ) δ = min E 1/2 β P n E 1/2 P [ ( Y β T X ( ( Y β T X ) 2 ) ) 2 ] + δ β A 1. Intuition: Think of A diagonal, encoding inverse variability of X i s... Blanchet (Columbia U. and Stanford U.) 43 / 57
126 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 A = (x x ) A (x x) with A positive definite (Mahalanobis distance). Then, max P :D c (P,P n ) δ = min E 1/2 β P n E 1/2 P [ ( Y β T X ( ( Y β T X ) 2 ) ) 2 ] + δ β A 1. Intuition: Think of A diagonal, encoding inverse variability of X i s... High variability > cheap transportation > high impact in risk estimation. Blanchet (Columbia U. and Stanford U.) 43 / 57
127 On Role of Transport Cost... Data-driven chose of c ( ). Blanchet (Columbia U. and Stanford U.) 44 / 57
128 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 Λ = (x x ) Λ (x x) with Λ positive definite (Mahalanobis distance). Blanchet (Columbia U. and Stanford U.) 44 / 57
129 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 Λ = (x x ) Λ (x x) with Λ positive definite (Mahalanobis distance). Then, ( ( ) ) 2 max E 1/2 P Y β T X P :D c (P,P n ) δ [ ( Y β T X = min E 1/2 β P n ) 2 ] + δ β Λ 1. Blanchet (Columbia U. and Stanford U.) 44 / 57
130 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 Λ = (x x ) Λ (x x) with Λ positive definite (Mahalanobis distance). Then, ( ( ) ) 2 max E 1/2 P Y β T X P :D c (P,P n ) δ [ ( Y β T X = min E 1/2 β P n ) 2 ] + δ β Λ 1. Intuition: Think of Λ diagonal, encoding inverse variability of X i s... Blanchet (Columbia U. and Stanford U.) 44 / 57
131 On Role of Transport Cost... Data-driven chose of c ( ). Suppose that x x 2 Λ = (x x ) Λ (x x) with Λ positive definite (Mahalanobis distance). Then, ( ( ) ) 2 max E 1/2 P Y β T X P :D c (P,P n ) δ [ ( Y β T X = min E 1/2 β P n ) 2 ] + δ β Λ 1. Intuition: Think of Λ diagonal, encoding inverse variability of X i s... High variability > cheap transportation > high impact in risk estimation. Blanchet (Columbia U. and Stanford U.) 44 / 57
132 On Role of Transport Cost... Comparing L 1 regularization vs data-driven cost regularization: real data BC BN QSAR Magic 3*LRL1 Train.185 ± ± ± ±.087 Test.428 ± ± ± ±.050 Accur.929 ± ± ± ±.045 3*DRO-NL Train.032 ± ± ± ±.084 Test.119 ± ± ± ±.049 Accur.955 ± ± ± ±.043 Num Predictors Train Size Test Size Table: Numerical results for real data sets. Blanchet (Columbia U. and Stanford U.) 45 / 57
133 Connections to Statistical Analysis Based on: Robust Wasserstein Profile Inference (B., Murthy & Kang 16) Highlight: How to choose size of uncertainty? Blanchet (Columbia U. and Stanford U.) 46 / 57
134 Towards an Optimal Choice of Uncertainty Size How to choose uncertainty size in a data-driven way? Blanchet (Columbia U. and Stanford U.) 47 / 57
135 Towards an Optimal Choice of Uncertainty Size How to choose uncertainty size in a data-driven way? Once again, consider Lasso as example: min β = min β max P :D c (P,P n ) δ { E 1/2 P n ( ( ) ) 2 E P Y β T X [ ( Y β T X ) 2 ] + δ β p } 2. Blanchet (Columbia U. and Stanford U.) 47 / 57
136 Towards an Optimal Choice of Uncertainty Size How to choose uncertainty size in a data-driven way? Once again, consider Lasso as example: min β = min β max P :D c (P,P n ) δ { E 1/2 P n ( ( ) ) 2 E P Y β T X [ ( Y β T X ) 2 ] + δ β p } 2. Use left hand side to define a statistical principle to choose δ. Blanchet (Columbia U. and Stanford U.) 47 / 57
137 Towards an Optimal Choice of Uncertainty Size How to choose uncertainty size in a data-driven way? Once again, consider Lasso as example: min β = min β max P :D c (P,P n ) δ { E 1/2 P n ( ( ) ) 2 E P Y β T X [ ( Y β T X ) 2 ] + δ β p } 2. Use left hand side to define a statistical principle to choose δ. Important: Optimizing δ is equivalent to optimizing regularization! Blanchet (Columbia U. and Stanford U.) 47 / 57
138 Towards an Optimal Choice of Uncertainty Size Standard way to pick δ (Esfahani and Kuhn (2015)). Blanchet (Columbia U. and Stanford U.) 48 / 57
139 Towards an Optimal Choice of Uncertainty Size Standard way to pick δ (Esfahani and Kuhn (2015)). Estimate D (P true, P n ) using concentration of measure results. Blanchet (Columbia U. and Stanford U.) 48 / 57
140 Towards an Optimal Choice of Uncertainty Size Standard way to pick δ (Esfahani and Kuhn (2015)). Estimate D (P true, P n ) using concentration of measure results. Not a good idea: rate of convergence of the form O ( 1/n 1/d ) (d is the data dimension). Blanchet (Columbia U. and Stanford U.) 48 / 57
141 Towards an Optimal Choice of Uncertainty Size Standard way to pick δ (Esfahani and Kuhn (2015)). Estimate D (P true, P n ) using concentration of measure results. Not a good idea: rate of convergence of the form O ( 1/n 1/d ) (d is the data dimension). Instead we seek an optimal approach. Blanchet (Columbia U. and Stanford U.) 48 / 57
142 Towards an Optimal Choice of Uncertainty Size Keep in mind linear regression problem Y i = β T X i + ɛ i. Blanchet (Columbia U. and Stanford U.) 49 / 57
143 Towards an Optimal Choice of Uncertainty Size Keep in mind linear regression problem Y i = β T X i + ɛ i. The plausible model variations of P n are given by the set U δ (n) = {P : D c (P, P n ) δ}. Blanchet (Columbia U. and Stanford U.) 49 / 57
144 Towards an Optimal Choice of Uncertainty Size Keep in mind linear regression problem Y i = β T X i + ɛ i. The plausible model variations of P n are given by the set U δ (n) = {P : D c (P, P n ) δ}. Given P U δ (n), define β (P) = arg min E P (Y β T X ). Blanchet (Columbia U. and Stanford U.) 49 / 57
145 Towards an Optimal Choice of Uncertainty Size Keep in mind linear regression problem Y i = β T X i + ɛ i. The plausible model variations of P n are given by the set U δ (n) = {P : D c (P, P n ) δ}. Given P U δ (n), define β (P) = arg min E P (Y β T X It is natural to say that are plausible estimates of β. Λ δ (n) = { β (P) : P U δ (n)} ). Blanchet (Columbia U. and Stanford U.) 49 / 57
146 Optimal Choice of Uncertainty Size Given a confidence level 1 α we advocate choosing δ via min δ s.t. P (β Λ δ (n)) 1 α. Blanchet (Columbia U. and Stanford U.) 50 / 57
147 Optimal Choice of Uncertainty Size Given a confidence level 1 α we advocate choosing δ via min δ s.t. P (β Λ δ (n)) 1 α. Equivalently: Find smallest confidence region Λ δ (n) at level 1 α. Blanchet (Columbia U. and Stanford U.) 50 / 57
148 Optimal Choice of Uncertainty Size Given a confidence level 1 α we advocate choosing δ via min δ s.t. P (β Λ δ (n)) 1 α. Equivalently: Find smallest confidence region Λ δ (n) at level 1 α. In simple words: Find the smallest δ so that β is plausible with confidence level 1 α. Blanchet (Columbia U. and Stanford U.) 50 / 57
149 The Robust Wasserstein Profile Function The value β (P) is characterized by ( ) ) 2 (( ) ) E P β (Y β T X = 2E P Y β T X X = 0. Blanchet (Columbia U. and Stanford U.) 51 / 57
150 The Robust Wasserstein Profile Function The value β (P) is characterized by ( ) ) 2 (( ) ) E P β (Y β T X = 2E P Y β T X X = 0. Define the Robust Wasserstein Profile (RWP) Function: ) ) R n (β) = min{d c (P, P n ) : E P ((Y β T X X = 0}. Blanchet (Columbia U. and Stanford U.) 51 / 57
151 The Robust Wasserstein Profile Function The value β (P) is characterized by ( ) ) 2 (( ) ) E P β (Y β T X = 2E P Y β T X X = 0. Define the Robust Wasserstein Profile (RWP) Function: ) ) R n (β) = min{d c (P, P n ) : E P ((Y β T X X = 0}. Note that R n (β ) δ β Λ δ (n) = { β (P) : D (P, P n ) δ}. Blanchet (Columbia U. and Stanford U.) 51 / 57
152 The Robust Wasserstein Profile Function The value β (P) is characterized by ( ) ) 2 (( ) ) E P β (Y β T X = 2E P Y β T X X = 0. Define the Robust Wasserstein Profile (RWP) Function: ) ) R n (β) = min{d c (P, P n ) : E P ((Y β T X X = 0}. Note that R n (β ) δ β Λ δ (n) = { β (P) : D (P, P n ) δ}. So δ is 1 α quantile of R n (β )! Blanchet (Columbia U. and Stanford U.) 51 / 57
153 The Robust Wasserstein Profile Function Blanchet (Columbia U. and Stanford U.) 52 / 57
154 Computing Optimal Regularization Parameter Theorem (B., Murthy, Kang (2016)) Suppose that {(Y i, X i )} n i=1 i.i.d. sample with finite variance, with c ( (x, y), ( { x, y )) x x = 2 q if y = y if y = y, is an then where L 1 is explicitly and nr n (β ) L 1, L 1 D L2 := E [e 2 ] E [e 2 ] (E e ) 2 N(0, Cov (X )) 2 q. Remark: We recover same order of regularization (but L 1 gives the optimal constant!) Blanchet (Columbia U. and Stanford U.) 53 / 57
155 Discussion on Optimal Uncertainty Size Optimal δ is of order O (1/n) as opposed to O ( 1/n 1/d ) as advocated in the standard approach. Blanchet (Columbia U. and Stanford U.) 54 / 57
156 Discussion on Optimal Uncertainty Size Optimal δ is of order O (1/n) as opposed to O ( 1/n 1/d ) as advocated in the standard approach. We characterize the asymptotic constant (not only order) in optimal regularization: P ( L 1 η 1 α ) = 1 α. Blanchet (Columbia U. and Stanford U.) 54 / 57
157 Discussion on Optimal Uncertainty Size Optimal δ is of order O (1/n) as opposed to O ( 1/n 1/d ) as advocated in the standard approach. We characterize the asymptotic constant (not only order) in optimal regularization: P ( L 1 η 1 α ) = 1 α. R n (β ) is inspired by Empirical Likelihood Owen (1988). Blanchet (Columbia U. and Stanford U.) 54 / 57
158 Discussion on Optimal Uncertainty Size Optimal δ is of order O (1/n) as opposed to O ( 1/n 1/d ) as advocated in the standard approach. We characterize the asymptotic constant (not only order) in optimal regularization: P ( L 1 η 1 α ) = 1 α. R n (β ) is inspired by Empirical Likelihood Owen (1988). Lam & Zhou (2015) use Empirical Likelihood in DRO, but focus on divergence. Blanchet (Columbia U. and Stanford U.) 54 / 57
159 A Toy Example Illustrating Proof Techniques Consider min β max P :D c (P,P n ) δ E [(Y β) 2] with c (y, y ) = (y y ) ρ and define R n (β) = min (y u) ρ π (dy, du) : π(dy,du) 0 u R π (dy, du) = 1 n δ {Y i } (dy) i, 2 (u β) π (dy, du) = 0. Blanchet (Columbia U. and Stanford U.) 55 / 57
160 A Toy Example Illustrating Proof Techniques Dual linear programming problem: Plug in β = β { R n (β ) = sup 1 n { λ R n sup λ(u β ) Y i u ρ }} i=1 u R { = sup λ } n n i=1(y i β ) λ R 1 { n n i=1 sup u R λ (u Yi ) Y i u ρ } { = sup λ ρ } n λ n (Y i β ) (ρ 1) λ ρ 1 i=1 ρ 1 ρ = (Y i β n ) = 1 ( N 0, σ 2 ) ρ n 1/2. n i=1 Blanchet (Columbia U. and Stanford U.) 56 / 57
161 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. Blanchet (Columbia U. and Stanford U.) 57 / 57
162 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. Blanchet (Columbia U. and Stanford U.) 57 / 57
163 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. OT can be used in path-space to quantify error in diffusion approximations. Blanchet (Columbia U. and Stanford U.) 57 / 57
164 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. OT can be used in path-space to quantify error in diffusion approximations. OT can be used for data-driven distributionally robust optimization. Blanchet (Columbia U. and Stanford U.) 57 / 57
165 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. OT can be used in path-space to quantify error in diffusion approximations. OT can be used for data-driven distributionally robust optimization. Cost function in OT can be used to improve out-of-sample performance. Blanchet (Columbia U. and Stanford U.) 57 / 57
166 Conclusions Optimal transport (OT) is a powerful tool based on linear programming. OT costs are natural for computing model uncertainty. OT can be used in path-space to quantify error in diffusion approximations. OT can be used for data-driven distributionally robust optimization. Cost function in OT can be used to improve out-of-sample performance. OT can be used for statistical inference using RWP function. Blanchet (Columbia U. and Stanford U.) 57 / 57
Optimal Transport in Risk Analysis
Optimal Transport in Risk Analysis Jose Blanchet (based on work with Y. Kang and K. Murthy) Stanford University (Management Science and Engineering), and Columbia University (Department of Statistics and
More informationPRIVATE RISK AND VALUATION: A DISTRIBUTIONALLY ROBUST OPTIMIZATION VIEW
PRIVATE RISK AND VALUATION: A DISTRIBUTIONALLY ROBUST OPTIMIZATION VIEW J. BLANCHET AND U. LALL ABSTRACT. This chapter summarizes a body of work which has as objective the quantification of risk exposures
More informationDistirbutional robustness, regularizing variance, and adversaries
Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned
More informationCS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018
CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs
More informationDistributionally Robust Groupwise Regularization Estimator
Proceedings of Machine Learning Research 77:97 112, 2017 ACML 2017 Distributionally Robust Groupwise Regularization Estimator Jose Blanchet jose.blanchet@columbia.edu Yang Kang yang.kang@columbia.edu 1255
More informationDistributionally Robust Stochastic Optimization with Wasserstein Distance
Distributionally Robust Stochastic Optimization with Wasserstein Distance Rui Gao DOS Seminar, Oct 2016 Joint work with Anton Kleywegt School of Industrial and Systems Engineering Georgia Tech What is
More informationQuantifying Stochastic Model Errors via Robust Optimization
Quantifying Stochastic Model Errors via Robust Optimization IPAM Workshop on Uncertainty Quantification for Multiscale Stochastic Systems and Applications Jan 19, 2016 Henry Lam Industrial & Operations
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationSuperhedging and distributionally robust optimization with neural networks
Superhedging and distributionally robust optimization with neural networks Stephan Eckstein joint work with Michael Kupper and Mathias Pohl Robust Techniques in Quantitative Finance University of Oxford
More informationIntroduction to Rare Event Simulation
Introduction to Rare Event Simulation Brown University: Summer School on Rare Event Simulation Jose Blanchet Columbia University. Department of Statistics, Department of IEOR. Blanchet (Columbia) 1 / 31
More informationarxiv: v3 [stat.ml] 4 Aug 2017
SEMI-SUPERVISED LEARNING BASED ON DISTRIBUTIONALLY ROBUST OPTIMIZATION JOSE BLANCHET AND YANG KANG arxiv:1702.08848v3 stat.ml] 4 Aug 2017 Abstract. We propose a novel method for semi-supervised learning
More informationNishant Gurnani. GAN Reading Group. April 14th, / 107
Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107 Why are these Papers Important? 2 / 107 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN,
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationMachine Learning. Support Vector Machines. Manfred Huber
Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationDistributionally Robust Optimization with ROME (part 1)
Distributionally Robust Optimization with ROME (part 1) Joel Goh Melvyn Sim Department of Decision Sciences NUS Business School, Singapore 18 Jun 2009 NUS Business School Guest Lecture J. Goh, M. Sim (NUS)
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationAmbiguity Sets and their applications to SVM
Ambiguity Sets and their applications to SVM Ammon Washburn University of Arizona April 22, 2016 Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 1 / 25 Introduction Go over some (very
More informationLecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016
Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,
More informationRobust Dual-Response Optimization
Yanıkoğlu, den Hertog, and Kleijnen Robust Dual-Response Optimization 29 May 1 June 1 / 24 Robust Dual-Response Optimization İhsan Yanıkoğlu, Dick den Hertog, Jack P.C. Kleijnen Özyeğin University, İstanbul,
More informationCS-E4830 Kernel Methods in Machine Learning
CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This
More informationSemidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization
Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization Instructor: Farid Alizadeh Author: Ai Kagawa 12/12/2012
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationLecture 10: Support Vector Machine and Large Margin Classifier
Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationGenerative Models and Optimal Transport
Generative Models and Optimal Transport Marco Cuturi Joint work / work in progress with G. Peyré, A. Genevay (ENS), F. Bach (INRIA), G. Montavon, K-R Müller (TU Berlin) Statistics 0.1 : Density Fitting
More informationDistributionally Robust Discrete Optimization with Entropic Value-at-Risk
Distributionally Robust Discrete Optimization with Entropic Value-at-Risk Daniel Zhuoyu Long Department of SEEM, The Chinese University of Hong Kong, zylong@se.cuhk.edu.hk Jin Qi NUS Business School, National
More informationInformation theoretic perspectives on learning algorithms
Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez
More informationMachine Learning. Lecture 6: Support Vector Machine. Feng Li.
Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationLecture 2: August 31
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy
More information1 Regression with High Dimensional Data
6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:
More informationHabilitationsvortrag: Machine learning, shrinkage estimation, and economic theory
Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory Maximilian Kasy May 25, 218 1 / 27 Introduction Recent years saw a boom of machine learning methods. Impressive advances
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationStatistics: Learning models from data
DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial
More information(Part 1) High-dimensional statistics May / 41
Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2
More informationBayesian Analysis of Risk for Data Mining Based on Empirical Likelihood
1 / 29 Bayesian Analysis of Risk for Data Mining Based on Empirical Likelihood Yuan Liao Wenxin Jiang Northwestern University Presented at: Department of Statistics and Biostatistics Rutgers University
More informationSupport Vector Machines
Support Vector Machines Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/32 Margin Classifiers margin b = 0 Sridhar Mahadevan: CMPSCI 689 p.
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationRefined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities
Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities Maxim Raginsky and Igal Sason ISIT 2013, Istanbul, Turkey Capacity-Achieving Channel Codes The set-up DMC
More informationAmbiguity in portfolio optimization
May/June 2006 Introduction: Risk and Ambiguity Frank Knight Risk, Uncertainty and Profit (1920) Risk: the decision-maker can assign mathematical probabilities to random phenomena Uncertainty: randomness
More informationInference For High Dimensional M-estimates. Fixed Design Results
: Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and
More informationMachine learning, shrinkage estimation, and economic theory
Machine learning, shrinkage estimation, and economic theory Maximilian Kasy December 14, 2018 1 / 43 Introduction Recent years saw a boom of machine learning methods. Impressive advances in domains such
More informationCovariance function estimation in Gaussian process regression
Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian
More informationSupport Vector Machines: Maximum Margin Classifiers
Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind
More informationProceedings of the 2014 Winter Simulation Conference A. Tolk, S. D. Diallo, I. O. Ryzhov, L. Yilmaz, S. Buckley, and J. A. Miller, eds.
Proceedings of the 204 Winter Simulation Conference A. Tolk, S. D. Diallo, I. O. Ryzhov, L. Yilmaz, S. Buckley, and J. A. Miller, eds. ROBUST RARE-EVENT PERFORMANCE ANALYSIS WITH NATURAL NON-CONVEX CONSTRAINTS
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationConvex Optimization M2
Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial
More informationCutting Plane Training of Structural SVM
Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs
More informationOptimal transportation and optimal control in a finite horizon framework
Optimal transportation and optimal control in a finite horizon framework Guillaume Carlier and Aimé Lachapelle Université Paris-Dauphine, CEREMADE July 2008 1 MOTIVATIONS - A commitment problem (1) Micro
More informationWasserstein GAN. Juho Lee. Jan 23, 2017
Wasserstein GAN Juho Lee Jan 23, 2017 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance)
More informationSummary of the simplex method
MVE165/MMG630, The simplex method; degeneracy; unbounded solutions; infeasibility; starting solutions; duality; interpretation Ann-Brith Strömberg 2012 03 16 Summary of the simplex method Optimality condition:
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationLECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)
LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered
More informationRandom Convex Approximations of Ambiguous Chance Constrained Programs
Random Convex Approximations of Ambiguous Chance Constrained Programs Shih-Hao Tseng Eilyan Bitar Ao Tang Abstract We investigate an approach to the approximation of ambiguous chance constrained programs
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationShort Course Robust Optimization and Machine Learning. Lecture 6: Robust Optimization in Machine Learning
Short Course Robust Optimization and Machine Machine Lecture 6: Robust Optimization in Machine Laurent El Ghaoui EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012
More informationExtreme Abridgment of Boyd and Vandenberghe s Convex Optimization
Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Compiled by David Rosenberg Abstract Boyd and Vandenberghe s Convex Optimization book is very well-written and a pleasure to read. The
More informationf-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models
IEEE Transactions on Information Theory, vol.58, no.2, pp.708 720, 2012. 1 f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models Takafumi Kanamori Nagoya University,
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More informationCS229 Supplemental Lecture notes
CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem
More informationFORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More informationSignal Recovery from Permuted Observations
EE381V Course Project Signal Recovery from Permuted Observations 1 Problem Shanshan Wu (sw33323) May 8th, 2015 We start with the following problem: let s R n be an unknown n-dimensional real-valued signal,
More informationJoint distribution optimal transportation for domain adaptation
Joint distribution optimal transportation for domain adaptation Changhuang Wan Mechanical and Aerospace Engineering Department The Ohio State University March 8 th, 2018 Joint distribution optimal transportation
More informationMultivariate Topological Data Analysis
Cleveland State University November 20, 2008, Duke University joint work with Gunnar Carlsson (Stanford), Peter Kim and Zhiming Luo (Guelph), and Moo Chung (Wisconsin Madison) Framework Ideal Truth (Parameter)
More informationInformation geometry for bivariate distribution control
Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic
More informationRobust linear optimization under general norms
Operations Research Letters 3 (004) 50 56 Operations Research Letters www.elsevier.com/locate/dsw Robust linear optimization under general norms Dimitris Bertsimas a; ;, Dessislava Pachamanova b, Melvyn
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationAre You a Bayesian or a Frequentist?
Are You a Bayesian or a Frequentist? Michael I. Jordan Department of EECS Department of Statistics University of California, Berkeley http://www.cs.berkeley.edu/ jordan 1 Statistical Inference Bayesian
More informationGeneralized quantiles as risk measures
Generalized quantiles as risk measures Bellini, Klar, Muller, Rosazza Gianin December 1, 2014 Vorisek Jan Introduction Quantiles q α of a random variable X can be defined as the minimizers of a piecewise
More informationLecture 7: Dynamic panel models 2
Lecture 7: Dynamic panel models 2 Ragnar Nymoen Department of Economics, UiO 25 February 2010 Main issues and references The Arellano and Bond method for GMM estimation of dynamic panel data models A stepwise
More informationDistributionally Robust Convex Optimization
Distributionally Robust Convex Optimization Wolfram Wiesemann 1, Daniel Kuhn 1, and Melvyn Sim 2 1 Department of Computing, Imperial College London, United Kingdom 2 Department of Decision Sciences, National
More informationHow much is information worth? A geometric insight using duality between payoffs and beliefs
How much is information worth? A geometric insight using duality between payoffs and beliefs Michel De Lara and Olivier Gossner CERMICS, École des Ponts ParisTech CNRS, X Paris Saclay and LSE Séminaire
More informationTopic: Primal-Dual Algorithms Date: We finished our discussion of randomized rounding and began talking about LP Duality.
CS787: Advanced Algorithms Scribe: Amanda Burton, Leah Kluegel Lecturer: Shuchi Chawla Topic: Primal-Dual Algorithms Date: 10-17-07 14.1 Last Time We finished our discussion of randomized rounding and
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of
More informationData-Driven Distributionally Robust Chance-Constrained Optimization with Wasserstein Metric
Data-Driven Distributionally Robust Chance-Constrained Optimization with asserstein Metric Ran Ji Department of System Engineering and Operations Research, George Mason University, rji2@gmu.edu; Miguel
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationCupid s Invisible Hand
Alfred Galichon Bernard Salanié Chicago, June 2012 Matching involves trade-offs Marriage partners vary across several dimensions, their preferences over partners also vary and the analyst can only observe
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More informationLecture 6: September 19
36-755: Advanced Statistical Theory I Fall 2016 Lecture 6: September 19 Lecturer: Alessandro Rinaldo Scribe: YJ Choe Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationTopic 2: Logistic Regression
CS 4850/6850: Introduction to Machine Learning Fall 208 Topic 2: Logistic Regression Instructor: Daniel L. Pimentel-Alarcón c Copyright 208 2. Introduction Arguably the simplest task that we can teach
More information21.1 Lower bounds on minimax risk for functional estimation
ECE598: Information-theoretic methods in high-dimensional statistics Spring 016 Lecture 1: Functional estimation & testing Lecturer: Yihong Wu Scribe: Ashok Vardhan, Apr 14, 016 In this chapter, we will
More informationOptimal Transport and Wasserstein Distance
Optimal Transport and Wasserstein Distance The Wasserstein distance which arises from the idea of optimal transport is being used more and more in Statistics and Machine Learning. In these notes we review
More informationMIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design
MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationMath for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han
Math for Machine Learning Open Doors to Data Science and Artificial Intelligence Richard Han Copyright 05 Richard Han All rights reserved. CONTENTS PREFACE... - INTRODUCTION... LINEAR REGRESSION... 4 LINEAR
More informationIntroduction. Chapter 1
Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics
More informationStat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.
Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have
More informationConstrained Optimization and Lagrangian Duality
CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may
More information