A Minimax Theorem with Applications to Machine Learning, Signal Processing, and Finance Seung Jean Kim Stephen Boyd Alessandro Magnani MIT ORC Sear 12/8/05
Outline A imax theorem Robust Fisher discriant analysis Robust matched filtering An extension Worst-case Sharpe ratio maximization MIT ORC Sear 12/8/05 1
A fractional function f(a, B, x) = at x xt Bx, a, x Rn, B S n ++ f is homogeneous, quasiconcave in x provided a T x 0: for γ 0, { {x f(a, B, x) γ} = x } γ xt Bx a T x is convex (a second-order cone) f is quasiconvex in (a, B) provided a T x 0: for γ 0, {(a, B) f(a, B, x) γ} = { (a, B) } γ xt Bx a T x is convex (since x T Bx is concave in B) f maximized over x by x = B 1 a; optimal value a T B 1 a MIT ORC Sear 12/8/05 2
A Rayleigh quotient f(a, B, x) 2 is Rayleigh quotient of (aa T, B) evaluated at x: f(a, B, x) 2 = (xt a) 2 x T Bx convex in (a, B) not quasiconcave in x MIT ORC Sear 12/8/05 3
Interpretation via Gaussian suppose z N (a, B), and x R n z T x N (a T x, x T Bx), so Prob(z T x 0) = Φ ( a T ) x xt Bx x = B 1 a gives hyperplane through origin that maximizes probability of z being on one side ( ) maximum probability is Φ at B 1 a a T B 1 a measures how well a linear function can discriate z from zero MIT ORC Sear 12/8/05 4
A picture expand confidence ellipsoid until it touches origin tangent is plane that maximizes Prob(z T x 0) PSfrag replacements x 0 MIT ORC Sear 12/8/05 5
Interpretation via Chebyshev bound suppose E z = a, E(z a) T (z a) = B (otherwise arbitrary) E z T x = a T x, E(z T x a T x) 2 = x T Bx, so by Chebyshev bound Prob(z T x 0) Ψ ψ is increasing (bound is sharp) ( a T ) x, Ψ(u) = u2 + xt Bx 1 + u 2 + x = B 1 a gives hyperplane through origin that maximizes Chebyshev lower bound for Prob(z T x 0) maximum value of Chebyshev lower bound is a T B 1 a 1 + a T B 1 a MIT ORC Sear 12/8/05 6
Worst-case discriation probability analysis uncertain statistics: (a, B) U R n \{0} S n ++ (convex and compact) for fixed x, find worst-case statistics: where z N (a, B) imize Prob(z T x > 0) subject to (a, B) U MIT ORC Sear 12/8/05 7
Worst-case discriation probability analysis worst-case discriation probability analysis is equivalent to imize subject to a T x xt Bx (a, B) U if optimal value is positive, i.e., a T x > 0 for all (a, B) U, equivalent to convex problem (a T x) 2 imize x T Bx subject to (a, B) U if optimal value is negative, can solve via bisection, convex optimization MIT ORC Sear 12/8/05 8
Worst-case discriation probability maximization find x that maximizes worst-case discriation probability (a,b) U Prob(zT x 0) equivalent to finding x that maximizes (a,b) U a T x xt Bx not concave in x; not clear how to solve studied in context of robust signal processing in 1980s (Poor, Verdú,... ) for some specific Us MIT ORC Sear 12/8/05 9
Weak imax property max x 0 (a,b) U x T a xt Bx max (a,b) U x 0 x T a xt Bx LHS: worst-case discriation probability maximization problem RHS can be evaluated via convex optimization: max (a,b) U x 0 x T a xt Bx = (a,b) U at B 1 a = [ (a,b) U at B 1 a ] 1/2 finds statistics hardest to discriate from zero MIT ORC Sear 12/8/05 10
Strong imax property in fact, we have equality: max x 0 (a,b) U x T a xt Bx = max (a,b) U x 0 x T a xt Bx (and common value is 0) equivalently, with z N (a, B), max x 0 (a,b) U Prob(zT x 0) = max (a,b) U x 0 Prob(zT x 0) (and common value is 1/2) MIT ORC Sear 12/8/05 11
Computing saddle point via convex optimization find least favorable statistics (a, B ), i.e., imize a T B 1 a over (a, B) U (x, a, B ) with x = B 1 a is saddle point: x T a xt B x x T a x T B x x T a x 0, (a, B) U x T Bx, x solves worst-case discriation probability maximization problem (a, B ) is worst-case statistics for x MIT ORC Sear 12/8/05 12
Interpretation of least favorable statistics find statistics with imum Mahalanobis distance between mean and zero: a T B 1 a = (a,b) U at B 1 a PSfrag replacements 0 MIT ORC Sear 12/8/05 13
Proof via convex analysis let (a, B ) be least favorable: (a,b) U at B 1 a = a T B 1 a by imax inequality max x 0 (a,b) U x T a xt Bx max (a,b) U x 0 x T a xt Bx = a T B 1 a x = B 1 a gives lower bound on a T B 1 a : (a,b) U x T a x T Bx max x 0 (a,b) U x T a xt Bx a T B 1 a if (a,b) U x T a x T Bx = x T a x T B x = a T B 1 a, then imax equality must hold MIT ORC Sear 12/8/05 14
(a, B ) is optimal for (a,b) U a T B 1 a, so 2x T (a a ) x T (B B )x 0, (a, B) U with x = B 1 a (since z imizes convex differentiable fct. f over convex set C iff f(z ) T (z z ) 0 z C) we conclude (a, B ) is optimal for (a,b) U (x T a) 2 /x T Bx, since its optimality condition is also 2x T (a a ) x T (B B )x 0, (a, B) U (a, B ) must be optimal for (a,b) U x T a/(x T Bx ) 1/2, since (a,b) U x T a/(x T Bx ) 1/2 > 0 MIT ORC Sear 12/8/05 15
Fisher linear discriant analysis x N (µ x, Σ x ), y N (µ y, Σ y ) find w that maximizes Prob(w T x > w T y) x y N (µ x µ y, Σ x + Σ y ), so ( ) Prob(w T x > w T w T (µ x µ y ) y) = Φ wt (Σ x + Σ y )w equivalent to maximizing w T (µ x µ y ) wt (Σ x + Σ y )w = f(µ x µ y, Σ x + Σ y, w) proposed by Fisher in 1930s MIT ORC Sear 12/8/05 16
Worst-case Fisher linear discriation uncertain statistics: (µ x, µ y, Σ x, Σ y ) V R n R n S n ++ S n ++ convex and compact; µ x µ y for each (µ x, µ y, Σ x, Σ y ) V worst-case discriation probability (for fixed w): (µ x,µ y,σ x,σ y ) V Prob(wT x > w T y) can compute via convex optimization MIT ORC Sear 12/8/05 17
Robust Fisher linear discriant analysis worst-case discriation probability maximization: find w that maximizes worst-case discriation probability equivalent to maximizing (over w) (µ x,µ y,σ x,σ y ) V Prob(wT x > w T y) (µ x,µ y,σ x,σ y ) V w T (µ x µ y ) wt (Σ x + Σ y )w our max- problem, with a = µ x µ y, B = Σ x + Σ y U = {(µ x µ y, Σ x + Σ y ) (µ x, µ y, Σ x, Σ y ) V} MIT ORC Sear 12/8/05 18
Example x, y R 2 ; only µ y is uncertain, µ y E PSfrag replacements -3 0 3-2 0 2 E µ x µ y 0 robust optimal noal optimal MIT ORC Sear 12/8/05 19
Results P nom P wc noal optimal 0.99 0.87 robust optimal 0.98 0.91 P nom : noal discriation probability P wc : worst-case discriation probability MIT ORC Sear 12/8/05 20
Matched filtering y(t) = s(t)a + v(t) R n s(t) R is desired signal; y(t) R n is received signal v(t) N (0, Σ) is noise filtered output with weight vector w R n : z(t) = w T y(t) = s(t)w T a + w T v(t) standard matched filtering: choose w to maximize (amplitude) signal to noise ratio (SNR) w T a wt Σw MIT ORC Sear 12/8/05 21
Robust matched filtering uncertain data: (a, Σ) U R n \{0} S n ++ (convex and compact) worst-case SNR for fixed w: (a,σ) U w T a wt Σw robust matched filtering: find w that maximizes worst-case SNR can be solved via convex optimization & imax theorem for x T a/ x T Bx MIT ORC Sear 12/8/05 22
Example a = (2, 3, 2, 2) is fixed (no uncertainty) uncertain Σ has form 1 + 1? + 1? 1 + means Σ ij [0, 1], means Σ ij [ 1, 0]? means Σ ij [ 1, 1] (and of course Σ 0) we take noal noise covariance as Σ = 1.5.5.5 1 0.5 1 0 1 MIT ORC Sear 12/8/05 23
Results noal SNR worst-case SNR noal optimal 5.5 3.0 robust optimal 4.9 3.6 least favorable covariance: 1 0.38.12 1.41.74 1.23 1 MIT ORC Sear 12/8/05 24
Extension of imax theorem U convex, compact; X convex cone max x X (a,b) U a T x xt Bx = max (a,b) U w X a T x xt Bx provided there exists x X s.t. a T x > 0 for all a with (a, B) U, i.e., LHS > 0 MIT ORC Sear 12/8/05 25
Computing saddle point via convex optimization convexity of -max problem: max (a,b) U x X X is dual cone a T x xt Bx = [ (a + λ) T B 1 (a + λ) ] 1/2 (a,b) U λ X [ = (a,b) U, λ X (a + λ)t B 1 (a + λ) if (a, B, λ ) is optimal for -max problem, (x, a, B ) with x = B 1 (a + λ ) is a saddle point: x T a xt B x x T a x T B x ] 1/2 x T a x X, (a, B) U x T Bx, MIT ORC Sear 12/8/05 26
A key lemma (a T x) 2 max x X x T Bx = λ X (a + λ)t B 1 (a + λ) using homogeneity of (a T x) 2 /x T Bx, write LHS as by convex duality max x X x T Bx = x X, a T x=1 (a T x) 2 x T Bx = [ x T Bx x X, a T x=1 ] 1 [ max (1/4)( λ + µa) T B 1 ( λ + µa) + µ ] λ X, µ R defining λ = λ/µ and optimizing over µ, RHS becomes x T 1 Bx = max x X, a T x=1 λ X (a + λ) T B 1 (a + λ) MIT ORC Sear 12/8/05 27
Asset allocation n risky assets with (single period) returns a N (µ, Σ) portfolio w W (convex); 1 T w = 1 for all w W w T a N (w T µ, w T Σw), so probability of beating risk-free asset with return µ rf is ( µ Prob(a T T ) w µ rf w > µ rf ) = Φ wt Σw maximized by w W that maximizes Sharpe ratio µt w µ rf wt Σw (called tangency portfolio w tp ) we re only interested in case when maximum Sharpe ratio 0 MIT ORC Sear 12/8/05 28
Interpretation via Chebyshev bound suppose E a = µ, E (a µ) T (a µ) = Σ (otherwise arbitrary) E a T w = µ T x, E (a T x µ T x) 2 = w T Σw, so by Chebyshev bound, Prob(a T w µ rf ) Ψ increasing (bound is tight) maximized by tangency portfolio ( µ T w µ rf wt Σw ), Ψ(u) = u2 + 1 + u 2 + MIT ORC Sear 12/8/05 29
PSfrag replacements CML EF tangency portfolio µ T w maximum Sharpe ratio µ rf wt Σw EF: efficient frontier; CML: capital market line MIT ORC Sear 12/8/05 30
Asset allocation with a risk-free asset affine combination of risk-free asset and risky portfolio w W: [ ] (1 θ)w x = θ θ is amount of risk-free asset Tobin s two-fund separation theorem: risk s and return r of any x = ((1 θ)w, θ) with w W cannot lie above capital market line (CML) r = µ rf + S s, S = max w W µ T w µ rf wt Σw CML is formed by affine combinations of two portfolios: tangency portfolio (w tp, 0) and risk-free portfolio (0, 1) MIT ORC Sear 12/8/05 31
Worst-case Sharpe ratio uncertain statistics: (µ, Σ) V R n S n ++ (convex and compact) for fixed w W, find worst-case statistics by computing worst-case SR (µ,σ) V µ T w µ rf wt Σw = (µ,σ) V (µ µ rf 1) T w wt Σw... can be computed via convex optimization related to worst-case probability of beating risk-free asset ( (µ,σ) V Prob(aT w > µ rf ) = Φ (µ,σ) V µ T w µ rf wt Σw ) MIT ORC Sear 12/8/05 32
Worst-case Sharpe ratio maximization find portfolio that maximizes worst-case SR: maximize subject to (µ,σ) V w W µ T w µ rf wt Σw this portfolio (called robust tangency portfolio w rtp ) maximizes worst-case probability of beating risk-free asset (µ,σ) V Prob(wT a > µ rf ), a N (µ, Σ) worst-case Chebyshev lower bound for Prob(a T w µ rf ) (µ,σ) V inf{prob(at w µ rf ) E a = µ, E (a µ) T (a µ) = Σ} MIT ORC Sear 12/8/05 33
Solution via imax 1 T w = 1 for all w W so µ T w µ rf wt Σw = (µ µ rf1) T w wt Σw, w W can use imax theorem for a T x/ x T Bx, with a = µ µ rf 1, B = Σ, X = RW U = {(µ µ rf 1, Σ) (µ, Σ) V} MIT ORC Sear 12/8/05 34
Minimax result for Sharpe ratio suppose there exists portfolio w s.t. µ T w > µ rf for all (µ, Σ) V strong imax property: 0 < max w W (µ,σ) V µ T w µ rf wt Σw = (µ,σ) V max w W µ T w µ rf wt Σw saddle point (w, µ, Σ ) can be computed via convex optimization µ T w µ rf wt Σ w µ T w µ rf w T Σ w µt w µ rf, w W, (µ, Σ) V w T Σw MIT ORC Sear 12/8/05 35
Saddle-point property PSfrag replacements EF for (µ, Σ ) P(w ) ` w T Σ w, µ T w µ T w µ rf P(w ) = wt Σw {(µ T w, w T Σw ) (µ, Σ) V } is risk-return set of w MIT ORC Sear 12/8/05 36
PSfrag replacements Example 7 risky assets with noal returns µ i, variances σ i 2, correlation matrix Ω 10% µi µ rf = 3 1 2 3 4 5 6 7 Ω = 2 6 4 0 0 σ i 30% 1.00.07.12.43.11.24.25 1.00.73.14.09.28.10 1.00.14.5.52.13 1.00.04.35.38 1.00.7.04 1.00.09 1.00 3 7 5 MIT ORC Sear 12/8/05 37
Example total short position is limited to 30% of total long position mean uncertainty model: 1 T µ 1 T µ 0.1 1 T µ, µ i µ i 0.2 µ i, i = 1,..., 6 covariance uncertainty model: Σ Σ F 0.1 Σ F, Σ ij Σ ij 0.2 Σ ij, i, j = 1,..., 6 Σ is noal covariance MIT ORC Sear 12/8/05 38
Results noal SR worst-case SR noal MP 1.56 0.56 robust MP 1.23 0.77 P nom P wc noal MP 0.92 0.77 robust MP 0.89 0.71 P nom : probability of beating risk-free asset with noal statistics ( µ, Σ) P wc : probability of beating risk-free asset with worst-case statistics MIT ORC Sear 12/8/05 39
Two-fund separation under model uncertainty robust two-fund separation theorem: risk-return set of any x = ((1 θ)w, θ) with w W cannot lie above robust capital market line (RCML) r = µ rf + S s µ T w µ rf wt Σw = with slope S = max max w W (µ,σ) V (µ,σ) V w W RCML is formed by affine combinations of two portfolios: µ T w µ rf wt Σw robust tangency portfolio (w rtp, 0), risk-free portfolio (0, 1) for V = {(µ, Σ)}, reduces to Tobin s two-fund separation theorem MIT ORC Sear 12/8/05 40
Two-fund separation under model uncertainty P(1.5w, 0.5) P(w, 0) PSfrag replacements return P(0, 1) = {(0, µ rf )} P(0.5w, 0.5) S P((1 θ)w, θ) risk P((1 θ)w, θ) is risk-return set of ((1 θ)w, θ) MIT ORC Sear 12/8/05 41
Summary & comments the function f(a, B, x) = at x xt Bx comes up alot what was known (??): simple imax result, no constraints on x, product form for U what is new (??): general imax theorem; convex optimization method for computing saddle point imax theorem for TrAX TrBX imax theorem for xt Ax x T Bx holds (with conditions) generally false MIT ORC Sear 12/8/05 42
References Kim and Boyd, Two-fund separation under model uncertainty, in preparation Kim, Magnani, and Boyd, Robust Fisher discriant analysis, www.stanford.edu/~boyd/research.html Verdú and Poor, On Minimax Robustness: A General Approach and Applications, IEEE Transactions on Information Theory, 30(2): 328-340, 1984 Sion, On general imax theorems, Pacific Journal of Mathematics, 8(1): 171-176, 1958 MIT ORC Sear 12/8/05 43