Bounding Probability of Small Deviation: A Fourth Moment Approach. December 13, 2007

Size: px

Start display at page:

Download "Bounding Probability of Small Deviation: A Fourth Moment Approach. December 13, 2007"

Thomas Pierce
5 years ago
Views:

1 Bounding Probability of Small Deviation: A Fourth Moment Approach Simai He, Jiawei Zhang, and Shuzhong Zhang December, 007 Abstract In this paper we study the problem of bounding the value of the probability distribution function of a random variable X at E[X] + a where a is a small quantity in comparison with E[X], by means of the second and the fourth moments of X In this particular context, many classical inequalities yield only trivial bounds By studying the primal-dual moments-generating conic optimization problems, we obtain upper bounds for Prob {X E[X] + a}, Prob {X 0}, and Prob {X a} respectively, where we assume the knowledge of the first, second and fourth moments of X These bounds are proved to be tightest possible As application, we demonstrate that the new probability bounds lead to a substantial sharpening and simplification of a recent result and its analysis by Feige ([7], 006); also, they lead to new properties of the distribution of the cut values for the max-cut problem We expect the new probability bounds to be useful in many other applications Keywords: probability of small deviation, fourth moment of a random variable, sum of random variables MSC subject classification: 60E5, 78M05, 60G50 Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, Hong Kong smhe@secuhkeduhk Department of Information, Operations, and Management Sciences, Stern School of Business, New York University, New York, USA jzhang@sternnyuedu Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, Hong Kong zhang@secuhkeduhk Research supported by Hong Kong RGC Earmarked Grants CUHK8505 and CUHK806

2 Introduction For a random variable X R, we consider the problem of upper bounding Prob {X E[X] + a} () for a given real a This problem has been studied extensively in the literature Based on available information about the distribution of X, various inequalities have been developed, including, the well-known Markov inequality and Chebyshev inequality Such inequalities () and () have been extremely useful However, these two inequalities by themselves could sometimes be too weak to yield useful results especially when a is small or zero This motivates us to develop stronger probability inequalities that can handle small deviations Our results In Section, we start our investigation by developing upper bounds for Prob {X 0} that are relatively simple functions of the first, second, and fourth moments of X In particular, we prove that, for any v > 0 Prob {X 0} ( 9 ( ) M + M v v M ) v () Here and throughout the paper, we denote M m = E[X m ] The above result is in Theorem of the current paper The bound provided by () has a relatively simple closed-form expression, and we have the freedom to choose any v > 0 in the bound Therefore, it is quite convenient to use this bound as long as the information about M, M, and M is available The study of this type of probability bounds is motivated by a lemma used in He et al [9], which is a special case of () when M = 0 with a specific choice of v The assumptions on the inequality () is minimal In fact, we do not even require the assumption that E[X] 0 That is to say, we can estimate the probability that X 0 even when E[X] 0 In fact, inequality () is non-trivial ie, the right hand side is less than, as long as E[X] < 0 or E[X] E[X ] E[X ] This is in contrast to many other probability inequalities in the literature, as we shall see in the next subsection The bound provided by () however, is not necessarily tight It is of interest to know whether or not the bound can be further improved We settle this issue by presenting in Theorem 8 a tight upper bound, which is thus the best possible bound, given the moments information As it turns out, the bound () is a very good one, in view of the tight bound; it is even tight under a certain condition After settling the issue of the probability bound for Prob {X 0}, it is natural to consider the bound for Prob {X a}, using the same moments information This extension is useful and is nontrivial to establish When E[X] = 0, we are able to provide a tight bound for Prob {X a}, using the information of M and M The result is presented in Theorem

3 Of course, inequality () may not be immediately applicable if the second and the fourth moments of X is not directly available Fortunately, in many applications, as we shall demonstrate in this paper, it is relatively straightforward to compute or bound the second and the fourth moments In Section and Section, we provide several examples to demonstrate the applicability of Theorem Our first example regards the sum of n independent random variables In particular, given n independent random variables, X, X,, X n, we provide upper bounds on { n [ n ] } Prob X i E X i + a, i =,, m i= i= The bounds are particularly useful when a is a relatively small non-negative real Here the random variables X i could be bounded from both sides, or from below only As a special case of this result, we obtain the following bound If X i is non-negative with expectation, then { n } Prob X i n i= This strengthens the main result of a recent paper by Feige [7] In [7], a weaker upper bound of / is proved by using a completely different approach, and the proof is considerably more involved and lengthy In Section we also apply Theorem to the well-known weighted maximum-cut problem Given an undirected graph G = (V, E) where each edge (u, v) has a weight w uv, we wish to partition the vertices of G into two sets S and S so as to maximize the total weight of the edges (u, v) such that u S and v S A simple solution to this problem is to independently and equiprobably assign each vertex of G to either S or S We denote the total weight of edges with end-points in different sets by W It is clear that the expected value of the W is exactly (u,v) E w u,v By applying Theorem, we can show that Prob W (u,v) E w u,v > 5 and Prob W ( ) V (u,v) E w u,v > % Both bounds seem to be new Furthermore, the second bound implies ( that) for any graph, there exists a cut so that the total weight of edges in the cut is at least V (u,v) E w u,v

4 Related Literature In the literature, there are several probability inequalities based on moment information example, if X assumes only non-negative values, then For Prob {X a} E[X] a () This is the well-known Markov inequality and gives the tightest possible bound when we know only that X is non-negative and has a given expectation If the standard deviation of X, denoted by σ, is also available, and t > 0, then we have Prob {X E[X] + tσ} + t () This inequality is often referred to as the (one-sided) Chebyshev inequality Both inequalities () and () have been extremely useful If we know the first three moments of X, it is shown in Bertsimas and Popescu [] that, Prob {X > ( + δ)e[x]} min ( C M C M +δ, +δ +δ ) DM, if δ > C DM +(C M δ) M, D M +(+δ)(c M δ) D M +(+C M )(C M δ), if δ C M, where CM = M M and D M M = M M M, and this bound is tight Tight bounds on Prob {X < M ( δ)m } and Prob { X M > δm } are also provided in [] These inequalities are potentially useful to bound small deviation probability as well, ie, when δ is small However, we noticed that in several applications that we consider in this paper, it is harder to estimate M than M Furthermore, the bound provided by inequality (5) could be as weak as Markov s and Chebyshev s bounds, for instance, for the problem considered by Feige [7] We shall discuss this in more details later Zelen [7] showed that, if the first four moments of X are known, then (5) Prob {X E[X] + tσ} ( + t + (t tκ ) ) κ κ for t κ + κ +, (6) where κ m = M m σ m Let There are also probability inequalities that uses absolute moments of the random variable X Cantelli [] showed that for m > 0, Prob { X E[X] a} ν m = E [(X E[X])] m ν m ν m ν m ν m + (a m ν m ) for a ( νm ν m ) /m

5 When m =, the above inequality is reduced to the well-known (two-sided) Chebyshev inequality Von Mises [] proved that, for m > k > 0, Prob { X a} J m ν m J m a m for a where J is the root, different from a, of the equation ( νm ν k ) /(m k), (J m a m )/(J k a k ) = (ν m a m )/(ν k a k ) Unfortunately, it is clear from the conditions provided in the above inequalities proved by Zelen [7], Cantelli [], and Von Mises [], that none of them is applicable for bounding probabilities when the deviation is very small In a recent paper, He et al [9] studied SDP relaxations for certain quadratic optimization problems The main results are to establish the gap between the SDP relaxations and the quadratic optimization problems As a key to their main results, they established the following inequality: Prob {X E[X]} 9 0 σ E[(X E[X]) ] This inequality is a special case of Theorem The current paper is partly motivated by [9] Our paper is also related to Berger [], which uses the fourth moment information to bound the absolute value of a random variable More specifically, it is shown in [] that, for all q > 0, E[ X ] ( ) E[X ] E[X ] q q This result has been used by Berger to bound the absolute value of a weighted sum of {+, } unbiased random variables, and achieve tight bounds for the total discrepancy of a set system Our results can be viewed as solutions to a special class of moment problems Moment problems concern about deriving bounds on the probability that a certain random variable belongs in a given set, given information on some of its moments The study on moment problems has a long history; see Bertsimas and Popescu [] for a brief review of this area The tight bounds derived in our paper use an optimization method and duality theory This duality approach was proposed independently and simultaneously by Isii [0] and Karlin [] Bertsimas and Popescu [] show that, for univariate random variables, the dual of the moment problem can be formulated as a semidefinite program (SDP) This result is important because SDP problem can be solved in polynomial time within any prescribed accuracy They also discuss the complexity of solving the dual moment problem for multivariate random variables Recent results on moment problems can also be found in [5], [5], and [] The work by Bertsimas and Popescu [] seems to have settled the moment problems for univariate random variables, ie, for the given information of the moments, one can compute the desired probability bound efficiently by solving an SDP However, such bounds may not be conveniently used because of the lack of simple closed-form expressions 5

6 The Moment Problem: Duality Approach Let us start our discussion by considering the problem: or equivalently ZP = max Prob {X 0} st E[X] = M E[X ] = M E[X ] = M, ZP = max F ( ) x 0 df (x) st x R df (x) = x R x df (x) = M x R x df (x) = M x R x df (x) = M, where the variable of this infinite dimensional optimization problem is the probability measure F ( ) The dual problem of (7) is given as follows: (7) Z D = min y 0 + M y + M y + M y st g(x) := y 0 + y x + y x + y x {x 0}, x R (8) We first define a feasible solution to the dual problem (8) Lemma For any u > v > 0, let c = (u+v) (u v) > 0 and d = v (u+v) > 0 Define y 0 = cu + du, y = du, y = d cu, y = c (9) Then (y 0, y, y, y ) is feasible to problem (8) if u + v Proof If (y 0, y, y, y ) is defined as in (9), then g(x) = cx + (d cu )x + dux + cu + du = c(x u ) + d(x + u) It is clear that g(x) 0 for all x R It is left to verify that g(x) for all x 0 We first observe that g(0) = u + u v u v (u + v) (u v) Thus, g(0) is reduced to v + uv u 0, which is true by the assumption that u + v Therefore, g(0) Notice that, g(x) = (x + u) (c(x u) + d ) 6

7 Since c(x u) + d > 0, we have g(x) = 0 if and only if x = u < 0 Thus x = u < 0 is the only global minimum solution of g(x), and thus one of the local minimum solutions Since g(x) is a polynomial with order four, it has at most two local minimum solutions, including x = u We denote the other local minimum solution by z If z < 0, then we must have that g(x) is increasing for x 0, and thus g(x) g(0) Therefore, we assume that z > 0 > u If follows that z must be the largest root to g (x) = 0 But g (x) = cx(x u ) + d(x + u) and the largest root to g (x) = 0 is u + u d c Therefore, z = u u + d c = u u + v(u v) = u v + u = v The last equality holds since u + v Now it is straightforward to verify that g(z) = g(v) = Finally, we observe that the global minimum solution to g(x) in [0, ) is either x = 0 or x = z Therefore, g(x) g(0) = g(z) = for all x 0 This completes the proof In Lemma, if we choose u = + v, then we have Corollary For any v > 0, define y 0 =, y = 8 9 ( )v, y = ( )v, y = 9 ( )v (0) Then (y 0, y, y, y ) is feasible to problem (8) with an objective value ( 9 ( ) M + M v v M ) v Corollary immediately leads to our first main result Theorem For any v > 0, Prob {X 0} 9 ( ) ( E[X] ) + E[X ] v v E[X ] v () In Theorem, we have the freedom to choose any v > 0 In particular, we could choose v that maximizes the function M + M v v M v Such a v can be obtained by solving the equation M v + M v M = 0, 7

8 if a solution exists Notice that, even if we choose the best v, the bound provided in Theorem is not necessarily tight In what follows, we develop a tight bound for Prob {X 0}, given the first, second, and the fourth moments of X We begin with the case where the bound in Theorem is tight Since when M = M or M = M the distribution X can be easily identified by the first two moments, and the result in Theorem 8 easily follows, therefore we assume M > M and M > M for the remaining part of this section and Let V min = α = M M M M ( )M + 7 > 0 Lemma If M /M M /M and α V min, then And the bound is tight Prob {X 0} 9 ( ) sup v>0 M + M 0 () ( E[X] + E[X ] v v E[X ] v Proof The inequality has been established in Theorem We need only to show that the bound is tight In view of Lemma, it is sufficient to find a feasible solution to problem (7) with an objective value that is equal to the right hand side of the bound If M < 0, define f(x) = M x + M x M Then it can be verified that f(x) is strictly increasing when x 0 By assumption, M /M M /M, and thus, ( f ( ) M ) = (M /M M ) 0 M On the other hand, by (), V min is a solution of the equation x ( )M x ( )M = 0 Thus ) f(v min ) = ( )M Vmin ( 7)M M V min + ( )M M = ( )M Vmin + ( )M (( ) )M Vmin + ( )M M = M M + ( )(D M )V min 0 where the last inequality holds because of the assumption that V min ( + )α By the monotonicity of f(x) when x 0, we must have V min ( ) M M Furthermore, there must exist a unique v [V min, ( ) M M ] such that f(v) = 0 For simplicity, in what follows, we assume v satisfy such a condition Also, let u = + v 8

9 We now define a random variable X = 0, with probability p q; v, with probability p := 6 u, with probability q := 9 ( )( M v + M v M v ); M 6 v 9 M v + M 9 v We show X defines a feasible solution to problem (7) First of all, by the fact f(v) = 0, or M = ( M v + M v ), we have q = ( ( ) M v + M ) v and p = M v + M v Therefore q 0 and p 0 since v ( ) M M Furthermore, p + q = ( ) M v + ( ) M v since v V min Therefore, X is indeed a well-defined random variable It is easy to check that ( E[X] = qu + pv = ( )( M v + M ) ( + v ) M v + v + = M, ( E[X ] = ( )( M v + M ) v ) = M, and ( E[X ] = ( )( M v + M ) v ) = M v + M v = M Therefore, X is feasible to problem (7) Finally, since u v > 0, we have ( + ( ) v + ( + ( ) v + Prob {X 0} = Prob {X u} = q = ( ) 9 This completes the proof of the lemma for the case M < 0 M v + M v + ( M v M v M v ) ) v v M v + M v M ) v For the case M 0 the proof is completely parallel, except that the solution for f(v) = 0 exists in range v [V min, M M ] The details are omitted here In order to get a tight bound for cases that are not covered in Lemma, we need to define different primal and dual variables, which are summarized in the following three lemmas ) v 9

10 Lemma 5 If M /M M /M and α V min, then And the bound is tight Prob {X 0} + α + M M + α + M α Proof Define From the assumption α s = M + α + αm z = α + M u = s+α v = s α V min, we have (5 + )α M α M 0 It follows that s = M + α + αm ( + )α and thus u = s + α + s α = + v, which also implies that u > v > 0 Thus, by Lemma, the function g(x) = c(x u ) + d(x + u) = cx + (d cu )x + dux + cu + du, with c = (u+v) (u v) > 0 and d = v > 0, (implicitly) defines a feasible solution to problem (u+v) (8) The corresponding dual objective value is where we use the fact that On the other hand, we define cm + (d cu )M + dum + cu + du = s + z s, X = { M = s α M α M = (M M )α + M c = s α d = s s α u (< 0), v (> 0), with probability q := s z with probability p := s+z We shall show that X is a feasible solution to problem (7) s ; s It is obvious that p + q = Also, by the fact that M M, we have s = M + α + αm α + M = z 0

11 Therefore, p, q 0 Thus, X is a well-defined random variable Furthermore, E[X] = uq + vp = s + α = M E[X ] = u q + v (s + α) p = s α M s s α M s + s α s + α + M s + (s α) s + α + M s = M, and E[X ] = u q + v (s + α) s α M (s α) s + α + M p = + 6 s 6 s = (M + αm )(M + α + αm ) αm (M + α + αm ) = (M M )α + M = M = s α M α Finally, Prob {X 0} = Prob {X = v} = s + z s, which is equal to the dual objective value This completes the proof of the Lemma Lemma 6 If M /M M /M and M < 0, then And the bound is tight Prob {X 0} M M Proof The inequality is the well-known Chebyshev inequality, which is known to be tight; see, for example, Bertsimas and Popescu [] Lemma 7 If M /M M /M and M > 0, then the trivial bound Prob {X 0} is actually tight Proof The primal solution X with objective value can be constructed this way: For any t M > 0, define random variable X t as a two point distribution as follows: { 0, with probability M X t = t t, with probability M t We have that EX t = M, EX t = tm and EX t = t M Consider function f(x) = x /M, which is convex when x > 0 Notice that M M, f(m ) = M and f(m ) M, the line passing through (M, M ) and (M, M ) intersect with function f(x) at some t M Thus there exists a p [0, ], such that p(m, M ) + ( p)(t, f(t)) = (M, M )

12 Let Y be the Bernoulli trial which takes the value with probability p, and let X = Y X M + ( Y )X t/m, where Y is independent to X t/m and X M, then (EX, EX, EX ) = p(ex M, EX M, EX M ) + ( p)(ex t/m, EX t/m, EX t/m ) = p(m, M, M ) + ( p)(m, t, f(t)) = (M, M, M ) Because X 0, this gives a feasible solution of the primal problem 7 with objective value Since is an upper bound for ZP, we conclude that Z P = and that X is an optimal primal solution For the dual problem (8), y 0 = y = y = y = 0 is obviously a feasible solution with objective value Because ZD Z P =, this is an optimal dual solution Lemma 6 and Lemma 7 indicate that when the fourth moment of X, ie, M, becomes sufficiently large, then the information will not be useful anymore in bounding the probability that X 0 The following theorem summarizes the results we obtained above Theorem 8 Prob {X 0} M M, if M M, if M M ( ) ( 9 sup v>0 M v + M M ) v v, if M M α+m, if M M +α +M α M M M M and M < 0; and M > 0; () < M and α M V min ; + < M M and α M V min, M where α M, V M M min ( )M + 7 M + M Furthermore, the bound is tight, ie, there exists an X such that the inequality () holds as an equality Now we consider a special case where E[X] = 0 As we have mentioned in the introduction, He et al [9] established the following inequality Prob {X E[X]} 9 0 σ E[(X E[X]) ], which has been a key to study an SDP relaxation for certain class of quadratic optimization problems Here we show that this inequlality can be strengthened by using Theorem Corollary 9 If E[X] = 0, then sup {Prob (X 0)} = X ( ) M M, if M M +, if M +M /M M ;

13 Proof Notice that if M = E[X] = 0, then V min = ( )M and α = M M M condition M M Therefore, The is equivalent to α V min The corollary follows by noting that ( M max v>0 v M ) v = 9 M M By applying Corollary 9, we can obtain a non-trivial bound for the probability X a when E[X] = 0, given the information on M and M Corollary 0 If E[X] = 0 and a 0, then Prob {X a} ( (M + a ) ) M + 6a M + a Proof Let Y be a random random variable independent to X, and X takes only one of the two values, a or a, each with probability half Let Z = X + Y Then Then, by Corollary 9, However, E[Z] = 0 E[Z ] = E[X ] + E[Y ] = M + a E[Z ] = E[X ] + 6E[X ]a + a = M + 6a M + a Prob {Z 0} = The desired inequality follows Prob {Z 0} ( (M + a ) ) M + 6a M + a Prob {X a} + Prob {X a} Prob {X a} The bound proved in Corollary 0 is not tight in general A tight bound is summarized in the following theorem Its proof, which is quite technical and similar to the proof of Theorem 8, is provided in the appendix Theorem Let K = M /M and L = M /a If E[X] = 0 and a 0, Then M M, if K L + +a L ; Prob {X a} M M M M, if K L + a +a L and L < ; + +M /M, if K L + L, L and L K + K + K K+ K ; min{p (v) v a}, otherwise

14 where P (v) = M + M (v + av + a ) + a v + av a + a v + a v + 6av + 9 v + (v + av + a v v) + (a+v) And the bounds are tight Small Deviation Bound for Sum of Independent Random Variables In this section, we consider the problem of bounding the probability of small deviations for sum of independent random variables In particular, consider n independent random variables X, X,, X n each with a mean of zero Let S = n i= X i We are interested in the probability that S < for some given constant For this purpose, we may directly apply Theorem Then we need to estimate E[S ] and E[S ] We may also apply Theorem 8 In this case, we need to estimate E[(S ) ] and E[(S ) ] We demonstrate below how this could be done We consider two cases In the first case, the random variables X i are uniformly bounded from both sides In the second case, we assume that the random variables are uniformly bounded only from below Given two nonnegative constants c and c, define s(c, c ) := max{c + c, c c, c + c c c (c c )} Our first result is summarized below Theorem Consider n independent random variables X, X,, X n Assume that > 0 is a given constant Also assume that E[X i ] = 0 and there exists two nonnegative constants c and c such that c X i c Let S = n i= X i, then where F (, c, c ) = ( ) 9 inf D>0 Prob {S < } F (, c, c ) F (c, c ), () ( 6(D + 9 ) D + (6 + s(c, c ))D + + (D + ) ) D + (6 + s(c, c ))D + and F (c, c ) = ( s(c, c ) + ) s(c, c ) + s(c, c ) +

15 Proof First of all, we can assume without loss of generality that X i follows a two point distribution for every i =,,, n In particular, given that E[X i ] = 0, we assume that there exists a i, b i 0, such that { b a i, with probability i X i = b i, with probability a i +b i a i a i +b i It follows that E[Xi ] = a ib i and E[Xi ] = a ib i (a i a ib i + b i ) Let denote the variance of S by D, ie, D = n i= E[X i ] = n i= a ib i Therefore, E[(S ) ] = D + Furthermore, E[(S ) ] = E[S ] E[S ] + 6 E[S ] + n = E[Xi ]E[Xj ] = n E[Xi ] + 6 E[Xi ] + 6 i= i<j i= i= n ( E[X i ] E[Xi ] (E[Xi ]) ) + D + 6 D + M i= = D + 6 D + + n E[Xi ] + n a i b i (a i + b i a i b i (b i a i )) i= Notice that a i + b i a ib i (b i a i ) is a convex function of a i when b i is fixed, and is convex in b i when a i is fixed Therefore, an optimal solution to the optimization problem max 0 a i c,0 b i c (a i + b i a i b i (b i a i )) is in the set {(0, 0), (0, c ), (c, 0), (c, c )} Thus, we conclude that a i + b i a i b i (b i a i ) s(c, c ) It then follows that E[(S ) ] D + 6 D + + s(c, c )D Thus by Theorem 8, we have for any v > 0 that Prob {S < 0} ( 9 ( ) v + (D + ) v D + 6 D + + s(c, c )D ) v In particular, we choose v such that Then we must have v = ( D + ) D + (6 + s(c, c ))D + Prob {S < 0} ( 9 ( 6(D ) + 9 ) D + (6 + s(c, c ))D + + (D + ) ) D + (6 + s(c, c ))D + F (, c, c ) 5

16 Furthermore, it is clear that F (, c, c ) ( inf ( (D + ) ) ) D>0 D + (6 + s(c, c ))D + ( (s(c, c ) + ) ) s(c, c ) + s(c, c ) + = F (c, c ), where the last inequality uses the fact that if we let t = D + (6 + x)d +, then D+ (D + ) = + xt( t) t + for any x > This completes the proof of the theorem x (x + ), Now we consider the case where the random variables X i are bounded from below only We obtain a similar result as Theorem Theorem Consider n independent random variables X, X,, X n Assume that > 0 is a given constant Also assume that E[X i ] = 0 and there exists a constant c > 0 such that X i c for every i Let S = n i= X i, then for any τ > 0, Prob {X < } e /τ F (, c, τ max(, c)) e /τ F (c, τ max(, c)) (5) Proof Once again, we assume without loss of generality that there exist a i, b i 0, such that { b a i, with probability i X i = a i +b i b i, with probability a i a i +b i By assumption a i c We also assume that without loss of generality that b b b n We consider a fixed τ > 0 and define N = max{0, max{k b k τ(a + a + + a k ), k n}} Let a = N i= a i; if N = 0, then let a = 0 If N < n, then for every i > N, b i b N+ τ N+ i= a i τ(a + a N+ ) τ(a + c ) For any i N, b i b N τa Thus, if N > 0, then { N } N Prob X i = a = Prob {X i = a i } i= = i= N i= ( a ) i a i + b i N e ai/(τa) = e /τ i= 6 N i= ( a ) i a i + τa

17 Let Y = n i=n+ X i Because for each i > N, a i c c(a + ) and b i max(, c)τ(a + ), by Theorem, we know that The proof is completed Prob {S < } { } Prob X i = a Prob {Y < a + } i<n e /τ F (, c, τ max{, c}) e /τ F (c, τ max{, c}) Theorem generalizes an inequality that was proved by Feige [7] In particular, if every X i is non-negative with expectation, then Feige proved that { n } Prob X i n + i= For this special case, Theorem implies a stronger result than the above inequality Corollary Consider n independent random variables X, X,, X n each with mean zero If X i for all i =,,, n, then we have that { n } Prob X i < e /5 ( ) 8 i= Proof We can apply Theorem with c = and = We choose τ = 5 and thus s(c, τ) = 5 In this case, ( ) 6(D + ) 9 F (, c, τ) = inf D>0 9 ( ) D + D + + (D + ) D + D + However, 6(D + ) 9 D + D + + (D + ) D + D + is a decreasing function of D when D 0 By letting D go to infinity, we have that By Theorem, we have F (, c, c ) 9 ( ) = Prob {X < } e /c F (, c, c ) = e /5 ( ), which completes the proof for the corollary 7

18 It would be interesting to see how strong a bound can be obtained if we apply Markov, Chebyshev, and Bertsimas-Popescu s (three-moments) inequality to the problem considered in Corollary Consider the following example, where all the X i s are iid distribution which take value 0 and with probability / of each Then for the random variable X = n i= X i, and δ = n, we have M = n, M = n + n, and M = n + n Since CM = n = δ, when n, the value f (CM, D M, δ) = n n+ approaches Therefore the three moments inequality alone is not good enough to yield a good bound for the problem Applications In many applications and rounding algorithms, the Chernoff type bounds or other similar inequalities can be applied to yield claims of the following spirit: If n > N(δ, ɛ), then for n independent samples X i ( i n) of a random variable X it follows that Prob {max i n X i ( δ)ex} ɛ However it follows from our analysis, the δ can be dropped and we can claim the following: Lemma If a random variable X has kurtosis κ = ) log( ɛ ) many samples, by Theorem, { Prob max X i EX i n E(X EX), then with n = + (E(X EX) ) (κ + } ɛ and { } Prob min X i EX ɛ i n This is to say that, when a distribution s Kurtosis κ can be estimated or upper bounded, then Θ(κ log(/ɛ)) many samples would guarantee that we are able to draw one whose value is at least as good as the expected value of the distribution, with high probability ɛ Proof Because for each i, Prob {X i EX} ( ) κ+, we have { } ( Prob max X i EX ( ) n ( ) exp n( ) i n κ + κ + Thus if n + (κ + ) log( ɛ ), we have Prob {max i n X i EX} ɛ The other inequality is symmetric Also, for sums of independent random variables we have the following: Lemma If X i are independent random variables with EX i = 0, EXi = D, EXi (κ + )(EXi ), then we have that { n } + κ Prob X i 0 n + κ n i= ) 8

19 Proof Let X = n i= X i, D X = Var(X) = nvar(x i ) = nd, τ = (κ + )D Then τ X = EX = n i= EX i + 6 i<j EX i EX j nτ + D X nd = (n n + n(κ + ))D Thus The other inequality follows by symmetry Prob {X EX} ( ) D X τ X + κ n Now we consider the weighted maximum cut problem In this problem, we are given an undirected graph G = (V, E) where each edge (u, v) has a weight w u,v, and the goal is to partition the vertices of G into two sets S and S so as to maximize the total weight of the edges (u, v) such that u S and v S This problem is NP-hard, but admits a polynomial time 0878-approximation algorithm; see Goemans and Williamson [6] Prior to the celebrated result of Goemans and Williamson, the best known approximation ratio for the maximum cut problem was / for the weighted version, and + δ for the unweighted version, where δ denotes the maximum degree of a vertex It is well-known that, a simple /-approximation algorithm can be obtained by independently and equiprobably assigning each vertex of G to either S or S Indeed, if we denote the total weight of edges with end-points in different sets by W, then it is clear that E[W ] = (u,v) E which of course is no less than half of the maximum weight w u,v := W tot (6) Equation (6) has a stronger implication That is, for any graph, there exists a cut so that weight of the cut is at least half of the total weight of the edges of the graph However, two interesting questions remain: There are O( V ) many cuts for a graph Among all the possible cuts, how many of them have a weight larger than W tot /? Is it possible to show that there always exists a cut with a weight higher than αw tot for some α > /? When the graph is unweighted, the answer to the second question is yes with an α = + n and this bound is the best possible; see Haglin and Venkatesan [8] The result is obtained by proving the existence of a matching of certain size, which also gives a linear time algorithm to find a cut with a weight larger than ( + n )W tot Now we shall answer the above two questions for a general weighted graph by using the simple randomized algorithm described earlier, together with the moment bound developed in this paper 9

20 We slightly formalizes the randomized algorithm as follows We define V independent random binary variables X,, X V, so that for each node i V, X i takes value or with probability half Thus, X i = indicate node i is assigned to the set S, and vice versa Then we have For convenience, we also define W = i<j X i X j w i,j Y = W W tot so that E[Y ] = 0 We now estimate the second and the fourth moments of random variable Y It can also be shown that E[Y ] = E w i,j X i X j = wi,j V W tot (7) i<j i<j E[Y ] 5(E[Y ]) (8) Therefore, it follows immediately from Corollary 9 that { Prob W } W tot = Prob { Y 0} ( ) (E[Y ]) E[Y ] 5 Denote = ( E[Y ]/ 5 ) / > E[Y ]/ and let Z = t Y with t 0 Then we have E[Z] = t, E[Z ] = E[Y ] + t, and E[Z ] = E[Y ] t E[Y ] + 6t E[Y ] + t E[Y ] + t (E[Y ]) / (E[Y ]) / + 6t E[Y ] + t E[Y ] + 5t (E[Y ]) / + 6t E[Y ] + t 5E[Y ] + 5t (E[Y ]) / + 6t E[Y ] + t ( = 5 + (5) / t + 6 t + ) 5 5 t (E[Y ]) Thus, by Theorem 8, we have, for any v > 0, Prob {Y t } = Prob {Z 0} 9 ( ) In particular, if we choose v = 0 and t = 00, then Prob {Y t } > % ( E[Z] ) + E[Z ] v v E[Z ] v It follows that Prob { ( W ) } W tot > % V To summarize, we have proved the following: 0

21 Theorem For any weighted graph, the following two statements are true ) Among all possible cuts of the graph, at least 5 > % of them will have a cut value larger than half of the total weight of the edges of the graph ( ) ) There exists a cut whose weight is at least V times the total weight of the edges of the graph References [] A Ben-Tal and A Nemirovski Lectures on Modern Convex Optimization: Analysis, Algorithm, and Engineering Applications MPS/SIAM Ser Optim, SIAM, Philadelphia, 00 [] B Berger The Fourth Moment Method SIAM Journal on Computing 6, pp 88 07, 999 [] FP Cantelli Intorno ad un teorema fundamentale della teoria del rischio Boll Assoc Attuar Ital (Milan), pp, 90 [] D Bertsimas and I Popescu Optimal Inequality in Probability Theory: A Convex Optimization Approach SIAM Journal on Optimzation, 005 [5] D Bertsimas and I Popescu On the Relation Between Option and Stock prices: A Convex Optimization Approach Operation Research 50 No, pp 58 7, 00 [6] MX Goemans and DP Williamson Improved Approximation Algorithms for Maximum Cut and Satisfiability Problems using Semidefinite Programming Journal of the ACM, pp 5 5, 995 [7] U Feige On Sums of Independent Random Variables with Unbounded Variances, and Estimating the Average Degree in a Graph SICOMP, (006) [8] DJ Haglin and SM Venkatesan Approximation and Intractability Results for the Maximum Cut Problem and Its Variants IEEE Transactions on Computers 0, pp 0, 99 [9] S He, ZQ Luo, J Nie, and S Zhang Semidefnite Relaxation Bounds for Indefinite Homogeneous Quadratic Optimization Working Paper, 007 [0] K Isii The extrema of probaiblity determined by generalized moments I Bounded random variables Ann Insti Statist Math,, pp 6 68, 960 [] S Karlin and WJ Studden Tchebysheff Systems: With Applications in Analysis and Statistics Pure Appl Math 5, Interscience, John Wiley and Sons, New York, 966 [] JB Lasserre Bounds on measures satisfying moment conditions Ann Appl Probab, pp 7, 00

22 [] R Von Mises The Limits of a Distribution Function if Two Expected Values Are Given Ann Math Statist 0, pp 99 0, 99 [] Yu Nesterov Structure of Non-Negative Polynomial and Optimization Problems Preprint DP 979, Louvain-la-Neuve, Belgium, 997 [5] I Popescu A Semidefinite Programming Approach to Optimal Moment Bounds for Convex Classes of Distributions Mathematics of Operation Research, 005 [6] J Smith Generalized Chebyshev Inequalities: Theory and Applications in Decision Analysis Operations Research,, , 995 [7] M Zelen Bounds on a Distribution Function That Are Functions of Moments to Order Four J Res Nat Bur Stand, 5, pp 77 8, 95 A Proof of Theorem For Theorem, the primal problem is ZP = max Prob {X a} st E[X] = 0 E[X ] = M E[X ] = M, or equivalently ZP = max F ( ) x a df (x) st x R df (x) = 0 x R x df (x) = M x R x df (x)µ = M x R x df (x) = M (A9) Its dual problem in this setting can be written as Z D = min y y + M y + M y st g(x) := y 0 + y x + y x + y x {x a}, x R (A0) Lemma A For any u > v > a, let c = (u+v) (u v) > 0 and d = v (u+v) > 0 Define y 0 = cu + du, y = du, y = d cu, y = c Then (y 0, y, y, y ) is feasible to problem (A0) if u v + v + (a+v) (A)

23 Proof If (y 0, y, y, y ) is defined by (A), then g(x) = cx + (d cu )x + dux + cu + du = c(x u ) + d(x + u) It is clear that g(x) 0 for all x R It is left to verify that g(x) for all x a The condition u v + v + (a+v) implies that We first observe that u uv (v + a) g(a) c = (a u ) + (a + u) (uv v ) (u + v) (u v) = a a (u uv + v ) + av(u uv) + v (v + uv u ) = (a v ) (u uv)(v a) (a v ) (v + a) (v a) = 0 Therefore g(a) Notice that, g(x) = (x + u) (c(x u) + d ) Since c(x u) + d > 0, we have g(x) = 0 if and only if x = u < 0 Thus x = u < 0 is the only global minimum solution of g(x), and thus one of the local minimum solutions Since g(x) is a polynomial with order four, it has at most two local minimum solutions, including x = u We denote the other local minimum solution by z If z < a, then we must have that g(x) is increasing for x a, and thus g(x) g(a) Therefore, we assume that z > 0 > u, and z must be the largest root to g (x) = 0 But g (x) = cx(x u ) + d(x + u) and the largest root to g (x) = 0 is u + u d c Therefore, The last equality holds since z = u u + d c = u u + v(u v) = u v + u = v u v + v + (a + v) Now it is straightforward to verify that g(z) = g(v) = v Finally, we observe that the global minimum solution to g(x) in [0, ) is either x = 0 or x = z Therefore, g(x) g(0) = g(z) = for all x 0 This completes the proof

24 Proof of Theorem When M = M, the distribution has to be { M, with probability X = ; M, with probability, Since Prob {X a} = Theorem holds under this condition { 0, if L < ;, if L, If M > M, the given condition of the moment information implies that the strong duality holds Thus problem (A9) is equivalent to ZP = max F ( ) x a df (x) st x R df (x) = 0 x R x df (x) = M x R x df (x) = M x R x df (x) M (A) Case : When K L + L, we define { M X = a, with probability a, with probability L which is always well defined Furthermore, E[X] = M a L+ + a L L+ = al L+ + a L L+ = 0, L+ ; L+, E[X ] = M a L+ + a L L+ = a L L+ + a L L+ = a L = M, E[X ] = a L L+ + a L L+ = a L(L L + ) = M L (L L + ) M K = M Therefore X is feasible to problem (A) with objective value Prob {X a} = L/(L + ) We now define ( ) ax + M g(x) = a, + M which is feasible to problem (A0) The corresponding dual objective value is ( a a + M ) ( ) M M + a = + M L (L + ) + L (L + ) = L L + Therefore, when K L + L the inequality in Theorem holds and is tight Case : When L < and K L + L, define [ a ] M g(x) = a a (x a ) +, M + M

25 which is a feasible solution of problem (A0) The corresponding dual objective value is = = = ( a ) M a a M + (a M )(M a M ) (M a M ) M + M (a a M + M ) M + (a a M + M ) ( a ) ( M a a (M M a M M a M ) + M + M a a M + M + M a a M + M M M (a a M + M ) ( (M M + (a M ) ) M M a a M + M Now we define v = a M M a M, p = M M M M a +a and v, with probability q = p(+a/v) ; X = v, with probability r = p( a/v) ; a, with probability p ) Since a M M = L M KM ( L)M > 0, the value v is well defined We observe that p = M M M M a + a = M M M M + (M a ), thus 0 < p < The condition L < and K L + L implies that which is equivalent to a (M M ) + (a M ) M ( ) (K ) M 5 ( L) + K L L ( ) ( L) M 5 ( L) L + L (L L + ) = M 5 ( L) L = a M (a M ), a v = a (a M ) a M M ( (a M ) ) = M M ( ) p Since a v + a v p, 5

26 we have q, r 0 Notice that p + q + r =, the distribution X is well defined Furthermore, E[X] = (q r)v + pa = pa + pa = 0, E[X ] = (q + r)v + pa = ( p)v + pa = (a M )(a M M ) + a (M M ) a a M + M = M, E[X ] = (q + r)v + pa = (a M M ) + a (M M ) a a M + M = M Thus X is feasible to problem (A9) Since v = the corresponding dual objective value is define a a + M a M a M < a, Prob {X a} = p = M M M M a + a Therefore, the inequality in Theorem is tight when L < and K L + L Case : When L, K L + L and L K + K + K u = v = M M +M M + M M M M +M M M M = M K++ K = M K+ K p = + +K = K++ K K+ q = +K = K+ K K+ K+ K, Because M M for any distribution, these values are well defined It follows from definition that u > v > 0 From the assumption K L + L and L, we have v = = K + K M M K + + K M L+ L + L L M = L = a The assumption L K + K + K K+ K implies that ( K + K + L ) K + K + K = ( K + + K ) K 6

27 Therefore u(u v) (a + v), which implies that u v + v + (a+v) It follows from Lemma A that the function with and g(x) = c(x u ) + d(x + u) = cx + (d cu )x + dux + cu + du c = (u + v) (u v) = > 0 (K + )M (K + )(K ) d = v (u + v) = K + K M (K + ) K + > 0 defines a feasible solution to problem (A0) Denote t = K + K, then d = cm (t K + ) and u = K++t The corresponding dual objective value is cm + (d cu )M + cu + du = cm (K (K + + t) + t K + (K + + t) + ( = cm t + K + ) t K + K = cm t t + K + = + t (K + ) = p + K + + t ) (t K + ) Now we define X = { u, with probability q; v, with probability p, which is always feasible since K > 0 Furthermore, E[X] = pv qu = 0, E[X ] = pv + qu = pv(u + v) = M, E[X ] = pv + qu = pv(u + v)(u uv + v ) = M K = M Therefore, the inequality in Theorem is tight when L, K L + L and K + K + K K + K L Case : Now we consider the case when L, K L + L and < K + K + K K + K L 7

28 Our main goal is to prove there exists a ˆv a and the corresponding û, so they can generate a feasible dual solution, and also satisfies the following conditions (which are the crucial conditions for the feasibility of the primal solution, and for the primal objective to match the dual objective value): aû M ˆvû; M + M (ˆv + aˆv + a ) + a ˆv + aˆv = (û + ˆv) (û ˆv)(M + aˆv) a + û Define W (K) = S(K) = V (K) = K + K + K K + K + K ; K+ K ; K++ K U(K) = ; u(v) = v + v v t(v) = + (a+v) + (a+v) ; K+ K ; Because K, there exists a unique b such that K = b +, and b K + K + K = (b ) K + + K b, which is a monotonically increasing function of b Since L+ L K = b + L, and L, b, we can conclude that L b If b, then KL b (b + b ) If b <, then KL b + b ( (b ) b ) Therefore, the assumptions L, K L + L and L < W (K) guarantees that KL, and L Also, since b is monotonically increasing to K, we have that W (K) is a monotonically increasing function of K Since W (K) is a continuous function with W () =, there exists a K 0 < K such that W (K 0 ) = L From Lemma A, for any v a > 0, by abusing the notation, let u = u(v) and t = t(v), the function g v (x) = c(x u ) + d(x + u) with d = v and c = is feasible for (u+v) (u+v) (u v) problem (A0) Notice that d = cv(u v) and u uv = (a+v), the corresponding dual objective value is: P (v) = M + M (v(u v) u ) + u + v(u v)u (u + v) (u v) = M + M (v + av + a ) + a v + av (u + v) (u v) 8

29 Let and then we have and f(v) = M + M (v + av + a ) + a v + av h(v) = (u + v) (u v), f (v) = (v + a)(m + av), h (v) = (u + v) (u ) + (u + v) (u v)(u + ) Therefore, = (u + v) ((u v)u + u v) v = (u + v) (a + v) + + = (u + v) (a + t(v)) a + v v + (a+v) + v + v + (a + v) v P (a) = h (a)f(a) h(a)f (a) h (a) = 7a ( M + 6a M + a ) 7a 8a(M + a ) h (a) = 7a (a M M ) h (a) = 7a5 M ( KL) h (a) 0 Also for all v, (u + v)(a + t) (v + a)(a + u) = ( v + t)(a + t) (v + a)(a + v + t) = t a va v = 0 Let { Then because we have Since u 0 v 0, we have that v 0 = V (K 0 ) M ; u 0 = U(K 0 ) M a = L M = W (K 0 ) M, K0 + + K 0 u 0 (u 0 v 0 ) = M K0 u 0 = v 0 v (a + v 0) 9 = u(v 0 ) = (a + v 0)

30 Notice that a + v 0 = S(K 0 ) M and a + t(v 0 ) = a + u 0 v 0 = (S(K 0 ) + K 0 ) M, we have (f(v 0 ) + M K 0 M )h (v 0 ) h(v 0 )f (v 0 ) = (u 0 + v 0 ) M 5/ [(S(K 0 ) + K 0 ) ( K 0 + V (K 0 ) + S(K 0 ) + V (K 0 ) (S(K 0 ) V (K 0 ) ) ) (U(K 0 ) V (K 0 ) )(V (K 0 ) + S(K 0 ))V (K 0 )(U(K 0 ) + W (K 0 )) ] [ = (u 0 + v 0 ) M 5/ (S(K 0 ) + K 0 ) K0 + K 0 K 0 + V (K 0 ) K0 + K 0 V (K 0 )(S(K 0 ) + K 0 )(S(K 0 ) + K 0 + ] K 0 ) [ = (u 0 + v 0 ) M 5/ V (K 0 ) K0 + K 0 (S(K 0 ) + K 0 ) K 0 + (S(K 0 ) + K 0 )(S(K 0 ) + K 0 + ] K 0 ) = 0 Therefore, P (v 0 ) = h (v 0 )f(v 0 ) h(v 0 )f (v 0 ) h (v 0 ) = (K 0 K) h (v 0 )M h (v 0 ) < 0 Define v = ( L + L L )a, and the corresponding u = u(v ) = La Let L + L = r, we have that f(v ) + M (L + L )M = a [ (L L + L) + L ( L + L + 6(L + )r + r (L + ) + ) + ( L + L + (L + )r ) (r L ) ] = a ( (6L + 0L + )r (9L + L + L + ) ) Therefore, (f(v ) + M (L + L )M )h (v ) h(v )f (v ) = (u + v ) a 5 [( (6L + 0L + )r (9L + L + L + ) ) (L + r) (r )(L + r)( + r L )(L + r L )] = (u + v ) a 5 [ (7L + 6L + 5L + 9)r (5L + L + L + 7L + ) (7L + 6L + 5L + 9)r + (5L + L + L + 7L + ) ] = 0, which implies that P (v ) = h (v )f(v ) h(v )f (v ) h (v ) = (L + L L KL )a h (v ) h (v ) 0 0

31 Since L, K 0 and K 0 L, we have that v LV (max(, /L)a LV (K 0 )a = v 0 Since L, we have v a Because P (v) is a continuous function, there exists a ˆv [max(a, v 0 ), v ] such that P (ˆv) = 0 Let û = u(ˆv) and ˆt = t(ˆv) Because u(v) is monotonically increasing function of v, we have that Since Since f (ˆv)h(ˆv) = h (ˆv)f(ˆv), it follows that f(ˆv) M + aˆv aû au = M = u 0 v 0 ûˆv f(ˆv) h(ˆv) = (ˆv + a) f = (ˆv + a) (ˆv) h (ˆv) = (û ˆv )(ˆv + a) a + ˆt (û + ˆv)(a + ˆt) = (ˆv + a)(a + û), we have f(ˆv) M + aˆv = (û ˆv )(ˆv + a) a + ˆt Therefore the corresponding dual objective is ˆP = P (ˆv) = = (û ˆv )(û + ˆv) a + û M + aˆv (a + û)(û + ˆv) By the definition of t and u, it s straightforward to prove that Therefore, (v + au a )(a + u) = ( v + 5av + 5 a v) + (a + av + v )t = (u + v) (u v) M = f(ˆv) + (aˆv + a ˆv + M (ˆv + aˆv + a ) = aˆv + a ˆv + M (ˆv + aˆv + a ) (M + aˆv) (û + ˆv) (û ˆv) a + û = aˆv + a ˆv + M (ˆv + aˆv + a ) (M + aˆv) v + au a = M (ˆv + a (ˆv + a) + aû + aˆv) + aˆv(û ûˆv aû) = M (ˆv + a + û ûˆv aû + aˆv) + aˆv(û ûˆv aû) = M (ˆv + a ) a ˆv + (M + aˆv)(û a)(û ˆv) = M (ˆv + a ) a ˆv + ( ˆP )(û a )(û ˆv ) = (M ( ˆP )û )(ˆv + a ) ˆP a ˆv + ( ˆP )û Now we define distribution X as X = ˆv, with probability q = M ˆP a ( ˆP )û ˆv a ; û, with probability p = ˆP ; a, with probability r = ˆP ˆv +( ˆP )û M ˆv a

32 Because it follows that Therefore, q, r 0 Since aû M ûˆv, û M û a ˆP û M û ˆv f(ˆv) = (û ˆv )(û + ˆv)(M + av) a + û 0, and u > v, we have that ˆP, therefore p 0 and the distribution X is well defined Furthermore, E[X] = qˆv pû + ra = M a ˆP ˆv ( ˆP )û ( ˆv + a ˆP )û = 0, E[X ] = qˆv + pû + râ = M, E[X ] = qˆv + pû + ra = (M ( ˆP )û )(ˆv + a ) ˆP a ˆv + ( ˆP )û = M Therefore, X is feasible to problem (A9) Finally, since ˆv a, the primal objective value is Prob {X a} = q + r = ˆP, which matches the dual objective value for dual feasible solution corresponds to ˆv Therefore, we have that Z P = Z D = ˆP By Lemma A, any v a corresponds to a dual feasible solution with objective value P (v), therefore P (v) ˆP, and ˆP = P (ˆv) = min{p (v) v a}

MIT Algebraic techniques and semidefinite optimization February 14, Lecture 3

MIT Algebraic techniques and semidefinite optimization February 14, Lecture 3 MI 6.97 Algebraic techniques and semidefinite optimization February 4, 6 Lecture 3 Lecturer: Pablo A. Parrilo Scribe: Pablo A. Parrilo In this lecture, we will discuss one of the most important applications