Sublinear Time Algorithms for Approximate Semidefinite Programming

Size: px

Start display at page:

Download "Sublinear Time Algorithms for Approximate Semidefinite Programming"

Arleen Kelley
5 years ago
Views:

1 Noname manuscript No. (will be inserted by the editor) Sublinear Time Algorithms for Approximate Semidefinite Programming Dan Garber Elad Hazan Received: date / Accepted: date Abstract We consider semidefinite optimization in a saddle point formulation where the primal solution is in the spectrahedron and the dual solution is a distribution over affine functions. We present an approximation algorithm for this problem that runs in sublinear time in the size of the data. To the best of our knowledge, this is the first algorithm to achieve this. Our algorithm is also guaranteed to produce low-rank solutions. We further prove lower bounds on the running time of any algorithm for this problem, showing that certain terms in the running time of our algorithm cannot be further improved. Finally, we consider a non-affine version of the saddle point problem and give an algorithm that under certain assumptions runs in sublinear time. Keywords semidefinite programming, sublinear algorithms, online algorithms, large scale optimization Mathematics Subject Classification (2000) 90C22 90C06 68W27 and 68W20 1 Introduction Semidefinite programming has become a tool of great importance in optimization in the past years. In the field of combinatorial optimization for example, numerous approximation algorithms have been discovered starting with the seminal work of Goemans and Williamson on the MAX-CUT problem [12], which was followed Dan Garber Department of Industrial Engineering and Management, Technion, Haifa 32000, Israel. Tel.: Fax: dangar@tx.technion.ac.il Elad Hazan Department of Industrial Engineering and Management, Technion, Haifa 32000, Israel. Tel.: Fax: ehazan@ie.technion.ac.il

2 2 Dan Garber, Elad Hazan by [4,1,3]. In the field of machine learning, solving semidefinite programs is at the heart of many learning tasks such as learning a distance metric [26,24], sparse PCA [10], multiple kernel learning [19], matrix completion [6], and more. In machine learning applications in particular, it is often the case that the data is assumed to be noisy. Thus there is no point in solving the underlying optimization problem to very high precision. Moreover, the usual extremely large scale of data in such applications makes the underlying optimization problem impractical to solve with relatively high precision even for fast first order methods. Indeed in these last years there has been much research improving the running time of first order methods for semidefinite programming with the aim to improve continually the scale of problems that could be tackled with sufficient accuracy, and there now exists solvers that for certain problems and a certain regime of parameters run in nearly linear time [9,5]. As an attempt to accelerate further first order methods for semidefinite programming, we explore in this paper the possibility of designing algorithms for semidefinite optimization that, in certain parameter-regimes of interest, require running time that is only sublinear in the size of the problem 1. The main semidefinite optimization problem under investigation in this paper is the following saddle point problem: max min p m m p i (A i X b i ), (1) where A 1,..., A m R n n are all symmetric, b 1,..., b m R, m = {x R m x 0, x 1 = 1} is the m 1 dimensional unit simplex, S = {X R n n X 0, Tr(X) = 1} is the spectrahedron, that is the set of unit-trace n n positive semidefinite matrices, and denotes the standard matrix dot product, that is A B = n i,j=1 A ijb ij. In case the optimum value of (1) is non-negative, then solving (1) up to an additive error of ɛ is equivalent to finding a solution in the set S that satisfies all constraints A i X b i 0 up to an additive error of ɛ. Note that the demand that the solution has unit trace is equivalent, via standard reductions, to a demand that the trace of the solution is upper bounded somehow. Moreover, the formulation in (1) captures the notion of maximizing the margin from feasibility, that is, maximizing the scalar σ that still satisfies A i X b i + σ, i [m], for some matrix X S. This objective is of direct interest in certain machine learning tasks such at the problem of learning a distance metric [26,24]. The dual of Problem (1) is the following eigenvalue optimization problem: min p m λ max ( m ) p i (A i b i I), (2) where I denotes the n n identity matrix. Note that the dual problem (2) is strong, i.e. there is no duality gap between problems (1) and (2). 1 The results presented in this paper are a continuation of preliminary results on sublinear semidefinite optimization presented in [11].

3 Sublinear Time Algorithms for Approximate Semidefinite Programming 3 Our main contribution is a randomized algorithm that solves Problem (1) up to a desired additive error ɛ in a total running time that, for large enough instances, is sublinear in the number of entries in the instance. To the best of our knowledge this is the first result to achieve this kind of bound on the running time. Our algorithm has an additional appealing feature - it always produces low-rank solutions. More precisely we prove the following theorem, which is the main theorem of this paper. Theorem 1 There exists an algorithm that given an error tolerance ɛ < 1 and an instance of Problem (1) satisfying i [m]: A i all matrices are bounded in spectral (operator) norm b i 1 the number of non-zero entries of each matrix A i is upper bounded by S A i F F - all matrices have Frobenius norm at most F. finds after ( ( )) O ɛ 2 log m F 2 (m + log n) + Sɛ 2 log n + min{sɛ 2 log n, n 2 }ɛ 1/2 log n arithmetic operations a pair of primal-dual solutions ( X, p) S m satisfying with probability at least 1/2: min p m m p i (A i X b i ) max m p i (A i X b i ) ɛ and the rank of X is upper bounded by O(ɛ 2 log m). Note that in the worst case S = Θ(n 2 ) and F 2 = Θ(n) (under the assumption of unit spectral norm) and the bound in the theorem becomes Õ ( ɛ 2 (mn + ɛ 2 n 2) (suppressing poly-logarithmic factors), which is sublinear in the size of the problem (which is Θ(mn 2 )) for ɛ max{n 1/2, m 1/4 }. In case F 2 = O(1) and S = Θ(n 2 ), the bound becomes Õ ( ɛ 2 (m + n 2 ɛ 2), which actually splits the dependency between m and, n. In this case the algorithm exhibit a sublinear running time for ɛ max{n 1, m 1/4 }. We also prove the following lower bounds on the running time of any ɛ-additive approximation algorithm for Problem (1). Theorem 2 Given a choice of parameters m, n, let Γm,n F denote the set of all instances of (1) for which i [m], A i R n n, A i F 1, b i = 0, and let Γm,n S denote the set of all instances of (1) for which i [m], A i R n n, A i 2 1, b i = 0. The following claims hold. 1. Fix an error tolerance ɛ (0, 1), the parameter m and let n 1 4ɛ. Any algorithm that given an instance from Γm,n, F finds with probability at least 2/3 an ɛ approximated solution, has worst case running time Ω ( ) m ɛ on instances from 2 Γm,n. F

4 4 Dan Garber, Elad Hazan 2. Fix an error tolerance ɛ (0, 1), the parameter n and let m 1 8ɛ. Any algorithm that given an instance from Γm,n, F finds with probability at least 2/3 an ɛ ( approximated solution, has worst case running time Ω Γ F m,n. n 2 ɛ ) on instances from 3. Fix the parameters m, n and an error tolerance ɛ satisfying 1 8 ɛ 1 4n. Any algorithm that given an instance from Γm,n, S finds with probability at least 2/3 an ɛ approximated solution, has worst case running time Ω ( ) mn on instances from Γ S m,n. Part 1 of Theorem 2 shows that the dependency of the running time in Theorem 1 on the term ɛ 2 m cannot be improved in general. Part 2 of the theorem shows that the dependency of the running time on the sparsity parameter S cannot be sublinear in general. Finally part 3 of the theorem shows that in general (under the unit spectral norm assumption) the dependency on m, n in the running time could not be separated, as suggested for instance in Theorem 1 in the case F = Θ(1). Finally, we also consider the following non-affine saddle-point optimization problem: max min p m ɛ m p i c i (X), (3) where c 1,..., c m : S R are all concave and differentiable everywhere in S 2. For this problem we have the following theorem. Theorem 3 There exists an algorithm that given an error tolerance ɛ < 1 and an instance of Problem (3) satisfying i [m] and for all X S: c i (X) 1. c i (X) F 1. c i (X) could be computed in O(n 2 ) time. There exists a procedure Sample(X, c 1,..., c m ) that returns in O(m + n 2 ) time unbiased estimators of the values c 1 (X),..., c m (X), each with variance at most 1. finds after ( O ɛ 2 log m(m + ɛ 4 n 2 ) log m) arithmetic operations a pair of primal-dual solutions ( X, p) S m satisfying with probability at least 1/2: min p m m p i c i ( X) max m p i c i (X) ɛ 2 Our results hold also under the weaker assumption that every c i has a supergradient everywhere in S.

5 Sublinear Time Algorithms for Approximate Semidefinite Programming 5 Our algorithm for the non-affine case may be viewed as a sublinear-time algorithm in the sense that under our assumptions, the total running time of the algorithm for ɛ max{n 1, m 1/6 } is less than the worst case time required to evaluate the gradient of each function c i at least once. Although the conditions stated in Theorem 3 are somewhat restrictive, we show that for instance, for the important case of the square-loss function c(x) = (A X b) 2 they indeed hold. This example is spelled out in Section Related Work The problems (1),(2) under consideration( in this paper are solvable up to an accuracy max{m, ) of ɛ via Interior Point methods in time O n}(n 3 + m 3 + m 2 n 2 ) log (1/ɛ). In recent years there has been extensive efforts to speed up the running time of first order methods for these problems, see [9,20,21,17,5,15] and references therein. Current state-of-the-art first order methods include the deterministic smoothing technique of Nesterov [21], The Mirror Descent method [9] and the recent randomized Mirror- Prox method of Baes, Bürgisser and Nemirovski [5], all three considering the eigenvalue optimization problem (2). The current best running time of first order methods for solving Problem (2) (up to poly-logarithmic factors), under the assumption that all matrices A i have spectral norm at most 1, is { } ms min + n2 ɛ ɛ 3 [5], ms + n3 ɛ ɛ [21], ms ɛ 2 + n2 [9]. (4) ɛ2.5 In order to show the potential appeal of the algorithm proposed in this paper over previous art, we present in Table 1 different regimes of m (number of constraints), S (the maximum number of non zero entries in any of the matrices) and F (upper bound on the Frobenius norm of all matrices) for which our method improves over the bound (4), for a meaningful range of the error tolerance parameter ɛ. Please note that Table 1 does not give full characterization of the settings in which our method may be preferable, but only presents a partial list. Our algorithms are based on applying algorithms from the paradigm of online convex optimization to solve saddle point optimization problems. For recent surveys on online convex optimization see [16], [23]. We use the generic schemes presented in [14] for converting these online algorithms into algorithms for saddle point optimization, and we especially build on the well known Multiplicative Weights algorithm for online linear optimization over the simplex [2]. In order to accelerate these schemes and derive a sublinear running time, we apply among others, ideas introduced in [8], in which the authors presented several sublinear time algorithms for machine learning optimization tasks such as the supervised learning of separating hyperplanes. In [8] the authors accelerated the schemes from [14] using novel random estimation techniques instead of exact computation. This approach was also the one taken in the pioneering work of Grigoriadis and Khachiyan [13], who presented a randomized sublinear time algorithm for a saddle point optimization problem in which both primal and dual solutions are in the simplex. The sublinear running time of their algorithm is also achieved by trading exact computations with random sampling.

6 6 Dan Garber, Elad Hazan regime previous FOM IPM this paper (ɛ n 1 ) improvement S = Θ(n), m = Θ(n), F 2 = Θ(n) min{ɛ 1 n 3, ɛ 5/2 n 2 } n 9/2 ɛ 2 n 2 + ɛ 9/2 n ɛ > n 4/7 S = Θ(n), m = Θ(n 3 ), F 2 = Θ(1) ɛ 1 n 4 n 21/2 ɛ 2 n 3 ɛ > n 1 S = Θ(n), m = Θ(n 3 ), F 2 = Θ( n) ɛ 1 n 4 n 21/2 ɛ 2 n 7/2 ɛ > n 1/2 S = Θ(n 3/2 ), m = Θ(n), F 2 = Θ(n) ɛ 1 n 5/2 n 9/2 ɛ 4 n 3/2 + ɛ 5/2 n 2 ɛ > n 1/3 S = Θ(n 3/2 ), m = Θ(n 2 ), F 2 = Θ(n) ɛ 1 n 7/2 n 7 ɛ 2 n 3 + ɛ 4 n 3/2 ɛ > n 1/2 S = Θ(n 3/2 ), m = Θ(n 3 ), F 2 = Θ( n) ɛ 1 n 9/2 n 21/2 ɛ 2 n 7/2 ɛ > n 1 S = Θ(n 3/2 ), m = Θ(n 3 ), F 2 = Θ(n) ɛ 1 n 9/2 n 21/2 ɛ 2 n 4 ɛ > n 1/2 S = Θ(n 2 ), m = Θ(n), F 2 = Θ(n) ɛ 1 n 3 n 9/2 ɛ 4 n 2 ɛ > n 1/3 S = Θ(n 2 ), m = Θ(n 2 ), F 2 = Θ(n) ɛ 1 n 4 n 7 ɛ 2 n 3 + ɛ 4 n 2 ɛ > n 2/3 S = Θ(n 2 ), m = Θ(n 3 ), F 2 = Θ(n) ɛ 1 n 5 n 21/2 ɛ 2 n 4 ɛ > n 1 Table 1 Comparison of running times between the bound in (4), the running time of Interior Point Methods and the bound stated in Theorem 1, for the eigen optimization problem (2), in various regimes of S, m, F 2. In all settings the spectral norm of all matrices is assumed to be upper bounded by 1. All bounds are stated without constants and without poly-logarithmic factors. The column titled previous FOM gives a lower bound on the expression in Equation (4). The column IPM gives the best running time of Interior Point methods for the chosen parameters. The column titled this paper gives an evaluation of the bound stated in Theorem 1 for the choices of S, m and F 2. In order to further simplify the expression in this column we assume that ɛ n 1. Finally, the column titled improvement gives the range of the error tolerance ɛ for which the bound in Theorem 1 improves over the bound in (4). Two more results that we apply in order derive sublinear running times, are a Bernstein-like concentration result for sum of random matrices presented in [22] and the Frank-Wolfe algorithm for smooth convex optimization over the spectrahedron presented in [15]. The rest of this paper is organized as follows. In section 2 we present our main algorithmic technique: algorithms for online convex optimization and overview their application to saddle point optimization. In section 3 we detail more algorithmic tools that we apply in our sublinear time algorithms. In section 4 we describe our randomized sublinear algorithm for Problem (1) and prove Theorem 1. In section 5 we describe how to make the algorithm work with arbitrary high probability while keeping the running time sublinear. In section 6 we present a sublinear algorithm for the non-affine problem (3) and prove Theorem 3. Finally, in section 7 we prove Theorem 2. 2 An Online Convex Optimization Approach to Saddle-Point Optimization In this section we overview our main algorithmic machinery for solving problems (1) and (3): algorithms for online convex optimization and their application to saddle point optimization. 2.1 Online Convex Optimization In the setting of Online convex optimization (OCO) [27,16,23,7], a decision maker is iteratively required to choose a point x t in a fixed convex set K. After the decision maker chooses x t, a convex loss function f t : K R is revealed and the decision

7 Sublinear Time Algorithms for Approximate Semidefinite Programming 7 maker incurs loss which is equal to the value of the function evaluated at the point chosen, f t (x t ), and the game proceeds to the next iteration. The emphasis in this setting is that the function f t revealed on each iteration of this game may be chosen completely arbitrarily and even in an adversarial manner given the previous choices of the decision maker and previous loss functions. In this adversarial model the decision maker cannot hope to bring to a minimum his cumulative loss, T f t(x t ) as a function of his choices x 1,..., x T. Instead the standard goal in this setting is to minimize the difference between the cumulative loss of the decision maker and that of an algorithm that has the benefit of hindsight but is able to play only a single fixed point in K on every iteration. This difference is known as regret, and is formally given by regret T = T f t (x t ) min x K T f t (x). Usually one cannot guarantee to achieve zero regret (or negative) [7] and the goal is to design an efficient algorithm for the decision maker that guarantees a regret that grows sub-linearly in the length of the game T, That is regret T T 0 as T. Sometimes one wishes to maximize the cumulative gain instead of minimizing the cumulative loss. In this case we require that all functions f t are concave and the regret is given by the difference between the cumulative gain of the best fixed point in K in hindsight and that of the decision maker. We conclude this subsection by presenting two algorithms for online convex optimization and their regret guarantees. In the next subsection we describe how these algorithms may be applied to saddle point optimization problems such as (1) and (3). The first algorithm we describe, known as the Multiplicative Weights algorithm [2], is an algorithm for online optimization over the simplex m with linear loss functions (f t (p) = qt p). Definition 1 (The Multiplicative Updates algorithm) Consider a sequence of linear losses q 1,..., q T R m such that for all t q t 1. The Multiplicative Weights (MW) algorithm is as follows. Let 0 < η R, w 1 1 m, and for t 1, p t w t / w t 1, w t+1 (i) w t (i)(1 ηq t (i)). The following lemma bounds the regret of the above described MW algorithm. For a proof see [2]. log m Lemma 1 Assuming T > log m, for the choice η = T the multiplicative updates algorithm satisfies, regret(mw ) T = p t q t min q t (i) 2 T log m. The second algorithm we present is an algorithm for online optimization over the set S, that is the set of unit-trace positive semidefinite matrices, with arbitrary concave functions. The algorithm is a variant of the more general meta-algorithm known as Regularized Follow The Leader (RFTL) [16,23].

8 8 Dan Garber, Elad Hazan Definition 2 (The RFTL algorithm) Consider a sequence of concave functions f 1,..., f T : S R such that t, X S f t (X) F 1. The RFTL algorithm is as follows. Let X 1 be an arbitrary point in S, and for t 1, 1 X t+1 arg max 2T t f τ (X τ ) X 1 2 X 2 F. (5) τ=1 The following lemma bounds the regret of the above described RFTL algorithm. For a proof see [23]. Lemma 2 The regularized follow the leader algorithm satisfies: max T f t (X) T f t (X t ) 2T. 2.2 Applying OCO algorithms to saddle point optimization In this subsection we overview how to apply online convex optimization algorithms to solve problems (1), (3). We concentrate on presenting convergent algorithms without focusing on the running times. When we present our sublinear algorithms we show how to derive sublinear time algorithms from the basic algorithms presented in this subsection. We approximate the problems (1),(3) by applying algorithms for OCO in two ways which we term the dual method and the primal dual method (see [14]) The dual method We apply the dual method to Problem (1), that is we assume that all constraint functions c i (X) are affine: c i (X) = A i X b i and A i R n n is symmetric and b i R. The basic algorithm is given below. Algorithm 1 Dual Method 1: Input: T log m T 2: w 1 1 m, η 3: for t = 1 to T do 4: p t(i) w t(i)/ w t 1 5: X t arg max m pt(i)c i(x) 6: for i = 1 to m do 7: w t+1 (i) w t(i)(1 η c i (X t)) 8: end for 9: end for 10: return X = 1 T T Xt, p = 1 T T pt Since all functions c i are affine, c i (X) = A i X b i, the solution to the optimization step in line 5 of the algorithm is without loss of generality a rank-one matrix

9 Sublinear Time Algorithms for Approximate Semidefinite Programming 9 of the form vv, where v is an eigenvector corresponding to the largest eigenvalue of the matrix m p t(i)a i. Notice that Algorithm 1 applies the MW algorithm with respect to the sequence of linear losses c 1,..., c T R m where c t (i) = c i (X t ) = A i X t b i. Lemma 3 Given tolerance ɛ > 0, running Algorithm 1 for T = 4ɛ 2 log m rounds guarantees that the returned primal-dual solutions ( X, p) satisfy: min p m m p i (A i X b i ) max m p i (A i X b i ) ɛ. Proof Denote by c t the vector in R m such that for all i [m] c t (i) = c i (X t ). By the regret guarantee of the MW algorithm (Lemma 1) and the value of T we have that 1 T p 1 t c t min p m T p c t ɛ. By the definition of X t we have that for all t, m p t(i)c i (X t ) = max m p t(i)c i (X) and thus, 1 min p m T p c t 1 m max p t (i)c i (X) ɛ T ( ) m 1 T max p t (i) c i (X) ɛ. T Since for all i c t (i) = c i (X t ) and c i (X) is affine we have that 1 T m p c t = p i 1 T m c i (X t ) = p i c i 1 T X t and thus we conclude that min p m m p i c i 1 T which yields the lemma. X t max m ( 1 T ) T p t (i) c i (X) ɛ,

10 10 Dan Garber, Elad Hazan The primal-dual method We apply the primal-dual method to Problem (3), that is we assume that all constraint functions c i (X) are concave and differentiable everywhere in S 3. The basic algorithm is given below. Algorithm 2 Primal-Dual Method 1: Input: T 2: Let X 1 be an arbitrary point in S 3: w 1 1 m 4: for t = 1 to T do 5: p t(i) w t(i)/ w t 1 1 6: X t+1 arg max t m 2T τ=1 pτ (i) c i(x τ ) X 1 2 X 2 F 7: for i = 1 to m do 8: w t+1 (i) w t(i)(1 η c i (X t)) 9: end for 10: end for 11: return X = 1 T T Xt, p = 1 T T pt Notice that as in the dual method, Algorithm 2 applies the MW algorithm with respect to the sequence of linear losses c 1,..., c T where c t (i) = c i (X t ) and the RFTL algorithm with respect to the sequence of concave functions f 1 (X),..., f T (X) where f t (X) = m p t(i)c i (X). Lemma 4 Given tolerance ɛ > 0, running Algorithm 2 for T = ɛ 2 ( 2+2 log m) 2 rounds guarantees that the returned primal-dual solutions ( X, p) satisfy: min p m m p i c i ( X) max m p i c i (X) ɛ. Proof By the regret guarantee of the MW algorithm (Lemma 1) we have that T m p t (i)c i (X t ) min p m T m p i c i (X t ) 2 T log m. By the regret guarantee of the RFTL algorithm (Lemma 2) we have that max T m p t (i)c i (X) T m p t (i)c i (X t ) 2T. Adding the above two equalities and rearranging we have that 3 As stated before it suffices to assume that c i has a supergradient everywhere in S

11 Sublinear Time Algorithms for Approximate Semidefinite Programming 11 min p m m p i ( 1 T ) T c i (X t ) max = max 1 T m T ( 1 T m 2T + 2 T log m p t (i)c i (X) T ) T 2T + 2 T log m p t (i) c i (X). T The lemma follows from plugging our choice for T and since by the concavity 1 T ( of the functions c 1,..., c m we have that for all i [m], T c i(x t ) ) 1 T c i T X t. 3 More Algorithmic Techniques In this section we detail some algorithmic techniques that we apply in order to derive sublinear time algorithms out of the high-level algorithms presented in section 2. These techniques include fast random estimation procedures and the Frank-Wolfe method for smooth optimization over the set of unit-trace positive semidefinite matrices. 3.1 Random Estimation Techniques A key step in deriving our sublinear time algorithms is replacing exact computations with random estimations. We use random sampling for the following two problems Estimating matrix dot products with constant variance Let A, X R n n such that A F F (for ease of presentation and without loosing any generality we assume that F N) and X F 1, and assume we are interested in estimating the dot product A X with constant variance and in O(F 2 log n) time (not including a preprocessing step to construct the sampling distribution). A way to achieve this goal is by sampling entries from the matrix A according to the l 2 norm of X. Consider picking a random entry (i, j) [n] [n] with probability X(i,j)2 and X 2 F estimating the dot product by ṽ A(i, j) X 2 F X(i, j). (6) Observe that E i,j [ṽ] = i,j X(i, j) 2 X 2 F A(i, j) X 2 F X(i, j) = i,j A(i, j) X(i, j) = A X. Thus ṽ is an unbiased estimator for the dot product A X. Moreover,

12 12 Dan Garber, Elad Hazan Var[ṽ] E i,j [ṽ 2 ] = i,j X(i, j) 2 X 2 F A(i, j)2 X 4 F X(i, j) 2 = i,j A(i, j) 2 X 2 F i,j A(i, j) 2 = A 2 F F 2. In order to achieve an unbiased estimator with constant variance (say unit variance) one can simply average F 2 i.i.d. estimators as in (6), that is, sample F 2 entries from X, (i 1, j 1 )..., (i F 2, j F 2) i.i.d. according to the l 2 norm of X and take the estimator: ṽ 1 F 2 F 2 r=1 A(i r, j r ) X 2 F X(i r, j r ). (7) Setting up the l 2 sampling distribution takes preprocessing time that is linear in the number of non-zeros in X. Thus, given m matrices A 1..., A m with Frobenius norm at most F, it is possible to estimate all dot products A i X, i [m] with unit variance in overall O( X 0 + F 2 (log n + m)) time (in particular note that we need to sample only one set of F 2 entries, and not one such set per matrix A i ). Although the estimator ṽ defined in (7) has unit variance, it may be unbounded which does not allow us to apply standard concentration results. To tackle this problem we use a clipping procedure. We define v = clip(ṽ, C) = max{ C, min{ṽ, C}} and use v instead of ṽ. The new estimator v is no longer unbiased (it is bounded and with variance at most 1). The following lemma, which we prove in the appendix, ties the resulting bias with the variance. Lemma 5 Let X be a random variable with finite variance, let X = clip(x, C) = min{c, max{ C, X}} where C is a constant satisfying E[X] C/2. Then, E[ X] E[X] 2 C Var[X]. Note that in case we apply Lemma 5 to the dot-product estimator ṽ described above, choosing C to be roughly 1/ɛ we get from Lemma 5 that the resulting bias in the estimator v is only O(ɛ) while v is upper bounded by O(1/ɛ). There is a slight issue that choosing C = Θ(1/ɛ) imposes an upper bound on ɛ because of the requirement in Lemma 5 that E[X] C/2. However, in all of the results presented in this paper, this requirement merely translates to the restriction ɛ 1, which is the only case of interest under our assumption of bounded affine/non-affine functions Spectral approximation of matrices Recall that in the dual method (Algorithm 1) we need on each iteration to compute the leading eigenvector of a matrix A given by A = m p t(i)a i where p t m and A i R n n is symmetric. Computing A, takes O(mS) time where S is an upper bound on the number of non-zeros in a single matrix. Instead of computing A we will

13 Sublinear Time Algorithms for Approximate Semidefinite Programming 13 be interested in finding a matrix Ã which is the average of only few of the matrices A 1..., A m, such that with probability at least 1 δ, Ã A 2 ɛ, (8) for some error tolerance ɛ and failure probability δ. Note that if X Ã max Ã X ɛ then X A max A X O(ɛ) and thus we can find an approximation to the leading eigenvector of A by approximating the leading eigenvector of Ã. In order to derive an approximation for A in the sense of (8) we rely on the following Bernstein-like inequality for matrices (see [25], Theorem 1.4). Lemma 6 Let M 1,..., M k be independent zero-mean random symmetric matrices of dimension n n. Suppose that i [k], M i 2 γ and denote ρ 2 = k E[M i 2] 2. Then for any ɛ > 0 it holds that ( ) k ( ɛ 2 ) /2 P M i 2 > ɛ n exp ρ 2. + γɛ/3 Then next lemma follows directly from Lemma 6 and is tailored to our needs. Lemma 7 Fix ɛ (0, 1]. Let p m be a distribution over m elements and let A 1,..., A m be symmetric matrices in R n n such that A i 2 1 for all i [m]. Let j 1,..., j k be k i.i.d. samples of indices from [m] according to distribution p (that is j l = i w.p. p i ). Then it holds that P ( 1 k ) k m ( kɛ 2 A jl p i A i 2 > ɛ n exp 10 l=1 ). 3.2 The Frank-Wolfe algorithm and approximated RFTL In order to apply the primal dual method with the RFTL online algorithm we need to solve iteratively Problem (5) which takes O(n 3 ) time (notice that this is just a Euclidean projection to the set S) and thus is prohibitive when aiming for sublinear running times. Instead we make due with only finding an approximated solution to objective (5) and show that if the approximation error is sufficiently small then our algorithm still converges. The benefit is that such an approximation could be found in O(n 2 poly(ɛ 1 )) time. The Frank-Wolfe method adapted to optimization over the spectrahedron, presented in [15], is an algorithm for maximizing a smooth concave function f over the set S. The algorithm performs iterations where each iteration t is comprised of solving the linear problem V t arg max V S V f(x t), (9) where X t is the current iterate. The algorithm then produces the next iterate by taking the convex combination X t+1 X t +α t (V t X t ) where α t is roughly 1/t. Note that

14 14 Dan Garber, Elad Hazan the solution to (9) is without loss of generality just a leading eigenvector computation which could be approximated in nearly linear time using the Lanczos method [18]. The Frank-Wolfe algorithm guarantees that after roughly O(ɛ 1 ) iterations, each of which is comprised of a single eigenvector computation, the final iterate X sol will satisfy max f(x) f(x sol ) ɛ. We now define the approximated RFTL algorithm and tie its regret bound the approximation quality of Problem (5). Definition 3 (Approximated RFTL algorithm) Consider a sequence of concave functions f 1,..., f T : S R such that t [T ], X S : f t (X) F 1. The approximated RFTL algorithm is as follows. Let ɛ > 0, X 1 is an arbitrary point in S. For time t 1 define the function F t (X) = 1 2T t f τ (X τ ) X 1 2 X 2 F τ=1 and choose a point X t+1 S that satisfies: max F t(x) F t (X t+1 ) ɛ. Lemma 8 The approximated RFTL algorithm satisfies: max T f t (X) T f t (X t ) 2T + T ɛ. Proof Consider the sequence of functions f t (X) = f t (X t ) X, t [T ] and denote by X t, t [T ] the sequence of points played by the original RFTL algorithm, as formulated in Equation (5), with respect to the sequence of functions f 1,..., f T. By Lemma 2 we have that max T f t (X t ) X T f t (X t ) X t 2T. (10) Note that F t (X) is 1-strongly-concave with respect to norm F. Thus since by definition of X t+1, F t ( X t+1 ) F t (X t+1 ) ɛ we have that t : X t+1 X t+1 2 F F t ( X t+1 ) F t (X t+1 ) ɛ. Using the Cauchy-Schwartz inequality we have that T f t (X t ) X t = T f t (X t ) X t + T f t (X t ) X t T f t (X t ) (X t X ) t T f t (X t ) F X t X t F T f t (X t ) X t T ɛ. (11)

15 Sublinear Time Algorithms for Approximate Semidefinite Programming 15 Plugging (11) into (10) we have that max T f t (X t ) X T f t (X t ) X t 2T + T ɛ. Finally using the concavity of the functions f t we have that max T f t (X) T f t (X t ) 2T + T ɛ. Note that the objective in (5) is indeed concave and smooth as required by the Frank Wolfe algorithm in [15]. The following lemma states the computation time of the update step of the approximated RFTL algorithm described in definition 3 using the Frank-Wolfe algorithm [15]. For a proof see [15]. Lemma 9 Fix ɛ > 0, t [T ] and let F t (X) be as in definition 3. The Frank-Wolfe algorithm( in [15] computes a matrix X S such that max Y S F t (Y ) F t (X) ɛ n in time O 2 t ɛ ) Sublinear Time Algorithm for the Affine Saddle-Point Problem In this section we present our main algorithmic result, a sublinear time algorithm for approximately solving (1). As stated in Theorem 1, throughout this section we assume that there exists parameters F, S such that for any constraint i [m] it holds that A i 2 1, b i 1, A i F F and the matrix A i contains at most S non-zero entries. The algorithm is based on Algorithm 1 and the random acceleration techniques described in section 3. We replace the MW algorithm applied in Algorithm 1, with the following variant from [8] which is suitable for the case in which the linear losses are random variables with bounded variance but do not necessarily have constant l norm (line 13 of the algorithm). Definition 4 (Variance MW) Consider a sequence of linear losses q 1,..., q T R m. The MW algorithm is as follows. Let 0 < η R, w 1 1 m, and for t 1, p t w t / w t 1, w t+1 w t (i)(1 ηq t (i) + η 2 q t (i) 2 ). The following lemma bounds the regret of the above described MW algorithm. For a proof see [8]. Lemma 10 Let η be such that η < min, 1/q t (i). Then the multiplicative updates algorithm satisfies: p t q t min max{q t (i), 1 η } log m η where we have denoted by q 2 t the vector whose ith entry is q t (i) 2. + η p t qt 2,

16 16 Dan Garber, Elad Hazan Instead of feeding the MW algorithm with the exact values c i (X t ) = A i X t b i as done in Algorithm 1, we only use estimates of these values using the l 2 -sampling technique described in subsection (lines 9-14). Also, instead of performing exact leading eigenvector computations with respect to the matrices S t = m p t(i)a i as suggested in Algorithm 1, we approximate the matrix S t by the matrix M t using roughly O(ɛ 2 ) samples from A 1,..., A m (lines 5-7), as explained in subsection 3.1.2, and then compute an approximated leading eigenvector of M t using the randomized Lanczos method up to additive error ɛ and with failure probability δ (line 8). Algorithm 3 1: Input: error tolerance ɛ (0, 1], instance of Problem (1) and a parameter F N that satisfy the requirements of Theorem 1. 2: Let ɛ ɛ 6, T 602 ɛ 2 log m, w 1 1 m, η 4 log m, k 10 3T ɛ 2 log (20nT ), δ 1 20T 3: for t = 1 to T do 4: p t w t w t 1 5: Let i 1,..., i k be k indexes from [m] sampled i.i.d according to p t 6: p t 1 k k l=1 e i l 7: M t m pt(i)a i 8: x t EV(M t, ɛ, δ) 9: Sample F 2 entries (jt 1, l1 t 10: for i [m] do 11: ṽ t(i) 1 F 2 F 2 r=1 A i(jt r, lr t )/(xt(jr t )xt(lr t )) b i 12: v t(i) clip(ṽ t(i), 1/η) 13: w t+1 (i) w t(i)(1 ηv t(i) + η 2 v t(i) 2 ) 14: end for 15: end for 16: return X = 1 T xtx t, p = 1 T pt ),..., (jf 2 t, l F 2 t ) [n] [n] by (j i t, li t ) (j, l) w.p. xt(j)2 x t(l) 2 We turn to analyze Algorithm 3. We begin by introducing some auxiliary lemmas that will be used in the proof of Theorem 1. The following lemma gives guarantees on the eigenvector computations carried out in line 8 of the algorithm. The proof relies on the Lanczos method for computing the largest eigenvalue of a positive semidefinite matrix, and is differed to the appendix. Lemma 11 Given a matrix M with N non-zeros, such that M 2 = O(1), and parameters ɛ, δ > 0, the eigenvector computation ( in) line 8 could be implemented using the Lanczos method to run in time O, such that the returned vector N ɛ log n δ x satisfies with probability at least 1 δ: x Mx λ max (M) ɛ. The following lemma gives concentration bounds on our random variables from their expectations. The proof is differed to the appendix. 4 log m Lemma 12 For 1/4 η 3T, it holds with probability at least 1 O(1/m) that

17 Sublinear Time Algorithms for Approximate Semidefinite Programming max [v t (i) (A i x t x t b i )] 5ηT, 2. p t (i)(a i x t x t b i ) p t v t 4ηT. The following lemma is a simple application of Markov s inequality. The proof is differed to the appendix. Lemma 13 With probability at least 3/4 it holds that p t vt 2 16T. Finally the following lemma is a simple application of Hoeffding s inequality. Lemma 14 For any t [T ] and ɛ > 0 it holds that ( ) m m P p t (i)b i p t (i)b i > ɛ 2 exp ( 2kɛ 2). We can now prove Theorem 1. Proof We assume that the guarantees of Lemmas 12, 13 hold and that for all t [T ], the guarantees of Lemmas 7, 11 and 14 hold. Using the union-bound and our choice of k, δ, these guarantees all hold with probability at least 1/2. From Lemma 10 and the use of the clipping procedure in line 12 we have that p t v t min v t (i) + log m η + η p t vt 2. Starting from the rhs of the inequality, plugging result (1) in Lemma 12 we have that p t v t min (A i x t x t b i ) + 5ηT + log m + η p t vt 2. η Plugging Lemma 13 we have that p t v t min (A i x t x t b i ) + 5ηT + log m + 16ηT. η Turning to the lhs of the inequality, plugging result (2) in Lemma 12 we have that

18 18 Dan Garber, Elad Hazan p t (i)(a i x t x t b i ) 4ηT min (A i x t x t b i ) + log m + 21ηT. η By Lemmas 7 and 14 we can replace p t (i) with p t (i) and get: p t (i)(a i x t x t b i ) 2 ɛt min (A i x t x t b i ) + log m + 25ηT. η By the guarantee of Lemma 11 on the eigenvector computation in line 8 we have that max p t (i)(a i X b i ) 3 ɛt min (A i x t x t b i ) + log m + 25ηT. η It holds that max p t (i)(a i X b i ) max p t (i)(a i X b i ) Thus we have that = max p t (i)(a i X b i ). max p t (i)(a i X b i ) 3 ɛt min (A i x t x t b i ) + log m + 25ηT. η Dividing through by T, rearranging and plugging our choices for T, η we get: max 1 T p t (i) (A i X b i ) 3 ɛ ɛ/2 min A i 1 T x t x t b i. Plugging our choice for ɛ yields the approximation guarantee stated in the theorem. We now turn to discuss the arithmetic complexity of the algorithm. The algorithm performs O(ɛ 2 log m) iterations, each is comprised of three major steps: (i) averaging O(ɛ 2 log n) matrices sampled i.i.d according to distribution p t (ii) computing an approximated eigenvector and (iii) updating the weights vector w t. Step (i) is implemented trivially with O(Sɛ 2 log n) arithmetic operations. Step (ii), according to Lemma 11, could be carried out with O( M t 0 ɛ 0.5 log n) arithmetic operations where M t 0 is the number of non-zero entries in the matrix M t.

19 Sublinear Time Algorithms for Approximate Semidefinite Programming 19 This number is trivially bounded by min{sɛ 2 log n, n 2 }. Note also that since M t is the average of matrices with spectral norm at most 1, it holds that M t 2 1, as required in Lemma 11. Step (iii) requires x t 0 M t 0 prepossessing time to generate the l 2 -distribution and additional O(F 2 (log n + m)) time to generate the samples. Summing all contributions yields the total arithmetic complexity. Finally, the conclusion regarding the rank of the returned solution X follows from the observation that X is the average of O(ɛ 2 log m) rank-one matrices. 5 Increasing the Probability of Success In this section we derive high probability results for our Algorithm 3. Our results are based on showing that given a pair of primal-dual solutions generated by Algorithm 3 it is possible to verify their quality with high probability and in sublinear time. Our algorithm for verifying a candidate primal solution is given below. The algorithm assumes that the candidate X is of the form prescribed in Algorithm 3, that is X = 1 T T x tx t and is available implicitly as the unit vectors x 1, x 2,..., x T R n. Algorithm 4 1: Input: instance of Problem (1), candidate solution X given by unit vectors x 1,..., x T R n, F N s.t. i [m] : A i F F, error tolerance ɛ > 0, failure probability δ. 16 log (2m/δ) 2: k ɛ 3: for s = 1 to 2 k do 4: Sample an index t s [T ] uniformly at random 5: Sample F 2 entries (j 1, l 1 ),..., (j F 2, l F 2) [n] [n] i.i.d. x ts (i) 2 x ts (j) 2 6: for i = 1 to m do 7: ṽ i (s) 1 F 2 F 2 r=1 8: v i (s) clip(ṽ i (s), 4/ɛ) 9: end for 10: v i 1 k k s=1 v i(s) 11: end for 12: return v 1,..., v m. A i (j r,l r) x ts (j r)x ts (l r) ( ( )) Lemma 15 Algorithm 4 runs is time O log (m/δ) ɛ min{ S log n 2 ɛ, n} + mf 2 2 returns values v 1,..., v m such that with probability at least 1 δ it holds that i [m] A i X v i ɛ. Proof As in the analysis of Algorithm 3, the l 2 -sampling in lines 5-9 guarantees for all s [k] and i [m] that the conditional expectation E[ṽ i (s) t s ] satisfies E[ṽ i (s) t s ] = A i x ts x t s and that the conditional second moment E[ṽ i (s) 2 t s ] satisfies, E[ṽ i (s) 2 t s ] 1. Since t s is chosen uniformly at random we have that E[ṽ i (s)] = T 1 T A i x t x t = A i X. Also it holds that E[ṽ i (s) 2 ] 1. Thus for all i [m] and s [k] it holds that E[v i (s) 2 ] E[ṽ i (s) 2 ] 1 and v i (s) 4/ɛ. By Bernstein s inequality we have that and

20 20 Dan Garber, Elad Hazan ( P ( v i E[v i ] ɛ/2) 2 exp k2 ɛ 2 ) ) /8 2 exp ( kɛ2. k + 4/3 16 Using the proprieties of the clipping procedure (see Lemma 5), since E[ṽ i (s)] = A i X b i we have that with probability at least 1 2 exp ( kɛ 2 /16), v i A i X = v i 1 k v i 1 k k E[v i (s)] + 1 k s=1 v i E[v i ] + 1 k ɛ k k s=1 ɛ 2 = ɛ. k E[ṽ i (s)] s=1 k E[v i (s)] 1 k s=1 k E[v i (s)] E[ṽ i (s)] s=1 k E[ṽ i (s)] Using the union-bound, the last ( inequality ) holds for all i [m] simultaneously with probability at least 1 2m exp. Choosing k = guarantees kɛ2 16 s=1 16 log (2m/δ) ɛ 2 that all inequalities hold with probability at least 1 δ. The running time is due to the number of iterations k, each is comprise of O( x ts 0 ) preprocessing time to set up the l 2 sampling distribution from the vector x ts and additional O(mF 2 ) time to process the samples. As detailed in the analysis of Algorithm 3, it holds that x ts 0 min{sɛ 2 log n, n}. Lemma 16 Given parameters ɛ, δ > 0, Algorithm 3 can be implemented to run in time ( O log 1 ( T ALG (ɛ) + T V ER (ɛ, δ) + min{sɛ 4 log(n) log(m), n 2 }ɛ 0.5 log n ) ), δ δ where T ALG (ɛ) denotes the running time of Algorithm 3 with tolerance ɛ and T V ER (ɛ, δ) denotes the running time of the primal verification procedure given in Algorithm 4 with tolerance ɛ and failure probability δ, such that with probability at least 1 δ the returned primal-dual solutions ( X, p) satisfy: min p m m p i (A i X b i ) max m p i (A i X b i ) ɛ. Proof We run Algorithm 3 N = O(log(1/δ) times with error parameter ɛ/4. This guarantees that with probability at least 1 δ/3 one of the returned primal-dual pairs ( X j, p j ) satisfies Theorem 1 with error at most ɛ/4. We can verify all primal solutions up to error ɛ/4 and with failure probability δ/(3n) by running Algorithm 4 on all

21 Sublinear Time Algorithms for Approximate Semidefinite Programming 21 primal solutions. In order to verify a dual solution p we need to approximate with high probability the value max ( m m ) m p i (A i X b i ) = λ max p i A i p i b i. We can perform this verification up to error ɛ/4 and with failure probability δ/(3n) by applying Lemma 11 to the matrix M = m p(i)a i with error parameter ɛ/4 and failure probability δ/(3n). Note that M could be aggregated during the run of Algorithm 3 without increasing the computational complexity. Thus with probability at least 1 δ we will get a pair of primal-dual solutions that satisfy Theorem 1 and all verifications will be correct. Finally note that on every iteration t of Algorithm 3 it holds that M t 0 min{sɛ 2 log n, n 2 } which means that M 0 min{sɛ 4 log(n) log(m), n 2 }. Thus verifying a dual solution according to Lemma 11 takes at most O(min{Sɛ 4 log(n) log(m), n 2 }ɛ 0.5 log (n/δ)) arithmetic operations. 6 Sublinear Time Algorithm for Non-Affine Saddle-Point Problem In this section we describe and analyse our algorithm for non-affine constraints c i (X) 0, i [m] where each c i is concave and differentiable. Throughout this section we make the following assumptions for all i [m] and X S. 1. c i (X) c i (X) F c i (X) could be computed in O(n 2 ) time. 4. There exists a procedure Sample(X, c 1,..., c m ) that returns in O(m + n 2 ) time unbiased estimators of the values c 1 (X),..., c m (X) with variance at most 1. Example and justification of assumptions. Although these assumptions are relatively strong, they hold for example for the important case of the square-loss, that is c(x) = (A X b) 2 for some matrix A R n n and scalar b R. Note that for this function conditions 1,2 hold since proper scaling could be applied and condition 3 holds trivially. Estimating the value of c(x) in O(1) time and unit variance could be done by taking two independent unbiased estimators v 1, v 2 using the l 2 -sampling technique described in subsection of the value A X b and forming the estimator v = v 1 v 2. Since v 1, v 2 are independent we have that E[v] = E[v 1 ] E[v 2 ] = (A X b) 2. Also E[v 2 ] = E[v 2 1v 2 2] = E[v 2 1] E[v 2 2], which means that in case A F 1, the estimator v also has unit variance. Remark 1 In case only a constant number of the functions c 1,.., c m are non affine and all the rest are affine, assumption 4 could be replaced with the assumption that c i (X) could be computed in O(n 2 ) for every non-affine function c i and X S. This will allow to keep the running time sublinear without the need to implement the procedure Sample for the non-affine constraints (recall that for affine constraints we implement Sample using the l 2 -sampling technique).

22 22 Dan Garber, Elad Hazan Our algorithm is based on Algorithm 2. The algorithm resembles Algorithm 3 except that instead of producing the iterates X t by offline optimization (eigenvector computations), they are produced by the approximated regularized follow the leader online algorithm defined in definition 3 (line 6 in Algorithm 5), where the RFTL objective (5) is solved up to an additive error ɛ using the Frank-Wolfe algorithm (see Lemma 9). A problem with applying the RFTL algorithm as done in Algorithm 2 (line 5) is that on each iteration we are required to compute the gradient vector of m functions, c 1 (X t ),..., c m (X t ), which under the our assumptions takes O(mn 2 ) time. To overcome this hurdle we use a standard trick in stochastic optimization: instead of using the entire gradient on time t, m p t(i) c i (X t ) we sample a single gradient c i (X t ) with probability p t (i) (line 5 of the algorithm). Applying RFTL with the sampled gradients achieves with high probability the same regret bound as in Lemma 8. The l 2 -sampling technique used to estimate the values c i (X t ) in Algorithm 3 is replaced with a call to the generic sampling oracle Sample(X t, c 1,..., c m ) (line 8 of the algorithm). The algorithm is given below. Algorithm 5 1: Input: error tolerance ɛ (0, 1], instance of Problem (5) and a procedure Sample( ) that satisfy the requirements of Theorem 3. 2: Input: ɛ (0, 1], instance of Problem (5) 3: Let ɛ ɛ2 9, T 602 ɛ 2 log m, w 1 1 m, η 4: for t = 1 to T do 5: p t w t w t 1 6: Choose i t [m] by i t i w.p. p t(i) 7: X t+1 ɛ approximated solution to 1 max 2T τ=1 8: ṽ t Sample(X t, c 1,..., c m) 9: for i [m] do 10: v t(i) clip(ṽ t(i), 1/η) 11: w t+1 (i) w t(i)(1 ηv t(i) + η 2 v t(i) 2 ) 12: end for 13: end for 14: return X = 1 T Xt, p = 1 T e i t 4 log m 3T t c iτ (X τ ) X 1 2 X 2 F We turn to analyze Algorithm 5 and to prove Theorem 3. The proof of the following concentration result is given in the appendix. 4 log m Lemma 17 For 1/4 η 3T, it holds with probability at least 1 O(1/m) that 1. max [v t (i) c i (X t )] 5ηT,

23 Sublinear Time Algorithms for Approximate Semidefinite Programming T c it (X t ) We can now prove Theorem 3. T p t v t 4ηT. Proof For the proof we assume that the guarantees stated in Lemmas 17, 13 hold. Using the union-bound this happens with probability at least 1/2. From Lemma 10 and the use of clipping in line 12 we have that p t v t min v t (i) + log m η + η p t vt 2. Starting from the rhs of the inequality, plugging result (1) in Lemma 17 we have that p t v t min Plugging Lemma 13 we have that p t v t min c i (X t ) + 5ηT + log m η c i (X t ) + 5ηT + log m η + η p t vt ηT. Turning to the lhs of the inequality, plugging result (2) in Lemma 17 we have that c it (X t ) 4ηT min By Lemma 8 we have that max c it (X) 2T T ɛ min c i (X t ) + log m η + 21ηT. c i (X t ) + log m η + 25ηT. Since c i t (X) = m t:i c t=i i(x) = T m p ic i (X), we have that T max m p i c i (X) 2T T ɛ min By concavity of the functions c 1,..., c m we have that T max m c i (X t ) + log m η + 25ηT. p i c i (X) 2T T ɛ T min c i( X) + log m + 25ηT. η The result follows by dividing through by T and plugging our choices for ɛ, η, T.

24 24 Dan Garber, Elad Hazan Algorithm 5 performs O(ɛ 2 log m) iterations each is comprised of an update step of the approximated RFTL algorithm and calling the procedure Sample( ). From Lemma 9 we have that the time complexity of the approximated RFTL update step is bounded by O(n 2 ɛ 4 log m) and under our assumption, calling the procedure Sample( ) is bounded by O(m + n 2 ) time. Hence we get the running time stated in Theorem 3. 7 Lower bounds In this section we prove Theorem 2. Our lower bounds are information-theoretic, meaning that any successful algorithm must read a certain amount of entries form the input. The core machinery behind these bounds is the following folklore fact (see [8] for more details). Proposition 1 Consider a one dimensional array of length r such that with probability 1/2 the array contains only zeros and with the remaining 1/2 probability the array contains a single non-zero entry, chosen uniformly at random, and zero in all other entries. Then any algorithm that with probability at least 2/3 determines whether the array contains only zeros or not, must read Ω(r) entries from the array. The following lemma describes how we use the above proposition to prove our lower bounds. Lemma 18 Fix an error tolerance ɛ and the parameters m, n. Suppose that there exists a distribution over pairs of sets of m matrices, A 1 1,..., A 1 m R n n, A 2 1,..., A 2 m R n n and a set B [m] [n] [n] such that 1. max min A 1 i X max min A 2 i X > ɛ 2. There exists a triplet (i, j, k) B such that A 1 i (j, k) A2 i (j, k) and for all (i, j, k ) B, (i, j, k ) (i, j, k) it holds that A 1 i (j, k ) = A 2 i (j, k ). 3. The distinctive triplet (i, j, k) in the previous condition is distributed uniformly among all triplets in B. Then any ɛ additive approximation algorithm for Problem (1) that succeeds with probability at least 2/3 has running time Ω( B ). Proof Assume a distribution and a set B that satisfy the conditions of the lemma exist. Given matrices A 1,..., A m R n n define ARRAY(A 1,..., A m ) to be a one dimensional array of length B such that the entries of this array are exactly the entries A i (j, k) such that (i, j, k) B, organized into a one dimensional array in some fixed order. Let A 1 1,...A 1 m, A 2 1,...A 2 m be two sets of matrices drawn from the distribution. Denote by ARR 1 the array ARRAY(A 1 1,..., A 1 m) and by ARR 2 the array ARRAY(A 2 1,..., A 2 m). Now, given an ɛ-additive approximation algorithm for Problem (1) that succeeds with probability at least 2/3, we do the following. With probability 1/2 we run it on the set of matrices A 1 1,..., A 1 m and with the remaining probability we run it on A 2 1,..., A 2 m. Note that since the optimums of these two instances differ by more then ɛ, the algorithm must distinguish between these two instances

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret