Sublinear Time Algorithms for Approximate Semidefinite Programming

Size: px
Start display at page:

Download "Sublinear Time Algorithms for Approximate Semidefinite Programming"

Transcription

1 Noname manuscript No. (will be inserted by the editor) Sublinear Time Algorithms for Approximate Semidefinite Programming Dan Garber Elad Hazan Received: date / Accepted: date Abstract We consider semidefinite optimization in a saddle point formulation where the primal solution is in the spectrahedron and the dual solution is a distribution over affine functions. We present an approximation algorithm for this problem that runs in sublinear time in the size of the data. To the best of our knowledge, this is the first algorithm to achieve this. Our algorithm is also guaranteed to produce low-rank solutions. We further prove lower bounds on the running time of any algorithm for this problem, showing that certain terms in the running time of our algorithm cannot be further improved. Finally, we consider a non-affine version of the saddle point problem and give an algorithm that under certain assumptions runs in sublinear time. Keywords semidefinite programming, sublinear algorithms, online algorithms, large scale optimization Mathematics Subject Classification (2000) 90C22 90C06 68W27 and 68W20 1 Introduction Semidefinite programming has become a tool of great importance in optimization in the past years. In the field of combinatorial optimization for example, numerous approximation algorithms have been discovered starting with the seminal work of Goemans and Williamson on the MAX-CUT problem [12], which was followed Dan Garber Department of Industrial Engineering and Management, Technion, Haifa 32000, Israel. Tel.: Fax: dangar@tx.technion.ac.il Elad Hazan Department of Industrial Engineering and Management, Technion, Haifa 32000, Israel. Tel.: Fax: ehazan@ie.technion.ac.il

2 2 Dan Garber, Elad Hazan by [4,1,3]. In the field of machine learning, solving semidefinite programs is at the heart of many learning tasks such as learning a distance metric [26,24], sparse PCA [10], multiple kernel learning [19], matrix completion [6], and more. In machine learning applications in particular, it is often the case that the data is assumed to be noisy. Thus there is no point in solving the underlying optimization problem to very high precision. Moreover, the usual extremely large scale of data in such applications makes the underlying optimization problem impractical to solve with relatively high precision even for fast first order methods. Indeed in these last years there has been much research improving the running time of first order methods for semidefinite programming with the aim to improve continually the scale of problems that could be tackled with sufficient accuracy, and there now exists solvers that for certain problems and a certain regime of parameters run in nearly linear time [9,5]. As an attempt to accelerate further first order methods for semidefinite programming, we explore in this paper the possibility of designing algorithms for semidefinite optimization that, in certain parameter-regimes of interest, require running time that is only sublinear in the size of the problem 1. The main semidefinite optimization problem under investigation in this paper is the following saddle point problem: max min p m m p i (A i X b i ), (1) where A 1,..., A m R n n are all symmetric, b 1,..., b m R, m = {x R m x 0, x 1 = 1} is the m 1 dimensional unit simplex, S = {X R n n X 0, Tr(X) = 1} is the spectrahedron, that is the set of unit-trace n n positive semidefinite matrices, and denotes the standard matrix dot product, that is A B = n i,j=1 A ijb ij. In case the optimum value of (1) is non-negative, then solving (1) up to an additive error of ɛ is equivalent to finding a solution in the set S that satisfies all constraints A i X b i 0 up to an additive error of ɛ. Note that the demand that the solution has unit trace is equivalent, via standard reductions, to a demand that the trace of the solution is upper bounded somehow. Moreover, the formulation in (1) captures the notion of maximizing the margin from feasibility, that is, maximizing the scalar σ that still satisfies A i X b i + σ, i [m], for some matrix X S. This objective is of direct interest in certain machine learning tasks such at the problem of learning a distance metric [26,24]. The dual of Problem (1) is the following eigenvalue optimization problem: min p m λ max ( m ) p i (A i b i I), (2) where I denotes the n n identity matrix. Note that the dual problem (2) is strong, i.e. there is no duality gap between problems (1) and (2). 1 The results presented in this paper are a continuation of preliminary results on sublinear semidefinite optimization presented in [11].

3 Sublinear Time Algorithms for Approximate Semidefinite Programming 3 Our main contribution is a randomized algorithm that solves Problem (1) up to a desired additive error ɛ in a total running time that, for large enough instances, is sublinear in the number of entries in the instance. To the best of our knowledge this is the first result to achieve this kind of bound on the running time. Our algorithm has an additional appealing feature - it always produces low-rank solutions. More precisely we prove the following theorem, which is the main theorem of this paper. Theorem 1 There exists an algorithm that given an error tolerance ɛ < 1 and an instance of Problem (1) satisfying i [m]: A i all matrices are bounded in spectral (operator) norm b i 1 the number of non-zero entries of each matrix A i is upper bounded by S A i F F - all matrices have Frobenius norm at most F. finds after ( ( )) O ɛ 2 log m F 2 (m + log n) + Sɛ 2 log n + min{sɛ 2 log n, n 2 }ɛ 1/2 log n arithmetic operations a pair of primal-dual solutions ( X, p) S m satisfying with probability at least 1/2: min p m m p i (A i X b i ) max m p i (A i X b i ) ɛ and the rank of X is upper bounded by O(ɛ 2 log m). Note that in the worst case S = Θ(n 2 ) and F 2 = Θ(n) (under the assumption of unit spectral norm) and the bound in the theorem becomes Õ ( ɛ 2 (mn + ɛ 2 n 2) (suppressing poly-logarithmic factors), which is sublinear in the size of the problem (which is Θ(mn 2 )) for ɛ max{n 1/2, m 1/4 }. In case F 2 = O(1) and S = Θ(n 2 ), the bound becomes Õ ( ɛ 2 (m + n 2 ɛ 2), which actually splits the dependency between m and, n. In this case the algorithm exhibit a sublinear running time for ɛ max{n 1, m 1/4 }. We also prove the following lower bounds on the running time of any ɛ-additive approximation algorithm for Problem (1). Theorem 2 Given a choice of parameters m, n, let Γm,n F denote the set of all instances of (1) for which i [m], A i R n n, A i F 1, b i = 0, and let Γm,n S denote the set of all instances of (1) for which i [m], A i R n n, A i 2 1, b i = 0. The following claims hold. 1. Fix an error tolerance ɛ (0, 1), the parameter m and let n 1 4ɛ. Any algorithm that given an instance from Γm,n, F finds with probability at least 2/3 an ɛ approximated solution, has worst case running time Ω ( ) m ɛ on instances from 2 Γm,n. F

4 4 Dan Garber, Elad Hazan 2. Fix an error tolerance ɛ (0, 1), the parameter n and let m 1 8ɛ. Any algorithm that given an instance from Γm,n, F finds with probability at least 2/3 an ɛ ( approximated solution, has worst case running time Ω Γ F m,n. n 2 ɛ ) on instances from 3. Fix the parameters m, n and an error tolerance ɛ satisfying 1 8 ɛ 1 4n. Any algorithm that given an instance from Γm,n, S finds with probability at least 2/3 an ɛ approximated solution, has worst case running time Ω ( ) mn on instances from Γ S m,n. Part 1 of Theorem 2 shows that the dependency of the running time in Theorem 1 on the term ɛ 2 m cannot be improved in general. Part 2 of the theorem shows that the dependency of the running time on the sparsity parameter S cannot be sublinear in general. Finally part 3 of the theorem shows that in general (under the unit spectral norm assumption) the dependency on m, n in the running time could not be separated, as suggested for instance in Theorem 1 in the case F = Θ(1). Finally, we also consider the following non-affine saddle-point optimization problem: max min p m ɛ m p i c i (X), (3) where c 1,..., c m : S R are all concave and differentiable everywhere in S 2. For this problem we have the following theorem. Theorem 3 There exists an algorithm that given an error tolerance ɛ < 1 and an instance of Problem (3) satisfying i [m] and for all X S: c i (X) 1. c i (X) F 1. c i (X) could be computed in O(n 2 ) time. There exists a procedure Sample(X, c 1,..., c m ) that returns in O(m + n 2 ) time unbiased estimators of the values c 1 (X),..., c m (X), each with variance at most 1. finds after ( O ɛ 2 log m(m + ɛ 4 n 2 ) log m) arithmetic operations a pair of primal-dual solutions ( X, p) S m satisfying with probability at least 1/2: min p m m p i c i ( X) max m p i c i (X) ɛ 2 Our results hold also under the weaker assumption that every c i has a supergradient everywhere in S.

5 Sublinear Time Algorithms for Approximate Semidefinite Programming 5 Our algorithm for the non-affine case may be viewed as a sublinear-time algorithm in the sense that under our assumptions, the total running time of the algorithm for ɛ max{n 1, m 1/6 } is less than the worst case time required to evaluate the gradient of each function c i at least once. Although the conditions stated in Theorem 3 are somewhat restrictive, we show that for instance, for the important case of the square-loss function c(x) = (A X b) 2 they indeed hold. This example is spelled out in Section Related Work The problems (1),(2) under consideration( in this paper are solvable up to an accuracy max{m, ) of ɛ via Interior Point methods in time O n}(n 3 + m 3 + m 2 n 2 ) log (1/ɛ). In recent years there has been extensive efforts to speed up the running time of first order methods for these problems, see [9,20,21,17,5,15] and references therein. Current state-of-the-art first order methods include the deterministic smoothing technique of Nesterov [21], The Mirror Descent method [9] and the recent randomized Mirror- Prox method of Baes, Bürgisser and Nemirovski [5], all three considering the eigenvalue optimization problem (2). The current best running time of first order methods for solving Problem (2) (up to poly-logarithmic factors), under the assumption that all matrices A i have spectral norm at most 1, is { } ms min + n2 ɛ ɛ 3 [5], ms + n3 ɛ ɛ [21], ms ɛ 2 + n2 [9]. (4) ɛ2.5 In order to show the potential appeal of the algorithm proposed in this paper over previous art, we present in Table 1 different regimes of m (number of constraints), S (the maximum number of non zero entries in any of the matrices) and F (upper bound on the Frobenius norm of all matrices) for which our method improves over the bound (4), for a meaningful range of the error tolerance parameter ɛ. Please note that Table 1 does not give full characterization of the settings in which our method may be preferable, but only presents a partial list. Our algorithms are based on applying algorithms from the paradigm of online convex optimization to solve saddle point optimization problems. For recent surveys on online convex optimization see [16], [23]. We use the generic schemes presented in [14] for converting these online algorithms into algorithms for saddle point optimization, and we especially build on the well known Multiplicative Weights algorithm for online linear optimization over the simplex [2]. In order to accelerate these schemes and derive a sublinear running time, we apply among others, ideas introduced in [8], in which the authors presented several sublinear time algorithms for machine learning optimization tasks such as the supervised learning of separating hyperplanes. In [8] the authors accelerated the schemes from [14] using novel random estimation techniques instead of exact computation. This approach was also the one taken in the pioneering work of Grigoriadis and Khachiyan [13], who presented a randomized sublinear time algorithm for a saddle point optimization problem in which both primal and dual solutions are in the simplex. The sublinear running time of their algorithm is also achieved by trading exact computations with random sampling.

6 6 Dan Garber, Elad Hazan regime previous FOM IPM this paper (ɛ n 1 ) improvement S = Θ(n), m = Θ(n), F 2 = Θ(n) min{ɛ 1 n 3, ɛ 5/2 n 2 } n 9/2 ɛ 2 n 2 + ɛ 9/2 n ɛ > n 4/7 S = Θ(n), m = Θ(n 3 ), F 2 = Θ(1) ɛ 1 n 4 n 21/2 ɛ 2 n 3 ɛ > n 1 S = Θ(n), m = Θ(n 3 ), F 2 = Θ( n) ɛ 1 n 4 n 21/2 ɛ 2 n 7/2 ɛ > n 1/2 S = Θ(n 3/2 ), m = Θ(n), F 2 = Θ(n) ɛ 1 n 5/2 n 9/2 ɛ 4 n 3/2 + ɛ 5/2 n 2 ɛ > n 1/3 S = Θ(n 3/2 ), m = Θ(n 2 ), F 2 = Θ(n) ɛ 1 n 7/2 n 7 ɛ 2 n 3 + ɛ 4 n 3/2 ɛ > n 1/2 S = Θ(n 3/2 ), m = Θ(n 3 ), F 2 = Θ( n) ɛ 1 n 9/2 n 21/2 ɛ 2 n 7/2 ɛ > n 1 S = Θ(n 3/2 ), m = Θ(n 3 ), F 2 = Θ(n) ɛ 1 n 9/2 n 21/2 ɛ 2 n 4 ɛ > n 1/2 S = Θ(n 2 ), m = Θ(n), F 2 = Θ(n) ɛ 1 n 3 n 9/2 ɛ 4 n 2 ɛ > n 1/3 S = Θ(n 2 ), m = Θ(n 2 ), F 2 = Θ(n) ɛ 1 n 4 n 7 ɛ 2 n 3 + ɛ 4 n 2 ɛ > n 2/3 S = Θ(n 2 ), m = Θ(n 3 ), F 2 = Θ(n) ɛ 1 n 5 n 21/2 ɛ 2 n 4 ɛ > n 1 Table 1 Comparison of running times between the bound in (4), the running time of Interior Point Methods and the bound stated in Theorem 1, for the eigen optimization problem (2), in various regimes of S, m, F 2. In all settings the spectral norm of all matrices is assumed to be upper bounded by 1. All bounds are stated without constants and without poly-logarithmic factors. The column titled previous FOM gives a lower bound on the expression in Equation (4). The column IPM gives the best running time of Interior Point methods for the chosen parameters. The column titled this paper gives an evaluation of the bound stated in Theorem 1 for the choices of S, m and F 2. In order to further simplify the expression in this column we assume that ɛ n 1. Finally, the column titled improvement gives the range of the error tolerance ɛ for which the bound in Theorem 1 improves over the bound in (4). Two more results that we apply in order derive sublinear running times, are a Bernstein-like concentration result for sum of random matrices presented in [22] and the Frank-Wolfe algorithm for smooth convex optimization over the spectrahedron presented in [15]. The rest of this paper is organized as follows. In section 2 we present our main algorithmic technique: algorithms for online convex optimization and overview their application to saddle point optimization. In section 3 we detail more algorithmic tools that we apply in our sublinear time algorithms. In section 4 we describe our randomized sublinear algorithm for Problem (1) and prove Theorem 1. In section 5 we describe how to make the algorithm work with arbitrary high probability while keeping the running time sublinear. In section 6 we present a sublinear algorithm for the non-affine problem (3) and prove Theorem 3. Finally, in section 7 we prove Theorem 2. 2 An Online Convex Optimization Approach to Saddle-Point Optimization In this section we overview our main algorithmic machinery for solving problems (1) and (3): algorithms for online convex optimization and their application to saddle point optimization. 2.1 Online Convex Optimization In the setting of Online convex optimization (OCO) [27,16,23,7], a decision maker is iteratively required to choose a point x t in a fixed convex set K. After the decision maker chooses x t, a convex loss function f t : K R is revealed and the decision

7 Sublinear Time Algorithms for Approximate Semidefinite Programming 7 maker incurs loss which is equal to the value of the function evaluated at the point chosen, f t (x t ), and the game proceeds to the next iteration. The emphasis in this setting is that the function f t revealed on each iteration of this game may be chosen completely arbitrarily and even in an adversarial manner given the previous choices of the decision maker and previous loss functions. In this adversarial model the decision maker cannot hope to bring to a minimum his cumulative loss, T f t(x t ) as a function of his choices x 1,..., x T. Instead the standard goal in this setting is to minimize the difference between the cumulative loss of the decision maker and that of an algorithm that has the benefit of hindsight but is able to play only a single fixed point in K on every iteration. This difference is known as regret, and is formally given by regret T = T f t (x t ) min x K T f t (x). Usually one cannot guarantee to achieve zero regret (or negative) [7] and the goal is to design an efficient algorithm for the decision maker that guarantees a regret that grows sub-linearly in the length of the game T, That is regret T T 0 as T. Sometimes one wishes to maximize the cumulative gain instead of minimizing the cumulative loss. In this case we require that all functions f t are concave and the regret is given by the difference between the cumulative gain of the best fixed point in K in hindsight and that of the decision maker. We conclude this subsection by presenting two algorithms for online convex optimization and their regret guarantees. In the next subsection we describe how these algorithms may be applied to saddle point optimization problems such as (1) and (3). The first algorithm we describe, known as the Multiplicative Weights algorithm [2], is an algorithm for online optimization over the simplex m with linear loss functions (f t (p) = qt p). Definition 1 (The Multiplicative Updates algorithm) Consider a sequence of linear losses q 1,..., q T R m such that for all t q t 1. The Multiplicative Weights (MW) algorithm is as follows. Let 0 < η R, w 1 1 m, and for t 1, p t w t / w t 1, w t+1 (i) w t (i)(1 ηq t (i)). The following lemma bounds the regret of the above described MW algorithm. For a proof see [2]. log m Lemma 1 Assuming T > log m, for the choice η = T the multiplicative updates algorithm satisfies, regret(mw ) T = p t q t min q t (i) 2 T log m. The second algorithm we present is an algorithm for online optimization over the set S, that is the set of unit-trace positive semidefinite matrices, with arbitrary concave functions. The algorithm is a variant of the more general meta-algorithm known as Regularized Follow The Leader (RFTL) [16,23].

8 8 Dan Garber, Elad Hazan Definition 2 (The RFTL algorithm) Consider a sequence of concave functions f 1,..., f T : S R such that t, X S f t (X) F 1. The RFTL algorithm is as follows. Let X 1 be an arbitrary point in S, and for t 1, 1 X t+1 arg max 2T t f τ (X τ ) X 1 2 X 2 F. (5) τ=1 The following lemma bounds the regret of the above described RFTL algorithm. For a proof see [23]. Lemma 2 The regularized follow the leader algorithm satisfies: max T f t (X) T f t (X t ) 2T. 2.2 Applying OCO algorithms to saddle point optimization In this subsection we overview how to apply online convex optimization algorithms to solve problems (1), (3). We concentrate on presenting convergent algorithms without focusing on the running times. When we present our sublinear algorithms we show how to derive sublinear time algorithms from the basic algorithms presented in this subsection. We approximate the problems (1),(3) by applying algorithms for OCO in two ways which we term the dual method and the primal dual method (see [14]) The dual method We apply the dual method to Problem (1), that is we assume that all constraint functions c i (X) are affine: c i (X) = A i X b i and A i R n n is symmetric and b i R. The basic algorithm is given below. Algorithm 1 Dual Method 1: Input: T log m T 2: w 1 1 m, η 3: for t = 1 to T do 4: p t(i) w t(i)/ w t 1 5: X t arg max m pt(i)c i(x) 6: for i = 1 to m do 7: w t+1 (i) w t(i)(1 η c i (X t)) 8: end for 9: end for 10: return X = 1 T T Xt, p = 1 T T pt Since all functions c i are affine, c i (X) = A i X b i, the solution to the optimization step in line 5 of the algorithm is without loss of generality a rank-one matrix

9 Sublinear Time Algorithms for Approximate Semidefinite Programming 9 of the form vv, where v is an eigenvector corresponding to the largest eigenvalue of the matrix m p t(i)a i. Notice that Algorithm 1 applies the MW algorithm with respect to the sequence of linear losses c 1,..., c T R m where c t (i) = c i (X t ) = A i X t b i. Lemma 3 Given tolerance ɛ > 0, running Algorithm 1 for T = 4ɛ 2 log m rounds guarantees that the returned primal-dual solutions ( X, p) satisfy: min p m m p i (A i X b i ) max m p i (A i X b i ) ɛ. Proof Denote by c t the vector in R m such that for all i [m] c t (i) = c i (X t ). By the regret guarantee of the MW algorithm (Lemma 1) and the value of T we have that 1 T p 1 t c t min p m T p c t ɛ. By the definition of X t we have that for all t, m p t(i)c i (X t ) = max m p t(i)c i (X) and thus, 1 min p m T p c t 1 m max p t (i)c i (X) ɛ T ( ) m 1 T max p t (i) c i (X) ɛ. T Since for all i c t (i) = c i (X t ) and c i (X) is affine we have that 1 T m p c t = p i 1 T m c i (X t ) = p i c i 1 T X t and thus we conclude that min p m m p i c i 1 T which yields the lemma. X t max m ( 1 T ) T p t (i) c i (X) ɛ,

10 10 Dan Garber, Elad Hazan The primal-dual method We apply the primal-dual method to Problem (3), that is we assume that all constraint functions c i (X) are concave and differentiable everywhere in S 3. The basic algorithm is given below. Algorithm 2 Primal-Dual Method 1: Input: T 2: Let X 1 be an arbitrary point in S 3: w 1 1 m 4: for t = 1 to T do 5: p t(i) w t(i)/ w t 1 1 6: X t+1 arg max t m 2T τ=1 pτ (i) c i(x τ ) X 1 2 X 2 F 7: for i = 1 to m do 8: w t+1 (i) w t(i)(1 η c i (X t)) 9: end for 10: end for 11: return X = 1 T T Xt, p = 1 T T pt Notice that as in the dual method, Algorithm 2 applies the MW algorithm with respect to the sequence of linear losses c 1,..., c T where c t (i) = c i (X t ) and the RFTL algorithm with respect to the sequence of concave functions f 1 (X),..., f T (X) where f t (X) = m p t(i)c i (X). Lemma 4 Given tolerance ɛ > 0, running Algorithm 2 for T = ɛ 2 ( 2+2 log m) 2 rounds guarantees that the returned primal-dual solutions ( X, p) satisfy: min p m m p i c i ( X) max m p i c i (X) ɛ. Proof By the regret guarantee of the MW algorithm (Lemma 1) we have that T m p t (i)c i (X t ) min p m T m p i c i (X t ) 2 T log m. By the regret guarantee of the RFTL algorithm (Lemma 2) we have that max T m p t (i)c i (X) T m p t (i)c i (X t ) 2T. Adding the above two equalities and rearranging we have that 3 As stated before it suffices to assume that c i has a supergradient everywhere in S

11 Sublinear Time Algorithms for Approximate Semidefinite Programming 11 min p m m p i ( 1 T ) T c i (X t ) max = max 1 T m T ( 1 T m 2T + 2 T log m p t (i)c i (X) T ) T 2T + 2 T log m p t (i) c i (X). T The lemma follows from plugging our choice for T and since by the concavity 1 T ( of the functions c 1,..., c m we have that for all i [m], T c i(x t ) ) 1 T c i T X t. 3 More Algorithmic Techniques In this section we detail some algorithmic techniques that we apply in order to derive sublinear time algorithms out of the high-level algorithms presented in section 2. These techniques include fast random estimation procedures and the Frank-Wolfe method for smooth optimization over the set of unit-trace positive semidefinite matrices. 3.1 Random Estimation Techniques A key step in deriving our sublinear time algorithms is replacing exact computations with random estimations. We use random sampling for the following two problems Estimating matrix dot products with constant variance Let A, X R n n such that A F F (for ease of presentation and without loosing any generality we assume that F N) and X F 1, and assume we are interested in estimating the dot product A X with constant variance and in O(F 2 log n) time (not including a preprocessing step to construct the sampling distribution). A way to achieve this goal is by sampling entries from the matrix A according to the l 2 norm of X. Consider picking a random entry (i, j) [n] [n] with probability X(i,j)2 and X 2 F estimating the dot product by ṽ A(i, j) X 2 F X(i, j). (6) Observe that E i,j [ṽ] = i,j X(i, j) 2 X 2 F A(i, j) X 2 F X(i, j) = i,j A(i, j) X(i, j) = A X. Thus ṽ is an unbiased estimator for the dot product A X. Moreover,

12 12 Dan Garber, Elad Hazan Var[ṽ] E i,j [ṽ 2 ] = i,j X(i, j) 2 X 2 F A(i, j)2 X 4 F X(i, j) 2 = i,j A(i, j) 2 X 2 F i,j A(i, j) 2 = A 2 F F 2. In order to achieve an unbiased estimator with constant variance (say unit variance) one can simply average F 2 i.i.d. estimators as in (6), that is, sample F 2 entries from X, (i 1, j 1 )..., (i F 2, j F 2) i.i.d. according to the l 2 norm of X and take the estimator: ṽ 1 F 2 F 2 r=1 A(i r, j r ) X 2 F X(i r, j r ). (7) Setting up the l 2 sampling distribution takes preprocessing time that is linear in the number of non-zeros in X. Thus, given m matrices A 1..., A m with Frobenius norm at most F, it is possible to estimate all dot products A i X, i [m] with unit variance in overall O( X 0 + F 2 (log n + m)) time (in particular note that we need to sample only one set of F 2 entries, and not one such set per matrix A i ). Although the estimator ṽ defined in (7) has unit variance, it may be unbounded which does not allow us to apply standard concentration results. To tackle this problem we use a clipping procedure. We define v = clip(ṽ, C) = max{ C, min{ṽ, C}} and use v instead of ṽ. The new estimator v is no longer unbiased (it is bounded and with variance at most 1). The following lemma, which we prove in the appendix, ties the resulting bias with the variance. Lemma 5 Let X be a random variable with finite variance, let X = clip(x, C) = min{c, max{ C, X}} where C is a constant satisfying E[X] C/2. Then, E[ X] E[X] 2 C Var[X]. Note that in case we apply Lemma 5 to the dot-product estimator ṽ described above, choosing C to be roughly 1/ɛ we get from Lemma 5 that the resulting bias in the estimator v is only O(ɛ) while v is upper bounded by O(1/ɛ). There is a slight issue that choosing C = Θ(1/ɛ) imposes an upper bound on ɛ because of the requirement in Lemma 5 that E[X] C/2. However, in all of the results presented in this paper, this requirement merely translates to the restriction ɛ 1, which is the only case of interest under our assumption of bounded affine/non-affine functions Spectral approximation of matrices Recall that in the dual method (Algorithm 1) we need on each iteration to compute the leading eigenvector of a matrix A given by A = m p t(i)a i where p t m and A i R n n is symmetric. Computing A, takes O(mS) time where S is an upper bound on the number of non-zeros in a single matrix. Instead of computing A we will

13 Sublinear Time Algorithms for Approximate Semidefinite Programming 13 be interested in finding a matrix à which is the average of only few of the matrices A 1..., A m, such that with probability at least 1 δ, à A 2 ɛ, (8) for some error tolerance ɛ and failure probability δ. Note that if X à max à X ɛ then X A max A X O(ɛ) and thus we can find an approximation to the leading eigenvector of A by approximating the leading eigenvector of Ã. In order to derive an approximation for A in the sense of (8) we rely on the following Bernstein-like inequality for matrices (see [25], Theorem 1.4). Lemma 6 Let M 1,..., M k be independent zero-mean random symmetric matrices of dimension n n. Suppose that i [k], M i 2 γ and denote ρ 2 = k E[M i 2] 2. Then for any ɛ > 0 it holds that ( ) k ( ɛ 2 ) /2 P M i 2 > ɛ n exp ρ 2. + γɛ/3 Then next lemma follows directly from Lemma 6 and is tailored to our needs. Lemma 7 Fix ɛ (0, 1]. Let p m be a distribution over m elements and let A 1,..., A m be symmetric matrices in R n n such that A i 2 1 for all i [m]. Let j 1,..., j k be k i.i.d. samples of indices from [m] according to distribution p (that is j l = i w.p. p i ). Then it holds that P ( 1 k ) k m ( kɛ 2 A jl p i A i 2 > ɛ n exp 10 l=1 ). 3.2 The Frank-Wolfe algorithm and approximated RFTL In order to apply the primal dual method with the RFTL online algorithm we need to solve iteratively Problem (5) which takes O(n 3 ) time (notice that this is just a Euclidean projection to the set S) and thus is prohibitive when aiming for sublinear running times. Instead we make due with only finding an approximated solution to objective (5) and show that if the approximation error is sufficiently small then our algorithm still converges. The benefit is that such an approximation could be found in O(n 2 poly(ɛ 1 )) time. The Frank-Wolfe method adapted to optimization over the spectrahedron, presented in [15], is an algorithm for maximizing a smooth concave function f over the set S. The algorithm performs iterations where each iteration t is comprised of solving the linear problem V t arg max V S V f(x t), (9) where X t is the current iterate. The algorithm then produces the next iterate by taking the convex combination X t+1 X t +α t (V t X t ) where α t is roughly 1/t. Note that

14 14 Dan Garber, Elad Hazan the solution to (9) is without loss of generality just a leading eigenvector computation which could be approximated in nearly linear time using the Lanczos method [18]. The Frank-Wolfe algorithm guarantees that after roughly O(ɛ 1 ) iterations, each of which is comprised of a single eigenvector computation, the final iterate X sol will satisfy max f(x) f(x sol ) ɛ. We now define the approximated RFTL algorithm and tie its regret bound the approximation quality of Problem (5). Definition 3 (Approximated RFTL algorithm) Consider a sequence of concave functions f 1,..., f T : S R such that t [T ], X S : f t (X) F 1. The approximated RFTL algorithm is as follows. Let ɛ > 0, X 1 is an arbitrary point in S. For time t 1 define the function F t (X) = 1 2T t f τ (X τ ) X 1 2 X 2 F τ=1 and choose a point X t+1 S that satisfies: max F t(x) F t (X t+1 ) ɛ. Lemma 8 The approximated RFTL algorithm satisfies: max T f t (X) T f t (X t ) 2T + T ɛ. Proof Consider the sequence of functions f t (X) = f t (X t ) X, t [T ] and denote by X t, t [T ] the sequence of points played by the original RFTL algorithm, as formulated in Equation (5), with respect to the sequence of functions f 1,..., f T. By Lemma 2 we have that max T f t (X t ) X T f t (X t ) X t 2T. (10) Note that F t (X) is 1-strongly-concave with respect to norm F. Thus since by definition of X t+1, F t ( X t+1 ) F t (X t+1 ) ɛ we have that t : X t+1 X t+1 2 F F t ( X t+1 ) F t (X t+1 ) ɛ. Using the Cauchy-Schwartz inequality we have that T f t (X t ) X t = T f t (X t ) X t + T f t (X t ) X t T f t (X t ) (X t X ) t T f t (X t ) F X t X t F T f t (X t ) X t T ɛ. (11)

15 Sublinear Time Algorithms for Approximate Semidefinite Programming 15 Plugging (11) into (10) we have that max T f t (X t ) X T f t (X t ) X t 2T + T ɛ. Finally using the concavity of the functions f t we have that max T f t (X) T f t (X t ) 2T + T ɛ. Note that the objective in (5) is indeed concave and smooth as required by the Frank Wolfe algorithm in [15]. The following lemma states the computation time of the update step of the approximated RFTL algorithm described in definition 3 using the Frank-Wolfe algorithm [15]. For a proof see [15]. Lemma 9 Fix ɛ > 0, t [T ] and let F t (X) be as in definition 3. The Frank-Wolfe algorithm( in [15] computes a matrix X S such that max Y S F t (Y ) F t (X) ɛ n in time O 2 t ɛ ) Sublinear Time Algorithm for the Affine Saddle-Point Problem In this section we present our main algorithmic result, a sublinear time algorithm for approximately solving (1). As stated in Theorem 1, throughout this section we assume that there exists parameters F, S such that for any constraint i [m] it holds that A i 2 1, b i 1, A i F F and the matrix A i contains at most S non-zero entries. The algorithm is based on Algorithm 1 and the random acceleration techniques described in section 3. We replace the MW algorithm applied in Algorithm 1, with the following variant from [8] which is suitable for the case in which the linear losses are random variables with bounded variance but do not necessarily have constant l norm (line 13 of the algorithm). Definition 4 (Variance MW) Consider a sequence of linear losses q 1,..., q T R m. The MW algorithm is as follows. Let 0 < η R, w 1 1 m, and for t 1, p t w t / w t 1, w t+1 w t (i)(1 ηq t (i) + η 2 q t (i) 2 ). The following lemma bounds the regret of the above described MW algorithm. For a proof see [8]. Lemma 10 Let η be such that η < min, 1/q t (i). Then the multiplicative updates algorithm satisfies: p t q t min max{q t (i), 1 η } log m η where we have denoted by q 2 t the vector whose ith entry is q t (i) 2. + η p t qt 2,

16 16 Dan Garber, Elad Hazan Instead of feeding the MW algorithm with the exact values c i (X t ) = A i X t b i as done in Algorithm 1, we only use estimates of these values using the l 2 -sampling technique described in subsection (lines 9-14). Also, instead of performing exact leading eigenvector computations with respect to the matrices S t = m p t(i)a i as suggested in Algorithm 1, we approximate the matrix S t by the matrix M t using roughly O(ɛ 2 ) samples from A 1,..., A m (lines 5-7), as explained in subsection 3.1.2, and then compute an approximated leading eigenvector of M t using the randomized Lanczos method up to additive error ɛ and with failure probability δ (line 8). Algorithm 3 1: Input: error tolerance ɛ (0, 1], instance of Problem (1) and a parameter F N that satisfy the requirements of Theorem 1. 2: Let ɛ ɛ 6, T 602 ɛ 2 log m, w 1 1 m, η 4 log m, k 10 3T ɛ 2 log (20nT ), δ 1 20T 3: for t = 1 to T do 4: p t w t w t 1 5: Let i 1,..., i k be k indexes from [m] sampled i.i.d according to p t 6: p t 1 k k l=1 e i l 7: M t m pt(i)a i 8: x t EV(M t, ɛ, δ) 9: Sample F 2 entries (jt 1, l1 t 10: for i [m] do 11: ṽ t(i) 1 F 2 F 2 r=1 A i(jt r, lr t )/(xt(jr t )xt(lr t )) b i 12: v t(i) clip(ṽ t(i), 1/η) 13: w t+1 (i) w t(i)(1 ηv t(i) + η 2 v t(i) 2 ) 14: end for 15: end for 16: return X = 1 T xtx t, p = 1 T pt ),..., (jf 2 t, l F 2 t ) [n] [n] by (j i t, li t ) (j, l) w.p. xt(j)2 x t(l) 2 We turn to analyze Algorithm 3. We begin by introducing some auxiliary lemmas that will be used in the proof of Theorem 1. The following lemma gives guarantees on the eigenvector computations carried out in line 8 of the algorithm. The proof relies on the Lanczos method for computing the largest eigenvalue of a positive semidefinite matrix, and is differed to the appendix. Lemma 11 Given a matrix M with N non-zeros, such that M 2 = O(1), and parameters ɛ, δ > 0, the eigenvector computation ( in) line 8 could be implemented using the Lanczos method to run in time O, such that the returned vector N ɛ log n δ x satisfies with probability at least 1 δ: x Mx λ max (M) ɛ. The following lemma gives concentration bounds on our random variables from their expectations. The proof is differed to the appendix. 4 log m Lemma 12 For 1/4 η 3T, it holds with probability at least 1 O(1/m) that

17 Sublinear Time Algorithms for Approximate Semidefinite Programming max [v t (i) (A i x t x t b i )] 5ηT, 2. p t (i)(a i x t x t b i ) p t v t 4ηT. The following lemma is a simple application of Markov s inequality. The proof is differed to the appendix. Lemma 13 With probability at least 3/4 it holds that p t vt 2 16T. Finally the following lemma is a simple application of Hoeffding s inequality. Lemma 14 For any t [T ] and ɛ > 0 it holds that ( ) m m P p t (i)b i p t (i)b i > ɛ 2 exp ( 2kɛ 2). We can now prove Theorem 1. Proof We assume that the guarantees of Lemmas 12, 13 hold and that for all t [T ], the guarantees of Lemmas 7, 11 and 14 hold. Using the union-bound and our choice of k, δ, these guarantees all hold with probability at least 1/2. From Lemma 10 and the use of the clipping procedure in line 12 we have that p t v t min v t (i) + log m η + η p t vt 2. Starting from the rhs of the inequality, plugging result (1) in Lemma 12 we have that p t v t min (A i x t x t b i ) + 5ηT + log m + η p t vt 2. η Plugging Lemma 13 we have that p t v t min (A i x t x t b i ) + 5ηT + log m + 16ηT. η Turning to the lhs of the inequality, plugging result (2) in Lemma 12 we have that

18 18 Dan Garber, Elad Hazan p t (i)(a i x t x t b i ) 4ηT min (A i x t x t b i ) + log m + 21ηT. η By Lemmas 7 and 14 we can replace p t (i) with p t (i) and get: p t (i)(a i x t x t b i ) 2 ɛt min (A i x t x t b i ) + log m + 25ηT. η By the guarantee of Lemma 11 on the eigenvector computation in line 8 we have that max p t (i)(a i X b i ) 3 ɛt min (A i x t x t b i ) + log m + 25ηT. η It holds that max p t (i)(a i X b i ) max p t (i)(a i X b i ) Thus we have that = max p t (i)(a i X b i ). max p t (i)(a i X b i ) 3 ɛt min (A i x t x t b i ) + log m + 25ηT. η Dividing through by T, rearranging and plugging our choices for T, η we get: max 1 T p t (i) (A i X b i ) 3 ɛ ɛ/2 min A i 1 T x t x t b i. Plugging our choice for ɛ yields the approximation guarantee stated in the theorem. We now turn to discuss the arithmetic complexity of the algorithm. The algorithm performs O(ɛ 2 log m) iterations, each is comprised of three major steps: (i) averaging O(ɛ 2 log n) matrices sampled i.i.d according to distribution p t (ii) computing an approximated eigenvector and (iii) updating the weights vector w t. Step (i) is implemented trivially with O(Sɛ 2 log n) arithmetic operations. Step (ii), according to Lemma 11, could be carried out with O( M t 0 ɛ 0.5 log n) arithmetic operations where M t 0 is the number of non-zero entries in the matrix M t.

19 Sublinear Time Algorithms for Approximate Semidefinite Programming 19 This number is trivially bounded by min{sɛ 2 log n, n 2 }. Note also that since M t is the average of matrices with spectral norm at most 1, it holds that M t 2 1, as required in Lemma 11. Step (iii) requires x t 0 M t 0 prepossessing time to generate the l 2 -distribution and additional O(F 2 (log n + m)) time to generate the samples. Summing all contributions yields the total arithmetic complexity. Finally, the conclusion regarding the rank of the returned solution X follows from the observation that X is the average of O(ɛ 2 log m) rank-one matrices. 5 Increasing the Probability of Success In this section we derive high probability results for our Algorithm 3. Our results are based on showing that given a pair of primal-dual solutions generated by Algorithm 3 it is possible to verify their quality with high probability and in sublinear time. Our algorithm for verifying a candidate primal solution is given below. The algorithm assumes that the candidate X is of the form prescribed in Algorithm 3, that is X = 1 T T x tx t and is available implicitly as the unit vectors x 1, x 2,..., x T R n. Algorithm 4 1: Input: instance of Problem (1), candidate solution X given by unit vectors x 1,..., x T R n, F N s.t. i [m] : A i F F, error tolerance ɛ > 0, failure probability δ. 16 log (2m/δ) 2: k ɛ 3: for s = 1 to 2 k do 4: Sample an index t s [T ] uniformly at random 5: Sample F 2 entries (j 1, l 1 ),..., (j F 2, l F 2) [n] [n] i.i.d. x ts (i) 2 x ts (j) 2 6: for i = 1 to m do 7: ṽ i (s) 1 F 2 F 2 r=1 8: v i (s) clip(ṽ i (s), 4/ɛ) 9: end for 10: v i 1 k k s=1 v i(s) 11: end for 12: return v 1,..., v m. A i (j r,l r) x ts (j r)x ts (l r) ( ( )) Lemma 15 Algorithm 4 runs is time O log (m/δ) ɛ min{ S log n 2 ɛ, n} + mf 2 2 returns values v 1,..., v m such that with probability at least 1 δ it holds that i [m] A i X v i ɛ. Proof As in the analysis of Algorithm 3, the l 2 -sampling in lines 5-9 guarantees for all s [k] and i [m] that the conditional expectation E[ṽ i (s) t s ] satisfies E[ṽ i (s) t s ] = A i x ts x t s and that the conditional second moment E[ṽ i (s) 2 t s ] satisfies, E[ṽ i (s) 2 t s ] 1. Since t s is chosen uniformly at random we have that E[ṽ i (s)] = T 1 T A i x t x t = A i X. Also it holds that E[ṽ i (s) 2 ] 1. Thus for all i [m] and s [k] it holds that E[v i (s) 2 ] E[ṽ i (s) 2 ] 1 and v i (s) 4/ɛ. By Bernstein s inequality we have that and

20 20 Dan Garber, Elad Hazan ( P ( v i E[v i ] ɛ/2) 2 exp k2 ɛ 2 ) ) /8 2 exp ( kɛ2. k + 4/3 16 Using the proprieties of the clipping procedure (see Lemma 5), since E[ṽ i (s)] = A i X b i we have that with probability at least 1 2 exp ( kɛ 2 /16), v i A i X = v i 1 k v i 1 k k E[v i (s)] + 1 k s=1 v i E[v i ] + 1 k ɛ k k s=1 ɛ 2 = ɛ. k E[ṽ i (s)] s=1 k E[v i (s)] 1 k s=1 k E[v i (s)] E[ṽ i (s)] s=1 k E[ṽ i (s)] Using the union-bound, the last ( inequality ) holds for all i [m] simultaneously with probability at least 1 2m exp. Choosing k = guarantees kɛ2 16 s=1 16 log (2m/δ) ɛ 2 that all inequalities hold with probability at least 1 δ. The running time is due to the number of iterations k, each is comprise of O( x ts 0 ) preprocessing time to set up the l 2 sampling distribution from the vector x ts and additional O(mF 2 ) time to process the samples. As detailed in the analysis of Algorithm 3, it holds that x ts 0 min{sɛ 2 log n, n}. Lemma 16 Given parameters ɛ, δ > 0, Algorithm 3 can be implemented to run in time ( O log 1 ( T ALG (ɛ) + T V ER (ɛ, δ) + min{sɛ 4 log(n) log(m), n 2 }ɛ 0.5 log n ) ), δ δ where T ALG (ɛ) denotes the running time of Algorithm 3 with tolerance ɛ and T V ER (ɛ, δ) denotes the running time of the primal verification procedure given in Algorithm 4 with tolerance ɛ and failure probability δ, such that with probability at least 1 δ the returned primal-dual solutions ( X, p) satisfy: min p m m p i (A i X b i ) max m p i (A i X b i ) ɛ. Proof We run Algorithm 3 N = O(log(1/δ) times with error parameter ɛ/4. This guarantees that with probability at least 1 δ/3 one of the returned primal-dual pairs ( X j, p j ) satisfies Theorem 1 with error at most ɛ/4. We can verify all primal solutions up to error ɛ/4 and with failure probability δ/(3n) by running Algorithm 4 on all

21 Sublinear Time Algorithms for Approximate Semidefinite Programming 21 primal solutions. In order to verify a dual solution p we need to approximate with high probability the value max ( m m ) m p i (A i X b i ) = λ max p i A i p i b i. We can perform this verification up to error ɛ/4 and with failure probability δ/(3n) by applying Lemma 11 to the matrix M = m p(i)a i with error parameter ɛ/4 and failure probability δ/(3n). Note that M could be aggregated during the run of Algorithm 3 without increasing the computational complexity. Thus with probability at least 1 δ we will get a pair of primal-dual solutions that satisfy Theorem 1 and all verifications will be correct. Finally note that on every iteration t of Algorithm 3 it holds that M t 0 min{sɛ 2 log n, n 2 } which means that M 0 min{sɛ 4 log(n) log(m), n 2 }. Thus verifying a dual solution according to Lemma 11 takes at most O(min{Sɛ 4 log(n) log(m), n 2 }ɛ 0.5 log (n/δ)) arithmetic operations. 6 Sublinear Time Algorithm for Non-Affine Saddle-Point Problem In this section we describe and analyse our algorithm for non-affine constraints c i (X) 0, i [m] where each c i is concave and differentiable. Throughout this section we make the following assumptions for all i [m] and X S. 1. c i (X) c i (X) F c i (X) could be computed in O(n 2 ) time. 4. There exists a procedure Sample(X, c 1,..., c m ) that returns in O(m + n 2 ) time unbiased estimators of the values c 1 (X),..., c m (X) with variance at most 1. Example and justification of assumptions. Although these assumptions are relatively strong, they hold for example for the important case of the square-loss, that is c(x) = (A X b) 2 for some matrix A R n n and scalar b R. Note that for this function conditions 1,2 hold since proper scaling could be applied and condition 3 holds trivially. Estimating the value of c(x) in O(1) time and unit variance could be done by taking two independent unbiased estimators v 1, v 2 using the l 2 -sampling technique described in subsection of the value A X b and forming the estimator v = v 1 v 2. Since v 1, v 2 are independent we have that E[v] = E[v 1 ] E[v 2 ] = (A X b) 2. Also E[v 2 ] = E[v 2 1v 2 2] = E[v 2 1] E[v 2 2], which means that in case A F 1, the estimator v also has unit variance. Remark 1 In case only a constant number of the functions c 1,.., c m are non affine and all the rest are affine, assumption 4 could be replaced with the assumption that c i (X) could be computed in O(n 2 ) for every non-affine function c i and X S. This will allow to keep the running time sublinear without the need to implement the procedure Sample for the non-affine constraints (recall that for affine constraints we implement Sample using the l 2 -sampling technique).

22 22 Dan Garber, Elad Hazan Our algorithm is based on Algorithm 2. The algorithm resembles Algorithm 3 except that instead of producing the iterates X t by offline optimization (eigenvector computations), they are produced by the approximated regularized follow the leader online algorithm defined in definition 3 (line 6 in Algorithm 5), where the RFTL objective (5) is solved up to an additive error ɛ using the Frank-Wolfe algorithm (see Lemma 9). A problem with applying the RFTL algorithm as done in Algorithm 2 (line 5) is that on each iteration we are required to compute the gradient vector of m functions, c 1 (X t ),..., c m (X t ), which under the our assumptions takes O(mn 2 ) time. To overcome this hurdle we use a standard trick in stochastic optimization: instead of using the entire gradient on time t, m p t(i) c i (X t ) we sample a single gradient c i (X t ) with probability p t (i) (line 5 of the algorithm). Applying RFTL with the sampled gradients achieves with high probability the same regret bound as in Lemma 8. The l 2 -sampling technique used to estimate the values c i (X t ) in Algorithm 3 is replaced with a call to the generic sampling oracle Sample(X t, c 1,..., c m ) (line 8 of the algorithm). The algorithm is given below. Algorithm 5 1: Input: error tolerance ɛ (0, 1], instance of Problem (5) and a procedure Sample( ) that satisfy the requirements of Theorem 3. 2: Input: ɛ (0, 1], instance of Problem (5) 3: Let ɛ ɛ2 9, T 602 ɛ 2 log m, w 1 1 m, η 4: for t = 1 to T do 5: p t w t w t 1 6: Choose i t [m] by i t i w.p. p t(i) 7: X t+1 ɛ approximated solution to 1 max 2T τ=1 8: ṽ t Sample(X t, c 1,..., c m) 9: for i [m] do 10: v t(i) clip(ṽ t(i), 1/η) 11: w t+1 (i) w t(i)(1 ηv t(i) + η 2 v t(i) 2 ) 12: end for 13: end for 14: return X = 1 T Xt, p = 1 T e i t 4 log m 3T t c iτ (X τ ) X 1 2 X 2 F We turn to analyze Algorithm 5 and to prove Theorem 3. The proof of the following concentration result is given in the appendix. 4 log m Lemma 17 For 1/4 η 3T, it holds with probability at least 1 O(1/m) that 1. max [v t (i) c i (X t )] 5ηT,

23 Sublinear Time Algorithms for Approximate Semidefinite Programming T c it (X t ) We can now prove Theorem 3. T p t v t 4ηT. Proof For the proof we assume that the guarantees stated in Lemmas 17, 13 hold. Using the union-bound this happens with probability at least 1/2. From Lemma 10 and the use of clipping in line 12 we have that p t v t min v t (i) + log m η + η p t vt 2. Starting from the rhs of the inequality, plugging result (1) in Lemma 17 we have that p t v t min Plugging Lemma 13 we have that p t v t min c i (X t ) + 5ηT + log m η c i (X t ) + 5ηT + log m η + η p t vt ηT. Turning to the lhs of the inequality, plugging result (2) in Lemma 17 we have that c it (X t ) 4ηT min By Lemma 8 we have that max c it (X) 2T T ɛ min c i (X t ) + log m η + 21ηT. c i (X t ) + log m η + 25ηT. Since c i t (X) = m t:i c t=i i(x) = T m p ic i (X), we have that T max m p i c i (X) 2T T ɛ min By concavity of the functions c 1,..., c m we have that T max m c i (X t ) + log m η + 25ηT. p i c i (X) 2T T ɛ T min c i( X) + log m + 25ηT. η The result follows by dividing through by T and plugging our choices for ɛ, η, T.

24 24 Dan Garber, Elad Hazan Algorithm 5 performs O(ɛ 2 log m) iterations each is comprised of an update step of the approximated RFTL algorithm and calling the procedure Sample( ). From Lemma 9 we have that the time complexity of the approximated RFTL update step is bounded by O(n 2 ɛ 4 log m) and under our assumption, calling the procedure Sample( ) is bounded by O(m + n 2 ) time. Hence we get the running time stated in Theorem 3. 7 Lower bounds In this section we prove Theorem 2. Our lower bounds are information-theoretic, meaning that any successful algorithm must read a certain amount of entries form the input. The core machinery behind these bounds is the following folklore fact (see [8] for more details). Proposition 1 Consider a one dimensional array of length r such that with probability 1/2 the array contains only zeros and with the remaining 1/2 probability the array contains a single non-zero entry, chosen uniformly at random, and zero in all other entries. Then any algorithm that with probability at least 2/3 determines whether the array contains only zeros or not, must read Ω(r) entries from the array. The following lemma describes how we use the above proposition to prove our lower bounds. Lemma 18 Fix an error tolerance ɛ and the parameters m, n. Suppose that there exists a distribution over pairs of sets of m matrices, A 1 1,..., A 1 m R n n, A 2 1,..., A 2 m R n n and a set B [m] [n] [n] such that 1. max min A 1 i X max min A 2 i X > ɛ 2. There exists a triplet (i, j, k) B such that A 1 i (j, k) A2 i (j, k) and for all (i, j, k ) B, (i, j, k ) (i, j, k) it holds that A 1 i (j, k ) = A 2 i (j, k ). 3. The distinctive triplet (i, j, k) in the previous condition is distributed uniformly among all triplets in B. Then any ɛ additive approximation algorithm for Problem (1) that succeeds with probability at least 2/3 has running time Ω( B ). Proof Assume a distribution and a set B that satisfy the conditions of the lemma exist. Given matrices A 1,..., A m R n n define ARRAY(A 1,..., A m ) to be a one dimensional array of length B such that the entries of this array are exactly the entries A i (j, k) such that (i, j, k) B, organized into a one dimensional array in some fixed order. Let A 1 1,...A 1 m, A 2 1,...A 2 m be two sets of matrices drawn from the distribution. Denote by ARR 1 the array ARRAY(A 1 1,..., A 1 m) and by ARR 2 the array ARRAY(A 2 1,..., A 2 m). Now, given an ɛ-additive approximation algorithm for Problem (1) that succeeds with probability at least 2/3, we do the following. With probability 1/2 we run it on the set of matrices A 1 1,..., A 1 m and with the remaining probability we run it on A 2 1,..., A 2 m. Note that since the optimums of these two instances differ by more then ɛ, the algorithm must distinguish between these two instances

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

A Linearly Convergent Conditional Gradient Algorithm with Applications to Online and Stochastic Optimization

A Linearly Convergent Conditional Gradient Algorithm with Applications to Online and Stochastic Optimization A Linearly Convergent Conditional Gradient Algorithm with Applications to Online and Stochastic Optimization Dan Garber Technion - Israel Inst. of Tech. dangar@tx.technion.ac.il Elad Hazan Technion - Israel

More information

A survey: The convex optimization approach to regret minimization

A survey: The convex optimization approach to regret minimization A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization

More information

A LINEARLY CONVERGENT CONDITIONAL GRADIENT ALGORITHM WITH APPLICATIONS TO ONLINE AND STOCHASTIC OPTIMIZATION

A LINEARLY CONVERGENT CONDITIONAL GRADIENT ALGORITHM WITH APPLICATIONS TO ONLINE AND STOCHASTIC OPTIMIZATION A LINEARLY CONVERGENT CONDITIONAL GRADIENT ALGORITHM WITH APPLICATIONS TO ONLINE AND STOCHASTIC OPTIMIZATION DAN GARBER AND ELAD HAZAN Abstract. Linear optimization is many times algorithmically simpler

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Bandit Convex Optimization

Bandit Convex Optimization March 7, 2017 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion Learning scenario Compact convex action set K R d. For t = 1 to T : Predict

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm he Algorithmic Foundations of Adaptive Data Analysis November, 207 Lecture 5-6 Lecturer: Aaron Roth Scribe: Aaron Roth he Multiplicative Weights Algorithm In this lecture, we define and analyze a classic,

More information

Regret bounded by gradual variation for online convex optimization

Regret bounded by gradual variation for online convex optimization Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

Beating SGD: Learning SVMs in Sublinear Time

Beating SGD: Learning SVMs in Sublinear Time Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Lecture notes for quantum semidefinite programming (SDP) solvers

Lecture notes for quantum semidefinite programming (SDP) solvers CMSC 657, Intro to Quantum Information Processing Lecture on November 15 and 0, 018 Fall 018, University of Maryland Prepared by Tongyang Li, Xiaodi Wu Lecture notes for quantum semidefinite programming

More information

We describe the generalization of Hazan s algorithm for symmetric programming

We describe the generalization of Hazan s algorithm for symmetric programming ON HAZAN S ALGORITHM FOR SYMMETRIC PROGRAMMING PROBLEMS L. FAYBUSOVICH Abstract. problems We describe the generalization of Hazan s algorithm for symmetric programming Key words. Symmetric programming,

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Sublinear Optimization for Machine Learning

Sublinear Optimization for Machine Learning Sublinear Optimization for Machine Learning Kenneth L. Clarkson IBM Almaden Research Center San Jose, CA Elad Hazan Department of Industrial Engineering Technion - Israel Institute of Technology Haifa

More information

Near-Optimal Algorithms for Online Matrix Prediction

Near-Optimal Algorithms for Online Matrix Prediction JMLR: Workshop and Conference Proceedings vol 23 (2012) 38.1 38.13 25th Annual Conference on Learning Theory Near-Optimal Algorithms for Online Matrix Prediction Elad Hazan Technion - Israel Inst. of Tech.

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Optimization, Learning, and Games with Predictable Sequences

Optimization, Learning, and Games with Predictable Sequences Optimization, Learning, and Games with Predictable Sequences Alexander Rakhlin University of Pennsylvania Karthik Sridharan University of Pennsylvania Abstract We provide several applications of Optimistic

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015 5-859E: Advanced Algorithms CMU, Spring 205 Lecture #6: Gradient Descent February 8, 205 Lecturer: Anupam Gupta Scribe: Guru Guruganesh In this lecture, we will study the gradient descent algorithm and

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

The Frank-Wolfe Algorithm:

The Frank-Wolfe Algorithm: The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology

More information

A Quantum Interior Point Method for LPs and SDPs

A Quantum Interior Point Method for LPs and SDPs A Quantum Interior Point Method for LPs and SDPs Iordanis Kerenidis 1 Anupam Prakash 1 1 CNRS, IRIF, Université Paris Diderot, Paris, France. September 26, 2018 Semi Definite Programs A Semidefinite Program

More information

Bandit Online Convex Optimization

Bandit Online Convex Optimization March 31, 2015 Outline 1 OCO vs Bandit OCO 2 Gradient Estimates 3 Oblivious Adversary 4 Reshaping for Improved Rates 5 Adaptive Adversary 6 Concluding Remarks Review of (Online) Convex Optimization Set-up

More information

1 Regression with High Dimensional Data

1 Regression with High Dimensional Data 6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:

More information

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018 15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018 Usual rules. :) Exercises 1. Lots of Flows. Suppose you wanted to find an approximate solution to the following

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal

More information

arxiv: v5 [math.na] 16 Nov 2017

arxiv: v5 [math.na] 16 Nov 2017 RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Beating SGD: Learning SVMs in Sublinear Time

Beating SGD: Learning SVMs in Sublinear Time Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological

More information

Convex and Semidefinite Programming for Approximation

Convex and Semidefinite Programming for Approximation Convex and Semidefinite Programming for Approximation We have seen linear programming based methods to solve NP-hard problems. One perspective on this is that linear programming is a meta-method since

More information

Lecture 7: Passive Learning

Lecture 7: Passive Learning CS 880: Advanced Complexity Theory 2/8/2008 Lecture 7: Passive Learning Instructor: Dieter van Melkebeek Scribe: Tom Watson In the previous lectures, we studied harmonic analysis as a tool for analyzing

More information

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,

More information

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

CHAPTER 11. A Revision. 1. The Computers and Numbers therein CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of

More information

Interior-Point Methods for Linear Optimization

Interior-Point Methods for Linear Optimization Interior-Point Methods for Linear Optimization Robert M. Freund and Jorge Vera March, 204 c 204 Robert M. Freund and Jorge Vera. All rights reserved. Linear Optimization with a Logarithmic Barrier Function

More information

Research Note. A New Infeasible Interior-Point Algorithm with Full Nesterov-Todd Step for Semi-Definite Optimization

Research Note. A New Infeasible Interior-Point Algorithm with Full Nesterov-Todd Step for Semi-Definite Optimization Iranian Journal of Operations Research Vol. 4, No. 1, 2013, pp. 88-107 Research Note A New Infeasible Interior-Point Algorithm with Full Nesterov-Todd Step for Semi-Definite Optimization B. Kheirfam We

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Run Time

Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Run Time Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear Sometimes Sublinear Run Time Mengdi Wang Department of Operations Research and Financial Engineering, Princeton

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Lecture 19: Follow The Regulerized Leader

Lecture 19: Follow The Regulerized Leader COS-511: Learning heory Spring 2017 Lecturer: Roi Livni Lecture 19: Follow he Regulerized Leader Disclaimer: hese notes have not been subjected to the usual scrutiny reserved for formal publications. hey

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016 U.C. Berkeley CS294: Spectral Methods and Expanders Handout Luca Trevisan February 29, 206 Lecture : ARV In which we introduce semi-definite programming and a semi-definite programming relaxation of sparsest

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Online Learning with Experts & Multiplicative Weights Algorithms

Online Learning with Experts & Multiplicative Weights Algorithms Online Learning with Experts & Multiplicative Weights Algorithms CS 159 lecture #2 Stephan Zheng April 1, 2016 Caltech Table of contents 1. Online Learning with Experts With a perfect expert Without perfect

More information

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs CS26: A Second Course in Algorithms Lecture #2: Applications of Multiplicative Weights to Games and Linear Programs Tim Roughgarden February, 206 Extensions of the Multiplicative Weights Guarantee Last

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Online Convex Optimization with Stochastic Constraints

Online Convex Optimization with Stochastic Constraints Online Convex Optimization with Stochastic Constraints Hao Yu, Michael J. Neely, Xiaohan Wei Department of Electrical Engineering University of Southern California 1 arxiv:1708.03741v1 [math.oc] 1 Aug

More information

Convex optimization. Javier Peña Carnegie Mellon University. Universidad de los Andes Bogotá, Colombia September 2014

Convex optimization. Javier Peña Carnegie Mellon University. Universidad de los Andes Bogotá, Colombia September 2014 Convex optimization Javier Peña Carnegie Mellon University Universidad de los Andes Bogotá, Colombia September 2014 1 / 41 Convex optimization Problem of the form where Q R n convex set: min x f(x) x Q,

More information

Online Submodular Minimization

Online Submodular Minimization Online Submodular Minimization Elad Hazan IBM Almaden Research Center 650 Harry Rd, San Jose, CA 95120 hazan@us.ibm.com Satyen Kale Yahoo! Research 4301 Great America Parkway, Santa Clara, CA 95054 skale@yahoo-inc.com

More information

Geometric problems. Chapter Projection on a set. The distance of a point x 0 R n to a closed set C R n, in the norm, is defined as

Geometric problems. Chapter Projection on a set. The distance of a point x 0 R n to a closed set C R n, in the norm, is defined as Chapter 8 Geometric problems 8.1 Projection on a set The distance of a point x 0 R n to a closed set C R n, in the norm, is defined as dist(x 0,C) = inf{ x 0 x x C}. The infimum here is always achieved.

More information

Bandits for Online Optimization

Bandits for Online Optimization Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Stochastic and Adversarial Online Learning without Hyperparameters

Stochastic and Adversarial Online Learning without Hyperparameters Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford

More information

Convex Stochastic and Large-Scale Deterministic Programming via Robust Stochastic Approximation and its Extensions

Convex Stochastic and Large-Scale Deterministic Programming via Robust Stochastic Approximation and its Extensions Convex Stochastic and Large-Scale Deterministic Programming via Robust Stochastic Approximation and its Extensions Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia

More information

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Elad Hazan IBM Almaden Research Center 650 Harry Rd San Jose, CA 95120 ehazan@cs.princeton.edu Satyen Kale Yahoo! Research 4301

More information

Lecture Approximate Potentials from Approximate Flow

Lecture Approximate Potentials from Approximate Flow ORIE 6334 Spectral Graph Theory October 20, 2016 Lecturer: David P. Williamson Lecture 17 Scribe: Yingjie Bi 1 Approximate Potentials from Approximate Flow In the last lecture, we presented a combinatorial

More information

approximation algorithms I

approximation algorithms I SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions

More information

CS 6820 Fall 2014 Lectures, October 3-20, 2014

CS 6820 Fall 2014 Lectures, October 3-20, 2014 Analysis of Algorithms Linear Programming Notes CS 6820 Fall 2014 Lectures, October 3-20, 2014 1 Linear programming The linear programming (LP) problem is the following optimization problem. We are given

More information

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming CSC2411 - Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming Notes taken by Mike Jamieson March 28, 2005 Summary: In this lecture, we introduce semidefinite programming

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

CS261: Problem Set #3

CS261: Problem Set #3 CS261: Problem Set #3 Due by 11:59 PM on Tuesday, February 23, 2016 Instructions: (1) Form a group of 1-3 students. You should turn in only one write-up for your entire group. (2) Submission instructions:

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

A full-newton step infeasible interior-point algorithm for linear programming based on a kernel function

A full-newton step infeasible interior-point algorithm for linear programming based on a kernel function A full-newton step infeasible interior-point algorithm for linear programming based on a kernel function Zhongyi Liu, Wenyu Sun Abstract This paper proposes an infeasible interior-point algorithm with

More information

Nonlinear Programming

Nonlinear Programming Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week

More information

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate 58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 12 Luca Trevisan October 3, 2017

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 12 Luca Trevisan October 3, 2017 U.C. Berkeley CS94: Beyond Worst-Case Analysis Handout 1 Luca Trevisan October 3, 017 Scribed by Maxim Rabinovich Lecture 1 In which we begin to prove that the SDP relaxation exactly recovers communities

More information

Lecture 7: Semidefinite programming

Lecture 7: Semidefinite programming CS 766/QIC 820 Theory of Quantum Information (Fall 2011) Lecture 7: Semidefinite programming This lecture is on semidefinite programming, which is a powerful technique from both an analytic and computational

More information

A priori bounds on the condition numbers in interior-point methods

A priori bounds on the condition numbers in interior-point methods A priori bounds on the condition numbers in interior-point methods Florian Jarre, Mathematisches Institut, Heinrich-Heine Universität Düsseldorf, Germany. Abstract Interior-point methods are known to be

More information

Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 4

Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 4 Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 4 Instructor: Farid Alizadeh Scribe: Haengju Lee 10/1/2001 1 Overview We examine the dual of the Fermat-Weber Problem. Next we will

More information