Sublinear Time Algorithms for Approximate Semidefinite Programming
|
|
- Arleen Kelley
- 5 years ago
- Views:
Transcription
1 Noname manuscript No. (will be inserted by the editor) Sublinear Time Algorithms for Approximate Semidefinite Programming Dan Garber Elad Hazan Received: date / Accepted: date Abstract We consider semidefinite optimization in a saddle point formulation where the primal solution is in the spectrahedron and the dual solution is a distribution over affine functions. We present an approximation algorithm for this problem that runs in sublinear time in the size of the data. To the best of our knowledge, this is the first algorithm to achieve this. Our algorithm is also guaranteed to produce low-rank solutions. We further prove lower bounds on the running time of any algorithm for this problem, showing that certain terms in the running time of our algorithm cannot be further improved. Finally, we consider a non-affine version of the saddle point problem and give an algorithm that under certain assumptions runs in sublinear time. Keywords semidefinite programming, sublinear algorithms, online algorithms, large scale optimization Mathematics Subject Classification (2000) 90C22 90C06 68W27 and 68W20 1 Introduction Semidefinite programming has become a tool of great importance in optimization in the past years. In the field of combinatorial optimization for example, numerous approximation algorithms have been discovered starting with the seminal work of Goemans and Williamson on the MAX-CUT problem [12], which was followed Dan Garber Department of Industrial Engineering and Management, Technion, Haifa 32000, Israel. Tel.: Fax: dangar@tx.technion.ac.il Elad Hazan Department of Industrial Engineering and Management, Technion, Haifa 32000, Israel. Tel.: Fax: ehazan@ie.technion.ac.il
2 2 Dan Garber, Elad Hazan by [4,1,3]. In the field of machine learning, solving semidefinite programs is at the heart of many learning tasks such as learning a distance metric [26,24], sparse PCA [10], multiple kernel learning [19], matrix completion [6], and more. In machine learning applications in particular, it is often the case that the data is assumed to be noisy. Thus there is no point in solving the underlying optimization problem to very high precision. Moreover, the usual extremely large scale of data in such applications makes the underlying optimization problem impractical to solve with relatively high precision even for fast first order methods. Indeed in these last years there has been much research improving the running time of first order methods for semidefinite programming with the aim to improve continually the scale of problems that could be tackled with sufficient accuracy, and there now exists solvers that for certain problems and a certain regime of parameters run in nearly linear time [9,5]. As an attempt to accelerate further first order methods for semidefinite programming, we explore in this paper the possibility of designing algorithms for semidefinite optimization that, in certain parameter-regimes of interest, require running time that is only sublinear in the size of the problem 1. The main semidefinite optimization problem under investigation in this paper is the following saddle point problem: max min p m m p i (A i X b i ), (1) where A 1,..., A m R n n are all symmetric, b 1,..., b m R, m = {x R m x 0, x 1 = 1} is the m 1 dimensional unit simplex, S = {X R n n X 0, Tr(X) = 1} is the spectrahedron, that is the set of unit-trace n n positive semidefinite matrices, and denotes the standard matrix dot product, that is A B = n i,j=1 A ijb ij. In case the optimum value of (1) is non-negative, then solving (1) up to an additive error of ɛ is equivalent to finding a solution in the set S that satisfies all constraints A i X b i 0 up to an additive error of ɛ. Note that the demand that the solution has unit trace is equivalent, via standard reductions, to a demand that the trace of the solution is upper bounded somehow. Moreover, the formulation in (1) captures the notion of maximizing the margin from feasibility, that is, maximizing the scalar σ that still satisfies A i X b i + σ, i [m], for some matrix X S. This objective is of direct interest in certain machine learning tasks such at the problem of learning a distance metric [26,24]. The dual of Problem (1) is the following eigenvalue optimization problem: min p m λ max ( m ) p i (A i b i I), (2) where I denotes the n n identity matrix. Note that the dual problem (2) is strong, i.e. there is no duality gap between problems (1) and (2). 1 The results presented in this paper are a continuation of preliminary results on sublinear semidefinite optimization presented in [11].
3 Sublinear Time Algorithms for Approximate Semidefinite Programming 3 Our main contribution is a randomized algorithm that solves Problem (1) up to a desired additive error ɛ in a total running time that, for large enough instances, is sublinear in the number of entries in the instance. To the best of our knowledge this is the first result to achieve this kind of bound on the running time. Our algorithm has an additional appealing feature - it always produces low-rank solutions. More precisely we prove the following theorem, which is the main theorem of this paper. Theorem 1 There exists an algorithm that given an error tolerance ɛ < 1 and an instance of Problem (1) satisfying i [m]: A i all matrices are bounded in spectral (operator) norm b i 1 the number of non-zero entries of each matrix A i is upper bounded by S A i F F - all matrices have Frobenius norm at most F. finds after ( ( )) O ɛ 2 log m F 2 (m + log n) + Sɛ 2 log n + min{sɛ 2 log n, n 2 }ɛ 1/2 log n arithmetic operations a pair of primal-dual solutions ( X, p) S m satisfying with probability at least 1/2: min p m m p i (A i X b i ) max m p i (A i X b i ) ɛ and the rank of X is upper bounded by O(ɛ 2 log m). Note that in the worst case S = Θ(n 2 ) and F 2 = Θ(n) (under the assumption of unit spectral norm) and the bound in the theorem becomes Õ ( ɛ 2 (mn + ɛ 2 n 2) (suppressing poly-logarithmic factors), which is sublinear in the size of the problem (which is Θ(mn 2 )) for ɛ max{n 1/2, m 1/4 }. In case F 2 = O(1) and S = Θ(n 2 ), the bound becomes Õ ( ɛ 2 (m + n 2 ɛ 2), which actually splits the dependency between m and, n. In this case the algorithm exhibit a sublinear running time for ɛ max{n 1, m 1/4 }. We also prove the following lower bounds on the running time of any ɛ-additive approximation algorithm for Problem (1). Theorem 2 Given a choice of parameters m, n, let Γm,n F denote the set of all instances of (1) for which i [m], A i R n n, A i F 1, b i = 0, and let Γm,n S denote the set of all instances of (1) for which i [m], A i R n n, A i 2 1, b i = 0. The following claims hold. 1. Fix an error tolerance ɛ (0, 1), the parameter m and let n 1 4ɛ. Any algorithm that given an instance from Γm,n, F finds with probability at least 2/3 an ɛ approximated solution, has worst case running time Ω ( ) m ɛ on instances from 2 Γm,n. F
4 4 Dan Garber, Elad Hazan 2. Fix an error tolerance ɛ (0, 1), the parameter n and let m 1 8ɛ. Any algorithm that given an instance from Γm,n, F finds with probability at least 2/3 an ɛ ( approximated solution, has worst case running time Ω Γ F m,n. n 2 ɛ ) on instances from 3. Fix the parameters m, n and an error tolerance ɛ satisfying 1 8 ɛ 1 4n. Any algorithm that given an instance from Γm,n, S finds with probability at least 2/3 an ɛ approximated solution, has worst case running time Ω ( ) mn on instances from Γ S m,n. Part 1 of Theorem 2 shows that the dependency of the running time in Theorem 1 on the term ɛ 2 m cannot be improved in general. Part 2 of the theorem shows that the dependency of the running time on the sparsity parameter S cannot be sublinear in general. Finally part 3 of the theorem shows that in general (under the unit spectral norm assumption) the dependency on m, n in the running time could not be separated, as suggested for instance in Theorem 1 in the case F = Θ(1). Finally, we also consider the following non-affine saddle-point optimization problem: max min p m ɛ m p i c i (X), (3) where c 1,..., c m : S R are all concave and differentiable everywhere in S 2. For this problem we have the following theorem. Theorem 3 There exists an algorithm that given an error tolerance ɛ < 1 and an instance of Problem (3) satisfying i [m] and for all X S: c i (X) 1. c i (X) F 1. c i (X) could be computed in O(n 2 ) time. There exists a procedure Sample(X, c 1,..., c m ) that returns in O(m + n 2 ) time unbiased estimators of the values c 1 (X),..., c m (X), each with variance at most 1. finds after ( O ɛ 2 log m(m + ɛ 4 n 2 ) log m) arithmetic operations a pair of primal-dual solutions ( X, p) S m satisfying with probability at least 1/2: min p m m p i c i ( X) max m p i c i (X) ɛ 2 Our results hold also under the weaker assumption that every c i has a supergradient everywhere in S.
5 Sublinear Time Algorithms for Approximate Semidefinite Programming 5 Our algorithm for the non-affine case may be viewed as a sublinear-time algorithm in the sense that under our assumptions, the total running time of the algorithm for ɛ max{n 1, m 1/6 } is less than the worst case time required to evaluate the gradient of each function c i at least once. Although the conditions stated in Theorem 3 are somewhat restrictive, we show that for instance, for the important case of the square-loss function c(x) = (A X b) 2 they indeed hold. This example is spelled out in Section Related Work The problems (1),(2) under consideration( in this paper are solvable up to an accuracy max{m, ) of ɛ via Interior Point methods in time O n}(n 3 + m 3 + m 2 n 2 ) log (1/ɛ). In recent years there has been extensive efforts to speed up the running time of first order methods for these problems, see [9,20,21,17,5,15] and references therein. Current state-of-the-art first order methods include the deterministic smoothing technique of Nesterov [21], The Mirror Descent method [9] and the recent randomized Mirror- Prox method of Baes, Bürgisser and Nemirovski [5], all three considering the eigenvalue optimization problem (2). The current best running time of first order methods for solving Problem (2) (up to poly-logarithmic factors), under the assumption that all matrices A i have spectral norm at most 1, is { } ms min + n2 ɛ ɛ 3 [5], ms + n3 ɛ ɛ [21], ms ɛ 2 + n2 [9]. (4) ɛ2.5 In order to show the potential appeal of the algorithm proposed in this paper over previous art, we present in Table 1 different regimes of m (number of constraints), S (the maximum number of non zero entries in any of the matrices) and F (upper bound on the Frobenius norm of all matrices) for which our method improves over the bound (4), for a meaningful range of the error tolerance parameter ɛ. Please note that Table 1 does not give full characterization of the settings in which our method may be preferable, but only presents a partial list. Our algorithms are based on applying algorithms from the paradigm of online convex optimization to solve saddle point optimization problems. For recent surveys on online convex optimization see [16], [23]. We use the generic schemes presented in [14] for converting these online algorithms into algorithms for saddle point optimization, and we especially build on the well known Multiplicative Weights algorithm for online linear optimization over the simplex [2]. In order to accelerate these schemes and derive a sublinear running time, we apply among others, ideas introduced in [8], in which the authors presented several sublinear time algorithms for machine learning optimization tasks such as the supervised learning of separating hyperplanes. In [8] the authors accelerated the schemes from [14] using novel random estimation techniques instead of exact computation. This approach was also the one taken in the pioneering work of Grigoriadis and Khachiyan [13], who presented a randomized sublinear time algorithm for a saddle point optimization problem in which both primal and dual solutions are in the simplex. The sublinear running time of their algorithm is also achieved by trading exact computations with random sampling.
6 6 Dan Garber, Elad Hazan regime previous FOM IPM this paper (ɛ n 1 ) improvement S = Θ(n), m = Θ(n), F 2 = Θ(n) min{ɛ 1 n 3, ɛ 5/2 n 2 } n 9/2 ɛ 2 n 2 + ɛ 9/2 n ɛ > n 4/7 S = Θ(n), m = Θ(n 3 ), F 2 = Θ(1) ɛ 1 n 4 n 21/2 ɛ 2 n 3 ɛ > n 1 S = Θ(n), m = Θ(n 3 ), F 2 = Θ( n) ɛ 1 n 4 n 21/2 ɛ 2 n 7/2 ɛ > n 1/2 S = Θ(n 3/2 ), m = Θ(n), F 2 = Θ(n) ɛ 1 n 5/2 n 9/2 ɛ 4 n 3/2 + ɛ 5/2 n 2 ɛ > n 1/3 S = Θ(n 3/2 ), m = Θ(n 2 ), F 2 = Θ(n) ɛ 1 n 7/2 n 7 ɛ 2 n 3 + ɛ 4 n 3/2 ɛ > n 1/2 S = Θ(n 3/2 ), m = Θ(n 3 ), F 2 = Θ( n) ɛ 1 n 9/2 n 21/2 ɛ 2 n 7/2 ɛ > n 1 S = Θ(n 3/2 ), m = Θ(n 3 ), F 2 = Θ(n) ɛ 1 n 9/2 n 21/2 ɛ 2 n 4 ɛ > n 1/2 S = Θ(n 2 ), m = Θ(n), F 2 = Θ(n) ɛ 1 n 3 n 9/2 ɛ 4 n 2 ɛ > n 1/3 S = Θ(n 2 ), m = Θ(n 2 ), F 2 = Θ(n) ɛ 1 n 4 n 7 ɛ 2 n 3 + ɛ 4 n 2 ɛ > n 2/3 S = Θ(n 2 ), m = Θ(n 3 ), F 2 = Θ(n) ɛ 1 n 5 n 21/2 ɛ 2 n 4 ɛ > n 1 Table 1 Comparison of running times between the bound in (4), the running time of Interior Point Methods and the bound stated in Theorem 1, for the eigen optimization problem (2), in various regimes of S, m, F 2. In all settings the spectral norm of all matrices is assumed to be upper bounded by 1. All bounds are stated without constants and without poly-logarithmic factors. The column titled previous FOM gives a lower bound on the expression in Equation (4). The column IPM gives the best running time of Interior Point methods for the chosen parameters. The column titled this paper gives an evaluation of the bound stated in Theorem 1 for the choices of S, m and F 2. In order to further simplify the expression in this column we assume that ɛ n 1. Finally, the column titled improvement gives the range of the error tolerance ɛ for which the bound in Theorem 1 improves over the bound in (4). Two more results that we apply in order derive sublinear running times, are a Bernstein-like concentration result for sum of random matrices presented in [22] and the Frank-Wolfe algorithm for smooth convex optimization over the spectrahedron presented in [15]. The rest of this paper is organized as follows. In section 2 we present our main algorithmic technique: algorithms for online convex optimization and overview their application to saddle point optimization. In section 3 we detail more algorithmic tools that we apply in our sublinear time algorithms. In section 4 we describe our randomized sublinear algorithm for Problem (1) and prove Theorem 1. In section 5 we describe how to make the algorithm work with arbitrary high probability while keeping the running time sublinear. In section 6 we present a sublinear algorithm for the non-affine problem (3) and prove Theorem 3. Finally, in section 7 we prove Theorem 2. 2 An Online Convex Optimization Approach to Saddle-Point Optimization In this section we overview our main algorithmic machinery for solving problems (1) and (3): algorithms for online convex optimization and their application to saddle point optimization. 2.1 Online Convex Optimization In the setting of Online convex optimization (OCO) [27,16,23,7], a decision maker is iteratively required to choose a point x t in a fixed convex set K. After the decision maker chooses x t, a convex loss function f t : K R is revealed and the decision
7 Sublinear Time Algorithms for Approximate Semidefinite Programming 7 maker incurs loss which is equal to the value of the function evaluated at the point chosen, f t (x t ), and the game proceeds to the next iteration. The emphasis in this setting is that the function f t revealed on each iteration of this game may be chosen completely arbitrarily and even in an adversarial manner given the previous choices of the decision maker and previous loss functions. In this adversarial model the decision maker cannot hope to bring to a minimum his cumulative loss, T f t(x t ) as a function of his choices x 1,..., x T. Instead the standard goal in this setting is to minimize the difference between the cumulative loss of the decision maker and that of an algorithm that has the benefit of hindsight but is able to play only a single fixed point in K on every iteration. This difference is known as regret, and is formally given by regret T = T f t (x t ) min x K T f t (x). Usually one cannot guarantee to achieve zero regret (or negative) [7] and the goal is to design an efficient algorithm for the decision maker that guarantees a regret that grows sub-linearly in the length of the game T, That is regret T T 0 as T. Sometimes one wishes to maximize the cumulative gain instead of minimizing the cumulative loss. In this case we require that all functions f t are concave and the regret is given by the difference between the cumulative gain of the best fixed point in K in hindsight and that of the decision maker. We conclude this subsection by presenting two algorithms for online convex optimization and their regret guarantees. In the next subsection we describe how these algorithms may be applied to saddle point optimization problems such as (1) and (3). The first algorithm we describe, known as the Multiplicative Weights algorithm [2], is an algorithm for online optimization over the simplex m with linear loss functions (f t (p) = qt p). Definition 1 (The Multiplicative Updates algorithm) Consider a sequence of linear losses q 1,..., q T R m such that for all t q t 1. The Multiplicative Weights (MW) algorithm is as follows. Let 0 < η R, w 1 1 m, and for t 1, p t w t / w t 1, w t+1 (i) w t (i)(1 ηq t (i)). The following lemma bounds the regret of the above described MW algorithm. For a proof see [2]. log m Lemma 1 Assuming T > log m, for the choice η = T the multiplicative updates algorithm satisfies, regret(mw ) T = p t q t min q t (i) 2 T log m. The second algorithm we present is an algorithm for online optimization over the set S, that is the set of unit-trace positive semidefinite matrices, with arbitrary concave functions. The algorithm is a variant of the more general meta-algorithm known as Regularized Follow The Leader (RFTL) [16,23].
8 8 Dan Garber, Elad Hazan Definition 2 (The RFTL algorithm) Consider a sequence of concave functions f 1,..., f T : S R such that t, X S f t (X) F 1. The RFTL algorithm is as follows. Let X 1 be an arbitrary point in S, and for t 1, 1 X t+1 arg max 2T t f τ (X τ ) X 1 2 X 2 F. (5) τ=1 The following lemma bounds the regret of the above described RFTL algorithm. For a proof see [23]. Lemma 2 The regularized follow the leader algorithm satisfies: max T f t (X) T f t (X t ) 2T. 2.2 Applying OCO algorithms to saddle point optimization In this subsection we overview how to apply online convex optimization algorithms to solve problems (1), (3). We concentrate on presenting convergent algorithms without focusing on the running times. When we present our sublinear algorithms we show how to derive sublinear time algorithms from the basic algorithms presented in this subsection. We approximate the problems (1),(3) by applying algorithms for OCO in two ways which we term the dual method and the primal dual method (see [14]) The dual method We apply the dual method to Problem (1), that is we assume that all constraint functions c i (X) are affine: c i (X) = A i X b i and A i R n n is symmetric and b i R. The basic algorithm is given below. Algorithm 1 Dual Method 1: Input: T log m T 2: w 1 1 m, η 3: for t = 1 to T do 4: p t(i) w t(i)/ w t 1 5: X t arg max m pt(i)c i(x) 6: for i = 1 to m do 7: w t+1 (i) w t(i)(1 η c i (X t)) 8: end for 9: end for 10: return X = 1 T T Xt, p = 1 T T pt Since all functions c i are affine, c i (X) = A i X b i, the solution to the optimization step in line 5 of the algorithm is without loss of generality a rank-one matrix
9 Sublinear Time Algorithms for Approximate Semidefinite Programming 9 of the form vv, where v is an eigenvector corresponding to the largest eigenvalue of the matrix m p t(i)a i. Notice that Algorithm 1 applies the MW algorithm with respect to the sequence of linear losses c 1,..., c T R m where c t (i) = c i (X t ) = A i X t b i. Lemma 3 Given tolerance ɛ > 0, running Algorithm 1 for T = 4ɛ 2 log m rounds guarantees that the returned primal-dual solutions ( X, p) satisfy: min p m m p i (A i X b i ) max m p i (A i X b i ) ɛ. Proof Denote by c t the vector in R m such that for all i [m] c t (i) = c i (X t ). By the regret guarantee of the MW algorithm (Lemma 1) and the value of T we have that 1 T p 1 t c t min p m T p c t ɛ. By the definition of X t we have that for all t, m p t(i)c i (X t ) = max m p t(i)c i (X) and thus, 1 min p m T p c t 1 m max p t (i)c i (X) ɛ T ( ) m 1 T max p t (i) c i (X) ɛ. T Since for all i c t (i) = c i (X t ) and c i (X) is affine we have that 1 T m p c t = p i 1 T m c i (X t ) = p i c i 1 T X t and thus we conclude that min p m m p i c i 1 T which yields the lemma. X t max m ( 1 T ) T p t (i) c i (X) ɛ,
10 10 Dan Garber, Elad Hazan The primal-dual method We apply the primal-dual method to Problem (3), that is we assume that all constraint functions c i (X) are concave and differentiable everywhere in S 3. The basic algorithm is given below. Algorithm 2 Primal-Dual Method 1: Input: T 2: Let X 1 be an arbitrary point in S 3: w 1 1 m 4: for t = 1 to T do 5: p t(i) w t(i)/ w t 1 1 6: X t+1 arg max t m 2T τ=1 pτ (i) c i(x τ ) X 1 2 X 2 F 7: for i = 1 to m do 8: w t+1 (i) w t(i)(1 η c i (X t)) 9: end for 10: end for 11: return X = 1 T T Xt, p = 1 T T pt Notice that as in the dual method, Algorithm 2 applies the MW algorithm with respect to the sequence of linear losses c 1,..., c T where c t (i) = c i (X t ) and the RFTL algorithm with respect to the sequence of concave functions f 1 (X),..., f T (X) where f t (X) = m p t(i)c i (X). Lemma 4 Given tolerance ɛ > 0, running Algorithm 2 for T = ɛ 2 ( 2+2 log m) 2 rounds guarantees that the returned primal-dual solutions ( X, p) satisfy: min p m m p i c i ( X) max m p i c i (X) ɛ. Proof By the regret guarantee of the MW algorithm (Lemma 1) we have that T m p t (i)c i (X t ) min p m T m p i c i (X t ) 2 T log m. By the regret guarantee of the RFTL algorithm (Lemma 2) we have that max T m p t (i)c i (X) T m p t (i)c i (X t ) 2T. Adding the above two equalities and rearranging we have that 3 As stated before it suffices to assume that c i has a supergradient everywhere in S
11 Sublinear Time Algorithms for Approximate Semidefinite Programming 11 min p m m p i ( 1 T ) T c i (X t ) max = max 1 T m T ( 1 T m 2T + 2 T log m p t (i)c i (X) T ) T 2T + 2 T log m p t (i) c i (X). T The lemma follows from plugging our choice for T and since by the concavity 1 T ( of the functions c 1,..., c m we have that for all i [m], T c i(x t ) ) 1 T c i T X t. 3 More Algorithmic Techniques In this section we detail some algorithmic techniques that we apply in order to derive sublinear time algorithms out of the high-level algorithms presented in section 2. These techniques include fast random estimation procedures and the Frank-Wolfe method for smooth optimization over the set of unit-trace positive semidefinite matrices. 3.1 Random Estimation Techniques A key step in deriving our sublinear time algorithms is replacing exact computations with random estimations. We use random sampling for the following two problems Estimating matrix dot products with constant variance Let A, X R n n such that A F F (for ease of presentation and without loosing any generality we assume that F N) and X F 1, and assume we are interested in estimating the dot product A X with constant variance and in O(F 2 log n) time (not including a preprocessing step to construct the sampling distribution). A way to achieve this goal is by sampling entries from the matrix A according to the l 2 norm of X. Consider picking a random entry (i, j) [n] [n] with probability X(i,j)2 and X 2 F estimating the dot product by ṽ A(i, j) X 2 F X(i, j). (6) Observe that E i,j [ṽ] = i,j X(i, j) 2 X 2 F A(i, j) X 2 F X(i, j) = i,j A(i, j) X(i, j) = A X. Thus ṽ is an unbiased estimator for the dot product A X. Moreover,
12 12 Dan Garber, Elad Hazan Var[ṽ] E i,j [ṽ 2 ] = i,j X(i, j) 2 X 2 F A(i, j)2 X 4 F X(i, j) 2 = i,j A(i, j) 2 X 2 F i,j A(i, j) 2 = A 2 F F 2. In order to achieve an unbiased estimator with constant variance (say unit variance) one can simply average F 2 i.i.d. estimators as in (6), that is, sample F 2 entries from X, (i 1, j 1 )..., (i F 2, j F 2) i.i.d. according to the l 2 norm of X and take the estimator: ṽ 1 F 2 F 2 r=1 A(i r, j r ) X 2 F X(i r, j r ). (7) Setting up the l 2 sampling distribution takes preprocessing time that is linear in the number of non-zeros in X. Thus, given m matrices A 1..., A m with Frobenius norm at most F, it is possible to estimate all dot products A i X, i [m] with unit variance in overall O( X 0 + F 2 (log n + m)) time (in particular note that we need to sample only one set of F 2 entries, and not one such set per matrix A i ). Although the estimator ṽ defined in (7) has unit variance, it may be unbounded which does not allow us to apply standard concentration results. To tackle this problem we use a clipping procedure. We define v = clip(ṽ, C) = max{ C, min{ṽ, C}} and use v instead of ṽ. The new estimator v is no longer unbiased (it is bounded and with variance at most 1). The following lemma, which we prove in the appendix, ties the resulting bias with the variance. Lemma 5 Let X be a random variable with finite variance, let X = clip(x, C) = min{c, max{ C, X}} where C is a constant satisfying E[X] C/2. Then, E[ X] E[X] 2 C Var[X]. Note that in case we apply Lemma 5 to the dot-product estimator ṽ described above, choosing C to be roughly 1/ɛ we get from Lemma 5 that the resulting bias in the estimator v is only O(ɛ) while v is upper bounded by O(1/ɛ). There is a slight issue that choosing C = Θ(1/ɛ) imposes an upper bound on ɛ because of the requirement in Lemma 5 that E[X] C/2. However, in all of the results presented in this paper, this requirement merely translates to the restriction ɛ 1, which is the only case of interest under our assumption of bounded affine/non-affine functions Spectral approximation of matrices Recall that in the dual method (Algorithm 1) we need on each iteration to compute the leading eigenvector of a matrix A given by A = m p t(i)a i where p t m and A i R n n is symmetric. Computing A, takes O(mS) time where S is an upper bound on the number of non-zeros in a single matrix. Instead of computing A we will
13 Sublinear Time Algorithms for Approximate Semidefinite Programming 13 be interested in finding a matrix à which is the average of only few of the matrices A 1..., A m, such that with probability at least 1 δ, à A 2 ɛ, (8) for some error tolerance ɛ and failure probability δ. Note that if X à max à X ɛ then X A max A X O(ɛ) and thus we can find an approximation to the leading eigenvector of A by approximating the leading eigenvector of Ã. In order to derive an approximation for A in the sense of (8) we rely on the following Bernstein-like inequality for matrices (see [25], Theorem 1.4). Lemma 6 Let M 1,..., M k be independent zero-mean random symmetric matrices of dimension n n. Suppose that i [k], M i 2 γ and denote ρ 2 = k E[M i 2] 2. Then for any ɛ > 0 it holds that ( ) k ( ɛ 2 ) /2 P M i 2 > ɛ n exp ρ 2. + γɛ/3 Then next lemma follows directly from Lemma 6 and is tailored to our needs. Lemma 7 Fix ɛ (0, 1]. Let p m be a distribution over m elements and let A 1,..., A m be symmetric matrices in R n n such that A i 2 1 for all i [m]. Let j 1,..., j k be k i.i.d. samples of indices from [m] according to distribution p (that is j l = i w.p. p i ). Then it holds that P ( 1 k ) k m ( kɛ 2 A jl p i A i 2 > ɛ n exp 10 l=1 ). 3.2 The Frank-Wolfe algorithm and approximated RFTL In order to apply the primal dual method with the RFTL online algorithm we need to solve iteratively Problem (5) which takes O(n 3 ) time (notice that this is just a Euclidean projection to the set S) and thus is prohibitive when aiming for sublinear running times. Instead we make due with only finding an approximated solution to objective (5) and show that if the approximation error is sufficiently small then our algorithm still converges. The benefit is that such an approximation could be found in O(n 2 poly(ɛ 1 )) time. The Frank-Wolfe method adapted to optimization over the spectrahedron, presented in [15], is an algorithm for maximizing a smooth concave function f over the set S. The algorithm performs iterations where each iteration t is comprised of solving the linear problem V t arg max V S V f(x t), (9) where X t is the current iterate. The algorithm then produces the next iterate by taking the convex combination X t+1 X t +α t (V t X t ) where α t is roughly 1/t. Note that
14 14 Dan Garber, Elad Hazan the solution to (9) is without loss of generality just a leading eigenvector computation which could be approximated in nearly linear time using the Lanczos method [18]. The Frank-Wolfe algorithm guarantees that after roughly O(ɛ 1 ) iterations, each of which is comprised of a single eigenvector computation, the final iterate X sol will satisfy max f(x) f(x sol ) ɛ. We now define the approximated RFTL algorithm and tie its regret bound the approximation quality of Problem (5). Definition 3 (Approximated RFTL algorithm) Consider a sequence of concave functions f 1,..., f T : S R such that t [T ], X S : f t (X) F 1. The approximated RFTL algorithm is as follows. Let ɛ > 0, X 1 is an arbitrary point in S. For time t 1 define the function F t (X) = 1 2T t f τ (X τ ) X 1 2 X 2 F τ=1 and choose a point X t+1 S that satisfies: max F t(x) F t (X t+1 ) ɛ. Lemma 8 The approximated RFTL algorithm satisfies: max T f t (X) T f t (X t ) 2T + T ɛ. Proof Consider the sequence of functions f t (X) = f t (X t ) X, t [T ] and denote by X t, t [T ] the sequence of points played by the original RFTL algorithm, as formulated in Equation (5), with respect to the sequence of functions f 1,..., f T. By Lemma 2 we have that max T f t (X t ) X T f t (X t ) X t 2T. (10) Note that F t (X) is 1-strongly-concave with respect to norm F. Thus since by definition of X t+1, F t ( X t+1 ) F t (X t+1 ) ɛ we have that t : X t+1 X t+1 2 F F t ( X t+1 ) F t (X t+1 ) ɛ. Using the Cauchy-Schwartz inequality we have that T f t (X t ) X t = T f t (X t ) X t + T f t (X t ) X t T f t (X t ) (X t X ) t T f t (X t ) F X t X t F T f t (X t ) X t T ɛ. (11)
15 Sublinear Time Algorithms for Approximate Semidefinite Programming 15 Plugging (11) into (10) we have that max T f t (X t ) X T f t (X t ) X t 2T + T ɛ. Finally using the concavity of the functions f t we have that max T f t (X) T f t (X t ) 2T + T ɛ. Note that the objective in (5) is indeed concave and smooth as required by the Frank Wolfe algorithm in [15]. The following lemma states the computation time of the update step of the approximated RFTL algorithm described in definition 3 using the Frank-Wolfe algorithm [15]. For a proof see [15]. Lemma 9 Fix ɛ > 0, t [T ] and let F t (X) be as in definition 3. The Frank-Wolfe algorithm( in [15] computes a matrix X S such that max Y S F t (Y ) F t (X) ɛ n in time O 2 t ɛ ) Sublinear Time Algorithm for the Affine Saddle-Point Problem In this section we present our main algorithmic result, a sublinear time algorithm for approximately solving (1). As stated in Theorem 1, throughout this section we assume that there exists parameters F, S such that for any constraint i [m] it holds that A i 2 1, b i 1, A i F F and the matrix A i contains at most S non-zero entries. The algorithm is based on Algorithm 1 and the random acceleration techniques described in section 3. We replace the MW algorithm applied in Algorithm 1, with the following variant from [8] which is suitable for the case in which the linear losses are random variables with bounded variance but do not necessarily have constant l norm (line 13 of the algorithm). Definition 4 (Variance MW) Consider a sequence of linear losses q 1,..., q T R m. The MW algorithm is as follows. Let 0 < η R, w 1 1 m, and for t 1, p t w t / w t 1, w t+1 w t (i)(1 ηq t (i) + η 2 q t (i) 2 ). The following lemma bounds the regret of the above described MW algorithm. For a proof see [8]. Lemma 10 Let η be such that η < min, 1/q t (i). Then the multiplicative updates algorithm satisfies: p t q t min max{q t (i), 1 η } log m η where we have denoted by q 2 t the vector whose ith entry is q t (i) 2. + η p t qt 2,
16 16 Dan Garber, Elad Hazan Instead of feeding the MW algorithm with the exact values c i (X t ) = A i X t b i as done in Algorithm 1, we only use estimates of these values using the l 2 -sampling technique described in subsection (lines 9-14). Also, instead of performing exact leading eigenvector computations with respect to the matrices S t = m p t(i)a i as suggested in Algorithm 1, we approximate the matrix S t by the matrix M t using roughly O(ɛ 2 ) samples from A 1,..., A m (lines 5-7), as explained in subsection 3.1.2, and then compute an approximated leading eigenvector of M t using the randomized Lanczos method up to additive error ɛ and with failure probability δ (line 8). Algorithm 3 1: Input: error tolerance ɛ (0, 1], instance of Problem (1) and a parameter F N that satisfy the requirements of Theorem 1. 2: Let ɛ ɛ 6, T 602 ɛ 2 log m, w 1 1 m, η 4 log m, k 10 3T ɛ 2 log (20nT ), δ 1 20T 3: for t = 1 to T do 4: p t w t w t 1 5: Let i 1,..., i k be k indexes from [m] sampled i.i.d according to p t 6: p t 1 k k l=1 e i l 7: M t m pt(i)a i 8: x t EV(M t, ɛ, δ) 9: Sample F 2 entries (jt 1, l1 t 10: for i [m] do 11: ṽ t(i) 1 F 2 F 2 r=1 A i(jt r, lr t )/(xt(jr t )xt(lr t )) b i 12: v t(i) clip(ṽ t(i), 1/η) 13: w t+1 (i) w t(i)(1 ηv t(i) + η 2 v t(i) 2 ) 14: end for 15: end for 16: return X = 1 T xtx t, p = 1 T pt ),..., (jf 2 t, l F 2 t ) [n] [n] by (j i t, li t ) (j, l) w.p. xt(j)2 x t(l) 2 We turn to analyze Algorithm 3. We begin by introducing some auxiliary lemmas that will be used in the proof of Theorem 1. The following lemma gives guarantees on the eigenvector computations carried out in line 8 of the algorithm. The proof relies on the Lanczos method for computing the largest eigenvalue of a positive semidefinite matrix, and is differed to the appendix. Lemma 11 Given a matrix M with N non-zeros, such that M 2 = O(1), and parameters ɛ, δ > 0, the eigenvector computation ( in) line 8 could be implemented using the Lanczos method to run in time O, such that the returned vector N ɛ log n δ x satisfies with probability at least 1 δ: x Mx λ max (M) ɛ. The following lemma gives concentration bounds on our random variables from their expectations. The proof is differed to the appendix. 4 log m Lemma 12 For 1/4 η 3T, it holds with probability at least 1 O(1/m) that
17 Sublinear Time Algorithms for Approximate Semidefinite Programming max [v t (i) (A i x t x t b i )] 5ηT, 2. p t (i)(a i x t x t b i ) p t v t 4ηT. The following lemma is a simple application of Markov s inequality. The proof is differed to the appendix. Lemma 13 With probability at least 3/4 it holds that p t vt 2 16T. Finally the following lemma is a simple application of Hoeffding s inequality. Lemma 14 For any t [T ] and ɛ > 0 it holds that ( ) m m P p t (i)b i p t (i)b i > ɛ 2 exp ( 2kɛ 2). We can now prove Theorem 1. Proof We assume that the guarantees of Lemmas 12, 13 hold and that for all t [T ], the guarantees of Lemmas 7, 11 and 14 hold. Using the union-bound and our choice of k, δ, these guarantees all hold with probability at least 1/2. From Lemma 10 and the use of the clipping procedure in line 12 we have that p t v t min v t (i) + log m η + η p t vt 2. Starting from the rhs of the inequality, plugging result (1) in Lemma 12 we have that p t v t min (A i x t x t b i ) + 5ηT + log m + η p t vt 2. η Plugging Lemma 13 we have that p t v t min (A i x t x t b i ) + 5ηT + log m + 16ηT. η Turning to the lhs of the inequality, plugging result (2) in Lemma 12 we have that
18 18 Dan Garber, Elad Hazan p t (i)(a i x t x t b i ) 4ηT min (A i x t x t b i ) + log m + 21ηT. η By Lemmas 7 and 14 we can replace p t (i) with p t (i) and get: p t (i)(a i x t x t b i ) 2 ɛt min (A i x t x t b i ) + log m + 25ηT. η By the guarantee of Lemma 11 on the eigenvector computation in line 8 we have that max p t (i)(a i X b i ) 3 ɛt min (A i x t x t b i ) + log m + 25ηT. η It holds that max p t (i)(a i X b i ) max p t (i)(a i X b i ) Thus we have that = max p t (i)(a i X b i ). max p t (i)(a i X b i ) 3 ɛt min (A i x t x t b i ) + log m + 25ηT. η Dividing through by T, rearranging and plugging our choices for T, η we get: max 1 T p t (i) (A i X b i ) 3 ɛ ɛ/2 min A i 1 T x t x t b i. Plugging our choice for ɛ yields the approximation guarantee stated in the theorem. We now turn to discuss the arithmetic complexity of the algorithm. The algorithm performs O(ɛ 2 log m) iterations, each is comprised of three major steps: (i) averaging O(ɛ 2 log n) matrices sampled i.i.d according to distribution p t (ii) computing an approximated eigenvector and (iii) updating the weights vector w t. Step (i) is implemented trivially with O(Sɛ 2 log n) arithmetic operations. Step (ii), according to Lemma 11, could be carried out with O( M t 0 ɛ 0.5 log n) arithmetic operations where M t 0 is the number of non-zero entries in the matrix M t.
19 Sublinear Time Algorithms for Approximate Semidefinite Programming 19 This number is trivially bounded by min{sɛ 2 log n, n 2 }. Note also that since M t is the average of matrices with spectral norm at most 1, it holds that M t 2 1, as required in Lemma 11. Step (iii) requires x t 0 M t 0 prepossessing time to generate the l 2 -distribution and additional O(F 2 (log n + m)) time to generate the samples. Summing all contributions yields the total arithmetic complexity. Finally, the conclusion regarding the rank of the returned solution X follows from the observation that X is the average of O(ɛ 2 log m) rank-one matrices. 5 Increasing the Probability of Success In this section we derive high probability results for our Algorithm 3. Our results are based on showing that given a pair of primal-dual solutions generated by Algorithm 3 it is possible to verify their quality with high probability and in sublinear time. Our algorithm for verifying a candidate primal solution is given below. The algorithm assumes that the candidate X is of the form prescribed in Algorithm 3, that is X = 1 T T x tx t and is available implicitly as the unit vectors x 1, x 2,..., x T R n. Algorithm 4 1: Input: instance of Problem (1), candidate solution X given by unit vectors x 1,..., x T R n, F N s.t. i [m] : A i F F, error tolerance ɛ > 0, failure probability δ. 16 log (2m/δ) 2: k ɛ 3: for s = 1 to 2 k do 4: Sample an index t s [T ] uniformly at random 5: Sample F 2 entries (j 1, l 1 ),..., (j F 2, l F 2) [n] [n] i.i.d. x ts (i) 2 x ts (j) 2 6: for i = 1 to m do 7: ṽ i (s) 1 F 2 F 2 r=1 8: v i (s) clip(ṽ i (s), 4/ɛ) 9: end for 10: v i 1 k k s=1 v i(s) 11: end for 12: return v 1,..., v m. A i (j r,l r) x ts (j r)x ts (l r) ( ( )) Lemma 15 Algorithm 4 runs is time O log (m/δ) ɛ min{ S log n 2 ɛ, n} + mf 2 2 returns values v 1,..., v m such that with probability at least 1 δ it holds that i [m] A i X v i ɛ. Proof As in the analysis of Algorithm 3, the l 2 -sampling in lines 5-9 guarantees for all s [k] and i [m] that the conditional expectation E[ṽ i (s) t s ] satisfies E[ṽ i (s) t s ] = A i x ts x t s and that the conditional second moment E[ṽ i (s) 2 t s ] satisfies, E[ṽ i (s) 2 t s ] 1. Since t s is chosen uniformly at random we have that E[ṽ i (s)] = T 1 T A i x t x t = A i X. Also it holds that E[ṽ i (s) 2 ] 1. Thus for all i [m] and s [k] it holds that E[v i (s) 2 ] E[ṽ i (s) 2 ] 1 and v i (s) 4/ɛ. By Bernstein s inequality we have that and
20 20 Dan Garber, Elad Hazan ( P ( v i E[v i ] ɛ/2) 2 exp k2 ɛ 2 ) ) /8 2 exp ( kɛ2. k + 4/3 16 Using the proprieties of the clipping procedure (see Lemma 5), since E[ṽ i (s)] = A i X b i we have that with probability at least 1 2 exp ( kɛ 2 /16), v i A i X = v i 1 k v i 1 k k E[v i (s)] + 1 k s=1 v i E[v i ] + 1 k ɛ k k s=1 ɛ 2 = ɛ. k E[ṽ i (s)] s=1 k E[v i (s)] 1 k s=1 k E[v i (s)] E[ṽ i (s)] s=1 k E[ṽ i (s)] Using the union-bound, the last ( inequality ) holds for all i [m] simultaneously with probability at least 1 2m exp. Choosing k = guarantees kɛ2 16 s=1 16 log (2m/δ) ɛ 2 that all inequalities hold with probability at least 1 δ. The running time is due to the number of iterations k, each is comprise of O( x ts 0 ) preprocessing time to set up the l 2 sampling distribution from the vector x ts and additional O(mF 2 ) time to process the samples. As detailed in the analysis of Algorithm 3, it holds that x ts 0 min{sɛ 2 log n, n}. Lemma 16 Given parameters ɛ, δ > 0, Algorithm 3 can be implemented to run in time ( O log 1 ( T ALG (ɛ) + T V ER (ɛ, δ) + min{sɛ 4 log(n) log(m), n 2 }ɛ 0.5 log n ) ), δ δ where T ALG (ɛ) denotes the running time of Algorithm 3 with tolerance ɛ and T V ER (ɛ, δ) denotes the running time of the primal verification procedure given in Algorithm 4 with tolerance ɛ and failure probability δ, such that with probability at least 1 δ the returned primal-dual solutions ( X, p) satisfy: min p m m p i (A i X b i ) max m p i (A i X b i ) ɛ. Proof We run Algorithm 3 N = O(log(1/δ) times with error parameter ɛ/4. This guarantees that with probability at least 1 δ/3 one of the returned primal-dual pairs ( X j, p j ) satisfies Theorem 1 with error at most ɛ/4. We can verify all primal solutions up to error ɛ/4 and with failure probability δ/(3n) by running Algorithm 4 on all
21 Sublinear Time Algorithms for Approximate Semidefinite Programming 21 primal solutions. In order to verify a dual solution p we need to approximate with high probability the value max ( m m ) m p i (A i X b i ) = λ max p i A i p i b i. We can perform this verification up to error ɛ/4 and with failure probability δ/(3n) by applying Lemma 11 to the matrix M = m p(i)a i with error parameter ɛ/4 and failure probability δ/(3n). Note that M could be aggregated during the run of Algorithm 3 without increasing the computational complexity. Thus with probability at least 1 δ we will get a pair of primal-dual solutions that satisfy Theorem 1 and all verifications will be correct. Finally note that on every iteration t of Algorithm 3 it holds that M t 0 min{sɛ 2 log n, n 2 } which means that M 0 min{sɛ 4 log(n) log(m), n 2 }. Thus verifying a dual solution according to Lemma 11 takes at most O(min{Sɛ 4 log(n) log(m), n 2 }ɛ 0.5 log (n/δ)) arithmetic operations. 6 Sublinear Time Algorithm for Non-Affine Saddle-Point Problem In this section we describe and analyse our algorithm for non-affine constraints c i (X) 0, i [m] where each c i is concave and differentiable. Throughout this section we make the following assumptions for all i [m] and X S. 1. c i (X) c i (X) F c i (X) could be computed in O(n 2 ) time. 4. There exists a procedure Sample(X, c 1,..., c m ) that returns in O(m + n 2 ) time unbiased estimators of the values c 1 (X),..., c m (X) with variance at most 1. Example and justification of assumptions. Although these assumptions are relatively strong, they hold for example for the important case of the square-loss, that is c(x) = (A X b) 2 for some matrix A R n n and scalar b R. Note that for this function conditions 1,2 hold since proper scaling could be applied and condition 3 holds trivially. Estimating the value of c(x) in O(1) time and unit variance could be done by taking two independent unbiased estimators v 1, v 2 using the l 2 -sampling technique described in subsection of the value A X b and forming the estimator v = v 1 v 2. Since v 1, v 2 are independent we have that E[v] = E[v 1 ] E[v 2 ] = (A X b) 2. Also E[v 2 ] = E[v 2 1v 2 2] = E[v 2 1] E[v 2 2], which means that in case A F 1, the estimator v also has unit variance. Remark 1 In case only a constant number of the functions c 1,.., c m are non affine and all the rest are affine, assumption 4 could be replaced with the assumption that c i (X) could be computed in O(n 2 ) for every non-affine function c i and X S. This will allow to keep the running time sublinear without the need to implement the procedure Sample for the non-affine constraints (recall that for affine constraints we implement Sample using the l 2 -sampling technique).
22 22 Dan Garber, Elad Hazan Our algorithm is based on Algorithm 2. The algorithm resembles Algorithm 3 except that instead of producing the iterates X t by offline optimization (eigenvector computations), they are produced by the approximated regularized follow the leader online algorithm defined in definition 3 (line 6 in Algorithm 5), where the RFTL objective (5) is solved up to an additive error ɛ using the Frank-Wolfe algorithm (see Lemma 9). A problem with applying the RFTL algorithm as done in Algorithm 2 (line 5) is that on each iteration we are required to compute the gradient vector of m functions, c 1 (X t ),..., c m (X t ), which under the our assumptions takes O(mn 2 ) time. To overcome this hurdle we use a standard trick in stochastic optimization: instead of using the entire gradient on time t, m p t(i) c i (X t ) we sample a single gradient c i (X t ) with probability p t (i) (line 5 of the algorithm). Applying RFTL with the sampled gradients achieves with high probability the same regret bound as in Lemma 8. The l 2 -sampling technique used to estimate the values c i (X t ) in Algorithm 3 is replaced with a call to the generic sampling oracle Sample(X t, c 1,..., c m ) (line 8 of the algorithm). The algorithm is given below. Algorithm 5 1: Input: error tolerance ɛ (0, 1], instance of Problem (5) and a procedure Sample( ) that satisfy the requirements of Theorem 3. 2: Input: ɛ (0, 1], instance of Problem (5) 3: Let ɛ ɛ2 9, T 602 ɛ 2 log m, w 1 1 m, η 4: for t = 1 to T do 5: p t w t w t 1 6: Choose i t [m] by i t i w.p. p t(i) 7: X t+1 ɛ approximated solution to 1 max 2T τ=1 8: ṽ t Sample(X t, c 1,..., c m) 9: for i [m] do 10: v t(i) clip(ṽ t(i), 1/η) 11: w t+1 (i) w t(i)(1 ηv t(i) + η 2 v t(i) 2 ) 12: end for 13: end for 14: return X = 1 T Xt, p = 1 T e i t 4 log m 3T t c iτ (X τ ) X 1 2 X 2 F We turn to analyze Algorithm 5 and to prove Theorem 3. The proof of the following concentration result is given in the appendix. 4 log m Lemma 17 For 1/4 η 3T, it holds with probability at least 1 O(1/m) that 1. max [v t (i) c i (X t )] 5ηT,
23 Sublinear Time Algorithms for Approximate Semidefinite Programming T c it (X t ) We can now prove Theorem 3. T p t v t 4ηT. Proof For the proof we assume that the guarantees stated in Lemmas 17, 13 hold. Using the union-bound this happens with probability at least 1/2. From Lemma 10 and the use of clipping in line 12 we have that p t v t min v t (i) + log m η + η p t vt 2. Starting from the rhs of the inequality, plugging result (1) in Lemma 17 we have that p t v t min Plugging Lemma 13 we have that p t v t min c i (X t ) + 5ηT + log m η c i (X t ) + 5ηT + log m η + η p t vt ηT. Turning to the lhs of the inequality, plugging result (2) in Lemma 17 we have that c it (X t ) 4ηT min By Lemma 8 we have that max c it (X) 2T T ɛ min c i (X t ) + log m η + 21ηT. c i (X t ) + log m η + 25ηT. Since c i t (X) = m t:i c t=i i(x) = T m p ic i (X), we have that T max m p i c i (X) 2T T ɛ min By concavity of the functions c 1,..., c m we have that T max m c i (X t ) + log m η + 25ηT. p i c i (X) 2T T ɛ T min c i( X) + log m + 25ηT. η The result follows by dividing through by T and plugging our choices for ɛ, η, T.
24 24 Dan Garber, Elad Hazan Algorithm 5 performs O(ɛ 2 log m) iterations each is comprised of an update step of the approximated RFTL algorithm and calling the procedure Sample( ). From Lemma 9 we have that the time complexity of the approximated RFTL update step is bounded by O(n 2 ɛ 4 log m) and under our assumption, calling the procedure Sample( ) is bounded by O(m + n 2 ) time. Hence we get the running time stated in Theorem 3. 7 Lower bounds In this section we prove Theorem 2. Our lower bounds are information-theoretic, meaning that any successful algorithm must read a certain amount of entries form the input. The core machinery behind these bounds is the following folklore fact (see [8] for more details). Proposition 1 Consider a one dimensional array of length r such that with probability 1/2 the array contains only zeros and with the remaining 1/2 probability the array contains a single non-zero entry, chosen uniformly at random, and zero in all other entries. Then any algorithm that with probability at least 2/3 determines whether the array contains only zeros or not, must read Ω(r) entries from the array. The following lemma describes how we use the above proposition to prove our lower bounds. Lemma 18 Fix an error tolerance ɛ and the parameters m, n. Suppose that there exists a distribution over pairs of sets of m matrices, A 1 1,..., A 1 m R n n, A 2 1,..., A 2 m R n n and a set B [m] [n] [n] such that 1. max min A 1 i X max min A 2 i X > ɛ 2. There exists a triplet (i, j, k) B such that A 1 i (j, k) A2 i (j, k) and for all (i, j, k ) B, (i, j, k ) (i, j, k) it holds that A 1 i (j, k ) = A 2 i (j, k ). 3. The distinctive triplet (i, j, k) in the previous condition is distributed uniformly among all triplets in B. Then any ɛ additive approximation algorithm for Problem (1) that succeeds with probability at least 2/3 has running time Ω( B ). Proof Assume a distribution and a set B that satisfy the conditions of the lemma exist. Given matrices A 1,..., A m R n n define ARRAY(A 1,..., A m ) to be a one dimensional array of length B such that the entries of this array are exactly the entries A i (j, k) such that (i, j, k) B, organized into a one dimensional array in some fixed order. Let A 1 1,...A 1 m, A 2 1,...A 2 m be two sets of matrices drawn from the distribution. Denote by ARR 1 the array ARRAY(A 1 1,..., A 1 m) and by ARR 2 the array ARRAY(A 2 1,..., A 2 m). Now, given an ɛ-additive approximation algorithm for Problem (1) that succeeds with probability at least 2/3, we do the following. With probability 1/2 we run it on the set of matrices A 1 1,..., A 1 m and with the remaining probability we run it on A 2 1,..., A 2 m. Note that since the optimums of these two instances differ by more then ɛ, the algorithm must distinguish between these two instances
Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning
Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret
More information1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016
AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationA Linearly Convergent Conditional Gradient Algorithm with Applications to Online and Stochastic Optimization
A Linearly Convergent Conditional Gradient Algorithm with Applications to Online and Stochastic Optimization Dan Garber Technion - Israel Inst. of Tech. dangar@tx.technion.ac.il Elad Hazan Technion - Israel
More informationA survey: The convex optimization approach to regret minimization
A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization
More informationA LINEARLY CONVERGENT CONDITIONAL GRADIENT ALGORITHM WITH APPLICATIONS TO ONLINE AND STOCHASTIC OPTIMIZATION
A LINEARLY CONVERGENT CONDITIONAL GRADIENT ALGORITHM WITH APPLICATIONS TO ONLINE AND STOCHASTIC OPTIMIZATION DAN GARBER AND ELAD HAZAN Abstract. Linear optimization is many times algorithmically simpler
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationA Greedy Framework for First-Order Optimization
A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts
More informationBandit Convex Optimization
March 7, 2017 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion Learning scenario Compact convex action set K R d. For t = 1 to T : Predict
More informationOnline Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016
Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural
More informationThe Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm
he Algorithmic Foundations of Adaptive Data Analysis November, 207 Lecture 5-6 Lecturer: Aaron Roth Scribe: Aaron Roth he Multiplicative Weights Algorithm In this lecture, we define and analyze a classic,
More informationRegret bounded by gradual variation for online convex optimization
Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date
More informationTutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.
Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research
More informationLogarithmic Regret Algorithms for Strongly Convex Repeated Games
Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationThe No-Regret Framework for Online Learning
The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,
More informationBeating SGD: Learning SVMs in Sublinear Time
Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological
More informationOnline Learning, Mistake Bounds, Perceptron Algorithm
Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationLecture notes for quantum semidefinite programming (SDP) solvers
CMSC 657, Intro to Quantum Information Processing Lecture on November 15 and 0, 018 Fall 018, University of Maryland Prepared by Tongyang Li, Xiaodi Wu Lecture notes for quantum semidefinite programming
More informationWe describe the generalization of Hazan s algorithm for symmetric programming
ON HAZAN S ALGORITHM FOR SYMMETRIC PROGRAMMING PROBLEMS L. FAYBUSOVICH Abstract. problems We describe the generalization of Hazan s algorithm for symmetric programming Key words. Symmetric programming,
More informationEE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015
EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,
More informationExponential Weights on the Hypercube in Polynomial Time
European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts
More informationGradient Sliding for Composite Optimization
Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this
More informationSublinear Optimization for Machine Learning
Sublinear Optimization for Machine Learning Kenneth L. Clarkson IBM Almaden Research Center San Jose, CA Elad Hazan Department of Industrial Engineering Technion - Israel Institute of Technology Haifa
More informationNear-Optimal Algorithms for Online Matrix Prediction
JMLR: Workshop and Conference Proceedings vol 23 (2012) 38.1 38.13 25th Annual Conference on Learning Theory Near-Optimal Algorithms for Online Matrix Prediction Elad Hazan Technion - Israel Inst. of Tech.
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More informationarxiv: v4 [math.oc] 5 Jan 2016
Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationOptimization, Learning, and Games with Predictable Sequences
Optimization, Learning, and Games with Predictable Sequences Alexander Rakhlin University of Pennsylvania Karthik Sridharan University of Pennsylvania Abstract We provide several applications of Optimistic
More informationTutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer
Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More information15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015
5-859E: Advanced Algorithms CMU, Spring 205 Lecture #6: Gradient Descent February 8, 205 Lecturer: Anupam Gupta Scribe: Guru Guruganesh In this lecture, we will study the gradient descent algorithm and
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationStochastic Gradient Descent with Only One Projection
Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine
More informationThe Frank-Wolfe Algorithm:
The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology
More informationA Quantum Interior Point Method for LPs and SDPs
A Quantum Interior Point Method for LPs and SDPs Iordanis Kerenidis 1 Anupam Prakash 1 1 CNRS, IRIF, Université Paris Diderot, Paris, France. September 26, 2018 Semi Definite Programs A Semidefinite Program
More informationBandit Online Convex Optimization
March 31, 2015 Outline 1 OCO vs Bandit OCO 2 Gradient Estimates 3 Oblivious Adversary 4 Reshaping for Improved Rates 5 Adaptive Adversary 6 Concluding Remarks Review of (Online) Convex Optimization Set-up
More information1 Regression with High Dimensional Data
6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:
More information15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018
15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018 Usual rules. :) Exercises 1. Lots of Flows. Suppose you wanted to find an approximate solution to the following
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon
More informationConstrained Optimization and Lagrangian Duality
CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may
More informationSupport Vector Machines
Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal
More informationarxiv: v5 [math.na] 16 Nov 2017
RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationBeating SGD: Learning SVMs in Sublinear Time
Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological
More informationConvex and Semidefinite Programming for Approximation
Convex and Semidefinite Programming for Approximation We have seen linear programming based methods to solve NP-hard problems. One perspective on this is that linear programming is a meta-method since
More informationLecture 7: Passive Learning
CS 880: Advanced Complexity Theory 2/8/2008 Lecture 7: Passive Learning Instructor: Dieter van Melkebeek Scribe: Tom Watson In the previous lectures, we studied harmonic analysis as a tool for analyzing
More informationCS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm
CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,
More informationCHAPTER 11. A Revision. 1. The Computers and Numbers therein
CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of
More informationInterior-Point Methods for Linear Optimization
Interior-Point Methods for Linear Optimization Robert M. Freund and Jorge Vera March, 204 c 204 Robert M. Freund and Jorge Vera. All rights reserved. Linear Optimization with a Logarithmic Barrier Function
More informationResearch Note. A New Infeasible Interior-Point Algorithm with Full Nesterov-Todd Step for Semi-Definite Optimization
Iranian Journal of Operations Research Vol. 4, No. 1, 2013, pp. 88-107 Research Note A New Infeasible Interior-Point Algorithm with Full Nesterov-Todd Step for Semi-Definite Optimization B. Kheirfam We
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More informationDesign and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016
Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)
More informationRandomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Run Time
Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear Sometimes Sublinear Run Time Mengdi Wang Department of Operations Research and Financial Engineering, Princeton
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationA Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming
A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More informationLecture 19: Follow The Regulerized Leader
COS-511: Learning heory Spring 2017 Lecturer: Roi Livni Lecture 19: Follow he Regulerized Leader Disclaimer: hese notes have not been subjected to the usual scrutiny reserved for formal publications. hey
More informationRelative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent
Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order
More informationU.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016
U.C. Berkeley CS294: Spectral Methods and Expanders Handout Luca Trevisan February 29, 206 Lecture : ARV In which we introduce semi-definite programming and a semi-definite programming relaxation of sparsest
More information14.1 Finding frequent elements in stream
Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours
More informationNoisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get
Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationOnline Learning with Experts & Multiplicative Weights Algorithms
Online Learning with Experts & Multiplicative Weights Algorithms CS 159 lecture #2 Stephan Zheng April 1, 2016 Caltech Table of contents 1. Online Learning with Experts With a perfect expert Without perfect
More informationCS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs
CS26: A Second Course in Algorithms Lecture #2: Applications of Multiplicative Weights to Games and Linear Programs Tim Roughgarden February, 206 Extensions of the Multiplicative Weights Guarantee Last
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationBregman Divergence and Mirror Descent
Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,
More informationOnline Convex Optimization with Stochastic Constraints
Online Convex Optimization with Stochastic Constraints Hao Yu, Michael J. Neely, Xiaohan Wei Department of Electrical Engineering University of Southern California 1 arxiv:1708.03741v1 [math.oc] 1 Aug
More informationConvex optimization. Javier Peña Carnegie Mellon University. Universidad de los Andes Bogotá, Colombia September 2014
Convex optimization Javier Peña Carnegie Mellon University Universidad de los Andes Bogotá, Colombia September 2014 1 / 41 Convex optimization Problem of the form where Q R n convex set: min x f(x) x Q,
More informationOnline Submodular Minimization
Online Submodular Minimization Elad Hazan IBM Almaden Research Center 650 Harry Rd, San Jose, CA 95120 hazan@us.ibm.com Satyen Kale Yahoo! Research 4301 Great America Parkway, Santa Clara, CA 95054 skale@yahoo-inc.com
More informationGeometric problems. Chapter Projection on a set. The distance of a point x 0 R n to a closed set C R n, in the norm, is defined as
Chapter 8 Geometric problems 8.1 Projection on a set The distance of a point x 0 R n to a closed set C R n, in the norm, is defined as dist(x 0,C) = inf{ x 0 x x C}. The infimum here is always achieved.
More informationBandits for Online Optimization
Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationStochastic and Adversarial Online Learning without Hyperparameters
Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford
More informationConvex Stochastic and Large-Scale Deterministic Programming via Robust Stochastic Approximation and its Extensions
Convex Stochastic and Large-Scale Deterministic Programming via Robust Stochastic Approximation and its Extensions Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia
More informationExtracting Certainty from Uncertainty: Regret Bounded by Variation in Costs
Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Elad Hazan IBM Almaden Research Center 650 Harry Rd San Jose, CA 95120 ehazan@cs.princeton.edu Satyen Kale Yahoo! Research 4301
More informationLecture Approximate Potentials from Approximate Flow
ORIE 6334 Spectral Graph Theory October 20, 2016 Lecturer: David P. Williamson Lecture 17 Scribe: Yingjie Bi 1 Approximate Potentials from Approximate Flow In the last lecture, we presented a combinatorial
More informationapproximation algorithms I
SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions
More informationCS 6820 Fall 2014 Lectures, October 3-20, 2014
Analysis of Algorithms Linear Programming Notes CS 6820 Fall 2014 Lectures, October 3-20, 2014 1 Linear programming The linear programming (LP) problem is the following optimization problem. We are given
More informationCSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming
CSC2411 - Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming Notes taken by Mike Jamieson March 28, 2005 Summary: In this lecture, we introduce semidefinite programming
More informationOptimization for Machine Learning
Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html
More informationCS261: Problem Set #3
CS261: Problem Set #3 Due by 11:59 PM on Tuesday, February 23, 2016 Instructions: (1) Form a group of 1-3 students. You should turn in only one write-up for your entire group. (2) Submission instructions:
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More informationMath 273a: Optimization Subgradient Methods
Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R
More informationIntroduction to Bandit Algorithms. Introduction to Bandit Algorithms
Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationA full-newton step infeasible interior-point algorithm for linear programming based on a kernel function
A full-newton step infeasible interior-point algorithm for linear programming based on a kernel function Zhongyi Liu, Wenyu Sun Abstract This paper proposes an infeasible interior-point algorithm with
More informationNonlinear Programming
Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week
More informationFirst Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate
58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationU.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 12 Luca Trevisan October 3, 2017
U.C. Berkeley CS94: Beyond Worst-Case Analysis Handout 1 Luca Trevisan October 3, 017 Scribed by Maxim Rabinovich Lecture 1 In which we begin to prove that the SDP relaxation exactly recovers communities
More informationLecture 7: Semidefinite programming
CS 766/QIC 820 Theory of Quantum Information (Fall 2011) Lecture 7: Semidefinite programming This lecture is on semidefinite programming, which is a powerful technique from both an analytic and computational
More informationA priori bounds on the condition numbers in interior-point methods
A priori bounds on the condition numbers in interior-point methods Florian Jarre, Mathematisches Institut, Heinrich-Heine Universität Düsseldorf, Germany. Abstract Interior-point methods are known to be
More informationSemidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 4
Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 4 Instructor: Farid Alizadeh Scribe: Haengju Lee 10/1/2001 1 Overview We examine the dual of the Fermat-Weber Problem. Next we will
More information