A Deterministic Rescaled Perceptron Algorithm

Similar documents
Convex optimization. Javier Peña Carnegie Mellon University. Universidad de los Andes Bogotá, Colombia September 2014

On the von Neumann and Frank-Wolfe Algorithms with Away Steps

A Polynomial Column-wise Rescaling von Neumann Algorithm

Some preconditioners for systems of linear inequalities

Rescaling Algorithms for Linear Programming Part I: Conic feasibility

Solving Conic Systems via Projection and Rescaling

An Efficient Re-scaled Perceptron Algorithm for Conic Systems

DISSERTATION. Submitted in partial fulfillment ofthe requirements for the degree of

An Example with Decreasing Largest Inscribed Ball for Deterministic Rescaling Algorithms

A Sampling Kaczmarz-Motzkin Algorithm for Linear Feasibility

A simpler unified analysis of Budget Perceptrons

Active Learning: Disagreement Coefficient

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

A data-independent distance to infeasibility for linear conic systems

A strongly polynomial algorithm for linear systems having a binary solution

Machine Learning Lecture 6 Note

An Efficient Re-scaled Perceptron Algorithm for Conic Systems

arxiv: v3 [math.oc] 25 Nov 2015

Cutting Plane Training of Structural SVM

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Online Passive-Aggressive Algorithms

Optimized first-order minimization methods

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang

Towards A Deeper Geometric, Analytic and Algorithmic Understanding of Margins

A Geometric Analysis of Renegar s Condition Number, and its interplay with Conic Curvature

A Bound on the Label Complexity of Agnostic Active Learning

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Smoothed Analysis of the Perceptron Algorithm for Linear Programming

On the power and the limits of evolvability. Vitaly Feldman Almaden Research Center

Topics in Theoretical Computer Science: An Algorithmist's Toolkit Fall 2007

7. Lecture notes on the ellipsoid algorithm

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

1 The linear algebra of linear programs (March 15 and 22, 2015)

A New Perspective on an Old Perceptron Algorithm

Large-scale Stochastic Optimization

The Perceptron algorithm

arxiv: v4 [math.oc] 17 Dec 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machines for Classification: A Statistical Portrait

Online Passive-Aggressive Algorithms

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

COMS 4771 Introduction to Machine Learning. Nakul Verma

Some Sieving Algorithms for Lattice Problems

Linear Discrimination Functions

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

A Magiv CV Theory for Large-Margin Classifiers

From the Zonotope Construction to the Minkowski Addition of Convex Polytopes

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

The Perceptron Algorithm

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Perceptron Mistake Bounds

Perceptron (Theory) + Linear Regression

Solving the 3D Laplace Equation by Meshless Collocation via Harmonic Kernels

Learning Optimal Commitment to Overcome Insecurity

Online Learning, Mistake Bounds, Perceptron Algorithm

Lecture notes on the ellipsoid algorithm

15-780: LinearProgramming

Theory and Internet Protocols

Multilayer Perceptron

Sharp Generalization Error Bounds for Randomly-projected Classifiers

Sparse Optimization Lecture: Dual Certificate in l 1 Minimization

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

1 Learning Linear Separators

Accelerating Stochastic Optimization

Online Convex Optimization

Support Vector and Kernel Methods

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

12. Interior-point methods

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Lecture notes for quantum semidefinite programming (SDP) solvers

Lecture 15: October 15

Information-Theoretic Limits of Matrix Completion

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

Fundamental Domains for Integer Programs with Symmetries

9 Classification. 9.1 Linear Classifiers

CS675: Convex and Combinatorial Optimization Spring 2018 The Ellipsoid Algorithm. Instructor: Shaddin Dughmi

Empirical Risk Minimization

Lecture 2: Linear Algebra Review

LINEAR PROGRAMMING III

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

Covering an ellipsoid with equal balls

10. Ellipsoid method

9.2 Support Vector Machines 159

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Support Vector Machine

Learning Optimal Commitment to Overcome Insecurity

Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Learning convex bodies is hard

Online Learning Summer School Copenhagen 2015 Lecture 1

Subdifferential representation of convex functions: refinements and applications

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm

CSC Linear Programming and Combinatorial Optimization Lecture 8: Ellipsoid Algorithm

Private Empirical Risk Minimization, Revisited

Efficient Learning of Linear Perceptrons

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

ON THE MINIMUM VOLUME COVERING ELLIPSOID OF ELLIPSOIDS

Transcription:

A Deterministic Rescaled Perceptron Algorithm Javier Peña Negar Soheili June 5, 03 Abstract The perceptron algorithm is a simple iterative procedure for finding a point in a convex cone. At each iteration, the algorithm only involves a query to a separation oracle for and a simple update on a trial solution. The perceptron algorithm is guaranteed to find a point in after O/τ ) iterations, where is the width of the cone. We propose a version of the perceptron algorithm that includes a periodic rescaling of the ambient space. In contrast to the classical version, our rescaled version finds a point in in Om 5 log/ )) perceptron updates. This result is inspired by and strengthens the previous work on randomized rescaling of the perceptron algorithm by Dunagan and Vempala [Math. Program. 4 006), 0 4] and by Belloni, reund, and Vempala [Math. Oper. Res. 34 009), 6 64]. In particular, our algorithm and its complexity analysis are simpler and shorter. urthermore, our algorithm does not require randomization or deep separation oracles. Introduction The relaxation method, introduced in the classical articles of Agmon [], and Motzkin and Schoenberg [6], is a conceptual algorithmic scheme for solving the feasibility problem y. ) Here R m is assumed to be an open convex set with an available separation oracle: Given a test point y R m, the oracle either certifies that y or else it finds a hyperplane separating y from, that is, u R m, b R such that u, y b and u, v > b for all v. The relaxation method starts with an arbitrary initial trial solution. At each iteration, the algorithm queries the separation oracle for at the current trial solution y. If y then the algorithm terminates. Otherwise, the algorithm generates a new trial point y + = y + ηu for some step length η > 0 where u R m, b R determine a hyperplane separating y from as above. Tepper School of Business, Carnegie Mellon University, USA, jfp@andrew.cmu.edu Tepper School of Business, Carnegie Mellon University, USA, nsoheili@andrew.cmu.edu

The perceptron algorithm can be seen as a particular type of relaxation method for the problem ). It applies to the case when is the interior of a convex cone. It usually starts at the origin as the initial trial solution and each update is of the form y + = y + u u. The perceptron algorithm was originally proposed by Rosenblatt [9] for the polyhedral feasibility problem A T y > 0. As it was noted by Belloni, reund, and Vempala [6], the algorithm readily extends to the more general problem ) when is the interior of a convex cone, as described above. urthermore, Belloni et al. [6, Lemma 3.] showed that the classical perceptron iteration bound of Block [8] and Novikoff [7] also holds in general: The perceptron algorithm finds a solution to ) in at most O/τ ) perceptron updates, where is the width of the cone : := sup {r R + : By, r) }. ) y = Here By, r) denotes the Euclidean ball of radius r centered at y, that is By, r) = {u R m : u y r}. Similar results also hold for the relaxation method as established by Goffin [4]. Since their emergence in the fifties, both the perceptron algorithm and the relaxation method have played major roles in machine learning and in optimization. The perceptron algorithm has attractive properties concerning noise tolerance [9]. It is also closely related to large-margin classification [] and to the highly popular and computationally effective Pegasos algorithm [0] for training support-vector machines. There are also numerous papers in the optimization literature related to various versions and variants of the relaxation method [, 3, 4, 5, 0]. A major drawback of both the perceptron algorithm and the relaxation method is their lack of theoretical efficiency in the standard bit model of computation [5]. In particular, when = {y : A T y > 0} with A Z m n, the perceptron algorithm may have exponential worst-case bit-model complexity because can be exponentially small in the bit-length representation of A. Our main contribution is a variant of the perceptron algorithm that solves ) in Om 5 log/ )) perceptron updates. In particular, when = {y : A T y > 0} with A Z m n, our algorithm is polynomial in the bit-length representation of A. Aside from its theoretical merits, given the close connection between the perceptron algorithm and first-order methods [], our algorithm provides a solid foundation for potential speed ups in the convergence of the widely popular first-order methods for large-scale convex optimization. Some results of similar nature have been recently obtained by Gilpin et al. [3] and by O Donoghue and Candès [8]. Our algorithm is based on a periodic rescaling of the space R m in the same spirit as in previous work by Dunagan and Vempala [], and by Belloni, reund, and Vempala [6]. In contrast to the rescaling procedure in [, 6], which is randomized and relies on a deep separation oracle, our rescaling procedure is deterministic and relies only on a separation oracle. The algorithm performs at most Om log/ )) rescaling steps and at most Om 4 ) perceptron updates between rescaling steps. When = {y R m : A T y > 0} for A R m n, a simplified version of the algorithm has iteration bound Om n log/ )). A smooth version of this algorithm, along the lines developed by Soheili and

Peña [], in turn has the improved iteration bound Omn m logn) log/ )). Our rescaled perceptron algorithm consists of an outer loop with two main phases. The first one is a perceptron phase and the second one is a rescaling phase. The perceptron phase applies a restricted number of perceptron updates. If this phase does not find a feasible solution, then it finds a unitary vector d R m such that { y R m : 0 d, y } y. 6m This inclusion means that the feasible cone is nearly perpendicular to d. The second phase of the outer loop, namely a rescaling phase, stretches R m along d and is guaranteed to enlarge the volume of the set {y : y = } by a constant factor. This in turn implies that the algorithm must halt in at most Om log/ )) outer iterations. Polyhedral case or ease of exposition, we first consider the case = {y R m : A T y > 0} for A R m n. Assumption i) The space R m is endowed with the canonical dot inner product u, v := u T v. ii) A = [ a a n ] where ai = for i =,..., n. iii) The problem A T y > 0 is feasible. In particular, > 0. or j =,..., n let e j R n denote the vector with jth component equal to one and all other components equal to zero. Rescaled Perceptron Algorithm. let B := I; à := A; N := 6mn. Perceptron Phase) x 0 := 0 R n ; y 0 := 0 R m ; for k = 0,,..., N if ÃT y k > 0 then HALT output By k else let j {,..., n} be such that ã T j y k 0 x k+ := x k + e j y k+ := y k + ã j end if end for 3. Rescaling Phase) j = argmax e i, x N i=,...,n B := BI ãjã T j ); à := I ãjã T j )à normalize the columns of à 4. Go back to Step. 3

The rescaled perceptron algorithm changes the initial constraint matrix A to a new matrix à = BT A. Thus when ÃT y > 0, the non-zero vector By returned by the algorithm solves A T y > 0. Now we can state a special version of our main theorem. Theorem Assume A R m n satisfies Assumption. Then the rescaled perceptron algorithm terminates with a solution to A T y > 0 after at most ) ) log.5) m ) log + )) τ logπ) = O m log τ rescaling steps. Since the algorithm performs Omn ) perceptron updates between rescaling steps, the algorithm terminates after at most )) O m n log perceptron updates. The key ingredients in the proof of Theorem are the three lemmas below. The first of these lemmas states that if the perceptron phase does not solve à T y > 0, then the rescaling phase identifies a column ã j of à that is nearly perpendicular to the feasible cone {y : à T y 0}. The second lemma in turn implies that the rescaling phase increases the volume of this cone by a constant factor. The third lemma states that the volume of the initial feasible cone = {y : A T y 0} is bounded below by a factor of τ m. Lemma If the perceptron phase in the rescaled perceptron algorithm does not find a solution to ÃT y > 0 then the vector ã j in the first step of the rescaling phase satisfies { {y : ÃT y 0} y : 0 ã T j y } y. 3) 6m Proof: Observe that at each iteration of the perceptron phase we have y k+ = y k + a T j y k + y k +. Hence y k k. Also, throughout the perceptron phase x k 0, y k = Ãx k and x k+ = x k +. Thus if the perceptron phase does not find a solution to à T y > 0 then the last iterates y N and x N satisfy x N 0, x N = N = 6mn and y N = Ãx N N = n 6m. In particular, the index j in the first step of the rescaling phase satisfies e j, x N x N /n = 6mn. Next observe that if ÃT y 0 then 0 6mn ã T j y e j, x N ã T j y x T NÃT y Ãx N y n 6m y. So 3) follows. The following two lemmas rely on geometric arguments concerning the unit sphere S m := {u R m : u = }. Given a measurable set C S m, let VolC) denote its volume in S m. 4

We rely on the following construction proposed by Betke [7]. Given a S m and α >, let Ψ a,α : S m S m denote the transformation u I + α )aat )u I + α )aa T )u = u + α )at u)a + α )a T u). This transformation stretches the sphere in the direction a. The magnitude of the stretch is determined by α. Lemma Assume a S m, 0 < δ <, and α >. If C {y S m : 0 a T y δ} is a measurable set, then α Vol Ψ a,α C)) VolC). 4) + δ α m/ )) In particular, if δ = 6m and α = then Vol Ψ a,α C)).5VolC). 5) Proof: Without loss of generality assume a = e m. Also for ease of notation, we shall write Ψ as shorthand for Ψ a,α. Under these assumptions, for y = ȳ, y m ) S m we have ȳ, αy m ) Ψȳ, y m ) = α + α ) ȳ. To calculate the volume of C and of ΨC), consider the differentiable map Φ : B m R m, defined by Φ v) = v, ) v that maps the unit ball B m := {v R m : v } to the surface of the hemisphere {ȳ, y m ) S m : y m 0} containing the set C. The volume of C is VolC) = Φ dv. Φ C) where Φ denotes the volume of the m )-dimensional parallelepiped spanned by the vectors Φ/ v,..., Φ/ v m. Likewise, the volume of ΨC) is VolΨC)) = Ψ Φ) dv. Φ C) Hence to prove 4) it suffices to show that Ψ Φ) v) Φ v) α + δ α )) m/ for all v Φ C). 6) Some straightforward calculations show that for all v intb m ) Ψ Φ) v) = α and Φ v) = v α + α ) v ). m/ v Hence for all v intb m ) Ψ Φ) v) Φ v) = α α + α ) v ) m/. 5

To obtain 6), observe that if v Φ C) then 0 v δ and thus If δ = 6m and α = then α + α ) v + δ α ). α + δ α )) = m/ ) + m/4 exp0.5).5. m Thus 5) follows from 4). Lemma 3 Assume R m is a closed convex cone. Then Vol S m ) ) m τ π VolSm ). 7) Proof: rom the definition of the cone width it follows that Bz, ) for some z with z =. Therefore τ )z + v for all v Rm such that v τ and z, v = 0. This implies that S m contains a spherical cap of S m with base radius τ. Hence Vol S m ) τ ) m VolB m ). π m, Γ m +) VolSm ) = m + ) Γ + ). The bound 7) now follows from the facts VolB m ) = π m Γ m +), and Γ m Proof of Theorem : Let := {y R m : Ã T y 0}. Observe that the rescaling phase rescales to I + ã j ã T j ). Therefore, Lemma and Lemma imply that after each rescaling phase the quantity Vol S m ) increases by a factor of.5 or more. Since the set S m is always contained in a hemisphere, we conclude that the number of rescaling steps before the algorithm halts cannot be larger than VolS m ) log.5) log )/ Vol S m ) To finish, apply Lemma 3. 3 General case The gist of the algorithm for the general case of a convex cone is the same as that of the polyhedral case presented above. We just need a bit of extra work to identify a suitable direction for the rescaling phase. To do so, we maintain a collection of m index sets S j, j = ±, ±,..., ±m. This collection of sets helps us determine a subset of update steps that align with each other. The sum of these steps in turn defines the appropriate direction for rescaling. Assumption 6

i) The space R m is endowed with the canonical dot inner product,. ii) R m is the non-empty interior of a convex cone. In particular, > 0. iii) There is an available separating oracle for the cone : Given y R m the oracle either determines that y or else it finds a non-zero vector u := {u : u, v > 0 for all v } such that u, y 0. or j =,..., m let e j R m denote the vector with jth component equal to one and all other components equal to zero. Observe that for a non-singular matrix B R m m, we have B ) = B T. Thus a separation oracle for := B is readily available provided one for is: Given y R m, apply the separation oracle for to the point By. If By then y B =. If By, then let u be a non-zero vector such that u, By 0. Thus B T u, y = u, By 0 with B T u B ) =. Consequently, throughout the algorithm below we assume that a separation oracle for the rescaled cone is available. General Rescaled Perceptron Algorithm. let B := I; := ; N := 4m 4. for j = ±, ±,..., ±m S j := end for 3. Perceptron Phase) u 0 := 0 R m ; y 0 := 0 R m ; for k = 0,,..., N if y k then HALT and output By k else let u k be such that u k, y k 0 and u k = y k+ = y k + u k j := argmax i=,...,m e i, u k if e j, u k > 0 then S j := S j {k} else S j := S j {k} end if end if end for 4. Rescaling Phase) i = argmax S j d := j=±,...,±m k S u k i k S u k i B := BI ddt ); 5. Go back to Step. := I + dd T ) The general rescaled perceptron algorithm changes the initial cone to = B. Thus when y, we have By. Notice that although the above algorithm implicitly performs this transformation, its steps do not involve inverting any matrices or solving any system of equations. Now we can state the general version of our main theorem. 7

Theorem Assume R m is such that Assumption holds. Then the general rescaled perceptron algorithm terminates with a solution to y after at most log.5) m ) log τ ) + logπ) ) )) = O m log rescaling steps. Since the algorithm performs Om 4 ) perceptron updates between rescaling steps, the algorithm terminates after at most )) O m 5 log perceptron updates. The proof of Theorem is almost identical to the proof of Theorem. All we need is the following analog of Lemma. Lemma 4 If the perceptron phase in the general rescaled perceptron algorithm does not find a solution to y then the vector d in the rescaling phase satisfies { y : 0 d, y } y. 8) 6m Proof: Proceeding as in the proof of Lemma, it is easy to see that if the perceptron phase does not find a solution to y then the last iterate y N = N k=0 u k satisfies y N N = 4m 4. Since {e,..., e m } is an orthonormal basis and each u k satisfies u k =, we have e j, u k / m for j = argmax e i, u k. i=,...,m urthermore, since j=±,...,±m S j = N = 4m 4 it follows that the set S i in the rescaling phase must have at least m 3 elements. Thus e i, u k = e i, u k S i m 5/. 9) m k S i k S i k S i u k On the other hand, for all y we have 0 k S i u k, y N k=0 u k, y = y N, y y N y 4m y. 0) Putting 9) and 0) together, it follows that for all y 4m y 0 d, y = y. m 5/ 6m Hence 8) holds. 8

4 Smooth version for the polyhedral case Consider again the case when = {y R m : A T y > 0}, where A R m n. We next show that in this case the perceptron phase can be substituted by a smooth perceptron phase by relying on the machinery developed by Soheili and Peña []. This leads to an algorithm with a substantially improved convergence rate but whose work per main iteration is roughly comparable to that in the rescaled perceptron algorithm. Suppose A satisfies Assumption. or µ > 0 let x µ : R m R n be defined by y x µ y) = e AT e AT y. In this expression e AT y µ is shorthand for the n-dimensional vector e AT y µ := e a T y µ.. e a T n y µ Let R n denote the n-dimensional vector of all ones. Consider the following smooth version of the rescaled perceptron algorithm. Smooth Rescaled Perceptron Algorithm. let B := I; à := A; N := 7n m logn).. Smooth Perceptron Phase) y 0 := à n ; µ 0 := ; x 0 := x µ0 y 0 ); for k = 0,,,..., N if ÃT y k > 0 then HALT and output By k else θ k := k+3 ; y k+ := θ k )y k + θ k Ãx k ) + θ kãx µ k y k ); µ k+ := θ k )µ k ; x k+ := θ k )x k + θ k x µk+ y k+ ); end if end for 3. Rescaling Phase) j = argmax e i, x N i=,...,n B := BI ãjã T j ); à := I ãjã T j )à normalize the columns of à 4. Go back to Step. Theorem 3 Assume A R m n satisfies Assumption. Then the smooth rescaled perceptron algorithm terminates with a solution to A T y > 0 after at 9

most log.5) m ) log τ ) + logπ) ) )) = O m log rescaling steps. Since the algorithm performs On m logn)) perceptron updates between rescaling steps, the algorithm terminates after at most O mn )) m logn) log perceptron updates. Proof: This proof is a modification of the proof of Theorem. It suffices to show that if the smooth perceptron phase in the rescaled perceptron algorithm does not find a solution to ÃT y > 0 then the vector ã j in the first step of the rescaling phase satisfies { {y : ÃT y 0} y : 0 ã T j y } y. ) 6m Indeed, from [, Lemma 4.] it follows that if the perceptron phase does not find a solution to ÃT y > 0, then Ãx N 8 logn) N+) 8 49mn 6mn. Since x N 0 and x N =, the index j in the rescaling phase satisfies e j, x N n. Therefore, if ÃT y 0 then 0 n ãt j y e j, x N ã T j y x T NÃT y Ãx N y 6mn y. So ) follows. References [] S. Agmon. The relaxation method for linear inequalities. Canadian Journal of Mathematics, 63):38 39, 954. [] E. Amaldi, P. Belotti, and R. Hauser. A randomized algorithm for the maxfs problem. In IPCO, pages 49 64, 005. [3] E. Amaldi and R. Hauser. Boundedness theorems for the relaxation method. Math. Oper. Res., 304):939 955, 005. [4] H. H. Bauschke and J. M. Borwein. Legendre functions and the method of random Bregman projections. J. Convex Anal., 4:7 67, 997. [5] H. H. Bauschke, J. M. Borwein, and A. Lewis. The method of cyclic projections for closed convex sets in Hilbert space. Contemporary Math, 04: 38, 997. [6] A. Belloni, R. reund, and S. Vempala. An efficient rescaled perceptron algorithm for conic systems. Math. Oper. Res, 343):6 64, 009. 0

[7] U. Betke. Relaxation, new combinatorial and polynomial algorithms for the linear feasibility problem. Discrete & Computational Geometry, 3:37 338, 004. [8] H. D. Block. The perceptron: A model for brain functioning. Reviews of Modern Physics, 34:3 35, 96. [9] A. Blum, A. rieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for learning noisy linear threshold functions. Algorithmica, - ):35 5, 998. [0] S. Chubanov. A strongly polynomial algorithm for linear systems having a binary solution. Math. Program., 34:533 570, 0. [] J. Dunagan and S. Vempala. A simple polynomial-time rescaling algorithm for solving linear programs. Math. Program., 4):0 4, 006. [] Y. reund and R. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37:77 96, 999. [3] A Gilpin, J. Peña, and T. Sandholm. irst-order algorithm with Oln/ɛ)) convergence for ɛ-equilibrium in two-person zero-sum games. Math. Program., 33:79 98, 0. [4] J. Goffin. The relaxation method for solving systems of linear inequalities. Math. Oper. Res., 5:388 44, 980. [5] J. Goffin. On the non-polynomiality of the relaxation method for systems of linear inequalities. Math. Program., :93 03, 98. [6] T. S. Motzkin and I. J. Schoenberg. The relaxation method for linear inequalities. Canadian Journal of Mathematics, 63):393 404, 954. [7] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume XII, pages 65 6, 96. [8] B. O Donoghue and E. J. Candès. Adaptive restart for accelerated gradient schemes. oundations of Computational Mathematics, To Appear. [9]. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Cornell Aeronautical Laboratory, Psychological Review, 656):386 408, 958. [0] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: primal estimated sub-gradient solver for SVM. Math. Program., 7:3 30, 0. [] N. Soheili and J. Peña. A smooth perceptron algorithm. SIAM Journal on Optimization, ):78 737, 0.