A Deterministic Rescaled Perceptron Algorithm

A Deterministic Rescaled Perceptron Algorithm Javier Peña Negar Soheili June 5, 03 Abstract The perceptron algorithm is a simple iterative procedure for finding a point in a convex cone. At each iteration, the algorithm only involves a query to a separation oracle for and a simple update on a trial solution. The perceptron algorithm is guaranteed to find a point in after O/τ ) iterations, where is the width of the cone. We propose a version of the perceptron algorithm that includes a periodic rescaling of the ambient space. In contrast to the classical version, our rescaled version finds a point in in Om 5 log/ )) perceptron updates. This result is inspired by and strengthens the previous work on randomized rescaling of the perceptron algorithm by Dunagan and Vempala [Math. Program. 4 006), 0 4] and by Belloni, reund, and Vempala [Math. Oper. Res. 34 009), 6 64]. In particular, our algorithm and its complexity analysis are simpler and shorter. urthermore, our algorithm does not require randomization or deep separation oracles. Introduction The relaxation method, introduced in the classical articles of Agmon [], and Motzkin and Schoenberg [6], is a conceptual algorithmic scheme for solving the feasibility problem y. ) Here R m is assumed to be an open convex set with an available separation oracle: Given a test point y R m, the oracle either certifies that y or else it finds a hyperplane separating y from, that is, u R m, b R such that u, y b and u, v > b for all v. The relaxation method starts with an arbitrary initial trial solution. At each iteration, the algorithm queries the separation oracle for at the current trial solution y. If y then the algorithm terminates. Otherwise, the algorithm generates a new trial point y + = y + ηu for some step length η > 0 where u R m, b R determine a hyperplane separating y from as above. Tepper School of Business, Carnegie Mellon University, USA, jfp@andrew.cmu.edu Tepper School of Business, Carnegie Mellon University, USA, nsoheili@andrew.cmu.edu

The perceptron algorithm can be seen as a particular type of relaxation method for the problem ). It applies to the case when is the interior of a convex cone. It usually starts at the origin as the initial trial solution and each update is of the form y + = y + u u. The perceptron algorithm was originally proposed by Rosenblatt [9] for the polyhedral feasibility problem A T y > 0. As it was noted by Belloni, reund, and Vempala [6], the algorithm readily extends to the more general problem ) when is the interior of a convex cone, as described above. urthermore, Belloni et al. [6, Lemma 3.] showed that the classical perceptron iteration bound of Block [8] and Novikoff [7] also holds in general: The perceptron algorithm finds a solution to ) in at most O/τ ) perceptron updates, where is the width of the cone : := sup {r R + : By, r) }. ) y = Here By, r) denotes the Euclidean ball of radius r centered at y, that is By, r) = {u R m : u y r}. Similar results also hold for the relaxation method as established by Goffin [4]. Since their emergence in the fifties, both the perceptron algorithm and the relaxation method have played major roles in machine learning and in optimization. The perceptron algorithm has attractive properties concerning noise tolerance [9]. It is also closely related to large-margin classification [] and to the highly popular and computationally effective Pegasos algorithm [0] for training support-vector machines. There are also numerous papers in the optimization literature related to various versions and variants of the relaxation method [, 3, 4, 5, 0]. A major drawback of both the perceptron algorithm and the relaxation method is their lack of theoretical efficiency in the standard bit model of computation [5]. In particular, when = {y : A T y > 0} with A Z m n, the perceptron algorithm may have exponential worst-case bit-model complexity because can be exponentially small in the bit-length representation of A. Our main contribution is a variant of the perceptron algorithm that solves ) in Om 5 log/ )) perceptron updates. In particular, when = {y : A T y > 0} with A Z m n, our algorithm is polynomial in the bit-length representation of A. Aside from its theoretical merits, given the close connection between the perceptron algorithm and first-order methods [], our algorithm provides a solid foundation for potential speed ups in the convergence of the widely popular first-order methods for large-scale convex optimization. Some results of similar nature have been recently obtained by Gilpin et al. [3] and by O Donoghue and Candès [8]. Our algorithm is based on a periodic rescaling of the space R m in the same spirit as in previous work by Dunagan and Vempala [], and by Belloni, reund, and Vempala [6]. In contrast to the rescaling procedure in [, 6], which is randomized and relies on a deep separation oracle, our rescaling procedure is deterministic and relies only on a separation oracle. The algorithm performs at most Om log/ )) rescaling steps and at most Om 4 ) perceptron updates between rescaling steps. When = {y R m : A T y > 0} for A R m n, a simplified version of the algorithm has iteration bound Om n log/ )). A smooth version of this algorithm, along the lines developed by Soheili and

Peña [], in turn has the improved iteration bound Omn m logn) log/ )). Our rescaled perceptron algorithm consists of an outer loop with two main phases. The first one is a perceptron phase and the second one is a rescaling phase. The perceptron phase applies a restricted number of perceptron updates. If this phase does not find a feasible solution, then it finds a unitary vector d R m such that { y R m : 0 d, y } y. 6m This inclusion means that the feasible cone is nearly perpendicular to d. The second phase of the outer loop, namely a rescaling phase, stretches R m along d and is guaranteed to enlarge the volume of the set {y : y = } by a constant factor. This in turn implies that the algorithm must halt in at most Om log/ )) outer iterations. Polyhedral case or ease of exposition, we first consider the case = {y R m : A T y > 0} for A R m n. Assumption i) The space R m is endowed with the canonical dot inner product u, v := u T v. ii) A = [ a a n ] where ai = for i =,..., n. iii) The problem A T y > 0 is feasible. In particular, > 0. or j =,..., n let e j R n denote the vector with jth component equal to one and all other components equal to zero. Rescaled Perceptron Algorithm. let B := I; Ã := A; N := 6mn. Perceptron Phase) x 0 := 0 R n ; y 0 := 0 R m ; for k = 0,,..., N if ÃT y k > 0 then HALT output By k else let j {,..., n} be such that ã T j y k 0 x k+ := x k + e j y k+ := y k + ã j end if end for 3. Rescaling Phase) j = argmax e i, x N i=,...,n B := BI ãjã T j ); Ã := I ãjã T j )Ã normalize the columns of Ã 4. Go back to Step. 3

The rescaled perceptron algorithm changes the initial constraint matrix A to a new matrix Ã = BT A. Thus when ÃT y > 0, the non-zero vector By returned by the algorithm solves A T y > 0. Now we can state a special version of our main theorem. Theorem Assume A R m n satisfies Assumption. Then the rescaled perceptron algorithm terminates with a solution to A T y > 0 after at most ) ) log.5) m ) log + )) τ logπ) = O m log τ rescaling steps. Since the algorithm performs Omn ) perceptron updates between rescaling steps, the algorithm terminates after at most )) O m n log perceptron updates. The key ingredients in the proof of Theorem are the three lemmas below. The first of these lemmas states that if the perceptron phase does not solve Ã T y > 0, then the rescaling phase identifies a column ã j of Ã that is nearly perpendicular to the feasible cone {y : Ã T y 0}. The second lemma in turn implies that the rescaling phase increases the volume of this cone by a constant factor. The third lemma states that the volume of the initial feasible cone = {y : A T y 0} is bounded below by a factor of τ m. Lemma If the perceptron phase in the rescaled perceptron algorithm does not find a solution to ÃT y > 0 then the vector ã j in the first step of the rescaling phase satisfies { {y : ÃT y 0} y : 0 ã T j y } y. 3) 6m Proof: Observe that at each iteration of the perceptron phase we have y k+ = y k + a T j y k + y k +. Hence y k k. Also, throughout the perceptron phase x k 0, y k = Ãx k and x k+ = x k +. Thus if the perceptron phase does not find a solution to Ã T y > 0 then the last iterates y N and x N satisfy x N 0, x N = N = 6mn and y N = Ãx N N = n 6m. In particular, the index j in the first step of the rescaling phase satisfies e j, x N x N /n = 6mn. Next observe that if ÃT y 0 then 0 6mn ã T j y e j, x N ã T j y x T NÃT y Ãx N y n 6m y. So 3) follows. The following two lemmas rely on geometric arguments concerning the unit sphere S m := {u R m : u = }. Given a measurable set C S m, let VolC) denote its volume in S m. 4

We rely on the following construction proposed by Betke [7]. Given a S m and α >, let Ψ a,α : S m S m denote the transformation u I + α )aat )u I + α )aa T )u = u + α )at u)a + α )a T u). This transformation stretches the sphere in the direction a. The magnitude of the stretch is determined by α. Lemma Assume a S m, 0 < δ <, and α >. If C {y S m : 0 a T y δ} is a measurable set, then α Vol Ψ a,α C)) VolC). 4) + δ α m/ )) In particular, if δ = 6m and α = then Vol Ψ a,α C)).5VolC). 5) Proof: Without loss of generality assume a = e m. Also for ease of notation, we shall write Ψ as shorthand for Ψ a,α. Under these assumptions, for y = ȳ, y m ) S m we have ȳ, αy m ) Ψȳ, y m ) = α + α ) ȳ. To calculate the volume of C and of ΨC), consider the differentiable map Φ : B m R m, defined by Φ v) = v, ) v that maps the unit ball B m := {v R m : v } to the surface of the hemisphere {ȳ, y m ) S m : y m 0} containing the set C. The volume of C is VolC) = Φ dv. Φ C) where Φ denotes the volume of the m )-dimensional parallelepiped spanned by the vectors Φ/ v,..., Φ/ v m. Likewise, the volume of ΨC) is VolΨC)) = Ψ Φ) dv. Φ C) Hence to prove 4) it suffices to show that Ψ Φ) v) Φ v) α + δ α )) m/ for all v Φ C). 6) Some straightforward calculations show that for all v intb m ) Ψ Φ) v) = α and Φ v) = v α + α ) v ). m/ v Hence for all v intb m ) Ψ Φ) v) Φ v) = α α + α ) v ) m/. 5

To obtain 6), observe that if v Φ C) then 0 v δ and thus If δ = 6m and α = then α + α ) v + δ α ). α + δ α )) = m/ ) + m/4 exp0.5).5. m Thus 5) follows from 4). Lemma 3 Assume R m is a closed convex cone. Then Vol S m ) ) m τ π VolSm ). 7) Proof: rom the definition of the cone width it follows that Bz, ) for some z with z =. Therefore τ )z + v for all v Rm such that v τ and z, v = 0. This implies that S m contains a spherical cap of S m with base radius τ. Hence Vol S m ) τ ) m VolB m ). π m, Γ m +) VolSm ) = m + ) Γ + ). The bound 7) now follows from the facts VolB m ) = π m Γ m +), and Γ m Proof of Theorem : Let := {y R m : Ã T y 0}. Observe that the rescaling phase rescales to I + ã j ã T j ). Therefore, Lemma and Lemma imply that after each rescaling phase the quantity Vol S m ) increases by a factor of.5 or more. Since the set S m is always contained in a hemisphere, we conclude that the number of rescaling steps before the algorithm halts cannot be larger than VolS m ) log.5) log )/ Vol S m ) To finish, apply Lemma 3. 3 General case The gist of the algorithm for the general case of a convex cone is the same as that of the polyhedral case presented above. We just need a bit of extra work to identify a suitable direction for the rescaling phase. To do so, we maintain a collection of m index sets S j, j = ±, ±,..., ±m. This collection of sets helps us determine a subset of update steps that align with each other. The sum of these steps in turn defines the appropriate direction for rescaling. Assumption 6

i) The space R m is endowed with the canonical dot inner product,. ii) R m is the non-empty interior of a convex cone. In particular, > 0. iii) There is an available separating oracle for the cone : Given y R m the oracle either determines that y or else it finds a non-zero vector u := {u : u, v > 0 for all v } such that u, y 0. or j =,..., m let e j R m denote the vector with jth component equal to one and all other components equal to zero. Observe that for a non-singular matrix B R m m, we have B ) = B T. Thus a separation oracle for := B is readily available provided one for is: Given y R m, apply the separation oracle for to the point By. If By then y B =. If By, then let u be a non-zero vector such that u, By 0. Thus B T u, y = u, By 0 with B T u B ) =. Consequently, throughout the algorithm below we assume that a separation oracle for the rescaled cone is available. General Rescaled Perceptron Algorithm. let B := I; := ; N := 4m 4. for j = ±, ±,..., ±m S j := end for 3. Perceptron Phase) u 0 := 0 R m ; y 0 := 0 R m ; for k = 0,,..., N if y k then HALT and output By k else let u k be such that u k, y k 0 and u k = y k+ = y k + u k j := argmax i=,...,m e i, u k if e j, u k > 0 then S j := S j {k} else S j := S j {k} end if end if end for 4. Rescaling Phase) i = argmax S j d := j=±,...,±m k S u k i k S u k i B := BI ddt ); 5. Go back to Step. := I + dd T ) The general rescaled perceptron algorithm changes the initial cone to = B. Thus when y, we have By. Notice that although the above algorithm implicitly performs this transformation, its steps do not involve inverting any matrices or solving any system of equations. Now we can state the general version of our main theorem. 7

Theorem Assume R m is such that Assumption holds. Then the general rescaled perceptron algorithm terminates with a solution to y after at most log.5) m ) log τ ) + logπ) ) )) = O m log rescaling steps. Since the algorithm performs Om 4 ) perceptron updates between rescaling steps, the algorithm terminates after at most )) O m 5 log perceptron updates. The proof of Theorem is almost identical to the proof of Theorem. All we need is the following analog of Lemma. Lemma 4 If the perceptron phase in the general rescaled perceptron algorithm does not find a solution to y then the vector d in the rescaling phase satisfies { y : 0 d, y } y. 8) 6m Proof: Proceeding as in the proof of Lemma, it is easy to see that if the perceptron phase does not find a solution to y then the last iterate y N = N k=0 u k satisfies y N N = 4m 4. Since {e,..., e m } is an orthonormal basis and each u k satisfies u k =, we have e j, u k / m for j = argmax e i, u k. i=,...,m urthermore, since j=±,...,±m S j = N = 4m 4 it follows that the set S i in the rescaling phase must have at least m 3 elements. Thus e i, u k = e i, u k S i m 5/. 9) m k S i k S i k S i u k On the other hand, for all y we have 0 k S i u k, y N k=0 u k, y = y N, y y N y 4m y. 0) Putting 9) and 0) together, it follows that for all y 4m y 0 d, y = y. m 5/ 6m Hence 8) holds. 8

4 Smooth version for the polyhedral case Consider again the case when = {y R m : A T y > 0}, where A R m n. We next show that in this case the perceptron phase can be substituted by a smooth perceptron phase by relying on the machinery developed by Soheili and Peña []. This leads to an algorithm with a substantially improved convergence rate but whose work per main iteration is roughly comparable to that in the rescaled perceptron algorithm. Suppose A satisfies Assumption. or µ > 0 let x µ : R m R n be defined by y x µ y) = e AT e AT y. In this expression e AT y µ is shorthand for the n-dimensional vector e AT y µ := e a T y µ.. e a T n y µ Let R n denote the n-dimensional vector of all ones. Consider the following smooth version of the rescaled perceptron algorithm. Smooth Rescaled Perceptron Algorithm. let B := I; Ã := A; N := 7n m logn).. Smooth Perceptron Phase) y 0 := Ã n ; µ 0 := ; x 0 := x µ0 y 0 ); for k = 0,,,..., N if ÃT y k > 0 then HALT and output By k else θ k := k+3 ; y k+ := θ k )y k + θ k Ãx k ) + θ kãx µ k y k ); µ k+ := θ k )µ k ; x k+ := θ k )x k + θ k x µk+ y k+ ); end if end for 3. Rescaling Phase) j = argmax e i, x N i=,...,n B := BI ãjã T j ); Ã := I ãjã T j )Ã normalize the columns of Ã 4. Go back to Step. Theorem 3 Assume A R m n satisfies Assumption. Then the smooth rescaled perceptron algorithm terminates with a solution to A T y > 0 after at 9

most log.5) m ) log τ ) + logπ) ) )) = O m log rescaling steps. Since the algorithm performs On m logn)) perceptron updates between rescaling steps, the algorithm terminates after at most O mn )) m logn) log perceptron updates. Proof: This proof is a modification of the proof of Theorem. It suffices to show that if the smooth perceptron phase in the rescaled perceptron algorithm does not find a solution to ÃT y > 0 then the vector ã j in the first step of the rescaling phase satisfies { {y : ÃT y 0} y : 0 ã T j y } y. ) 6m Indeed, from [, Lemma 4.] it follows that if the perceptron phase does not find a solution to ÃT y > 0, then Ãx N 8 logn) N+) 8 49mn 6mn. Since x N 0 and x N =, the index j in the rescaling phase satisfies e j, x N n. Therefore, if ÃT y 0 then 0 n ãt j y e j, x N ã T j y x T NÃT y Ãx N y 6mn y. So ) follows. References [] S. Agmon. The relaxation method for linear inequalities. Canadian Journal of Mathematics, 63):38 39, 954. [] E. Amaldi, P. Belotti, and R. Hauser. A randomized algorithm for the maxfs problem. In IPCO, pages 49 64, 005. [3] E. Amaldi and R. Hauser. Boundedness theorems for the relaxation method. Math. Oper. Res., 304):939 955, 005. [4] H. H. Bauschke and J. M. Borwein. Legendre functions and the method of random Bregman projections. J. Convex Anal., 4:7 67, 997. [5] H. H. Bauschke, J. M. Borwein, and A. Lewis. The method of cyclic projections for closed convex sets in Hilbert space. Contemporary Math, 04: 38, 997. [6] A. Belloni, R. reund, and S. Vempala. An efficient rescaled perceptron algorithm for conic systems. Math. Oper. Res, 343):6 64, 009. 0

[7] U. Betke. Relaxation, new combinatorial and polynomial algorithms for the linear feasibility problem. Discrete & Computational Geometry, 3:37 338, 004. [8] H. D. Block. The perceptron: A model for brain functioning. Reviews of Modern Physics, 34:3 35, 96. [9] A. Blum, A. rieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for learning noisy linear threshold functions. Algorithmica, - ):35 5, 998. [0] S. Chubanov. A strongly polynomial algorithm for linear systems having a binary solution. Math. Program., 34:533 570, 0. [] J. Dunagan and S. Vempala. A simple polynomial-time rescaling algorithm for solving linear programs. Math. Program., 4):0 4, 006. [] Y. reund and R. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37:77 96, 999. [3] A Gilpin, J. Peña, and T. Sandholm. irst-order algorithm with Oln/ɛ)) convergence for ɛ-equilibrium in two-person zero-sum games. Math. Program., 33:79 98, 0. [4] J. Goffin. The relaxation method for solving systems of linear inequalities. Math. Oper. Res., 5:388 44, 980. [5] J. Goffin. On the non-polynomiality of the relaxation method for systems of linear inequalities. Math. Program., :93 03, 98. [6] T. S. Motzkin and I. J. Schoenberg. The relaxation method for linear inequalities. Canadian Journal of Mathematics, 63):393 404, 954. [7] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume XII, pages 65 6, 96. [8] B. O Donoghue and E. J. Candès. Adaptive restart for accelerated gradient schemes. oundations of Computational Mathematics, To Appear. [9]. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Cornell Aeronautical Laboratory, Psychological Review, 656):386 408, 958. [0] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: primal estimated sub-gradient solver for SVM. Math. Program., 7:3 30, 0. [] N. Soheili and J. Peña. A smooth perceptron algorithm. SIAM Journal on Optimization, ):78 737, 0.