Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15, 018 15 Fenchel duality and algoriths In this section, we introduce the Fenchel conjugate. First, recall that for a real-valued convex function of a single variable f (x, we call f (p := sup x px f (x its Legendre transfor. This is illustrated in figure 1. The generalization of the Legendre transfor to (possibly nonconvex functions of ultiple variables is the Fenchel conjugate. Definition 15.1 (Fenchel conjugate. The Fenchel conjugate of f : R n R is f (p = sup p, x f (x x We now present several useful facts about the Fenchel conjugate. The proofs are left as an exercise. Fact 15.. f is convex. Indeed, f is the supreu of affine functions and therefore convex. Thus, the Fenchel conjugate of f is also known as its convex conjugate. Fact 15.3. f ( f (x = f if f is convex. In other words, the Fenchel conjugate is its own inverse for convex functions. Now, we can also relate the subdifferential of a function to that of its Fenchel conjugate. Intuitively, observe that 0 f (p 0 p f (x p f (x. This is suarized ore generally in the following fact. 1

0 18 16 14 1 10 8 6 4 0 0 0.5 1 1.5.5 3 3.5 4 4.5 5 Figure 1: Legendre Transfor. The function f (p is how uch we need to vertically shift the epigraph of f so that the linear function px is tangent to f at x. Fact 15.4. The subdifferential of f at p is f (p = {x : p f (x}. Indeed, f (0 is the set of iniizers of f. In the following theore, we introduce Fenchel duality. Theore 15.5 (Fenchel duality. Suppose we have f proper convex, and g proper concave. Then in f (x g(x = ax g (p f (p x p where g is the concave conjugate of g, defined as inf x p, x g(x. In the one-diensional case, we can illustrate Fenchel duality with Figure. In the iniization proble, we want to find x such that the vertical distance between f and g at x is as sall as possible. In the (dual axiization proble, we draw tangents to the graphs of f and g such that the tangent lines have the sae slope p, and we want to find p such that the vertical distance between the tangent lines is as large as possible. The duality theore above states that strong duality holds, that is, the two probles have the sae solution. We can recover Fenchel duality fro Lagrangian duality, which we have already studied. To do so, we need to introduce a constraint to our iniization proble in Theore 15.5. A natural reforulation of the proble with a constraint is as follows. in x,z f (x g(z subject to x = z (1

4 y 0 4 f g 1 0 1 3 4 x Figure : Fenchel duality in one diension 15.1 Deriving the dual proble for epirical risk iniization In epirical risk iniization, we often want to iniize a function of the following for: P(w = φ i ( w, x i + R(w ( We can think of w R n as the odel paraeter that we want to optiize over (in this case it corresponds to picking a hyperplane, and x i as the features of the i-th exaple in the training set. φ i (, x i is the loss function for the i-th training exaple and ay depend on its label. R(w is the regularizer, and we typically choose it to be of the for R(w = λ w. The prial proble, in w R n P(w, can be equivalently written as follows: in w,z φ i (z i + R(w subject to X w = z (3 3

By Lagrangian duality, we know that the dual proble is the following: ax in R z,w = ax = ax in w,z ( = ax = ax in ( n φ i (z i + R(w (X w z φ i (z i + i z i + R(w X w w,z ax z i ( φ i (z i + i z i + (X w R(w ( φ i (z i i z i + ax w (X w R(w φ i ( i R (X where φ i and R are the Fenchel conjugates of φ i and R respectively. Let us denote the dual objective as D( = φ i ( i R (X. By weak duality, D( P(w. For R(w = λ w, R (p = λ 1 λ p. So R is its own convex conjugate (up to correction by a constant factor. In this case the dual becoes: ax φ i ( i λ 1 λ i x i We can relate the prial and dual variables by the ap w( = 1 λ ix i. In particular, this shows that the optial hyperplane is in the span of the data. Here are soe exaples of odels we can use under this fraework. Exaple 15.6 (Linear SVM. We would use the hinge loss for φ i. This corresponds to φ i (w = ax(0, 1 y i x i w, φ i ( i = i y i Exaple 15.7 (Least-squares linear regression. We would use the squared loss for φ i. This corresponds to φ i (w = (w x i y i, φ i ( i = i y i + /4 We end with a fact that relates the soothness of φ i to the strong convexity of φ i. Fact 15.8. If φ i is 1 γ -sooth, then φ i is γ-strongly convex. 15. Stochastic dual coordinate ascent (SDCA In this section we discuss a particular algorith for epirical risk iniization which akes use of Fenchel duality. Its ain idea is picking an index i [] at rando, and then solving the dual proble at coordinate i, while keeping the other coordinates fixed. More precisely, the algorith perfors the following steps: 4

1. Start fro w 0 := w( 0. For t = 1,..., T: (a Randoly pick i [] (b Find i which axiizes Φ i ( ( t 1 i + i λ 3. Update the dual and prial solution (a t = t 1 + i (b w t = w t 1 + 1 λ ix i wt 1 + 1 λ ix i For certain loss functions, the axiizer i is given in closed for. For exaple, for hinge loss it is given explicitly by: ( i = y i ax and for squared loss it is given by: 0, in(1, 1 xt i wt 1 y i x i /λ + t 1 i y i i = y i xi Twt 1 0.5 t 1 i 0.5 + x. /λ t 1 i, Note that these updates require both the prial and dual solutions to perfor the update. Now we state a lea given in [SSZ13], which iplies linear convergence of SDCA. In what follows, assue that x i 1, Φ i (x 0 for all x, and Φ i (0 1. Lea 15.9. Assue Φ i is γ-strongly convex, where γ > 0. Then: where s = λγ 1+λγ. E[D( t D( t 1 ] s E[P(wt 1 D( t 1 ], We leave out the proof of this result, however give a short arguent that proves linear convergence of SDCA using this lea. Denote ɛ t D := D( D( t. Because the dual solution provides a lower bound on the prial solution, it follows that: ɛ t D P(wt D( t. Further, we can write: D( t D( t 1 = ɛ t 1 D ɛt D. 5

By taking expectations on both sides of this equality and applying Lea 15.9, we obtain: E[ɛ t 1 D ɛt D ] = E[D(t D( t 1 ] s E[P(wt 1 D( t 1 ] s E[ɛt 1 D ]. Rearranging ters and recursively applying the previous arguent yields: E[ɛ t D ] (1 s E[ɛt 1 D ] (1 s t ɛ 0 D. Fro this inequality we can conclude that we need O( + λγ 1 log(1/ɛ steps to achieve ɛ dual error. Using Lea 15.9, we can also bound the prial error. Again using the fact that the dual solution underestiates the prial solution, we provide the bound in the following way: E[P(w t P(w ] E[P(w t D( t ] s E[D(t+1 D( t ] s E[ɛt D ], where the last inequality ignores the negative ter E[ɛ t 1 D ]. References [SSZ13] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent ethods for regularized loss iniization. Journal of Machine Learning Research, 14(Feb:567 599, 013. 6