On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,..., /n). + log n = 0, On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith diagonal x i >. So Hd(x) I 0. Hence, the strong convexity parameter is at least one. Alternatively, it is acceptable to show that d(x) is -strongly convex in l norm. To this end, note that ht Hd(x)h = n h i = ( ) ( ) xi h x i i /x i ( hi ), i= where we make use of the fact x i = and the Cauchy-Schwarz inequality.. We have that f = sup (u,v) Q u v, Ax b d(u, v). We can re-write this as minimize w R m d(w) + w T c subject to w T =, where w = (u, v) and c = ( Ax + b, Ax b). We will follow some of the steps in Boyd & Vandenberghe [] for the conjugate of the entropy function. For the function of a single variable f(z) = z log z +az, we have that the conjugate is f (y) = sup z [0,] yz z log z az. Taking a derivative and setting it equal to zero gives that z = e (y a). Plugging in this solution gives f (y) = exp( (y a) ). Therefore, the conjugate of d(w) + w T c is f (y) = m i= e (y i c i ). Hence, the dual function is g(λ) = λ f ( λ) = λ m i= We can maximize the dual over λ and get that max g(λ) = log λ e ( λ ci) m = λ e (λ+)/ e ci/. ( m i= e c i/ Provided by Austin Benson and Victor Minden, with some modifications. ). i=

Table : Summary of performance of algorithms on test problem where A R 00 50 when using entropy smoothing. Algorithm Ax b time (seconds) cvx 0.54 0.38 fixed smoothing ( = 0.05) 0.05.0 adaptive smoothing 0.9 0.33 But we know that c = ( Ax + b, Ax b), so log ( m i= e c i/ ) = log ( m i= ( ) ) cosh (at i x b i ). By strong duality we know that this is equal to the optimal primal solution. Finally, we account for the log m term and the fact that we need to negate the solution (since we switched from maximizing to minimizing the objective). Putting everything together gives ( m ( ) ) ( m ( ) ) f (x) = log m + log cosh (at i x b i ) = log cosh m (at i x b i ) i= We will need the gradient to run our algorithm for the next question. The derivative with respect to x j is x j f (x) = m m i= cosh ( (at i x b i)) m i= i= ( ) sinh (at i x b i ) a ij So the gradient is given by f (x) = ( ) m ( ) A T sinh (Ax b), T cosh (Ax b) where cosh and sinh are taken entry-wise. 3. We consider three algorithms for solving the optimization problem: () a standard optimization solver (cvx), () gradient descent on f with fixed value 0 = 0.05 (fixed smoothing), and (3) adaptive gradient descent on f, where ranges from 5 to 0.05 = 0 and decreases by a factor of.5 at each iteration (adaptive smoothing). We restrict each sub-problem of adaptive smoothing to use one tenth the number of iterations as fixed smoothing, in order to control the running time of the former algorithm. For gradient descent, we compute the step size from a line search. Table summarizes the performance results on a test problem where A R 00 50. The cvx solution is the best, and the running time is about the same as for adaptive smoothing. The fixed smoothing algorithm is by far the slowest. Figure shows how the solution from adaptive smoothing varies with. We see that the adaptive smoothing approaches an objective value close that for fixed smoothing, but the adaptive version is much faster.

Entropy smoothing 0 0.6 Ax b 0 0.7 Adaptive smoothing fixed smoothing cvx 0 0.8 0 0 0 0 0 Figure : Performance of adaptive smoothing as varies for entropy smoothing. 4. Let r = Ax b. We have f (r) = sup u T r u u = sup u u r/ + r/ = [ ] r/ inf u u r/. The solution to inf u u r/ is the Euclidean projection of r/ onto the l ball. By the optimality conditions, we know that this achieved with soft thresholding for some unknown threshold λ []. Specifically, u i = sgn(r i ) ( r i / λ) + for some λ that satisfies ( ri / λ) + =. Furthermore, we can use bisection to find the λ, using the fact that λ [0, max i r i /]. Now we compute the gradient. We know that f (r) is the conjugate function of f(u) = I u + u. Thus, from the lecture notes, [ ] f (r) = arg max r/ u inf u u r/ = arg inf u u r/, i.e., f (r) is just the projection of r/ onto the l ball. Applying the chain rule since r = Ax b, we arrive at f (x) = A T inf u (Ax b). u 5. Now we solve the problem with 0 = 0.0. Table summarizes the performance results on the same test problem where A R 00 50. We note that cvx is around two orders of 3

Table : Summary of performance of algorithms on test problem where A R 00 50 when using quadratic smoothing. Algorithm Ax b time (seconds) cvx 0.54 0.3769 fixed smoothing ( = 0.0) 5.374 3.49 adaptive smoothing 0.554 8.43 Quadratic smoothing Ax b 0 0 Adaptive smoothing fixed smoothing cvx 0 0 0 0 0 Figure : Performance of adaptive smoothing as varies for quadratic smoothing. magnitude faster than our solver. In this case, fixed smoothing did not converge after 000 iterations, so the resulting objective value is large. (This is motivation to use an accelerated method, but we will not consider that here). On the other hand, adaptive smoothing finds a solution with residual comparable to the cvx solution. Figure shows how the solution from adaptive smoothing varies with. The figure illustrates the benefit of warm starts. Overall, adaptive smoothing performs much better than fixed smoothing. 6. We now test on a problem size of A R 3000 750. To keep our analysis simple, we will just compare the entropy and quadratic smoothing using fixed smoothing for a few different values of. To compare the quality of the solutions, we use the true residual and the relative error in the optimal solution x to the vector x 0 used to generate the data (b = Ax 0 + z, where z is noise). Table 3 summarizes the performance of the two different smoothing techniques. In general, the entropy smoothing is much faster and quadratic smoothing finds a solution that is as good or slightly better. Furthermore, the entropy smoothing was much easier to implement. For these reasons, I prefer the entropy smoothing. References [] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 004. 4

Table 3: Summary of performance of fixed smoothing with the two different smoothing techniques for min Ax b where A R 3000 750. The data is generated synthetically by b = Ax 0 + z, where A and b have entries N(0, ) and z has entries N(0, e-). smoothing Ax b x 0 x / x 0 time (seconds) 000 entropy 0.030.99e-4.84 quadratic 0.037.98e-4 33.77 500 entropy 0.037.99e-4 0.89 quadratic 0.037.98e-4 33.77 00 entropy 0.037.98e-4.33 quadratic 0.037.98e-4 64.3 [] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in optimization, (3):3 3, 03. 5