Solution of Exercise Sheet

Size: px

Start display at page:

Download "Solution of Exercise Sheet"

Maximilian Robinson
5 years ago
Views:

1 Conve Optimization and Modeling Jun.-Prof. Matthias Hein Solution of Eercise Sheet Eercise 19 - Subgradient Steepest Descent a. (3 Points) Consider the minimization of the two-dimensional function { f( 1, 2 ) = , if 1 > 2, , if 1 2 using the steepest descent method, which moves from the current point in the opposite direction of the minimum norm subgradient with eact line search. Suppose that the algorithm starts anywhere within the set {( 1, 2 ) 1 > 2 > (9/16) 2 1 }. Verify computationally that it converges to the nonoptimal point (, ). What happens if a subgradient method with a constant stepsize is used instead? Check computationally. Provide for both cases the code you have used. Solution: function Eercise9 cur = [2;1]; alpha=.1; path=[cur]; fcur = f(cur); fvals = [fcur]; for i=1:5 subg = gradf(cur); %alpha = EactLineSearch(cur,fcur,subg); alpha=.5; new = cur - alpha*subg; cur = new; fvals = [fvals, f(cur)]; path=[path, new]; figure, hold on, [X,Y]=meshgrid(-1:.5:2,-1:.5:1); Z = zeros(size(x,1),size(x,2)); for i=1:size(x,1), for j=1:size(x,2) Z(i,j)=f([X(i,j);Y(i,j)]); surfc(x,y,z); nump = size(path,2); for i=1:nump 1

2 plot3(path(1,i),path(2,i),fvals(i),.-, Color,[i/numP,i/numP,i/numP], MarkerSize,2); hold off function alpha = EactLineSearch(cur,fcur,subg) % naive eact line search by bisection % first we search for a point along the line which has larger objective % than the current point alpha=1; NOTFOUND=1; while NOTFOUND new = cur - alpha*subg; fnew = f(new); if(fnew>fcur) NOTFOUND=; alpha=2*alpha; if(alpha>1e15) error([ Problem seems to be unbounded from below ]); left=; right=alpha; fleft=fcur; fright=f(cur-alpha*subg); while (right-left)>1e-15 new = cur -.5*(right+left)*subg; if(-subg *gradf(new)>) % ascent at the middle point right=.5*(right+left); % descent at the middle point left =.5*(right+left); alpha=right; function fval = f() % returns the function value at if((1)>abs((2))) fval = 5*sqrt(9*(1)^2 + 16*(2)^2); fval = 9*(1) + 16*abs((2)); function gradfval = gradf() % returns the (sub)-gradient of the function at if((1)>abs((2))) gradfval = 5/2/sqrt(9*(1)^2 + 16*(2)^2)*[18*(1); 32*(2)]; % for the point (,) we use the subgradient with minimal norm as % required in the steepest descent subgradient method gradfval = 9*[1; ] + 16*u; 2

3 Figure 1: Left: Path found by the subgradient method with eact line search. It converges to the origin (if one stops early enough as due to numerical precision the origin is not stable and so the method will eventually converge to but very, very slow!). The problem is the change of the function from the set where it is differentiable to the non-differentiable set at the origin - where the path k has to pass through if one starts in the described area. Right: With fied stepsize the problem does not occur as one simply jumps over the difficult place. The oscillating behavior comes from the jumping derivative and the fied stepsize. Eercise 2 - L 1 -Minimization We want to find the solution of a problem of form, min F () := f() + λ 1, R d where λ and f is a conve, continuously twice-differentiable function which has Lipschitzcontinuous gradient, that is, f() f(y) L y, where L is the Lipschitz-constant. Problems of this type arise in machine learning (loss function + l1-regularization e.g. the lasso), signal processing and image processing. a. (3 Points) Show that the Lipschitz-condition implies that, f(y) f() + f(), y + L 2 y 2. b. (3 Points) Using the result of the previous part, we derive the following upper bound on the objective F (). For any z R d, F () f(z) + f(z), z + L 2 z 2 + λ 1. Derive the closed-form solution of the following problem, which corresponds to the minimization of the upper bound for a particular point z, arg min f(z), z + L 2 z 2 + λ 1 c. (3 Points) Implement the following iterative method, (k+1) = arg min f( k ), k + L 2 3 k 2 + λ 1,

4 which minimizes the upper bound on the objective using the point from the previous iteration for the function. This method is called: Iterative Shrinkage Thresholding Algorithm (ISTA). Use the particular quadratic function f, This corresponds to the Lasso-Problem. webpage. f() = A Y 2, Data and dummy code will be provided on the Solution: a. The fundamental theorem of differential and integral calculus is as follows, f(y) f() = y f (t) dt. The transfer to the multivariate setting can be done by considering the curve, γ(t) = + t(y ) with tangent vector γ = y. The directional derivative of f(γ(t)) is given as f (γ(t)) = f(γ(t)), γ(t) = f( + t(y )), y. Note, that f(γ()) = and f(γ(1)) = y. Bringing everything together, we get f(y) f() f(), y = f( + t(y )) f(), y dt f( + t(y )) f() y dt L t y 2 dt = L 2 y 2. Thus the Lipschitz-condition for the gradient implies Hf L. b. We get the optimality condition, f(z) + L( z) + λu, where u i = sign( i ) if i and u i [ 1, 1] if i =. Reformulating in terms of, i z i 1 L ( f) i λ L u i. The argumentation is now in the same way as done in the lecture. If z i 1 L ( f) i > λ L, then i has to be positive and i = z i 1 L ( f) i λ L, and analogously if z i 1 L ( f) i < λ L, then i has to be negative and i = z i 1 L ( f) i + λ L. If z i 1 L ( f) i > λ L λ L the only way the equation can be fulfilled, is i =. In total we get, ( 1 = S λ z L L f), where S α is the so-called soft thresholding operator defined as, i α, if i > α, (S α ()) i =, if i α, i + α, if i < α. c. We first compute the Lipschitz constant of f, f() = 2A T (A Y ), and thus f() f(y) = 2A T A( y), from which we deduce L 2 A T A = 2λ ma (A T A). 4

5 Figure 2: We use as stopping criterion f(k+1 ) f( k ) f( k ) Left: The blue line shows the descent for the correct (tight) Lipschitz constant - convergence is achieved in 3 steps. The red line shows the descent when we replace A T A by A T A F the Frobenius-norm, which is easy to compute and known to be an upper bound on A T A. Convergence then takes 5433 steps - so almost 2 times as long. Right: The data Y (red curve) is a mass-spectrogram which consists of a few isotopic patterns, thus apart from the noise the signal has a sparse representation in the isotopic pattern basis (plot the columns of A). The solution is sparse (count the non-zeros of ) and the fit A (blue curve) is quite good of without fitting too much the noise. Note, that we have an underfit as we penalize the coefficients. 5

Optimization methods

Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to