The iterative convex minorant algorithm for nonparametric estimation

The iterative convex minorant algorithm for nonparametric estimation Report 95-05 Geurt Jongbloed Technische Universiteit Delft Delft University of Technology Faculteit der Technische Wiskunde en Informatica Faculty of Technical Mathematics and Informatics

ISSN 0922-564 Copyright c 995 by the Faculty of Technical Mathematics and Informatics, Delft, The Netherlands. No part of this Journal may be reproduced in any form, by print, photoprint, microfilm, or any other means without permission from the Faculty of Technical Mathematics and Informatics, Delft University of Technology, The Netherlands. Copies of these reports may be obtained from the bureau of the Faculty of Technical Mathematics and Informatics, Julianalaan 32, 2628 BL Delft, phone +352784568. A selection of these reports is available in PostScript form at the Faculty s anonymous ftp-site. They are located in the directory /pub/publications/tech-reports at ftp.twi.tudelft.nl

The iterative convex minorant algorithm for nonparametric estimation By Geurt Jongbloed Department of Mathematics Delft University of Technology Mekelweg 4 2628 CD Delft The Netherlands Abstract The problem of minimizing a smooth convex function over a basic cone in IR n is frequently encountered in nonparametric statistics. For that type of problem we suggest an algorithm and show that this algorithm converges to the solution of the minimization problem. Introduction Groeneboom & Wellner (992) introduce the iterative convex minorant (ICM) algorithm to compute nonparametric maximum likelihood estimators (NPMLE's) for distribution functions in some statistical inverse problems. Using the specic structure of the Interval Censoring Case II problem, Aragon & Eberly (992) show the ICM algorithm to be locally convergent under the assumption that the points of jump of the NPMLE are known in advance. Determining these points of jump is, however, the main part of the problem. In this paper we describe the ICM algorithm in its general form, show that it does not converge under mild conditions and propose a modied version that does converge under mild conditions. The ICM algorithm is taylored for minimizing a smooth convex function over one of the cones C or C + in IR n, which are dened by C = fx 2 IR n : x x 2 x n g and C + = fx 2 C : x 0g : () Although this problem might seem rather specic, it is a very general problem. Convex optimization problems over more general nitely generated closed convex cones K in IR n arising in statistics, can often be rewritten in terms of one of the cones C or C +. Examples of estimation problems where the algorithm can be applied to compute the NPMLE are the Interval Censoring Case II, Deconvolution, and Wicksell's problem (see also Jongbloed (995)). Another example where it can be applied, maximum likelihood estimation of a convex decreasing density, is given in section 5. Also least squares estimators for a convex regression function can be computed by means of the ICM algorithm. 0 AMS 99 subject classications. primary 65U05, secondary 62G05. 0 Key words and phrases. global convergence, inverse problems, isotonic regression.

For some of these examples also other algorithms have been proposed. For censoring problems, the Expectation Maximization (EM) algorithm (see e.g. Dempster et al. (977) and Wu (983) for a convergence proof of this algorithm) is frequently used. The experience with this algorithm is that it converges rather slowly to the solution of the optimization problem. Recently, for censoring problems, a combination of the EM and ICM algorithm is proposed in Zhan & Wellner (995). Simulation results indicate this hybrid algorithm to behave very well for the double censoring model. Also general optimization techniques such as Interior Point Methods have been applied to some statistical estimation problems. See e.g. Terlaky & Vial (995). Some known results from optimization theory and the theory of isotonic regression are reviewed in section 2. In section 3 we show that in general the ICM algorithm does not converge to the solution of the minimization problem. We also give a modied ICM algorithm in pseudo code. For this modied algorithm we prove a global convergence result in section 4. Finally, in section 5, we compute the maximum likelihood estimator of a convex and decreasing density on [0; ), using the modied ICM algorithm. Additionally, a useful lemma is proved which states that the ICM algorithm can also be used to maximize loglikelihood-type functions over the intersection of a closed convex cone and a hyperplane in IR n. 2 Review of some known results Let K be a cone in IR n and satisfy Condition : IR n! (?; ] is () convex and attains its minimum over K at a unique point ^x, (2) continuous, (3) continuously dierentiable on the set fx 2 IR n : (x) < g. Writing r for the vector of partial derivatives of, @ @ T r(x) = (x); ; (x) ; @x @x n and (; ) for the usual inner product in IR n, it is known from e.g. Robertson et al. (988) section 6.2, that ^x = argmin (x) x2k if and only if ^x 2 K satisfying 8x 2 K : (x; r(^x)) 0 and (^x; r(^x)) = 0: (2) Taking for K one of the cones C or C + as dened in () and for the function q(x) = 2 (x? y)t W (x? y) 2

for some xed y 2 IR n and positive denite diagonal matrix W = diag(w i ), the optimality conditions in (2) have a nice geometric interpretation. Indeed, for K = C, ^x i is the left derivative of the convex minorant of the cumulative sum diagram consisting of the points P 0 = (0; 0) and P j = 0 X @ j l= w l ; jx l= w l y l A for j n evaluated at the point P i. If K = C +, the only dierence is that the negative components of ^x should be changed to zero. The geometric interpretation of the optimality conditions when the object function is a quadratic form with a diagonal matrix of second order derivatives and the cone is C or C +, is the back bone of the theory of isotonic regression as it can be found in Barlow et al. (972) and Robertson et al. (988). Let x (0) 2 C be xed and let k = 0. The idea behind the ICM algorithm is then to approximate the convex function locally near x (k) by a quadratic form of the type q(x; x (k) ) = 2 x? x (k) + W (x (k) )? r(x (k) ) T W (x (k) ) x? x (k) + W (x (k) )? r(x (k) ) where W (x (k) ) is a positive denite diagonal matrix depending on x (k). The next iterate x (k+) is then dened as the minimizer of q(; x (k) ) over C. Incrementing k with one and repeating the procedure gives the iterative algorithm. Since x (k+) can be determined by taking the derivative of the convex minorant of the cumulative sum diagram consisting of the points P 0 = (0; 0) and P j = 0 X @ j l= w (k) l ; jx l= w (k) l x (k) l? @ @x l (x (k) ) A for j n; where w (k) l denotes the l-th diagonal entry of W (x (k) ), the name iterative convex minorant algorithm is justied. 3 Description of the algorithm In this section K denotes one of the cones C and C + rather than a general cone in IR n and satises condition. An iterative optimization algorithm to approximate ^x = argmin (y); y2k is properly specied by an initial point x (0) 2 K, an algorithmic map A and a termination criterion. An algorithmic map is a mapping x 7! A(x) dened on K and taking values in the class of nonempty subsets of K. The algorithm can then be formulated as: k := 0; while the stopping criterion is not satised, x (k+) 2 A(x (k) ) and k := k +. Once a continuous mapping x 7! W (x) from K to the class of positive denite diagonal matrices (equipped with its usual matrix norm) is specied, the algorithmic map associated with the ICM algorithm is given by B(x) = argmin y2k T y? x + W (x)? r(x) W (x) y? x + W (x)? r(x) ; 2 3 ;

where we adopt the convention to leave out the curly brackets when the set returned by an algorithmic map is a singleton. Note that, by the continuity of x 7! W (x) and Condition, the mapping B is continuous at each point x where (x) <. Taking (x) = x 2? x + 2 x x 2 + 3 4 (x2 + x 2 2); K = C, x (0) = (; ) T and W I, the identity matrix, it follows that an algorithm based on B does in general not converge. (Indeed, x (k) = (; ) T for k even and x (k) = (?;?) T for k odd in this example.) Moreover, it may happen that the value of at some iterate is innite, so that the algorithm is not even well dened. However, and this we will use when we dene the modied ICM algorithm, the algorithmic map B generates a direction of descent for at each x 2 K n f^xg such that (x) <. This result is stated in Lemma. Lemma Let satisfy Condition and x 2 K n f^xg satisfy (x) <. Then for all > 0 suciently small. (x + (B(x)? x)) < (x) Proof: Fix x 2 K n f^xg with (x) < and dene the function on [0; ] as follows: () = (x + (B(x)? x)): It suces to show that the right derivative of at zero, 0 (0) = (B(x)? x) T r(x); is strictly negative. From the denition of B(x) and the fact that x 2 K, it follows by (2) that and Subtracting (4) from (3) we see that (B(x); W (x)(b(x)? x) + r(x)) = 0 (3) (x; W (x)(b(x)? x) + r(x)) 0: (4) (B(x)? x; W (x)(b(x)? x)) + 0 (0) 0: (5) Note that the assumption x 6= ^x implies that x 6= B(x). Therefore, since W (x) is positive denite, the rst term at the left hand side of (5) is strictly positive, so that 0 (0) < 0 as was to be shown. 2 Using lemma we can construct an algorithm that converges to ^x. The idea behind this modied iterative convex minorant algorithm is to select a point x (k+) from the segment n o seg(x (k) ; B(x (k) )) = x (k) + (B(x (k) )? x (k) ) : 2 [0; ] 4

such that the value of decreases suciently when moving from x (k) to x (k+). One way to formalize this idea is to dene the algorithmic map C C(x) = 8 >< >: fb(x)g if (B(x)) < (x) + (? )r(x) T (B(x)? x) fy 2 seg(x; B(x)) : (? )r(x) T (y? x) (y)? (x) r(x) T (y? x)g elsewhere, where 2 (0; =2) is xed. See Figure for the idea behind the denition of C. (6) 0 0 0 Figure : The three possible forms of the set returned by the algorithmic map C in the parametrization () = (x + (A(x)? x)). To completely specify the algorithm we should x an initial point for the algorithm, a rule to determine x (k+) from C(x (k) ) and a termination criterion. As an initial point we take any x (0) 2 K with (x (0) ) <. As a rule to choose x (k+) from C(x (k) ) we propose to choose x (k+) = B(x (k) ) whenever it belongs to C(x (k) ), and otherwise perform a binary search for an element of C(x (k) ) in the segment seg(x (k) ; B(x (k) )). See the pseudo code below for an exact description of this binary search, which can easily be seen to terminate after a nite number of steps. Finally, we base our stopping criterion on (2), where we use that for C the inequality part of (2) is equivalent to the conditions i=j @ @x i (^x) ( 0 for j n = 0 for j = : Below we give a formal description of the algorithm obtained in this way (K = C). Modied iterative convex minorant algorithm Input: > 0: accuracy parameter; 2 (0; =2): line search parameter; x (0) 2 K: initial point satisfying (x (0) ) < ; 5

begin x := x (0) ; while j P n i= x i @ @ P @x i (x)j > or min ni=j jn P n @x i (x)j > or j i= begin ~y := argmin y2k (y? x + W (x)? r(x)) T W (x)(y? x + W (x)? r(x)); if (~y) < (x) + r(x) T (~y? x) then x := ~y else begin := ; s := =2; z := ~y; while (z) < (x) + (? )r(x) T (z? x) (I) or (z) > (x) + r(x) T (z? x) (II) do begin if (I) then := + s; if (II) then :=? s; z := x + (~y? x); s := s=2; end; x := z; end; end; end. @ @x i (x) <? do If the algorithm is used to minimize over C +, the C should be replaced by C + throughout the algorithm and the second condition in the rst while statement should be removed. In the next section we prove that under mild conditions the modied ICM algorithm generates a sequence x (k) such that x (k)! ^x for k!. 4 Convergence of the modied ICM algorithm To prove the modied ICM algorithm to converge to the point ^x we will use a general convergence theorem (cf. Bazaraa et al. (993), theorem 7.2.3 or Zangwill (969), page 9; curiously enough, this theorem is also used in Wu (983) to prove global convergence of the EM algorithm). This theorem assures convergence of the algorithm based on an algorithmic map A under three conditions. The rst is that the sequence of iterates generated by the algorithm is contained in a compact subset K of K. The second is that there exists a descent function, which is a continuous function on K such that (y) < (x) for all y 2 A(x), whenever x 6= ^x. The third condition is that the algorithmic map A is closed. This means that if (x k ) and (y k ) are sequences in K satisfying x k! x, y k 2 A(x k ) and y k! y, then y 2 A(x) necessarily. Theorem Let the function : IR n! (?; ] satisfy Condition and x (0) 2 K satisfy (x (0) ) <. Let the mapping x 7! W (x) take values in the set of positive denite (n n) 6

diagonal matrices such that x 7! W (x) is continuous on the set K = fx 2 K : (x) (x (0) )g: (7) Then an algorithm generated by the mapping C, as dened in (6), converges to ^x. Proof: From lemma it follows that the mapping C is well dened and has as a descent function: for all x 6= ^x and for all y 2 C(x), (y) < (x). From this observation it follows that fx (k) : k 0g K; where K is as dened in (7). From Condition () and (2) and the fact that (x (0) ) <, it follows that K is compact. Therefore, in view of the remarks made above, closedness of C at each x 2 K n f^xg would imply global convergence of the algorithm. Fix x 2 K n f^xg and a sequence (x k ) in K such that x k! x. Let y k 2 C(x k ) with y k! y for some y 2 K. To prove closedness of C we have to prove that y 2 C(x). First note that continuity of the mapping x 7! W (x) on K and Condition (3) yield that B(x k )! B(x) and r(x k )! r(x) (8) as k!. From this it follows that y 2 seg(x; B(x)) necessarily. Now consider the two dierent situations that can occur. The rst situation is that (B(x k )) (x k ) + (? )r(x k ) T (B(x k )? x k ) for innitely many values of k. Letting k tend to innity along a subsequence k j where this inequality holds, we get from (8) that (B(x)) (x) + (? )r(x) T (B(x)? x) so that C(x) = fb(x)g. Moreover, along the same subsequence it follows from the denition of C that y kj = B(x kj ). Therefore, for j!, y kj! B(x) by the continuity of B. This shows that y = B(x) 2 C(x) as was to be proved. The other possibility is that for all k suciently large (B(x k )) > (x k ) + (? )r(x k ) T (B(x k )? x k ): Letting k! and using (8), it then follows that (B(x)) (x) + (? )r(x) T (B(x)? x): Therefore, according to the denition of C and the fact that y 2 seg(x; B(x)), y 2 C(x) whenever (y)? (x) 2 [(? )r(x) T (y? x); r(x) T (y? x)]: This, however, immediately follows from the fact that for all k suciently large (y k )? (x k ) 2 [(? )r(x k ) T (y k? x k ); r(x k ) T (y k? x k )]; x k! x, y k! y and r(x k )! r(x). 2 7

5 Example Let z < z 2 < < z n denote an ordered realization of a sample from a density g on [0; ) which is known to be convex and decreasing and dene z? = z 0 = 0. Consider the problem of estimating g from the data. This estimation problem can be found in Hampel (987). In Groeneboom & Jongbloed (995) a sieved nonparametric maximum likelihood estimator for g is dened. This estimator is dened as the maximizer of the function g 7! n n? X i=0 log g(z i ); over the class of convex decreasing densities g on [0; ) which are piecewise linear such that all the jumps in the derivative of g are concentrated at the observation points. Therefore, dening x i = g(z n?i ) for 0 i n, we see that this class of densities can be identied with the intersection of the closed convex cone K K = x 2 IR n : x 0 and x i? x i? z n?i+? z n?i x i+? x i for i n? : (9) z n?i? z n?i? in IR n and the ane subspace A ( A = x 2 IR n : 2 i= x i (z n?i+? z n?i? ) = ) : in IR n which takes into account the fact that densities integrate to one. The problem of determining the maximum likelihood estimator is therefore equivalent to the problem of determining ^x = argmin x2k\a? n i= log x i : Lemma 2 shows that this problem is equivalent to minimizing a smooth strictly convex function over the whole cone K rather than over the intersection of K with the ane subspace A. Lemma 2 Let K be a cone in IR n and the function be dened by (x) =? n i= log x i : Let c 6= 0 be a vector in IR n and A the ane subset of IR n given by A = fx 2 IR n : c T x = g for some given 6= 0. Then argmin K\A (x) = argmin (x) + K ct x : 8

Proof: Include the linear restriction c T x = in the object function via the Lagrangian multiplier to obtain the function (x) = (x) + (c T x? ): On K \ A this function coincides with. When ^x minimizes over K, and c T ^x =, then ^x evidently minimizes over K \ A. From the structure of together with the equality part of (2), it follows that! ^x ;i? + c i = 0; n^x ;i so that we have Therefore, it is clear that which was to be proved. According to this lemma ^x = argmin y2k Noting that x 2 K if and only if x i = i= argmin A\K? ix j= i= c T ^x = =: (x) = argmin = (x); K we see that determining ^x is equivalent to determining where (y) =? =? 8 0 < : X n log @ i j= 8 0 < : X n log @ i i= i= j= n log x i? 2 x i(z n?i+? z n?i?) : (z n?j+? z n?j )y j for some y 2 C + ; (0) ^y = argmin (y); y2c + 9 (z n?j+? z n?j )y j A ix =? 2 (z n?i+? z n?i? ) (z n?j+? z n?j )y j ; j= 9 (z n?j+? z n?j )y j A =? 2 y i(z 2? n?i+ z2 n?i) ; : () Figure 2 shows the maximum likelihood estimate of a convex decreasing density on [0; ) based on a sample of size n = 000 from the density g(z) = 3(? z) 2 [0;] (z) 2 9

computed by the modied ICM algorithm. We used the settings = 0?5, = 0:, y (0) i = 2 ( i n) and the weights w(y) i = (z n?i+? z n?i ) y i j=i nx j ( i n); where x depends on y as in (0). On a NeXTSTEP machine the algorithm stopped after 05 iterations. 3.004 2.002.00 0 0 0.25 0.5 0.75 Figure 2: Maximum likelihood estimator of the density based on a sample of size 000; the dashed curve is the underlying density. References [] Aragon, J., Eberly, D. (992). On convergence of convex minorant algorithms for distribution estimation with interval-censored data. J. of Comp. and Graph. Statist. 29-40. [2] Barlow, R.E., Bartholomew, R.J., Bremner, J.M. and Brunk, H.D. (972). Statistical inference under order restrictions. Wiley, New York. [3] Bazaraa, M.S., Sherali, H.D. and Shetti, C.M. (993). Nonlinear programming, theory and algorithms. Wiley, New York. [4] Dempster, A.P., Laird, N.M. and Rubin, D.B. (977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. B 39? 38. 0

[5] Groeneboom, P. and Jongbloed, G. (995). Maximum likelihood estimation of a convex decreasing density. In preparation. [6] Groeneboom, P. and Wellner, J.A. (992). Information bounds and nonparametric maximum likelihood estimation. Birkhauser, Basel. [7] Hampel, F.R. (987). Design, modelling, and analysis of some biological data sets. In C.L. Mallows, editor, Design, data and analysis, by some friends of Cuthbert Daniel, p. -5, Wiley, New York. [8] Jongbloed, G. (995). Three statistical inverse problems. Ph.D. thesis, Delft University of Technology, The Netherlands. [9] Robertson, T., Wright, F.T. and Dykstra, R.L. (988). Order restricted statistical inference. Wiley, New York. [0] Terlaky, T. and Vial, J.Ph. (995). Maximum likelihood estimation of convex density functions. Technical Report 95-49, Department of Mathematics, Delft University of Technology. [] Wu, C.F.J. (983). On the convergence properties of the EM algorithm. Ann. Statist. 95-03. [2] Zangwill, W.I. (969). Nonlinear programming: a unied approach. Prentice Hall, Englewood Clis, New Jersey. [3] Zhan, Y. and Wellner, J.A. (995). Double censoring: characterization and computation of the nonparametric maximum likelihood estimator. To appear as Technical Report, Department of Statistics, University of Washington.