Likelihood Based Inference for Monotone Response Models

Size: px

Start display at page:

Download "Likelihood Based Inference for Monotone Response Models"

Barrie Hampton
6 years ago
Views:

1 Likelihood Based Inference for Monotone Response Models Moulinath Banerjee University of Michigan September 5, 25 Abstract The behavior of maximum likelihood estimates (MLE s) the likelihood ratio statistic in a family of problems involving pointwise nonparametric estimation of a monotone function is studied This class of problems differs radically from the usual parametric or semiparametric situations in that the MLE of the monotone function at a point converges to the truth at rate (slower than the usual rate) with a non-gaussian limit distribution A framework for likelihood based estimation of monotone functions is developed limit theorems describing the behavior of the MLE s the likelihood ratio statistic are established In particular, the likelihood ratio statistic is found to be asymptotically pivotal with a limit distribution that is no longer but can be explicitly characterized in terms of a functional of Brownian motion Applications of the main results are presented potential extensions discussed Key words phrases:greatest convex minorant, ICM, likelihood ratio statistic, monotone function, monotone response model, self-induced characterization, two sided Brownian motion,universal limit 1 Introduction A common problem in nonparametric statistics is the need to estimate a function, like a density, a distribution, a hazard or a regression function Background knowledge about the statistical problem can provide information about certain aspects of the function of interest, which, if incorporated in the analysis, enables one to draw meaningful conclusions from the data Often, this manifests itself in the nature of shape-restrictions (on the function) Monotonicity, in particular, is a shape-restriction that shows up very naturally in Supported by NSF Grant DMS

2 different areas of application like reliability, renewal theory, epidemiology biomedical studies Depending on the underlying problem, the monotone function of interest could be a distribution function or a cumulative hazard function (survival analysis), the mean function of a counting process (demography, reliability, clinical trials), a monotone regression function (dose-response modeling, modeling disease incidence as a function of distance from a toxic source), a monotone density (inference in renewal theory other applications) or a monotone hazard rate (reliability) Consequently, monotone functions have been fairly well studied in the literature several authors have addressed the problem of maximum likelihood estimation under monotonicity constraints We point out some of the well known ones One of the earliest results of this type goes back to Prakasa Rao (1969) who derived the asymptotic distribution of the Grener estimator (the MLE of a decreasing density); Brunk (197) explored the limit distribution of the MLE of a monotone regression function, Groeneboom Wellner (1992) studied the limit distribution of the MLE of the survival time distribution with current status data, Huang Zhang (1994) Huang Wellner (1995) obtained the asymptotics for the MLE of a monotone density a monotone hazard respectively with right censored data (the asymptotics for a monotone hazard under no censoring had been earlier addressed by Prakasa Rao (197)) while Wellner Zhang (2) deduced the large sample theory for a pseudo likelihood estimator for the mean function of a counting process A common feature of these monotone function problems that sets them apart from the spectrum of regular parametric semiparametric problems is the slower rate of convergence ( ) of the maximum likelihood estimates of the value of the monotone function at a fixed point (recall that the usual rate of convergence in regular parametricsemiparametric problems is ) What happens in each case is the following: If is the MLE of the monotone function, then provided that does not vanish, where the rom variable! "!$ (11) is a symmetric (about ) but non-gaussian rom variable is a constant depending upon the underlying parameters in the problem the point of interest! '(*)+-,)1+ In fact, argmin& motion on the line The distribution of! where '(*)+ is stard two-sided Brownian was analytically characterized by Groeneboom (1989) more recently its distribution functionals thereof have been computed by Groeneboom Wellner (21) 2

3 = The above result is reminiscent of the asymptotic normality of MLE s in regular parametric ( semiparametric) settings Classical parametric theory provides very general theorems on the asymptotic normality of the MLE If ' ' ' are iid observations from a regular parametric model, denotes the MLE of based on this data, then + where follows 1 + is the Fisher information In monotone function estimation, the rate slows down to! plays the role of In this paper, we study a class of conditionally parametric models, of the covariate response type, where the conditional distribution of the response given the covariate comes from a regular parametric model, with the parameter being given by a monotone function of the covariate We call these monotone response models Here is a formal description: Let, with being an open subset of, be a one parameter family of probability densities with respect to a dominating measure Let be an increasing or decreasing continuous function defined on an interval taking values! "$ in Consider iid data & "'(, -'"' where *), +) being a Lebesgue density ( "' We think of as the covariate value for defined on the th observation, 1 as the value of the response Interest focuses on estimating the function covariate, since it captures the nature of dependence between the response the If the parametric family of densities,, is parametrized by its mean, then 32-4" is precisely the regression function It will be seen later that when is a one-parameter full rank exponential family model, the regression :-;" 8 corresponding to the sufficient statistic, is a function strictly increasing (known) transformation of ; consequently, estimating is equivalent to estimating the regression of 8 $ on " In this paper, we study the asymptotics of the MLE of also the likelihood ratio statistic for testing at a fixed point of interest, with a view to obtaining pointwise confidence sets for of an assigned level of significance Before we discuss this further, here are some motivating examples to illustrate the above framework (i) Monotone Regression Model: Consider the model <"$,>=?= are iid where has mean variance A " = rom "' variables, is independent of, each, each has a Lebesgue density ) CB is 3

4 A a monotone function The above model its variants have been fairly well studied in the literature on isotonic regression (see, for example, Brunk (197), Wright(1981), Mukherjee (1988), Mammen (1991), Huang (22)) Now suppose = that the s are Gaussian We are then in the above framework: - " ( We want to estimate interior point in the domain of "3( ) CB test for an (ii) Binary Choice Model: Here we have a dichotomous response variable " or a continuous covariate with a Lebesgue density ) CB -4"6 <" " " is a smooth function of Thus, conditional on, such that a Bernoulli distribution with parameter <" Models of this kind have been quite broadly studied in econometrics statistics (see, for example, Dunson (24), Newton, Czado Chappell (1996), Salanti Ulm (23)) In a biomedical context " one could think of as representing the indicator of a diseaseinfection the level of exposure to a toxin, or the measured level of a biomarker that is predictive of the diseaseinfection (see, for example, Ghosh, Banerjee Biswas (24)) In such cases it is often natural to impose a monotonicity assumption on As in (i), we want to make inference on has A special version of this model is the Case 1 interval censoringcurrent status model that is used extensively in biomedical studies epidemiological contexts has received much attention among biostatisticians statisticians (see, for example, Groeneboom (1987), Groeneboom Wellner (1992), Sun Kalbfleisch (1993), Huang (1996), Shiboski (1998), Sun (1999), Banerjee Wellner (21)) Consider individuals who are checked for infection at independent rom times " " " ; we set if individual " is infected by time otherwise "$! We can think of as where is the (rom) time to infection (measured from some baseline time) The s are assumed to be independent also independent of the " s are unknown We are interested in making inference on, the common (survival) distribution of the s We note that an iid sample from the distribution of density +) - " ( Bernoulli with monotonicity constraints " <" where " " ( (iii) Poisson Regression Model: Suppose that ) CB Poisson where ( ) CB, "$ & is for some Lebesgue This is just a binary regression model - " ( is a monotone function We have iid observations from this model Here one can think of " as the distance of a region from a point source 4

5 (for example, a nuclear processing plant) the number of cases of disease incidence at distance " Given ", the number of cases of disease incidence at distance from the source is assumed to follow a Poisson distribution with mean where can be expected to be monotonically decreasing in Variants of this model have received considerable attention in epidemiological contexts (Stone (1988), Diggle, Morris Morton Jones (1999), Morton Jones, Diggle Elliott (1999)) A common feature of all three models described above is the fact that the conditional distribution of the response comes from a one parameter full rank exponential family (in (i), the variance A needs to be held fixed) One parameter full rank exponential family models have certain structural niceties that can be exploited to derive explicit solutions to the MLE of Our last example below considers a curved exponential family model for the response which is fundamentally of a different flavor (iv) Conditional normality under a mean variance relationship: Consider the " scenario, where has a Lebesgue density concentrated on an interval " given (, for an increasing function + ; being the normal density with A, For with, with for some real, this reduces to a normal density with a linear relationship between the mean the stard deviation Such a model could be postulated in a real life setting based on say, exploratory plots of the mean variance relationship using observed data, or background knowledge The likelihood for the parametric model is analytically more complicated than the ones considered above will be discussed later Based on existing work, one would expect, the MLE of at a pre fixed point to satisfy (11), with replaced by As will be seen, this indeed happens This result permits the construction of (asymptotic) confidence intervals for the quantiles of! which are well tabulated The constant using however needs to be estimated involves nuisance parameters depending on the underlying model, in particular, the derivative of at, estimating which is a tricky affair A more attractive way to achieve this goal is through the use of subsampling techniques as developed in Politis, Romano Wolf (1999); because of the slower convergence rate of estimators lack of asymptotic normality the usual Efron type bootstrap is suspect in this situation This however is computationally intensive; see, for example, Banerjee 5

6 Wellner (25) Sen Banerjee (25) for a discussion of some of the issues involved Another likelihood based method of constructing confidence sets for involve testing a null hypothesis of the form would, using the likelihood ratio test, for different values of, then inverting the acceptance region of the likelihood ratio test; in other words, the confidence set for is formed by compiling all values of for which the likelihood ratio statistic does not exceed a critical threshold The threshold depends on, where is the level of confidence being sought, the asymptotic distribution of the likelihood ratio statistic when the null hypothesis is correct Thus, we are interested in studying the asymptotics of the likelihood ratio statistic for testing the true (null) hypothesis Pointwise null hypotheses of this kind are very important from the perspective of estimation since they serve as a conduit for setting confidence limits for the value of, through inversion A key motivation behind studying the likelihood ratio based method in our framework comes from our experience with (regular) parametric models, where, under modest regularity conditions, the likelihood ratio statistic for testing the null hypothesis, based on an iid sample ' ' ' from the density converges to a distribution, under the null This is a powerful result; since the limit distribution is free of nuisance parameters, this permits construction of confidence regions for the parameter with mere knowledge of the quantiles of the distribution Nuisance parameters like do not need to be estimated from the data This would however be indispensable if one tried to use the asymptotic normality of to construct confidence intervals for the true value of the parameter The question then naturally arises whether, similar to the classical parametric case, there exists a universal limit distribution that describes the limiting behavior of the likelihood ratio statistic when the null hypothesis holds, for the monotone response models introduced above The existence of such a universal limit would facilitate the construction of confidence sets for the monotone function immensely as it would preclude the need to estimate the constant that appears in the limit distribution of the MLE Furthermore, it is well known that likelihood ratio based confidence sets are more data driven than MLE based ones possess better finite sample properties in many different settings, so that one may expect to reap the same benefits for this class of problems Note however that we do not expect a distribution for the limiting likelihood ratio statistic for monotone functions; the limit in regular settings is intimately connected with the rate of convergence of MLE s to a Gaussian limit 6

7 The hope that a universal limit may exist is bolstered by the work of Banerjee Wellner (21), who studied the limiting behavior of the likelihood ratio statistic for testing the value of the distribution function ( ) of the survival time at a fixed point in the current status model They found that in the limit, the likelihood ratio statistic behaves like, which is a well defined functional of '(, ( is described below) Based on fundamental similarities among different monotone function models, Banerjee Wellner (21) conjectured that could be expected to arise as the limit distribution of likelihood ratios in several different models where (11) holds Thus the relationship of to! in the context of monotone function estimation would be analogous to that of the distribution to 1 in the context of likelihood based inference in parametric models Indeed would be a non regular version of the our monotone response models, this is the case We are now in a position to describe the agenda for this paper distribution We will show that for In Section 2, we give regularity conditions on the monotone response models, under which the results in this paper are developed We state prove the main theorems describing the limit distributions of the MLE s the likelihood ratio statistic In particular, the emergence of a fixed limit law for the pointwise likelihood ratio statistic for testing has some of the flavor of Wilks classical result (Wilks (1938)) on the limiting distribution of likelihood ratios in stard parametric models Section 3 discusses applications of the main theorems Section 4 contains some discussion concluding remarks Section 5 contains the proofs of some of the lemmas used to establish the main results in Section 2, is followed by references 2 Model Assumptions, Characterizations of Estimators Main Results Consider the general monotone response model introduced in the previous section Let be an interior point of at which one seeks to estimate (a) +) is positive continuous in a neighborhood of, (b) Assume that is continuously differentiable in a neighborhood of with - - The joint density of the data vector dominating measure) can be written as: "$ & & <"$ (with respect to an appropriate & *) <"$ 7

8 The second factor on the right side of the above display does not involve hence is irrelevant as far as computation of MLE s is concerned Absorbing this into the dominating measure, the likelihood function is given by the first factor on the right side of the display above Denote by the unconstrained MLE of by the MLE of under the constraint imposed by the pointwise null hypothesis (A): With probability increasing to 1 as, the MLE s We assume: exist Consider the likelihood ratio statistic for testing the hypothesis, where is an interior point of Denoting the likelihood ratio statistic by, we have & <"' & <"' In what follows, assume that the null hypothesis holds Further Assumptions: (A1) The set (A2) We now state our assumptions about the parametric model does not depend on is denoted by is at least three times differentiable with respect to is strictly concave in for every fixed The first, second third partial derivatives of with respect to will be denoted by 8 2 (A3) If is any statistic such that - 8 -, then: 8 8 Under these assumptions, '6 2 "1 2 (A4) + is finite continuous at (A5) There exists a neighborhood of - such that for all, sup 5 2 sup 5 $ - (A6) The functions: 2 2 8

9 are continuous in a neighborhood of 2 (A7) Set Then, " 2 - Also, the function is uniformly bounded in a neighborhood of to be -, - -, - - lim sup We are interested in describing the asymptotic behavior of the MLE s of in local neighborhoods of that of the likelihood ratio statistic In order to do so, we first need to introduce the basic spaces processes ( relevant functionals of the processes) that will figure in the asymptotic theory First define to be the space of locally square integrable real valued functions on equipped with the topology of convergence on compact sets Thus comprises all functions that are square integrable on every compact set is said to converge to if for every The space denotes 5 the cartesian product of two copies of with the usual product topology Also define to be the set of all real valued functions defined on that are bounded on every compact set, equipped with the topology of uniform convergence on compacta Thus )+ converges to ) 5 in every compact interval ) sup - ) ) - ( if ) ) are bounded on '(*)+, ) for every For positive constants! define the process " *)+, where '(*)+ is stard two-sided Brownian $ motion starting from Let " *) denote the GCM (greatest convex minorant) of " *) " *) denote the right derivative of " It can be shown that the non-decreasing function " is piecewise constant, with finitely ' many jumps in any compact interval For ), let " & *) denote the GCM of " *)+ on the set ) " & *)+ ) denote its right derivative process For, let " ( *) ' denote the GCM of " *)+ on the set ) " ( *)+ denote its right derivative process Define " *)+ as " & *) *) for ) as " ( *)+,+ for ) Then, " *), like " *)+, is non-decreasing piecewise constant, with finitely many jumps in any compact interval differing (almost surely) from " *)+ on 9

10 a finite interval containing In fact, with probability 1, is identically in some (rom) neighborhood of, whereas " *)+ is almost surely non-zero in some (rom) neighborhood of Also, the length of the interval " on which " " differ is 1 ", see Groeneboom For more detailed descriptions of the processes " (1989), Banerjee (2), Banerjee Wellner (21) Wellner (23) Thus, 6 6 associated with the canonical process *) 6 " *)+ are the unconstrained constrained versions of the slope processes *) Finally define, The following theorem describes the limiting behavior of the unconstrained constrained MLE s of, appropriately normalized Theorem 21 Let *), ) *), ) Let Under assumptions A A7 (a), (b), dimensionally also in the space Thus, ) *)+ *) $ " *) $ " " *)+ finite Using Brownian scaling it follows that the following distributional equality holds in the space : " *) " *)+ ) For a proof of this proposition, see for example, Banerjee (2) ;6! (see, for example, Prakasa Rao (1969)), we get: This is precisely the phenomenon described in (11) 6! ) (22) Using the fact that (23) Our next theorem concerns the limit distribution of the likelihood ratio statistic for testing the null hypothesis, when it is true 1

11 Theorem 22 Under assumptions A A7 (a), (b), when is true Remark 1: In this paper we work under the assumption that " has a Lebesgue density on its support However, Theorems 21 22, the main results of this paper, continue to hold under the assumption that the distribution function of " is continuously differentiable ( hence has a Lebesgue density) in a neighborhood of with non vanishing derivative at Also, subsequently we tacitly assume that MLE s always exist; this is not really a stronger assumption than (A) Since our main results deal with convergences in distribution, we can, without loss of generality, restrict ourselves to sets with probability tending to 1 In this paper, we will focus on the case when is increasing The case where is decreasing is incorporated into this framework by replacing " by " considering the (increasing) function Remark 2: We briefly discuss the implications of Theorems for constructing confidence sets for which an asymptotic level confidence interval for, upper th quantile of the (symmetric) distribution of! Theorem 21 leads to (23), based on is given by, where is the (this is approximately 1 when ) Here is an estimate of To use Theorem 22, for constructing confidence sets, for each one constructs the likelihood ratio statistic for testing the null hypothesis Denoting this by,, the set is the upper th quantile of (this is approximately 228 for level confidence set for, where ) gives an asymptotic In order to study the asymptotic properties of the statistics of interest it is necessary to characterize the MLE s this is what we do below Characterizing 6 * * likelihood estimators of : In what follows, we define: We now discuss the characterization of the maximum We first write down the log likelihood function for the data 11

12 This is given by: " 1 " 1 <"' The goal is to maximize this expression over all increasing functions " Let " " " denote " the ordered values of denote the observed value of corresponding to Since <" is increasing <" <" Finding therefore reduces to minimizing & over all 6 Once, we obtain the (unique) minimizer, " the MLE & <" is given by: at the points By our assumptions, is a (continuous) convex function defined on assuming values in Let From the continuity convexity of, it follows that is an open convex subset of We will minimize over subject to the constraints Note that is finite differentiable on Necessary sufficient conditions characterizing the minimizer are obtained readily, using the Kuhn-Tucker theorem We write the constraints as there exists an such that, if for dimensional vector with is the minimizer in, satisfying the constraints, &,, then, where Then, for all, where is the 11 matrix of partial derivatives of The conditions displayed above are often referred to as Fenchel conditions In this case, with 6 boils down to 1, for 1 Solving recursively to obtain the s (for 6 * for 5 Now, let 5 5 let 5 be the common value on block & The equality:, the second condition ), we get be the blocks of indices on which the solution 1 1 (24) (25) is constant forces 12

13 5, Also, if Noting that Thus is the unique solution to the equation whenever is a head subset of the block 5 (ie 5 indices of the ordered set ), then it follows that The solution, this implies that on each (26) is the ordered subset of the first few (27) can be characterized as the vector of left derivatives of the greatest convex minorant (GCM) of a (rom) cumulative sum (cusum) diagram, as will be shown below The cusum diagram will itself be characterized in terms of the solution, giving us a self-induced characterization Under sufficient structure on the underlying parametric model (for example, one parameter full rank exponential families), the self-induced characterization can be avoided the MLE can characterized as the slope of the convex minorant of a cusum diagram that is explicitly computable from the data This issue is discussed in greater detail in Section 3 * & Before proceeding further, we introduce some notation For points where, consider the left-continuous function such that such that is constant on + We will denote the vector of slopes (left derivatives) of the GCM of computed at the by slogcm Here is the idea behind the self induced characterization: Suppose that minimizer of Consider now, the following (quadratic) function, is the (unique) over the region subject to the given constraints -, -, where is some positive definite matrix Note that Hess which is positive definite; thus is a strictly convex function It is also finite differentiable over It can be shown (for the details of the argument, see Banerjee (24)) that irrespective of of the positive definite matrix, the function is uniquely minimized at To compute, it therefore suffices to try to minimize Choosing to be a diagonal matrix with the 13

14 " th entry being 6 Thus, &, minimizes & + on the ordered set known that the solution, we see reduces to: & * hence, is given by the isotonic regression of the function slogcm subject to the constraints that with weight function It is well See, for example Theorem 121 of Robertson, Wright Dykstra (1988) In terms of the function the solution can be written as: C Recall that function & 6 slogcm <" <" Characterizing to Finding The MLE & & (28) ; for a that lies strictly between " ", we set thus defined is a piecewise constant right continuous : Let be the number of s that are less than or equal amounts to minimizing & + & two separate optimization problems These are: (1) Minimize over & (2) Minimize Consider (1) first over all This can be reduced to solving over As in the unconstrained minimization problem one can write down the Kuhn Tucker conditions characterizing the minimizer It is then easy to see that the solution can be obtained through the recipe : Minimize over to get Then, ) ) ) The solution vector to (2), say is similarly given by C where & argmin & * 14

15 An examination of the connections between the unconstrained the constrained solutions reveals that the following holds: Furthermore, for any in the set!- -!- or - on which the two solutions differ, we have: (29) Another important property of the constrained solution & is that on any block 5 of indices where it is constant not equal to, the constant value, say is the unique solution to the equation: (21) The constrained solution also has a self-induced characterization in terms of the slope of the greatest convex minorant of a cumulative sum diagram This follows in the same way as for the unconstrained solution by using the Kuhn Tucker theorem formulating a quadratic optimization problem based on the Fenchel conditions arising from this theorem We skip the details but give the self-consistent characterization The constrained solution subject to the constraints that + & 6 slogcm It is not difficult to see that & 6 slogcm The constrained MLE <" for " & &, where ) (211) (212) is the piecewise constant right continuous function satisfying, having no jump points outside the set & +$ 15

16 ' Remark 3: The characterization of the estimators above does not take into consideration boundary constraints on However, in certain models, the very nature of the problem imposes natural boundary constraints; for example, the parameter space for the parametric model may be naturally non-negative (Example (d) discussed above), in which case the constraint needs to be enforced Similarly, there can be situations, where is constrained to lie below some natural bound In such cases, Fenchel conditions may be derived in the usual fashion by applying the Kuhn-Tucker theorem self-induced characterizations may be derived similarly as above However as the sample size grows, with probability increasing to 1, the Fenchel conditions characterizing the estimator in a neighborhood of will remain unaffected by these additional boundary constraints, since is assumed to lie in the interior of the parameter space, the asymptotic distributional results will remain unaffected For this paper, we will assume the (uniform) almost sure consistency of the MLE s for in a closed neighborhood of We state this formally below Lemma 21 There exists a neighborhood A of such that sup sup The global consistency result above can be established using a variety of methods See, for example, the proof of Lemma 21 in Banerjee (24) For the purposes of deducing the limit distribution of the MLE s the likelihood ratio statistic the following lemma that guarantees local consistency at an appropriate rate, is crucial Lemma 22 For any sup& sup&, we have:, ) $, ) $ + We next state a number of preparatory lemmas required in the proofs of Theorems But before that we need to introduce further notation Let denote the empirical measure based on the data, that assigns mass "$ to each observation For a monotone function taking values in, define the following processes: 16 <" * <"

17 5 Also, define normalized processes <" * <" ' 5 *)+ *) in the following manner: 5 *)+ *) 5, ) 5, ) *)+ *), ) 5 Lemma 23 The process *)+ " *) 5 in the space *) Lemma 24 For every sup& sup& 5, where,, the following asymptotic equivalences hold: 5 *) *)+ 5 *) *) 5 Lemma 25 The processes both converge uniformly (in probability) to the deterministic function ) on the compact interval, for every The next lemma characterizes the set on which *) *)+ vary = Lemma 26 Given any we can find an such that for all sufficiently large, =, Lemma 27 For any, if & ( then & & ( - denotes the first jump point of denotes the first jump point of ( to the right of, are 1 to the left of, 17

18 ' Lemma 28 Suppose that that ' ' ' ' = (iii) For every, ' ' as Then ' ', as The above lemma is adapted from Prakasa Rao (1969) are three sets of rom vectors such ' ' Proof of Theorem 21: The proof presented here relies on continuous mapping arguments for slopes of greatest convex minorant estimators Setting " interpreting an integral over the empty set as, we get that the unconstrained MLE is given by: <" & slogcm <" 5 <" & this is a direct consequence of (28) Consequently, <" & slogcm <" 5 <" <" & Recall that the normalized processes are defined in the following manner: *)+ *), ) + 5 *)+ ) 5, ) 5, ) Let!6 slope-gcm *)+ 5 *) ) convex minorant of the points *) 5 *)+ ) *) function it is easy to see that 5 By Lemmas 23 24, the process *) ' to the process " *) ' *), ) 5 on compacta; in other words, *) )3? ' " *)+ )7? the space of bounded functions on convergence to denote the slope of the greatest This is a piecewise constant *) converges in distribution under the topology of uniform convergence converges as a process in equipped with the topology of uniform Furthermore, by Lemma 25, the process 18

19 *) > )? converges in probability (uniformly) to the function ) Denote by, the slope of the GCM of the points *) 5 *)+ )? evaluated at the point, where denote by " the slope of the GCM of " *)+ )? at the point Now fix an such that the points ) ) ) the interior of the set probability at least =, on Similarly,, evaluated are in = For every, find, such that with & ( That this can be done is guaranteed by Lemma 27 So, consider a sample point for which the above happens Noting that the points & *) 5 *)+ ), it ( are kink points of the GCM of follows that the GCM of the points *) 5 *)+ ) & (? coincides with the GCM of the points *) 5 *)+ ) restricted & ( Since, & (, it is also the case that the GCM of the points *)+ 5 *) )? coincides on the set & ( with the GCM of the points *) 5 *)+:) & (? Consequently, the slope of the GCM of *) 5 *)+ )? evaluated at *) must agree with the slope of the GCM of *)+ 5 *)+ ) evaluated at the points *) We conclude that with probability more than - =, *) *) eventually coincides with Further, we can ensure that with probability at least =, & (, where & is the first kink point of the GCM of the process " *) to the left of ( $ is the first kink point of the GCM of " *)+ to the right of = And, as in the previous paragraph, we can argue that with probability more than, *) " Now, setting ' $6 *) " *) coincides with " *) 6 ' *) ' " *) 6 ' we find that Conditions (i) (ii) of Lemma 28 are satisfied It only remains to check Condition (iii) Since, on, *) converges uniformly in probability to ) 5 *)+ $ converges (in distribution) to " *) under the uniform topology, the GCM 19

20 of the points *) 5 of the process " *)+ ) *)+4 )?? slope-gcm *)+ 5 *) ): evaluated at the points *) ; " *) ):? ) ' converges in distribution to the GCM (under the uniform topology); consequently,? converges in distribution to slope-gcm, evaluated at the points (without loss, the set of s can be taken to be a subset of the positive integers hence we can ensure that with probability 1, the GCM of " *)+4 ):? is differentiable at the ) s for ) This establishes (iii) of Lemma 28 Hence, by Lemma 28, we conclude that = each *) ; converges to " *) in distribution A similar argument applies to the constrained MLE with some modification The crux of the argument, which involves applying Lemma 28, is similar modification comes from the fact that instead of dealing with the slope of the convex minorant of *)+ 5 *) ) slope-gcm *) 5 *)+) ) taken pointwise) on one h slope-gcm *)+ 5 *) ) + other The convergence of slope-gcm *)+ 5 *)+! ) (for a set of points negative) to slope-gcm " *) ) The, we now need to deal with (where the minimum is interpreted as being on the evaluated at points, where the s are all, evaluated at the s, is deduced similarly, evaluated at the ) s whence slope-gcm *) 5 *)+ ) *'6 (this is precisely, ) converges to ' slope-gcm " *) ) ) evaluated at the s But this is precisely the vector " *, are larger than Finally, Similarly, one argues the convergence for points s which note that the above convergences happen jointly for the constrained unconstrained estimators; finite dimensionally to " *) *) *) & converges " The above finite dimensional convergence, so we can conclude that coupled with the monotonicity of the functions involved, allows us to conclude that convergence happens in the space The strengthening of finite dimensional convergence to convergence in the metric is deduced from the monotonicity of the processes, as in Corollary 2 of Theorem 3 in Huang Zhang (1994) 2

21 Proof of Theorem 22: We have, & <" <" <" Here is the set of indices for which expansion about, we find that equals with <", <" $,, where for points (lying between ) Under our assumptions <" & <" <" <" is 1,, ) <" <" are different By Taylor <" <", (lying between as will be established later Thus, we can write: 1 <" where 6 <" <" <" <" 21

22 , Consider the term Now, , such that the constrained solution Let denote a typical block let <" on this block; thus 5 5 for each For any block where can write for some point equals, can be written as the union of blocks of indices, say is constant on each of these blocks denote the constant value of the constrained MLE between, we have We conclude that, <" where as, + we Using the fact that on each block 5 where (from (21)) it follows that the above display <" <" <" $ <" <" is a point between The second term in the above 1 display is shown to be by the exact same reasoning as used for or Hence, <" <", <", this last step following from a one step Taylor expansion about 1 Now, using the fact that,, <", 1 6 : representations for these terms derived above, we get:,, <" 1 1+ Similarly, 1 using the 22

23 <", 1 whence where " The term <" <", ", 1 ", + ", is the rom function given by remains to deal with the term <" " " contributes to the likelihood ratio statistic in the limit Thus, - ", <" <" by Lemma 29 stated proved below It now The first term on the right side of the above display can be written as 2 On changing to the local variable ) - the above becomes where & *)+ 2 *)+ as we shall see, it is this term that *) denoting, ) & 2 *)+ *) *) *) ) 6, 5 *) *) *) ) by *)+, *) *) 1*) ) 23

24 The term 5 converges to in probability on using the facts that eventually, with arbitrarily high probability, is contained in an interval of the form are 1 that for every, - 2 & sup & on which the processes 2 - by (A6) Thus, + + +) *) *) *)+ ), ) )-, *) *) ), *)+ ), by similar arguments We now deduce the asymptotic distribution of the expression on the right side of the above display, using Lemma 28 introduced above Set ' ' Using Lemma 26, for each = such that eventually, = *)+ " *) Here " is the set on which the processes " ' ' *) ) " *)+ ), we can find a compact set of the form also " = *)+ " *)+ " vary Now let *)+ ) " *)+ ) Since contains with probability greater than = eventually ( left-closed, right-open interval over which the processes ' ' = eventually Similarly ' ' = Also '$ ' = as for every fixed This is so because by Theorem 21 *) *)+ " *)+ *)+ *)+ ) as a process in 24 is the differ ) we have " *)+ is a continuous real valued,

25 -, function defined from to the reals Thus all conditions of Lemma (28) are satisfied, leading to the conclusion that ' - ' The fact that the limiting distribution is actually independent of the constants, thereby showing universality, falls out from Brownian scaling Using (22) we obtain, ' 6 " *) on making the change of variable: It only remains to show that ; the proof for <" where is " *)+ ) ) 1 ) is similar We can write <" is some point between eventually contained in a set of the form high probability on which eventually, with arbitrarily high probability, - for some constant <" )+ as stated earlier )+ ) We outline the proof for <" <" <" On using the facts that, is with arbitrarily is 1 (A5), we conclude that - 5 * <", 5 * <", That the first term on the right side goes to in probability is a consequence of an extended Glivenko-Cantelli theorem (see, for example Proposition 2 or Theorem 3 of Van der Vaart Wellner (1999)), whereas the second term goes to by direct computation Lemma 29 With we have: 1 " Proof of Lemma 29: given by <" " <" <" + By virtue of Lemmas 22 26, it is easy to see that with arbitrarily high pre-assigned probability (say, probability greater than 7=, 25

26 = where is pre-assigned allowed to be arbitrarily small), the rom function " is eventually contained in the class of functions (the class of functions formed by taking the product of a function in * * with a function in ), where, where is the class of monotone increasing functions on the line bounded in absolute value by ( depends on = ) We * choose above in such a way that the function is square-integrable; this is doable by the assumption on in (A6) To complete the proof, it suffices to show that the class is Donsker By Example 2123 of Van der Vaart Wellner (1996), this follows if we can show that there exist envelope functions for for such that for which satisfy the uniform entropy condition (see, Section 25 of Van der Vaart Wellner (1996), for a formal definition of this condition) Choosing - to be constant function, we see that - <" to be the holds by Assumption (A6) That satisfies the uniform entropy condition follows from the fact that the class of indicators of intervals of the form forms a class of functions, multiplying this class by a fixed function, * <", preserves the property A class of functions readily satisfies the uniform entropy condition To show that satisfies the uniform entropy condition needs a little more work If classes of functions satisfy the uniform entropy condition for envelopes it is not difficult to show that the class of functions also satisfies the uniform entropy condition for the envelope, Now, where We will show that satisfies the uniform entropy condition for the constant envelope function, whence it will follow that satisfies the uniform entropy condition for the constant envelope function Now, the constant function is an envelope function for for every probability measure, we have:?= = using Theorem 275 of Van der Vaart Wellner (1996), where the constant completely free of It follows by direct computation that sup?= ;= showing that satisfies the uniform entropy condition with envelope function Now, note that if in = are at a distance less than in the metric, then, = = = is 26

27 It follows that?= showing that 6 envelope function?= = whence = indeed satisfies the uniform entropy condition for the constant 3 Applications of the main results In this section, we discuss some interesting special cases of monotone response models Consider the case of a one parameter full rank exponential family model, naturally parametrized Thus, exp $8- ) + where varies in an open interval The function " ( possesses derivatives of all orders Suppose we have ) CB -;" ( where is increasing or decreasing in We are interested in making inference on, where " is an interior point in the support of If ) satisfy conditions (a) (b) of Section 2, the likelihood ratio statistic for testing converges to under the null hypothesis, since conditions (A) (A7) are readily satisfied for exponential family models Note that: so that $8, ) + 8 Since we are in an exponential family setting, differentiation under the integral sign is permissible up to all orders 2 We note that is infinitely differentiable with respect to for all ; since 2 we have Also note that 2 Var 98 $ Since, we get * which implies the concavity of + is continuous everywhere Condition (A6) is also satisfied easily: " bounded in a neighborhood of ( ; a neighborhood of that +, Clearly Hence conditions (A1) (A4) are satisfied readily is clearly uniformly is differentiable in is continuous follows on noting that To check conditions (A5) (A7) fix a neighborhood of, say =, = 27

28 Since, for all, 5 which is continuous, we can actually choose in (A5) to be a constant It remains to verify condition (A7) Since * uniformly bounded for! =,= 6 ensure that for some constant, 8, For 2, consider 98, are, by choosing sufficiently large, we can ) + which in turn is dominated by sup 98,, The expression above is not dependent on sup 8 an appeal to the DCT the fact that $,,>= * ) hence serves as a bound for As goes to the above expression goes to ; this is seen by is integrable at parameter values = The nice structure of exponential family models actually leads to a simpler characterization of the MLE s 5 <" For each block of indices on which is constant with common value equal to, say,, it follows by (26) that 98- * ; hence 8 (313) where 5 is the number of indices in the block Furthermore, by (27), it follows that if 5 is a head subset of the block, then 98- * * ; ie 8-, denoting the cardinality of As a direct consequence of the above, we deduce that the unconstrained MLE can actually be written as: <" slogcm <" & $ <"' 8- ** <"' <" The MLE is characterized in a similar fashion but as constrained slopes of the cumulative sum diagram formed by the points <" 6 <" <" & Thus, the MLE s have explicit characterizations for these models their asymptotic distributions may also be obtained by direct methods These would involve studying the asymptotic behavior of the unconstrained constrained 28

29 , convex minorants of the cusum diagram given by <" neighborhoods of subsequently transforming by the inverse of <" & in local ( ) It is not difficult to check that Examples (i), (ii) (iii) discussed in the introduction are special cases of the one-parameter full rank exponential family models discussed above Theorems therefore hold for these models, MLE s have explicit characterizations are easily computable We discuss Example (ii) briefly leave (i) (iii) to the reader Under Assumptions (a) (b) of Section 2, Example (ii) falls in the - " monotone response framework: ( Here Set is monotone, so is - " Under this reparametrization, ( 1, is the one parameter exponential family given by Testing is the same as testing where - where Since where It follows immediately from Theorem 22 that the likelihood ratio statistic in this model converges to This is the key result that was derived in Banerjee Wellner (21), but now follows as a special case of our current results Consider now, Example (iv) with exp for simplicity In this case, This is a one-parameter curved exponential family model with a two-dimensional sufficient statistic It follows that constant,, $ whence, -, For, this is clearly decreasing on showing that note that as converges to or diverges to, diverges to Defining for, we can consider the problem of maximizing & A unique maximum exists, with * 5 of indices such that assumes a constant value, say, for then, as before, the Fenchel conditions imply that to:,, 1 is concave in Also, over all If 5 is a block is a head-subset,, which translates 29

30 where 5 is the cardinality of Also, if is the cardinality of, then,, 1 For a general (real), the above equations inequalities do not translate into an explicit characterization of the MLE as the solution to an isotonic regression problem The mileage that we get in the full rank exponential family models is no longer available to us owing to the more complex analytical structure of the current model The self-induced characterization nevertheless allows us to write down the MLE as a slope of greatest convex minorant estimator, determine its asymptotic behavior, following the route described in this paper (apart from log-concavity of the existence of a solution, which we argue, the other regularity conditions on the parametric model need to be verified; we leave this as an exercise for the reader) Remark 4: Example (iv) illustrates the point that simple explicit characterizations of MLE s may not always be available for the monotone response models we have considered in this paper While the analytical structure of a particular model may lead to an explicit representation for the MLE of, as we have witnessed above, this typically will not provide any information about the characterization of the MLE for a structurally different model For example, the explicit characterization of the MLE for the one parameter full rank exponential family models provides no insight into the characterization of the MLE in Example (iv) In our view, the advantage of the self-induced characterization lies in the fact that it does not depend on the analytical structure of the model therefore provides an integrated representation of the MLE s as slopes of greatest convex minorant estimators across the entire class of monotone response models From a computational stpoint, an explicit characterization of the MLE, if available, clearly has its advantages Applying the self-induced characterization involves iteration using, for example, the iterative convex minorant algorithm (Jongbloed (1998)), can be time consuming On the other h, if the MLE also has an explicit characterization as the solution to an isotonic regression problem, then, for example, a simple (non-iterative) application of the PAVA will suffice, will typically consume far less time 4 Discussion In this paper, we have studied the asymptotics of likelihood based inference in monotone response models A crucial aspect of these models is the fact that conditional on the 3

31 covariate ", the response is generated from a parametric family that is regular in the usual sense; consequently, the conditional score functions, their derivatives the conditional information play a key role in describing the asymptotic behavior of the maximum likelihood estimates of the function It must be noted though that there are monotone function models that asymptotically exhibit similar behavior though they are not monotone response models in the sense of this paper The problem of estimating a monotone instantaneous hazarddensity function based on iid observations from the underlying probability distribution or from right censored data (with references provided in the Introduction) are cases in point While methods similar to ones developed in this paper apply to the hazard estimation problems show that still arises as the limit distribution of the pointwise likelihood ratio statistic, the asymptotics of the likelihood ratio for the density estimation problems still remain to be established The formulation that we have developed in this paper admits some natural extensions which we briefly indicate For example, one would like to investigate what happens in the case of (conditionally) multidimensional parametric models Suppose we have iid " " observations from the distribution of where <" " " " is a rom vector such that " " " with probability 1 given, is distributed as where is strictly increasing, is a dimensional parametric model Under appropriate regularity conditions (in particular, log-concavity in the parameters would appear to be essential), would the likelihood ratio statistic for testing converge in this case to as well? The above model is motivated by the Case interval censoring problem Here 8 is the survival time of an individual 98 8 follow up times One only observes 8 (interpret 8 8 where 98+ 8* is a rom vector of ) The goal is to estimate the monotone function (the distribution function of ) based on the above data It is not difficult to see that with " 8,, this is a special case of the parameter model with where & & This is simply the multinomial distribution The above extension has a fundamental structural difference with the monotone response models treated in this paper For the models of this paper, the values of the monotone function to be estimated appear in the log-likelihood function in a separated manner Recall that the log-likelihood function can be written as &, where is 31

32 the value of the monotone function at ", the th largest covariate value; furthermore, the additive components of the log-likelihood correspond to independent observations These features simplify the characterization of the maximum likelihood estimates the subsequent derivation of the asymptotics However, the additive separated structure present here, is no longer encountered for the extended models of the previous paragraph For simplicity, consider the Case 2 interval censoring model, where there are two observation times * intervals for each individual one records in which of the three mutually disjoint,,, the individual fails Letting & denote the distinct ordered values of the * observation times ( here is the 98, one can write down the number of individuals being observed), denote log-likelihood for the data It is seen that terms of the form 8 immediately enter into the log-likelihood (see, for example, Groeneboom Wellner (1992), for a detailed treatment) One important consequence of this, in particular, is the fact that the computation of the constrained MLE of the survival distribution (under a hypothesis of the form ) can no longer be decomposed into two separate optimization problems, in contrast to the monotone response models of this paper Consequently, an analytical treatment of the constrained estimator will involve techniques beyond those presented in this paper Regarding the unconstrained estimator of in this model, Groeneboom (1996) uses some hard analysis to show that under a hypothesis of separation between (the first second observation times), the estimator converges to the truth at (pointwise) rate, with limit distribution still given by! A general treatment of the extended models is expected to involve applications extensions of the ideas in Groeneboom (1996), is left as a topic for future research Another potential extension is to semiparametric models where the infinite dimensional component is a monotone function Here is a general formulation: Consider a rom '" " vector where is unidimensional, but ' '" can be vector valued Suppose that the distribution of conditional on is given by, where is a one-dimensional parametric model We are interested in making inference on both The above formulation is fairly general includes for example the partially linear regression model: ', <", = where is a monotone function (certain aspects of this model have been studied by Huang(22)), '" semiparametric logistic regression (with denoting the binary response, covariates of interest) where the log-odds of a positive outcome ( ) is modelled as ', <", other models of interest In the light of previous results, we expect that under appropriate conditions on, will converge to a normal distribution with asymptotic dispersion & 32

Likelihood Based Inference for Monotone Response Models

Likelihood Based Inference for Monotone Response Models Moulinath Banerjee University of Michigan September 11, 2006 Abstract The behavior of maximum likelihood estimates (MLEs) and the likelihood ratio