Maximum Likelihood Based Estimation of Hazard Function under Shape. Restrictions and Related Statistical Inference.

Size: px

Start display at page:

Download "Maximum Likelihood Based Estimation of Hazard Function under Shape. Restrictions and Related Statistical Inference."

Melvin Rogers
6 years ago
Views:

1 Maximum Likelihood Based Estimation of Hazard Function under Shape Restrictions and Related Statistical Inference by Desale Habtzghi (Under the direction of Somnath Datta and Mary Meyer ) Abstract The problem of estimation of a hazard function has received considerable attention in the statistical literature. In particular, assumptions of increasing, decreasing, concave and bathtub-shaped hazard function are common in literature, but practical solutions are not well developed. In this dissertation, we introduce a new nonparametric method for estimation of hazard function under shape restrictions to handle the above problem. This is an important topic of practical utility because often, in survival analysis and reliability applications, one has a prior notion about the physical shape of underlying hazard rate function. At the same time, it may not be appropriate to assume a totally parametric form for it. We adopt a nonparametric approach in assuming that the density and hazard rate have no specific parametric form with the assumption that the shape of the underlying hazard rate is known ( either decreasing, increasing, concave, convex or bathtub-shaped). We present an efficient algorithm for computing the shape restricted estimator. The theoretical justification for the algorithm is provided. We also show how the estimation procedures can be used when dealing

2 with right censored data. We evaluate the performance of the estimator via simulation studies and illustrate it on some real data sets. We also consider testing the hypothesis that the lifetimes come from a population with a parametric hazard rate such as Weibull against a shape restricted alternative which comprises a broad range of hazard rate shapes. The alternative may be appropriate when the shape of the parametric hazard is not constant and monotone. We use appropriate resampling based computation to conduct our tests since the asymptotic distributions of the test statistics in these problems are mostly intractable. Index words: Survival Analysis, Hazard Function, Survival Function, Right Censored Data, Nonparametric, Estimation, Parametric, increasing, decreasing, Bathtub-Shaped, Concave, Shape Restricted Estimator, Simulation, Testing, Resampling.

3 Maximum Likelihood Based Estimation of Hazard Function under Shape Restrictions and Related Statistical Inference by Desale Habtzghi B.S., University of Asmara, Eritrea, 1996 M.S., Southern Illinois University, U.S.A, 2001 M.S., University of Georgia, U.S.A, 2003 A Dissertation Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY Athens, Georgia 2006

5 Maximum Likelihood Based Estimation of Hazard Function under Shape Restrictions and Related Statistical Inference by Desale Habtzghi Approved: Major Professor: Somnath Datta and Mary Meyer Committee: Ishwar Basawa Daniel Hall Lynne Seymour Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia May 2006

6 Dedication To my brother Hagos Hadera Habtzghi iv

7 Acknowledgments Writing acknowledgments is a time to reflect upon the glorious struggle that has just taken place and remember each step along the way. At every turn there are many who have given their time, energy and expertise and I wish to thank each for the help. I would like to express my sincere appreciation to my major professors, Dr. Somnath Datta and Dr. Mary Meyer, who provided not only the direction for the project, but also an enthusiasm and personal concern which greatly contributed to its progress. Dr. Meyer s innovative ideas have provided me with a new research avenue and a desire to learn more about the nonparametric function estimation using shape restrictions. I appreciate her endless help in pushing me to fully understand the concepts of shape restrictions, without her open door, open mind and potential it is impossible to complete this project. Dr. Datta broadened my horizons, I particularly would like to thank him for helping to open my eyes to biostatistics discipline. I really appreciate all the inputs, advice and encouragement I got from him. He is always there for me when I call him. I would like to thank Dr. Ishwar Basawa, Dr. Daniel Hall and Dr. Lynne Seymour for serving on my committee as well as for their comments and enhancing my professional development. I am grateful to have spent five years with most knowledgeable professors and the most friendly staff as well as fellow students, building my solid professional background. In particular, I would like to thank Dr. Seymour for teaching me Fortran 90 while I was taking Stat v

8 vi I would like to express my appreciation to Dr. Robert Lund, Dr. Robert Taylor, Dr. Tharuvai Sriram and Dr. John Stufken for allowing me to teach in the department of statistics. I would like to Thank Dr. Pike for always wishing me the best. I am especially appreciative of the support and love of several friends including Mehari, Thomas, Tesfay, Ron, Musie, Simon, Mebrahtu, Aman, Abel, J. Park, Ross, Archan, Haitao, Lin Lin, Ghenet, Helen, Dipankar and others who made it easy to live away from home. I thank my parents for always being there for me. Finally, I would like to express my sincere thanks to my relatives for their endless love and support. Above all, my highest gratitude to my God. I would like to dedicate this dissertation to the memory of my brother, Hagos, who has passed away because of a tragic accident in 2002.

9 Table of Contents Page Acknowledgments List of Tables List of Figures v ix xi Chapter 1 INTRODUCTION LITERATURE REVIEW Distribution of failure time Censoring Estimation Shape Restricted Regression ESTIMATION OF HAZARD FUNCTION UNDER SHAPE RESTRIC- TIONS Uncensored Sample Computing the Estimator Examples Right Censored Sample SIMULATION STUDIES AND APPLICATION TO REAL DATA SETS vii

10 viii 4.1 Simulation Results Application To Real Data Sets TESTING FOR SHAPE RESTRICTED HAZARD FUNCTION USING RESAMPLING TECHNIQUES Test Statistics Resampling Approach Bootstrap based tests Simulation Studies and Results CONCLUSIONS AND FUTURE RESEARCH Summary Bayesian Approach To Shape Restricted Hazard Function Marginal Estimation of Hazard Function Under Shape Restriction in Presence of Dependent Censoring Hazard Function Estimation Using Splines Under Shape Restrictions Bibliography Appendix A Head and Neck Cancer data for Arm A B Bone Marrow Transplantation for leukemia data C Data for Leukemia Survival Patients D Generator fans failure data

11 List of Tables 2.1 Parametric Distributions with increasing and decreasing hazard rates Comparison of SRE, Kaplan Meier and kernel estimators using OMSE when the underlying hazard function is increasing convex Comparison of SRE, Kaplan Meier and kernel estimators using OMSE when the underlying hazard function is convex Comparison of Direct and Weighted approaches for estimating increasing convex hazard function Simulation results of bias and mean square error for SRE, kernel and Kaplan Meier estimators at 0, 25 and 50 percent censoring with n=25 from increasing convex hazard function (Weibull distribution with α = 3, λ = 6) Simulation results of bias and mean square error for SRE, kernel and Kaplan Meier estimators at 0, 25 and 50 percent censoring with n=50 from increasing convex hazard function (Weibull distribution with α = 3, λ = 6) Simulation results of bias and mean square error for SRE, kernel and Kaplan Meier estimators at 0, 25 and 50 percent censoring with n=25 from bathtub shaped hazard function (exponentiated Weibull distribution with α = 3, λ = 10 and θ = 0.2) Simulation results of bias and mean square error for SRE, kernel and Kaplan Meier estimators at 0, 25 and 50 percent censoring with n=50 from bathtub shaped hazard function (exponentiated Weibull distribution with α = 3, λ = 10 and θ = 0.2) ix

12 x 5.1 Power values for specific values of η, nominal level 0.05, and n =25, 50 and 100 based on log rank (LR), Kolmogorov s goodness of fit (KS) at 0 and 25 level of censoring Size-power comparison for shape constrained and unconstrained tests for specific values of η, nominal level 0.05 based on LR and KS without censoring.. 73 A.1 Survival times (in days) for patients in Arm A of the Head and Neck Cancer Trial. The 0 denotes observations lost to follow up B.1 Bone Marrow Transplantation for acute lymphoblastic leukemia (ALL) group, status=0 indicates alive or disease free, and status=1 indicates dead or relapsed.) 93 C.1 Data for Leukemia Patients, status=0 indicates still alive and status=1 indicates dead D.1 Generator fan failure data in thousands of hours of running time; status=1 indicates failure, and status=0 indicates censored

13 List of Figures 1.1 Typical Hazard Shapes Examples of fits to scatterplot. (a) The solid curve is convex fit, the dashed curve is quadratic fit and the dotted curve is the underlying convex function.(b) The solid curve is convex fit, the dashed curve is linear fit and the dotted curve is the underlying quadratic function Estimation results using percentiles as data. The failure times are quantiles of exponentiated Weibull distribution with parameters α = 4, η = 1 and λ = 10. The thin solid curve is the underlying hazard rate, the thick solid curve is SRE estimate, the dotted curve is kernel estimate, and the dashed curve is Kaplan Meier estimate Estimation results using percentiles as data. The failure times are quantiles of exponentiated Weibull distribution with parameters α = 3, η = 0.2 and λ = 10. The thin solid curve is the underlying hazard rate estimate, the thick solid curve is SRE estimate, the dotted curve is kernel estimate, and the dashed curve is Kaplan Meier estimate Estimation results using percentiles as data. The failure times are quantiles of a distribution function with quadratic hazard function. The thin curve is the underlying hazard rate, the thick solid curve is SRE estimate, the dotted curve is kernel estimate, and the dashed curve is Kaplan Meier estimate xi

14 xii 3.4 Comparison of Survival functions estimated by different methods. The thin solid curve is the underlying survival function, the thick solid curve is the shape restricted estimate, the dotted curve is Kaplan Meier estimate and the dashed curve is kernel estimate Estimates of hazard rates for the head and neck cancer data based on kernel (dashed curve), SRE (solid curve) and parametric (dotted curve) estimators Estimates of hazard rates for the bone marrow transplantation data based on SRE (thick solid curve), kernel (dashed curve) and PMLE (dotted curve) estimators Estimates of hazard rates for the Leukemia Survival Data based on SRE (solid curve), kernel (dotted curve) and Kaplan Meier (short dashed curve) and PMLE (long dashed curve) estimators Graph of hazard function for the model (5.3.1) when α = 6, λ = 10 and η = 1, 0.75 and 0.5 (solid curves) from lowest to highest, η =0.025 and 0.01 (dashed curves) from lowest to highest, and α = 1, η = 1 (dotted curve) Power at selected η values for nominal level 0.05, for log-rank test for 25 ( solid curve), 50 (dotted curve) and 100 (short dashed curve) sample sizes, while the long dashed curve represents the nominal level α = The edges for convex piecewise quadratic when K=5, with equally spaced knots Comparison of SRE and quadratic spline, the failure times are generated from Weibull distribution with shape and scale parameters 3 and The dotted curve is the underlying hazard rate, the dashed curve is SRE estimate and the solid curve is shape restricted quadratic spline estimate

15 Chapter 1 INTRODUCTION The problem of analyzing time to event data arises in many fields. In the biomedical sciences, the event of interest is most often the time of death of an individual, measured from the time of disease onset, diagnosis, or the time when a particular treatment was applied. In social sciences, events of interest might include the timing of arrests, divorces, revolutions, etc. Time-to-event data are also common in engineering, where the focus is most often on analyzing the time until a piece of equipment fails. All the above fields use different terms for the analysis of the occurrence and the timing of events. For example, the terms survival analysis, event-history analysis and failure-time analysis are used in biomedical, social sciences and engineering, respectively. We will use the term survival analysis throughout this dissertation. Let T be the duration of time when the subject is alive or doesn t fail. In survival analysis there are three functions that characterize the distribution of T. These are, the survival function, which is the probability of an individual surviving beyond time t; the probability density (probability mass) function, which is the unconditional probability of the event occurring at time t; and the hazard rate (function) which is the probability an individual dies in the time interval t T < t + no matter how small is, provided that the individual has survived to time t. If we know one of these functions, then the other two can be uniquely determined. 1

16 2 The hazard function is a fundamental quantity in survival analysis. It is also termed as the failure rate, the instantaneous death rate, or the force of mortality and is defined mathematically as, h(t) = lim t 0 p(t T < t + T t). t The hazard function is usually more informative about the underlying mechanism of failure than the survival function. For this reason, modeling the hazard function is an important method for summarizing survival data. Hazard functions have various shapes, some of them are increasing, decreasing, constant, bathtub shaped, hump-shaped or possessing other characteristics. See Figure 1.1 for a picture of typical hazard shapes occurring in practice. For instance, model (a) has an increasing hazard rate. This may arise when there is natural aging or wear. Model (b) has a bathtub shaped hazard. Most population mortality data follow this type of hazard function where, during an early period, deaths result primarily from infant diseases, after which the death rate stabilizes, followed by an increasing hazard rate due to the natural aging process. Model (c) has a constant hazard rate. Individuals from a population whose only risks of death are accidents or rare illness show a constant hazard rate. Model (d) has a decreasing hazard rate. Decreasing hazard functions are less common but find occasional use when there is an elevated likelihood of early failure, such as certain types of electronic devices. The problem of estimation of hazard function has received considerable attention in the statistical literature. For discussions of some parametric and nonparametric hazard estimators see Chapter 2. Estimations and inferences based on nonparametric methods have been shown to be less efficient than those based on suitably chosen parametric models (Miller, 1981). Hence, in the absence of any distributional assumptions about h(t) other than the shape constraints to make estimation and related inferences of h(t) based on nonparametric method

17 3 Hazard a) increase b) constant b) bathtub d) decrease Time Figure 1.1: Typical Hazard Shapes. can be even less efficient. So when the only information we have is that the underlying hazard function is decreasing, increasing, concave, convex or bathtub, the shape restricted estimate may provide a more acceptable estimate. In this dissertation, we introduce a new nonparametric method for estimation of hazard functions under shape restrictions to handle the above problem. This is an important topic of practical utility because often, in survival analysis and reliability applications, one has a prior notion about the physical shape of underlying hazard rate function. At the same time, it may not be safe or appropriate to assume a totally parametric form for it. In such cases, the prior notion may translate into a restriction on its shape. Furthermore, we show how the estimation procedures can be used when dealing with right censored data.

18 4 We also study the problem of testing whether survival times can be modeled by certain parametric families which are often assumed in applications. Instead of omnibus tests, we compare hazard rates derived nonparametrically but under similar shape restrictions as the parametric hazard. We use appropriate resampling-based computation to conduct our tests since the asymptotic distributions of the test statistics in these problems are largely intractable. Estimation and inference for tests involving shape restriction are not easy but methods for their numerical computation exists (Robertson, Wright, and Dykstra 1988; Fraser and Massam 1989; Meyer 1999a). We review this issue in detail in Chapter 2, section 2.4. In our approach, we consider the maximum likelihood technique for estimating the constrained hazard function. The shape restricted estimator can be obtained through iteratively reweighted least squares. This technique has been used in a variety of contexts. Meyer (1999b) used iteratively reweighted least squares to estimate the maximum likelihood of constrained potency curve. Meyer and Lund (2003) also applied this technique on time series data for estimating shape restricted trend models. In addition, Fraser and Massam (1989) applied the weighted least squares method to obtain the least square estimate of concave regression. The problem of finding the least square estimator of the concave and convex function over the constraint space is a quadratic programming problem. There is no known closed form solution, but it can be obtained by the hinge algorithm of Meyer (1999) or the mixed primal-dual bases algorithm of Fraser and Massam (1989). These algorithms are given in section 2.4. The dissertation is organized as follows: In Chapter 2 we begin with a review of the literature. We discuss various estimation methods proposed for the hazard rate. This Chapter also presents a summary review of shape restricted regression and the constraint cone, over which we maximize the likelihood or minimize the sum of squared errors. In Chapter 3, the general formulation and some theoretical properties of our method are discussed. Section

19 5 3.1 deals with construction of the new estimator for uncensored data and section 3.2 deals with the problem of estimation of hazard function for right censored data. For the right censored data case, two approaches of obtaining the shape restricted estimator for hazard are discussed. Simulation results and some real examples are given in Chapter 4. Chapter 5 is devoted to testing for shape restricted hazard function using resampling technique. Finally, Chapter 6 deals with future research: 1. Bayesian approaches to the shape restricted hazard function, 2. Marginal estimation of hazard function under shape restriction in presence of dependent censoring, and 3. Hazard function estimation using splines under shape restrictions.

20 Chapter 2 LITERATURE REVIEW In this chapter we give basic definitions of functions related to lifetimes. We also review some pre-existing methods used in the estimation of the hazard function and provide some background of shape restricted regression. 2.1 Distribution of failure time Let T denote a nonnegative random variable representing the lifetime of an individual in some population. Suppose that the lifetime T has the distribution function F and density f. We would then define the survival function of T as S(t) = P(T > t) = 1 F(t). If T is a continuous random variable, then h(t) = f(t) S(t) = lim t 0 p(t T < t + T t). t A related quantity is the cumulative hazard function H(t), defined by H(t) = t 0 h(u)du = log(s(t)). Thus, for continuous lifetimes we have the following relationships: 1. S(t) = exp( H(t)) = exp{ t 0 h(u)du}; 2. h(t) = {log S(t)} ; 6

21 7 3. f(t) = S (t); 4. f(t) = h(t) exp{ H(t)} Some Parametric Distributions The models discussed in this section are the most frequently used lifetime models. Reasons for the popularity of these models include their ability to fit different types of lifetime data and their mathematical and statistical tractability. 1. Weibull distribution with parameters α and λ f(t) = α ( t λ λ h(t) = α ( t λ λ ) α 1 [ ( t α ] exp λ) ) α 1 [ ( t α ], S(t) = exp λ) 2. Exponentiated Weibull Family The exponentiated Weibull distribution with parameters λ, η and α has: f(t) = αη λ [1 exp( (t/λ)α ] η 1 exp ( (t/λ) α ) (t/λ) α 1, S(t) = 1 [1 exp ( (t/λ) α )] η, h(t) = αη [1 exp( (t/λ)α )] η 1 exp ( (t/λ) α ) (t/λ) α 1 λ (1 [1 exp ( (t/λ) α )] η. ) when η = 1 the exponentiated Weibull distribution will be reduced to the familiar Weibull distribution with scale and shape parameters λ and α, respectively. 3. Gompertz-Makeham distribution with parameters θ, η and α has f(t) = θe αt exp[ θ α h(t) = θe αt, ( 1 e αt ) ], S(t) = e [ θ α (1 eα t)].

22 8 4. Rayleigh distribution with parameters λ 0, and λ 1 has f(t) = (λ 0 + λ 1 t) exp ( λ 0 t 0.5λ 1 t 2) h(t) = λ 0 + λ 1 t, S(t) = exp ( λ 0 t 0.5λ 1 t 2). 5. Pareto distribution with parameters λ, and α has f(t) = θλθ t θ+1, h(t) = θ t, λθ S(t) = t. θ From the different models we can see that hazard functions can be quite different in functional form. It is hard to choose the appropriate model from these different parametric models of no theoretical basis. In the absence of any strong distributional assumptions about h( ) other than its shape, it may not be appropriate to use a totally parametric form of the hazard function. For example, the concepts of a distribution functions with increasing hazard function are useful in engineering applications (Miller, 1981). However, we have many distributions that have an increasing hazard function; this makes it difficult to select one without an appropriate theoretical basis (see the Table 2.1). In addition to that, these models are not capable of giving different shapes of hazard function such as U-shape hazard function, and bimodal hazard function. For such conditions when the only information available is the shape (decreasing, increasing, concave, convex or bathtub) of the underlying hazard function, a new nonparametric estimator that considers shape is introduced in this dissertation to provide more acceptable estimates. In Table 2.1 IFR and DFR stands for an increasing hazard rate and a decreasing hazard rate, respectively.

23 9 Table 2.1: Parametric Distributions with increasing and decreasing hazard rates Constant IFR DFR Exponential Weibull(α > 1) Weibull (α < 1) Gamma(α > 1) Gamma (α < 1) Rayleigh (λ > 0) Rayleigh (λ < 0) Gampertz (θ, α > 0) Pareto (t > θ) 2.2 Censoring What distinguishes survival analysis from other fields of statistics is that censoring and truncation are common. A censored observation contains only partial information about the random variable of interest. In this dissertation we considered the problem of estimating and testing the constrained maximum likelihood estimator when the data may be subject to right censoring. Right censoring means that not all of a set of independent survival times or life times are observed, so that for some of them it is only known that they are larger than given values. This is the most common type of censoring. Right censoring arises often in medical studies. For example in clinical trials, patients may enter the study at different times, then each is treated with one of the several possible therapies. If someone wants to observe their lifetimes, but censoring occurs when subject is lost to follow up, drops out, dies due to another cause, or the patient is still alive at the end of the study. Let T 1, T 2,...,T n denote iid lifetimes (times to failure) from the continuous distribution function F, and Z 1, Z 2,...,Z n be the iid corresponding censoring times from continuous distribution G. The times T i and Z i are usually assumed to be independent. The observed random variables are then X i and δ i where X i = min(t i, Z i ) and δ i = I(T i Z i ). Based on

24 10 this assumption and the distribution of Z does not involve any parameters of interest, we derived the maximum likelihood function of the lifetimes in the next section. 2.3 Estimation Parametric Procedures Parametric methods rest on the assumption that h(t) is a member of some family of distributions h(t, θ), where h is known but depends on an unknown parameter θ, possibly vector-valued. In general, θ is estimated in some optimal fashion, and its estimator ˆθ is used in h(t, ˆθ) to obtain a parametric estimator of h(t) (Lawless, 1982; Miller, 1981). The Weibull distribution is considered as illustrative of the parametric approach. Because of its flexibility the Weibull distribution has been widely used as a model in fitting lifetimes data. Various problems associated with this distribution have been considered by Cohen (1965) and many other authors. The likelihood function: Here we concentrate on methods based on the likelihood function for a right censored sample. We derive the general form of the likelihood function. Let T denote a lifetime with distribution function F, probability density function (pdf) f and survival function S f ; and Z denote a random censoring time with distribution function G, pdf g, and survival function S g. The derivation of the likelihood is as follows: P(X = x, δ = 0) = P(Z = x, Z < T) = P(Z = x, x < T) = P(Z = x)p(x < T) = g(x)s f (x) by independence P(X = x, δ = 1) = P(T = x, T < Z) = P(T = x, x < Z) = f(x)s g (x) by independence

25 11 Hence, the joint pdf of the pairs (X i, δ i ) is a mixed distribution as X is continuous and δ discrete. It is given by the single expression P(x, δ) = {g(x)s f (x)} 1 δ {f(x)s g (x)} δ. Then the likelihood function of the n iid pairs (X i, δ i ) is given by L = {f(x i )S g (x i )} δi {g(x i )S f (x i )} 1 δ i n n L = {g(x i )} 1 δ i {S g (x i )} δi {f(x i )} δ i {S f (x i )} 1 δ i. If the distribution of Z does not involve any parameters of interest, then the first factor plays no role in the maximization process. Hence, the likelihood function can be taken to be L = {f(x i )} δ i {S f (x i )} 1 δ i or L = {h(x i )} δ i S f (x i ) (2.3.1) since f(x i ) = h(x i )S f (x i ). The log-likelihood function is n l = log(l) = {δ i log h(x i ) + log S f (x i )}. Replacing S f (x) by exp( H(x)), the log likelihood becomes, n n l = {δ i log h(x i ) H(x i )} = {δ i log h(x i ) For the uncensored case, all δ i = 1, so n n l = {log h(x i ) H(x i )} = {log h(x i ) xi 0 xi 0 h(u)du}. (2.3.2) h(u)du}. (2.3.3)

26 12 The maximum likelihood estimation for Weibull distribution: The hazard and cumulative hazard functions of the Weibull distribution are h(t) = (α/λ)(t/λ) α 1 and H(t) = (t/λ) α, respectively, with unknown scale λ and shape α parameters. The log-likelihood function from a right censored sample can be written in the following form: l(λ, α) = = = = n [δ i log h(t i ) H(t i )] [ ( n ( ) ) α t α 1 ( ] t α δ i log λ λ λ) n [ ( ( ) ( ) α α ] ti ti δ i log + (α 1)δ i log λ) λ λ n n n n δ i log α δ i α log λ + (α 1) δ i log t i ( ) α ti λ Taking the first derivative of l with respect to λ and equating it to 0, we obtain ( ) l 1 α+1 λ = αd λ + α n t α i = 0 λ λ α = 1 n t α i (2.3.4) d Similarly, equating the derivative of l with respect to α to 0, gives l α = d n α d log λ + ( 1 α n ( ) δ i log t i t λ) α i log ti = 0 (2.3.5) λ Substituting (2.3.4) in (2.3.5), the following equation is obtained, n d n α + (t α i log t i) δ i log t i d n = 0, (2.3.6) t α i

27 13 where d is the number of uncensored values. If the shape parameter α is known, then the maximum likelihood estimator (MLE) of λ can be obtained explicitly using (2.3.4). However, if α is unknown, then we cannot have an explicit form of the MLE. Equation (2.3.6) can be solved for α using the Newton-Raphson iterative method. Then the associated estimator of h(α, λ) is h(ˆα, ˆλ), where ˆα, ˆλ are the MLEs of α, λ, respectively Nonparametric Procedures Nonparametric procedures, on the other hand, do not require any distributional assumptions about h(t). Thus, they are more flexible than their parametric counterparts, and as a result they are widely used in the analysis of failure times (Kouassi and Singh, 1997). For discussions of some nonparametric hazard estimators see Aalen (1978); Cox (1972); Watson and Leadbetter (1964b); Antoniadis et al. (1999); Liu and Van Ryzin (1984); Ramlau-Hansen (1983); and Kouassi and Singh (1997). For the present discussion we next review several of these nonparametric approaches: a) Kernel Hazard Estimator: Kernel smoothing for general non-parametric function estimation is widely used in statistical applications, particularly for density, hazard and regression functions. Kernel estimation of the hazard in the uncensored situation was first proposed and studied by Watson and Leadbetter (1964). Then Ramlau-Hansen (1983), and Tanner and Wong (1983) extended the idea for right censored data. They described a fixed bandwidth Kernel-smoothed estimator of the hazard rate function as follows, ĥ(t) = 1 n ( ) t ti δ i K b b n i + 1 (2.3.7)

28 14 where K( ) is a kernel function, b is the bandwidth which determines the degree of smoothness. In this dissertation the Epanechnikov kernel K(x) = 0.75(1 x 2 ) for 1 x 1 was used throughout the examples and simulation studies. b) Kaplan-Meier Type Estimate: Smith (2002), among many authors, discuss the following estimates of the hazard function. Let t i denote a distinct ordered death time, i = 1,..., r n, then the hazard rate function is estimated by ĥ(t i) = d i /n i and ĥ(t) = d i/n i (t i+1 t i ) at an observed death t i and in the interval t i t < t i+1, respectively. Here d i is the number of deaths at i th death time and n i is the number of individuals at risk of death at time t i. c) Semiparametric Approach to Hazard Estimation: Kouassi and Singh (1997) proposed a mixture of parametric and nonparametric hazard rate estimators, instead of using either exclusively. Let h αt (t, ˆθ) = α t h(t, ˆθ) + (1 α t ) h(t), (2.3.8) where h(t, ˆθ) and h(t) are parametric and nonparametric estimators, respectively and α t is estimated by minimizing the mean square error of h αt (t, ˆθ). d) Cox s Proportional Hazard Model: Introduced by Cox (1972), this approach was developed in order to estimate the effects of different covariates influencing the times to failure of a system. The proportional hazards model assumes that the hazard rate of a unit is the product of an unspecified baseline failure rate, which is a function of time only and a positive function g(z, A), independent of time, which incorporates the effects of a number of covariates. The failure rate of a unit is then given by, h(t, Z) = h 0 (t)g(z, A)

29 15 where h 0 is the baseline hazard rate, Z is a row vector consisting of the covariates, A is a column vector consisting of the unknown parameters (also called regression parameters) of the model. It can be assumed that the form of g(z, A) is known and t is unspecified. 2.4 Shape Restricted Regression In this section before we introduce our new nonparametric shape restricted estimator, we review some fundamental concepts that can help us to lay groundwork for the construction of the shape restricted estimator. The definitions, results, and their proofs along with more details about the properties of the constraint cone and polar cones can be found in Rockafellar (1970), Robertson et al. (1988), Fraser and Massam (1989), and Meyer (1999a). Suppose we have the following model y i = f(x i ) + σǫ i, i = 1,, n. In this model the errors ǫ i s are independent and have standard normal distribution, f Λ, and Λ is a class of regression functions sharing a qualitative property such as monotonicity, convexity or concavity. The constrained set over which we maximize the likelihood or minimize the sum of squared errors is constructed as follows: let θ i = f(x i ) and x i s are known, distinct and ordered for 1 i n. The monotone nondecreasing constraints can be written as θ 1 θ 2... θ n If we consider piecewise linear approximations to the regression function with knots at x values, the nondecreasing convex, nondecreasing concave and convex shape restrictions can be written as a set of linear inequality constraints. For example, if we are considering convex, then we have

30 16 θ 2 θ 1 x 2 x 1 θ 3 θ 2 x 3 x 2... θ n θ n 1 x n x n 1. The constraints for nondecreasing convex can be written as θ 2 θ 1 x 2 x 1 θ 3 θ 2 x 3 x 2... θ n θ n 1 x n x n 1, θ 1 θ 2, and the constraints for nondecreasing concave are given by, θ 2 θ 1 x 2 x 1 θ 3 θ 2 x 3 x 2... θ n θ n 1 x n x n 1, θ n 1 θ n. Any of these sets of inequalities defines m half spaces in IR n, and their intersection forms a closed polyhedral convex cone in R n. The cone is designated by C = {θ : Aθ 0} for m n constraint matrix A (see Rockafellar, 1970, p. 170). For monotone, nondecreasing convex we have m = n 1, and for convex m = n 2. The nonzero elements of the m n dimensional A: 1. For monotone constraints, A i,i = 1 and A i,i+1 = 1 for 1 i n For nondecreasing convex, A 1,1 = 1, A 1,2 = 1, A i,i 1 = x i+1 x i, A i,i = x i 1 x i+1, and A i,i+1 = x i x i 1, for 2 i n For nondecreasing concave, A i,i = (x i+2 x i+1 ), A i,i+1 = (x i x i+2 ), A i,i+2 = (x i+1 x i ), A n 1,n 1 = 1 and A n 1,n = 1 for 1 i n For convex, A i,i = x i+2 x i+1, A i,i+1 = x i x i+2 and A i,i+2 = x i+1 x i for 1 i n 2. For example if n = 5, the monotone constraint matrix A is given by A =

31 17 If n = 5 and the x coordinates are equally spaced, the nondecreasing convex, nondecreasing concave and convex constraints are given by the following constraint matrices, respectively: A =, A = , and A = Projection on a closed convex set The ordinary least-squares regression estimator is the projection of the data vector y on to a lower-dimensional linear subspace of R n, whereas the shape restricted estimator can be obtained through the projection of y on to an m dimensional polyhedral convex cone in R n (Meyer, 2003). We have the following useful proposition which shows the existence and uniqueness of the projection of the vector y on a closed convex set (see Rockafellar, 1970, p. 332 ). Proposition 1 Let C be a closed convex subset of IR n. 1. For y IR n and θ C, the following properties are equivalent:

32 18 (a) y ˆθ = min θ C y θ (b) y ˆθ, θ ˆθ 0 for all θ C 2. For every y IR n, there exists a unique point where ˆθ C satisfies (a) and (b). ˆθ is said to be the projection of y onto C, where the notation y, x = x i y i refers to the vector inner product of x and y. If C is also a cone, it is easy to see that (b) of Proposition 1 becomes y ˆθ, ˆθ = 0 and y ˆθ, θ 0, θ C, which are the necessary and sufficient conditions for θ to minimize y θ 2 over C (see Robertson et al. 1988, p. 17). For monotone regression there is a closed form solution, (see Robertson et al. 1988, p.23). As for nondecreasing convex, nondecreasing concave and convex regression, the problem of finding the least-squares estimator ˆθ is a quadratic programming problem. There is no known closed-form solution. But ˆθ can be found using the mixed primal-dual bases algorithm (Fraser and Massam, 1989) or the hinge algorithm (Meyer, 1999a) Constraint Cone Let V be the space spanned by 1 = (1,...,1) T for a monotone, nondecreasing convex, and nondecreasing concave, and let V be linear space spanned by 1 = (1,...,1) T and x = (x 1,...,x n ) T for convex regression. Note that V C and V is perpendicular to the rows of the corresponding constraint matrix. Let Ω be the set such that Ω = C V, where V is the orthogonal complement of V. This implies C = Ω V. We refer to Ω as the constraint cone. By partitioning C into two orthogonal spaces Ω and V, the projection of a vector y R n onto C is the sum

33 19 of the projection of y onto Ω and V, which simplifies the computation. Besides, the edges of Ω are unique up to multiplicative factor. The edges are a set of vectors in the constraint cone such that any vector in Ω can be written as nonnegative linear combination of edges, and no edge is itself a nonnegative linear combination of other edges. For a more detailed discussion, see Meyer (1999) or Fraser and Massam (1989) Edges of constraint cone and Polar cone The constraint space can be specified by a set of linearly independent vectors δ 1,...,δ m. So that Ω = {θ : θ = m j=1 b j δ j : b 1,..., b m 0} and the constraint set C = {θ : θ = mj=1 b j δ j + ν : b 1,...,b m, b j 0 and ν V }, where m = n 1 for monotone, nondecreasing concave, nondecreasing convex and m = n 2 for convex. For example, if Ω is the set of all nondecreasing concave, nondecreasing convex, or convex vectors in IR n, it can be specified using the vectors δ j. The vectors δ j can be obtained from the formula = (AA ) 1 A = [δ 1,..., δ m ]. For n = 5 and equally spaced x values, is given by: for convex, = , nondecreasing convex, nondecreasing concave, = ,

34 20 = , and monotone = For convenience of presentation, the smallest possible multiplicative factors are chosen so that all entries of are integers. Any convex vector θ C is a nonnegative linear combination of the columns of the corresponding plus a linear combination of 1 and x. If C is the set of all convex vectors in IR n we can also define the vectors δ j to be the rows of the following matrix: x x 2 x x n x 2 n 1 x 2 x n x 2 1 x x 3 x x n x 3 n 1 x 3 x n x x 1 x n 1 x n For a large data set it is better to use the above vectors δ j because the previous method of obtaining the edges is computationally intensive. Another advantage is that the computations of the inner products with the second approach are faster because of all the zero entries in the vectors.

35 21 The polar cone of the constraint cone Ω is (Rockafellar, 1979, p. 121) Ω 0 = {ρ : ρ, θ 0, θ Ω}. Geometrically, the polar cone is the set of points in R n which make an obtuse angle with all points in Ω. Let us note some straightforward properties of Ω 0 : 1. Ω 0 is a closed convex cone 2. The only possible element in Ω Ω 0 is 0, 3. γ 1,...,γ m Ω 0. where γ j is negative rows of A, i.e., [γ 1,...,γ m ] = A. The relationship between δ j and γ i is (Fraser and Massam, 1989) 1 if i = j δ j, γ i = 0 if i j These vectors are generators of the polar cone. That is, each ρ Ω 0 can be written as a nonnegative linear combination of the γ j s. To see this, let K be the cone generated by γ i, i.e., each κ K can be written as a nonnegative linear combination of the γ i, then for any θ Ω, we have m K = {κ : κ = a i γ i, a i 0}, m θ, κ = a i θ, γ i 0, κ K. This shows that Ω K 0, where K 0 is the polar cone of K. For any ζ K 0, we have ζ, γ i 0, i = 1,, m, which shows that K 0 Ω. Therefore, Ω = K 0. Since K 00 = K (Rockafellar, 1970, p.121), we have Ω 0 = K 00 = K.

36 22 Faces and Sectors The faces of the constraint cone are constructed by subsets of the constraint cone edges. Any subset J {1,, m} defines a face of the constraint cone; i.e., a face consists of all nonnegative linear combinations of constraint cone edges δ j, j J. Note that Ω itself is a face for J = {1,, m}. The subsets J also define sectors which are themselves a polyhedral convex cone. Let the sector C J be the set of all y s in IR n such that y = b j δ j + b j γ j + ν (2.4.1) j J j / J where b j 0 for j J; b j > 0 for j / J, ν V. The C J partition R n, with J = corresponding to the interior of the polar cone, and the sector with J = {1, 2,, m} coinciding with the constrained cone. Further, the representation of y C J given in (2.4.1) is unique (Meyer 1999). The following propositions are useful tools for finding the constrained least squares estimator. Their proofs are discussed indetail by Meyer (1999a). Proposition 2 Given y IR n such that y = j J b jδ j + j / J b jγ j + ν, the projection of y onto the constraint set Ω is ˆθ = b j δ j + ν. (2.4.2) j J and the residual vector ˆρ = y ˆθ = j / J b jγ j is the projection of y onto the polar cone Ω 0. Proposition 3 If y C J, then ˆθ is the projection of y onto the linear space spanned by the vectors δ j, j J, plus the projection of y onto V. Similarly, ˆρ is the projection of y onto the linear space spanned by the vectors γ j, j / J. If the set J is determined, using Propositions 2 and 3, the constrained least squares estimate, ˆθ, can be found through ordinary least-squares regression (OLS), using ν V

37 23 and δ j for j J as regressors. Alternatively, ˆρ can be obtained through OLS using γ j, for j / J as regressors, then ˆθ = y ˆρ. To find the set J and ˆθ, Fraser and Massam (1989), and Meyer (1999) proposed the mixed primal-dual bases algorithm and the hinge algorithm, respectively. The method chosen in this paper is the hinge algorithm for it is fast, useful for iterative projection algorithm and computationally more efficient The hinge algorithm This algorithm uses a set of vectors δ 1,, δ m and ν to characterize the constraint space. The algorithm finds θ by finding Ĵ through a series of guesses J k. At a typical iteration, the current estimate θ k can be obtained by the least-squares regression of y on the δ j, for j J k and ν. We call δ j the hinges since for the convex regression problem, the points (x j, θ j ), j J, are the bending points at which the line segments change slope, and there is only one way that the bends are allowed to go. The initial guess J 0 is set to be empty. The algorithm can be summarized in four steps: 1. Using ν as regressors to obtain a least-squares estimate θ 0, for a convex, ν = {1, x} and for monotone, nondecreasing convex and nondecreasing concave, ν = 1. Loop 2. At the k th iteration, compute y θ k, δ j for each j / J k. If these are all non-positive, then stop. If not, then add the vector δ j to the model for which this inner product is largest. 3. Get the least-squares fit with the new set of δ-vectors. 4. Check to see if the regression function satisfies the constraints on the coefficients, i.e. is b j 0, for j J and j / J 0

38 24. If yes, go to step 2.. If no, choose the hinge with the largest negative coefficient and remove it from the current set J. Go to step 3. At each stage, the new hinge is added where it is most needed, and other hinges are removed if the new fit does not satisfy the constraints. It is clear that if the algorithm ends, it gives the correct solution and the algorithm does end. See Meyer (1999) for proof The mixed primal-dual bases algorithm The mixed primal-dual bases algorithm is used to find the projection onto a closed convex cone. In this algorithm, the γ j s are the primal vectors and δ j s are the dual vectors. The mixed primal-dual bases algorithm finds the correct set Ĵ by moving along a line segment connecting the point z 0 = m δ j with z, where z is the projection of the data y on the j=1 subspace spanned by δ j, j = 1,, m. At the k th iteration, the point z k on the line segment is reached, such that the distance between z k and z is strictly decreasing in k. This point is also on a face of Ω Jk. The next iteration finds z k+1 farther along the segment, on a face of Ω Jk+1. At the beginning of the iteration, both z and z k are expressed in the basis defined by J k, such as z = b j δ j + b j γ j, j J k j / J k and z k = a j δ j + a j γ j, j J k j / J k where a j 0 for j J k and a j > 0 for j / J k. If b j 0 for j J k and b j > 0 for j / J k, the algorithm stops. Otherwise, find z k+1 = z k + α k+1 (z z k ),

39 25 where α k+1 (0, 1) is as large as possible while the coefficients of z k+1 are all positive or nonnegative as they are in J k or not, respectively. The point z k+1 is on the face of Ω J k, which divides Ω Jk and Ω Jk+1. The algorithm terminates at the face of the sector containing z. It clearly takes a finite number of iterations since there are a finite number of sectors. Example of Shape Restricted Regression The following are two examples of shape restricted fit. In Figure 2.1 (a), the data were generated from convex function f(x i ) = 2x i +1/x i with independent zero-mean normal errors, and fitted by convex and quadratic regressions. The solid curve is convex fit, the dashed curve is quadratic fit and the dotted curve is the underlying convex function. In Figure 2.1 (b) the data were generated from quadratic functions f(x i ) = x 2 i with independent zero-mean normal errors, and fitted by convex and linear regressions. The solid curve is convex fit, the dashed curve is linear fit and the dotted curve is the underlying quadratic function. For both cases, it can be clearly seen that the shape restricted regressions fit the data better.

40 26 Convex Quadratic Y Y X X Figure 2.1: Examples of fits to scatterplot. (a) The solid curve is convex fit, the dashed curve is quadratic fit and the dotted curve is the underlying convex function.(b) The solid curve is convex fit, the dashed curve is linear fit and the dotted curve is the underlying quadratic function.

41 Chapter 3 ESTIMATION OF HAZARD FUNCTION UNDER SHAPE RESTRICTIONS In this chapter, we introduce a new nonparametric method for estimation of hazard function that imposes shape restrictions on the hazard function, such as increasing, concave, convex, nondecreasing concave or nondecreasing convex or concave-convex. We derive shape restricted estimator of hazard rate based on maximum likelihood method from uncensored and right censored samples. We also examine how the estimated hazard function behaves for a Weibull distribution, an exponentiated Weibull distribution and a distribution with a polynomial hazard function, with different parameters using the new estimator and some pre-existing estimators. 3.1 Uncensored Sample Suppose X 1, X 2,...,X n be a random sample of lifetimes from the distribution with density f, and let F and S = 1 F be the corresponding distribution and survival functions, respectively. The associated hazard rate is h = f/s for F(x) < 1. The problem is to estimate f, S or h by maximizing log ( f(xi ) ) = n log f(x i ) subject to h Λ where Λ is a class of hazard functions sharing a qualitative property such as monotonicity, convexity, or concavity. Let 0 = x 0 < x 1 <... < x n be the order statistics of random sample of lifetimes, 27

42 28 recall that f(x) can be written as then the log-likelihood function is x f(x) = h(x)s(x) = h(x)exp{ h(u) du}, Numerical Integration n n n l = log f(x i ) = log h(x i ) xi 0 h(u) du. (3.1.1) If h(t) is approximated by a piecewise linear function with knots at the data, the integral of h(t) is the sum of trapezoid areas, and (3.1.1) becomes, n n i 1 l = log h(x i ) j=1 2 [h(x j) + h(x j 1 )](x j x j 1 ). Expanding the summation, the expression can be simplified to the following, n n l = log h(x i ) c i h(x i ), (3.1.2) where the c i depend on the x j. They can be derived by applying the trapezoidal rule to each segment and summing the results as follows: n xi 0 h(u)du = x 1 h(0) + h(x 1 ) 2 h(0) + h(x 1 ) +x 1 2 h(0) + h(x 1 ) +x x 1 h(0) + h(x 1 ) 2 + (x 2 x 1 ) h(x 1) + h(x 2 ) 2 + (x 2 x 1 ) h(x 1) + h(x 2 ) 2 + (x 2 x 1 ) h(x 1) + h(x 2 ) 2 + (x 3 x 2 ) h(x 3) + h(x 2 ) (x n x n 1 ) h(x n) + h(x n 1 ) 2

43 29 Collecting h(x i ) terms and simplifying yields the following: 2 n xi 0 h(u)du = nx 1 h(0) +(x 1 + (n 1)x 2 )h(x 1 ) +(x 2 + (n 2)x 3 (n 1)x 1 )h(x 2 ) +(x 3 + (n 3)x 4 (n 2)x 2 )h(x 3 ). +(x n x n 1 )h(x n ) (3.1.3) Note that h(0) must be a function of the elements of the vector (h(x 1 ),, h(x n )) in accordance with shape restrictions. For example, if we are assuming an increasing hazard function, it is clear that h(0) = 0 is the choice that satisfies the shape restriction and maximizes the likelihood. If h is constrained to be convex, then we define h(0) = max{0, h(x 1)x 2 x 2 x 1 h(x 2)x 1 x 2 x 1 } (3.1.4) as the choice that preserves the convex shape and maximizes the likelihood over the assumptions. If (h(x 1 )x 2 /(x 2 x 1 ) h(x 2 )x 1 /(x 2 x 1 )) > 0 then plugging Equation. (3.1.4) into Equation (3.1.3) gives 2 n xi 0 h(u)du = nx 1 ( h(x 1)x 2 x 2 x 1 h(x 2)x 1 x 2 x 1 ) +(x 1 + (n 1)x 2 )h(x 1 ) +(x 2 + (n 2)x 3 (n 1)x 1 )h(x 2 ) +(x 3 + (n 3)x 4 (n 2)x 2 )h(x 3 )

Multi-state Models: An Overview

Multi-state Models: An Overview Andrew Titman Lancaster University 14 April 2016 Overview Introduction to multi-state modelling Examples of applications Continuously observed processes Intermittently observed