Hazard Function, Failure Rate, and A Rule of Thumb for Calculating Empirical Hazard Function of Continuous-Time Failure Data

Similar documents
Notes largely based on Statistical Methods for Reliability Data by W.Q. Meeker and L. A. Escobar, Wiley, 1998 and on their class notes.

Chapter 9. Bootstrap Confidence Intervals. William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University

Introduction to Reliability Theory (part 2)

Reliability Engineering I

10 Introduction to Reliability

Unit 10: Planning Life Tests

Exact Inference for the Two-Parameter Exponential Distribution Under Type-II Hybrid Censoring

Time-varying failure rate for system reliability analysis in large-scale railway risk assessment simulation

Statistical Inference on Constant Stress Accelerated Life Tests Under Generalized Gamma Lifetime Distributions

Key Words: Lifetime Data Analysis (LDA), Probability Density Function (PDF), Goodness of fit methods, Chi-square method.

Let us use the term failure time to indicate the time of the event of interest in either a survival analysis or reliability analysis.

Censoring and Truncation - Highlighting the Differences

Failure rate in the continuous sense. Figure. Exponential failure density functions [f(t)] 1

Slope Fields: Graphing Solutions Without the Solutions

Measurements and Data Analysis

Lecture 22 Survival Analysis: An Introduction

ST745: Survival Analysis: Nonparametric methods

STAT 6350 Analysis of Lifetime Data. Probability Plotting

Survival Analysis. Stat 526. April 13, 2018

A Simulation Study on Confidence Interval Procedures of Some Mean Cumulative Function Estimators

Problem Set 3: Bootstrap, Quantile Regression and MCMC Methods. MIT , Fall Due: Wednesday, 07 November 2007, 5:00 PM

Survival Distributions, Hazard Functions, Cumulative Hazards

ter. on Can we get a still better result? Yes, by making the rectangles still smaller. As we make the rectangles smaller and smaller, the

Estimation of AUC from 0 to Infinity in Serial Sacrifice Designs

A hidden semi-markov model for the occurrences of water pipe bursts

Chapter 15. System Reliability Concepts and Methods. William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University

Optimal Cusum Control Chart for Censored Reliability Data with Log-logistic Distribution

Multistate Modeling and Applications

TMA 4275 Lifetime Analysis June 2004 Solution

Seismic Analysis of Structures Prof. T.K. Datta Department of Civil Engineering Indian Institute of Technology, Delhi. Lecture 03 Seismology (Contd.

Chapter 6. a. Open Circuit. Only if both resistors fail open-circuit, i.e. they are in parallel.

Availability and Reliability Analysis for Dependent System with Load-Sharing and Degradation Facility

Math 016 Lessons Wimayra LUY

Step-Stress Models and Associated Inference

Cox s proportional hazards model and Cox s partial likelihood

Simultaneous Prediction Intervals for the (Log)- Location-Scale Family of Distributions

Open book, but no loose leaf notes and no electronic devices. Points (out of 200) are in parentheses. Put all answers on the paper provided to you.

EAS 535 Laboratory Exercise Weather Station Setup and Verification

Predicting the Probability of Correct Classification

Smooth nonparametric estimation of a quantile function under right censoring using beta kernels

An Evaluation of the Reliability of Complex Systems Using Shadowed Sets and Fuzzy Lifetime Data

Point and Interval Estimation for Gaussian Distribution, Based on Progressively Type-II Censored Samples

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

Analytical Bootstrap Methods for Censored Data

Supporting Information for Estimating restricted mean. treatment effects with stacked survival models

n =10,220 observations. Smaller samples analyzed here to illustrate sample size effect.

Estimation of Quantiles

Objective Experiments Glossary of Statistical Terms

Load-strength Dynamic Interaction Principle and Failure Rate Model

ISQS 5349 Spring 2013 Final Exam

Application of Time-to-Event Methods in the Assessment of Safety in Clinical Trials

Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis

Chapter 17. Failure-Time Regression Analysis. William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University

Structure of Materials Prof. Anandh Subramaniam Department of Material Science and Engineering Indian Institute of Technology, Kanpur

Distribution Fitting (Censored Data)

Parametric Techniques

For right censored data with Y i = T i C i and censoring indicator, δ i = I(T i < C i ), arising from such a parametric model we have the likelihood,

Sample Size and Number of Failure Requirements for Demonstration Tests with Log-Location-Scale Distributions and Type II Censoring

Practical Applications of Reliability Theory

Quantile POD for Hit-Miss Data

ON THE FAILURE RATE ESTIMATION OF THE INVERSE GAUSSIAN DISTRIBUTION

Survival Analysis: Weeks 2-3. Lu Tian and Richard Olshen Stanford University

Asymptotic distribution of the sample average value-at-risk

Bootstrap Method for Dependent Data Structure and Measure of Statistical Precision

Robustness and Distribution Assumptions

Double Bootstrap Confidence Interval Estimates with Censored and Truncated Data

A Note on Bayesian Inference After Multiple Imputation

Lecture 7. Poisson and lifetime processes in risk analysis

Statistics for Engineers Lecture 4 Reliability and Lifetime Distributions

BAYESIAN MODELING OF DYNAMIC SOFTWARE GROWTH CURVE MODELS

These notes will supplement the textbook not replace what is there. defined for α >0

Introducing the Normal Distribution

THE WEIBULL GENERALIZED FLEXIBLE WEIBULL EXTENSION DISTRIBUTION

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institute of Technology, Kharagpur

Bayesian vs frequentist techniques for the analysis of binary outcome data

Robust Parameter Estimation in the Weibull and the Birnbaum-Saunders Distribution

Introducing the Normal Distribution

Bivariate Degradation Modeling Based on Gamma Process

DIFFERENTIAL EQUATIONS

Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee

Dependable Systems. ! Dependability Attributes. Dr. Peter Tröger. Sources:

Parametric Evaluation of Lifetime Data

=.55 = = 5.05

Exercises. (a) Prove that m(t) =

STATISTICAL INFERENCE IN ACCELERATED LIFE TESTING WITH GEOMETRIC PROCESS MODEL. A Thesis. Presented to the. Faculty of. San Diego State University

Uncertainty. Michael Peters December 27, 2013

Reliability Growth in JMP 10

UNIVERSITÄT POTSDAM Institut für Mathematik

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances

MAS3301 / MAS8311 Biostatistics Part II: Survival

Fundamentals of Reliability Engineering and Applications

Statistical Analysis of Competing Risks With Missing Causes of Failure

Maejo International Journal of Science and Technology

The Fundamental Theorem of Calculus with Gossamer numbers

Bootstrap Procedures for Testing Homogeneity Hypotheses

STAT Sample Problem: General Asymptotic Results

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part II 1 st Nine Weeks,

Measurement: The Basics

Teaching Linear Algebra, Analytic Geometry and Basic Vector Calculus with Mathematica at Riga Technical University

18.465, further revised November 27, 2012 Survival analysis and the Kaplan Meier estimator

Transcription:

Hazard Function, Failure Rate, and A Rule of Thumb for Calculating Empirical Hazard Function of Continuous-Time Failure Data Feng-feng Li,2, Gang Xie,2, Yong Sun,2, Lin Ma,2 CRC for Infrastructure and Engineering Asset Management (CIEAM) 2 School of Engineering Systems Queensland University of Technology & Mathematical Sciences, Brisbane, Australia Abstract Hazard function plays an essential role in engineering reliability study. Distribution free hazard rate values calculated based on observed sample data is defined as the empirical hazard function. A theoretically sound and accurate empirical hazard function may be used directly for analysis of life time distribution of the or continuous-time failure data or can be used as a basis for further parametric modelling analysis in asset management. For the sake of bridging the gaps between academic theory and data analysis practice, this paper starts from clarifying the relationship between the concepts of hazard function and failure rate. Then, two often-used continuous-time data empirical hazard function formulas are derived directly from discretising their theoretic definitions of the hazard function. The properties of these two different formulas are investigated and their estimation performances against the true hazard function values are compared using simulation samples from an exponential and a Weibull distribution. It is found that one formula calculates the average hazard rates over a specified time interval while the other one underestimates the true hazard function values. However, we also showed that in most cases the relative error of the underestimation is less than 6%. Both formulas are valid for right censored data and under certain conditions are valid for left and interval censored data, too. The simulaton results show that the average hazard formula always gives more accurate estimates while the other one consistently underestimates. Such results match the theoretic conclusions completely. Based on the result of this study, we proposed a rule of thumb for applications of these two most often-used empirical hazard function formulas in data analysis practice. Keywords: hazard function; failure rate; empirical hazard function; continuoustime failure data. Introduction Hazard function plays an essential role in the application of probability theory in engineering reliability study. For example, the Mean Time To Failure (MTTF) is calculated

as the inverse of hazard rate if we assume the asset system life time distribution follows an exponential distribution. In the data analysis stage for asset management, however, the term failure rate is more often used when we try to work out the MTTF. In a sense, there is a gap between probability theory and data analysis when we talk about hazard function and failure rate. Because people can be confused with the questions like, are these two terms are interchangeable; if the answer is yes, why not just use one of the terms; if they are different what are the differences? A short answer is: hazard or hazard rate h i h(t i ) is the instantaneous failure rate (for non repairable asset systems) at a time instant t i i =, 2,. However, when we talk about failure rate in data analysis it is more often a short term for Average Failure Rate (AFR) over a time period t 2 t (assuming 0 t < t 2 ). We know that AFR can be calculated using formula [8] AFR = t2 t h(u)du. () t 2 t Equation() is nothing but the average hazard function formula which is considered as the most typical estimation of the true hazard function values [5]. Therefore, we need a empirical hazard function formula so that we can estimate the hazard function h(t) based on observed sample data. We may treat sample failure time data as discrete data, i.e. we consider the observed sample failure times as the events that occur at pre-assigned times 0 t < t 2 <, and that under a parametric model of interest the hazard function at t i is h i = h i (θ). Let us consider a set of intervals I i = [t i, t i+ ) covering [0, ) for an engineering asset system with N functional components at t = 0. Let us also denote d i = N(t i ) N(t i+ ) where N(t i ) and N(t i+ ) are the numbers of components which are functional at time t i and time t i+, respectively. Then the quantity d i is the number of failures in interval I i and r i N(t i ) is the number of components at risk (i.e. having the potential to fail) at t i. It can be shown that the maximum likelihood estimator (MLE) ĥ i = d i /r i (2) from which the well known Kaplan-Meier estimator for the reliability function ˆR(y) = ( ) ( ĥ i = d ) i r i i:t i <y is derived. Equation(2) is valid under independent right censoring [2](pp93-97) and [9] (pp268-270). However, in data analysis practice, we may be interested in treating the sample failure time data as continuous-time data as shown in Equation (). Two often-used empirical hazard function formulas for treating the continuous-time data are: i:t i <y and ĥ i = N(t i) N(t i + t) t N(t i ) = t d i r i ĥ i, (3) ĥ i = [ t log N(t ] i) N(t i + t) = N(t i ) t log( d i ) ĥ2 i, (4) r i 2

where t t i+ t i in order to emphasize that failures can happen at any time instants, not necessarily at t i i =, 2, under the continuous-time data setting. At the first glance, Equations (3) and (4) are very different. When people need to make a decision in choosing one of the above two formulas for the calculation of the empirical hazard function, questions like which one should I use and why are naturally asked. In addition, industry people may have a good chance of not knowing, hence they want to know how Equations (3) and (4) relate to Equation (). These questions are necessary to be answered for correctly estimating the true hazard function values using sample failure time data in asset management practice. Although these questions are not theoretically difficult but it seems that they have been ignored so far in publication. This paper is aiming at filling this gap. The rest of the paper is arranged as follows. In Section 2 we derive the Equations (3) and (4) directly from discretising their theoretic definitions of the hazard function followed by a detailed discussion on the properties of these two formulas in terms of estimation of the true hazard function values. In Section 3, we verify our theoretic conclusions by calculating the empirical hazards based on two simulation samples one generated from an exponential distribution and the second one generated from a Weibull distribution. Section 4 shows how to use Equations (3) and (4) properly based on a real life scenario. Section 5 concludes this paper with a proposed rule of thumb for applications of Equations (3) and (4) in engineering reliability analysis practice. 2 Empirical hazard function derivation and discussion The empirical hazard function formulas can be derived in various ways. For example, Equation (3) was given in [4] and [7]; Equation (4) was derived from the discussion of the probability of failure in the period [t i, t i+ ) given survival to t i in [2]. We will derive Equations (3) and (4) directly from the definition of the hazard function. As can be found in any standard textbook on failure time data analysis, we have the following definition and relationship equations for the hazard function. Assuming the time to failure T is a random variable which can take any value in the interval [0, ), the hazard function of T is defined as h(t) = f(t) F (t) = lim t 0 F (t + t) F (t) t ( F (t)), (5) where f(t) and F (t) are the probability density function (pdf) and the cumulative distribution function (cdf) of T, respectively. Since f(t) = hazard function as df (t) dt d[log( F (t))] h(t) = dt, after some algebra, we get another form of the definition for the = lim t 0 F (t + t)) log( F (t)) log(. (6) t 3

By discretising Equations (5) and (6) respectively, we get and ĥ(t) = log( F (t + t)) log( F (t)) ĥ(t) = t F (t + t) F (t) t ( F (t)), (7) = [ ] F (t + t) t log. (8) F (t) Given our early defined notations N, N(t i ), t t i+ t i and h i h(t i ), using the the relative frequency as the estimator for F (t i ), we have F (t i ) N N(t i) = N(t i) N N. (9) By applying Equation (9) to Equations (7) and (8) accordingly, Equations (3) and (4) fall out after some trivial and tedious algebras. Up to this point it is clear that both formulas (3) and (4) converge to the true values of h i as t approaches zero. Note that this asymptotic property of convergence still hold after the introduction of Equation (9) in the derivation process due to the Law of large numbers []. We now investigate their properties when t > 0. First, let us rewrite Equation (7) as ĥ(t) = t+ t t f(u) du t F (t). (0) Equation (0) implies that Equation (3) will underestimate the true hazard function t+ t t f(u) du values because is the average density over t while is monotonically t F (t) decreasing. Another way to show Equation (3) underestimating the true h i values is to consider t as a unit time interval, e.g. one hour, one day, or one year, etc.. Then we have ĥ i = N(t i) N(t i + t) N(t i ) ĥ i, which implies the empirical hazard values will never be greater than per unit time. Now let us rewrite Equation (8) as ĥ(t) = H(t + t) H(t), () t where H(t) = t h(u)du = log( F (t)) is the cumulative hazard function. Equation 0 () implies that Equation (4) calculates the average values of the true hazard function. Therefore, we should expect to see Equation (4) gives more accurate and unbiased estimation of the true hazard function values than Equation (3) does. If we denote that t + t t 2 and t t, hence t = t 2 t, we realize that Equation () and Equation () are identical. This is how Equation (4) related to AFR but Equation (3) does not have this direct connection. As from Equation (5), the hazard function h(t), also referred to as hazard rate at time t, is defined as a conditional density function, i.e. the ratio of probability density f(t) 4

over the reliability F (t) (a probability), which is not as intuitive to interpret as the concept of failure rate used in data analysis. The direct connection of Equation (4) with the AFR fills the mental gap between the probability theory and data analysis. Theoretically, the difference between formulas (3) and (4) is significant. However, in data analysis practice, the numeric calculation results from both formulas can be very close. Before we verify this theoretic conclusion in the next section, we examine how different the estimation results can be between Equations (3) and (4). As a standard mathematical result[] (pp25), it is known that, if x 2/3, then log( + x) = x x2 2 + θ(x), where θ(x) x 3. Therefore, it is straight forward to show that if 0 < x 0., the relative difference between log( x) and x (i.e. [ log( x) x]/ log( x)) is less than 6%. We are now ready to compare the estimation performances of Equations (3) and (4) to verify the theoretic results we have obtained so far. 3 Comparison of empirical hazard function formulas using simulation samples In this section the open source statistical package R [6] is used for data analysis. A random sample of an exponential distribution of sample size n = 0000 is generated with the parameter specification rate = 0. (using random seed 0 for exact repeatability of the analysis results); A second random sample of a Weibull distribution of sample size n = 0000 is generated with the parameter specification: shape =.8 and scale = 30 (random seed = 0). Based on these two simulation random samples, the empirical hazard values ĥ i of Equation (3) and ĥ2 i of Equation (4) are calculated and compared with the true hazard function values to verify the theoretic results obtained from Section 2. Figure presents the simulation results of comparing the empirical hazard values ĥ i and ĥ2 i (in vertical bars) against the true hazard function values (in circles connected by a fine solid line) based on the exponential distribution random sample. In calculating ĥ i and ĥ2 i, the most important setting is to specify the number of intervals over the full sample data range. The specification of the number of intervals is equivalent to specify the length of t. Therefore, we would expect to see the larger of the number of intervals the better of the approximation of the ĥ i and ĥ2 i values to the true hazard values. In Figure, the empirical hazards in the top two panel plots are calculated using 20 intervals and in the bottom two panel plots the number of intervals is 50. The graph shows us that, ĥ2 i always performs better than ĥ i which is consistently underestimating the true hazards. The difference is much significant when the number of intervals is small. We also notice that it is ĥ i which is much more sensitive to the number of intervals specification while ĥ2 i s estimation results very robust (i.e. almost not affected by the change of the number of intervals specification). With this particular exponential distribution sample, the 99% quantile value is about 45 time units which only spread over less than 60% of the full sample data range. Note 5

hazard 0.00 0.05 0.0 0.5 exponential(x rate=0.) hazard 0.00 0.05 0.0 0.5 hazard 0.00 0.05 0.0 0.5 hazard 0.00 0.05 0.0 0.5 failure times Figure : Empirical hazard function values calculated using ĥ i (the top and third panel plots) and ĥ2 i (the second and bottom panel plots): circle points are the true hazard function values connected by a fine solid line; vertical bars are the empirical hazard function values. 6

Table : Comparison of calculated empirical hazard values versus the true hazard value True hazard ĥ i ĥ2 i Data set Number of Exponential sample average average range intervals 0.0999 0.08486 0.072 full range 20 0.0999 0.0884 0.003 99% quantile 20 0.0999 0.0964 0.039 full range 50 0.0999 0.09248 0.006 99% quantile 50 that, for both ĥ i and ĥ2 i, the estimates fluctuate wildly after the 99% quantile point because of the sparseness of observations over the upper part of range interval. Actually, ĥ2 i will always has an infinite large hazard value for the last interval because surely all items must die out in the end. On the other hand, ĥ i will always equals / t for the last interval. Therefore, empirical values of the very last interval should not be included. We, therefore, propose using only the estimates calculated from those sample observations which are up to 99% quantile point. Table presents the numeric results of the averages of ĥ i and ĥ2 i under different conditions, compared with the true hazard value. Since this is an exponential sample which has a constant hazard rate, the true hazard is given in column. The conditions under which the empirical hazards are calculated are specified in column 4 and 5. For example, the first numeric output line shows that, given number of intervals is 20 and using the full range empirical hazard values (values of the last interval discarded), the average of ĥ i = 0.08486 and the average of ĥ2 i = 0.072. This is not a very good estimation of the true hazard value which is 0.0999. Based on the comparison of the numeric results presented Table, we conclude that (a) the conclusions obtained from examining Figure are confirmed by the numeric results; (b) only those estimates of empirical hazard functions calculated up to 99% quantile point are reliable and robust. Figure 2 examines the simulation results of comparing the empirical hazard values ĥ i (top panel) and ĥ2 i (bottom panel) against the true hazard function values based on a Weibull distribution random sample. Figure 2 follows the same drawing format as in Figure, i.e. the empirical hazard values ĥ i and ĥ2 i are represented in vertical bars against the true hazard function values (in circles connected by a fine solid line). The number of intervals is chosen to be 45, i.e. t = 2 time units. In addition, the approximate 95% confidence bands for ĥ i and ĥ2 i values are constructed using the parametric bootstrap method [3]. Based on the Weibull distribution specification, 500 bootstrap samples (each of n = 0000) are generated and ĥ i and ĥ2 i are calculated for each of these bootstrap samples. The medians of empirical hazards are superimposed using a thick (in blue colour) solid line with the dashed lines (in grey colour) for the lower and upper limits respectively. By examining the graphic output of the comparison of the empirical hazards ĥ i (top panel plot) and ĥ2 i (bottom panel plot) against the true hazard values from Figure 2, we discover once again what we already found from the examination of Figure. With the specified Weibull distribution sample, the 99% quantile point is at about 70 time units. With Figure 2, the superimposed confidence bands shows us visually how much 7

hazard 0.00 0.0 0.20 0.30 Weibull(x shape=.8,scale=30): complete sample hazard 0.00 0.0 0.20 0.30 failure times Figure 2: Empirical hazard function values calculated using ĥ i (top panel plot) and ĥ2 i (bottom panel plot): circle points are the true hazard function values connected by a fine solid line; vertical bars are the empirical hazard function values; The thick blue line is the medians of the empirical hazard function values calculated from 500 bootstrap samples (each of sample size n = 0000); the two grey dashed lines are the approximate 95% confidence band. the sampling variation can be over the upper part of the sample data range. So far, the simulation verification is done with the full sample data sets. In the next section, we will examine how ĥ i and ĥ2 i perform with right censored data based on the same Weibull distribution random sample specified in this section. Furthermore, based on a real life scenario of a water pipelines data set, we examine the different types of the data censoring and how they may affect the calculation of ĥ i and ĥ2 i. 4 Empirical hazard function and censored failure time data Figure 3 examines the simulation results of comparing the empirical hazard values ĥ i (top panel) and ĥ2 i (bottom panel) against the true hazard function values based on a 8

right censored Weibull distribution random sample. Figure 3 follows the same drawing format as we detailed in Figure 2 in Section 3. hazard 0.00 0.04 0.08 0.2 Weibull(x shape=.8,scale=30): right censored hazard 0.00 0.04 0.08 0.2 failure times Figure 3: Empirical hazard function values calculated using ĥ i (top panel plot) and ĥ2 i (bottom panel plot) with right censored sample: circle points are the true hazard function values connected by a fine solid line; vertical bars are the empirical hazard function values; The thick blue line is the medians of the empirical hazard function values calculated from 500 bootstrap samples (each of sample size n = 0000); the two grey dashed lines are the approximate 95% confidence band. In the analysis of a censored data set, we should distinguish a censored random sample from a truncated sample. For example, in this study, we created a right censored sample with the censoring time at 50 as shown in Figure 3. We set any observations greater than 50 to be 50 in the full sample, whereas we should discard any observations which are greater than 50 if what we are after is a truncated sample. A close look at the Equations (3) and (4) will reveal that the calculation of the empirical hazards ĥi do not depend on those observations failed before t i and the calculation would not be affected by the right censoring. Therefore, it is expected to find out that Figure 3 is just part of the Figure 2 (t 50). Hence, all the relevant results concluded from the examination of Figure 2 in Section 3 are still true. In fact, the estimation of the true hazards with right censored data is more reliable in general, than the estimation calculated based on the full data set. 9

Because the wild fluctuation of the empirical hazards calculated from the upper part of the sample range more or less is avoided. Of course, the cost is that we are unable to estimate the true hazards beyond the right censoring point of time. 3 2 4 2 3 4 0 50 60 year Figure 4: A schematic of data types in a continuous-time failure events sample a water pipelines scenario: small vertical bars represent the starting/installation times; circles represent missing records or unknown times (either installation or failure times); crosses represent the failure time records; each horizontal line segment represent one pipeline section of the same asset ID. The length of the line segment represents the corresponding time period in years. The motivation of this study is to justify and verify the proper calculation of empirical hazards based on a real life case of analysis of a water pipelines data set obtained from a water company located in Queensland, Australia [0]. In the raw data treatment stage, we found that the classification of data types does not match the normally defined categories for censoring data in standard failure time data analysis. We now present our finds on the classification of the water pipelines data and these finds may apply to linear assets in 0

general. Finally, we give very brief discussion on how the calculation of ĥ i and ĥ2 i may be affected given those different data types. The data types of these water pipeline assets are schematically shown in Figure 4. It is known that the earliest water pipelines were installed about 60 years ago in the region. The asset management data are properly recorded only for about 0 years up to date. Since linear assets like water pipelines are long-lived assets, it is not surprising to see that no failure/repair records are found with the majority of the pipelines over the observation period, i.e. they are right censored. As shown in Figure 4, these right censored observations are labelled 0 s at the right end of horizontal line segments. Observations labelled by s are the pipelines with known installation date and failures observed; observations with unknown installation date but known failure date are labelled by 2 s; observations with both installation date and failure date unknown, but functional over the whole observation period and beyond, are labelled by 3 s; observations with both installation date and failure date unknown, and failed before the observation period, are labelled by 4 s. Obviously, observations labelled by 4 s are missing values with which we are not even aware of them. The existence of this type of the missing values will make the calculated empirical hazards overestimate the true hazards because we want to find out the asset age specific hazard distribution. Observations labelled by 3 s may be treated as right censored data. By doing so, we will underestimate the true hazards. Similarly, we may treat observations with label 2 s as fully observed failure data and we will underestimate the true hazards as well. Overall, it would be reasonable to believe that the bias effect caused by data labelled 2, 3, and 4 may be cancelling out each other to some extent. If we can reasonably assume that the asset management records have been well collected and kept, i.e. the missing values, or loss of installation information are not serious, we conclude that ĥ i and ĥ2 i are valid estimators for the true hazard function values. 5 conclusions In this paper, we have presented the theoretic proof and numeric verification on the proper use of two often-used formulas (as reproduced below from Section to refresh our minds) for calculating the empirical hazard function in reliability analysis for the complete or censored continuous-time failure data: ĥ i = N(t i) N(t i + t) t N(t i ) = t d i r i ĥ i, and ĥ i = [ t log N(t ] i) N(t i + t) = N(t i ) t log( d i ) ĥ2 i. r i Our research shows that ĥ2 i is nothing but a finite approximation of AFR, whereas ĥ i is a finite approximation of the instantaneous hazard rates. However, in their limiting forms, both ĥ i and ĥ2 i converge to the true hazard function h i. For data analysis purpose, a rule of thumb for calculating empirical hazard function of continuous-time failure data may be summarised as: if the maximum failure rate over the time interval periods of our concern is less than 0., both ĥ i and ĥ2 i are good estimators

of the true hazard function values. Most asset management reliability study cases should fall into this category. Otherwise, ĥ2 i should be used for calculating the empirical hazard function. Note that both formulas are valid for right censored continuous-time failure data. If the data contains left censored, or interval censored, or missing value cases, one must be aware of the limitations in using these formulas. We also recommend that, in using ĥ i and ĥ2 i for estimating the true hazard function values, discard any empirical hazard values which are calculated based on those sample observations beyond the 99% quantile. As we have shown in Section 3, empirical hazard values which are calculated based on those sample observations beyond 99% quantile are inaccurate and have a very wide fluctuation range due to the sparse observations over a big life time distribution interval. The proposed rule of thumb should fill the gap between the probability theory and the data analysis practice in applications of the hazard function. Acknowledgments Add any thing needed HERE. References [] Kai Lai Chung and Farid AitSahlia. Elementary Probability Theory with Stochastic Process and an Introduction to Mathematical Finance. Springer-Verlag New York Berlin Heidelberg, Fourth Edition, 2003. [2] A. C. Davison. Statistical Models. Cambridge University Press, 2003. [3] Bradley Efron and Robert J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall/CRC, 993. [4] E.A. Elsayed. Reliability Engineering. Reading, Massachusetts: Addison Wesley Longman, Inc. 378 395, 996. [5] William Q. Meeker and Luis A. Escobar. Statistical Method for Reliability Data. JOHN WILEY & SONS, INC., 998. [6] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 202. ISBN 3-90005-07-0. [7] B. Rai and N. Singh. Hazard rate estimation from incomplete and unclean warranty data. Reliability Engineering & System Safety, 8:79 92, 2003. [8] R. Ramakumar. Engineering Reliability: Fundamentals and Applications. Prentice Hall, 993. [9] W.N. Venables and B.D. Ripley. Modern Applied Statistics with S-Plus. Springer- Verlag New York, Inc., corrected fourth printing, 996. 2

[0] Lin Ma Yong Sun, Colin Fidge. Reliability prediction of long-lived linear assets with incomplete failure data. Quality, Reliability, Risk, Maintenance, and Safety Engineering (ICQR2MSE), 20 International Conference, Xian, IEEE Conference Publications, pages 43 47, 20. 3