BAYESIAN MODELING OF DYNAMIC SOFTWARE GROWTH CURVE MODELS Zhaohui Liu, Nalini Ravishanker, University of Connecticut Bonnie K. Ray, IBM Watson Research Center Department of Mathematical Sciences, IBM Watson Research Center P.O. Box 218 Yorktown Hts., NY 10598 bonnier@us.ibm.com Key Words: hierarchical Bayes, reliability growth, software engineering 1. Introduction Models for characterizing the reliability of software have traditionally focused on using the observed number of failures 1 or time between observed failures to estimate how defects will be uncovered over time. These models, which are called growth curve (GC) models, are most often used during the latter stages of development, as an aid in determining when the software is ready to be released or to predict the failure rate in the field. The models rely strictly on characterizing the instantaneous rate of failure as a function of the number of failures detected upto time t. Most often, both the failure detection rate and the total number of expected failures are assumed to remain constant over time. Additional information, such as data concerning the operational profile at different testing stages or expert opinion concerning the expected reliability of the current product, is seldom incorporated into the models. In this paper, we develop an extended GC methodology for estimating failure rate that allows for evolution of the growth curve parameters as a function of the dynamic operational profile of the product. The model is fit using hierarchical Bayesian methods, which allow for incorporation of available prior information that may be relevant to estimating the ultimate failure rate. In Section 1 We use the terms failure, fault, and defect interchangeably in this paper, to denote a unique error in the code that causes the software not to function properly. 2, we provide some additional background on the software reliability GC models that form the basis of our extended modeling framework. Section 3 discusses a Bayesian framework for model estimation. Section 4 gives an illustration of software failure data from the system test stage of two releases of an IBM middleware product for a large operating system. Section 5 concludes. 2. Background and Model Formulation 2.1 Background A wide variety of software reliability growth models have been postulated in the literature, including those of Jelinski and Moranda (1972), Goel and Okumoto (1979), and Yamada et al. (1983). Many of these models assume that the underlying software failure process can be described using a nonhomogeneous Poisson process (NHPP). If it is assumed that no new defects are introduced at each repair, and that there is a finite, but Poisson distributed, random number N of defects remaining in the software at time t =0,then failure times observed upto time t can be taken to be the first n order statistics from N independent and identically distributed (i.i.d) observations having probability distribution f(t). Different choices of f( ) determine the exact shape of the failure rate function. For example, in the Jelinski-Moranda model, failure times are assumed to follow an exponential distribution having f(t) = β exp( βt),β,t > 0, which gives rise to a NHPP with mean rate of occurrence upto time t of m(t) = θ[1 exp( βt)], where θ denotes the mean of N. This is just the 2850
GC model proposed by Goel and Okumoto (1979). The parameter β can be interpreted as the defect discovery rate, while m (t) =λ(t) is called the failure rate function. See Musa, Iannino, and Okumoto (1987) and Kuo and Yang (1996) for further discussion of commonly used models for characterizing failure times and/or number of failures upto time t. The above mentioned models assume that both θ and β remain constant over the observed time interval. However, many different actions may occur during the testing or release phases of the software that make this assumption questionable. For example, it is typically assumed that code remains frozen during testing, i.e., the number of defects in a system under test does not change. However, code drops, affecting the total lines of code (LOC) in the software, are not uncommon during function testing or even later phases of software development. Thus, the size of code and, consequently, the number of failures in a large system, can vary widely during testing. Other variables relevant to the observed in-process defect discovery rate are the number of test cases run, the size of the test team, etc. Defect discovery in the field may be affected by such things as the rate at which customers install and exercise the software. If these influences are not incorporated into models used for characterizing reliability, resulting models are likely to have increased variability and poor predictive performance. A model containing covariate information could also be useful for what-if scenarios, for example to estimate when 90% of defects would be found under various test strategies. The importance of research into new models for software reliability growth that incorporate covariate information was noted in a 1996 National Academy of Science Panel Report on Statistical Methods in Software Engineering. Although dynamic models for software reliability were investigated by Singpurwalla and Soyer (1985), the models proposed there had failure time distributions changing randomly, as opposed to being driven by operational characteristics of the software development process. Incorporation of deterministic or stochastic covariate information into the software reliability modeling framework was mentioned by Singpurwalla and Wilson (1999, Section 7.1), but to the best of our knowledge, there have been no published applications which use this type of information. 2.2 Model Formulation We concentrate here on failure times assumed to follow a Weibull distribution, f(t) =αβt α 1 exp( βt α ),α,β,t>0, which reduces to an exponential distribution when α = 1. The Weibull distribution gives rise to a NHPP with mean rate of occurrence upto time t of m(t) =θ[1 exp( βt α )], and has been used to model discovery of defects in the field, where the α parameter was interpreted as being related to the customer usage rate (Kenney, 1993). Here, we call α a measure of the exercise rate of the software. We allow θ, β and α to change over time following a log-linear model. Let X t,i denotes the i th covariate value at time t and let k denote the number of covariates used to model each parameter. We have log(θ t )=η 0 + k θ i=1 η ix t,i, log(β t )=γ 0 + k β i=1 γ ix t,i, log(α t )=κ 0 + k α i=1 κ ix t,i. Thus the effect of the covariate is to scale the associated parameter by an amount corresponding to the exponentiated linear model. For example, the baseline defect discovery rate exp(γ 0 ) is scaled by the amount exp( k β i=1 γ ix t,i ). Under the assumption that X tj,i is constant in the interval (t j,t j+1 ), the Weibull distribution with stochastic covariate information gives rise to a NHPP with mean rate of occurrence in the interval (t j,t j+1 )of m j+1 = θ tj [exp( β tj t αt j j+1 ) exp( β t j t αt j j )] (1) It is not necessary that the covariates be the same for each parameter. As discussed above, LOC may be a reasonable covariate for θ, whereas test cases may be a reasonable covariate for β or α. The time intervals (t j,t j+1 ) can also be of varying length. In the next section, we describe a Bayesian estimation method for fitting these extended growth curve models. 2851
3. Bayesian modeling framework The Bayesian framework provides a way to incorporate reliability information from historical releases and test information for the current release completed previous to the time-frame of the model, for example code and function test defect information (Jeske et al., 2000) and from expert opinion (Singpurwalla and Wilson, 1999, Section 4.3) through the use of informative prior distributions on model parameters. When few failures have been observed, such as might be expected early in system testing (when growth curve models are most useful for planning purposes), MLE estimates may be difficult to obtain. Additionally, Jeske and Pham (2001) show that standard maximum likelihood estimation (MLE) techniques do not yield asymptotically efficient estimates for the exponential GC models in some instances.the incorporation of strong prior information mitigates these problems, while providing a context for expressing subjective judgement as to how a release is expected to perform. The following subsection gives details of the Bayesian estimation method and discusses choice of prior distributions. 3.1 Sampling-based Bayesian estimation The most common format for reporting software defects at IBM is that of recording the defect information, along with the day on which the defect was logged. Usually, more than one defect is found in a single day. This type of data gives rise to defect counts per unit time, as opposed to individual failure times, and necessitates using the NHPP form of the likelihood for model estimation. We formulate our model as a hierarchical Bayesian model and estimate it using sampling-based procedures (Gelfand and Smith, 1990). Unfortunately, the NHPP framework does not result in standard conditional posterior distributions, as were obtained by Kuo and Yang (1996) and Singpurwalla and Wilson (1999, Chapter 4) for failure time data. Thus we use a Metropolis-Hastings algorithm to sample from those conditional densities having non-standard distributions. Let Y tj represent the number of defects observed in the interval (t j 1,t j ). In general, given data Y tj,j = 1,,n, along with parameters Ψ = ( η, γ, κ), the Bayesian model specification requires a likelihood f(ỹn; Ψ) and a prior π(ψ). By Bayes theorem, we then obtain the posterior density as π(ψ Ỹn) f(ỹn;ψ)π(ψ). The likelihood function for the NHPP with mean function as in (1) is computed as f(ỹn;ψ)= n j=1 m Yt j j exp( m j ) Y tj!. (2) We assume that the priors for all parameters are independent and Gaussian. The hyperparameters for the distributions of η 0,γ 0, and κ 0 are selected to incorporate prior information provided in the form of an expected mean and variance for the total number of defects, the defect density rate, and the exercise rate. This information can come from experts, e.g. product development managers who may give an expected mean and upper and lower bound for these parameters, or from estimation results for the constant parameter Weibull model fitted to defect data from previous releases. The choice of which prior information to use depends on planned use of the model. For instance, if the model is fit to partially collected in-process data as a way of determining, e.g., when 90% of in-process defects will be found, prior information should be based on complete in-process data for a previous release of the same product or for a product having similar characteristics, such as similar LOC, functionality, operating environment, etc. If the GC model is fit as a way of projecting field defect discovery rates, previous release field data should be used for prior determination (see Jeske et al., 2000). For the remaining parameters, which reflect the dependence on covariate information, we use independent Gaussian priors centered at zero and having large variance relative to the specified prior of the baseline parameters, η 0,γ 0,κ 0. This results in non-informative prior information reflecting considerable uncertainty about the effect of the covariates. The rationale is that development managers will rarely be able to provide any reasonable guess as to how these parameters change and that only a simple model with no covariate information is fit to historical data. In the MCMC framework, extensive posterior and predictive analysis is facilitated through the 2852
use of numerical summary statistics and graphical displays of samples from the joint posterior and predictive distributions. For example, it is straightforward to obtain a confidence interval for T p, the time at which p percent of the remaining defects are expected to be discovered, which is not the case for GC models fit using MLE techniques. In the next section, we illustrate these ideas using system test defect data from an IBM product. 4. Data Example We use system test data from two releases of an IBM middleware product for a large operating system to illustrate our method. The two releases were of similar size and functionality. Data from the first release represent a complete system test cycle, whereas system test had just begun when the data for the second release was obtained. 4.1 Orthogonal Defect Classification Almost all of the published literature on GC modeling has focused on characterizing a product s reliability based on all defects considered together, not distinguished by type. However, several papers in the last few years have pointed out the information that can be gained in terms of understanding the evolving software process by consideration of the type of defects observed. In particular, IBM uses a scheme called Orthogonal Defect Classification (ODC) to distinguish different types of defects found during the development process. The failure process associated with each of the defect types can provide additional insight into the overall reliability of the product. Chillarege et al. (1992) defined seven standard defect types and established a cause and effect relationshipbetween type of defect and reliability growth. To exploit this, Biryani and Chillarege (1994) used separate growth curve models to track defects of different types over time. However, as mentioned in Chillarege et al. (1992), dependence relationships may exist between defects of different types, the rationale being that certain defects cannot be discovered until defects of a different type are first found. Bhandari et al. (1992) developed a reliability growth model that explicitly incorporates such relationships in the case of two defect types. The primary example given in Bhandari et al. (1992) is Cumulative Counts 0 2 4 6 8 10 12 Cumulative Counts Plot of CH and AI (Release 2) 5 10 15 Weeks Figure 1: Cumulative Assignment/Initialization and Checking defects per week during System Test for second release of product that of Assignment/Initialization(AI) defects that must be discovered before certain Checking (CH) defects are found. It is reasonable to believe then that an increase in the discovery rate for CH defects might result as more AI defects are found. Feedback relationships between CH and AI defect discovery may also exist. Although the model of Bhandari et al. (1992) is useful, extension to allow for dependence on more than one defect type is mathematically intractable. The dependence of discovery of defects of type l on an arbitrary number of other defect types can be handled in our extended GC framework by modeling the reliability growth of type l defects using covariates X tj,i to represent the number of defects of Type i found upto time t j, i =1,,k,i =l. Here, we illustrate the use of such covariates in the extended GC model for AI and CH defects. Figure 1 shows the cumulative number of AI and CH defects per week for the second release of the project described above. The plot for CH defects suggests that checking defects do tend to be discovered a little later than AI defects in the development process. We compare four different models for characterizing the cumulative growth of AI and CH defects over the second release system test time frame. Model 1 is the simple exponential model. Model 2 is an exponential model in which the de- 2853
fect discovery rate is allowed to vary as a function of the number of defects of the other type discovered in the previous time interval. Model 3 is a simple Weibull model, while Model 4 is a Weibull model allowing for varying defect discovery rates as in Model 2. System test data from the earlier release of the product was used to obtain meaningful priors for θ and γ 0 through fitting of the simple exponential model using maximum likelihood estimation. As discussed in the previous section, the prior distribution for γ 1 is taken to be Normal (0, 1). The prior distribution for α was set to lognormal (1, 1), to indicate a prior belief in a simple exponential model relative to the Weibull model. Results based on other values of the hyperparameters, in particular differing standard deviations, did not significantly alter the results. Normal proposal densities were used to generate samples in the Metropolis-Hastings step. The means and variances of the proposal densities were determined using the last 500 samples from an initial run of 1000 iterations of the modified Gibbs algorithm. The algorithm was then run a second time for 5000 iterations, each time taking 25 replications within each MH step. The last 2500 iterations were used to compute the results. All computations were done using the R programming language, freely available from http://cran.r-project.org/. Table 1 shows the means, standard deviations, and 95% credible sets for the parameters of each model for both AI and CH type defects. We see that Model 2, which allows the defect discovery rate to change as a function of the number of discovered defects of the other type in the previous time period, yields posterior means of γ 1 which are significantly different from zero for CH defects but not for AI defects. This indicates that there is indeed a feedforward relationshipbetween AI and CH defects, as hypothesized, but not vice versa. This relationshipagain shows upin Model 4, although not as significantly as for Model 2. Based on the 95% credible set, the α parameter of Models 3 and 4 does not differ significantly from one, indicating that the exponential model is sufficient for this data. However, the mean α value for the CH Weibull models is larger than one, while for AI defects it is very close to one, indicating that there may be some small exercise rate effect for CH defects. It is also interesting to note that the total expected defects of each type is higher when the defect discovery rate is allowed to change, in this case increasing as more defects of the other type are found. This suggests that models which fail to incorporate the feedback relationship may underestimate the total defects for a software development project. 5. Discussion We have shown how time-dependent covariate information can easily be incorporated into reliability growth curve modeling using sampling-based Bayesian methods. The Bayesian framework also allows incorporation of useful prior information. Although we have focused on time-dependent covariates here, the method also applies to the case in which failure data from several products is available, along with product-specific information, such as LOC and operating environment, and a common model is fit to the data with parameters varying across products as a function of the covariate information. This is closer to the regression model framework of classical reliability modeling, although the issue of appropriate scaling must be addressed when the defect rates vary dramatically in size. 6. References 1. Chillarege, R., Bhandari, I., Chaar, J., Halliday, M., Moebus, D., Ray, B., and Wong, M. (1992) Orthogonal defect classification-a concept for in-process measurement. IEEE Transactions on Software Engineering, 18, 943-956. 2. Chillarege, R. and Biryani, S. (1994). Identifying risk using ODC based growth curve models. Proceedings of the Fifth International Symposium of Software Reliability Engineering, 282-288. 3. Gelfand, A.E. and Smith, A.F.M. (1990). Sampling based approaches to calculating marginal densities, Journal of the American Statistical Association, 85, 398 409. 2854
Table 1: Parameter estimates for Exponential and Weibull GC models of AI and CH system test defects AI CH θ γ 0 γ 1 α θ γ 0 γ 1 α Mean 19.86-4.12 34.89-3.76 Model 1 Std.dev 8.76 0.57 10.25 0.41 Lower Bound 7.92-5.27 19.19-4.57 Upper Bound 41.28-3.02 58.36-2.96 Mean 22.60-4.42 0.63 40.83-4.11 1.15 Model 2 Std.dev 9.45 0.47 0.34 11.62 0.36 0.51 Lower Bound 9.12-5.35-0.13 22.75-4.84 0.13 Upper Bound 46.20-3.51 1.21 67.69-3.43 2.15 Mean 18.50-4.23 1.12 28.85-4.53 1.44 Model 3 Std.dev 9.42 0.71 0.34 9.29 0.67 0.29 Lower Bound 6.40-5.56 0.55 15.69-5.86 0.92 Upper Bound 42.38-2.89 1.86 52.85-3.28 2.05 Mean 22.56-4.38 0.60 1.00 33.93-4.57 1.01 1.31 Model 4 Std.dev 11.39 0.55 0.36 0.29 10.84 0.51 0.51 0.23 Lower Bound 7.94-5.49-0.18 0.48 17.99-5.62-0.02 0.89 Upper Bound 49.02-3.36 1.23 1.63 59.50-3.59 1.99 1.82 4. Goel, A. L. and Okumoto, K., (1979). Timedependent error-detection rate model for software reliability and other performance measures, IEEE Transactions on Reliability, R- 28(1), 206-211. 5. Jelinski, Z. and Moranda, P. (1972). Software Reliability Research, in Statistical Computer Performance Evaluation, e.d. W. Freiberger, New York: Academic Press, 1972. 6. Jeske, D., Qureshi, M., and Muldoon, E. (2000). A Bayesian methodology for estimating the failure rate of software, International Journal of Reliability, Quality, and Safety Engineering, 7, 153 168. 7. Jeske, D., and Pham, H. (2001). On the maximum likelihood estimates for the Goel-Okumoto software reliability model, The American Statistician, 55, 219-222. 8. Kenney, G. (1993). Estimating defects in commercial software during operational use. IEEE Transactions on Reliability, 42, 107 115. 9. Kuo, L. and Yang, T.Y. (1996). Bayesian computation for nonhomogeneous Poisson processes in software reliability, Journal of the American Statistical Association, 91, 763-773. 10. Musa, J., Iannino, A., and Okumoto, K. (1987). Software Reliability: Measurement, Prediction, Application, McGraw-Hill: New York. 11. Robert, C. and Casella, G.(1999). Monte Carlo Statistical Methods, Springer: New York. 12. Singpurwalla, N. and Soyer, R. (1985). Assessing (software) reliablity growth using a random coefficient autoregressive process and its ramifications. IEEE Transacations on Software Engineering, SE-11 12: 1456-1464. 13. Singpurwalla, N. and Wilson, S. (1999). Statistical Methods in Software Engineering: Reliablity and Risk, Springer: New York. 14. Yamada, S., Ohba, M., and Osaki, S. (1983). S-shaped reliability growth modeling for software error detection, IEEE Transactions on Reliability, 32, 475-478. 2855