Proceedings of the 2017 Winter Simulation Conference W. K. V. Chan, A. D Ambrogio, G. Zacharewicz, N. Mustafee, G. Wainer, and E. Page, eds.

Similar documents
On the estimation of the mean time to failure by simulation

Estimation of the large covariance matrix with two-step monotone missing data

Understanding and Using Availability

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

Shadow Computing: An Energy-Aware Fault Tolerant Computing Model

arxiv: v1 [physics.data-an] 26 Oct 2012

Understanding and Using Availability

Approximating min-max k-clustering

Distributed Rule-Based Inference in the Presence of Redundant Information

Asymptotically Optimal Simulation Allocation under Dependent Sampling

Calculation of MTTF values with Markov Models for Safety Instrumented Systems

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK

ESTIMATION OF THE RECIPROCAL OF THE MEAN OF THE INVERSE GAUSSIAN DISTRIBUTION WITH PRIOR INFORMATION

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

Analysis of some entrance probabilities for killed birth-death processes

Sums of independent random variables

t 0 Xt sup X t p c p inf t 0

Robustness of classifiers to uniform l p and Gaussian noise Supplementary material

On Doob s Maximal Inequality for Brownian Motion

Elementary Analysis in Q p

Yixi Shi. Jose Blanchet. IEOR Department Columbia University New York, NY 10027, USA. IEOR Department Columbia University New York, NY 10027, USA

Information collection on a graph

Information collection on a graph

Availability and Maintainability. Piero Baraldi

CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules

A Comparison between Biased and Unbiased Estimators in Ordinary Least Squares Regression

Probability Estimates for Multi-class Classification by Pairwise Coupling

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models

Chapter 7 Sampling and Sampling Distributions. Introduction. Selecting a Sample. Introduction. Sampling from a Finite Population

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK

Introduction to Probability and Statistics

1 Extremum Estimators

Lecture 6. 2 Recurrence/transience, harmonic functions and martingales

E-companion to A risk- and ambiguity-averse extension of the max-min newsvendor order formula

BOUNDS FOR THE COUPLING TIME IN QUEUEING NETWORKS PERFECT SIMULATION

MATH 2710: NOTES FOR ANALYSIS

Improved Bounds on Bell Numbers and on Moments of Sums of Random Variables

1 Probability Spaces and Random Variables

Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process

Lower Confidence Bound for Process-Yield Index S pk with Autocorrelated Process Data

General Linear Model Introduction, Classes of Linear models and Estimation

Performance of lag length selection criteria in three different situations

On Wald-Type Optimal Stopping for Brownian Motion

ASYMPTOTIC ROBUSTNESS OF ESTIMATORS IN RARE-EVENT SIMULATION. Bruno Tuffin. IRISA/INRIA, Campus de Beaulieu Rennes Cedex, France

An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators

On Derivative Estimation of the Mean Time to Failure in Simulations of Highly Reliable Markovian Systems

An Ant Colony Optimization Approach to the Probabilistic Traveling Salesman Problem

Tests for Two Proportions in a Stratified Design (Cochran/Mantel-Haenszel Test)

Supplementary Materials for Robust Estimation of the False Discovery Rate

Statics and dynamics: some elementary concepts

Hotelling s Two- Sample T 2

ON UNIFORM BOUNDEDNESS OF DYADIC AVERAGING OPERATORS IN SPACES OF HARDY-SOBOLEV TYPE. 1. Introduction

4. Score normalization technical details We now discuss the technical details of the score normalization method.

Principles of Computed Tomography (CT)

Uniform Law on the Unit Sphere of a Banach Space

START Selected Topics in Assurance

Universal Finite Memory Coding of Binary Sequences

Notes on Instrumental Variables Methods

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning

1 Gambler s Ruin Problem

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

Location of solutions for quasi-linear elliptic equations with general gradient dependence

Age of Information: Whittle Index for Scheduling Stochastic Arrivals

A New Asymmetric Interaction Ridge (AIR) Regression Method

On split sample and randomized confidence intervals for binomial proportions

Feedback-error control

The non-stochastic multi-armed bandit problem

Generalized Coiflets: A New Family of Orthonormal Wavelets

On-Line Appendix. Matching on the Estimated Propensity Score (Abadie and Imbens, 2015)

CONVOLVED SUBSAMPLING ESTIMATION WITH APPLICATIONS TO BLOCK BOOTSTRAP

Convex Optimization methods for Computing Channel Capacity

1-way quantum finite automata: strengths, weaknesses and generalizations

substantial literature on emirical likelihood indicating that it is widely viewed as a desirable and natural aroach to statistical inference in a vari

Research Note REGRESSION ANALYSIS IN MARKOV CHAIN * A. Y. ALAMUTI AND M. R. MESHKANI **

State Estimation with ARMarkov Models

Modeling and Estimation of Full-Chip Leakage Current Considering Within-Die Correlation

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES

PETER J. GRABNER AND ARNOLD KNOPFMACHER

6 Stationary Distributions

Brownian Motion and Random Prime Factorization

Deriving Indicator Direct and Cross Variograms from a Normal Scores Variogram Model (bigaus-full) David F. Machuca Mory and Clayton V.

Outline. Markov Chains and Markov Models. Outline. Markov Chains. Markov Chains Definitions Huizhen Yu

Some results of convex programming complexity

Inference for Empirical Wasserstein Distances on Finite Spaces: Supplementary Material

Journal of Mathematical Analysis and Applications

Estimation of component redundancy in optimal age maintenance

A MIXED CONTROL CHART ADAPTED TO THE TRUNCATED LIFE TEST BASED ON THE WEIBULL DISTRIBUTION

Chapter 7: Special Distributions

Estimating function analysis for a class of Tweedie regression models

An Outdoor Recreation Use Model with Applications to Evaluating Survey Estimators

On the Chvatál-Complexity of Knapsack Problems

Improved Capacity Bounds for the Binary Energy Harvesting Channel

On the asymptotic sizes of subset Anderson-Rubin and Lagrange multiplier tests in linear instrumental variables regression

Stochastic integration II: the Itô integral

Paper C Exact Volume Balance Versus Exact Mass Balance in Compositional Reservoir Simulation

Multiplicative group law on the folium of Descartes

On the Relationship Between Packet Size and Router Performance for Heavy-Tailed Traffic 1

Haar type and Carleson Constants

Solution: (Course X071570: Stochastic Processes)

Transcription:

Proceedings of the 207 Winter Simulation Conference W. K. V. Chan, A. D Ambrogio, G. Zacharewicz, N. Mustafee, G. Wainer, and E. Page, eds. ON THE ESTIMATION OF THE MEAN TIME TO FAILURE BY SIMULATION Peter W. Glynn Deartment of Management Science and Engineering Stanford University 475 Via Ortega Stanford, CA 94305, USA Marvin K. Nakayama Deartment of Comuter Science New Jersey Institute of Technology Newark, NJ 0702, USA Bruno Tuffin Inria Camus de Beaulieu, 263 Avenue Général Leclerc 35042 Rennes, FRANCE ABSTRACT The mean time to failure MTTF) of a stochastic system is often estimated by simulation. One natural estimator, which we call the direct estimator, simly averages indeendent and identically distributed coies of simulated times to failure. When the system is regenerative, an alternative aroach is based on a ratio reresentation of the MTTF. The urose of this aer is to comare the two estimators. We first analyze them in the setting of crude simulation i.e., no imortance samling), showing that they are actually asymtotically identical in a rare-event context. The two crude estimators are inefficient in different but closely related ways: the direct estimator requires a large comutational time because times to failure often include many transitions, whereas the ratio estimator entails estimating a rare-event robability. We then discuss the two aroaches when emloying imortance samling; for highly reliable Markovian systems, we show that using a ratio estimator is advised. INTRODUCTION Deendability analysis is of rimary imortance in many areas, such as nuclear ower lants, telecommunications, manufacturing, transort systems, and comuter science; for examles, see Heidelberger 995) and Rubino and Tuffin 2009b). Even if system failures are rare, their occurrence may have dramatic consequences and therefore need to be analyzed with care. We focus here on one common deendability metric, the mean time to failure MTTF), which is the exected value of the random time to reach failure. An examle of the tye of system we are considering is one with comonents subject to failures and reairs exonentially distributed over time. Such a system is then reresented by a Markov chain which can in rincile be solved analytically, but for ractical roblems the state sace is usually so large that it would require an enormous comutation time. Monte Carlo simulation then becomes a relevant otion. A crude simulation of the model entails simulating failures and reairs of comonents u to the failure of the whole system. We obtain the direct estimator of the MTTF by reeating the exeriment many indeendent times and averaging the obtained times to system failure. But in the case when individual comonents are highly reliable in the sense that failure rates are much smaller than the reair rates), this often requires a very long comutation time because it involves, with high robability, a large number of transitions before a failure of the system since when comonents are failed, it is more likely to have reairs than other failures. In the literature, another estimator is often instead used. It exloits the regenerative 978--5386-3428-8/7/$3.00 207 IEEE 844

structure of the model and exresses the MTTF as a ratio of quantities over regenerative cycles. Estimating by crude simulation the numerator in this exression is efficient, but it is not the case for the denominator, which is the robability of a rare event. Many rare-event simulation techniques have been develoed to obtain efficient estimators using the ratio exression Heidelberger 995, L Ecuyer and Tuffin 202, Rubino and Tuffin 2009a). The urose of this aer is to review and discuss the relative merits of the two estimators: the direct and a ratio-based one. We highlight the following results we obtain: We first show that crude estimators based on direct simulation of times to failure and on the ratio exression are asymtotically similar in erformance, in rare-event settings. Both estimators are inefficient, suffering from different but closely related issues: the direct estimator requires large comutation times because relications are often very long, whereas the ratio estimator encounters difficulties from estimating a rare-event robability. To analyze the asymtotics as the event of interest becomes rarer, we consider a sequence A b : b ) of failure) sets, with PA b ) 0 as b. We rove that the two estimators are asymtotically equivalent when estimating the exected hitting time to A b as b. Moreover, we rovide numerical results that the same is true for highly reliable Markovian systems, in which the asymtotic regime differs in that the failure set is fixed but the comonent failure rates shrink with the reair rates fixed. Given that crude estimators are equivalent in erformance, we next comare the imortance samling IS) versions of the estimators. We show that in the setting of highly reliable Markovian systems, it is not ossible in full generality to design efficient direct IS estimators, so the ratio-based estimators are then rather advised. The rest of the aer develos as follows. Section 2 resents the two versions of the crude estimators of the MTTF, with their associated central limit theorems CLTs), from which one can construct asymtotically valid confidence intervals. Section 3 comares the two estimators in rare-event settings and shows that they are equivalent in terms of accuracy as the comutational budget grows large. Instead of the well-known IS ratio estimation, Section 4 discusses direct estimators via IS. Based on simle examles, we show that, and exlain why, designing efficient direct estimators is difficult. Section 5 concludes the aer. 2 CRUDE MTTF ESTIMATORS 2. The estimators Let X = Xt) : t 0) be an S-valued non-delayed classically) regenerative rocess Smith 955) with regeneration times 0 = Γ0) < Γ) <. For k, let τk) = Γk) Γk ) be the length of the kth regenerative cycle. Given a set A S, the goal is to comute α = E[T ], where T = inf{t 0 : Xt) A} is the hitting time of A and E[ ] is the exectation oerator. When A corresonds to the set of states for which the simulated system is failed, α reresents the MTTF. We assume throughout that E[τ 2 )] <. Because X is classically regenerative, we have that τk),xγk ) + s) : 0 s < τk)) : k ) is a sequence of indeendent and identically distributed IID) cycles. For real-valued x and y, define x y = minx,y). For k, let Wk) = inf{t 0 : XΓk ) + t) A} be the first hitting to A after regeneration time Γk ). The classical regenerative roerty imlies that τk),wk) τk), Ik)) : k ) is an IID sequence of trilets, where Ik) = I Wk) < τk)) and I ) is the indicator function. Define τ = τ), W = W) = T, and = PT < τ). A roof of the following ratio reresentation for α aears, e.g., in Goyal et al. 992). Proosition If > 0, then α = E[T τ] PT < τ). ) 845

Set N0) = 0, and for j, let N j) = inf{k > N j ) : Ik) = } be the index k of the cycle corresonding to the jth cycle in which A is hit. Let T ) = T, and T j) = inf{t 0 : XΓN j ))+t) A} for j 2. Then α can be estimated either by the direct estimator or via the ratio estimator 2.2 Central Limit Theorems α m) = m m T j) j= α 2 n) = /n)n k= [Wk) τk)] /n) n k= Ik). Let denote weak convergence e.g., Billingsley 999), and let N a,s 2 ) be a normal random variable with mean a and variance s 2. Then the direct estimator α m) satisfies the following CLT. Proosition 2 If > 0, then m /2 [α m) α] σ N 0,) as m, where σ 2 = α 2 + E[T τ)2 ] 2α E[T I T < τ)]. 2) Proof. Because > 0 and E[τ 2 ] <, all the exectations below are finite. Note that σ 2 = E[T 2 ] α 2 because α m) averages IID coies of T. To show that σ 2 satisfies 2), observe that T D = T τ)+i T τ)t, where D = denotes equality in distribution and T D = T is indeendent of T τ,i T τ)). Hence, E[T 2 ] = E[T τ) 2 ] + 2E[T τ)i T τ)]e[t ] + E[I T τ)]e[t 2 ]. Therefore, because = E[I T τ)] and α = E[T ] = E[T τ]/ by ), we get establishing 2). E[T 2 ] = E[T τ)2 ] = E[T τ)2 ] = E[T τ)2 ] E[T τ)i T τ)] + 2α E[T τ] + 2α + 2α 2 2α ) E[T I T < τ)] E[T I T < τ)], On the other hand, the second estimator α 2 n) satisfies the following CLT, which also aears in Goyal et al. 992) but we include its roof here for comleteness. Proosition 3 If > 0, then n /2 [α 2 n) α] σ 2 N 0,) as n, where Proof. Note that σ 2 2 = E[T τ)2 ] 2 E[T I T < τ)] 2α 2 + α2. 3) n /2 [α 2 n) α] = n /2 n k= [Wk) τk)) αik)] n k= Ik)/n. But n n k= Ik) almost surely as n, and n /2 n k= [Wk) τk)) αik)] σ 2N 0,) as n, where σ 2 2 = E[T τ) αi T < τ))2 ] = E[T τ) 2 ] 2αE[T I T < τ)] + α 2 PT < τ), so 3) holds by Slutsky s theorem. 846

3 COMPARISON OF THE CRUDE ESTIMATORS Although the CLTs in Proositions 2 and 3 for α m) and α 2 n), resectively, may aear to be different, they actually are very similar. In fact, they agree at the instants at which the T j) s occur i.e., hitting times of A), which Shahabuddin et al. 988) also note. Proosition 4 For m, we have α 2 Nm)) = α m). Proof. For j, note that T j) = τn j ) + ) + τn j ) + 2) + + τn j) ) +WN j)). We have that τk) < Wk) when N j ) < k < N j), whereas Wk) < τk) for k = N j). Thus, T j) = N j) τk) Wk). Also, observe that m = Nm) k= Ik), so that k=n j )+ α m) = m l= N j) k=n j )+ τk) Wk) Nm) k= Ik) = α 2 Nm)). 3. Equivalence with a decreasing sequence of reachable sets We will now show that these estimators are asymtotically identical in the rare-event setting in which A is a rare set. Consider a sequence A b : b ) of subsets of S for which b PT b < τ) 0 as b, where T b = inf{t 0 : Xt) A b }. For each fixed b, we can then define W b k), I b k), N b j), T b j), etc., analogously to Wk), Ik), N j), T j), etc., but with A b instead of A; e.g., W b k) = inf{t 0 : XΓk ) +t) A b } and I b k) = I W b k) < τi)). Suose that it takes c > 0 units of comuter time to generate c simulated time units of the rocess X, where the comuter time units may differ from the time units of X, e.g., milliseconds vs. days. The number of the T b j) generated in c units of comuter time is then given by β b c) = su{m 0 : m j= T b j) c}, so that the estimator that is analogous to α m)) available after exending c units of comuter time is ˆα,b c) = β b c) β b c) when β b c), and ˆα,b c) = 0 when β b c) = 0. Similarly, let Λ b c) be the number of W b k) τk),i b k)) generated in c units of comuter time, so that Λ b c) = su{l 0 : l k= [W bk) τk)] c}. The estimator that is analogous to α 2 n)) available after c units of comuter time is j= T b j) ˆα 2,b c) = Λ bc) k= W bk) τk) Λ bc) k= I bk) if Λ b c), and ˆα 2,b c) = 0 if Λ b c) = 0. Note that it takes on average roughly E[τ]/ b units of comuter time to observe one visit to A b. So, to hoe for consistency and CLTs, we need a comutational budget t b for which t b b as b. We next consider the relative accuracy of the two estimators ˆα,b ) and ˆα 2,b ), showing that ˆα,b c) and ˆα 2,b c) are asymtotically identical in the regime in which t b / b. Proosition 5 Assume E[τ 3 ] <. If t b b as b, then we have that as b, ) ˆαi,b t b ) tb b E[τ]N 0,), i =,2, and 4) E[T b ] tb b ˆα,b t b ) E[T b ] ˆα 2,bt b ) E[T b ] ) 0. 5) 847

Proof. Glynn, Nakayama, and Tuffin To rove 4) for i =, we will first establish that t tb b b s ) Tb j) b t b b s j= E[T b ] : 0 < s < ) ) Bs) : 0 < s < s 6) as b in D0, ), where B = Bs) : 0 < s < ) is a standard Brownian motion and D0, ) is the sace of right-continuous functions with left limits on 0, ), and then emloy a random-time-change argument. We now show 6) by alying Theorem.4b),. 339, of Ethier and Kurtz 986) after verifying the two sufficient conditions in their.9) and.7). Note that the rocess A b ) of their theorem is just A b s) = t b b s /t b b ))Var[T b /E[T b ]]. But our Proosition 2 imlies that Var[T b ] = E[T b ]) 2 +o)) as b because E[T b I T b < τ)] = o) as b, where we use the notation that a function f b) = ogb)) for another function g if f b)/gb) 0 as b. Consequently, for each s 0, we have that A b s) s as b, establishing the condition.9) of Theorem.4b) of Ethier and Kurtz 986). The other condition,.7), in Theorem.4b) of Ethier and Kurtz 986) requires verifying that t b b E [ ] max V b j) j t b b s 0 7) as b, for each fixed s 0, where the V b j), j, are IID coies of T b /E[T b ]) ) 2. But [ ] E max t b V b j) = ) P max b j t b b s t b V b j) > x dx 8) b 0 j t b b s and Pmax j tb b s V b j) > x)/t b b ) 0 as b for each x 0. Hence, 7) follows from 8) if we can aly the dominated convergence theorem to 8). Note that ) P max t b V b j) > x t b b s ) 2 Tb b j t b b s t b b E[T b ] > x) s x 3/2 E j= PV b j) > x) = t b b s P t b b ) ] 3 [ Tb E[T b ], 9) where the last ste follows from Markov s inequality. Because E[τ 3 ] < by assumtion, we can aly an argument similar to that used to rove Proosition 2 to comute E[Tb 3 ]. This comutation shows that E[T b /E[T b ]) ) 3 ] is bounded as a function of b. The inequality 9) then roves that the integrand of 8) is uniformly dominated by an integrable function of x, so that 7) holds. This roves 6). To enable alying a random-time-change theorem to 6), we next rove that β b t b s) t b b s E[τ] 0) as b. We note that for ε > 0, P β b t b s) t ) b b s + ε) = PT b ) + + T b n b ) t b s), ) E[τ] where n b = t b b s + ε)/e[τ], and observe that n b E[T b ] = t b s + ε) + o)) as b. Chebyshev s inequality then imlies that ) is bounded above by P T b ) + + T b n b ) n b E[T b ] > t b sε + o))) n b Var[ b T b ] t b b sε) 2 + o)) 0 848

as b, because Var[ b T b ] is bounded as a function of b since E[T b /E[T b ]) 2 ] as b ). A similar argument shows that Pβ b t b s) t b b s ε)/e[τ]) 0 as b, for each fixed ε > 0, thus roving 0). In view of 6) and 0), the random-time-change theorem e.g., Theorem 4.4 of Billingsley 999) then yields the conclusion that tb b β b t b s) β b t b s) ) Tb j) j= E[T b ] Bs/E[τ]) s/e[τ] as b. Consequently, 4) for i = follows by setting s =. We next show that 5) holds, which will imly 4) holds for i = 2 by 4) for i = and the convergingtogether lemma. First note that ˆα,b c) ˆα 2,b c) β bc)+ j= T b j)/β b c) and [ β bc)+ j= T b j)/β b c)] ˆα,b c) = T b β b c) + )/β b c), which imlies that 5) holds if tb b max j t b b s T b j) E[T b ] 0 2) as b for s 0. But this follows immediately from 7) and Markov s inequality. 3.2 Equivalence in a Highly Reliable Markovian Systems Setting We next consider a model of highly reliable Markovian systems HRMS) commonly studied in the literature, e.g., see Cancela, Rubino, and Tuffin 2002), Shahabuddin 994a), Shahabuddin 994b) or Shahabuddin et al. 988), among others. Basically the reader is advised to read the above references for more details) the state sace S is decomosed into the set A of failed states and the set of oerational states. Transitions of the Markov chains are reairs and failures of comonents. Failures are assumed to be rare events with resect to reairs, so that a rarity arameter 0 < ε is introduced. Failure transitions are assumed to have a rate Oε), while reair transitions have a rate Θ), where we use the notation that a function f ε) is Ogε)) if f ε)/gε) remains bounded when ε 0, and it is Θgε)) if f ε)/gε) is bounded and also bounded away from 0, when ε 0. The smaller ε is, the smaller the robability to reach A from an initial oerational state. In contrast to Section 3., where b was the rarity arameter and we considered a sequence of sets A b : b ) as b, we now change the rarity arameter to ε and examine the asymtotics as ε 0 for a fixed set A of failed states. Index by ε the robability measure driving the system and denote it P ε. Indexing by ε will allow us to highlight the roerties of estimators when ε gets close to 0. The time to failure that is, to reach a failed state) is denoted T ε, and the direct and ratio estimators can be used for both crude simulation and imortance samling. Here we consider the embedded discrete-time Markov chain DTMC), where the time sent in each state is taken as the exected value of the exonential holding time in the state of the continuous-time chain. This discrete-time conversion reduces the variance of the estimator and simlifies the analysis in next section. But looking at either the continuous or discrete version does not change the conclusions of the comarison between the direct and the ratio-based estimators. As was done in Section 3., we can also examine the asymtotic equivalence of the two estimators as ε 0. Instead, we simly resent numerical results to illustrate this roerty. A numerical comarison. Consider a system with 3 comonent tyes, with n = n 2 = n 3 = 3, where n i is the redundancy of comonent tye i. Each comonent has an exonentially distributed time to failure with rate λ i for comonents of tye i, where λ i = ε, for some arameter ε. Any failed comonent has an exonentially distributed reair time with rate. Times to failure and reair times are all indeendent. The system is down whenever fewer than two comonents of any one tye are oerational. The results with various values of ε and various samle sizes m are rovided in Table for the direct estimator and Table 2 for the regenerative ratio-based) estimator n is then the number of indeendent cycles). The last column dislays the work-normalized variance, defined as the variance of the estimator 849

multilied by the CPU time. It balances the comutational effort and variance and basically reresents the exected variance for a unit of comutational budget. It can be seen that the work-normalized variances are basically the same for each value of ε. Table : Results for the direct DTMC crude estimator. m ε Est. Confidence Interval Variance CPU Work Norm. Var. 0 5 0. 8.755 8.708e+00, 8.802e+00) 5.840e+0 0.7 9.733e-05 0 7 0. 8.769 8.764e+00, 8.774e+00) 5.879e+0 7.7.04e-04 0 5 0.0 5.88e+02 5.782e+02, 5.854e+02) 3.366e+05.33 4.488e+00 0 7 0.0 5.84+02 5.838e+02, 5.845e+02) 3.343e+05 34 4.482e+00 0 5 0.00 5.5925825e+04 5.558e+04, 5.627e+04) 3.26e+09 2.86 4.022e+05 0 7 0.00 5.5844640e+04 5.58e+04, 5.588e+04) 3.7e+09 36.5 4.04e+05 Table 2: Results for the regenerative crude DTMC estimator. n ε Est. Confidence Interval Variance CPU Work Norm. Var. 0 5 0. 8.692 8.595e+00, 8.788e+00) 2.42e+02 0.05.206e-04 0 7 0. 8.772 8.762e+00, 8.782e+00) 2.484e+02 4.283.064e-04 0 5 0.0 5.805e+02 5.558e+02, 6.05e+02).580e+07 0.066 2.633e+00 0 7 0.0 5.82e+02 5.788e+02, 5.837e+02).586e+07 2.97 4.627e+00 0 5 0.00 5.496e+04 4.742e+04, 6.249e+04).478e+2 0.066 2.463e+05 0 7 0.00 5.535e+04 5.459e+04, 5.6e+04).50e+2 2.800 4.227e+05 It is interesting to note that the direct estimator does not encounter rare-event roblems: as ε 0, the relative variance i.e., variance divided by the square of the exected value) is ket bounded in Table. But the comutation CPU) time increases because the number of stes before reaching system failure increases. Table 2 instead shows the oosite for the ratio-based estimator: the comutational time is bounded but we have a rare-event estimation issue. In other words, the comutational time issue is relaced by a rare-event estimation roblem. Comared to comutational time issues, rare-event roblems are more extensively studied and variance-reduction techniques can be alied. 4 IMPORTANCE SAMPLING ESTIMATORS Because the crude estimators are asymtotically equivalent, one may wonder why in the rare-event setting IS techniques have only been develoed for the ratio estimator Heidelberger 995, Rubino and Tuffin 2009b) but not for its direct counterart. To illustrate our arguments and give counter-examles, we introduce the following very simle examle which will be used throughout the section. Examle Consider a system made of a single tye of comonents with two comonents failing with rate ε and a single reairman with reair rate. The state sace and transition-rate diagram of the continuous-time Markov chain are reresented in Figure, where state x means x failed comonents. The system is failed when both comonents are failed state 2). For this examle, the embedded DTMC is described in Figure 2, 2ε 0 2 Figure : HRMS with a single tye of comonents and 2 comonents. with mean sojourn times /2ε) and / + ε) in states 0 and, resectively. A ath to failure is simly ε 850

ε/ + ε) 0 2 / + ε) Figure 2: HRMS with a single tye of comonents and 2 comonents: embedded DTMC. described as a transition from 0 to, followed by n 0 cycles 0, and finally a transition from to 2. Hence, the MTTF can be directly comuted as E ε T ε ) = n + ) 2ε + ) ) n ε + ε + ε + ε = + 3ε 2ε 2. n=0 Note that this exression could also have been found from the system of equations obtained by conditioning on the first move: E ε T ε ) = /2ε) + E ε, T ε ) and E ε, T ε ) = / + ε) + / + ε)e ε T ε ) with E ε, ) the exected value when starting from state instead of state 0), but it is insightful to decomose into the sum over aths with n cycles, whose exected length is n + ) 2ε + ) +ε and whose robability is ) n ε +ε +ε. For the crude Monte Carlo estimator using the embedded DTMC, the second moment is E ε [T ε ) 2 ] = n=0 n + ) 2 2ε + + ε ) 2 ) n ε + ε + ε = 2 + ε) + 3ε)2 4 + ε)ε 4 leading to a variance of +3ε) 2 4. The relative variance is bounded, illustrating again that this estimator +ε)ε 4 does not encounter a rare-event roblem. Let N be the random) number of transitions in a run to which the comutation time is roortional), and its exectation is ) n ε 2 + ε) E ε N) = 2 + 2n) =. + ε + ε ε n=0 The work-normalized relative variance W NRV, defined as the variance multilied by the comutation time and divided by the square of the exected value, is then WNRV = 2 ε. The work-normalized relative variance is a good measure of efficiency of an estimator, since if bounded as ε 0, it means that the comutational budget to ensure a redefined accuracy level is indeendent of the rarity. Here the comutational time to achieve a given accuracy level increases as ε 0, as we had reviously noted. 4. Failure Biasing If we wish to aly IS to the direct estimator, a natural idea is to aly failure biasing similarly to what has been done for the regenerative estimator: from any state excet the initial one with all comonents u, change the robability of making a failure transition to be ρ, indeendent of ε. This will make reaching a failed state more likely because under the original system dynamics, the robability of taking a failure transition is Oε) as failure rates are Oε) but reair rates are Θ). Several imlementations exist, including simle failure biasing SFB) and balanced failure biasing BFB) Shahabuddin 994b). Under SFB, the robability of any failure given that a failure occur is roortional to its original robability, but with BFB, 85

it is uniform; i.e., if f failure transitions are ossible out of a state, then each failure transition is given robability / f. For reairs, the conditional robabilities are always taken roortional to the original ones. Note that when simulating a Markov chain, the number of stes can be large and a change of robability matrix can lead to very oor results if the robabilities are changed too much because the likelihood ratio is subject to large variations. But usually for HRMS the number of stes in direct aths to failure is small otherwise the failure biasing methods would not be efficient), so we could wrongly as we will see) exect failure biasing to be efficient. Examle 2 When alying IS to Examle by changing the robabilities of the Markov chain, the only latitude we have is to change to robabilities from state as described in Figure 3 by using ρ as the robability to reach the failed state from. Failure biasing is then the only ossibility. ρ 0 2 ρ Figure 3: HRMS with a single tye of comonents and 2 comonents: Failure biasing on the embedded DTMC. The robability under IS of the ath to failure with n 0 cycles is ρ) n ρ. Letting Ẽ ε denote the exectation oerator under the IS robability distribution, the second moment of the IS estimator is Ẽ ε [T ε L) 2 ] = E ε [T ε ) 2 L] = n=0 n + ) 2 2ε + + ε For the sum to converge, we need / + ε) 2 ρ)) <, or equivalently, ρ < ) 2 ) ) n 2 ε +ε +ε ρ) n. ρ + ε) 2 = 2ε 3ε2 + oε 2 ). 3) In other words, the failure robability from state cannot be increased too much; otherwise, the likelihood ratio will build u too much and lead to infinite variance. This is not a new story as this tye of issue has already been encountered in the simulation of HRMS by Juneja and Shahabuddin 992) and Juneja and Shahabuddin 200), where deferred reairs are considered so that at some states, failures are not rare something we also have here from state 0). The oint is that failures from intermediate states can have a robability Θ), introducing cycles with large robability, which can cause the variance of the likelihood to otentially exlode. As a remedy, they roose to aly a so-called small failure biasing, assigning a failure robability δ ρ to the whole set of failures from some states. While the issue is the same, here the assumtions are different:. We do not have deferred reairs and the roblem is only at the initial state 0, which is not a roblem with regenerative simulation because the simulation of a cycle stos when we have a return to this state; 2. In Juneja and Shahabuddin 992) and Juneja and Shahabuddin 200), the small robability δ is indeendent of ε, which is not ossible here from what we have seen to avoid an infinite variance. Thus the imlementation of IS to the direct estimator is roblematic since, already on our simle examle, it is not ossible to make unrare the failure transition from state. 852

The average number of transitions under IS is Glynn, Nakayama, and Tuffin Ẽ ε N) = n=0 2 + 2n) ρ) n ρ = 2 ρ. 4) So the average simulation time for a single run will increase to infinity as ε 0 by 3). Remark: This analysis can be related to Glynn 994), where it is shown that we can tyically exect the IS variance to grow exonentially in the deterministic) number of simulated stes. With HRMS, the asymtotic analysis is instead with resect to ε 0 rather than an increasing fixed) number of transitions. Although failure biasing with ρ = Θ) leads to the system failing after usually only a few transitions, 3) stiulates that ρ = Oε) to ensure finite variance, resulting in the random) number of transitions until failure having infinite mean, as seen in 4). But even if the comutation time increases as ε 0, the direct estimator could still be efficient if the relative variance vanishes with ε. To see if that can haen, let us focus on the zero-variance IS scheme. 4.2 Zero-Variance Aroximation In the literature, there indeed exists a zero-variance IS scheme for Markov chains and HRMS, as derived by Awad, Glynn, and Rubinstein 203) and L Ecuyer and Tuffin 202). In order to imlement it, we cannot use the estimator T ε L but rather the still-unbiased Tε IS = τ F j=0 /λy k))l k, where L k is the likelihood ratio for ste 0 to ste k. This tye of estimator is often called a filtered imortance samling estimator Awad, Glynn, and Rubinstein 203, Glasserman 993). For any states x and y, let E ε,y T ε ) be the MTTF starting from y and P x,y ) x,y S the original transition matrix of the DTMC. Then using the IS matrix P x,y = P x,y /λx) + E ε,y T ε ) E ε,x T ε ) yields an estimator with variance zero, which follows from a direct alication of the framework described in Awad, Glynn, and Rubinstein 203), L Ecuyer and Tuffin 2007), L Ecuyer and Tuffin 202). Examle 3 On our examle in Figure 3, we can modify the robabilities from state only, with ρ = ε + ε +ε + 0 = +2ε 2ε 2 2ε 3 + ε) 2 + 2ε). We can see that the robability to reach State 2 directly from is Θε 3 ), so even if the variance is zero, the estimation takes on average longer time, 2 ρ = Θε 3 ), as ε gets closer to zero. If the otimal ρ is not known and an aroximation of order ρ = θε 3 ) is used, we need a relative variance Oε 3 ), that is a variance Oε), to ensure a bounded work-normalized variance. Let us investigate if it is easily attainable. The second moment of the estimator Tε IS with ρ as the robability to go directly from to 2 is Ẽ ε [Tε IS ) 2 n ] = n=0 2ε + k= + ε + ) ) ) k 2 2ε + ε ρ) k + ε/ + ε)/ + ε)) n + ε ρ ρ) n ρ) n ρ. A closed-form exression is easily obtainable, but the derivation is very long and not insightful. However, we make the following observations: 2ε For ρ = 3 +ε) 2 +2ε), we retrieve Ẽ ε [Tε IS ) 2 ] = +3ε) 2 4 ε 4 = Ẽ ε [Tε IS ]) 2, that is, a variance zero. For ρ = ε 3 i.e, an aroximation of the robability of good asymtotic order), the variance is Θε 2 ), two orders of magnitude better than the variance of the crude estimator, but this gain is lost on the comutational time. 853

For ρ = 2ε 3 i.e., the exact first-order term of the zero-variance change of measure), the variance is Θ), which is better but still not sufficient to yield a bounded work-normalized variance. Thus, we see that much better than an exact first-order aroximation of transition robabilities is required. This seems hard to obtain in ractice. With the ratio estimator, a first-order aroximation is used in L Ecuyer and Tuffin 202), which yields bounded normalized variance even a vanishing one), and the estimator does not suffer from an increasing comutational time. 5 CONCLUSIONS In conclusion, we have reviewed and comared two standard estimators of the MTTF for classically) regenerative rocesses: a direct one exressed as the average of simulated times to failure, and a second one making use of the regenerative structure and exressing the MTTF as a ratio of exected values. We have highlighted that. Crude direct and ratio-based estimators are asymtotically equivalent as the robability to reach the secified failure set decreases, with the comutational issue for the direct estimator just being relaced by a rare-event roblem for the ratio estimator. 2. When failures are rare, we may want to aly IS to obtain more efficient estimators, but we have illustrated that for the direct estimator, designing an efficient IS rocedure might be difficult, while many efficient ones exist for the ratio exression. The latter is then advised. ACKNOWLEDGMENTS This work has been suorted in art by the National Science Foundation under Grant No. CMMI-537322. Any oinions, findings, and conclusions or recommendations exressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. REFERENCES Awad, H. P., P. W. Glynn, and R. Y. Rubinstein. 203. Zero-Variance Imortance Samling Estimators for Markov Process Exectations. Mathematics of Oerations Research 38 2): 358 388. Billingsley, P. 999. Convergence of Probability Measures, Volume Second. New York: John Wiley and Sons. Cancela, H., G. Rubino, and B. Tuffin. 2002. MTTF Estimation by Monte Carlo Methods Using Markov Models. Monte Carlo Methods and Alications 8 4): 32 34. Ethier, S., and T. Kurtz. 986. Markov Processes: Characterization and Convergence. New York: Sringer- Verlag. Glasserman, P. 993. Filtered Monte Carlo. Mathematics of Oerations Research 8 3): 60 634. Glynn, P. W. 994. Imortance Samling for Markov Chains: Asymtotics for the Variance. Stochastic Models 0 4): 70 77. Goyal, A., P. Shahabuddin, P. Heidelberger, V. F. Nicola, and P. W. Glynn. 992. A Unified Framework for Simulating Markovian Models of Highly Reliable Systems. IEEE Transactions on Comuters C- 4:36 5. Heidelberger, P. 995. Fast Simulation of Rare Events in Queueing and Reliability Models. ACM Transactions on Modeling and Comuter Simulation 5 ): 43 85. Juneja, S., and P. Shahabuddin. 992. Fast Simulation of Markovian Reliability/Availability Models with General Reair Policies. In Proceedinds of the Twenty-Second International Symosium on Fault- Tolerant Comuting, 50 59: IEEE Comuter Society Press. Juneja, S., and P. Shahabuddin. 200. Fast Simulation of Markov Chains with Small Transition Probabilities. Management Science 47 4): 547 562. 854

L Ecuyer, P., and B. Tuffin. 2007. Effective Aroximation of Zero-Variance Simulation in a Reliability Setting. In Proceedings of the 2007 Euroean Simulation and Modeling Conference, 48 54. Ghent, Belgium: EUROSIS. L Ecuyer, P., and B. Tuffin. 202. Aroximating Zero-Variance Imortance Samling in a Reliability Setting. Annals of Oerations Research 89 ): 277 297. Rubino, G., and B. Tuffin. 2009a. Markovian Models for Deendability Analysis. In Rare Event Simulation using Monte Carlo Methods, edited by G. Rubino and B. Tuffin, 25 44. John Wiley & Sons. Chater 6. Rubino, G., and B. Tuffin. 2009b. Rare Event Simulation using Monte Carlo Methods. Wiley. Shahabuddin, P. 994a. Fast Transient Simulation of Markovian Models of Highly Deendable Systems. Performance Evaluation 20:267 286. Shahabuddin, P. 994b. Imortance Samling for the Simulation of Highly Reliable Markovian Systems. Management Science 40 3): 333 352. Shahabuddin, P., V. F. Nicola, P. Heidelberger, A. Goyal, and P. W. Glynn. 988. Variance Reduction in Mean Time to Failure Simulations. In Proceedings of the 988 Winter Simulation Conference, edited by M. Abrams, P. Haigh, and J. Comfort, 49 499. Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, Inc. Smith, W. L. 955. Regenerative Stochastic Processes. Proceedings of the Royal Society: Series A. Mathematical and Physical Sciences 232:6 3. AUTHOR BIOGRAPHIES PETER W. GLYNN is the Thomas Ford Professor in the Deartment of Management Science and Engineering MS&E) at Stanford University. He is a Fellow of INFORMS and of the Institute of Mathematical Statistics, has been co-winner of Best Publication Awards from the INFORMS Simulation Society in 993, 2008, and 206, and was the co-winner of the John von Neumann Theory Prize from INFORMS in 200. In 202, he was elected to the National Academy of Engineering. His research interests lie in stochastic simulation, queueing theory, and statistical inference for stochastic rocesses. His email address glynn@stanford.edu. MARVIN K. NAKAYAMA is a rofessor in the Deartment of Comuter Science at the New Jersey Institute of Technology. He received his Ph.D. in oerations research from Stanford University and a B.A. in mathematics-comuter science from U.C. San Diego. He is a reciient of a CAREER Award from the National Science Foundation, and a aer he co-authored received the Best Theoretical Paer Award for the 204 Winter Simulation Conference. He served as the simulation area editor for the INFORMS Journal on Comuting from 2007 206, and is an associate editor for ACM Transactions on Modeling and Comuter Simulation. His research interests include simulation, modeling, statistics, risk analysis, and energy. His email address is marvin@njit.edu. BRUNO TUFFIN received his PhD degree in alied mathematics from the University of Rennes France) in 997. Since then, he has been with INRIA in Rennes. His research interests include develoing Monte Carlo and quasi-monte Carlo simulation techniques for the erformance evaluation of telecommunication systems and telecommunication-related economical models. He is currently Area Editor for INFORMS Journal on Comuting and Associate Editor for ACM Transactions on Modeling and Comuter Simulation. He has written or co-written three books two devoted to simulation): Rare event simulation using Monte Carlo methods ublished by John Wiley & Sons in 2009, La simulation de Monte Carlo in French), ublished by Hermes Editions in 200, and Telecommunication Network Economics: From Theory to Alications, ublished by Cambridge University Press in 204. His email address is bruno.tuffin@inria.fr. 855