VALIDATION OF TRACE-DRIVEN SIMULATION MODELS: MORE ON BOOTSTRAP TESTS. Russell C.H. Cheng

Proceedngs of the 2000 Wnter Smulaton Conference J A Jones, R R Barton, K Kang, and P A Fshwck, eds VALIDATION OF TRACE-DRIVEN SIMULATION MODELS: MORE ON BOOTSTRAP TESTS Jack PC Klejnen Department of Informaton Systems (BIK)/ Center for Economc Research (CentER) Tlburg Unversty (KUB) 5000 LE Tlburg, Netherlands Russell CH Cheng Department of Mathematcal Scences Unversty of Southampton, Hghfeld, Southampton SO17 1BJ, UK Bert Bettonvl Department of Informaton Systems (BIK)/ Center for Economc Research (CentER) Tlburg Unversty (KUB) 5000 LE Tlburg, Netherlands ABSTRACT Trace-drven or correlated nspecton smulaton means that the smulated and the real systems have some common nputs (say, arrval tmes) so the two systems outputs are cross-correlated To valdate such smulaton models, ths paper formulates sx valdaton statstcs, whch are nspred by practce and statstcal analyss; for example, the smplest statstc s the dfference between the average smulated and real responses To evaluate these valdaton statstcs, the paper develops novel types of bootstrappng based on subruns Three basc bootstrap procedures are devsed, dependng on the number of smulaton replcates: one, two, or more replcates Moreover, for the case of more than two replcates the paper consders condtonal versus uncondtonal resamplng These sx valdaton statstcs and four bootstrap procedures are evaluated n extensve Monte Carlo experments wth sngle-server queueng systems The man concluson s that bootstrappng of the smplest valdaton statstc gves the correct type I error probablty, and has relatvely hgh power 1 INTRODUCTION Valdaton has many aspects; for a recent revew and references see Klejnen (1999) In ths paper, however, we lmt ourselves to statstcal testng of the valdty of tracedrven smulatons Consder the followng trace-drven smulaton; also see Table 1 The smulated and the real systems have some common nputs (say) A; for example, the same hstorcal sequence of arrval tmes (we use captal letters for random varables, lower-case letters for realzed values, and bold letters for matrces ncludng vectors) The real system generates a tme seres of outputs W ; t whereas the smulaton generates outputs wth = 1,, n and t = 1, V ; t 882 2,, k; for example, sojourn tme of job t on day To evaluate the real system, ts manager characterzes the output tme seres by a sngle performance measure (response) X ; for example, average sojourn tme on day To valdate the smulaton statstcally, ths real performance X s compared wth the smulated performance (say) Y - for the same stuaton (same crcumstances, same scenaro) characterzed by the trace A But how should we compare X and Y? Some solutons are presented n Moors and Strjbosch (1998), but we focus on Klejnen, Bettonvl, and Van Groenendaal (1998), abbrevated here to KLEIJ Lke KLEIJ we assume that all smulaton responses Y are dentcally and ndependently dstrbuted (d) More specfcally, each subrun starts n the empty state, and stops after a fxed number k of jobs The real responses X are also d Unlke KLEIJ we do not assume that ( X, Y ) are bvarate normal Indeed, n case of short subruns (say, k = 10) the responses are serously nonnormal Ths nonnormalty - together wth a small n (number of subruns) - s not well handled by conventonal non-bootstrap technques (Obvously, tracedrven smulaton mples that the two members of the par ( X, Y ) are cross-correlated) We suppose that the smulaton model has at least one more nput varable (eg, servce tme) not recorded on the trace, so ths nput s sampled usng a pseudorandom number stream R There are s smulaton replcatons (usng the same Y (r) trace A ), whch yeld wth r = 1,, s We dstngush three cases for s, namely 1, 2, or more - namely, fve or ten To solve ths problem, we use bootstrappng, whch n general samples - randomly wth replacement - d observatons; see the semnal book on bootstrappng (outsde smulaton), Efron and Tbshran (1993), here abbrevated to EFRON (Other monographs on bootstrappng are Davson and Hnkley (19 ), Mooney and Duval (1993), and Shao and Tu (1995))

Table 1: Trace-drven Smulaton Subrun number Trace: A 1 A A n Real performance: X 1 X X n Smulated performance: replcate 1 Y (1) 1 Y (1) Y (1) n replcate r Y (r) 1 Y (r) replcate s Y (s) 1 Y (s) Y (r) n Y (s) n We wsh to test the hypothess that the smulaton model s vald For hypothess testng through bootstrappng outsde smulaton we refer to EFRON and also Shao and Tu (1995, pp 176, 189) Our man dscovery wll be: one smulaton replcate s certanly a vald model for another smulaton replcate So f s $ 2 we can obtan the bootstrap dstrbuton of any valdaton statstc under the null-hypothess of a vald trace-drven smulaton model! Note that - nstead of generatng responses through bootstrappng - we may generate more smulaton responses In practce, however, replcatng a smulaton generally requres much more computer tme than bootstrappng a smulaton We assume that the number of smulaton replcates (symbol s) s gven, and s small compared wth the bootstrap sample sze b (Breman 1992, p 750 also dscusses bootstrappng versus replcatng, but not n a smulaton context) To provde some background of our research, we now summarze the lterature on bootstrappng n smulaton Fredman and Fredman (1995) provde two academc examples Km, Wlleman, Haddock, and Runger (1993) formulate ther so-called threshold bootstrap for the analyss of autocorrelated smulaton outputs Several authors nvestgate bootstrappng of emprcal nput dstrbutons n smulaton: Barton and Schruben (1993), Cheng (1995), Cheng and Holland (1997), and Prtsker (1998) Bootstrappng for valdaton of metamodels s done by Klejnen, Feelders, and Cheng (1998) A summary of the present paper s Klejnen, Cheng, and Bettonvl (2000) Our man concluson wll be: f a trace-drven smulaton model s run more than twce (s > 2), then bootstrappng any statstc gves acceptable (albet conservatve) type I error probablty; the smplest statstc (the average devaton) has good power compared wth the more complcated statstcs The remander of ths artcle s organzed as follows 2 summarzes KLEIJ s F-statstc based on regresson analyss, and proposes fve more valdaton statstcs 3 recaptulates EFRON s bootstrappng of tme seres; EFRON uses blocks, whch we nterpret as termnatng subruns 4 derves three bootstrap procedures for trace-drven smulatons, usng one, two, or more than two smulaton Klejnen, Cheng, and Bettonvl 883 replcatons per subrun; moreover, n case of more than two replcates the resamplng may be ether condtonal or uncondtonal To evaluate these sx valdaton statstcs and four bootstrap technques, 5 desgns a Monte Carlo experment wth queueng models that generate 'real' and smulated sojourn tmes 6 nterprets the results of ths extensve Monte Carlo experment 7 presents conclusons and topcs for future research 2 SIX TESTS FOR VALIDATION The bootstrap enables estmatng the dstrbuton of any statstc, provded the statstc s a contnuous functon of the observatons (eg, the medan s not a contnuous functon) For the valdaton of trace-drven smulatons we nvestgate sx statstcs, denoted as T 1 through T 6 KLEIJ calls a smulaton model vald f the real and the smulated systems have () dentcal means (say) µ x = µ y and () dentcal varances σ 2 x = σ 2 y To test ths composte hypothess, KLEIJ computes the dfferences D = X - Y and the sums Q = X + Y, and regresses D on Q: E(D *Q ' q) ' γ 0 % γ 1 q The null-hypothess then becomes H 0 : γ 0 = 0 and γ 1 = 0 To test ths H 0, KLEIJ computes the two Sums of Squared Errors or SSEs that correspond wth the full and the reduced regresson model: SSE full ' j (D & ˆD ) 2 wth ˆD ' C 0 % C 1 Q where C 0 and C 1 are the Ordnary Least Squares (OLS) estmators of γ 0 and γ 1 ; and SSE reduced ' j D 2 These two SSEs gve the frst valdaton statstc: Y T 1 ' [SSE reduced & SSE full ]/2 SSE full /(n & 2) If X and are nd (see 1), the statstc n Equaton (1) has an F-dstrbuton wth 2 and n - 2 degrees of freedom (df) If ths statstc s sgnfcantly hgh, then KLEIJ concludes that the smulaton model s not vald We propose another valdaton statstc wth ntutve appeal to smulaton practtoners, namely the average absolute predcton error, T 2 = j *D */n (also see Klejnen and Sargent, 1999) A thrd statstc related to the two precedng statstcs s the mean squared devaton (MSE), T 3 = j D 2 /n A fourth statstc s the average devaton, T 4 = j D /n = X & Ȳ A dsadvantage of ths statstc s that postve model errors may compensate negatve errors, and vce versa (Ths phenomenon may be gnored f a wrong smulaton model always underestmates - or always overestmates - the real response whatever the trace s; moreover, ths statstc allows bootstrappng n case of a sngle smulaton run; see 41) (1)

Klejnen, Cheng, and Bettonvl The next statstc s the average relatve error, T 5 = j (Y /X )/n, whch s often used n practce Obvously ths statstc assumes that no X s zero; actually, the event X = 0 may occur wth non-neglgble probablty n queueng applcatons wth empty startng states, no excessvely saturated traffc rates, and short subruns (see 6) Fnally, T 6 compares ˆF x and ˆF y, the estmated dstrbuton functon (EDF) computed from the n observatons on X and Y respectvely: 4 T 6 ' * ˆF m x (z) & Ĝ y (z)*dz (2) &4 Note that n Equaton (2) we use the L 1 norm, not the L 2 or the L 4 norms KLEIJ s statstc T 1 also tests equalty of varances, whereas T 2 through T 5 consder only equalty of means More crtera or measures for model selecton are examned n detal n the monograph by Lnhart and Zucchn (1986) 3 EFRON s BOOTSTRAP FOR TIME SERIES EFRON (p 91) assumes a sample of n d observatons Z wth = 1,, n (Hence, n our case we defne Z = (X, Y); see 1) EFRON summarzes the sample data through a statstc T = s ( Z 1,, Z n ) (In our case: T j = s j ( Z 1,, Z n ) wth j = 1,, 6) Bootstrappng means that the orgnal values z are randomly resampled wth replacement, n tmes So, f the superscrpt * ndcates bootstrappng, then the bootstrap observatons are Z ( Ths bootstrap sample gves one observaton on the bootstrap statstc T ( ' s(z ( 1,, Z( n ) To estmate the dstrbuton of ths statstc, the whole bootstrap procedure s repeated b tmes Sortng these b observatons on T ( gves the order statstcs T (,,, and the estmated α quantle (1) T ( (b) T ( (lbαm) of ts dstrbuton, Ths procedure gves a two-sded 1- α confdence nterval for the orgnal statstc T, rangng from the lower estmated α/2 quantle to the upper 1 - α/2 quantle (Alternatve confdence ntervals are dscussed n EFRON and Shao and Tu 1995) Ths nterval can be used for hypothess testng, as we shall see; also see EFRON (p 169) We saw that ths bootstrap assumes d sample observatons Z, but EFRON (pp 99-102) also presents a bootstrap for tme seres, called movng blocks (also see Shao and Tu 1995, pp 387-392, 407-415) In our smulaton context we nterpret these blocks as subruns So we have n non-overlappng subruns, each startng n the empty state and each of length k; we do not elmnate the transent phase We shall elaborate our approach n the next secton 4 BOOTSTRAP OF VALIDATION TESTS IN TRACE-DRIVEN SIMULATION We assume a reasonable number of d subruns; more specfcally, we use the same numbers as KLEIJ (p 815): n s ether 10 or 25 We dstngush three stuatons for the number of smulaton runs, for whch we develop dfferent bootstrappng technques: s s 1, 2, or more 41 A Sngle Smulaton Run: s = 1 By assumpton, the n pars ( X, ) are mutually ndependent (as the n subruns are assumed ndependent) Moreover, these pars are dentcally dstrbuted f we do not condton on the trace varable A ; we assumed the latter varable to be d So we bootstrap the n orgnal pars, whch gves the n bootstrap pars ( X (, Y ( ) These bootstrap pars result n the bootstrap valdaton statstcs T ( 1 through T ( 6, albet not necessarly under the null-hypothess of a vald trace-drven smulaton model We repeat ths bootstrappng b tmes, to obtan an estmated 1 - α confdence nterval for each valdaton statstc We have an ntutve target or hypotheszed value for T 4 (= j D /n ), namely zero The two-sded confdence nterval ranges from the lower α /2 quantle to the upper 1 - α /2 quantle of the bootstrap dstrbuton If the confdence nterval does no cover ths target value, then we reject the smulaton model We follow a smlar approach for T 5 (= j (Y /X )/n ), but now wth a target value of one We have no target values for the other four statstcs However, we may compare the frst statstc, wth the tabulated 1 - α quantle of the F-statstc wth 2 and n - 2 degrees of freedom, F 1 & α 2, n & 2 (no bootstrappng) Moreover, for ths statstc we frst apply the normalzng logarthmc transformaton: replace x by log(x) and y by log(y) provded x and are not zero (also see KLEIJ) y 42 Only Two Smulaton Replcates: s = 2 When s = 2 we bootstrap the two replcates of the smulaton model: we replace the par ( X, Y ) by ( Y (1), Y (2) ) Ths yelds a bootstrap confdence nterval per statstc T, under the null-hypothess of a vald trace-drven smulaton model We also have two observatons on each orgnal valdaton statstc under the alternatve hypothess, namely T ' s((x 1, Y (r) 1 ),, (X n, Y (r) n )) wth r = 1, 2 We reject the smulaton model f any of these two observatons on T falls outsde the 1 - α/2 bootstrap confdence nterval: we use α/2 nstead of α because of Bonferron s nequalty (obvously we may also replace any by the maxmum ) Y 884

Klejnen, Cheng, and Bettonvl 43 More than Two Smulaton Replcates: s > 2 When s > 2 we proceed smlarly to the case s = 2 However, we now dstngush two approaches: (a) condton on the trace; (b) do not condton on the trace (a) Condtonng: From each column of Table 1 we sample two observatons Y (r) and Y (r ) ) wth r r ) (n the orgnal sample the probablty of a par wth dentcal values s zero n case of contnuous X and Y, so we requre r r ) ) From these n bootstrap pars we compute the valdaton statstc T ( After b repettons we compute a 1 - α confdence nterval for ths T (, as n the case s = 2 (b) No condtonng: Ths approach assumes that the traced varables A are d So now we resample n pars from the whole table More precsely, frst we sample one value from the s n values of Y; next we sample wthout replacement a second value from the remanng sn - 1 values, gvng one bootstrap par; the next par s sampled after replacng the precedng par, etc Let us compare approaches (a) and (b), focusng on the smplest valdaton statstc T 4 Then we see that the expected values of all dfferences between replcated smulaton responses are zero, n both approaches Ther varances, however, are smaller n approach (a): blockng s a well-known varance reducton technque n the desgn of experments So we expect condtonal resamplng to yeld more powerful tests (ths wll turn out to be true: see 6) Analogous to the s = 2 case, we agan compare one real response X wth each of the s smulated responses Y (r) We reject the smulaton model f any of these s values falls outsde the 1 - α/s confdence nterval (Bonferron) 44 Asymptotc Results: Large n In the Appendx we derve asymptotc results for the smplest bootstrap valdaton statstc T ( 4 (ths statstc wll turn out to have the greatest practcal relevance; see 6) We can prove that as n tends to nfnty, the EDF of T ( 4 tends unformly to the EDF of the orgnal statstc T 4, for all four bootstrap methods defned n 41 through 43 Ths unform convergence s mportant f confdence ntervals wth the correct coverage are to be constructed Of course, ths convergence s only asymptotc; our Monte Carlo experments n 6 estmate small-sample performance 45 Mnmal Bootstrap Sample Sze A classc value for b s 1,000; see EFRON (p275), Andrews and Buchnsky (1996), and also Barton and Schruben (1993) and Shao and Tu (1995, pp 206-210) We shall use ths classc value, but also a much smaller value Actually, we are not nterested n the whole dstrbuton functon (say) g of the bootstrapped statstc T (, but only n ts α/2 and 1 - α/2 quantles (we reject the null-hypothess f the value of the orgnal statstc T does not fall between these two quantles) To estmate ths dstrbuton functon g, we sort the b observatons on T (, whch gves T ( (1),, T ( (b) Hence, g(t ( (1) ),, g(t ( () ),, g(t ( (b) ) s an ordered sample from a unform dstrbuton on [0, 1) The expected value of g(t ( () ) s /(b + 1) Consequently, f we take the mnmal bootstrap sample sze, then our estmator of the lower α/2 quantle s the smallest order statstc, namely T ( (1) Lkewse the largest order statstc T ( (b) estmates the upper 1 - α/2 quantle It s easy to prove that the mnmum value for b s b mn ' (2/α) & 1 (3) For example, α = 01 gves b = 19; we shall use ths value (besdes the classc value of 1,000; see 5) However, when we have more than one smulaton replcate (s > 1), then we apply Bonferron s nequalty so α s replaced by α/s For example, for α = 01 and s =10 Equaton (3) gves 199 (stll much smaller than 1,000) Actually, we shall report on b = 19 even when s > 1: we then avod Bonferron s nequalty by randomly selectng a sngle value from the s values for the valdaton statstc computed from the orgnal (non-bootstrapped) observatons on X and Y (r) We reject the smulaton model f ths one value les outsde the bootstrap confdence nterval 5 DESIGN OF QEUEING EXPERIMENTS For the type I error rate of the valdaton tests we use an α of 001, 005, and 010 respectvely These values determne whch quantles of the bootstrap dstrbuton should be used as thresholds (Of course, the hgher α s, the hgher the power s) We focus on α = 010 because t gves the smallest relatve varance for our Monte Carlo results (see 6); besdes, ths value s the only one that we can use for b = 19 Followng KLEIJ, we start wth M/M/1 smulaton models, whch generate 'real' and smulated ndvdual sojourn tmes W and V So these models have Posson arrval and servce parameters (say) λ a ' 1/µ a and λ s ' 1/µ s where µ a and µ s denote the means of the nterarrval and servce tmes We use a tlde to denote a parameter of the smulaton model; for example, λ s refers to the smulaton model, whereas λ s denotes the real parameter 885

To study the type I error of the valdaton tests, we use a smulaton model and a real system wth equal servce rates (arrval tmes are on the trace, so smulated and real arrval tmes are the same); hence smulated and real traffc rates are the same: ρ = ρ We use an mperfect smulaton model: the real and the smulated servce tmes use dfferent pseudorandom numbers We examne the followng three factors - followng KLEIJ (p 815) - n a 2 3 desgn: () number of jobs per subrun, k: 10 and 1,000 (affects the degree of nonnormalty); () number of subruns, n: 10 and 25 (affects the convergence of the bootstrap dstrbuton); () real traffc load, ρ : 05 and 10 (affects the cross-correlaton caused by the common trace) To study the type II error, we use unequal smulated and real rates For real load ρ = 05 and number of jobs per subrun k = 1,000 we use ρ = 046, 048, 052, and 054; for k = 10 we use 03, 04, 06, and 07 For ρ = 1 and k = 1,000 we use 096, 098, 102, and 104; for k = 10 we use 08, 09, 12, and 14 (For more extreme values of ρ the estmated power reaches 1) Stll followng KLEIJ (p 815), we use 1,000 macroreplcatons; by defnton, each macro-replcaton ether rejects or accepts a specfc smulaton model (Each macroreplcaton requres b bootstraps; each bootstrap requres kn observatons on the real and the smulated ndvdual outputs) Because we use many pseudorandom numbers, we select our generator wth some care: we use a generator proposed by L Ecuyer (1999), called MRG32k3a wth a cycle length of the order 2 191 We select seeds randomly All sx valdaton tests use the same data ( X, Y (r) ), whch mproves the comparson of these tests The three values for α also gve postvely correlated results To obtan more general results, we extend KLEIJ: we also use M/G/1 smulaton models where we let G stand for servce tmes wth a gamma dstrbuton (Cheng 1998 gves generators for ths dstrbuton famly; the exponental dstrbuton belongs to ths famly) The real system remans M/M/1 We lmt the desgn to a sngle combnaton of the three factors: traffc load 10, number of jobs per subrun 1,000, number of subruns 10 Fnally, we extend our Monte Carlo study to smulatons wth other prorty rules, namely shortest processng tme (SPT) and longest processng tme (LPT) We use the same factor combnaton as for M/G/1 6 MONTE CARLO RESULTS Our Monte Carlo experments wth varous sngle-server queues result n estmated type I and II error probabltes of our sx valdaton statstcs for fve bootstrap procedures Klejnen, Cheng, and Bettonvl 886 If ths type I error probablty equals the prespecfed (nomnal) value α, we call the valdaton test acceptable: H 0 : E(Â) ' α (4) where Â denotes the Monte Carlo estmator of that probablty - wth values ˆα If no statstc satsfes ths condton, we accept a conservatve valdaton procedure (Bonferron s nequalty mples such conservatsm): n Equaton (4) we replace = by # Gven Equaton (4), ths error probablty has a bnomal dstrbuton wth varance α( 1 - α)/1000 (we have 1,000 macro-replcatons) For example, α = 010 gves a standard devaton of 00095 We use the normal approxmaton s factor 196 (95% confdence nterval) to test the sgnfcance of the devaton between observed and nomnal type I error probablty: we reject H 0 f *ˆα & α* > 00186 In case of a conservatve, one-sded test we accept an ˆα smaller than 01156; see the results prnted n bold n the tables below (There s no need for multple comparsons or jont nferences, whch mght use Bonferron) If several statstcs have acceptable type I error probabltes, then we compare ther estmated type II error probabltes (power complement) How to nterpret the massve amount of data generated by our Monte Carlo experments? We thnk that the prmary user queston s: whch valdaton statstc should be used, gven that t s known how many smulaton replcates are avalable? Remember that when s = 1 we should bootstrap only those two statstcs that have ntutve target values, namely T 4 and T 5 (for T 1 we use the F table) The answer may also depend on other known characterstcs of the gven smulaton, namely the number of d subruns, n If the smulaton represents a queueng system, then another known characterstc mght be the number of customers per subrun (k), the traffc load (ρ), and the queueng dscplne (FIFO, LPT, etc) Some queueng smulatons, however, may be much more complcated than the sngle-server systems that we study, so these characterstcs are of secondary nterest We start our analyss of all these Monte Carlo results by studyng ˆα (type I error) Though we have 2 3 combnatons of ρ, k, and n (see 5), we present data only for the hgh ρ and the low n; see Table 2 We do gve results for both k values, because ths factor may exclude the use of certan valdaton statstcs (namely, T 5 ) and strongly affect nonnormalty of the performance measures X and Y Further, for s = 1 we also present the statstc T 1 as appled by KLEIJ usng the F-table (nstead of bootstrappng) after the normalzng transformaton log(x) and log (Y) Fnally, for s > 2 we may condton on the trace or not, but Table 2 shows

Klejnen, Cheng, and Bettonvl results for condtonng only: we found that condtonng does ndeed mprove the power whle mantanng the type I error Part A gves results for short subruns (k = 10) Case s = 1: Not applcable (N/A) holds for T 2, T 3, and T 6 because they have no practcal thresholds; T 5 has a denomnator X = 0 wth hgh probablty so t s also N/A The table look-up of T 1 gves a worse error probablty than bootstrappng the smple statstc T 4 ; nevertheless, even the latter statstc gves sgnfcantly hgh ˆα Case s = 2: Acceptable - though conservatve - results are gven by bootstrappng T 6 Case s = 5: Our bootstrappng gves acceptable - but conservatve - ˆα, except for T 2 and T 3 Case s = 10: Bootstrappng any statstc gves acceptable ˆα Ths case gves results more conservatve than s = 5: Bonferron becomes more conservatve as s ncreases We can prove that as s ncreases for fxed n, then the EDFs of the orgnal statstc T and the bootstrap statstc T ( converge Because ths proof s rather techncal we do not gve t here Part B gves results for long subruns (k = 1,000) Case s = 1: KLEIJ s procedure gves an acceptable result; n long runs the nonnormalty dsappears after the log transformaton Case s = 2: Acceptable but conservatve results are agan gven by bootstrappng T 6 Case s = 5: Bootstrappng the smple statstc T 4 gves acceptable ˆα Case s = 10: Our bootstrap gves acceptable - but conservatve - ˆα for any statstc except T 2 and T 3 Altogether Table 2 suggests the followng conclusons Case s = 1: All valdaton statstcs gve observed type I error probabltes sgnfcantly hgher than the nomnal α, except for KLEIJ s procedure when long subruns are used Case s = 2: Bootstrappng T 6 gves best conservatve results Case s = 5: Bootstrappng the smple statstc T 4 gves acceptable ˆα Case s = 10: Bootstrappng any statstc - except for T 2 and T 3 - gves acceptable ˆα, albet rather conservatve for short subruns The next queston s: whch of the acceptable valdaton statstcs has the hghest power? Table 3 shows the estmated power for these statstcs, for a gven combnaton of s and k We select four smulated traffc rates ρ that dffer from the real rate ρ = 1 (see the four rows) Obvously, any statstc has more power as the smulated load devates more from the real load (read wthn columns) Further, any statstc can detect smaller devatons between real and smulated traffc rates when k s larger (10 versus 1,000) For s > 2 the bootstrapped smple statstc T 4 has good power compared wth the more complcated statstcs We also obtan results for other systems than M/M/1/FIFO (see 5) However, gven the conclusons so far, we focus on T 4 when nterpretng these results Then t suffces to state that the above conclusons also hold for these systems! Table 4 gves estmated type I error probabltes n case of the mnmum bootstrap sample sze (b = 19) These probabltes are smlar to Table 2, though less conservatve when k = 10 Our results (not dsplayed to save space) further show that the power s smaller than n case of a large bootstrap sample sze (for s > 1 we use Bonferron s nequalty n Table 3, whereas we now randomly select one of the s values; whch confounds the effects of small b and usng only one of the s values) 7 CONCLUSIONS AND FUTURE RESEARCH In general, bootstrappng s a versatle tool, as t allows the estmaton of the dstrbuton of any statstc T(Z) for any type of nput dstrbuton for Z However, ths tool requres masterng the art of modelng: the researchers stll have to nterpret ther problems Indeed, EFRON (pp 115, 383) states bootstrappng s not a unquely defned concept alternatve bootstrap methods may coexst More specfcally, for valdaton n smulaton we focused on statstcal tests for the valdaton of trace-drven termnatng smulatons wth d response Y Gven the d real response X, we proposed sx valdaton statstcs T j (X, Y)(j = 1,, 6) The pars (X, Y) are correlated, and may be non-normally dstrbuted We developed dfferent bootstrap methods that vary wth the number of smulaton replcates (symbol s) All these methods use subruns When we have more than two replcates (s > 2), we ether condton or we do not condton on the trace To evaluated and llustrate the resultng tests, we appled them to sngle-server queueng smulaton models wth dfferent prorty rules Whether these Monte Carlo results Further, these conclusons suggest that - for bootstrapped valdaton - a trace-drven smulaton model be run more than twce (usng dfferent random numbers) 887

Klejnen, Cheng, and Bettonvl hold for other applcatons, requres further research; the current results mght be seen as rules of thumb These rules are as follows Case s = 1: Most valdaton statstcs gve type I error probabltes hgher than the nomnal α If a normalzng transformaton can be found, then follow KLEIJ; that s, use the F-table wthout bootstrappng Case s = 2: Bootstrappng T 6 gves acceptable - but conservatve - results Case s > 2: Many statstcs gve acceptable - possbly conservatve - ˆα So we recommend to run a trace-drven smulaton model more than twce The smplest statstc, namely the average devaton T 4 = j D /n, has good power compared wth the more complcated statstcs A surprsngly small bootstrap sample sze mght suffce to quckly decde on the valdty of a smulaton model Then, lttle extra computer tme s needed for bootstrappng Nevertheless, f the small bootstrap sample results n a borderlne value for the valdaton statstc, then we recommend a larger bootstrap sample - especally snce n practce bootstrappng requres far less computer tme than smulaton does In future research we mght extend our analyss to other termnatng smulatons (eg, queueng networks), and to steady-state and non-statonary smulatons For example, f the trace does not reman statonary over subruns, then we may condton and resample one response from each subrun (column n Table 1; see 43) Whereas we use subruns, EFRON uses overlappng blocks; also see Shao and Tu (1995, pp391-392) Such a samplng procedure has also been explored n nontermnatng, statonary smulaton: see Sherman (1995) We mght also study a complcaton that KLEIJ mentoned but dd not solve: a more general null-hypothess states that the dfference between the real and the smulated systems expected values s smaller than some postve constant δ, not necessarly zero:: *E(X) & E(Y)* < δ Snce bootstrappng uses smulaton (Monte Carlo for resamplng the orgnal values z), typcal smulaton problems may be further explored n a bootstrappng context For example, the determnaton of the sample sze n quantle estmaton s a standard problem n smulaton; see Alexopoulos and Sela (1998) We add that computer tme may be saved by not takng a fxed sample sze b for the bootstrap Instead, we may use Wald s sequental probablty rato test (SPRT); see Ghosh and Sen (1991) Varance reducton technques may also be appled to bootstrappng Indeed, Shao and Tu (1995, pp 221-2228) dscuss antthetc and mportance samplng n bootstrappng We assumed that the number of replcates s s so small that bootstrappng s needed If, however, (say) s = 100, then we can use classc tests such as Student s t test, a dstrbuton-free test (eg, sgn test, rank test), or goodnessof-ft tests (see D'Agostno and Stephens (1986) and Vncent (1998)) APPENDIX: CONVERGENCE OF EDFs OF T ( 4 AND AS n INCREASE T 4 We gve a theoretcal backng for the condtonal samplng bootstrap method descrbed n 43: for T 4 (the statstc we recommend) we show that T ( 4 & E(T( 4 ) has the same asymptotc dstrbuton as T 4 & E(T 4 ), as n tends to nfnty Condtonal samplng s both the most nterestng and the most dffcult case Here a bootstrap sample has the form {Z ( ' Y U() E(Z ( ) ' 1 s(s & 1) j (u, v)0c (Y u & Y v ), E(Z (2 ) ' & Y V() ; ' 1, þ, n} (A-1) where (U(), V()) are d pars of random values selected from the s(s - 1) dstnct pars C = { r, r ) ; r, r ) = 1,, s, r r ) }, wth all pars beng equally lkely to be selected Ths gves 1 s(s & 1) j (Y u & Y v )2 (u, v)0c (A-2) Elementary consderatons show that E(T ( 4 ) and Var(T ( 4 ) are exactly the same as n the uncondtonal case; moreover wth probablty 1, E(T ( 4 ) 6 E(T 4 ) and Var(T ( 4 ) 6 Var(T 4 ) However the form of the moments n Equaton (A-2) shows that the Z ( are not dentcally dstrbuted Thus we need an addtonal assumpton to guarantee that T ( 4 s asymptotcally normal Theorem: Let T ( 4 be calculated from the condtonal bootstrap sample n Equaton (A-1) where s > 2 Let and τ ' E Y [Z ( & E(Z ( )] 2 < 4 κ ' E Y Z ( & E(Z ( ) 3 < 4, 888

Klejnen, Cheng, and Bettonvl where the outer expectatons are taken wth respect to Y ' (Y (1),, Y (s) ), the s observatons smulated Further, let c(z) ' Pr[ n (T 4 & E(T 4 )) # z] c ( (z) ' Pr[ n (T ( 4 & T 4 ) # z] Then wth probablty 1 we have Proof: Let sup z c(z) & c ( (z) 6 0 B n ' j Var(Z ( ), C n ' j E( Z ( & E(Z ( ) 3 ) Then by the strong law of large numbers n 1/2 B &3/2 n C n 6 τ &3/2 κ wth probablty 1 as n 6 4 Thus B &3/2 n C n 6 0 (A-3) wth probablty 1 as n 6 4 It follows by Lyapunov s Theorem (gven n eg Petrov (1995) as Theorem 49) that T ( 4 s asymptotcally normally dstrbuted wth probablty 1 Wth probablty 1 we have E(T ( 4 ) 6 E(T 4 ) and Var(T ( 4 ) 6 Var(T 4 ) so we can apply Theorem 67 n Hjorth (1994) (see also Sngh (1981) and Bckel and Freedman (1981), to show that Equaton (A-3) holds REFERENCES Alexopoulos, C and AF Sela 1998, Output Data Analyss Handbook of Smulaton, edted by Jerry Banks, Wley, New York Andrews, DWK and M Buchnsky 1996, On the number of bootstrap repettons for bootstrap standard error estmates Cowles Foundaton Dscusson Paper no 1141, Yale Unversty, PO Box 208281, New Haven, Connectcut 06520-8281 Barton, RR and LW Schruben 1993, Unform and bootstrap resamplng of emprcal dstrbutons In Proceedngs of the 1993 Wnter Smulaton Conference, 503-508 ed GW Evans et al, IEEE, Pscataway, NJ Bckel, PJ and Freedman, DA 1981 Some asymptotc theory for the bootstrap Annals of Statstcs, 9, 1196-1197 Breman, L 1992, The lttle bootstrap and other methods for dmensonalty selecton n regresson: x-fxed predcton error Journal Amercan Statstcal Assocaton, 87, no 419, pp 738-754 Cheng, RCH 1995, Bootstrap methods for computer smulaton experments Proceedngs of the 1995 Wnter Smulaton Conference, 171-177 ed C Alexopoulos, K Kang, WR Llegdon, and D Goldsman --- 1998, Random varate generaton Handbook of Smulaton, edted by J Banks, Wley, New York --- and W Holland 1997, Senstvty of computer smulaton experments to errors n nput data Journal Statstcal Computaton and Smulaton, 57(1-4): 219-241 D' Agostno, RD and HA Stephens, edtors 1986, Goodness-of-ft dstrbutons Marcel Dekker, New York Davson, AC and DVHnkley, Bootstrap methods and ther applcaton, CUP Efron, B and RJ Tbshran (1993), Introducton to the Bootstrap Chapman & Hall, New York Fredman, LW and HH Fredman (1995), Analyzng smulaton output usng the bootstrap method Smulaton, 64(2): 95-100 Ghosh, BK and PK Sen (1991), Handbook of Sequental Analyss Marcel Dekker, New York Hjorth, JSU (1994) Computer ntensve statstcal methods, Chapman & Hall, London Km, YB, TR Wlleman, J Haddock, and GC Runger (1993), The threshold bootstrap: a new approach to smulaton output analyss In Proceedngs of the 1993 Wnter Smulaton Conference, 498-502 ed GW Evans, M Mollaghasem, EC Russell, and WE Bles Klejnen (1999) Valdaton of models: statstcal technques and data avalablty Proceedngs of the 1999 Wnter Smulaton Conference, 647-654 (ed by PA Farrngton, HB Nembhard, DTSturrock, and GWEvans) ---, B Bettonvl, and W Van Groenendaal (1998), Valdaton of trace drven smulaton models: a novel regresson test Management Scence, 44: 812-819 ---, RCH Cheng, and B Bettonvl (2000), Valdaton of trace-drven smulaton models: bootstrapped tests Management Scence(under revew) ---, AJ Feelders, and RCH Cheng (1998)Bootstrappng and valdaton of metamodels n smulaton Proceedngs of the 1998 Wnter Smulaton Conference --- and RG Sargent (1999), A methodology for fttng and valdatng metamodels n smulaton European Journal Operatonal Research (accepted) L Ecuyer, PL (1999), Good parameter sets for combned multple recursve random number generators Operatons Research, 47(1) Lnhart, H and W Zucchn (1986), Model selecton Wley, New York 889

Klejnen, Cheng, and Bettonvl Mooney, CZ and RD Duval (1993), Bootstrappng: a nonparametrc approach to statstcal nference Sage Publcatons, Newbury Park, Calforna 91320 Moors, JJA and LWG Strjbosch (1998), New proposals for the valdaton of trace-drven smulatons Communcatons n Statstcs: Smulaton and Computaton, 27(4): 1051-1073 Petrov, VV (1995), Lmt theorems of probablty theory, Oxford Unversty Press, Oxford Prtsker, AA (1998), Lfe & death decsons OR/MS Today, 25(4): 22-28 Shao, J and D Tu (1995), The jackknfe and bootstrap Sprnger-Verlag, New York Sherman, M (1995), On batch means n the smulaton and statstcs communtes In Proceedngs of the 1995 Wnter Smulaton Conference, 297-302 ed C Alexopoulos, K Kang, WR Llegdon, and D Goldsman Sngh, K (1981) On the asymptotc accuracy of Efron's bootstrap Annals of Statstcs, 9: 1187-1195 Vncent, S (1998), Input data analyss Handbook of smulaton, edted by J Banks, Wley, New York ACKNOWLEDGMENT Cheng and Klejnen acknowledge the NATO Collaboratve Research Grants Programme s fnancal support for ther project on 'Senstvty analyss for mproved smulaton modelng' Table 2: Estmated Type I Error Probablty of Valdaton Statstc (T) for Varyng Number of Smulaton Replcates (s) of M/M/1/FIFO wth Number of Customers per Subrun k, Traffc Rate ρ = 1; Number of Subruns n = 10; Nomnal α = 010; Bootstrap Sample Sze b = 1,000; Bold Numbers Denote Acceptable Results 1) F-table used (nstead of bootstrap) after normalzng transformaton log(x) and log (Y) (A) Number of Customers per Subrun k = 10 s T 1 T 2 T 3 T 4 T 5 T 6 1 212 1) N/A N/A 174 N/A N/A 2 021 142 180 172 180 044 5 055 127 142 063 046 068 10 024 046 059 028 023 033 (B) Number of Customers per Subrun k = 1,000 s T 1 T 2 T 3 T 4 T 5 T 6 1 098 1) N/A N/A 167 235 N/A 2 027 196 265 364 358 050 5 124 252 301 107 118 122 10 096 126 146 088 095 080 890

Klejnen, Cheng, and Bettonvl Table 3: Estmated Power of Acceptable Statstcs T for Varyng Smulated Traffc Rates ρ and Fxed Real Traffc Rate ρ = 1 (for Remanng Symbols See Table 2) s = 1; k = 1,000 s = 2; k = 10 s = 2; k = 1,000 ρ T 1 ρ T 1 T 6 ρ T 1 T 6 96 622 8 098 401 96 249 661 98 276 9 039 161 98 086 265 102 264 12 172 045 102 088 098 104 618 14 428 272 104 250 419 s = 5; k = 10 s = 5; k = 1,000 ρ T 1 T 4 T 5 T 6 ρ T 4 8 186 444 148 453 96 874 9 068 204 072 219 98 434 12 220 142 175 108 102 335 14 490 434 511 358 104 782 s = 10; k = 10 ρ T 1 T 2 T 3 T 4 T 5 T 6 8 098 377 350 394 088 401 9 039 165 185 149 025 161 12 172 0 003 072 119 045 14 428 002 002 353 424 272 s = 10; k = 1,000 ρ T 1 T 4 T 5 T 6 96 534 874 873 865 98 239 404 415 391 102 236 338 350 299 104 565 808 831 782 891

Klejnen, Cheng, and Bettonvl Table 4: Estmated Type I Error Probablty Usng Small Bootstrap Sample Sze b = 19 (Remanng Symbols Defned n Table 2) k = 10 s T 1 T 2 T 3 T 4 T 5 T 6 1 212 1) N/A N/A 160 N/A N/A 2 046 198 256 179 245 061 5 118 185 210 118 173 139 10 100 137 131 112 115 105 k = 1,000 s T 1 T 2 T 3 T 4 T 5 T 6 1 098 1) N/A N/A 150 210 N/A 2 034 183 223 160 172 058 5 121 178 199 120 126 124 10 096 121 132 113 108 118 AUTHOR BIOGRAPHIES JACK PC KLEIJNEN s a Professor of Smulaton and Informaton Systems Hs research concerns smulaton, mathematcal statstcs, nformaton systems, and logstcs; ths research resulted n sx books and nearly 160 artcles He has been a consultant for several organzatons n the USA and Europe, and has served on many nternatonal edtoral boards and scentfc commttees He spent several years n the USA, at both unverstes and companes, and receved a number of nternatonal fellowshps and awards More nformaton s provded on hs web page: <http:// centerkubnl/staff/klejnen> RUSSELL CH CHENG s Professor of Operatonal Research at the Unversty of Southampton He has an MA and the Dploma n Mathematcal Statstcs from Cambrdge Unversty, England He obtaned hs PhD from Bath Unversty He s Charman of the UK Smulaton Socety, a Fellow of the Royal Statstcal Socety, Member of the Operatonal Research Socety Hs research nterests nclude: varance reducton methods and parametrc estmaton methods He s Jont Edtor of the IMA Journal on Mathematcs Appled to Busness and Industry BERT BETTONVIL s Assocate Professor at the Department of Informaton Systems of Tlburg Unversty Educated as a mathematcal statstcan, hs research nterests are n the feld of statstcal aspects of smulaton and n research methodology 892