Inferential Statistics and Methods for Tuning and Configuring

Size: px

Start display at page:

Download "Inferential Statistics and Methods for Tuning and Configuring"

Brent Allison
5 years ago
Views:

DM63 HEURISTICS FOR COMBINATORIAL OPTIMIZATION Lecture 13 and Methods for Tuning and Configuring Marco Chiarandini 1 Summar 2 3 Eperimental Designs 4 Sequential Testing (Racing) Summar 1 Summar 2 3

1 DM63 HEURISTICS FOR COMBINATORIAL OPTIMIZATION Lecture 13 and Methods for Tuning and Configuring Marco Chiarandini 1 Summar 2 3 Eperimental Designs 4 Sequential Testing (Racing) Summar 1 Summar 2 3 Eperimental Designs 1 Algorithm Engineering and Eperimental Analsis Definitions Performance Measures 2 Eplorator Data Analsis Representation of Sampled Data Regression Analsis 4 Sequential Testing (Racing) Algorithm Engineering and Eperimental Algoritmics Summar Single-pass heuristics Asmptotic heuristics: Two approaches: 1 Univariate 1a Time as an eternal parameter decided a priori 1b Solution qualit as an eternal parameter decided a priori 2 Cost dependent on running time: 1 Summar 2 We work with samples (instances, solution qualit) But we want sound conclusions: generalization over a given population (all possible instances) Thus we need statistical inference 3 Eperimental Designs 4 Sequential Testing (Racing)

2 We work with samples (instances, solution qualit) But we want sound conclusions: generalization over a given population (all possible instances) Thus we need statistical inference We work with samples (instances, solution qualit) But we want sound conclusions: generalization over a given population (all possible instances) Thus we need statistical inference Random Sample X n Statistical Estimator Inference Population P(, θ) Parameter θ Random Sample X n Statistical Estimator Inference Population P(, θ) Parameter θ Since the analsis is based on finite-sized sampled data, statements like the cost of solutions returned b algorithm A is smaller than that of algorithm B must be completed b at a level of significance of 5% A Motivating Eample A Motivating Eample There is a competition and two algorithms A 1 and A 2 are submitted We run both algorithms once on n instances On each instance either A 1 wins (+) or A 2 wins (-) or the make a tie (=) Questions: 1 If we have onl 10 instances and algorithm A 1 wins 7 times how confident are we in claiming that algorithm A 1 is the best? 2 How man instances and how man wins should we observe to gain a confidence of 95% that the algorithm A 1 is the best? A Motivating Eample A Motivating Eample Y b(n, p) : Pr[Y = ] = ( ) n p (1 p) n Y b(n, p) : Pr[Y = ] = ( ) n p (1 p) n Under the conditions of question 1, we can check how unlikel the situation is if it were p(+) p( ) If p = 05 then the chance that algorithm A 1 wins 7 or more times out of 10 is 172%: quite high! b(10,05) A Motivating Eample Y b(n, p) : Pr[Y = ] = Under the conditions of question 1, we can check how unlikel the situation is if it were p(+) p( ) If p = 05 then the chance that algorithm A 1 wins 7 or more times out of 10 is 172%: quite high! ( ) n p (1 p) n To answer question 2 we compute the 95% quantile, ie, : Pr[Y ] < 005 with p = 05 at different values of n: n General procedure: Assume that data are consistent with a null hpothesis H 0 (eg, sample data are drawn from distributions with the same mean value) Use a statistical test to compute how likel this is to be true, given the data collected This likel is quantified as the p-value Accept H 0 as true if the p-value is larger than an user defined threshold called level of significance α Alternativel (p-value < α), H 0 is rejected in favor of an alternative hpothesis, H 1, at a level of significance of α n

3 Two kinds of errors ma be committed when testing hpothesis: General rule: α = P(tpe I error) = P(reject H 0 H 0 is true) β = P(tpe II error) = P(fail to reject H 0 H 0 is false) 1 specif the tpe I error or level of significance α 2 seek the test with a suitable large statistical power, ie, 1 β = P(reject H 0 H 0 is false) Theorem: Central Limit Theorem If X n is a random sample from an arbitrar distribution with mean µ and variance σ then the average X n is asmptoticall normall distributed, ie, X n N(µ, σ2 n ) or z = X n µ σ/ N(0, 1) n Consequences: allows inference from a sample allows to model errors in measurements: X = µ + ɛ Issues: n should be enough large µ and σ must be known Weibull distribution Inference: Hpothesis Testing and Confidence Intervals dweibull(, shape = 14) z = X µ σ/ n A test of hpothesis determines how likel an sampled estimate ^θ is to occur under some assumptions on the parameter θ of the population σ Pr µ z 1 n X σ } µ+z 2 n = 1 α µ X1 X2 X3 Samples of size 1, 5, 15, 50 repeated 100 times n=1 n=5 n= n=50 A confidence interval contains all those values that a parameter θ is likel to assume with probabilit 1 α: Pr(^θ 1 < θ < ^θ 2 ) = 1 α σ Pr X z1 n µ X+z σ } 2 n = 1 α µ X1 X2 X3 The Procedure of Test of Hpothesis The Confidence Intervals Procedure µ 1 µ 2 θ 1 Specif the parameter θ and the test hpothesis, 2 Obtain P(θ θ = 0), the null distribution of θ 3 Compare ^θ with the α/2-quantiles (for two-sided tests) of P(θ θ = 0) and reject or not H 0 according to whether ^θ is larger or smaller than this value µ 1 µ 2 θ θ = 0 1 Specif the parameter θ and the test hpothesis, 2 Obtain P(θ, θ = 0), the null distribution of θ in correspondence of the observed estimate ^θ of the sample X 3 Determine (^θ, ^θ + ) such that Pr^θ θ ^θ + } = 1 α 4 Do not reject H 0 if θ = 0 falls inside the interval (^θ, ^θ + ) Otherwise reject H 0 The Confidence Intervals Procedure The Confidence Intervals Procedure N(µ 1, σ) N(µ 2, σ) ( X 1, S X1 ) ( X 2, S X2 ) θ = 0 T = ( X 1 X 2 ) `µ 1 µ r 2 SX1 S X2 r T Student s t Distribution 1 Specif the parameter θ and the test hpothesis, 2 Obtain P(θ, θ = 0), the null distribution of θ in correspondence of the observed estimate ^θ of the sample X 3 Determine (^θ, ^θ + ) such that Pr^θ θ ^θ + } = 1 α 4 Do not reject H 0 if θ = 0 falls inside the interval (^θ, ^θ + ) Otherwise reject H 0 P(θ 1 ) P(θ 2 ) θ = X 1 X 2 θ = 0 1 Specif the parameter θ and the test hpothesis, 2 Obtain P(θ, θ = 0), the null distribution of θ in correspondence of the observed estimate ^θ of the sample X 3 Determine (^θ, ^θ + ) such that Pr^θ θ ^θ + } = 1 α 4 Do not reject H 0 if θ = 0 falls inside the interval (^θ, ^θ + ) Otherwise reject H 0

4 Kolmogorov-Smirnov Tests Parametric vs Nonparametric The test compares empirical cumulative distribution functions F() F() It uses maimal difference between the two curves, sup F 1 () F 2 (), and assesses how likel this value is under the null hpothesis that the two curves come from the same data The test can be used as a two-samples or single-sample test (in this case to test against theoretical distributions: goodness of fit) R functions: F() Preparation of the Eperiments 2 Parametric assumptions: independence homoschedasticit normalit N(µ, σ) Nonparametric assumptions: independence homoschedasticit P(θ) tests Permutation tests Eact Conditional Monte Carlo The test can be done in R with kstest Variance reduction techniques Same pseudo random seed 1 Summar Sample Sizes If the sample size is large enough (infinit) an difference in the means of the factors, no matter how small, will be significant Real vs Statistical significance Stud factors until the improvement in the response variable is deemed small Desired statistical power + practical precision sample size Note: If resources available for N runs then the optimal design is one run on N instances [?] 2 3 Eperimental Designs 4 Sequential Testing (Racing) The Design of Eperiments for Algorithms Eperimental Design Statement of the objectives of the eperiment Comparison of different algorithms Impact of algorithm components How instance features affect the algorithms Identification of the sources of variance Treatment factors (qualitative and quantitative) Controllable nuisance factors blocking Uncontrollable nuisance factors measuring Definition of factor combinations to test Easiest design: Unreplicated or Replicated Full Factorial Design Running a pilot eperiment and refine the design Bugs and no eternal biases Ceiling or floor effects Rescaling levels of quantitative factors Detect the number of eperiments needed to obtained the desired power Algorithms Treatment Factor; Instances Blocking Factor Design A: One run on various instances (Unreplicated Factorial) Algorithm 1 Algorithm 2 Algorithm k Instance 1 X 11 X 12 X 1k Instance b X b1 X b2 X bk Design B: Several runs on various instances (Replicated Factorial) Algorithm 1 Algorithm 2 Algorithm k Instance 1 X 111,, X 11r X 121,, X 12r X 1k1,, X 1kr Instance 2 X 211,, X 21r X 221,, X 22r X 2k1,, X 2kr Instance b X b11,, X b1r X b21,, X b2r X bk1,, X bkr Multiple Comparisons Univariate Analsis: On a single instance H 0 : µ 1 = µ 2 = µ 3 = H 1 : at least one differs} Appling a statistical test to all pairs the error of Tpe I is not α but higher: α EX = 1 (1 α) c Eg, for α = 005 and c = 3 α EX = 014! Adjustment methods Protected versions: global test + no adjustments Bonferroni α = α EX /c (conservative) Tuke Honest Significance Method (for parametric analsis) Holm (step-wise) Other step procedures Post-hoc analsis: Once the effect of factors has been recognized a finer grained analsis is performed to distinguish where important differences are Global tests Parametric KS tpe Replicated F-test Kruskall-Wallis Test Pooled Permutations Birnbaum-Hall test

5 Univariate Analsis: On a single instance Univariate Analsis: On a Class of Instances Pairwise tests Replicated Parametric KS tpe t-test Tuke HSD Kruskall-Wallis Test or Mann-Whitne test Wilcoon Rank Sum Test Pooled Permutations Birnbaum-Hall test Global tests Unreplicated Replicated Parametric F-test F-test Simple Permutations Snchronized Permutations Matched pairs versions: when, when not t-test with different variances Univariate Analsis: On a Class of Instances An Eample Pairwise tests Unreplicated Replicated Parametric t-test Tuke HSD or Wilcoon Signed Rank Test Simple Permutations Matched pairs versions: when, when not t-test Welch variant: no assumption of equal variances t-test Tuke HSD Snchronized Permutations SLS algorithms for Graph Coloring: Results collected on a set of benchmark instances Instance TSN1 Instance Succ k Succ k Succ k Succ k Succ k flat flat flat flat flat flat GLS SAN2 TSN3 Instance Succ k Succ k Succ k Succ k flat flat flat flat flat flat An Eample Raw data on the instances: flat300_20_ flat1000_50_0 flat300_26_ flat1000_60_0 flat300_28_ flat1000_76_0 R functions: > load( gcp all classesdatar ) > G <- F[F$class== Flat,] > bwplot(alg ~ col inst,data=g,scales=list(=list(relation= free )),pch= ) > boplot(err3~alg,data=g,horizontal=true,main=epression(paste( Invariant error:, frac(-^(opt),^(worst)-^(opt)))),notch=true,col= pink ) > boplot(rank~alg,data=g,horizontal=true,main= Ranks,notch=TRUE,col= pink ) col An Eample An Eample (opt) Invariant error: (worst) (opt) (opt) Invariant error: (worst) (opt) Ranks Ranks Note: notches are not appropriate for comparative inference

6 R functions: R functions: > par(las=1,mar=c(3,8,3,1)) > plot(tukehsd(aov(err3~alg*inst,data=g),which= alg ),las=1,mar=c(3,7,3,1)) > pairwisewilcotest(g$err3,g$alg,paired=true) Pairwise comparisons using Wilcoon rank sum test data: G$err3 and G$alg e e e e-07 30e e-08 58e-10 58e-10 58e-10 58e-10 58e-10 58e-10 58e-10 P value adjustment method: holm 95% famil wise confidence level An Eample An Eample Alg 3 Alg 2 Alg 1 MSD 2 X1 X2 X1 X2 X3 Minimal Significant Difference (MSD) interval that satisfies simultaneousl each comparison Differences are statisticall significant if the confidence intervals do not overlap Average Inveriant Error (Tuke's Honset Significance Difference) Average Inveriant Error (Permutation Test) Average Rank () Unreplicated Designs 1 Summar 2 3 Eperimental Designs 4 Sequential Testing (Racing) Procedure Race [?]: repeat Randoml select an unseen instance and run all candidates on it Perform all-pairwise comparison statistical tests Drop all candidates that are significantl inferior to the best algorithm until onl one candidate left or no more unseen instances ; F-Race use Friedman test Note: If resources available for N runs then the optimal design is one run on N instances [?] Replicated Designs Sequential Testing 7 algorithms, 5 instances, 3 runs Procedure Race repeat Run all candidates once on all instances Perform all-pairwise comparison statistical tests Drop all candidates that are significantl inferior to the best algorithm until onl one candidate left or maimal number of runs eceeded ;

Sequential Testing Sequential Testing 7 algorithms, 5 instances, 3 runs 7 algorithms, 5 instances, 3 runs 3 algorithms, 5 instances, 9 runs 3 algorithms, 5 instances, 9 runs 2 algorithms, 5

7 Sequential Testing Sequential Testing 7 algorithms, 5 instances, 3 runs 7 algorithms, 5 instances, 3 runs 3 algorithms, 5 instances, 9 runs 3 algorithms, 5 instances, 9 runs 2 algorithms, 5 instances, 30 runs Sequential Testing class GEOMb (11 Instances) S_Seq_SL_Y O_DCFA O_DCFB O_CCFB S_RLF_Y O_CCFA S_RLF_N O_DRRB O_DRRA S_D_s_N O_CRRB O_DCRA O_CRRA S_D_g_N O_DCRB O_CCRA O_CCRB S_D_g_Y S_D_s_Y Stage

Outline. 2. Sequential Testing (Racing) Marco Chiarandini. Tests for our Scenarios Example. Parametric F-test F-test

Outline. 2. Sequential Testing (Racing) Marco Chiarandini. Tests for our Scenarios Example. Parametric F-test F-test Outline DM812 METAHEURISTICS Lecture 4 Empirical Methods for Configuring and Tuning 1. Statistical 2. (Racing) Marco Chiarandini Department of Mathematics and Computer Science University of Southern Denmark,