SPRINT Multi-Objective Model Racing

SPRINT Multi-Objective Model Racing Tiantian Zhang, Michael Georgiopoulos, Georgios C. Anagnostopoulos Machine Learning Lab University of Central Florida

Overview Racing Algorithms Multi-objective Model Selection SPRINT-Race Motivation Sequential Probability Ratio Test (Non-) Dominance Inference via Dual-SPRT Experimental Results and Discussions Conclusion and Future Direction 2

Racing Algorithm A set of initial models Model Selection Sample validation instances o Consider an ensemble of models and o stick with the best Elimination-type Multi-Armed Bandit Evaluate all models Abandon bad performers No Stop? Yes Return the best performers 3

Racing Algorithm Trade-off o Benefit: Identify best model(s) Computation Time o Price to be paid: Computational cost RAs trade off model optimalit y vs. computational effort by automatically allocating comp utational resources Racing algorithms Brute force Search Process 4

Racing Algorithm Name Year Statistical Test Criterion Of Goodness Application Hoeffding 1994 Hoeffiding s inequality Prediction Accuracy Classification & Func. Approximation BRACE 1994 Bayesian Statistics Cross-validation Error Classification F-Race 2002 Friedman Test Function Optimization Optimal parameter of Max- Min-Ant-System for TSP Bernstein Racing 2008 Bernstein s inequality Function Optimization Policy selection in CMA-ES S-Race 2013 Sign Test Prediction Accuracy Classification 5

Multi-Objective Optimization Pareto Dominance (minimization case) f 2 Pareto Front f 1

Multi-Objective Model Selection Multi-Criteria Recommendation Story: 2 Actors: 5 Story: 3 Actors: 4 Story: 4 Actors: 3

Multi-Objective Model Selection Multi-task Learning

Multi-Objective Model Selection Motivation Model selection often involves more than one crite rion of goodness Examples Generalization performance vs. model complexity Single model addressing multiple tasks How can racing be adapted to a MOMS setting? Approaches Scalarization of criteria into a single criterion Pros: Existing racing algorithms may be applicable Cons: Good models may be left undiscovered (due to non-convexity of the relevant Pareto front) Vector Optimization Pros: Selecting models based on Pareto optimality Cons: No racing algorithm available

Multi-Objective Racing Algorithm S-Race Fixed-Budget Offline Maximize selection accuracy SPRINT- Race Fixed-Confidence Online Minimize computational cost

SPRINT-Race Racing via Sequential Probability Ratio with INdiff erence zone Test Try to minimize the computational effort needed to achieve a predefi ned confidence about the quality of the returned models. A near-optimal non-parametric ternary-decision sequenti al analogue of the sign test is adopted to identify pair-wis e dominance/non-dominance relationship. Strictly confine the error probability of returning dominate d models and, simultaneously, abandoning non-dominated models at a predefined level. Able to stop automatically. The concept of indifference zone is introduced.

Non-sequential VS Sequential Test Non-sequential test Sample size n is fixed Given αα and rejection region, ββ is a function of sample size. The Uniformly Most Efficient test given n and αα, find the test procedur e that minimize ββ Sequential Test Sample size n is a variable Either we accept HH 0, or we accept HH 1, or we continue sampling Properties: OC function (probability of accepting HH 0 ), ASN (average samp le number) The Uniformly Most Efficient test given αα and ββ, find the test that mini mize the expected sample size

SPRT -- Sequential Probability Ratio Test Locally Most Efficient Test procedure SPRT with Bernoulli Distribution At n-th step, assume xx 1, xx 2,, xx nn are i.i.d. samples collected from a Bernoulli distri bution with P xx ii = 1 = pp, and dd = # xx ii = 1 Given two simple hypothesis HH 00 : pp = pp 00 HH 11 : pp = pp 11 Let AA = 1 ββ αα ββ, BB =. 1 αα ττ nn = pp 1 dd (1 pp 1 ) nn dd pp 0 dd (1 pp 0 ) nn dd if ττ nn BB, HH 00 is accepted; if ττ nn A, HH 1 is accepted; otherwise, continue sampling.

(Non-) Dominance Inference Given model CC ii and CC jj. nn iijj - the number of instances that the performance vector of nn jjjj CC ii dominates the performance vector of CC jj - vise versa p the probability that CC ii dominates CC jj nn iiii ~BBBBBBBBBBBBBBBB(nn iiii + nn jjjj, pp) 11 22 δδ 11 22 11 22 + δδ pp

Dual-SPRT (Sobel-Wald s test) The three-hypothesis problem is composed of two component two-hypothesis tests SPRT1: HH 0 : pp 1 δδ 2 HH 1: pp 1 2 SPRT2: HH 0 : pp 1 2 HH 1 : pp 1 + δδ 2 SPRT1 SPRT2 pp 11 22 δδ pp 11 22 pp 11 22 pp 11 22 + δδ pp 11 22 δδ pp = 11 22 pp 11 22 + δδ

Dual-SPRT (Sobel-Wald s test) Assume that common αα and ββ values are shared between SPRT1 and SPRT2 n ij H 2 H 1 A C B H 0 n ij + n ji

Dual-SPRT Accuracy Analysis Interval Wrong Decisio ns γγ(pp) pp 1 2 δδ Accept HH 1 or HH 2 γγ(pp) αα 1 2 δδ < pp < 1 2 pp = 1 2 Accept HH 2 Accept HH 0 or HH 2 γγ(pp) αα γγ pp αα + ββ 1 2 < pp < 1 2 + δδ Accept HH 0 γγ pp < ββ pp 1 2 + δδ Accept HH 0 or HH 1 γγ(pp) ββ γγ = max{γγ pp } ~αα + ββ

Overall Error Control Bonferroni Approach ( MM 2 ) ii=1 γγ ( MM 2 ii = ) ii=1 (αα ii + ββ ii ) MM Let αα ii = ββ ii = εε for ii = 1,, 2, mmmmmm denote the maximum error probability allowed for SPRINT-Race, εε = mmmmmm MM(MM 11)

SPRINT-Race Initialize PPoooooo CC 1, CC 2,, CC mm mm 2, t = 1 Randomly sample a problem instance for each pair CC ii, CC jj PPPPPPPP ss. tt. ii < jj do if the corresponding dual-sprt continues then Evaluate CC ii, CC jj, update nn iiii, nn jjjj if HH 0 is accepted then PPoooooo PPPPPPPP\{CC ii } stop all dual-sprts involving CC ii else if HH 2 is accepted then PPPPPPPP PPPPPPPP\{CC jj } stop all dual-sprts involving CC jj else if HH 1 is accepted then stop all dual-sprts involving CC ii, CC jj end if end for until All dual-sprts are terminated return the Pareto front models found

Experiments - Artificially Constructed MOMS Problem Given M models, construct MM 2 with known p values. Three performance metrics RR rrrrrrrrrrrrrrrrrr = PP RR PP PPPP PP PPPP EE eeeeeeeeeeee = PP RR\PP PPPP PP RR PP PPFF - Pareto front models Bernoulli distributions = 1 PP RR PP PPPP PP RR PP RR - models returned by SPRINT-Race T sample complexity 20

Impact of No. of Objectives R 1.005 0.995 1 2 3 4 5 6 7 8 9 10 11 12 13 14 E R 0.1 0 2 3 4 5 6 7 8 9 10 11 12 13 14 T E 2.00E+05 0.00E+00 2 3 4 5 6 7 8 9 10 11 12 13 14 T 21

Impact of δδ and mmmmmm Values 3.5 x 105 0.14 3 0.12 2-obj 3-obj 2.5 0.1 Sample Complexity (T) 2 1.5 1 1-R+E 0.08 0.06 0.04 0.5 0.02 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 δ 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 δ Decreasing mmmmmm will definitely increase sample com plexity with slightly varied R and E values. 22

Experiments - ACO Selection for TSPs Parameter tuning of ACOs is time consuming A pool of 125 candidate models were initialized with diverse confi guration in terms on different combinations of three parameters αα AAAAAA - the influence of pheromone trials ββ AAAAAA - the influence of heuristic information ρρ AAAAAA - the pheromone trial evaporation rate Two objectives Minimize the TSP tour length Minimize the actual computation time to find this tour ACOTSPJava + DIMACS TSP instance generator 23

Experiments - ACO Selection for TSPs 1 0.52 0.9 0.8 Dominated Models Non-dominated Models 0.5 0.48 Normalized Computation Time 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Tour Length Minimum Probability of Dominance 0.46 0.44 0.42 0.4 0.38 0.36 0.34 Dominated Non-dominated p=0.45 0.32 0 20 40 60 80 100 120 Index of Models 24

Conclusions SPRINT-Race is a multi-objective racing algorithm Solving multi-objective model selection Dual-SPRT is adopted for dominance and non-dominance inference The total probability of falsely retaining any dominated model and remov e any non-dominated model is strictly controlled Be able to stop automatically with fixed confidence SPRINT-Race is able to return almost exactly the true Pareto front models at a reduced cost Artificially-constructed MOMS problems Select Pareto optimal ACP parameters for solving TSPs? Optimal model configuration 25

References [1] O. Maron and A. Moore. Hoeffding races: Accelerating model selection search for classification and function approximation. A dvances in Neural Information Processing Systems, 6:59 66, 1994. [2] T. Zhang, M. Georgiopoulos, and G. C. Anagnostopoulos. S-Race: A multi-objective racing algorithm. In GECCO 13 Proc. of the Genet. and Evol. Comput. Conf., pages 1565 1572, 2013. [3] B. K. Ghosh. Sequential Tests of Statistical Hypotheses. Addison-Wesley Publishing Company, Inc., 1970. [4] Y. Jin and B. Sendhoff. Pareto-based multi-objective machine learning: An overview and case studies. IEEE Trans. Syst., Man, C ybern. C, Appl. Rev., 38:397 415, 2008. [5] W. Mendenhall, D. D. Wackerly, and R. L. Scheaffer. Nonparametric statistics. Math. Stat. with App., pages 674 679, 1989. [6] M. Sobel and A. Wald. A sequential decision procedure for choosing one of three hypotheses concerning the unknown mean of a normal distribution. The Ann. of Math. Stat., 20:502 522, 1949. [7] M. Birattari, Z. Yuan, P. Balaprakash, and T. Stützle. F-race and iterated F-race: An overview. Technical report, IRIDIA, 2011. 26

Thank you! Any Questions? 27

Backup Slides 28

Multi-Objective Optimization Performance vector of model C L objectives ff(cc) ff 1 (CC), ff 2 (CC),, ff LL (CC) (Minimization) CC dddddddddddddddds CC if and only if ff ii (CC) ff ii CC ii {1, 2,, LL} and jj 1, 2,, LL jj ii ff jj (CC) < ff jj CC

Multi-Objective Racing Algorithm S-Race (GECCO 13) Pareto-optimality Test Pareto dominance non-parametrically via sign test Consider family-wise testing procedures Limitations of S-Race Х No inference on non-dominance relationship Х No control over the overall probability of making any Ty pe II errors Х Sign test is not an optimum test procedure in sequential setting

Dual-SPRT (Sobel-Wald s test) The three-hypothesis problem is composed of two componen t two-hypothesis tests SPRT1: HH 0 : pp 1 2 δδ HH 1: pp 1 2 SPRT2: HH 0 : pp 1 2 HH 1 : pp 1 2 + δδ if SPRT1 accepts if SPRT2 accepts then dual SPRT accepts HH 0 : pp 1 2 δδ HH 0 : pp 1 2 HH 1 : pp 1 2 δδ(remove CC ii) HH 1 : pp 1 2 HH 0 : pp 1 2 HH 2 : pp = 1 2 (keep both) HH 1 : pp 1 2 HH 1 : pp 1 2 + εε HH 3 : pp 1 2 + εε (remove CC jj)